Prompt Caching
Definition
Prompt caching (offered by Anthropic as 'cache_control' and by OpenAI as 'cached input tokens') extends the KV cache optimization to the API level. When multiple requests share a common prefix—the same system prompt, a large context document, or conversation history—the API provider can cache the computed key-value tensors for that prefix after the first request. Subsequent requests with the same prefix reuse the cached computation instead of reprocessing those tokens. Anthropic charges 10% of the normal input token price for cache hits versus 125% for initial cache writes; OpenAI provides a 50% discount on cached input tokens. For applications with large, stable system prompts or reference documents, prompt caching can reduce both cost and latency significantly.
Why It Matters
System prompts, reference documents, and conversation histories are often large (hundreds to thousands of tokens) and identical across many requests. Without prompt caching, these tokens are billed and processed at full cost for every single API call. With prompt caching, a 10,000-token legal reference document included in every query is processed and billed at full cost only once (or periodically when cached); subsequent requests accessing the same document get it at 90% discount. For 99helpers customers who include extensive knowledge base context or large system prompts in every request, prompt caching can reduce input token costs by 40-80% on typical workloads.
How It Works
Anthropic prompt caching implementation: mark the cacheable prefix with cache_control: {type: 'ephemeral'} in the messages array. Cached content must be at least 1,024 tokens. Cache lifetime: 5 minutes (ephemeral), refreshed with each cache hit. Example: {role: 'user', content: [{type: 'text', text: large_system_doc, cache_control: {type: 'ephemeral'}}, {type: 'text', text: user_question}]}. The API response includes cache_creation_input_tokens and cache_read_input_tokens in the usage object, enabling cost tracking. Strategy: place the most reused, longest prefix first in the message; put dynamic content (user query) after the cacheable prefix.
Prompt Caching: First Request vs. Cache Hit
Savings over repeated calls (2,000-token system prompt)
Real-World Example
A 99helpers chatbot includes a 5,000-token knowledge base context in every user query to answer product questions. Without caching: 100 users × 5,000 context tokens = 500,000 input tokens at full price per hour. With Anthropic prompt caching: the first query each 5 minutes pays 125% for cache write (6,250 token-equivalents); subsequent queries pay 10% (500 token-equivalents). For 100 queries/5-minute window, total cost: 6,250 + 99×500 = 55,750 token-equivalents vs 500,000 without caching—an 89% reduction in context token costs.
Common Mistakes
- ✕Placing dynamic content (user questions) before static content (system prompts, documents)—the cache key is the prefix; dynamic content must come after the cacheable portion.
- ✕Trying to cache very short prefixes (under 1,024 tokens)—most providers require a minimum prefix length for caching to be beneficial.
- ✕Not monitoring cache hit rates—if the cached prefix changes frequently (e.g., timestamp in the system prompt), cache misses eliminate savings.
Related Terms
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
LLM API
An LLM API is a cloud service interface that provides programmatic access to large language models, allowing developers to send prompts and receive completions without managing model infrastructure.
Token
A token is the basic unit of text an LLM processes—roughly 4 characters or 3/4 of an English word. LLM APIs charge per token, context windows are measured in tokens, and generation speed is measured in tokens per second.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →