Large Language Models (LLMs)

Prompt Caching

Definition

Prompt caching (offered by Anthropic as 'cache_control' and by OpenAI as 'cached input tokens') extends the KV cache optimization to the API level. When multiple requests share a common prefix—the same system prompt, a large context document, or conversation history—the API provider can cache the computed key-value tensors for that prefix after the first request. Subsequent requests with the same prefix reuse the cached computation instead of reprocessing those tokens. Anthropic charges 10% of the normal input token price for cache hits versus 125% for initial cache writes; OpenAI provides a 50% discount on cached input tokens. For applications with large, stable system prompts or reference documents, prompt caching can reduce both cost and latency significantly.

Why It Matters

System prompts, reference documents, and conversation histories are often large (hundreds to thousands of tokens) and identical across many requests. Without prompt caching, these tokens are billed and processed at full cost for every single API call. With prompt caching, a 10,000-token legal reference document included in every query is processed and billed at full cost only once (or periodically when cached); subsequent requests accessing the same document get it at 90% discount. For 99helpers customers who include extensive knowledge base context or large system prompts in every request, prompt caching can reduce input token costs by 40-80% on typical workloads.

How It Works

Anthropic prompt caching implementation: mark the cacheable prefix with cache_control: {type: 'ephemeral'} in the messages array. Cached content must be at least 1,024 tokens. Cache lifetime: 5 minutes (ephemeral), refreshed with each cache hit. Example: {role: 'user', content: [{type: 'text', text: large_system_doc, cache_control: {type: 'ephemeral'}}, {type: 'text', text: user_question}]}. The API response includes cache_creation_input_tokens and cache_read_input_tokens in the usage object, enabling cost tracking. Strategy: place the most reused, longest prefix first in the message; put dynamic content (user query) after the cacheable prefix.

Prompt Caching: First Request vs. Cache Hit

First Request (cache miss)~2,000ms · full compute
System prompt (2,000 tokens)
Full KV computation — all tokens processed
User message (50 tokens)
Processed normally
KV cache stored for system prompt prefix
Cost: 100% input tokens billed·Latency: baseline
Repeat Request (cache hit)~300ms · 85% cost saving
System prompt (2,000 tokens)
Loaded from KV cache — no recomputation
CACHED
New user message (50 tokens)
Only new tokens computed
Cost: ~10% of input tokens billed·Latency: 6–7× faster TTFT

Savings over repeated calls (2,000-token system prompt)

Without cache
100%
With cache
~10%

Real-World Example

A 99helpers chatbot includes a 5,000-token knowledge base context in every user query to answer product questions. Without caching: 100 users × 5,000 context tokens = 500,000 input tokens at full price per hour. With Anthropic prompt caching: the first query each 5 minutes pays 125% for cache write (6,250 token-equivalents); subsequent queries pay 10% (500 token-equivalents). For 100 queries/5-minute window, total cost: 6,250 + 99×500 = 55,750 token-equivalents vs 500,000 without caching—an 89% reduction in context token costs.

Common Mistakes

  • Placing dynamic content (user questions) before static content (system prompts, documents)—the cache key is the prefix; dynamic content must come after the cacheable portion.
  • Trying to cache very short prefixes (under 1,024 tokens)—most providers require a minimum prefix length for caching to be beneficial.
  • Not monitoring cache hit rates—if the cached prefix changes frequently (e.g., timestamp in the system prompt), cache misses eliminate savings.

Related Terms

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →
What is Prompt Caching? Prompt Caching Definition & Guide | 99helpers | 99helpers.com