KV Cache
Definition
The KV (key-value) cache is a critical inference optimization that eliminates redundant computation in autoregressive LLM generation. During the prefill phase, the transformer computes key and value matrices for every token in the prompt at every layer. During decoding, each new token is generated by attending over all previous tokens—including all prompt tokens. Without caching, this would require recomputing the keys and values for the entire prompt at every generation step, scaling quadratically with prompt length. The KV cache stores these computed key-value tensors in GPU memory after the prefill phase. Each new decode step only computes keys and values for the new token and appends them to the cache, reducing per-step computation from O(full_length) to O(1) amortized.
Why It Matters
KV caching is mandatory for efficient LLM inference—without it, generating a 500-token response with a 1,000-token prompt would require 500x 1,000^2 attention computations instead of 1*1,000^2 + 499 incremental steps. The cache enables the near-constant cost per generated token observed in practice. KV cache memory consumption is the primary bottleneck for LLM serving throughput: each cached token requires memory proportional to (number_of_layers × number_of_attention_heads × head_dimension × 2 bytes per float16). For large models serving many concurrent requests with long prompts, KV cache can consume most available GPU memory, limiting batch size and thus throughput.
How It Works
KV cache memory formula: memory = (2 bytes) × (num_layers) × (num_kv_heads) × (head_dim) × (sequence_length) × (batch_size). For Llama-3-8B: 2 × 32 × 8 × 128 × 4,096 × 32 ≈ 8.6GB for 32 concurrent requests with 4K context. vLLM's PagedAttention treats the KV cache like virtual memory in an OS: fixed-size 'pages' of KV cache are allocated dynamically and shared across requests that have common prefixes (e.g., identical system prompts). This enables much higher throughput by eliminating KV cache fragmentation and allowing fine-grained memory management. Prompt caching (offered by some API providers) extends KV caching to the API level, reusing computed cache states across requests with common prefixes.
KV Cache — Reuse Keys & Values Across Decode Steps
Without KV Cache
Generating token 7 requires recomputing K/V for ALL 6 previous tokens:
The
recompute
quick
recompute
brown
recompute
fox
recompute
jumps
recompute
over
recompute
the
Cost: O(n) compute per new token → n forward passes for n tokens
With KV Cache
K/V pairs computed once and stored — only the new token needs computation:
The
cached ✓
quick
cached ✓
brown
cached ✓
fox
cached ✓
jumps
cached ✓
over
cached ✓
the
new only
Cost: O(1) compute per new token → massive throughput gain
Compute saved
~85%
on 150-token response
Latency impact
3–5×
faster decode
Memory tradeoff
GPU RAM
stores K/V tensors
Real-World Example
A 99helpers deployment uses a 2,000-token system prompt for every chatbot query. Without KV cache optimization, computing attention over the system prompt costs the same at step 1 and step 200 of generation. With vLLM's radix cache (which caches common prompt prefixes), the 2,000-token system prompt is computed once and its KV tensors are reused across all concurrent requests. For 50 concurrent users, this eliminates 99 redundant system prompt computations per response, saving ~40ms of prefill time per query and increasing throughput by 35%.
Common Mistakes
- ✕Treating KV cache as unlimited—KV cache size is bounded by available GPU VRAM; very long sequences or large batch sizes can exhaust cache memory.
- ✕Ignoring KV cache eviction policies in vLLM deployments—when the cache is full, least-recently-used entries are evicted, causing re-computation for evicted sequences.
- ✕Not accounting for KV cache memory in capacity planning—a 70B model may use 140GB for weights plus 50+GB for KV cache at moderate batch sizes.
Related Terms
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Speculative Decoding
Speculative decoding uses a small 'draft' model to generate multiple candidate tokens quickly, then verifies them in parallel with the large target model, achieving 2-3x inference speedup without changing output quality.
Prompt Caching
Prompt caching is an LLM API feature that stores the computed KV cache state of a common prompt prefix server-side, so repeated requests sharing that prefix can skip its processing—reducing latency and input token costs.
Context Length
Context length is the maximum number of tokens an LLM can process in a single request—encompassing the system prompt, conversation history, retrieved documents, and the response—determining how much information the model can consider simultaneously.
GPU Inference
GPU inference is the use of graphics processing units to run LLM predictions, leveraging their massive parallel compute capabilities to achieve the high throughput and low latency required for production AI applications.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →