Large Language Models (LLMs)

KV Cache

Definition

The KV (key-value) cache is a critical inference optimization that eliminates redundant computation in autoregressive LLM generation. During the prefill phase, the transformer computes key and value matrices for every token in the prompt at every layer. During decoding, each new token is generated by attending over all previous tokens—including all prompt tokens. Without caching, this would require recomputing the keys and values for the entire prompt at every generation step, scaling quadratically with prompt length. The KV cache stores these computed key-value tensors in GPU memory after the prefill phase. Each new decode step only computes keys and values for the new token and appends them to the cache, reducing per-step computation from O(full_length) to O(1) amortized.

Why It Matters

KV caching is mandatory for efficient LLM inference—without it, generating a 500-token response with a 1,000-token prompt would require 500x 1,000^2 attention computations instead of 1*1,000^2 + 499 incremental steps. The cache enables the near-constant cost per generated token observed in practice. KV cache memory consumption is the primary bottleneck for LLM serving throughput: each cached token requires memory proportional to (number_of_layers × number_of_attention_heads × head_dimension × 2 bytes per float16). For large models serving many concurrent requests with long prompts, KV cache can consume most available GPU memory, limiting batch size and thus throughput.

How It Works

KV cache memory formula: memory = (2 bytes) × (num_layers) × (num_kv_heads) × (head_dim) × (sequence_length) × (batch_size). For Llama-3-8B: 2 × 32 × 8 × 128 × 4,096 × 32 ≈ 8.6GB for 32 concurrent requests with 4K context. vLLM's PagedAttention treats the KV cache like virtual memory in an OS: fixed-size 'pages' of KV cache are allocated dynamically and shared across requests that have common prefixes (e.g., identical system prompts). This enables much higher throughput by eliminating KV cache fragmentation and allowing fine-grained memory management. Prompt caching (offered by some API providers) extends KV caching to the API level, reusing computed cache states across requests with common prefixes.

KV Cache — Reuse Keys & Values Across Decode Steps

Without KV Cache

Generating token 7 requires recomputing K/V for ALL 6 previous tokens:

The

recompute

quick

recompute

brown

recompute

fox

recompute

jumps

recompute

over

recompute

the

Cost: O(n) compute per new token → n forward passes for n tokens

With KV Cache

K/V pairs computed once and stored — only the new token needs computation:

The

cached ✓

quick

cached ✓

brown

cached ✓

fox

cached ✓

jumps

cached ✓

over

cached ✓

the

new only

Cost: O(1) compute per new token → massive throughput gain

Compute saved

~85%

on 150-token response

Latency impact

3–5×

faster decode

Memory tradeoff

GPU RAM

stores K/V tensors

Real-World Example

A 99helpers deployment uses a 2,000-token system prompt for every chatbot query. Without KV cache optimization, computing attention over the system prompt costs the same at step 1 and step 200 of generation. With vLLM's radix cache (which caches common prompt prefixes), the 2,000-token system prompt is computed once and its KV tensors are reused across all concurrent requests. For 50 concurrent users, this eliminates 99 redundant system prompt computations per response, saving ~40ms of prefill time per query and increasing throughput by 35%.

Common Mistakes

✕Treating KV cache as unlimited—KV cache size is bounded by available GPU VRAM; very long sequences or large batch sizes can exhaust cache memory.
✕Ignoring KV cache eviction policies in vLLM deployments—when the cache is full, least-recently-used entries are evicted, causing re-computation for evicted sequences.
✕Not accounting for KV cache memory in capacity planning—a 70B model may use 140GB for weights plus 50+GB for KV cache at moderate batch sizes.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

KV Cache

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

LLM Inference

Speculative Decoding

Prompt Caching

Context Length

GPU Inference

Ready to build your AI chatbot?