Speculative Decoding
Definition
Speculative decoding addresses the sequential bottleneck in autoregressive LLM generation. Normally, a large target model generates one token per forward pass—slow but high quality. Speculative decoding introduces a small, fast 'draft' model (same family as the target but much smaller—e.g., 7B draft for a 70B target) that speculatively generates K tokens in K sequential steps. These K draft tokens are then verified in a single parallel forward pass of the large target model, which can confirm or reject each draft token. Accepted tokens are kept; the first rejected token is replaced with the correct token sampled from the target model's distribution. When the draft model's distribution closely matches the target, most tokens are accepted, achieving effective throughput of K tokens per target model forward pass instead of 1.
Why It Matters
Speculative decoding is one of the most impactful inference optimizations available, delivering 2-3x throughput improvements on appropriate hardware without any change to output quality—the output distribution is mathematically identical to sampling from the target model alone. This matters for latency-sensitive applications where response speed directly impacts user experience. For 99helpers chatbots, 2-3x faster token generation means 2-3x shorter response times for the same hardware investment. The optimization is particularly effective for conversational responses where the draft model (trained on similar data) predicts common phrases accurately, achieving high acceptance rates.
How It Works
Speculative decoding algorithm: (1) draft model generates K tokens autoregressively: [t1, t2, ..., tK]; (2) target model processes [prompt, t1, t2, ..., tK] in one forward pass, computing probabilities for each position; (3) accept/reject each draft token t_i: if target_prob(t_i) >= draft_prob(t_i) * uniform_sample, accept; otherwise reject and resample from target; (4) if all K tokens accepted, generate K+1 using target model; if rejected at position i, use target-sampled correction token and restart from i+1. The acceptance rate depends on how closely draft and target distributions match—higher acceptance = more speedup. Matching model families (Llama-3-8B as draft for Llama-3-70B) typically achieves 60-80% acceptance rates and 2-2.5x speedup.
Speculative Decoding — Draft → Verify → Accept/Reject
Stage 1 — Small Draft Model generates 5 tokens in parallel
Stage 2 — Large Verifier checks all draft tokens in one forward pass
Stage 3 — Accepted tokens output + verifier generates correction
Standard decoding
5 sequential passes
1 token per forward pass
Speculative decoding
1 verifier pass
3 accepted tokens = 3× speedup
Real-World Example
A 99helpers self-hosted deployment uses Llama-3-70B for response quality. Without speculative decoding: 30 tokens/second on 4 A100 GPUs. With Llama-3-8B as a speculative draft model (running on 1 spare A100) and K=4 draft tokens per step: acceptance rate of 73%, effective throughput of 4 × 0.73 ≈ 2.9 tokens per target step → 30 × 2.9 = ~87 tokens/second—a 2.9x speedup. Average response time drops from 5 seconds to 1.7 seconds for a 150-token response. The draft model's A100 adds $2/hour to infrastructure but the speedup allows serving 3x more concurrent users on the same target model hardware.
Common Mistakes
- ✕Using a draft model from a different model family than the target—the token distributions must be similar for high acceptance rates; mismatched families produce low acceptance rates and no meaningful speedup.
- ✕Applying speculative decoding when the bottleneck is prefill rather than decode—speculative decoding only helps during the decode phase; long prompt prefill is not accelerated.
- ✕Thinking speculative decoding changes output quality—the algorithm is designed to be mathematically equivalent to target model sampling; any quality difference indicates an implementation bug.
Related Terms
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
KV Cache
The KV cache stores the key and value attention tensors computed during the prefill phase, allowing subsequent token generation to reuse these computations rather than recomputing them for every new token.
Model Quantization
Model quantization reduces the numerical precision of LLM weights from 32-bit or 16-bit floats to 8-bit or 4-bit integers, dramatically reducing memory requirements and inference costs with minimal quality loss.
GPU Inference
GPU inference is the use of graphics processing units to run LLM predictions, leveraging their massive parallel compute capabilities to achieve the high throughput and low latency required for production AI applications.
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →