Large Language Models (LLMs)

Speculative Decoding

Definition

Speculative decoding addresses the sequential bottleneck in autoregressive LLM generation. Normally, a large target model generates one token per forward pass—slow but high quality. Speculative decoding introduces a small, fast 'draft' model (same family as the target but much smaller—e.g., 7B draft for a 70B target) that speculatively generates K tokens in K sequential steps. These K draft tokens are then verified in a single parallel forward pass of the large target model, which can confirm or reject each draft token. Accepted tokens are kept; the first rejected token is replaced with the correct token sampled from the target model's distribution. When the draft model's distribution closely matches the target, most tokens are accepted, achieving effective throughput of K tokens per target model forward pass instead of 1.

Why It Matters

Speculative decoding is one of the most impactful inference optimizations available, delivering 2-3x throughput improvements on appropriate hardware without any change to output quality—the output distribution is mathematically identical to sampling from the target model alone. This matters for latency-sensitive applications where response speed directly impacts user experience. For 99helpers chatbots, 2-3x faster token generation means 2-3x shorter response times for the same hardware investment. The optimization is particularly effective for conversational responses where the draft model (trained on similar data) predicts common phrases accurately, achieving high acceptance rates.

How It Works

Speculative decoding algorithm: (1) draft model generates K tokens autoregressively: [t1, t2, ..., tK]; (2) target model processes [prompt, t1, t2, ..., tK] in one forward pass, computing probabilities for each position; (3) accept/reject each draft token t_i: if target_prob(t_i) >= draft_prob(t_i) * uniform_sample, accept; otherwise reject and resample from target; (4) if all K tokens accepted, generate K+1 using target model; if rejected at position i, use target-sampled correction token and restart from i+1. The acceptance rate depends on how closely draft and target distributions match—higher acceptance = more speedup. Matching model families (Llama-3-8B as draft for Llama-3-70B) typically achieves 60-80% acceptance rates and 2-2.5x speedup.

Speculative Decoding — Draft → Verify → Accept/Reject

Stage 1 — Small Draft Model generates 5 tokens in parallel

The

draft 1

quick

draft 2

brown

draft 3

fox

draft 4

leaps

draft 5

Stage 2 — Large Verifier checks all draft tokens in one forward pass

The

✓ keep

quick

✓ keep

brown

✓ keep

fox

✗ drop

leaps

✗ drop

Stage 3 — Accepted tokens output + verifier generates correction

Thequickbrownjumps(verifier correction)

Standard decoding

5 sequential passes

1 token per forward pass

Speculative decoding

1 verifier pass

3 accepted tokens = 3× speedup

Real-World Example

A 99helpers self-hosted deployment uses Llama-3-70B for response quality. Without speculative decoding: 30 tokens/second on 4 A100 GPUs. With Llama-3-8B as a speculative draft model (running on 1 spare A100) and K=4 draft tokens per step: acceptance rate of 73%, effective throughput of 4 × 0.73 ≈ 2.9 tokens per target step → 30 × 2.9 = ~87 tokens/second—a 2.9x speedup. Average response time drops from 5 seconds to 1.7 seconds for a 150-token response. The draft model's A100 adds $2/hour to infrastructure but the speedup allows serving 3x more concurrent users on the same target model hardware.

Common Mistakes

✕Using a draft model from a different model family than the target—the token distributions must be similar for high acceptance rates; mismatched families produce low acceptance rates and no meaningful speedup.
✕Applying speculative decoding when the bottleneck is prefill rather than decode—speculative decoding only helps during the decode phase; long prompt prefill is not accelerated.
✕Thinking speculative decoding changes output quality—the algorithm is designed to be mathematically equivalent to target model sampling; any quality difference indicates an implementation bug.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Speculative Decoding

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

LLM Inference

KV Cache

Model Quantization

GPU Inference

Large Language Model (LLM)

Ready to build your AI chatbot?