Use a small draft model to guess the next K tokens, then verify all K in parallel with the large model. Same output distribution as the large model alone — but 2-3x faster. Lossless speedup.
You're running GPT-4-class model inference. Each token requires a full forward pass through the entire model — 175 billion parameters, hundreds of layers. The model generates one token, then feeds it back in to generate the next. One at a time. This is autoregressive decoding, and it's agonizingly slow.
The bottleneck is fundamental: generating N tokens requires N serial forward passes. You can't parallelize this naively because each token depends on all previous tokens. Token 5 can't be generated until tokens 1-4 exist. This creates a strict sequential dependency.
Think of it like a writing assistant. The assistant (small model) types out a rough draft of the next sentence. The expert (big model) reviews the entire sentence at once, accepting the parts that match what they would have written and correcting the parts that don't. The expert reads the whole sentence in one glance — much faster than writing it word by word.
Modern GPUs can perform trillions of operations per second, but generating one token only uses a fraction of that compute. The bottleneck is memory bandwidth: loading the model's weights from GPU memory for each token takes longer than the actual matrix multiplications. This means the GPU is mostly waiting for data, not computing. Speculative decoding amortizes this memory loading over multiple tokens.
| Operation | Time | Bottleneck |
|---|---|---|
| Load 175B params from HBM | ~10ms | Memory bandwidth |
| Matrix multiply (1 token) | ~1ms | Compute |
| Standard: generate 1 token | ~10ms (memory-bound) | 10ms per token |
| Batch verify K=5 tokens | ~12ms (almost same!) | ~2.4ms per token |
Verifying 5 tokens costs barely more than generating 1, because the weight loading cost is amortized. This is the source of speculative decoding's speedup.
Compare standard autoregressive decoding (one token at a time) with speculative decoding's verify-in-parallel approach. Watch how the big model processes K tokens in nearly the same time as one.
Speculative decoding uses two models: a small, fast draft model (e.g., GPT-2 124M) and a large, accurate target model (e.g., GPT-4 175B). The algorithm alternates between drafting and verifying.
python # Speculative decoding loop def speculative_decode(target_model, draft_model, prompt, K=5): tokens = prompt.copy() while not is_done(tokens): # Step 1: Draft K tokens with small model draft_tokens = [] draft_probs = [] for _ in range(K): q = draft_model.next_token_probs(tokens + draft_tokens) t = sample(q) draft_tokens.append(t) draft_probs.append(q) # Step 2: Verify all K tokens with big model (ONE forward pass) target_probs = target_model.forward(tokens + draft_tokens) # target_probs: K probability distributions, one per position # Step 3: Accept/reject each draft token for i in range(K): p = target_probs[i][draft_tokens[i]] # target prob q = draft_probs[i][draft_tokens[i]] # draft prob if random() < min(1, p / q): tokens.append(draft_tokens[i]) # accept else: # Step 4: Sample from adjusted distribution adjusted = normalize(max(0, target_probs[i] - draft_probs[i])) tokens.append(sample(adjusted)) break # restart draft from here return tokens
The term comes from speculative execution in CPU design. Modern CPUs predict which branch an if-statement will take and start executing instructions speculatively. If the prediction is right, the work is kept. If wrong, it's discarded. Speculative decoding does the same: the draft model "speculates" on the next K tokens, and correct speculations are kept.
Watch the speculative decoding pipeline in action. The small model drafts tokens, the big model verifies them in one pass. Green = accepted, red = rejected.
The heart of speculative decoding is the rejection sampling scheme that decides whether to accept or reject each draft token. This is what makes the method lossless — the output distribution is mathematically identical to the target model alone.
For each draft token xi, compute the acceptance probability:
Where p(xi) is the target model's probability for token xi and q(xi) is the draft model's probability. Accept with probability min(1, p/q). This is standard rejection sampling from statistics.
Case 1: p ≥ q (target likes it more). Acceptance probability = 1. Always accept. The draft model generated a token the target model likes even more.
Case 2: p < q (target likes it less). Acceptance probability = p/q < 1. Sometimes reject. The draft model was more enthusiastic about this token than the target would be.
Case 3: p = 0 (target would never generate this). Acceptance probability = 0. Always reject. The draft model generated something impossible under the target.
When a token is rejected, we need to sample a replacement from an adjusted distribution that corrects for the bias introduced by the draft model:
This correction distribution samples tokens that the target model prefers but the draft model underweights. Together, the acceptance + correction scheme produces exact samples from the target distribution.
python # Rejection sampling for one token def accept_or_reject(draft_token, target_probs, draft_probs): p = target_probs[draft_token] # target probability q = draft_probs[draft_token] # draft probability # Accept with probability min(1, p/q) if random.random() < min(1.0, p / q): return draft_token, True # accepted # Rejected: sample from correction distribution adjusted = np.maximum(0, target_probs - draft_probs) adjusted /= adjusted.sum() # normalize corrected_token = np.random.choice(len(adjusted), p=adjusted) return corrected_token, False # rejected, corrected
See how the acceptance criterion works for different p/q ratios. When p ≥ q (target likes the token more), always accept. When p < q, accept with probability p/q.
The most remarkable property of speculative decoding: the output distribution is mathematically identical to the target model's distribution. Not approximately equal. Not "close enough." Exactly identical. Every sample from speculative decoding could have been produced by the target model alone.
For each token position, the effective sampling distribution combines the accept-draft and reject-correct paths:
The first term covers tokens accepted from the draft. The second term covers tokens sampled from the correction distribution after rejection. When you simplify this expression (the paper provides the full derivation), you get P(x) = p(x) for all x. The distribution is exact.
Even with a terrible draft model that matches the target 0% of the time, speculative decoding never produces worse output. In the worst case, every draft token is rejected, and the correction distribution exactly reproduces the target model's output. The speedup is zero (same speed as standard decoding), but the quality is unchanged.
This is the critical guarantee: speculative decoding can only help, never hurt. The speedup depends on draft model quality, but the correctness is absolute.
| Draft Quality | Acceptance Rate | Speedup | Output Quality |
|---|---|---|---|
| Perfect match | 100% | ~Kx | Identical to target |
| Good match | ~70% | ~3x | Identical to target |
| Poor match | ~30% | ~1.4x | Identical to target |
| No match | 0% | 1x (no speedup) | Identical to target |
Compare the output distribution of standard decoding vs speculative decoding. They are mathematically identical. Click "Sample" to see tokens drawn from each — they follow the same distribution.
The draft model determines the speedup. A better draft model (higher acceptance rate) means more tokens accepted per verification step, which means faster inference. But the draft model must also be fast — otherwise its generation cost offsets the savings.
An ideal draft model is both fast (so drafting K tokens is cheap) and accurate (so most tokens are accepted). These goals are in tension: bigger models are more accurate but slower.
| Draft Model | Size | Speed (tok/s) | Acceptance Rate | Net Speedup |
|---|---|---|---|---|
| GPT-2 Small | 124M | 5000 | ~50% | ~1.8x |
| GPT-2 Medium | 345M | 2000 | ~65% | ~2.3x |
| GPT-2 Large | 774M | 1000 | ~75% | ~2.5x |
| Same family 1B | 1B | 800 | ~80% | ~2.8x |
Same family, smaller size. Use LLaMA-7B as the draft for LLaMA-70B. Same tokenizer, similar distribution. High acceptance rate. This is the most common approach.
Distilled model. Train a small model specifically to approximate the target. Even better acceptance rates, but requires a one-time training cost.
N-gram model. Use the prompt's own n-gram statistics to predict the next tokens. Zero compute cost for drafting, but low acceptance rate.
Self-drafting. Use early exit from the target model itself as the draft. No separate model needed, but harder to implement efficiently.
python # Choosing draft model and K def optimal_K(acceptance_rate, draft_cost, verify_cost): """ acceptance_rate: fraction of tokens accepted draft_cost: time to generate 1 draft token verify_cost: time for 1 target forward pass (any K) """ # Expected accepted tokens per iteration # E[accepted] = sum_{i=0}^{K-1} alpha^i = (1 - alpha^K) / (1 - alpha) best_K = 1 best_speedup = 1.0 for K in range(1, 20): expected_accepted = (1 - acceptance_rate**K) / (1 - acceptance_rate) iteration_cost = K * draft_cost + verify_cost tokens_per_second = expected_accepted / iteration_cost baseline = 1 / verify_cost speedup = tokens_per_second / baseline if speedup > best_speedup: best_speedup = speedup best_K = K return best_K, best_speedup
Adjust the acceptance rate and K to find the optimal configuration. Higher acceptance rate = more tokens accepted per iteration. Higher K = more potential tokens but more wasted work on rejection.
The paper evaluates speculative decoding on translation (T5-XXL, 11B) and summarization tasks, using smaller T5 variants as draft models.
| Task | Target Model | Draft Model | K | Speedup | Quality Change |
|---|---|---|---|---|---|
| Translation (EN-DE) | T5-XXL (11B) | T5-Small (60M) | 4 | 2.0x | 0% (lossless) |
| Translation (EN-DE) | T5-XXL (11B) | T5-Base (220M) | 5 | 2.5x | 0% (lossless) |
| Summarization | T5-XXL (11B) | T5-Small (60M) | 4 | 1.9x | 0% (lossless) |
| Code generation | PaLM (540B) | PaLM (62B) | 3 | 2.3x | 0% (lossless) |
Consistent 2-2.5x speedups across tasks and model sizes, with zero quality degradation. The speedup is highest when the draft model closely matches the target (same family).
Not all tokens are equally predictable. The paper finds that acceptance rates vary dramatically by token type:
| Token Type | Acceptance Rate | Why |
|---|---|---|
| Function words (the, is, of) | ~90% | Highly predictable, small/large models agree |
| Common content words | ~70% | Fairly predictable from context |
| Rare words / names | ~40% | Large model has better knowledge |
| Creative / reasoning tokens | ~30% | Small model can't match large model's reasoning |
This explains why speculative decoding works better for translation (predictable patterns) than creative writing (unpredictable choices). The more predictable the text, the higher the acceptance rate and the faster the inference.
Compare speedups across different task types. Predictable tasks (translation) get higher speedups than creative tasks.
Experience speculative decoding interactively. This simulator shows the draft-verify-accept cycle in real time. Watch how tokens are drafted, verified, and either accepted (green) or rejected (red) with a corrected replacement.
Click "Generate" to produce tokens. Watch the draft model propose K tokens (gray), the target model verify them (green/red), and the overall speedup accumulate. Adjust the acceptance rate to see its effect on performance.
Speculative decoding launched a family of inference acceleration techniques that all share the draft-verify paradigm.
| Method | Year | Relationship to Speculative Decoding |
|---|---|---|
| Speculative Decoding | 2023 | Original: small model drafts, big model verifies. Lossless. |
| Medusa | 2023 | Multiple small prediction heads instead of separate draft model. |
| EAGLE | 2024 | Auto-regressive draft head trained on target's features. Higher acceptance. |
| Lookahead Decoding | 2024 | Uses n-gram patterns from Jacobi iteration. No draft model. |
| Self-Speculative | 2024 | Early exit from target model as draft. No separate model. |
The lossless guarantee. No quality trade-off. This enabled adoption in production systems where any quality loss is unacceptable.
The insight that verification is cheap. The fundamental observation — that verifying K tokens costs nearly the same as generating 1 — applies broadly and has spawned many follow-up techniques.
Batch inference. Speculative decoding is designed for single-sequence inference. Adapting it to batched inference (serving many requests simultaneously) is an active research area.
Draft model availability. You need a compatible draft model for every target model. Universal draft models that work across targets would be more practical.
Scaling Test-Time Compute — The compute-optimal framework that includes speculative decoding's speedups. Read the TTC lesson →
See how inference acceleration techniques evolved from speculative decoding to modern approaches.