Yaniv Leviathan, Matan Kalman, Yossi Matias — Google Research — ICML 2023

Speculative Decoding: Fast Inference from Transformers

Use a small draft model to guess the next K tokens, then verify all K in parallel with the large model. Same output distribution as the large model alone — but 2-3x faster. Lossless speedup.

Prerequisites: Autoregressive decoding + Basic probability + Token-by-token generation. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Inference Bottleneck

You're running GPT-4-class model inference. Each token requires a full forward pass through the entire model — 175 billion parameters, hundreds of layers. The model generates one token, then feeds it back in to generate the next. One at a time. This is autoregressive decoding, and it's agonizingly slow.

The bottleneck is fundamental: generating N tokens requires N serial forward passes. You can't parallelize this naively because each token depends on all previous tokens. Token 5 can't be generated until tokens 1-4 exist. This creates a strict sequential dependency.

The insight: Even though generation is sequential, verification can be parallel. A transformer can process a sequence of K tokens and compute the probability of each token given all its predecessors — in a single forward pass. So if someone guesses the next K tokens, the big model can verify all K simultaneously. Speculative decoding exploits this asymmetry: draft (sequential, small model) then verify (parallel, big model).

Think of it like a writing assistant. The assistant (small model) types out a rough draft of the next sentence. The expert (big model) reviews the entire sentence at once, accepting the parts that match what they would have written and correcting the parts that don't. The expert reads the whole sentence in one glance — much faster than writing it word by word.

Why standard decoding is memory-bound, not compute-bound

Modern GPUs can perform trillions of operations per second, but generating one token only uses a fraction of that compute. The bottleneck is memory bandwidth: loading the model's weights from GPU memory for each token takes longer than the actual matrix multiplications. This means the GPU is mostly waiting for data, not computing. Speculative decoding amortizes this memory loading over multiple tokens.

OperationTimeBottleneck
Load 175B params from HBM~10msMemory bandwidth
Matrix multiply (1 token)~1msCompute
Standard: generate 1 token~10ms (memory-bound)10ms per token
Batch verify K=5 tokens~12ms (almost same!)~2.4ms per token

Verifying 5 tokens costs barely more than generating 1, because the weight loading cost is amortized. This is the source of speculative decoding's speedup.

Sequential vs Parallel Processing

Compare standard autoregressive decoding (one token at a time) with speculative decoding's verify-in-parallel approach. Watch how the big model processes K tokens in nearly the same time as one.

Why can a transformer verify K draft tokens in nearly the same time as generating 1 token?

Chapter 1: Draft-Then-Verify

Speculative decoding uses two models: a small, fast draft model (e.g., GPT-2 124M) and a large, accurate target model (e.g., GPT-4 175B). The algorithm alternates between drafting and verifying.

Step 1: Draft K tokens
The small model generates K tokens autoregressively (fast, ~K ms for a 124M model).
Step 2: Verify all K
The large model processes all K draft tokens in ONE forward pass, computing the probability of each.
Step 3: Accept/reject
Each draft token is accepted or rejected using a rejection sampling scheme. All tokens up to the first rejection are kept.
Step 4: Sample correction
At the first rejection point, sample a corrected token from an adjusted distribution. Continue from there.
python
# Speculative decoding loop
def speculative_decode(target_model, draft_model, prompt, K=5):
    tokens = prompt.copy()

    while not is_done(tokens):
        # Step 1: Draft K tokens with small model
        draft_tokens = []
        draft_probs = []
        for _ in range(K):
            q = draft_model.next_token_probs(tokens + draft_tokens)
            t = sample(q)
            draft_tokens.append(t)
            draft_probs.append(q)

        # Step 2: Verify all K tokens with big model (ONE forward pass)
        target_probs = target_model.forward(tokens + draft_tokens)
        # target_probs: K probability distributions, one per position

        # Step 3: Accept/reject each draft token
        for i in range(K):
            p = target_probs[i][draft_tokens[i]]  # target prob
            q = draft_probs[i][draft_tokens[i]]   # draft prob

            if random() < min(1, p / q):
                tokens.append(draft_tokens[i])  # accept
            else:
                # Step 4: Sample from adjusted distribution
                adjusted = normalize(max(0, target_probs[i] - draft_probs[i]))
                tokens.append(sample(adjusted))
                break  # restart draft from here

    return tokens

Why "speculative"?

The term comes from speculative execution in CPU design. Modern CPUs predict which branch an if-statement will take and start executing instructions speculatively. If the prediction is right, the work is kept. If wrong, it's discarded. Speculative decoding does the same: the draft model "speculates" on the next K tokens, and correct speculations are kept.

The speedup formula: If the draft model matches the target model α fraction of the time (the acceptance rate), and K is the lookahead length, the expected speedup is approximately 1/(1-α) when K is large. With α = 0.7 (70% acceptance), you get ~3.3x speedup. With α = 0.5, you get ~2x. The better the draft model, the faster the inference.
Draft-Then-Verify Pipeline

Watch the speculative decoding pipeline in action. The small model drafts tokens, the big model verifies them in one pass. Green = accepted, red = rejected.

What happens when the large model rejects a draft token?

Chapter 2: Rejection Sampling

The heart of speculative decoding is the rejection sampling scheme that decides whether to accept or reject each draft token. This is what makes the method lossless — the output distribution is mathematically identical to the target model alone.

The acceptance criterion

For each draft token xi, compute the acceptance probability:

accept_prob = min(1, p(xi) / q(xi))

Where p(xi) is the target model's probability for token xi and q(xi) is the draft model's probability. Accept with probability min(1, p/q). This is standard rejection sampling from statistics.

Three cases

Case 1: p ≥ q (target likes it more). Acceptance probability = 1. Always accept. The draft model generated a token the target model likes even more.

Case 2: p < q (target likes it less). Acceptance probability = p/q < 1. Sometimes reject. The draft model was more enthusiastic about this token than the target would be.

Case 3: p = 0 (target would never generate this). Acceptance probability = 0. Always reject. The draft model generated something impossible under the target.

Why p/q works: This ratio captures how "surprised" the target model would be by the draft's choice. If the draft picks a token that the target also strongly prefers (p/q ≥ 1), accept it. If the draft picks a token the target considers unlikely (p/q << 1), reject it with high probability. This reweighting corrects for the difference between the draft and target distributions.

The correction distribution

When a token is rejected, we need to sample a replacement from an adjusted distribution that corrects for the bias introduced by the draft model:

padj(x) = normalize(max(0, p(x) − q(x)))

This correction distribution samples tokens that the target model prefers but the draft model underweights. Together, the acceptance + correction scheme produces exact samples from the target distribution.

python
# Rejection sampling for one token
def accept_or_reject(draft_token, target_probs, draft_probs):
    p = target_probs[draft_token]  # target probability
    q = draft_probs[draft_token]   # draft probability

    # Accept with probability min(1, p/q)
    if random.random() < min(1.0, p / q):
        return draft_token, True  # accepted

    # Rejected: sample from correction distribution
    adjusted = np.maximum(0, target_probs - draft_probs)
    adjusted /= adjusted.sum()  # normalize
    corrected_token = np.random.choice(len(adjusted), p=adjusted)
    return corrected_token, False  # rejected, corrected
Acceptance Rate Visualizer

See how the acceptance criterion works for different p/q ratios. When p ≥ q (target likes the token more), always accept. When p < q, accept with probability p/q.

p (target) 0.60
q (draft) 0.40
When is a draft token always accepted (acceptance probability = 1)?

Chapter 3: The Lossless Guarantee

The most remarkable property of speculative decoding: the output distribution is mathematically identical to the target model's distribution. Not approximately equal. Not "close enough." Exactly identical. Every sample from speculative decoding could have been produced by the target model alone.

The proof sketch

For each token position, the effective sampling distribution combines the accept-draft and reject-correct paths:

P(x) = q(x) · min(1, p(x)/q(x)) + [1 − ∑x' q(x') min(1, p(x')/q(x'))] · padj(x)

The first term covers tokens accepted from the draft. The second term covers tokens sampled from the correction distribution after rejection. When you simplify this expression (the paper provides the full derivation), you get P(x) = p(x) for all x. The distribution is exact.

Why lossless matters: Many inference acceleration techniques (quantization, pruning, distillation) trade quality for speed. Speculative decoding is unique: it provides speedup with ZERO quality loss. The output is indistinguishable from running the target model alone. This makes it safe to deploy in any setting — medical, legal, financial — where approximation is unacceptable.

No degradation, ever

Even with a terrible draft model that matches the target 0% of the time, speculative decoding never produces worse output. In the worst case, every draft token is rejected, and the correction distribution exactly reproduces the target model's output. The speedup is zero (same speed as standard decoding), but the quality is unchanged.

This is the critical guarantee: speculative decoding can only help, never hurt. The speedup depends on draft model quality, but the correctness is absolute.

Draft QualityAcceptance RateSpeedupOutput Quality
Perfect match100%~KxIdentical to target
Good match~70%~3xIdentical to target
Poor match~30%~1.4xIdentical to target
No match0%1x (no speedup)Identical to target
Distribution Comparison

Compare the output distribution of standard decoding vs speculative decoding. They are mathematically identical. Click "Sample" to see tokens drawn from each — they follow the same distribution.

What happens if the draft model is very poor (almost no tokens match the target)?

Chapter 4: Choosing the Draft Model

The draft model determines the speedup. A better draft model (higher acceptance rate) means more tokens accepted per verification step, which means faster inference. But the draft model must also be fast — otherwise its generation cost offsets the savings.

The draft model trade-off

An ideal draft model is both fast (so drafting K tokens is cheap) and accurate (so most tokens are accepted). These goals are in tension: bigger models are more accurate but slower.

Draft ModelSizeSpeed (tok/s)Acceptance RateNet Speedup
GPT-2 Small124M5000~50%~1.8x
GPT-2 Medium345M2000~65%~2.3x
GPT-2 Large774M1000~75%~2.5x
Same family 1B1B800~80%~2.8x

Draft model strategies

Same family, smaller size. Use LLaMA-7B as the draft for LLaMA-70B. Same tokenizer, similar distribution. High acceptance rate. This is the most common approach.

Distilled model. Train a small model specifically to approximate the target. Even better acceptance rates, but requires a one-time training cost.

N-gram model. Use the prompt's own n-gram statistics to predict the next tokens. Zero compute cost for drafting, but low acceptance rate.

Self-drafting. Use early exit from the target model itself as the draft. No separate model needed, but harder to implement efficiently.

The optimal K (lookahead length): More draft tokens (larger K) means more potential speedup but also more wasted work when tokens are rejected late in the sequence. The optimal K depends on the acceptance rate α: K* ≈ 1/(1-α). For α = 0.7, K* ≈ 3. For α = 0.9, K* ≈ 10. In practice, K = 3-5 works well.
python
# Choosing draft model and K
def optimal_K(acceptance_rate, draft_cost, verify_cost):
    """
    acceptance_rate: fraction of tokens accepted
    draft_cost: time to generate 1 draft token
    verify_cost: time for 1 target forward pass (any K)
    """
    # Expected accepted tokens per iteration
    # E[accepted] = sum_{i=0}^{K-1} alpha^i = (1 - alpha^K) / (1 - alpha)
    best_K = 1
    best_speedup = 1.0
    for K in range(1, 20):
        expected_accepted = (1 - acceptance_rate**K) / (1 - acceptance_rate)
        iteration_cost = K * draft_cost + verify_cost
        tokens_per_second = expected_accepted / iteration_cost
        baseline = 1 / verify_cost
        speedup = tokens_per_second / baseline
        if speedup > best_speedup:
            best_speedup = speedup
            best_K = K
    return best_K, best_speedup
Draft Model Trade-off Explorer

Adjust the acceptance rate and K to find the optimal configuration. Higher acceptance rate = more tokens accepted per iteration. Higher K = more potential tokens but more wasted work on rejection.

Acceptance rate (α) 0.70
K (lookahead) 5
Why is using a model from the same family (e.g., LLaMA-7B for LLaMA-70B) the best draft model strategy?

Chapter 5: Results

The paper evaluates speculative decoding on translation (T5-XXL, 11B) and summarization tasks, using smaller T5 variants as draft models.

Benchmark results

TaskTarget ModelDraft ModelKSpeedupQuality Change
Translation (EN-DE)T5-XXL (11B)T5-Small (60M)42.0x0% (lossless)
Translation (EN-DE)T5-XXL (11B)T5-Base (220M)52.5x0% (lossless)
SummarizationT5-XXL (11B)T5-Small (60M)41.9x0% (lossless)
Code generationPaLM (540B)PaLM (62B)32.3x0% (lossless)

Consistent 2-2.5x speedups across tasks and model sizes, with zero quality degradation. The speedup is highest when the draft model closely matches the target (same family).

2-3x speedup for free. Speculative decoding gives 2-3x inference speedup with zero quality loss, zero training cost, and minimal code changes. This makes it one of the most impactful inference optimizations ever published. It's now implemented in vLLM, HuggingFace Transformers, and most production inference frameworks.

Acceptance rate across token types

Not all tokens are equally predictable. The paper finds that acceptance rates vary dramatically by token type:

Token TypeAcceptance RateWhy
Function words (the, is, of)~90%Highly predictable, small/large models agree
Common content words~70%Fairly predictable from context
Rare words / names~40%Large model has better knowledge
Creative / reasoning tokens~30%Small model can't match large model's reasoning

This explains why speculative decoding works better for translation (predictable patterns) than creative writing (unpredictable choices). The more predictable the text, the higher the acceptance rate and the faster the inference.

Speedup by Task

Compare speedups across different task types. Predictable tasks (translation) get higher speedups than creative tasks.

Why does speculative decoding give higher speedups on translation than on creative writing?

Chapter 6: Speculative Decoding Simulator

Experience speculative decoding interactively. This simulator shows the draft-verify-accept cycle in real time. Watch how tokens are drafted, verified, and either accepted (green) or rejected (red) with a corrected replacement.

Speculative Decoding Live Simulator

Click "Generate" to produce tokens. Watch the draft model propose K tokens (gray), the target model verify them (green/red), and the overall speedup accumulate. Adjust the acceptance rate to see its effect on performance.

Acceptance rate 70%
K (lookahead) 5
Watch the speedup counter. It shows how many tokens were generated vs how many target model forward passes were used. With high acceptance rate, the ratio can reach 3-4x — meaning 3-4 tokens generated per forward pass of the large model. With low acceptance rate, it approaches 1x (no speedup).
In the simulator, what happens when you increase K (lookahead) beyond the optimal point?

Chapter 7: Connections

Speculative decoding launched a family of inference acceleration techniques that all share the draft-verify paradigm.

MethodYearRelationship to Speculative Decoding
Speculative Decoding2023Original: small model drafts, big model verifies. Lossless.
Medusa2023Multiple small prediction heads instead of separate draft model.
EAGLE2024Auto-regressive draft head trained on target's features. Higher acceptance.
Lookahead Decoding2024Uses n-gram patterns from Jacobi iteration. No draft model.
Self-Speculative2024Early exit from target model as draft. No separate model.

What speculative decoding got right

The lossless guarantee. No quality trade-off. This enabled adoption in production systems where any quality loss is unacceptable.

The insight that verification is cheap. The fundamental observation — that verifying K tokens costs nearly the same as generating 1 — applies broadly and has spawned many follow-up techniques.

What it left open

Batch inference. Speculative decoding is designed for single-sequence inference. Adapting it to batched inference (serving many requests simultaneously) is an active research area.

Draft model availability. You need a compatible draft model for every target model. Universal draft models that work across targets would be more practical.

From inference trick to production standard. Speculative decoding went from paper to production in under a year. vLLM, TensorRT-LLM, and HuggingFace TGI all implement it. Every major inference provider uses some form of speculative decoding. It's one of the rare ML papers that directly changed how production systems work.

Scaling Test-Time Compute — The compute-optimal framework that includes speculative decoding's speedups. Read the TTC lesson →

Inference Acceleration Timeline

See how inference acceleration techniques evolved from speculative decoding to modern approaches.

What property makes speculative decoding unique among inference acceleration techniques?