Speculative Decoding (Leviathan 2023)

Chapter 0: The Inference Bottleneck

You're running GPT-4-class model inference. Each token requires a full forward pass through the entire model — 175 billion parameters, hundreds of layers. The model generates one token, then feeds it back in to generate the next. One at a time. This is autoregressive decoding, and it's agonizingly slow.

The bottleneck is fundamental: generating N tokens requires N serial forward passes. You can't parallelize this naively because each token depends on all previous tokens. Token 5 can't be generated until tokens 1-4 exist. This creates a strict sequential dependency.

The insight: Even though generation is sequential, verification can be parallel. A transformer can process a sequence of K tokens and compute the probability of each token given all its predecessors — in a single forward pass. So if someone guesses the next K tokens, the big model can verify all K simultaneously. Speculative decoding exploits this asymmetry: draft (sequential, small model) then verify (parallel, big model).

Think of it like a writing assistant. The assistant (small model) types out a rough draft of the next sentence. The expert (big model) reviews the entire sentence at once, accepting the parts that match what they would have written and correcting the parts that don't. The expert reads the whole sentence in one glance — much faster than writing it word by word.

Why standard decoding is memory-bound, not compute-bound

Modern GPUs can perform trillions of operations per second, but generating one token only uses a fraction of that compute. The bottleneck is memory bandwidth: loading the model's weights from GPU memory for each token takes longer than the actual matrix multiplications. This means the GPU is mostly waiting for data, not computing. Speculative decoding amortizes this memory loading over multiple tokens.

Operation	Time	Bottleneck
Load 175B params from HBM	~10ms	Memory bandwidth
Matrix multiply (1 token)	~1ms	Compute
Standard: generate 1 token	~10ms (memory-bound)	10ms per token
Batch verify K=5 tokens	~12ms (almost same!)	~2.4ms per token

Verifying 5 tokens costs barely more than generating 1, because the weight loading cost is amortized. This is the source of speculative decoding's speedup.

Sequential vs Parallel Processing

Compare standard autoregressive decoding (one token at a time) with speculative decoding's verify-in-parallel approach. Watch how the big model processes K tokens in nearly the same time as one.

Why can a transformer verify K draft tokens in nearly the same time as generating 1 token?

Because the bottleneck is loading model weights from memory, not computation — loading 175B parameters takes ~10ms regardless of whether you process 1 token or K tokens, so verifying K tokens amortizes the memory bandwidth cost Because the model uses less computation for verification Because verification uses a smaller model

Chapter 1: Draft-Then-Verify

Speculative decoding uses two models: a small, fast draft model (e.g., GPT-2 124M) and a large, accurate target model (e.g., GPT-4 175B). The algorithm alternates between drafting and verifying.

Step 1: Draft K tokens

The small model generates K tokens autoregressively (fast, ~K ms for a 124M model).

↓

Step 2: Verify all K

The large model processes all K draft tokens in ONE forward pass, computing the probability of each.

↓

Step 3: Accept/reject

Each draft token is accepted or rejected using a rejection sampling scheme. All tokens up to the first rejection are kept.

↓

Step 4: Sample correction

At the first rejection point, sample a corrected token from an adjusted distribution. Continue from there.

python
# Speculative decoding loop
def speculative_decode(target_model, draft_model, prompt, K=5):
    tokens = prompt.copy()

    while not is_done(tokens):
        # Step 1: Draft K tokens with small model
        draft_tokens = []
        draft_probs = []
        for _ in range(K):
            q = draft_model.next_token_probs(tokens + draft_tokens)
            t = sample(q)
            draft_tokens.append(t)
            draft_probs.append(q)

        # Step 2: Verify all K tokens with big model (ONE forward pass)
        target_probs = target_model.forward(tokens + draft_tokens)
        # target_probs: K probability distributions, one per position

        # Step 3: Accept/reject each draft token
        for i in range(K):
            p = target_probs[i][draft_tokens[i]]  # target prob
            q = draft_probs[i][draft_tokens[i]]   # draft prob

            if random() < min(1, p / q):
                tokens.append(draft_tokens[i])  # accept
            else:
                # Step 4: Sample from adjusted distribution
                adjusted = normalize(max(0, target_probs[i] - draft_probs[i]))
                tokens.append(sample(adjusted))
                break  # restart draft from here

    return tokens

Why "speculative"?

The term comes from speculative execution in CPU design. Modern CPUs predict which branch an if-statement will take and start executing instructions speculatively. If the prediction is right, the work is kept. If wrong, it's discarded. Speculative decoding does the same: the draft model "speculates" on the next K tokens, and correct speculations are kept.

The speedup formula: If the draft model matches the target model α fraction of the time (the acceptance rate), and K is the lookahead length, the expected speedup is approximately 1/(1-α) when K is large. With α = 0.7 (70% acceptance), you get ~3.3x speedup. With α = 0.5, you get ~2x. The better the draft model, the faster the inference.

Draft-Then-Verify Pipeline

Watch the speculative decoding pipeline in action. The small model drafts tokens, the big model verifies them in one pass. Green = accepted, red = rejected.

What happens when the large model rejects a draft token?

All tokens up to the rejection are kept, a corrected token is sampled from an adjusted distribution at the rejection point, and all subsequent draft tokens are discarded — drafting restarts from the corrected token All draft tokens are discarded and regenerated Only the rejected token is replaced

Chapter 2: Rejection Sampling

The heart of speculative decoding is the rejection sampling scheme that decides whether to accept or reject each draft token. This is what makes the method lossless — the output distribution is mathematically identical to the target model alone.

The acceptance criterion

For each draft token x_i, compute the acceptance probability:

accept_prob = min(1, p(x_i) / q(x_i))

Where p(x_i) is the target model's probability for token x_i and q(x_i) is the draft model's probability. Accept with probability min(1, p/q). This is standard rejection sampling from statistics.

Three cases

Case 1: p ≥ q (target likes it more). Acceptance probability = 1. Always accept. The draft model generated a token the target model likes even more.

Case 2: p < q (target likes it less). Acceptance probability = p/q < 1. Sometimes reject. The draft model was more enthusiastic about this token than the target would be.

Case 3: p = 0 (target would never generate this). Acceptance probability = 0. Always reject. The draft model generated something impossible under the target.

Why p/q works: This ratio captures how "surprised" the target model would be by the draft's choice. If the draft picks a token that the target also strongly prefers (p/q ≥ 1), accept it. If the draft picks a token the target considers unlikely (p/q << 1), reject it with high probability. This reweighting corrects for the difference between the draft and target distributions.

The correction distribution

When a token is rejected, we need to sample a replacement from an adjusted distribution that corrects for the bias introduced by the draft model:

p_adj(x) = normalize(max(0, p(x) − q(x)))

This correction distribution samples tokens that the target model prefers but the draft model underweights. Together, the acceptance + correction scheme produces exact samples from the target distribution.

python
# Rejection sampling for one token
def accept_or_reject(draft_token, target_probs, draft_probs):
    p = target_probs[draft_token]  # target probability
    q = draft_probs[draft_token]   # draft probability

    # Accept with probability min(1, p/q)
    if random.random() < min(1.0, p / q):
        return draft_token, True  # accepted

    # Rejected: sample from correction distribution
    adjusted = np.maximum(0, target_probs - draft_probs)
    adjusted /= adjusted.sum()  # normalize
    corrected_token = np.random.choice(len(adjusted), p=adjusted)
    return corrected_token, False  # rejected, corrected

Acceptance Rate Visualizer

See how the acceptance criterion works for different p/q ratios. When p ≥ q (target likes the token more), always accept. When p < q, accept with probability p/q.

p (target) 0.60

q (draft) 0.40

When is a draft token always accepted (acceptance probability = 1)?

When p ≥ q — when the target model assigns equal or higher probability to the token than the draft model, meaning the target model "agrees with" or "likes even more" the draft's choice When the token is a common word When the draft model is confident

Chapter 3: The Lossless Guarantee

The most remarkable property of speculative decoding: the output distribution is mathematically identical to the target model's distribution. Not approximately equal. Not "close enough." Exactly identical. Every sample from speculative decoding could have been produced by the target model alone.

The proof sketch

For each token position, the effective sampling distribution combines the accept-draft and reject-correct paths:

P(x) = q(x) · min(1, p(x)/q(x)) + [1 − ∑_x' q(x') min(1, p(x')/q(x'))] · p_adj(x)

The first term covers tokens accepted from the draft. The second term covers tokens sampled from the correction distribution after rejection. When you simplify this expression (the paper provides the full derivation), you get P(x) = p(x) for all x. The distribution is exact.

Why lossless matters: Many inference acceleration techniques (quantization, pruning, distillation) trade quality for speed. Speculative decoding is unique: it provides speedup with ZERO quality loss. The output is indistinguishable from running the target model alone. This makes it safe to deploy in any setting — medical, legal, financial — where approximation is unacceptable.

No degradation, ever

Even with a terrible draft model that matches the target 0% of the time, speculative decoding never produces worse output. In the worst case, every draft token is rejected, and the correction distribution exactly reproduces the target model's output. The speedup is zero (same speed as standard decoding), but the quality is unchanged.

This is the critical guarantee: speculative decoding can only help, never hurt. The speedup depends on draft model quality, but the correctness is absolute.

Draft Quality	Acceptance Rate	Speedup	Output Quality
Perfect match	100%	~Kx	Identical to target
Good match	~70%	~3x	Identical to target
Poor match	~30%	~1.4x	Identical to target
No match	0%	1x (no speedup)	Identical to target

Distribution Comparison

Compare the output distribution of standard decoding vs speculative decoding. They are mathematically identical. Click "Sample" to see tokens drawn from each — they follow the same distribution.

What happens if the draft model is very poor (almost no tokens match the target)?

Almost every draft token is rejected, and the correction distribution reproduces the target model's exact output — the speedup approaches 1x (no speedup) but the output quality is still identical to the target model, never worse The output quality degrades The algorithm crashes

Chapter 4: Choosing the Draft Model

The draft model determines the speedup. A better draft model (higher acceptance rate) means more tokens accepted per verification step, which means faster inference. But the draft model must also be fast — otherwise its generation cost offsets the savings.

The draft model trade-off

An ideal draft model is both fast (so drafting K tokens is cheap) and accurate (so most tokens are accepted). These goals are in tension: bigger models are more accurate but slower.

Draft Model	Size	Speed (tok/s)	Acceptance Rate	Net Speedup
GPT-2 Small	124M	5000	~50%	~1.8x
GPT-2 Medium	345M	2000	~65%	~2.3x
GPT-2 Large	774M	1000	~75%	~2.5x
Same family 1B	1B	800	~80%	~2.8x

Draft model strategies

Same family, smaller size. Use LLaMA-7B as the draft for LLaMA-70B. Same tokenizer, similar distribution. High acceptance rate. This is the most common approach.

Distilled model. Train a small model specifically to approximate the target. Even better acceptance rates, but requires a one-time training cost.

N-gram model. Use the prompt's own n-gram statistics to predict the next tokens. Zero compute cost for drafting, but low acceptance rate.

Self-drafting. Use early exit from the target model itself as the draft. No separate model needed, but harder to implement efficiently.

The optimal K (lookahead length): More draft tokens (larger K) means more potential speedup but also more wasted work when tokens are rejected late in the sequence. The optimal K depends on the acceptance rate α: K* ≈ 1/(1-α). For α = 0.7, K* ≈ 3. For α = 0.9, K* ≈ 10. In practice, K = 3-5 works well.

python
# Choosing draft model and K
def optimal_K(acceptance_rate, draft_cost, verify_cost):
    """
    acceptance_rate: fraction of tokens accepted
    draft_cost: time to generate 1 draft token
    verify_cost: time for 1 target forward pass (any K)
    """
    # Expected accepted tokens per iteration
    # E[accepted] = sum_{i=0}^{K-1} alpha^i = (1 - alpha^K) / (1 - alpha)
    best_K = 1
    best_speedup = 1.0
    for K in range(1, 20):
        expected_accepted = (1 - acceptance_rate**K) / (1 - acceptance_rate)
        iteration_cost = K * draft_cost + verify_cost
        tokens_per_second = expected_accepted / iteration_cost
        baseline = 1 / verify_cost
        speedup = tokens_per_second / baseline
        if speedup > best_speedup:
            best_speedup = speedup
            best_K = K
    return best_K, best_speedup

Draft Model Trade-off Explorer

Adjust the acceptance rate and K to find the optimal configuration. Higher acceptance rate = more tokens accepted per iteration. Higher K = more potential tokens but more wasted work on rejection.

Acceptance rate (α) 0.70

K (lookahead) 5

Why is using a model from the same family (e.g., LLaMA-7B for LLaMA-70B) the best draft model strategy?

Because same-family models share the same tokenizer and similar learned distributions, giving high acceptance rates — the smaller model's token predictions closely match the larger model's, maximizing accepted tokens per iteration Because they use less memory Because they're easier to download

Chapter 5: Results

The paper evaluates speculative decoding on translation (T5-XXL, 11B) and summarization tasks, using smaller T5 variants as draft models.

Benchmark results

Task	Target Model	Draft Model	K	Speedup	Quality Change
Translation (EN-DE)	T5-XXL (11B)	T5-Small (60M)	4	2.0x	0% (lossless)
Translation (EN-DE)	T5-XXL (11B)	T5-Base (220M)	5	2.5x	0% (lossless)
Summarization	T5-XXL (11B)	T5-Small (60M)	4	1.9x	0% (lossless)
Code generation	PaLM (540B)	PaLM (62B)	3	2.3x	0% (lossless)

Consistent 2-2.5x speedups across tasks and model sizes, with zero quality degradation. The speedup is highest when the draft model closely matches the target (same family).

2-3x speedup for free. Speculative decoding gives 2-3x inference speedup with zero quality loss, zero training cost, and minimal code changes. This makes it one of the most impactful inference optimizations ever published. It's now implemented in vLLM, HuggingFace Transformers, and most production inference frameworks.

Acceptance rate across token types

Not all tokens are equally predictable. The paper finds that acceptance rates vary dramatically by token type:

Token Type	Acceptance Rate	Why
Function words (the, is, of)	~90%	Highly predictable, small/large models agree
Common content words	~70%	Fairly predictable from context
Rare words / names	~40%	Large model has better knowledge
Creative / reasoning tokens	~30%	Small model can't match large model's reasoning

This explains why speculative decoding works better for translation (predictable patterns) than creative writing (unpredictable choices). The more predictable the text, the higher the acceptance rate and the faster the inference.

Speedup by Task

Compare speedups across different task types. Predictable tasks (translation) get higher speedups than creative tasks.

Why does speculative decoding give higher speedups on translation than on creative writing?

Because translation produces more predictable token sequences — function words and common patterns have high acceptance rates (~90%), while creative/reasoning tokens are less predictable (~30%), so the draft model matches better for translation Because translation is easier Because translation uses shorter sequences

Chapter 6: Speculative Decoding Simulator

Experience speculative decoding interactively. This simulator shows the draft-verify-accept cycle in real time. Watch how tokens are drafted, verified, and either accepted (green) or rejected (red) with a corrected replacement.

Speculative Decoding Live Simulator

Click "Generate" to produce tokens. Watch the draft model propose K tokens (gray), the target model verify them (green/red), and the overall speedup accumulate. Adjust the acceptance rate to see its effect on performance.

Acceptance rate 70%

K (lookahead) 5

Watch the speedup counter. It shows how many tokens were generated vs how many target model forward passes were used. With high acceptance rate, the ratio can reach 3-4x — meaning 3-4 tokens generated per forward pass of the large model. With low acceptance rate, it approaches 1x (no speedup).

In the simulator, what happens when you increase K (lookahead) beyond the optimal point?

The extra draft tokens are increasingly likely to be rejected (each has only α probability of acceptance), so the drafting cost grows but the number of accepted tokens plateaus — wasting compute on draft tokens that will be discarded Speedup always increases with K Output quality degrades

Chapter 7: Connections

Speculative decoding launched a family of inference acceleration techniques that all share the draft-verify paradigm.

Method	Year	Relationship to Speculative Decoding
Speculative Decoding	2023	Original: small model drafts, big model verifies. Lossless.
Medusa	2023	Multiple small prediction heads instead of separate draft model.
EAGLE	2024	Auto-regressive draft head trained on target's features. Higher acceptance.
Lookahead Decoding	2024	Uses n-gram patterns from Jacobi iteration. No draft model.
Self-Speculative	2024	Early exit from target model as draft. No separate model.

What speculative decoding got right

The lossless guarantee. No quality trade-off. This enabled adoption in production systems where any quality loss is unacceptable.

The insight that verification is cheap. The fundamental observation — that verifying K tokens costs nearly the same as generating 1 — applies broadly and has spawned many follow-up techniques.

What it left open

Batch inference. Speculative decoding is designed for single-sequence inference. Adapting it to batched inference (serving many requests simultaneously) is an active research area.

Draft model availability. You need a compatible draft model for every target model. Universal draft models that work across targets would be more practical.

From inference trick to production standard. Speculative decoding went from paper to production in under a year. vLLM, TensorRT-LLM, and HuggingFace TGI all implement it. Every major inference provider uses some form of speculative decoding. It's one of the rare ML papers that directly changed how production systems work.

Scaling Test-Time Compute — The compute-optimal framework that includes speculative decoding's speedups. Read the TTC lesson →

Inference Acceleration Timeline

See how inference acceleration techniques evolved from speculative decoding to modern approaches.

What property makes speculative decoding unique among inference acceleration techniques?

It is provably lossless — the output distribution is mathematically identical to the target model alone, unlike quantization, pruning, or distillation which all trade quality for speed It is the fastest acceleration technique It doesn't require any additional models

Speculative Decoding: Fast Inference from Transformers