Scaling Test-Time Compute (Snell 2024)

Chapter 0: The Scaling Question

You have a fixed compute budget for inference. You can either: (A) run a 14x larger model once, or (B) run a small model many times and pick the best answer. Which is better?

Before this paper, the conventional wisdom was "bigger model = better answers." Training-time compute scaling (Chinchilla scaling laws) showed that investing in larger models and more training data consistently improves performance. But nobody had formally studied whether the same principle applies at inference time.

This paper asks: given a fixed compute budget at inference, what's the optimal way to spend it? The answer is surprising: it depends on the problem difficulty. For easy problems, a larger model is better. For hard problems, a smaller model with more test-time compute can win.

The core question: Is it more compute-efficient to train a bigger model or to use a smaller model that "thinks harder" at inference time? The answer: for hard enough problems, test-time compute scaling is more efficient than model scaling. A Llama-8B model with optimal test-time compute can match or beat a Llama-70B model on difficult math problems.

This result has profound implications. It suggests that the future of AI isn't just bigger models — it's smarter inference. Instead of spending billions on training ever-larger models, we can invest compute at the point of use, adapting to each problem's difficulty.

The two scaling axes

Axis	When It's Used	Example
Training-time scaling	Before deployment (one-time cost)	Train GPT-4 on 13T tokens
Test-time scaling	At inference (per-query cost)	Generate 100 solutions, verify, pick best

Model Size vs Test-Time Compute

Compare two approaches: running a big model once vs running a small model many times with extra compute. Click to see which wins for different problem difficulties.

What does this paper demonstrate about the relationship between model size and inference compute?

For hard enough problems, a smaller model with optimal test-time compute (sampling more, verifying, searching) can match or beat a much larger model thinking once — the optimal strategy depends on problem difficulty Bigger models are always better regardless of compute budget Test-time compute never helps

Chapter 1: Two Strategies

The paper identifies two fundamentally different ways to spend test-time compute: parallel sampling (generate many independent solutions) and sequential revision (iteratively improve one solution). Each has different scaling properties.

Strategy 1: Parallel sampling (search)

Generate N independent solutions, score each with a verifier (reward model or process reward model), and select the best. This is the same as best-of-N sampling. The compute cost scales linearly with N.

accuracy(N) ≈ 1 − (1 − p)^N

Where p is the per-attempt success probability. With diminishing returns: doubling N gives less than doubling accuracy. This follows the "coverage" curve — eventually, you've sampled enough that adding more doesn't help.

Strategy 2: Sequential revision

Generate one solution, then iteratively refine it. At each revision step, the model sees its previous attempt and feedback from a verifier. The model can fix errors, try different approaches, or elaborate on partial solutions. This is analogous to a student revising an essay based on feedback.

python
# Strategy 1: Parallel sampling (best-of-N)
def parallel_search(model, verifier, problem, N):
    solutions = [model.generate(problem) for _ in range(N)]
    scores = [verifier.score(s) for s in solutions]
    return solutions[argmax(scores)]
    # Cost: N * (generate_cost + verify_cost)

# Strategy 2: Sequential revision
def sequential_revise(model, verifier, problem, T):
    solution = model.generate(problem)
    for _ in range(T):
        feedback = verifier.critique(solution)
        solution = model.revise(problem, solution, feedback)
    return solution
    # Cost: (T+1) * generate_cost + T * verify_cost

Parallel vs sequential trade-off: Parallel sampling explores breadth — many independent attempts at the problem. Sequential revision explores depth — one attempt, refined multiple times. Parallel is better when the model has a reasonable chance of getting it right on any single try (high p). Sequential is better when errors are systematic and can be corrected with feedback.

The hybrid approach

The paper also considers hybrids: generate M initial solutions, then revise each T times, for a total of M * (T+1) forward passes. The optimal M and T depend on the problem difficulty and the quality of the verifier.

Parallel vs Sequential

Compare parallel sampling (breadth) vs sequential revision (depth) for the same compute budget. Adjust the compute budget and see which strategy wins.

Compute budget 10x

When is parallel sampling (best-of-N) a better strategy than sequential revision?

When the model has a reasonable per-attempt success probability — parallel sampling explores many independent approaches, which helps when correct solutions are findable but not guaranteed on any single try Always, because more samples is always better Never, because revision is always more efficient

Chapter 2: Search with Verifiers

The quality of test-time compute scaling depends critically on the verifier — the reward model that scores solutions. A perfect verifier makes best-of-N scale optimally. A poor verifier limits how much test-time compute helps.

Verifier types

Verifier	What It Scores	Quality	Cost
Majority voting	Answer frequency	Low	Free
ORM (outcome RM)	Final answer correctness	Medium	Moderate
PRM (process RM)	Each reasoning step	High	Expensive
Oracle	Ground truth	Perfect	Impossible in practice

The paper compares these verifiers and shows that the verifier quality is the bottleneck for test-time scaling. With an oracle verifier, accuracy scales smoothly with N. With majority voting, it plateaus quickly. PRMs are the sweet spot — high enough quality to enable meaningful scaling without requiring ground truth.

The verifier bottleneck: Test-time compute scaling is limited by verifier quality, not generator quality. Once you have enough samples that at least one is correct (high coverage), the bottleneck becomes: can the verifier identify which one is correct? A weak verifier can't, so adding more samples doesn't help. A strong PRM can, so accuracy continues to improve with N.

Search with tree-structured reasoning

The paper also studies beam search over reasoning steps (using a PRM as the value function). Instead of generating complete solutions and scoring them, the model generates step by step, with the PRM evaluating each step. Bad branches are pruned early, saving compute.

python
# Beam search with PRM
def beam_search_prm(model, prm, problem, beam_width=5, max_steps=10):
    beams = [("", 1.0)]  # (partial solution, cumulative score)

    for step in range(max_steps):
        candidates = []
        for sol, score in beams:
            # Generate K continuations for each beam
            for _ in range(3):
                next_step = model.generate_step(problem, sol)
                step_score = prm.score_step(problem, sol + next_step)
                candidates.append((sol + next_step, score * step_score))

        # Keep top beam_width candidates
        beams = sorted(candidates, key=lambda x: -x[1])[:beam_width]

    return beams[0][0]  # highest-scoring complete solution

Verifier Quality Impact

See how verifier quality affects test-time scaling. With a perfect verifier, accuracy grows smoothly. With a weak verifier, it plateaus. Select verifier type to compare.

What is the main bottleneck for scaling test-time compute?

Verifier quality — once enough samples ensure at least one correct solution exists (high coverage), the bottleneck becomes whether the verifier can identify which solution is correct — weak verifiers plateau quickly while strong PRMs enable continued scaling GPU memory The generator model's size

Chapter 3: Sequential Revision

The paper studies a second test-time strategy: instead of sampling many solutions in parallel, generate one solution and revise it iteratively. At each revision, the model sees its previous attempt plus verifier feedback and produces an improved version.

How revision works

The revision model is a fine-tuned LLM that takes (problem, previous_solution, feedback) as input and generates an improved solution. The feedback can come from a PRM ("Step 3 has an error"), a simple correctness check ("Answer is wrong, try again"), or the model's own self-evaluation.

Revision vs sampling: complementary strengths

Sampling excels at exploration. Each sample is independent, so N samples explore N different regions of solution space. If the correct approach exists somewhere in the model's distribution, enough samples will find it.

Revision excels at exploitation. Each revision builds on the previous attempt, fixing specific errors. If the model's first attempt is almost correct (one calculation error), revision can fix it efficiently. But if the first attempt uses a fundamentally wrong approach, revision may get stuck polishing a bad idea.

When revision beats sampling: Revision is more efficient when errors are local (a calculation mistake, a sign error) and the overall approach is sound. Sampling is more efficient when the problem requires a specific insight or approach that the model might miss entirely. For hard math problems, the paper finds that revision works best after an initial round of sampling — start with best-of-N to find a good starting point, then revise it.

Revision Steps Visualizer

Watch how sequential revision improves a solution step by step. Each revision fixes errors identified by the verifier. The solution quality improves with each iteration.

When is sequential revision more efficient than parallel sampling?

When errors are local (calculation mistakes, sign errors) and the overall approach is sound — revision can fix specific issues efficiently, while sampling would waste compute generating many independent attempts at problems that are almost solved Always, because revision uses less compute Never, because sampling always finds the right answer faster

Chapter 4: Compute-Optimal Allocation

Given a fixed test-time compute budget C, how should you split it between sampling (N attempts) and revision (T revisions per attempt)? The paper formulates this as an optimization problem.

max_N,T accuracy(N, T) subject to N · (T+1) ≤ C

The key finding: the optimal allocation depends on problem difficulty.

Difficulty	Optimal Strategy	Why
Easy	N=C, T=0 (all sampling)	First attempt is usually close. Many samples find a correct one quickly.
Medium	Balanced N and T	Need both exploration (find good approach) and exploitation (fix errors).
Hard	N=1, T=C-1 (all revision)	No single attempt is likely correct. Need deep iterative improvement.

The compute-optimal insight: There is no single best strategy. The optimal allocation between sampling and revision depends on how hard the problem is for the model. A compute-optimal system should estimate problem difficulty first, then allocate compute accordingly. Easy problem? Sample a few times. Hard problem? Revise deeply.

Difficulty-conditional allocation

The paper proposes a difficulty-conditional compute allocation policy. Before spending compute, estimate the problem difficulty (e.g., using the verifier's score on the first attempt). Then allocate:

python
# Compute-optimal test-time allocation
def compute_optimal_inference(model, verifier, problem, budget):
    # Step 1: Estimate difficulty with one quick attempt
    first_try = model.generate(problem)
    difficulty = 1 - verifier.score(first_try)  # 0=easy, 1=hard

    # Step 2: Allocate compute based on difficulty
    if difficulty < 0.3:  # easy
        N = budget; T = 0
    elif difficulty < 0.7:  # medium
        N = int(budget**0.5); T = budget // N - 1
    else:  # hard
        N = 2; T = budget // 2 - 1

    # Step 3: Execute strategy
    best_sol = None; best_score = -1
    for _ in range(N):
        sol = model.generate(problem)
        for _ in range(T):
            sol = model.revise(problem, sol, verifier.critique(sol))
        score = verifier.score(sol)
        if score > best_score:
            best_score = score; best_sol = sol

    return best_sol

Compute Allocation by Difficulty

Adjust problem difficulty and compute budget. The bar chart shows the optimal split between sampling (N) and revision (T). Easy problems: all sampling. Hard problems: deep revision.

Difficulty 50%

Budget 20x

For a very hard problem, how should test-time compute be allocated?

Mostly revision (few initial samples, many revision steps) — because no single attempt is likely correct, deep iterative improvement is needed to gradually build toward the right solution All sampling with no revision Equal split between sampling and revision

Chapter 5: The Key Result

The paper's headline finding: a Llama-8B model with compute-optimal test-time scaling matches a Llama-70B model on hard math problems — using the same total FLOPs.

The comparison

Approach	Model	FLOPs	MATH Accuracy
Standard	Llama-70B (1 pass)	~70B per token	52%
TTC-optimal	Llama-8B (N=14, T=0)	~112B per token	52%
TTC-optimal	Llama-8B (difficulty-conditional)	~56B per token	52%

With difficulty-conditional allocation, the 8B model matches the 70B model while using less total compute. The key: easy problems are solved with one pass (cheap), and the saved compute is redirected to hard problems (more samples + revision).

The implication is radical: Instead of training a 70B model, you could train an 8B model and spend the saved compute at inference time — adaptively, problem by problem. On easy questions (80% of real-world queries), the 8B model answers in one pass. On the remaining 20% hard questions, it uses 10-50x more compute. The average cost is lower than running 70B for every query.

Scaling curves

The paper plots accuracy vs. test-time FLOPs for different strategies:

Best-of-N with PRM scales as roughly accuracy ∝ log(N). Each doubling of N gives a constant accuracy improvement.

Revision scales faster initially (first few revisions fix obvious errors) but plateaus sooner (diminishing returns from polishing).

Compute-optimal (combining both) outperforms either strategy alone at every budget level.

Test-Time Compute Scaling Curves

See how accuracy scales with test-time compute for different strategies. The compute-optimal curve (combining sampling and revision) dominates. The dashed line shows the 70B model baseline.

How does a Llama-8B model with difficulty-conditional compute allocation match a Llama-70B model while using less total compute?

By solving easy problems (80% of queries) with one cheap 8B pass and redirecting saved compute to hard problems (20%) with many samples + revision — the average cost is lower than running 70B for every query By using a better training dataset By quantizing the 70B model

Chapter 6: Compute Allocator Simulator

Experience compute-optimal inference. This simulator presents problems of varying difficulty and lets you allocate compute. See how difficulty-conditional allocation outperforms fixed strategies.

Compute Allocation Simulator

Problems arrive with varying difficulty. Choose how to allocate your compute budget for each. Watch the overall accuracy and compute efficiency accumulate. The optimal strategy adapts to each problem.

Why does adaptive compute allocation outperform a fixed N=10 strategy across mixed-difficulty problem sets?

Because it spends minimal compute on easy problems (solved in 1-2 attempts) and redirects that savings to hard problems (which need 20-50 attempts) — fixed N=10 wastes compute on easy problems and under-invests on hard ones Because it uses a better model Because it skips hard problems

Chapter 7: Connections

This paper formalized what OpenAI's o1 and DeepSeek-R1 demonstrated empirically: spending more compute at inference can substitute for model scale. It connects forward to every reasoning-at-inference-time technique.

Method	Year	Relationship
Self-Consistency	2023	Naive parallel sampling. This paper generalizes it with verifiers.
Let's Verify Step by Step	2023	PRMs as verifiers. This paper uses them for compute-optimal search.
OpenAI o1	2024	Extended thinking = test-time compute. This paper provides theory.
DeepSeek-R1	2025	RL-trained extended thinking. Emergent test-time compute scaling.

What this paper got right

The difficulty-conditional insight. The optimal strategy depends on problem difficulty. This insight has been adopted by every subsequent reasoning system.

What it left open

How to estimate difficulty in practice. The paper assumes you can estimate difficulty cheaply. In practice, this is hard — you often need to attempt the problem to know how hard it is.

The paradigm shift. Before this paper, scaling meant "make the model bigger." After it, scaling has two dimensions: model size (training-time) and thinking time (test-time). The most efficient systems will optimize both simultaneously.

Self-Consistency — The simplest test-time compute method. Read the SC lesson →

Let's Verify Step by Step — PRMs as verifiers for search. Read the PRM lesson →

DeepSeek-R1 — RL-trained test-time compute. Read the R1 lesson →

Test-Time Compute Landscape

What paradigm shift does this paper represent for AI scaling?

It introduces a second scaling axis — test-time compute — alongside model size, showing that the most efficient systems optimize both dimensions simultaneously, adapting inference compute per-problem rather than using fixed model size for all queries It proves that bigger models are always better It eliminates the need for large models entirely