Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar — UC Berkeley & Google DeepMind, 2024

Scaling Test-Time Compute

A smaller model thinking harder can beat a 14x larger model thinking once. The foundational paper on compute-optimal inference: when to sample more, when to revise, and when to search.

Prerequisites: LLM basics + Reward models + Best-of-N sampling. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Scaling Question

You have a fixed compute budget for inference. You can either: (A) run a 14x larger model once, or (B) run a small model many times and pick the best answer. Which is better?

Before this paper, the conventional wisdom was "bigger model = better answers." Training-time compute scaling (Chinchilla scaling laws) showed that investing in larger models and more training data consistently improves performance. But nobody had formally studied whether the same principle applies at inference time.

This paper asks: given a fixed compute budget at inference, what's the optimal way to spend it? The answer is surprising: it depends on the problem difficulty. For easy problems, a larger model is better. For hard problems, a smaller model with more test-time compute can win.

The core question: Is it more compute-efficient to train a bigger model or to use a smaller model that "thinks harder" at inference time? The answer: for hard enough problems, test-time compute scaling is more efficient than model scaling. A Llama-8B model with optimal test-time compute can match or beat a Llama-70B model on difficult math problems.

This result has profound implications. It suggests that the future of AI isn't just bigger models — it's smarter inference. Instead of spending billions on training ever-larger models, we can invest compute at the point of use, adapting to each problem's difficulty.

The two scaling axes

AxisWhen It's UsedExample
Training-time scalingBefore deployment (one-time cost)Train GPT-4 on 13T tokens
Test-time scalingAt inference (per-query cost)Generate 100 solutions, verify, pick best
Model Size vs Test-Time Compute

Compare two approaches: running a big model once vs running a small model many times with extra compute. Click to see which wins for different problem difficulties.

What does this paper demonstrate about the relationship between model size and inference compute?

Chapter 1: Two Strategies

The paper identifies two fundamentally different ways to spend test-time compute: parallel sampling (generate many independent solutions) and sequential revision (iteratively improve one solution). Each has different scaling properties.

Strategy 1: Parallel sampling (search)

Generate N independent solutions, score each with a verifier (reward model or process reward model), and select the best. This is the same as best-of-N sampling. The compute cost scales linearly with N.

accuracy(N) ≈ 1 − (1 − p)N

Where p is the per-attempt success probability. With diminishing returns: doubling N gives less than doubling accuracy. This follows the "coverage" curve — eventually, you've sampled enough that adding more doesn't help.

Strategy 2: Sequential revision

Generate one solution, then iteratively refine it. At each revision step, the model sees its previous attempt and feedback from a verifier. The model can fix errors, try different approaches, or elaborate on partial solutions. This is analogous to a student revising an essay based on feedback.

python
# Strategy 1: Parallel sampling (best-of-N)
def parallel_search(model, verifier, problem, N):
    solutions = [model.generate(problem) for _ in range(N)]
    scores = [verifier.score(s) for s in solutions]
    return solutions[argmax(scores)]
    # Cost: N * (generate_cost + verify_cost)

# Strategy 2: Sequential revision
def sequential_revise(model, verifier, problem, T):
    solution = model.generate(problem)
    for _ in range(T):
        feedback = verifier.critique(solution)
        solution = model.revise(problem, solution, feedback)
    return solution
    # Cost: (T+1) * generate_cost + T * verify_cost
Parallel vs sequential trade-off: Parallel sampling explores breadth — many independent attempts at the problem. Sequential revision explores depth — one attempt, refined multiple times. Parallel is better when the model has a reasonable chance of getting it right on any single try (high p). Sequential is better when errors are systematic and can be corrected with feedback.

The hybrid approach

The paper also considers hybrids: generate M initial solutions, then revise each T times, for a total of M * (T+1) forward passes. The optimal M and T depend on the problem difficulty and the quality of the verifier.

Parallel vs Sequential

Compare parallel sampling (breadth) vs sequential revision (depth) for the same compute budget. Adjust the compute budget and see which strategy wins.

Compute budget 10x
When is parallel sampling (best-of-N) a better strategy than sequential revision?

Chapter 2: Search with Verifiers

The quality of test-time compute scaling depends critically on the verifier — the reward model that scores solutions. A perfect verifier makes best-of-N scale optimally. A poor verifier limits how much test-time compute helps.

Verifier types

VerifierWhat It ScoresQualityCost
Majority votingAnswer frequencyLowFree
ORM (outcome RM)Final answer correctnessMediumModerate
PRM (process RM)Each reasoning stepHighExpensive
OracleGround truthPerfectImpossible in practice

The paper compares these verifiers and shows that the verifier quality is the bottleneck for test-time scaling. With an oracle verifier, accuracy scales smoothly with N. With majority voting, it plateaus quickly. PRMs are the sweet spot — high enough quality to enable meaningful scaling without requiring ground truth.

The verifier bottleneck: Test-time compute scaling is limited by verifier quality, not generator quality. Once you have enough samples that at least one is correct (high coverage), the bottleneck becomes: can the verifier identify which one is correct? A weak verifier can't, so adding more samples doesn't help. A strong PRM can, so accuracy continues to improve with N.

Search with tree-structured reasoning

The paper also studies beam search over reasoning steps (using a PRM as the value function). Instead of generating complete solutions and scoring them, the model generates step by step, with the PRM evaluating each step. Bad branches are pruned early, saving compute.

python
# Beam search with PRM
def beam_search_prm(model, prm, problem, beam_width=5, max_steps=10):
    beams = [("", 1.0)]  # (partial solution, cumulative score)

    for step in range(max_steps):
        candidates = []
        for sol, score in beams:
            # Generate K continuations for each beam
            for _ in range(3):
                next_step = model.generate_step(problem, sol)
                step_score = prm.score_step(problem, sol + next_step)
                candidates.append((sol + next_step, score * step_score))

        # Keep top beam_width candidates
        beams = sorted(candidates, key=lambda x: -x[1])[:beam_width]

    return beams[0][0]  # highest-scoring complete solution
Verifier Quality Impact

See how verifier quality affects test-time scaling. With a perfect verifier, accuracy grows smoothly. With a weak verifier, it plateaus. Select verifier type to compare.

What is the main bottleneck for scaling test-time compute?

Chapter 3: Sequential Revision

The paper studies a second test-time strategy: instead of sampling many solutions in parallel, generate one solution and revise it iteratively. At each revision, the model sees its previous attempt plus verifier feedback and produces an improved version.

How revision works

The revision model is a fine-tuned LLM that takes (problem, previous_solution, feedback) as input and generates an improved solution. The feedback can come from a PRM ("Step 3 has an error"), a simple correctness check ("Answer is wrong, try again"), or the model's own self-evaluation.

Revision vs sampling: complementary strengths

Sampling excels at exploration. Each sample is independent, so N samples explore N different regions of solution space. If the correct approach exists somewhere in the model's distribution, enough samples will find it.

Revision excels at exploitation. Each revision builds on the previous attempt, fixing specific errors. If the model's first attempt is almost correct (one calculation error), revision can fix it efficiently. But if the first attempt uses a fundamentally wrong approach, revision may get stuck polishing a bad idea.

When revision beats sampling: Revision is more efficient when errors are local (a calculation mistake, a sign error) and the overall approach is sound. Sampling is more efficient when the problem requires a specific insight or approach that the model might miss entirely. For hard math problems, the paper finds that revision works best after an initial round of sampling — start with best-of-N to find a good starting point, then revise it.
Revision Steps Visualizer

Watch how sequential revision improves a solution step by step. Each revision fixes errors identified by the verifier. The solution quality improves with each iteration.

When is sequential revision more efficient than parallel sampling?

Chapter 4: Compute-Optimal Allocation

Given a fixed test-time compute budget C, how should you split it between sampling (N attempts) and revision (T revisions per attempt)? The paper formulates this as an optimization problem.

maxN,T accuracy(N, T) subject to N · (T+1) ≤ C

The key finding: the optimal allocation depends on problem difficulty.

DifficultyOptimal StrategyWhy
EasyN=C, T=0 (all sampling)First attempt is usually close. Many samples find a correct one quickly.
MediumBalanced N and TNeed both exploration (find good approach) and exploitation (fix errors).
HardN=1, T=C-1 (all revision)No single attempt is likely correct. Need deep iterative improvement.
The compute-optimal insight: There is no single best strategy. The optimal allocation between sampling and revision depends on how hard the problem is for the model. A compute-optimal system should estimate problem difficulty first, then allocate compute accordingly. Easy problem? Sample a few times. Hard problem? Revise deeply.

Difficulty-conditional allocation

The paper proposes a difficulty-conditional compute allocation policy. Before spending compute, estimate the problem difficulty (e.g., using the verifier's score on the first attempt). Then allocate:

python
# Compute-optimal test-time allocation
def compute_optimal_inference(model, verifier, problem, budget):
    # Step 1: Estimate difficulty with one quick attempt
    first_try = model.generate(problem)
    difficulty = 1 - verifier.score(first_try)  # 0=easy, 1=hard

    # Step 2: Allocate compute based on difficulty
    if difficulty < 0.3:  # easy
        N = budget; T = 0
    elif difficulty < 0.7:  # medium
        N = int(budget**0.5); T = budget // N - 1
    else:  # hard
        N = 2; T = budget // 2 - 1

    # Step 3: Execute strategy
    best_sol = None; best_score = -1
    for _ in range(N):
        sol = model.generate(problem)
        for _ in range(T):
            sol = model.revise(problem, sol, verifier.critique(sol))
        score = verifier.score(sol)
        if score > best_score:
            best_score = score; best_sol = sol

    return best_sol
Compute Allocation by Difficulty

Adjust problem difficulty and compute budget. The bar chart shows the optimal split between sampling (N) and revision (T). Easy problems: all sampling. Hard problems: deep revision.

Difficulty 50%
Budget 20x
For a very hard problem, how should test-time compute be allocated?

Chapter 5: The Key Result

The paper's headline finding: a Llama-8B model with compute-optimal test-time scaling matches a Llama-70B model on hard math problems — using the same total FLOPs.

The comparison

ApproachModelFLOPsMATH Accuracy
StandardLlama-70B (1 pass)~70B per token52%
TTC-optimalLlama-8B (N=14, T=0)~112B per token52%
TTC-optimalLlama-8B (difficulty-conditional)~56B per token52%

With difficulty-conditional allocation, the 8B model matches the 70B model while using less total compute. The key: easy problems are solved with one pass (cheap), and the saved compute is redirected to hard problems (more samples + revision).

The implication is radical: Instead of training a 70B model, you could train an 8B model and spend the saved compute at inference time — adaptively, problem by problem. On easy questions (80% of real-world queries), the 8B model answers in one pass. On the remaining 20% hard questions, it uses 10-50x more compute. The average cost is lower than running 70B for every query.

Scaling curves

The paper plots accuracy vs. test-time FLOPs for different strategies:

Best-of-N with PRM scales as roughly accuracy ∝ log(N). Each doubling of N gives a constant accuracy improvement.

Revision scales faster initially (first few revisions fix obvious errors) but plateaus sooner (diminishing returns from polishing).

Compute-optimal (combining both) outperforms either strategy alone at every budget level.

Test-Time Compute Scaling Curves

See how accuracy scales with test-time compute for different strategies. The compute-optimal curve (combining sampling and revision) dominates. The dashed line shows the 70B model baseline.

How does a Llama-8B model with difficulty-conditional compute allocation match a Llama-70B model while using less total compute?

Chapter 6: Compute Allocator Simulator

Experience compute-optimal inference. This simulator presents problems of varying difficulty and lets you allocate compute. See how difficulty-conditional allocation outperforms fixed strategies.

Compute Allocation Simulator

Problems arrive with varying difficulty. Choose how to allocate your compute budget for each. Watch the overall accuracy and compute efficiency accumulate. The optimal strategy adapts to each problem.

Why does adaptive compute allocation outperform a fixed N=10 strategy across mixed-difficulty problem sets?

Chapter 7: Connections

This paper formalized what OpenAI's o1 and DeepSeek-R1 demonstrated empirically: spending more compute at inference can substitute for model scale. It connects forward to every reasoning-at-inference-time technique.

MethodYearRelationship
Self-Consistency2023Naive parallel sampling. This paper generalizes it with verifiers.
Let's Verify Step by Step2023PRMs as verifiers. This paper uses them for compute-optimal search.
OpenAI o12024Extended thinking = test-time compute. This paper provides theory.
DeepSeek-R12025RL-trained extended thinking. Emergent test-time compute scaling.

What this paper got right

The difficulty-conditional insight. The optimal strategy depends on problem difficulty. This insight has been adopted by every subsequent reasoning system.

What it left open

How to estimate difficulty in practice. The paper assumes you can estimate difficulty cheaply. In practice, this is hard — you often need to attempt the problem to know how hard it is.

The paradigm shift. Before this paper, scaling meant "make the model bigger." After it, scaling has two dimensions: model size (training-time) and thinking time (test-time). The most efficient systems will optimize both simultaneously.

Self-Consistency — The simplest test-time compute method. Read the SC lesson →

Let's Verify Step by Step — PRMs as verifiers for search. Read the PRM lesson →

DeepSeek-R1 — RL-trained test-time compute. Read the R1 lesson →

Test-Time Compute Landscape
What paradigm shift does this paper represent for AI scaling?