A smaller model thinking harder can beat a 14x larger model thinking once. The foundational paper on compute-optimal inference: when to sample more, when to revise, and when to search.
You have a fixed compute budget for inference. You can either: (A) run a 14x larger model once, or (B) run a small model many times and pick the best answer. Which is better?
Before this paper, the conventional wisdom was "bigger model = better answers." Training-time compute scaling (Chinchilla scaling laws) showed that investing in larger models and more training data consistently improves performance. But nobody had formally studied whether the same principle applies at inference time.
This paper asks: given a fixed compute budget at inference, what's the optimal way to spend it? The answer is surprising: it depends on the problem difficulty. For easy problems, a larger model is better. For hard problems, a smaller model with more test-time compute can win.
This result has profound implications. It suggests that the future of AI isn't just bigger models — it's smarter inference. Instead of spending billions on training ever-larger models, we can invest compute at the point of use, adapting to each problem's difficulty.
| Axis | When It's Used | Example |
|---|---|---|
| Training-time scaling | Before deployment (one-time cost) | Train GPT-4 on 13T tokens |
| Test-time scaling | At inference (per-query cost) | Generate 100 solutions, verify, pick best |
Compare two approaches: running a big model once vs running a small model many times with extra compute. Click to see which wins for different problem difficulties.
The paper identifies two fundamentally different ways to spend test-time compute: parallel sampling (generate many independent solutions) and sequential revision (iteratively improve one solution). Each has different scaling properties.
Generate N independent solutions, score each with a verifier (reward model or process reward model), and select the best. This is the same as best-of-N sampling. The compute cost scales linearly with N.
Where p is the per-attempt success probability. With diminishing returns: doubling N gives less than doubling accuracy. This follows the "coverage" curve — eventually, you've sampled enough that adding more doesn't help.
Generate one solution, then iteratively refine it. At each revision step, the model sees its previous attempt and feedback from a verifier. The model can fix errors, try different approaches, or elaborate on partial solutions. This is analogous to a student revising an essay based on feedback.
python # Strategy 1: Parallel sampling (best-of-N) def parallel_search(model, verifier, problem, N): solutions = [model.generate(problem) for _ in range(N)] scores = [verifier.score(s) for s in solutions] return solutions[argmax(scores)] # Cost: N * (generate_cost + verify_cost) # Strategy 2: Sequential revision def sequential_revise(model, verifier, problem, T): solution = model.generate(problem) for _ in range(T): feedback = verifier.critique(solution) solution = model.revise(problem, solution, feedback) return solution # Cost: (T+1) * generate_cost + T * verify_cost
The paper also considers hybrids: generate M initial solutions, then revise each T times, for a total of M * (T+1) forward passes. The optimal M and T depend on the problem difficulty and the quality of the verifier.
Compare parallel sampling (breadth) vs sequential revision (depth) for the same compute budget. Adjust the compute budget and see which strategy wins.
The quality of test-time compute scaling depends critically on the verifier — the reward model that scores solutions. A perfect verifier makes best-of-N scale optimally. A poor verifier limits how much test-time compute helps.
| Verifier | What It Scores | Quality | Cost |
|---|---|---|---|
| Majority voting | Answer frequency | Low | Free |
| ORM (outcome RM) | Final answer correctness | Medium | Moderate |
| PRM (process RM) | Each reasoning step | High | Expensive |
| Oracle | Ground truth | Perfect | Impossible in practice |
The paper compares these verifiers and shows that the verifier quality is the bottleneck for test-time scaling. With an oracle verifier, accuracy scales smoothly with N. With majority voting, it plateaus quickly. PRMs are the sweet spot — high enough quality to enable meaningful scaling without requiring ground truth.
The paper also studies beam search over reasoning steps (using a PRM as the value function). Instead of generating complete solutions and scoring them, the model generates step by step, with the PRM evaluating each step. Bad branches are pruned early, saving compute.
python # Beam search with PRM def beam_search_prm(model, prm, problem, beam_width=5, max_steps=10): beams = [("", 1.0)] # (partial solution, cumulative score) for step in range(max_steps): candidates = [] for sol, score in beams: # Generate K continuations for each beam for _ in range(3): next_step = model.generate_step(problem, sol) step_score = prm.score_step(problem, sol + next_step) candidates.append((sol + next_step, score * step_score)) # Keep top beam_width candidates beams = sorted(candidates, key=lambda x: -x[1])[:beam_width] return beams[0][0] # highest-scoring complete solution
See how verifier quality affects test-time scaling. With a perfect verifier, accuracy grows smoothly. With a weak verifier, it plateaus. Select verifier type to compare.
The paper studies a second test-time strategy: instead of sampling many solutions in parallel, generate one solution and revise it iteratively. At each revision, the model sees its previous attempt plus verifier feedback and produces an improved version.
The revision model is a fine-tuned LLM that takes (problem, previous_solution, feedback) as input and generates an improved solution. The feedback can come from a PRM ("Step 3 has an error"), a simple correctness check ("Answer is wrong, try again"), or the model's own self-evaluation.
Sampling excels at exploration. Each sample is independent, so N samples explore N different regions of solution space. If the correct approach exists somewhere in the model's distribution, enough samples will find it.
Revision excels at exploitation. Each revision builds on the previous attempt, fixing specific errors. If the model's first attempt is almost correct (one calculation error), revision can fix it efficiently. But if the first attempt uses a fundamentally wrong approach, revision may get stuck polishing a bad idea.
Watch how sequential revision improves a solution step by step. Each revision fixes errors identified by the verifier. The solution quality improves with each iteration.
Given a fixed test-time compute budget C, how should you split it between sampling (N attempts) and revision (T revisions per attempt)? The paper formulates this as an optimization problem.
The key finding: the optimal allocation depends on problem difficulty.
| Difficulty | Optimal Strategy | Why |
|---|---|---|
| Easy | N=C, T=0 (all sampling) | First attempt is usually close. Many samples find a correct one quickly. |
| Medium | Balanced N and T | Need both exploration (find good approach) and exploitation (fix errors). |
| Hard | N=1, T=C-1 (all revision) | No single attempt is likely correct. Need deep iterative improvement. |
The paper proposes a difficulty-conditional compute allocation policy. Before spending compute, estimate the problem difficulty (e.g., using the verifier's score on the first attempt). Then allocate:
python # Compute-optimal test-time allocation def compute_optimal_inference(model, verifier, problem, budget): # Step 1: Estimate difficulty with one quick attempt first_try = model.generate(problem) difficulty = 1 - verifier.score(first_try) # 0=easy, 1=hard # Step 2: Allocate compute based on difficulty if difficulty < 0.3: # easy N = budget; T = 0 elif difficulty < 0.7: # medium N = int(budget**0.5); T = budget // N - 1 else: # hard N = 2; T = budget // 2 - 1 # Step 3: Execute strategy best_sol = None; best_score = -1 for _ in range(N): sol = model.generate(problem) for _ in range(T): sol = model.revise(problem, sol, verifier.critique(sol)) score = verifier.score(sol) if score > best_score: best_score = score; best_sol = sol return best_sol
Adjust problem difficulty and compute budget. The bar chart shows the optimal split between sampling (N) and revision (T). Easy problems: all sampling. Hard problems: deep revision.
The paper's headline finding: a Llama-8B model with compute-optimal test-time scaling matches a Llama-70B model on hard math problems — using the same total FLOPs.
| Approach | Model | FLOPs | MATH Accuracy |
|---|---|---|---|
| Standard | Llama-70B (1 pass) | ~70B per token | 52% |
| TTC-optimal | Llama-8B (N=14, T=0) | ~112B per token | 52% |
| TTC-optimal | Llama-8B (difficulty-conditional) | ~56B per token | 52% |
With difficulty-conditional allocation, the 8B model matches the 70B model while using less total compute. The key: easy problems are solved with one pass (cheap), and the saved compute is redirected to hard problems (more samples + revision).
The paper plots accuracy vs. test-time FLOPs for different strategies:
Best-of-N with PRM scales as roughly accuracy ∝ log(N). Each doubling of N gives a constant accuracy improvement.
Revision scales faster initially (first few revisions fix obvious errors) but plateaus sooner (diminishing returns from polishing).
Compute-optimal (combining both) outperforms either strategy alone at every budget level.
See how accuracy scales with test-time compute for different strategies. The compute-optimal curve (combining sampling and revision) dominates. The dashed line shows the 70B model baseline.
Experience compute-optimal inference. This simulator presents problems of varying difficulty and lets you allocate compute. See how difficulty-conditional allocation outperforms fixed strategies.
Problems arrive with varying difficulty. Choose how to allocate your compute budget for each. Watch the overall accuracy and compute efficiency accumulate. The optimal strategy adapts to each problem.
This paper formalized what OpenAI's o1 and DeepSeek-R1 demonstrated empirically: spending more compute at inference can substitute for model scale. It connects forward to every reasoning-at-inference-time technique.
| Method | Year | Relationship |
|---|---|---|
| Self-Consistency | 2023 | Naive parallel sampling. This paper generalizes it with verifiers. |
| Let's Verify Step by Step | 2023 | PRMs as verifiers. This paper uses them for compute-optimal search. |
| OpenAI o1 | 2024 | Extended thinking = test-time compute. This paper provides theory. |
| DeepSeek-R1 | 2025 | RL-trained extended thinking. Emergent test-time compute scaling. |
The difficulty-conditional insight. The optimal strategy depends on problem difficulty. This insight has been adopted by every subsequent reasoning system.
How to estimate difficulty in practice. The paper assumes you can estimate difficulty cheaply. In practice, this is hard — you often need to attempt the problem to know how hard it is.
Self-Consistency — The simplest test-time compute method. Read the SC lesson →
Let's Verify Step by Step — PRMs as verifiers for search. Read the PRM lesson →
DeepSeek-R1 — RL-trained test-time compute. Read the R1 lesson →