The discovery that reshaped 2024: a model can get dramatically smarter not by being bigger, but by being allowed to think longer before it answers.
Ask a person a hard question — a tricky math problem, a subtle logic puzzle — and watch what they do. They pause. They scribble. They try an approach, hit a dead end, back up, try another. They spend more time thinking on harder problems. A standard language model does the opposite: it produces an answer at a fixed, constant rate, one token after another, spending exactly the same effort on “What is 2+2?” as on a competition-level proof.
For years, the only way we knew to make models smarter was to make them bigger: more parameters, more training data, more training compute. That works, but it's brutally expensive and slow — you retrain a giant model and wait months. Then, in 2024, a different axis exploded into view: test-time compute (also called inference-time compute). The idea is simple and profound — let the model do more computation when answering, not just when training.
The headline result: a fixed, already-trained model can become dramatically more accurate on hard problems if you let it spend more compute at inference — by sampling many answers, by thinking step-by-step, by searching, by checking its own work. You're not changing the model's weights at all. You're changing how hard it's allowed to think before it commits to an answer.
This reframes “how good is the model?” into a question with two knobs. Train-time compute sets the raw capability baked into the weights. Test-time compute sets how much of that capability you can extract on a given question by letting the model deliberate. A small model that thinks for a long time can match a huge model that answers instantly — on the right kind of problem. That trade, between model size and thinking time, is the central economic question of modern AI.
The widget shows accuracy on a hard problem set as you spend more test-time compute (more “thinking budget”). Drag the budget up: accuracy climbs — a model that scored 40% answering instantly might reach 80% when allowed to think. Notice the curve's shape: steep early gains, then diminishing returns. That log-linear-ish climb is the inference scaling law, and it's why this matters.
Same fixed model, harder thinking. Drag the test-time compute budget and watch accuracy rise with diminishing returns — a brand-new way to buy capability without retraining.
The simplest way to spend test-time compute needs no new training and barely any code: ask the model the same question many times and take the most common answer. Because sampling is random, each run may reason differently and reach a different answer. Aggregate them with a majority vote, and the consensus is far more reliable than any single run. This is self-consistency.
Here's the intuition. On a problem the model half-understands, a single sampled answer is a coin-flip — sometimes the right reasoning path, sometimes a wrong one. But the wrong answers tend to be scattered (there are many ways to be wrong, each rare), while the right answer is a single target that multiple correct reasoning paths converge on. So even if no single run is reliable, the correct answer is usually the most frequent one across many runs. Majority vote surfaces it.
A model is asked a math problem 7 times (sampled with some randomness). The final answers come back as:
Tally them: “42” appears 4 times, “17” twice, “36” once. The majority answer is 42, with 4 of 7 votes. Notice that any single run had only a 4-in-7 (57%) chance of being right — barely better than a coin flip on this problem. But the vote is correct as long as the right answer is merely the most common, which is a much weaker and more reliable condition. Sample 50 times instead of 7 and the vote becomes near-certain even when individual runs are unreliable.
python from collections import Counter def self_consistency(model, question, n=40, temperature=0.7): answers = [] for _ in range(n): chain = model.generate(question, temperature=temperature) # sample a reasoning path answers.append(extract_final_answer(chain)) # pull out the boxed answer # the most common final answer wins return Counter(answers).most_common(1)[0][0]
The cost is exactly n times one inference — you spent n× the test-time compute to buy reliability. That's the trade in its purest form: more samples, more compute, more accuracy, with no change to the model. Self-consistency was one of the first demonstrations that inference compute alone could lift reasoning accuracy by large margins.
Each sample is a die-like draw: usually the right answer, sometimes a scattered wrong one. Increase the number of samples and watch the majority lock onto the correct answer, even though each individual sample stays unreliable. The accuracy of the vote climbs far above the accuracy of a single sample.
Bars = how many samples gave each answer. Green = correct. Raise N and watch the correct answer's lead grow. Single-sample accuracy stays fixed; the vote's accuracy climbs.
Majority vote treats every sample equally and just counts. But what if we could judge which answers are good and pick the best one, instead of voting? That's best-of-N: sample N candidate solutions, score each with a verifier, and return the highest-scoring one. When you have a good verifier, this beats majority voting — especially when the right answer is rare but recognizable once you see it.
This works because of a deep asymmetry: verifying a solution is often much easier than generating one. It's hard to find the proof; it's easy to check each step. Hard to write the working code; easier to run the tests. A model (or a separate verifier model) that's mediocre at producing correct answers can still be quite good at recognizing them. Best-of-N exploits that gap: let the generator throw many attempts at the wall, and let the verifier — which only needs to judge, not create — spot the one that sticks.
Five sampled solutions to a hard problem, with their (hidden) correctness and a verifier's quality score:
| solution | final answer | actually correct? | verifier score |
|---|---|---|---|
| S1 | 17 | no | 0.3 |
| S2 | 17 | no | 0.4 |
| S3 | 42 | yes | 0.9 |
| S4 | 17 | no | 0.2 |
| S5 | 9 | no | 0.3 |
Majority vote picks “17” (3 of 5 votes) — wrong, because the model's common failure mode produced the same wrong answer repeatedly. Best-of-N with the verifier picks S3 (score 0.9) — correct, because the verifier recognized the sound solution even though it was a minority of 1. This is the regime where verifiers win: the correct answer is rare, so counting fails, but quality is recognizable, so judging succeeds.
The same N sampled solutions, judged two ways. Toggle between majority vote (counts answers) and best-of-N (picks the verifier's highest score). Lower the per-sample correctness to make the right answer rare, and watch voting fail while the verifier still finds it — until the verifier itself gets noisy.
Each dot is a sampled solution (green = correct). Vote counts answers; verifier picks the top score. Lower correctness to see voting fail where the verifier still succeeds.
Sampling and verifying spend test-time compute across many attempts. But there's a way to spend more compute within a single attempt: let the model write out its reasoning step by step before committing to an answer. This is chain of thought (CoT), and understanding why it works reveals something fundamental about how transformers compute.
Here's the key insight. A transformer does a fixed amount of computation to produce each token — one forward pass through the network. If you force the model to jump straight to a final answer, it gets exactly one forward pass worth of computation to solve the whole problem. But a hard problem might need more computation than fits in a single forward pass. By writing intermediate reasoning steps, the model gives itself more forward passes — one per token of reasoning — to work toward the answer. The reasoning text is a scratchpad, and each token written on it is another slice of computation spent.
A subtle point: why must the model write out the steps — couldn't it think silently? Because a transformer has no hidden scratchpad that persists between tokens beyond what it has already written. Its only working memory across steps is the sequence of tokens itself. So to use the result of step 1 in step 3, it must write step 1 down — the text is the memory. Forcing the reasoning into the visible token stream is what lets later computation build on earlier computation. (This also makes the reasoning inspectable, which is a safety bonus.)
“A shop has 3 shelves with 14 books each, sells 9, then receives 2 new boxes of 6. How many books now?”
No chain of thought — the model must compute everything in one forward pass and blurts: “47” — wrong, it fumbled the arithmetic with no room to work.
With chain of thought — each step is a forward pass building on the last: “Start: 3 × 14 = 42 books. After selling 9: 42 − 9 = 33. New books: 2 × 6 = 12. Total: 33 + 12 = 45.” Correct. The model didn't get smarter — it got more steps of computation, and it stored each intermediate result (42, 33, 12) in the text so the next step could use it. Same model, more thinking, right answer.
The widget shows a problem of adjustable difficulty and a reasoning chain of adjustable length. A problem needs enough reasoning steps to be solved — too few and the model runs out of computation before reaching the answer. Drag the difficulty up and watch how many steps are needed; drag the reasoning length and see the answer flip from wrong to right once there's enough thinking.
The chain (steps shown) needs to be long enough for the problem's difficulty. Too short = the model runs out of compute before the answer. Adjust both and watch it solve or fail.
Self-consistency samples whole solutions independently. But a reasoning process is really a tree: at each step there are several plausible next moves, and a wrong early step dooms everything after it. The most powerful use of test-time compute treats reasoning as a search problem — explore the tree of possible reasoning steps, keep the promising branches, and prune the dead ends. This is where a process reward model (Chapter 2) becomes the engine.
Here's the picture. Start at the problem. Generate a few candidate first steps. Score each with a PRM — how promising is this step? Keep the best few (this is a beam), expand each into candidate second steps, score again, keep the best, and so on. Bad branches get low scores and are abandoned; good branches get explored further. Instead of blindly sampling 100 complete solutions and hoping, you spend your compute where it's likely to pay off — deepening the branches that look correct.
Press step to grow the search. At each depth, candidate next-steps appear, the PRM scores them (greener = more promising), and the search keeps the top branches (the beam) while pruning the rest (they fade). Watch the search ignore the dead ends and drive toward a high-scoring solution leaf. Raise the beam width to explore more branches at once — more compute, better coverage.
Nodes = partial reasoning states, colored by PRM score (green = promising). The beam keeps the best; pruned branches fade. Step the search and watch it home in on a solution.
Everything so far — voting, verifiers, chain of thought, search — is a different way to convert test-time compute into accuracy. This simulator puts them head to head. Pick a problem difficulty, then slide the compute budget (how many samples / how much search) and watch each strategy's accuracy curve. This is the picture researchers actually plot when they study inference scaling laws.
Run the comparisons that define the field:
Each curve is a strategy's accuracy as test-time compute grows (log scale). Adjust difficulty, slide the compute budget (vertical line), and highlight a strategy. Watch where each one plateaus or keeps climbing.
No quiz — the simulator is the test. If you can predict which strategy wins at a tiny budget versus a huge one, you understand inference scaling.
Everything so far bolts test-time compute onto a normal model from the outside — we sample, vote, verify, search using a model that was never specifically trained to reason at length. The 2024 breakthrough — OpenAI's o1, then DeepSeek-R1 and others — was to train the model itself to produce long, self-correcting chains of thought. The thinking moves inside the model. It learns to deliberate.
The recipe is reinforcement learning with a beautifully simple reward. Give the model problems with verifiable answers (math, code). Let it generate a long chain of thought and a final answer. Check the answer — right or wrong is the reward. Then use RL to make reasoning that led to correct answers more likely. Crucially, the model isn't told how to reason — only whether it succeeded. Over training, it discovers effective strategies on its own: breaking problems down, trying approaches, backtracking when stuck, double-checking its work. Behaviors that look remarkably like human deliberation emerge purely from rewarding correct final answers.
o1 revealed something striking: reasoning models improve along two compute axes at once. More train-time RL compute (more reasoning practice during training) makes the model better. And more test-time compute (letting it think longer at inference) also makes it better — both follow smooth, predictable scaling curves. You can spend compute to train a better reasoner, and spend compute to let that reasoner think longer, and both pay off. This doubled the levers available for pushing capability.
A practical hallmark: these models produce a long internal monologue before answering — often thousands of tokens of “Let me try… no wait, that's wrong… let me reconsider…” — then a concise final answer. That monologue is the test-time compute being spent, and its self-correction (“wait, that's wrong”) is the learned backtracking that plain models lack. The model trained itself to do, in one stream, what we previously had to orchestrate with external search.
Adjust both knobs: train-time RL compute (how well the model learned to reason) and test-time thinking (how long it's allowed to deliberate now). Accuracy rises with both. A well-trained reasoner thinking briefly can match a lightly-trained one thinking hard — the two axes trade off, and you get to spend on whichever is cheaper.
Accuracy as a surface over two axes. More RL training (better reasoner) and more thinking time both raise it. The dot shows your current operating point.
Now the economic heart of it. You have a compute budget. You can spend it making the model bigger and better-trained, or you can spend it letting a smaller model think longer at inference. These trade off against each other — and the right split depends on how you'll use the model. This is one of the most consequential questions in AI today.
A landmark finding (DeepMind, 2024): for many problems, you can cut model size and make up the lost capability with test-time compute — a smaller model that thinks harder matches a bigger one that answers instantly. But the trade isn't free or unlimited. On easy problems, extra thinking is wasted — a bigger model is better. On hard problems within the model's reach, test-time compute is remarkably effective — thinking wins. And on problems beyond the model's capability, no amount of thinking helps — you need the bigger model.
This reframes the famous training scaling laws. The old question was “given a training budget, what's the best model size and dataset?” The new question adds a dimension: “given a total budget across training AND all future inference, how should we split it?” For a model serving heavy traffic, the answer shifts toward bigger models (training amortizes). For reasoning-heavy, low-volume use, it shifts toward smaller models with big test-time budgets. There's no universal answer — only an optimal split for your deployment.
You have a fixed total compute budget. Slide how it splits between training (model quality) and test-time (thinking per query), and set the problem difficulty. Watch resulting accuracy: on easy problems the peak sits toward training; on hard problems it shifts toward test-time thinking. There's a sweet split, and it moves with difficulty.
Slide the split of a fixed total budget. The curve shows accuracy across all splits; the marker is your choice. Change difficulty and watch the optimal split move.
Test-time compute is powerful but not a free lunch, and the honest failure modes are as important as the wins. Three things can go wrong: thinking too much, gaming the verifier, and applying it to the wrong kind of problem.
More thinking is not monotonically better. Past a point, accuracy can fall. A reasoning model given an enormous budget may “talk itself out of” a correct early answer — second-guessing, introducing errors in later steps, or wandering into needless complexity on a problem it had already solved. The accuracy-vs-thinking curve is often an inverted U: rising, peaking, then declining. Knowing when to stop thinking is itself a skill, and overthinking wastes compute and can lower accuracy.
When you optimize against a verifier (best-of-N, search, or RL reward), the generator learns to produce solutions that score well — which is not always the same as solutions that are correct. If the verifier has blind spots, the model finds them: confident-sounding but wrong reasoning, answers formatted to please the reward model, exploiting quirks in how the verifier scores. This is reward hacking applied to reasoning, and it's why a verifier must be robust: the harder you optimize against it, the more its flaws get exploited.
Test-time compute helps reasoning — problems where the answer can be worked out or verified from what the model knows. It does little for pure knowledge: if the model doesn't know a fact (a specific date, an obscure name), no amount of thinking will conjure it. You can't deliberate your way to information you never learned. This is why the dramatic test-time-compute gains show up on math, code, and logic — verifiable reasoning — and barely on trivia. Match the tool to the task.
| Situation | Test-time compute verdict |
|---|---|
| Hard math / code / logic (verifiable) | Big win — the ideal case |
| Easy problems | Skip it — risk of overthinking, wasted cost |
| Pure factual recall | No help — thinking can't create missing knowledge |
| High-volume serving | Costly — paid per query; weigh against a bigger model |
| Latency-critical | Careful — long thinking means slow answers |
Drag the thinking budget on a problem of adjustable difficulty. Watch accuracy rise to a peak — then, if you keep going, fall as the model overthinks. The peak sits further right for harder problems (they need more thinking) and further left for easy ones (they're quickly solved and then spoiled by extra steps). The sweet spot is real, and it moves.
Accuracy vs thinking budget — an inverted U. Too little = unsolved; too much = talked out of the right answer. The peak shifts with difficulty.
You now have the full landscape: why thinking longer is a second scaling axis, the simple lever of self-consistency voting, best-of-N with verifiers, chain of thought as literal computation, tree search over reasoning, the strategy-comparison curves, RL-trained reasoning models like o1, the train-versus-test economic tradeoff, and the limits where thinking stops helping. The thread: capability is no longer a single number baked into the weights — it's a curve over how much compute you let the model spend thinking, and the art is spending it well.
| Method | How it spends compute | Needs |
|---|---|---|
| Self-consistency | sample N, majority vote | nothing extra |
| Best-of-N | sample N, verifier picks best | a verifier (ORM) |
| Chain of thought | reasoning tokens within one attempt | prompting / training |
| Tree search | explore + prune reasoning steps | a process reward model (PRM) |
| Reasoning model (o1) | learned long self-correcting CoT | RL training on verifiable rewards |
“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” — and a model, given more time to think, learns to sharpen its reasoning before it answers.