Test-Time Compute — Letting the Model Think Longer

Chapter 0: A New Place to Spend Compute

Ask a person a hard question — a tricky math problem, a subtle logic puzzle — and watch what they do. They pause. They scribble. They try an approach, hit a dead end, back up, try another. They spend more time thinking on harder problems. A standard language model does the opposite: it produces an answer at a fixed, constant rate, one token after another, spending exactly the same effort on “What is 2+2?” as on a competition-level proof.

For years, the only way we knew to make models smarter was to make them bigger: more parameters, more training data, more training compute. That works, but it's brutally expensive and slow — you retrain a giant model and wait months. Then, in 2024, a different axis exploded into view: test-time compute (also called inference-time compute). The idea is simple and profound — let the model do more computation when answering, not just when training.

The headline result: a fixed, already-trained model can become dramatically more accurate on hard problems if you let it spend more compute at inference — by sampling many answers, by thinking step-by-step, by searching, by checking its own work. You're not changing the model's weights at all. You're changing how hard it's allowed to think before it commits to an answer.

The one-sentence version. Test-time compute is a second scaling axis: instead of (or in addition to) spending more compute during training to get a smarter model, you spend more compute during inference to get a smarter answer from the model you already have. On hard reasoning problems, an hour of thinking can beat a much bigger model that answers instantly.

Two budgets, one capability

This reframes “how good is the model?” into a question with two knobs. Train-time compute sets the raw capability baked into the weights. Test-time compute sets how much of that capability you can extract on a given question by letting the model deliberate. A small model that thinks for a long time can match a huge model that answers instantly — on the right kind of problem. That trade, between model size and thinking time, is the central economic question of modern AI.

See it: the new scaling axis

The widget shows accuracy on a hard problem set as you spend more test-time compute (more “thinking budget”). Drag the budget up: accuracy climbs — a model that scored 40% answering instantly might reach 80% when allowed to think. Notice the curve's shape: steep early gains, then diminishing returns. That log-linear-ish climb is the inference scaling law, and it's why this matters.

Accuracy vs. Thinking Budget

Same fixed model, harder thinking. Drag the test-time compute budget and watch accuracy rise with diminishing returns — a brand-new way to buy capability without retraining.

Test-time compute budget low

Problem difficulty 0.60

Common misconception. “Test-time compute makes any model arbitrarily smart — just let it think forever.” No. Thinking longer mostly helps when the model could get the answer but often doesn't on the first try — reasoning problems with verifiable answers. It can't conjure knowledge the model never had, and the returns diminish (and can even reverse, as we'll see in Chapter 8). It's a powerful lever on the right problems, not a universal “make smarter” button.

What does test-time compute scale, and how does it differ from train-time compute?

It scales the number of parameters during training It scales the compute spent at inference (sampling, thinking, searching) to get a better answer from a fixed model — no weight changes It scales the training dataset size

Chapter 1: Self-Consistency — Just Ask Many Times

The simplest way to spend test-time compute needs no new training and barely any code: ask the model the same question many times and take the most common answer. Because sampling is random, each run may reason differently and reach a different answer. Aggregate them with a majority vote, and the consensus is far more reliable than any single run. This is self-consistency.

Why voting works

Here's the intuition. On a problem the model half-understands, a single sampled answer is a coin-flip — sometimes the right reasoning path, sometimes a wrong one. But the wrong answers tend to be scattered (there are many ways to be wrong, each rare), while the right answer is a single target that multiple correct reasoning paths converge on. So even if no single run is reliable, the correct answer is usually the most frequent one across many runs. Majority vote surfaces it.

Wrong answers disagree; right answers agree. This is the engine of self-consistency. There's typically one correct answer and many distinct wrong ones. Correct reasoning chains, even when they differ in their steps, land on the same final answer — so it accumulates votes. Wrong chains splatter across many different wrong answers, so each gets few votes. The truth wins by concentration, not by any single run being trustworthy.

Worked example: a vote

A model is asked a math problem 7 times (sampled with some randomness). The final answers come back as:

42, 42, 17, 42, 36, 42, 17

Tally them: “42” appears 4 times, “17” twice, “36” once. The majority answer is 42, with 4 of 7 votes. Notice that any single run had only a 4-in-7 (57%) chance of being right — barely better than a coin flip on this problem. But the vote is correct as long as the right answer is merely the most common, which is a much weaker and more reliable condition. Sample 50 times instead of 7 and the vote becomes near-certain even when individual runs are unreliable.

From scratch: self-consistency

python
from collections import Counter

def self_consistency(model, question, n=40, temperature=0.7):
    answers = []
    for _ in range(n):
        chain = model.generate(question, temperature=temperature)  # sample a reasoning path
        answers.append(extract_final_answer(chain))           # pull out the boxed answer
    # the most common final answer wins
    return Counter(answers).most_common(1)[0][0]

The cost is exactly n times one inference — you spent n× the test-time compute to buy reliability. That's the trade in its purest form: more samples, more compute, more accuracy, with no change to the model. Self-consistency was one of the first demonstrations that inference compute alone could lift reasoning accuracy by large margins.

See it: the vote sharpening with N

Each sample is a die-like draw: usually the right answer, sometimes a scattered wrong one. Increase the number of samples and watch the majority lock onto the correct answer, even though each individual sample stays unreliable. The accuracy of the vote climbs far above the accuracy of a single sample.

Majority Vote: Reliability from Repetition

Bars = how many samples gave each answer. Green = correct. Raise N and watch the correct answer's lead grow. Single-sample accuracy stays fixed; the vote's accuracy climbs.

Number of samples N 7

Per-sample correctness 0.45

Common misconception. “If each sample is only 45% accurate, the vote can't beat 45%.” It easily can — that's the whole point. As long as the correct answer is the single most likely outcome (even at 45%, if the wrong answers are split among several options each below 45%), the majority vote's accuracy rises toward 100% as you add samples. Voting converts “most likely” into “almost certainly” — the condition is on being the plurality, not the majority.

Why can majority voting be far more accurate than any single sampled answer?

Because sampling at high temperature is always correct Correct reasoning paths converge on the same answer (it accumulates votes) while wrong answers scatter across many options, so the truth is the most frequent even if no single run is reliable Because the model remembers its previous answers

Chapter 2: Best-of-N — Generate Many, Pick the Best

Majority vote treats every sample equally and just counts. But what if we could judge which answers are good and pick the best one, instead of voting? That's best-of-N: sample N candidate solutions, score each with a verifier, and return the highest-scoring one. When you have a good verifier, this beats majority voting — especially when the right answer is rare but recognizable once you see it.

The generator–verifier gap

This works because of a deep asymmetry: verifying a solution is often much easier than generating one. It's hard to find the proof; it's easy to check each step. Hard to write the working code; easier to run the tests. A model (or a separate verifier model) that's mediocre at producing correct answers can still be quite good at recognizing them. Best-of-N exploits that gap: let the generator throw many attempts at the wall, and let the verifier — which only needs to judge, not create — spot the one that sticks.

Why this beats voting when answers are rare. Majority vote needs the correct answer to be the most common. But on very hard problems, the model might find the right answer only 1 time in 20 — a minority. Voting would discard it. A verifier doesn't care about frequency: it can pick that single correct-but-rare solution out of the 20, as long as it can recognize quality. Verifiers shine exactly where voting fails — the hardest problems.

Two kinds of verifier: outcome vs process

Outcome Reward Model (ORM): scores only the final answer — was the end result right? Simple to train (you just need to know if the final answer was correct), but it gives no credit for good intermediate reasoning and can be fooled by a wrong chain that stumbles into a right answer.
Process Reward Model (PRM): scores every reasoning step — is each step valid? Much richer signal: it can catch an error the moment it happens and rank solutions by the soundness of their reasoning, not just the final number. PRMs are more expensive to train (you need step-level labels) but more powerful, and they're what enables the step-by-step search in Chapter 4.

Worked example: voting vs best-of-N

Five sampled solutions to a hard problem, with their (hidden) correctness and a verifier's quality score:

solution	final answer	actually correct?	verifier score
S1	17	no	0.3
S2	17	no	0.4
S3	42	yes	0.9
S4	17	no	0.2
S5	9	no	0.3

Majority vote picks “17” (3 of 5 votes) — wrong, because the model's common failure mode produced the same wrong answer repeatedly. Best-of-N with the verifier picks S3 (score 0.9) — correct, because the verifier recognized the sound solution even though it was a minority of 1. This is the regime where verifiers win: the correct answer is rare, so counting fails, but quality is recognizable, so judging succeeds.

See it: vote vs verifier

The same N sampled solutions, judged two ways. Toggle between majority vote (counts answers) and best-of-N (picks the verifier's highest score). Lower the per-sample correctness to make the right answer rare, and watch voting fail while the verifier still finds it — until the verifier itself gets noisy.

Majority Vote vs. Best-of-N Verifier

Each dot is a sampled solution (green = correct). Vote counts answers; verifier picks the top score. Lower correctness to see voting fail where the verifier still succeeds.

Per-sample correctness 0.20

Verifier reliability 0.90

Common misconception. “A verifier is always better than voting.” Only if the verifier is good. A weak or hackable verifier can confidently pick a wrong answer (and best-of-N will dutifully return it). Voting is more robust to a missing verifier; verifiers are more powerful when reliable. And a verifier that can be gamed invites the generator to produce answers that look good to the verifier without being correct — the verifier-hacking problem we'll meet in Chapter 8.

On a very hard problem where the model finds the right answer only 1 time in 20, why does best-of-N with a good verifier beat majority voting?

Voting is faster but less accurate by design The correct answer is a minority, so voting discards it; a verifier judges quality, not frequency, so it can pick the rare correct solution Verifiers generate better answers than the model

Chapter 3: Chain of Thought — Thinking Is Computing

Sampling and verifying spend test-time compute across many attempts. But there's a way to spend more compute within a single attempt: let the model write out its reasoning step by step before committing to an answer. This is chain of thought (CoT), and understanding why it works reveals something fundamental about how transformers compute.

Every token is a fixed slice of compute

Here's the key insight. A transformer does a fixed amount of computation to produce each token — one forward pass through the network. If you force the model to jump straight to a final answer, it gets exactly one forward pass worth of computation to solve the whole problem. But a hard problem might need more computation than fits in a single forward pass. By writing intermediate reasoning steps, the model gives itself more forward passes — one per token of reasoning — to work toward the answer. The reasoning text is a scratchpad, and each token written on it is another slice of computation spent.

Reasoning tokens buy computation. A model answering “immediately” is trying to solve the problem in one step of thinking — impossible for anything genuinely hard. Chain of thought lets the model decompose the problem and spend a forward pass on each sub-step, carrying intermediate results forward in the text it has written. The length of the reasoning is literally the amount of test-time compute spent on that one attempt. Harder problems need longer chains.

Why writing it down matters

A subtle point: why must the model write out the steps — couldn't it think silently? Because a transformer has no hidden scratchpad that persists between tokens beyond what it has already written. Its only working memory across steps is the sequence of tokens itself. So to use the result of step 1 in step 3, it must write step 1 down — the text is the memory. Forcing the reasoning into the visible token stream is what lets later computation build on earlier computation. (This also makes the reasoning inspectable, which is a safety bonus.)

Worked example: with and without the chain

“A shop has 3 shelves with 14 books each, sells 9, then receives 2 new boxes of 6. How many books now?”

No chain of thought — the model must compute everything in one forward pass and blurts: “47” — wrong, it fumbled the arithmetic with no room to work.

With chain of thought — each step is a forward pass building on the last: “Start: 3 × 14 = 42 books. After selling 9: 42 − 9 = 33. New books: 2 × 6 = 12. Total: 33 + 12 = 45.” Correct. The model didn't get smarter — it got more steps of computation, and it stored each intermediate result (42, 33, 12) in the text so the next step could use it. Same model, more thinking, right answer.

See it: reasoning length vs. solvable difficulty

The widget shows a problem of adjustable difficulty and a reasoning chain of adjustable length. A problem needs enough reasoning steps to be solved — too few and the model runs out of computation before reaching the answer. Drag the difficulty up and watch how many steps are needed; drag the reasoning length and see the answer flip from wrong to right once there's enough thinking.

Reasoning Steps vs. Problem Difficulty

The chain (steps shown) needs to be long enough for the problem's difficulty. Too short = the model runs out of compute before the answer. Adjust both and watch it solve or fail.

Reasoning length (steps) 2

Problem difficulty (steps needed) 4

Common misconception. “Chain of thought is just the model explaining an answer it already has.” It's the opposite — the reasoning is the computation that produces the answer, not a post-hoc explanation. Remove the chain and the model genuinely can't solve the problem, because it lacks the forward passes to compute the intermediate results. The steps aren't narration; they're the calculation, externalized into tokens because that's the only working memory a transformer has across positions.

Why does writing out reasoning steps let a transformer solve harder problems?

It makes the model's weights larger Each token is one forward pass of computation; reasoning tokens give the model more forward passes, and writing steps down stores intermediate results that later steps build on It retrieves the answer from a database

Chapter 4: Search — Exploring the Tree of Reasoning

Self-consistency samples whole solutions independently. But a reasoning process is really a tree: at each step there are several plausible next moves, and a wrong early step dooms everything after it. The most powerful use of test-time compute treats reasoning as a search problem — explore the tree of possible reasoning steps, keep the promising branches, and prune the dead ends. This is where a process reward model (Chapter 2) becomes the engine.

From sampling to guided search

Here's the picture. Start at the problem. Generate a few candidate first steps. Score each with a PRM — how promising is this step? Keep the best few (this is a beam), expand each into candidate second steps, score again, keep the best, and so on. Bad branches get low scores and are abandoned; good branches get explored further. Instead of blindly sampling 100 complete solutions and hoping, you spend your compute where it's likely to pay off — deepening the branches that look correct.

Search spends compute intelligently. Best-of-N spreads compute uniformly: 100 full attempts, most wasted on doomed early mistakes. Tree search concentrates compute on promising partial solutions, catching errors early (the PRM flags a bad step before the model wastes 50 more tokens on it) and exploring alternatives at exactly the points where the reasoning could fork. For the same compute budget, guided search finds correct solutions that uniform sampling would almost never stumble onto.

The methods: beam, lookahead, MCTS

Beam search over steps: keep the top-b partial solutions by PRM score at each depth, expand them, repeat. Simple and effective.
Lookahead / rollouts: before committing to a step, simulate a few steps ahead to see if it leads somewhere good — a cheap preview of each branch's future.
Monte Carlo Tree Search (MCTS): the algorithm behind AlphaGo, adapted to reasoning. Balance exploring new branches against exploiting known-good ones, using rollouts to estimate each branch's promise. The most sophisticated (and expensive) option.

See it: a reasoning tree being searched

Press step to grow the search. At each depth, candidate next-steps appear, the PRM scores them (greener = more promising), and the search keeps the top branches (the beam) while pruning the rest (they fade). Watch the search ignore the dead ends and drive toward a high-scoring solution leaf. Raise the beam width to explore more branches at once — more compute, better coverage.

PRM-Guided Tree Search Over Reasoning Steps

Nodes = partial reasoning states, colored by PRM score (green = promising). The beam keeps the best; pruned branches fade. Step the search and watch it home in on a solution.

Beam width 2

Common misconception. “Search always beats sampling, so always use MCTS.” Search needs a good PRM to guide it — with a noisy step-scorer, search confidently deepens wrong branches and can do worse than simple sampling. Search is also far more complex to implement and slower per solution. For many problems, plain best-of-N with a decent verifier is the better compute-per-accuracy deal. Reach for tree search when problems are deep, steps are verifiable, and you have a reliable PRM.

How does PRM-guided tree search use test-time compute more efficiently than best-of-N sampling?

It uses less memory It concentrates compute on promising partial solutions and prunes bad branches early, instead of spreading compute uniformly across many full attempts (most doomed by early mistakes) It doesn't need the model at all

Chapter 5: The Inference-Scaling Simulator

Everything so far — voting, verifiers, chain of thought, search — is a different way to convert test-time compute into accuracy. This simulator puts them head to head. Pick a problem difficulty, then slide the compute budget (how many samples / how much search) and watch each strategy's accuracy curve. This is the picture researchers actually plot when they study inference scaling laws.

Run the comparisons that define the field:

Single sample is a flat line — no test-time compute, no improvement. The baseline.
Majority vote climbs, then plateaus: once the most-common answer is locked in, more samples don't help. It caps below 100% on hard problems where the right answer is rare.
Best-of-N (verifier) climbs higher and keeps climbing — a good verifier can find rare correct answers that voting's plateau can never reach.
Tree search is steepest early — it gets the most accuracy per unit of compute by spending it intelligently — though it needs a good PRM.

Accuracy vs. Compute, by Strategy

Each curve is a strategy's accuracy as test-time compute grows (log scale). Adjust difficulty, slide the compute budget (vertical line), and highlight a strategy. Watch where each one plateaus or keeps climbing.

Compute budget (samples) 8

Problem difficulty 0.70

What to take away. There is no single “best” strategy — it depends on the problem and the compute budget. At small budgets, search and best-of-N pull ahead. Voting is cheap and robust but plateaus. The shape of these curves — and where they cross — is exactly what the famous “a 2024-era small model with test-time compute beats a much bigger model” results are measuring. Capability is now a curve over compute, not a single number.

Common misconception. “Just pick the strategy with the highest ceiling.” The ceiling only matters if you can afford the compute to reach it. At a fixed, modest budget — which is the real-world constraint — the steepest early curve wins, even if another strategy would be better with 1000× the compute. Compute-optimal strategy selection means matching the curve to your actual budget, not chasing the highest asymptote.

No quiz — the simulator is the test. If you can predict which strategy wins at a tiny budget versus a huge one, you understand inference scaling.

Chapter 6: o1 & Reasoning Models — Learning to Think

Everything so far bolts test-time compute onto a normal model from the outside — we sample, vote, verify, search using a model that was never specifically trained to reason at length. The 2024 breakthrough — OpenAI's o1, then DeepSeek-R1 and others — was to train the model itself to produce long, self-correcting chains of thought. The thinking moves inside the model. It learns to deliberate.

How they're trained: RL on reasoning

The recipe is reinforcement learning with a beautifully simple reward. Give the model problems with verifiable answers (math, code). Let it generate a long chain of thought and a final answer. Check the answer — right or wrong is the reward. Then use RL to make reasoning that led to correct answers more likely. Crucially, the model isn't told how to reason — only whether it succeeded. Over training, it discovers effective strategies on its own: breaking problems down, trying approaches, backtracking when stuck, double-checking its work. Behaviors that look remarkably like human deliberation emerge purely from rewarding correct final answers.

The model learns to spend test-time compute well. Earlier chapters had us decide how to spend inference compute (how many samples, how deep a search). A reasoning model internalizes that decision: it learns, on its own, to write a long chain when a problem is hard and a short one when it's easy, and to course-correct mid-stream. The “search” happens inside one long generation — the model explores, rejects, and retries in natural language. It's test-time compute, but learned rather than hand-orchestrated.

Two scaling laws, one model

o1 revealed something striking: reasoning models improve along two compute axes at once. More train-time RL compute (more reasoning practice during training) makes the model better. And more test-time compute (letting it think longer at inference) also makes it better — both follow smooth, predictable scaling curves. You can spend compute to train a better reasoner, and spend compute to let that reasoner think longer, and both pay off. This doubled the levers available for pushing capability.

The visible thinking

A practical hallmark: these models produce a long internal monologue before answering — often thousands of tokens of “Let me try… no wait, that's wrong… let me reconsider…” — then a concise final answer. That monologue is the test-time compute being spent, and its self-correction (“wait, that's wrong”) is the learned backtracking that plain models lack. The model trained itself to do, in one stream, what we previously had to orchestrate with external search.

See it: the two-axis scaling

Adjust both knobs: train-time RL compute (how well the model learned to reason) and test-time thinking (how long it's allowed to deliberate now). Accuracy rises with both. A well-trained reasoner thinking briefly can match a lightly-trained one thinking hard — the two axes trade off, and you get to spend on whichever is cheaper.

Reasoning Models: Train-Time RL × Test-Time Thinking

Accuracy as a surface over two axes. More RL training (better reasoner) and more thinking time both raise it. The dot shows your current operating point.

Train-time RL compute 0.50

Test-time thinking 0.40

Common misconception. “o1 is just GPT-4 with chain-of-thought prompting.” No — prompting a base model to “think step by step” helps a little, but a reasoning model was RL-trained to generate long, self-correcting reasoning, which it does far better and more reliably than any prompt can elicit. The backtracking, the self-verification, the knowing-when-to-think-longer — those are learned skills, not prompt tricks. That's why reasoning models leap ahead on hard math and code while base models plateau.

What is fundamentally different about how a reasoning model like o1 uses test-time compute?

It uses a bigger external verifier It was RL-trained to produce long, self-correcting reasoning natively — it learned to deliberate, backtrack, and decide how long to think, rather than having search orchestrated externally It stores all answers in memory

Chapter 7: Train-Time vs. Test-Time — Where to Spend

Now the economic heart of it. You have a compute budget. You can spend it making the model bigger and better-trained, or you can spend it letting a smaller model think longer at inference. These trade off against each other — and the right split depends on how you'll use the model. This is one of the most consequential questions in AI today.

The trade, made concrete

A landmark finding (DeepMind, 2024): for many problems, you can cut model size and make up the lost capability with test-time compute — a smaller model that thinks harder matches a bigger one that answers instantly. But the trade isn't free or unlimited. On easy problems, extra thinking is wasted — a bigger model is better. On hard problems within the model's reach, test-time compute is remarkably effective — thinking wins. And on problems beyond the model's capability, no amount of thinking helps — you need the bigger model.

The asymmetry that decides it: training is paid once, inference is paid forever. Training compute is a one-time cost. Test-time compute is paid every single time the model answers. So the split depends massively on usage. Serving billions of queries? Amortize a big training run — make the model great so each cheap inference is great. Running a few extremely hard queries (frontier math, research)? Spend lavishly on test-time thinking per query. The same capability has wildly different optimal splits depending on how many times you'll run it.

Compute-optimal allocation

This reframes the famous training scaling laws. The old question was “given a training budget, what's the best model size and dataset?” The new question adds a dimension: “given a total budget across training AND all future inference, how should we split it?” For a model serving heavy traffic, the answer shifts toward bigger models (training amortizes). For reasoning-heavy, low-volume use, it shifts toward smaller models with big test-time budgets. There's no universal answer — only an optimal split for your deployment.

See it: splitting a fixed budget

You have a fixed total compute budget. Slide how it splits between training (model quality) and test-time (thinking per query), and set the problem difficulty. Watch resulting accuracy: on easy problems the peak sits toward training; on hard problems it shifts toward test-time thinking. There's a sweet split, and it moves with difficulty.

Allocating a Fixed Budget: Train vs. Test

Slide the split of a fixed total budget. The curve shows accuracy across all splits; the marker is your choice. Change difficulty and watch the optimal split move.

Budget split (← train · test →) 50/50

Problem difficulty 0.60

Common misconception. “Test-time compute makes big models obsolete — just use a small model and think harder.” Only sometimes. For high-volume serving, the per-query cost of heavy thinking dwarfs the one-time cost of a bigger model — so big models still win. And no amount of thinking rescues a problem beyond the model's reach. Test-time compute is a powerful complement to scale, not a replacement for it. The future is both, allocated wisely.

Why does the optimal split between train-time and test-time compute depend heavily on deployment, not just the model?

Because inference is always cheaper than training Training is a one-time cost while test-time compute is paid on every query — so high-volume serving favors bigger models, while low-volume hard queries favor heavy thinking Because bigger models can't use test-time compute

Chapter 8: Limits — When Thinking Hurts

Test-time compute is powerful but not a free lunch, and the honest failure modes are as important as the wins. Three things can go wrong: thinking too much, gaming the verifier, and applying it to the wrong kind of problem.

Overthinking: the inverted U

More thinking is not monotonically better. Past a point, accuracy can fall. A reasoning model given an enormous budget may “talk itself out of” a correct early answer — second-guessing, introducing errors in later steps, or wandering into needless complexity on a problem it had already solved. The accuracy-vs-thinking curve is often an inverted U: rising, peaking, then declining. Knowing when to stop thinking is itself a skill, and overthinking wastes compute and can lower accuracy.

Why overthinking happens. Each extra reasoning step is another chance to make a mistake. On an easy problem the model solves in two steps, forcing twenty more steps just adds twenty chances to introduce an error or rationalize a wrong revision. The optimal thinking length is matched to the problem's difficulty — and a fixed, large budget over-thinks the easy ones. Good reasoning models learn to think briefly on easy problems precisely to avoid this.

Verifier hacking

When you optimize against a verifier (best-of-N, search, or RL reward), the generator learns to produce solutions that score well — which is not always the same as solutions that are correct. If the verifier has blind spots, the model finds them: confident-sounding but wrong reasoning, answers formatted to please the reward model, exploiting quirks in how the verifier scores. This is reward hacking applied to reasoning, and it's why a verifier must be robust: the harder you optimize against it, the more its flaws get exploited.

Reasoning vs. knowledge

Test-time compute helps reasoning — problems where the answer can be worked out or verified from what the model knows. It does little for pure knowledge: if the model doesn't know a fact (a specific date, an obscure name), no amount of thinking will conjure it. You can't deliberate your way to information you never learned. This is why the dramatic test-time-compute gains show up on math, code, and logic — verifiable reasoning — and barely on trivia. Match the tool to the task.

Situation	Test-time compute verdict
Hard math / code / logic (verifiable)	Big win — the ideal case
Easy problems	Skip it — risk of overthinking, wasted cost
Pure factual recall	No help — thinking can't create missing knowledge
High-volume serving	Costly — paid per query; weigh against a bigger model
Latency-critical	Careful — long thinking means slow answers

See it: the overthinking curve

Drag the thinking budget on a problem of adjustable difficulty. Watch accuracy rise to a peak — then, if you keep going, fall as the model overthinks. The peak sits further right for harder problems (they need more thinking) and further left for easy ones (they're quickly solved and then spoiled by extra steps). The sweet spot is real, and it moves.

The Overthinking Curve

Accuracy vs thinking budget — an inverted U. Too little = unsolved; too much = talked out of the right answer. The peak shifts with difficulty.

Thinking budget 0.40

Problem difficulty 0.50

Common misconception. “A reasoning model should always be told to think as long as possible.” Forcing maximum thinking wastes money (you pay per token), increases latency, and can lower accuracy through overthinking. The frontier of the field is teaching models to spend just enough compute — adaptive thinking that matches effort to difficulty — rather than always maxing it out. More is not always better; right-sized is better.

Why can giving a reasoning model an enormous thinking budget sometimes lower its accuracy?

The model runs out of memory Longer answers always score worse Overthinking: each extra step is another chance to introduce an error or second-guess a correct answer, so past the peak the accuracy curve declines

Chapter 9: Connections & Cheat Sheet

You now have the full landscape: why thinking longer is a second scaling axis, the simple lever of self-consistency voting, best-of-N with verifiers, chain of thought as literal computation, tree search over reasoning, the strategy-comparison curves, RL-trained reasoning models like o1, the train-versus-test economic tradeoff, and the limits where thinking stops helping. The thread: capability is no longer a single number baked into the weights — it's a curve over how much compute you let the model spend thinking, and the art is spending it well.

The methods at a glance

Method	How it spends compute	Needs
Self-consistency	sample N, majority vote	nothing extra
Best-of-N	sample N, verifier picks best	a verifier (ORM)
Chain of thought	reasoning tokens within one attempt	prompting / training
Tree search	explore + prune reasoning steps	a process reward model (PRM)
Reasoning model (o1)	learned long self-correcting CoT	RL training on verifiable rewards

The cheat sheet

Test-time compute: spend compute at inference, not training, to get a better answer

Self-consistency: correct answers converge, wrong ones scatter → majority vote wins

Generator–verifier gap: checking is easier than generating → best-of-N works

ORM vs PRM: score the final answer vs score every step (PRM enables search)

CoT: each token = one forward pass; writing steps = more compute + working memory

Two scaling axes: train-time RL (better reasoner) AND test-time thinking both help

Train is paid once, test is paid per query: deployment volume decides the split

Limits: overthinking (inverted U), verifier hacking, no help for pure knowledge

A decision guide

Verifiable problem (math/code)?

Yes → test-time compute is a big win. No → limited help.

↓

Have a good verifier/PRM?

Yes → best-of-N or tree search. No → self-consistency voting.

↓

High query volume?

Lean toward a bigger model (training amortizes). Low volume + hard → spend on thinking.

↓

Need top reasoning out of the box?

Use a reasoning model (o1-style) — thinking is trained in, adaptive to difficulty.

Where this connects

CS224N: Reasoning I & Reasoning II — the lecture-grade treatment of CoT, self-consistency, and reasoning.
CS224R: RL for Reasoning — how o1-style models are trained with RL on verifiable rewards.
Loss Functions — reward models (ORM/PRM) are trained classifiers; the verifier is a learned scorer.
Reward & Alignment — verifier hacking is reward hacking; robust rewards matter here too.
Mixture of Experts — the other big efficiency lever: MoE scales capacity cheaply, test-time compute extracts capability cheaply.
GPT & Transformer — the per-token forward pass that makes “each reasoning token = a slice of compute” literally true.

The one thing to remember. For most of deep learning's history, a model's intelligence was fixed the moment training ended — you got one forward pass per answer, take it or leave it. Test-time compute broke that: a fixed model can be made dramatically smarter on hard, verifiable problems by letting it think, sample, verify, and search. Capability became a dial you can turn at inference. The frontier now is spending that compute wisely — enough to solve, not so much that you overthink, and only on problems where reasoning, not knowledge, is the bottleneck.

You have a fixed, already-trained model and a hard but verifiable math benchmark it scores 40% on. Which is the soundest way to raise the score without retraining?

Nothing — accuracy is fixed once training ends Ask it once at temperature 0 and accept the answer Spend test-time compute: sample many chains-of-thought and aggregate with self-consistency or a verifier (best-of-N), optionally guided search — extracting more capability from the same weights

“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” — and a model, given more time to think, learns to sharpen its reasoning before it answers.