Scaling training made GPT-4. Scaling inference made o1. The next frontier is thinking harder, not just knowing more.
The last decade of AI can be told as a story of scaling: more data, more compute, better results. The numbers are staggering.
| Model | Year | Params | Training Cost (est.) |
|---|---|---|---|
| GPT-2 | 2019 | 1.5B | ~$4,600 |
| GPT-3 | 2020 | 175B | ~$4.6M |
| GPT-4 | 2023 | ~1.8T (MoE) | ~$50M+ |
Each generation: roughly 10× more compute, dramatically better capabilities. Chinchilla scaling laws (Hoffmann et al., 2022) formalize this: loss decreases as a power law in both parameters and tokens seen.
Consider this problem from the lecture: "You're organizing a Christmas party at work. 5 people have dietary restrictions. The caterer needs 2 weeks notice. Today is December 1st. When should you contact the caterer, and what information do you need from the 5 people?"
| Model | Response Quality |
|---|---|
| GPT-2 | Incoherent — can't follow multi-step instructions |
| GPT-3 | Understands the question but gives wrong date math |
| GPT-3.5 | Reasonable structure, still makes logical errors |
| GPT-4 | Correct: contact by Dec 11, collect restrictions first |
More scale → better reasoning. But is this the whole story?
Performance improves log-linearly with compute. A 10× increase in training FLOPs yields a roughly constant improvement in benchmark accuracy. This held from 2019–2023. But the question Brown poses: what happens when you run out of training data, or the cost becomes prohibitive?
Here's a devastating failure case. Give GPT-4 — the most powerful language model of its era — a game of tic-tac-toe. Not chess (where it can lean on memorized openings). A game so simple that a child can play it perfectly.
GPT-4 fails. It makes illegal moves. It doesn't notice wins. It doesn't block obvious threats.
LLMs excel at tasks where the answer exists in training data (or is pattern-matchable from training data). They struggle when they need to reason through novel combinations — even simple ones — that weren't explicitly in the training set. Tic-tac-toe has ~5,000 legal game states. Chess has ~1044. The LLM memorized chess positions but can't generalize game logic.
Consider what the model is doing internally. Given a board state, it predicts the most likely next token in its training distribution. For chess, there's enormous amounts of chess game notation (PGN) in the training data — it's essentially doing lookup. For tic-tac-toe played in natural language? Almost no training data exists in that format.
The deeper issue: there's no mechanism for the model to "think" about consequences. It produces output in a single forward pass. No lookahead. No backtracking. No search.
Training-time compute buys you knowledge (compressed patterns from data). But knowledge alone doesn't give you reasoning (the ability to explore, evaluate, and select among possibilities at decision time). For that, you need compute at inference time.
Here's the paradigm shift. Instead of only scaling training (bigger model, more data), what if we also scale inference? Let the model "think longer" on hard problems.
This is exactly what OpenAI's o1 does. And the results are remarkable:
AIME 2024 (competition math, 30 problems): GPT-4 scores ~12%. o1 with high test-time compute scores ~83%. Same base knowledge, but o1 can reason through multiple solution paths before committing to an answer.
Why was this overlooked? Because for years, inference was cheap. Training GPT-4 cost ~$50M. Running it cost fractions of a cent per query. The economics didn't demand optimization of inference efficiency. But reasoning models change this calculus: o1 on a hard math problem might generate 10,000+ tokens internally before producing a 100-token answer.
Training compute is a fixed cost (pay once, use forever). Test-time compute is a variable cost (pay per query). The optimal split depends on query volume: high-volume easy queries → invest in training. Low-volume hard queries → invest in test-time reasoning.
The idea of spending compute at decision time is not new. It's been the dominant strategy in game AI for decades. Brown argues this is no coincidence — it's a fundamental principle.
TD-Gammon learned to play backgammon through self-play using temporal-difference learning. Its value network alone played at an intermediate level. But add just 2–3 ply of lookahead (searching 2–3 moves ahead) and it became world-class. The neural network provided evaluation; the search provided reasoning.
Deep Blue defeated Kasparov not because it had a better evaluation function than earlier programs, but because it searched 200 million positions per second. Alpha-beta pruning with 12+ ply search. It spent minutes per move — enormous test-time compute for 1997.
One half-move in game tree search. A "3-ply search" means looking 3 half-moves ahead (e.g., my move, opponent's response, my next move). Depth doubles the computational cost exponentially: branching factor b to depth d = bd nodes.
1994: neural eval + shallow search = superhuman backgammon. 1997: handcrafted eval + deep search = superhuman chess. 2016: neural eval + MCTS search = superhuman Go. 2024: LLM + reasoning chains = superhuman math/coding. The recipe is always: learned evaluation + search at test time.
Brown's own research provides the clearest evidence. The Annual Computer Poker Competition (ACPC) was the proving ground. Two key milestones:
Claudico played heads-up no-limit Texas Hold'em against top professionals. It used a fixed strategy (no real-time adaptation). Result: lost by 9.16 big blinds per 100 hands (bb/100). Statistically significant defeat.
Libratus added subgame solving — real-time computation at each decision point to refine its strategy. Same base algorithm family. Result: won by 14.7 bb/100, p-value ≈ 0.0002. Crushingly significant victory.
In poker, you can't search the full game tree (10164 decision points in no-limit Hold'em). Subgame solving works by: (1) starting with a coarse blueprint strategy computed offline, then (2) at each actual decision point, zooming into the local subtree and computing a near-optimal strategy in real-time, conditioned on what has happened so far.
The gap between Claudico and Libratus is the gap between "generate one answer" and "search through possibilities at inference time." It's the same principle o1 applies to reasoning: don't just output the first completion — explore, evaluate, refine.
Go is the ultimate test case. Branching factor ~250 (vs. ~35 in chess). Traditional alpha-beta search is hopeless. AlphaGo Zero (Silver et al., 2017) combined a neural network with Monte Carlo Tree Search (MCTS) to achieve superhuman play from scratch.
Monte Carlo Tree Search builds a search tree incrementally through four repeated phases:
Repeat for N iterations (more iterations = more test-time compute = better decisions). After all iterations, select the action with the highest visit count from the root.
UCB1 balances exploitation (high Q → pick known-good moves) and exploration (low visit count → try under-explored moves). The √(ln N / n) term grows when a child is under-visited relative to its siblings. As visits → ∞, the tree converges to the minimax-optimal strategy. This is the same explore/exploit tradeoff from multi-armed bandits.
UCB1 achieves O(log T) regret over T rounds. This is optimal (Lai & Robbins lower bound). Your task: Explain intuitively why the exploration term uses ln(N) not N, and why this gives logarithmic rather than linear regret.
Key idea: UCB pulls arm i when Qi + √(2 ln t / ni) ≥ Q* (the best arm's true mean). By Hoeffding, this "mistake" happens with probability at most t-4 after enough samples. Summing over all t gives a convergent series.
Concretely: a suboptimal arm with gap Δ gets pulled at most O(8 ln T / Δ²) times total. Multiply by Δ for regret contribution: O(8 ln T / Δ). Sum over K arms: total regret = O(Σi:Δi>0 ln T / Δi) = O(K · ln T / Δmin).
For MCTS: This means each branch of the tree converges to its true value at rate O(log N / N). More search → better estimates → better decisions. The logarithmic regret means you waste very few evaluations on bad branches.
Across three decades of game AI, the recipe hasn't changed:
| Domain | Value Function | Search Method | Test-Time Compute |
|---|---|---|---|
| Backgammon | TD-learned NN | 2-3 ply expectimax | Seconds/move |
| Chess | Handcrafted eval | Alpha-beta, 12+ ply | Minutes/move |
| Go | Policy + value NN | MCTS, 1600 sims | Seconds/move |
| Poker | Blueprint strategy | Subgame solving | Seconds/decision |
| LLMs (o1) | Trained LLM | Chain-of-thought search | Seconds–minutes/query |
The value function tells you "how good is this position?" The search tells you "what's the best move from here?" Neither alone is sufficient. A perfect value function without search can't look ahead. A search without a value function can't prune — it must explore everything.
We're now doing for language what we did for games 20 years ago. The LLM is the value/policy network. Chain-of-thought, verification, tree search — these are the search component. The next leap won't come from bigger models alone. It'll come from better search algorithms applied to language.
The simplest form of "test-time compute" for LLMs: make the model show its work. Wei et al. (2022) showed that adding "Let's think step by step" or demonstrating worked examples dramatically improves reasoning.
Problem: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many does he have now?"
Standard prompting: Model outputs "11" (correct) or "9" (wrong) — one forward pass, no intermediate reasoning.
CoT prompting: "Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 balls. Total: 5 + 6 = 11." The step-by-step generation is the reasoning.
Results from the original paper:
| Benchmark | Standard | Chain-of-Thought | Gain |
|---|---|---|---|
| GSM8K | 17.9% | 58.1% | +40.2% |
| MultiArith | 33.8% | 93.0% | +59.2% |
| SVAMP | 68.0% | 79.0% | +11.0% |
CoT only works at scale. Models below ~100B parameters show no benefit from CoT prompting — their chains are incoherent and don't improve answers. Above 100B, CoT becomes dramatically effective. This suggests there's a minimum "reasoning capability" needed before you can harness it through prompting.
Why does showing work help? Each intermediate step conditions the next step. The model can use its own generated text as working memory. Without CoT, it must compute the answer in a single forward pass (bounded compute). With CoT, it gets T additional forward passes where T = number of generated tokens. More tokens = more compute = harder problems solvable.
CoT gives us one reasoning chain. But what if we generate many chains and pick the best? This is the next level of test-time compute.
Generate N independent solutions to the same problem. Take the most common final answer. Wang et al. (2022) showed this dramatically improves over single-chain CoT:
Single sample: 33.6% accuracy. Majority vote @64 samples: 50.3% accuracy. A 50% relative improvement just from sampling more and voting.
Consider a problem where the model gets the right answer with probability p per sample. With N independent samples, the probability that the majority is correct:
Majority vote saturates because it can only select from answers the model actually generates. If a problem requires a creative insight that the model never produces in any of its N samples, voting 1000 times won't help. This is the coverage limitation.
The probability that at least one of k generated solutions is correct. Formally:
where n = total samples generated, c = number of correct samples among them. Equivalently: pass@k = 1 − (1−p)k when samples are independent with success probability p.
The Large Language Monkeys paper (Brown et al., 2024) showed that coverage grows monotonically with k: more samples = higher chance at least one is right. But consensus (majority vote accuracy) can plateau or even decrease, because wrong answers may cluster.
If majority vote saturates, can we do better? Yes — use a reward model (verifier) to score each solution and pick the highest-scored one:
Best-of-N with a good reward model tracks coverage (which keeps growing) rather than consensus (which saturates). This is strictly better than majority vote when the reward model is well-calibrated.
Verification is easier than generation. It's easier to check if a math proof is correct than to generate one. This asymmetry is why reward models work: they need to rank solutions, not produce them. Even an imperfect verifier, when combined with many candidates, dramatically outperforms single-shot generation.
Setup: A model solves a MATH problem correctly 20% of the time (p=0.2). You generate N=16 samples.
Task 1: Compute pass@16 (probability at least one sample is correct).
Task 2: Compute the probability that majority vote (most common answer) is correct, assuming wrong answers are uniformly distributed over 4 distractors (each wrong with probability 0.8/4 = 0.2 per distractor).
Task 1: pass@16
pass@16 = 1 − (1 − 0.2)16 = 1 − 0.816
0.816 = (0.84)4 = 0.40964 ≈ 0.0281
pass@16 ≈ 97.2% — almost certainly at least one correct solution exists.
Task 2: Majority vote
This is the tragic case. Each of 5 possible answers (1 correct + 4 wrong) has per-sample probability 0.2. Expected votes per answer: 16 × 0.2 = 3.2. By symmetry, the correct answer wins the plurality ~1/5 = 20% of the time. Majority vote gives almost no improvement over single-shot (20% vs 20%)!
The lesson: pass@16 = 97.2% but majority vote = 20%. The correct answer IS in the samples, but you can't identify it without a verifier. This is exactly why Best-of-N with a reward model dominates majority voting.
OpenAI's reasoning models are the culmination of everything above: test-time compute scaling applied to LLMs.
The key empirical finding: o1's accuracy on AIME (competition math) scales log-linearly with test-time compute. Double the inference budget → roughly constant accuracy improvement.
o3 improves on o1 not by using more compute, but by using it more efficiently. Same accuracy at lower cost:
| Model | AIME Score | Approximate Cost/Problem |
|---|---|---|
| o1-low | ~65% | $0.60 |
| o1-high | ~83% | $4.80 |
| o3-low | ~76% | $0.15 |
| o3-high | ~91% | $1.20 |
Scenario: You need to solve 30 AIME problems with at least 80% accuracy.
o1-high: 30 × $4.80 = $144, gets ~83% accuracy. Meets threshold.
o3-low: 30 × $0.15 = $4.50, gets ~76% accuracy. Below threshold.
o3-high: 30 × $1.20 = $36, gets ~91% accuracy. Meets threshold at 4× less cost than o1-high.
The Pareto frontier shifted: same quality for 4× less money.
We now have two ways to improve AI: train better base models (shifts the starting point) and search more efficiently at inference (shifts the Pareto frontier). o3 improved on both axes simultaneously. Future progress will come from advancing both.
How do you actually train a reasoning model? The DeepSeek-R1 paper (2024) introduced GRPO — Group Relative Policy Optimization — a remarkably elegant algorithm that trains reasoning without needing a critic network.
PPO (Proximal Policy Optimization) requires a value function (critic) to estimate advantages. For LLM reasoning, the value function is hard to train: the "state" is a partial chain-of-thought, and the "reward" only comes at the end (correct/incorrect). GRPO eliminates the critic entirely.
Instead of estimating advantages from a learned value function, GRPO estimates them by comparing completions within a group:
That's it. The advantage of completion i is just how much better (or worse) its reward is compared to the group average, normalized by the group's standard deviation.
This is the PPO clipped objective, but with group-relative advantages instead of critic-based advantages, plus a KL penalty against a reference policy (the base model before RL).
Consider a hard problem where all completions score 0. The z-score is 0/0 (undefined) — no gradient signal. Good: we shouldn't update on a problem we can't solve at all yet. Consider an easy problem where all score 1. Same thing: no gradient. Also good: nothing to learn.
Gradient signal only flows when there's variance within the group — when some completions succeed and others fail. This automatically focuses learning on problems at the frontier of the model's ability.
GRPO is a variant of REINFORCE with a very specific baseline: the group mean. Recall from policy gradients: ∇J ≈ Σ ∇ log π · (R − b). Setting b = mean(Rgroup) and normalizing by std(Rgroup) gives GRPO's advantage. The z-scoring acts as both baseline subtraction AND reward normalization simultaneously.
Task 1: Show that the GRPO advantage Âi = (ri − μ) / σ has zero mean within any group (i.e., Σi Âi = 0).
Task 2: Explain why this zero-mean property means GRPO doesn't suffer from the "all-positive rewards" problem that plagues vanilla REINFORCE.
Task 3: What happens when G=2 (minimum group size)? Derive the advantage formula.
Task 1: Σi=1G Âi = Σi (ri − μ)/σ = (1/σ)(Σ ri − Gμ) = (1/σ)(Gμ − Gμ) = 0. ■
Task 2: When advantages sum to zero, the policy update has a "conservation" property: total probability mass pushed UP = total pushed DOWN. Concretely: Σ Âi · ∇ log π(oi) explicitly reinforces some completions while suppressing others. In vanilla REINFORCE with all-positive rewards, the gradient tries to increase ALL action probabilities — a much noisier signal that relies on normalization (softmax) to implicitly decrease others.
Task 3 (G=2): Let r1, r2 be the two rewards. μ = (r1 + r2)/2. σ = |r1 − r2|/√2 (sample std with Bessel correction: actually σ = |r1 − r2|/√2 or |r1−r2|/2 depending on convention).
Using population std: σ = |r1−r2|/2. Then Â1 = (r1 − μ) / σ = ((r1−r2)/2) / (|r1−r2|/2) = sign(r1−r2).
Key insight for G=2: The advantage is just +1 or −1 — a binary signal saying "this one was better/worse than the other." It's a pure pairwise comparison. This is equivalent to DPO (Direct Preference Optimization) on self-generated pairs! As G grows, the signal becomes richer (continuous z-scores) and lower variance.
DeepSeek-R1 (2024) used GRPO to train a reasoning model from a base LLM with no explicit reasoning capability:
| Stage | What Happens | Key Detail |
|---|---|---|
| 1. Cold Start | SFT on long CoT examples | ~5000 human-written reasoning chains |
| 2. GRPO Training | RL with outcome verification | G=64, math problems with auto-verifiable answers |
| 3. Distillation | Distill reasoning into smaller models | R1-32B → 7B, 14B via SFT on R1 outputs |
PPO needs a critic (value network) of similar size to the policy. For a 70B policy, that's another 70B model just for advantages. GRPO replaces this with sampling (G completions) + z-scoring. The trade: more generation compute per step, but no critic to train, tune, or risk instability from. For reasoning tasks with binary outcomes (correct/incorrect), GRPO's simplicity wins.
G=8, 3 correct: Rewards = [1,1,1,0,0,0,0,0]. μ = 3/8 = 0.375. σ = √((Σ(ri−μ)²)/G) = √((3×0.625² + 5×0.375²)/8) = √((3×0.3906 + 5×0.1406)/8) = √((1.172 + 0.703)/8) = √(0.234) = 0.484.
Âcorrect = (1 − 0.375)/0.484 = 1.29. Âincorrect = (0 − 0.375)/0.484 = −0.775.
G=64, 24 correct (same 3/8 ratio): Same μ = 0.375, same population σ = 0.484. Same advantages! The z-score depends only on the ratio, not the count.
BUT: the gradient variance is lower with G=64 because we average over more samples. The advantages are the same per-sample, but the overall gradient estimate is 8× lower variance (by CLT: var ∝ 1/G). Larger G = more compute per step but more reliable updates.
| Question | Think Fast | Think Hard |
|---|---|---|
| Difficulty | Easy / lookup | Hard / novel |
| Cost sensitivity | High volume, cost matters | Low volume, quality matters |
| Latency | Real-time requirement | Can wait seconds/minutes |
| Verifiability | Hard to verify (creative writing) | Easy to verify (math, code) |
| Example | "Translate this sentence" | "Prove this theorem" |
ORMs score final answers only. PRMs score intermediate reasoning steps. PRMs give denser training signal (reward per step, not per chain) but are much harder to train (who labels intermediate reasoning quality?). The optimal balance is unknown.
Given a total inference budget, how do you split it between: (a) number of samples, (b) length per sample (more reasoning tokens), (c) verification passes? This is an active research question with no clean answer.
Current reasoning models work best on math/code (easy to verify). Can the same approach work for domains without clear verification signals? Creative reasoning, ethical reasoning, scientific hypothesis generation? The verification bottleneck is the key limitation.
| This Lecture | Connects To |
|---|---|
| GRPO | Policy Gradients (REINFORCE + baseline), PPO clipping |
| MCTS | Multi-armed bandits (UCB), tree search |
| Majority Vote | Ensemble methods, Condorcet jury theorem |
| Test-time compute | RLHF, reward modeling |
| Best-of-N | Rejection sampling, reward models |
Training gives you knowledge. Search gives you reasoning. The future of AI is learning when and how to think, not just what to know.