← Gleams
Stanford CS 224R · Lecture 10 · Noam Brown (OpenAI)

RL for LLMs: Reasoning

Scaling training made GPT-4. Scaling inference made o1. The next frontier is thinking harder, not just knowing more.

Test-time compute MCTS from scratch GRPO derived Worked cost examples
Roadmap

What You'll Master

Chapter 01

Scaling Laws Revisited

The last decade of AI can be told as a story of scaling: more data, more compute, better results. The numbers are staggering.

ModelYearParamsTraining Cost (est.)
GPT-220191.5B~$4,600
GPT-32020175B~$4.6M
GPT-42023~1.8T (MoE)~$50M+

Each generation: roughly 10× more compute, dramatically better capabilities. Chinchilla scaling laws (Hoffmann et al., 2022) formalize this: loss decreases as a power law in both parameters and tokens seen.

The Christmas Party Test

Consider this problem from the lecture: "You're organizing a Christmas party at work. 5 people have dietary restrictions. The caterer needs 2 weeks notice. Today is December 1st. When should you contact the caterer, and what information do you need from the 5 people?"

ModelResponse Quality
GPT-2Incoherent — can't follow multi-step instructions
GPT-3Understands the question but gives wrong date math
GPT-3.5Reasonable structure, still makes logical errors
GPT-4Correct: contact by Dec 11, collect restrictions first

More scale → better reasoning. But is this the whole story?

Scaling Laws VisualizationInteractive
The Scaling Hypothesis

Performance improves log-linearly with compute. A 10× increase in training FLOPs yields a roughly constant improvement in benchmark accuracy. This held from 2019–2023. But the question Brown poses: what happens when you run out of training data, or the cost becomes prohibitive?

Chapter 02

Is Scaling All You Need?

Here's a devastating failure case. Give GPT-4 — the most powerful language model of its era — a game of tic-tac-toe. Not chess (where it can lean on memorized openings). A game so simple that a child can play it perfectly.

GPT-4 fails. It makes illegal moves. It doesn't notice wins. It doesn't block obvious threats.

The Failure Mode

LLMs excel at tasks where the answer exists in training data (or is pattern-matchable from training data). They struggle when they need to reason through novel combinations — even simple ones — that weren't explicitly in the training set. Tic-tac-toe has ~5,000 legal game states. Chess has ~1044. The LLM memorized chess positions but can't generalize game logic.

Why Pure Pattern Matching Fails

Consider what the model is doing internally. Given a board state, it predicts the most likely next token in its training distribution. For chess, there's enormous amounts of chess game notation (PGN) in the training data — it's essentially doing lookup. For tic-tac-toe played in natural language? Almost no training data exists in that format.

The deeper issue: there's no mechanism for the model to "think" about consequences. It produces output in a single forward pass. No lookahead. No backtracking. No search.

Pattern Matching vs. ReasoningInteractive
The Key Insight

Training-time compute buys you knowledge (compressed patterns from data). But knowledge alone doesn't give you reasoning (the ability to explore, evaluate, and select among possibilities at decision time). For that, you need compute at inference time.

Chapter 03

Reasoning as Test-Time Compute

Here's the paradigm shift. Instead of only scaling training (bigger model, more data), what if we also scale inference? Let the model "think longer" on hard problems.

This is exactly what OpenAI's o1 does. And the results are remarkable:

Concrete Result

AIME 2024 (competition math, 30 problems): GPT-4 scores ~12%. o1 with high test-time compute scores ~83%. Same base knowledge, but o1 can reason through multiple solution paths before committing to an answer.

Two Dimensions of Scaling

The Two Axes Performance = f(Training Compute, Test-Time Compute)

Traditional scaling: increase training compute (data × params × steps)
New axis: increase test-time compute (tokens generated before answer)

Why was this overlooked? Because for years, inference was cheap. Training GPT-4 cost ~$50M. Running it cost fractions of a cent per query. The economics didn't demand optimization of inference efficiency. But reasoning models change this calculus: o1 on a hard math problem might generate 10,000+ tokens internally before producing a 100-token answer.

Training vs. Test-Time Compute TradeoffInteractive
The Economic Insight

Training compute is a fixed cost (pay once, use forever). Test-time compute is a variable cost (pay per query). The optimal split depends on query volume: high-volume easy queries → invest in training. Low-volume hard queries → invest in test-time reasoning.

Chapter 04

Historical Precedent: Games

The idea of spending compute at decision time is not new. It's been the dominant strategy in game AI for decades. Brown argues this is no coincidence — it's a fundamental principle.

TD-Gammon (Tesauro, 1994)

TD-Gammon learned to play backgammon through self-play using temporal-difference learning. Its value network alone played at an intermediate level. But add just 2–3 ply of lookahead (searching 2–3 moves ahead) and it became world-class. The neural network provided evaluation; the search provided reasoning.

Deep Blue (Campbell et al., 1997)

Deep Blue defeated Kasparov not because it had a better evaluation function than earlier programs, but because it searched 200 million positions per second. Alpha-beta pruning with 12+ ply search. It spent minutes per move — enormous test-time compute for 1997.

Definition
Ply

One half-move in game tree search. A "3-ply search" means looking 3 half-moves ahead (e.g., my move, opponent's response, my next move). Depth doubles the computational cost exponentially: branching factor b to depth d = bd nodes.

Pattern Across Decades

1994: neural eval + shallow search = superhuman backgammon. 1997: handcrafted eval + deep search = superhuman chess. 2016: neural eval + MCTS search = superhuman Go. 2024: LLM + reasoning chains = superhuman math/coding. The recipe is always: learned evaluation + search at test time.

Chapter 05

Poker: Search Beats Scale

Brown's own research provides the clearest evidence. The Annual Computer Poker Competition (ACPC) was the proving ground. Two key milestones:

Claudico (2015): No Search

Claudico played heads-up no-limit Texas Hold'em against top professionals. It used a fixed strategy (no real-time adaptation). Result: lost by 9.16 big blinds per 100 hands (bb/100). Statistically significant defeat.

Libratus (2017): With Search

Libratus added subgame solving — real-time computation at each decision point to refine its strategy. Same base algorithm family. Result: won by 14.7 bb/100, p-value ≈ 0.0002. Crushingly significant victory.

The Libratus Improvement Without search: −9.16 bb/100 (loses to humans)
With search: +14.7 bb/100 (dominates humans)

Delta: 23.9 bb/100 — from search alone

Brown & Sandholm, "Safe and Nested Subgame Solving" — NeurIPS 2017 Best Paper

What is Subgame Solving?

In poker, you can't search the full game tree (10164 decision points in no-limit Hold'em). Subgame solving works by: (1) starting with a coarse blueprint strategy computed offline, then (2) at each actual decision point, zooming into the local subtree and computing a near-optimal strategy in real-time, conditioned on what has happened so far.

Search vs. No-Search: Distance from Nash EquilibriumInteractive
Why This Matters for LLMs

The gap between Claudico and Libratus is the gap between "generate one answer" and "search through possibilities at inference time." It's the same principle o1 applies to reasoning: don't just output the first completion — explore, evaluate, refine.

Chapter 06

Go: AlphaGo Zero & MCTS

Go is the ultimate test case. Branching factor ~250 (vs. ~35 in chess). Traditional alpha-beta search is hopeless. AlphaGo Zero (Silver et al., 2017) combined a neural network with Monte Carlo Tree Search (MCTS) to achieve superhuman play from scratch.

The Numbers

AlphaGo Zero: Search vs. Raw Network Raw policy network (no search): ~3000 Elo
Full AlphaGo Zero (with MCTS): ~5200 Elo

Gap: ~2200 Elo from search alone

To get 5200 Elo without search: estimated ~100,000× more training compute
Empirical finding: 2× test-time search ≈ 2× model parameters
Rate: ~120 Elo per doubling of search compute

MCTS: The Algorithm from Scratch

Monte Carlo Tree Search builds a search tree incrementally through four repeated phases:

Monte Carlo Tree Search (MCTS)
  1. Selection: Starting from root, traverse the tree using UCB1 to select the most promising unexplored path. At each node, pick the child maximizing:
    UCB(s,a) = Q(s,a) + c · √(ln N(s) / N(s,a))
    where Q(s,a) = average value of action a from state s, N(s) = visit count of parent, N(s,a) = visit count of child, c = exploration constant.
  2. Expansion: When selection reaches a leaf node (not yet in the tree), add it to the tree. In AlphaGo Zero: use the policy network πθ(a|s) to initialize prior probabilities of its children.
  3. Simulation (Evaluation): Evaluate the new node. In classic MCTS: random rollout to game end. In AlphaGo Zero: use the value network vθ(s) directly (no rollout needed — the NN IS the evaluation).
  4. Backpropagation: Update Q values and visit counts for all nodes on the path from root to the evaluated leaf. Each ancestor's Q is updated toward the new value estimate.

Repeat for N iterations (more iterations = more test-time compute = better decisions). After all iterations, select the action with the highest visit count from the root.

MCTS Visualization: Selection → Expansion → Evaluation → BackpropInteractive
Why UCB Works

UCB1 balances exploitation (high Q → pick known-good moves) and exploration (low visit count → try under-explored moves). The √(ln N / n) term grows when a child is under-visited relative to its siblings. As visits → ∞, the tree converges to the minimax-optimal strategy. This is the same explore/exploit tradeoff from multi-armed bandits.

🔨 Derivation UCB1 Regret Bound: Why the Logarithm? ✓ ATTEMPTED

UCB1 achieves O(log T) regret over T rounds. This is optimal (Lai & Robbins lower bound). Your task: Explain intuitively why the exploration term uses ln(N) not N, and why this gives logarithmic rather than linear regret.

As total visits N grows, ln(N) grows very slowly. So the exploration bonus √(ln N / ni) eventually becomes negligible for well-visited arms. But for an arm visited only k times while N grows, the bonus √(ln N / k) keeps growing — forcing eventual exploration. The log ensures you explore every arm infinitely often, but the RATE of exploration decays.
The confidence interval for the mean of ni samples shrinks as 1/√ni (Hoeffding). UCB adds this confidence width to the empirical mean. After O(log T) samples of a suboptimal arm, its upper confidence bound drops below the optimal arm's mean — and we stop pulling it. Total regret from each suboptimal arm: O(log T / Δi) where Δi is the gap.

Key idea: UCB pulls arm i when Qi + √(2 ln t / ni) ≥ Q* (the best arm's true mean). By Hoeffding, this "mistake" happens with probability at most t-4 after enough samples. Summing over all t gives a convergent series.

Concretely: a suboptimal arm with gap Δ gets pulled at most O(8 ln T / Δ²) times total. Multiply by Δ for regret contribution: O(8 ln T / Δ). Sum over K arms: total regret = O(Σi:Δi>0 ln T / Δi) = O(K · ln T / Δmin).

For MCTS: This means each branch of the tree converges to its true value at rate O(log N / N). More search → better estimates → better decisions. The logarithmic regret means you waste very few evaluations on bad branches.

Chapter 07

The General Pattern

Across three decades of game AI, the recipe hasn't changed:

The Universal Recipe Superhuman Performance = Learned Value Function + Search at Test Time
DomainValue FunctionSearch MethodTest-Time Compute
BackgammonTD-learned NN2-3 ply expectimaxSeconds/move
ChessHandcrafted evalAlpha-beta, 12+ plyMinutes/move
GoPolicy + value NNMCTS, 1600 simsSeconds/move
PokerBlueprint strategySubgame solvingSeconds/decision
LLMs (o1)Trained LLMChain-of-thought searchSeconds–minutes/query

The value function tells you "how good is this position?" The search tells you "what's the best move from here?" Neither alone is sufficient. A perfect value function without search can't look ahead. A search without a value function can't prune — it must explore everything.

Brown's Thesis

We're now doing for language what we did for games 20 years ago. The LLM is the value/policy network. Chain-of-thought, verification, tree search — these are the search component. The next leap won't come from bigger models alone. It'll come from better search algorithms applied to language.

Chapter 08

Chain of Thought

The simplest form of "test-time compute" for LLMs: make the model show its work. Wei et al. (2022) showed that adding "Let's think step by step" or demonstrating worked examples dramatically improves reasoning.

Standard Prompting vs. CoT

Example: GSM8K

Problem: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many does he have now?"

Standard prompting: Model outputs "11" (correct) or "9" (wrong) — one forward pass, no intermediate reasoning.

CoT prompting: "Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 balls. Total: 5 + 6 = 11." The step-by-step generation is the reasoning.

Results from the original paper:

BenchmarkStandardChain-of-ThoughtGain
GSM8K17.9%58.1%+40.2%
MultiArith33.8%93.0%+59.2%
SVAMP68.0%79.0%+11.0%

The Emergence Threshold

CoT only works at scale. Models below ~100B parameters show no benefit from CoT prompting — their chains are incoherent and don't improve answers. Above 100B, CoT becomes dramatically effective. This suggests there's a minimum "reasoning capability" needed before you can harness it through prompting.

CoT as Implicit Search

Why does showing work help? Each intermediate step conditions the next step. The model can use its own generated text as working memory. Without CoT, it must compute the answer in a single forward pass (bounded compute). With CoT, it gets T additional forward passes where T = number of generated tokens. More tokens = more compute = harder problems solvable.

CoT: Tokens as ComputeInteractive
Chapter 09

Verification & Majority Vote

CoT gives us one reasoning chain. But what if we generate many chains and pick the best? This is the next level of test-time compute.

Majority Voting (Self-Consistency)

Generate N independent solutions to the same problem. Take the most common final answer. Wang et al. (2022) showed this dramatically improves over single-chain CoT:

Minerva Results (MATH benchmark)

Single sample: 33.6% accuracy. Majority vote @64 samples: 50.3% accuracy. A 50% relative improvement just from sampling more and voting.

Why Majority Vote Works (And Why It Saturates)

Consider a problem where the model gets the right answer with probability p per sample. With N independent samples, the probability that the majority is correct:

Majority Vote Probability P(majority correct) = Σk=⌈N/2⌉N C(N,k) · pk · (1−p)N−k

If p > 0.5: as N → ∞, P → 1 (Condorcet jury theorem)
If p < 0.5: as N → ∞, P → 0 (majority locks onto wrong answer)
If p = 0.5: P = 0.5 regardless of N (no signal to amplify)
The Saturation Problem

Majority vote saturates because it can only select from answers the model actually generates. If a problem requires a creative insight that the model never produces in any of its N samples, voting 1000 times won't help. This is the coverage limitation.

Coverage (pass@k) vs. Consensus

Definition
pass@k (Coverage)

The probability that at least one of k generated solutions is correct. Formally:

pass@k = 1 − C(n−c, k) / C(n, k)

where n = total samples generated, c = number of correct samples among them. Equivalently: pass@k = 1 − (1−p)k when samples are independent with success probability p.

The Large Language Monkeys paper (Brown et al., 2024) showed that coverage grows monotonically with k: more samples = higher chance at least one is right. But consensus (majority vote accuracy) can plateau or even decrease, because wrong answers may cluster.

Coverage vs. Consensus: Why Majority Vote SaturatesInteractive

Best-of-N with a Reward Model

If majority vote saturates, can we do better? Yes — use a reward model (verifier) to score each solution and pick the highest-scored one:

Best-of-N Selection a* = argmaxi ∈ {1,...,N} RM(problem, solutioni)

Expected accuracy ≈ pass@N (if RM is perfect)
In practice: RM quality caps the gain

Best-of-N with a good reward model tracks coverage (which keeps growing) rather than consensus (which saturates). This is strictly better than majority vote when the reward model is well-calibrated.

The Verification Gap

Verification is easier than generation. It's easier to check if a math proof is correct than to generate one. This asymmetry is why reward models work: they need to rank solutions, not produce them. Even an imperfect verifier, when combined with many candidates, dramatically outperforms single-shot generation.

🔨 Worked Example Computing pass@k and Majority Vote Probability ✓ ATTEMPTED

Setup: A model solves a MATH problem correctly 20% of the time (p=0.2). You generate N=16 samples.

Task 1: Compute pass@16 (probability at least one sample is correct).

Task 2: Compute the probability that majority vote (most common answer) is correct, assuming wrong answers are uniformly distributed over 4 distractors (each wrong with probability 0.8/4 = 0.2 per distractor).

pass@k = 1 − (1−p)k = 1 − 0.816. Compute 0.816: 0.82 = 0.64, 0.84 = 0.4096, 0.88 = 0.1678, 0.816 = 0.0281. So pass@16 = 1 − 0.0281 = 0.972.
With p=0.2 for the correct answer and p=0.2 for each of 4 distractors, ALL answers are equally likely per-sample. Majority vote essentially picks randomly among tied answers. Expected correct votes: 16×0.2 = 3.2. Expected per-distractor: also 3.2. The correct answer wins the plurality only ~1/5 of the time — barely better than random!

Task 1: pass@16

pass@16 = 1 − (1 − 0.2)16 = 1 − 0.816

0.816 = (0.84)4 = 0.40964 ≈ 0.0281

pass@16 ≈ 97.2% — almost certainly at least one correct solution exists.

Task 2: Majority vote

This is the tragic case. Each of 5 possible answers (1 correct + 4 wrong) has per-sample probability 0.2. Expected votes per answer: 16 × 0.2 = 3.2. By symmetry, the correct answer wins the plurality ~1/5 = 20% of the time. Majority vote gives almost no improvement over single-shot (20% vs 20%)!

The lesson: pass@16 = 97.2% but majority vote = 20%. The correct answer IS in the samples, but you can't identify it without a verifier. This is exactly why Best-of-N with a reward model dominates majority voting.

Chapter 10

OpenAI o1 and o3

OpenAI's reasoning models are the culmination of everything above: test-time compute scaling applied to LLMs.

o1: Log-Linear Scaling

The key empirical finding: o1's accuracy on AIME (competition math) scales log-linearly with test-time compute. Double the inference budget → roughly constant accuracy improvement.

o1 Scaling (Empirical) AIME accuracy ≈ a + b · log(test_time_compute)

At low compute: ~50% accuracy
At high compute: ~83% accuracy
~10× compute per +10% accuracy

o3: Shifting the Pareto Frontier

o3 improves on o1 not by using more compute, but by using it more efficiently. Same accuracy at lower cost:

ModelAIME ScoreApproximate Cost/Problem
o1-low~65%$0.60
o1-high~83%$4.80
o3-low~76%$0.15
o3-high~91%$1.20
Cost Worked Example

Scenario: You need to solve 30 AIME problems with at least 80% accuracy.

o1-high: 30 × $4.80 = $144, gets ~83% accuracy. Meets threshold.

o3-low: 30 × $0.15 = $4.50, gets ~76% accuracy. Below threshold.

o3-high: 30 × $1.20 = $36, gets ~91% accuracy. Meets threshold at 4× less cost than o1-high.

The Pareto frontier shifted: same quality for 4× less money.

Pareto Frontiers: o1 vs o3Interactive
The Implication

We now have two ways to improve AI: train better base models (shifts the starting point) and search more efficiently at inference (shifts the Pareto frontier). o3 improved on both axes simultaneously. Future progress will come from advancing both.

Chapter 11

GRPO: Group Relative Policy Optimization

How do you actually train a reasoning model? The DeepSeek-R1 paper (2024) introduced GRPO — Group Relative Policy Optimization — a remarkably elegant algorithm that trains reasoning without needing a critic network.

The Problem GRPO Solves

PPO (Proximal Policy Optimization) requires a value function (critic) to estimate advantages. For LLM reasoning, the value function is hard to train: the "state" is a partial chain-of-thought, and the "reward" only comes at the end (correct/incorrect). GRPO eliminates the critic entirely.

Core Idea

Instead of estimating advantages from a learned value function, GRPO estimates them by comparing completions within a group:

GRPO: Setup For each prompt q, sample G completions: {o₁, o₂, ..., oG} ~ πθold(· | q)
Compute rewards: ri = R(q, oi)   (binary: correct/incorrect, or continuous score)

Group-relative advantage:
i = (ri − mean({rj})) / std({rj})

Z-score within the group. No critic needed.

That's it. The advantage of completion i is just how much better (or worse) its reward is compared to the group average, normalized by the group's standard deviation.

The Full GRPO Objective

GRPO Objective JGRPO(θ) = 𝔼q ~ D [ (1/G) Σi=1G min(
   (πθ(oi|q) / πθold(oi|q)) · Âi ,
   clip(πθ(oi|q) / πθold(oi|q), 1−ε, 1+ε) · Âi
) − β · DKLθ || πref) ]

This is the PPO clipped objective, but with group-relative advantages instead of critic-based advantages, plus a KL penalty against a reference policy (the base model before RL).

GRPO Algorithm
  1. Sample prompts: Draw a batch of prompts {q1,...,qB} from the training set.
  2. Generate groups: For each prompt qj, sample G completions from πθold.
  3. Score: Evaluate each completion with the outcome verifier: ri = R(qj, oi). For math: 1 if final answer correct, 0 otherwise.
  4. Compute advantages: Within each group, z-score the rewards: Âi = (ri − μj) / σj.
  5. Policy update: Gradient ascent on JGRPO(θ) using PPO-style clipping.
  6. Update reference: Periodically set θold ← θ. Keep πref fixed.

Why Z-Scoring Works

Consider a hard problem where all completions score 0. The z-score is 0/0 (undefined) — no gradient signal. Good: we shouldn't update on a problem we can't solve at all yet. Consider an easy problem where all score 1. Same thing: no gradient. Also good: nothing to learn.

Gradient signal only flows when there's variance within the group — when some completions succeed and others fail. This automatically focuses learning on problems at the frontier of the model's ability.

Connection to REINFORCE

GRPO is a variant of REINFORCE with a very specific baseline: the group mean. Recall from policy gradients: ∇J ≈ Σ ∇ log π · (R − b). Setting b = mean(Rgroup) and normalizing by std(Rgroup) gives GRPO's advantage. The z-scoring acts as both baseline subtraction AND reward normalization simultaneously.

GRPO: Group Sampling and Advantage ComputationInteractive
🔨 Derivation GRPO Advantage: Unbiasedness and Variance Properties ✓ ATTEMPTED

Task 1: Show that the GRPO advantage Âi = (ri − μ) / σ has zero mean within any group (i.e., Σii = 0).

Task 2: Explain why this zero-mean property means GRPO doesn't suffer from the "all-positive rewards" problem that plagues vanilla REINFORCE.

Task 3: What happens when G=2 (minimum group size)? Derive the advantage formula.

Σi (ri − μ) / σ = (1/σ) · Σi (ri − μ) = (1/σ) · (Σ ri − Gμ) = (1/σ) · (Gμ − Gμ) = 0.
In vanilla REINFORCE without baseline: ∇J = Σ ∇ log π · R. If all R > 0 (e.g., sparse reward = 1 for success), we increase probability of ALL actions — including bad ones. We just increase good ones MORE. But with finite samples, this is extremely noisy. Z-scoring guarantees some advantages are negative, explicitly pushing down below-average completions.

Task 1: Σi=1Gi = Σi (ri − μ)/σ = (1/σ)(Σ ri − Gμ) = (1/σ)(Gμ − Gμ) = 0. ■

Task 2: When advantages sum to zero, the policy update has a "conservation" property: total probability mass pushed UP = total pushed DOWN. Concretely: Σ Âi · ∇ log π(oi) explicitly reinforces some completions while suppressing others. In vanilla REINFORCE with all-positive rewards, the gradient tries to increase ALL action probabilities — a much noisier signal that relies on normalization (softmax) to implicitly decrease others.

Task 3 (G=2): Let r1, r2 be the two rewards. μ = (r1 + r2)/2. σ = |r1 − r2|/√2 (sample std with Bessel correction: actually σ = |r1 − r2|/√2 or |r1−r2|/2 depending on convention).

Using population std: σ = |r1−r2|/2. Then Â1 = (r1 − μ) / σ = ((r1−r2)/2) / (|r1−r2|/2) = sign(r1−r2).

Key insight for G=2: The advantage is just +1 or −1 — a binary signal saying "this one was better/worse than the other." It's a pure pairwise comparison. This is equivalent to DPO (Direct Preference Optimization) on self-generated pairs! As G grows, the signal becomes richer (continuous z-scores) and lower variance.

The DeepSeek-R1 Recipe

DeepSeek-R1 (2024) used GRPO to train a reasoning model from a base LLM with no explicit reasoning capability:

StageWhat HappensKey Detail
1. Cold StartSFT on long CoT examples~5000 human-written reasoning chains
2. GRPO TrainingRL with outcome verificationG=64, math problems with auto-verifiable answers
3. DistillationDistill reasoning into smaller modelsR1-32B → 7B, 14B via SFT on R1 outputs
Why GRPO Over PPO?

PPO needs a critic (value network) of similar size to the policy. For a 70B policy, that's another 70B model just for advantages. GRPO replaces this with sampling (G completions) + z-scoring. The trade: more generation compute per step, but no critic to train, tune, or risk instability from. For reasoning tasks with binary outcomes (correct/incorrect), GRPO's simplicity wins.

Checkpoint — Before the summary
You're training a reasoning model with GRPO. Group size G=8. For a given math problem, 3 out of 8 completions get the right answer. Compute the advantage for a correct completion and an incorrect one. Then explain: if you increased G to 64 and still got 3/8 correct (same ratio: 24/64), would the advantages be the same? Why or why not?
✓ Gate cleared
Model Answer

G=8, 3 correct: Rewards = [1,1,1,0,0,0,0,0]. μ = 3/8 = 0.375. σ = √((Σ(ri−μ)²)/G) = √((3×0.625² + 5×0.375²)/8) = √((3×0.3906 + 5×0.1406)/8) = √((1.172 + 0.703)/8) = √(0.234) = 0.484.

correct = (1 − 0.375)/0.484 = 1.29. Âincorrect = (0 − 0.375)/0.484 = −0.775.

G=64, 24 correct (same 3/8 ratio): Same μ = 0.375, same population σ = 0.484. Same advantages! The z-score depends only on the ratio, not the count.

BUT: the gradient variance is lower with G=64 because we average over more samples. The advantages are the same per-sample, but the overall gradient estimate is 8× lower variance (by CLT: var ∝ 1/G). Larger G = more compute per step but more reliable updates.

Chapter 12

Summary & Open Problems

The Big Picture

The Two Frontiers Training-time: bigger models, more data → better base capabilities
Test-time: search, verification, reasoning chains → better reasoning

2019–2023: dominated by training scaling (GPT-2 → GPT-4)
2024+: test-time scaling emerges as equally important (o1, o3, R1)

When to Think Harder vs. Answer Fast

QuestionThink FastThink Hard
DifficultyEasy / lookupHard / novel
Cost sensitivityHigh volume, cost mattersLow volume, quality matters
LatencyReal-time requirementCan wait seconds/minutes
VerifiabilityHard to verify (creative writing)Easy to verify (math, code)
Example"Translate this sentence""Prove this theorem"

Open Problems

Open Problem
Process Reward Models (PRM) vs. Outcome Reward Models (ORM)

ORMs score final answers only. PRMs score intermediate reasoning steps. PRMs give denser training signal (reward per step, not per chain) but are much harder to train (who labels intermediate reasoning quality?). The optimal balance is unknown.

Open Problem
Compute-Optimal Inference

Given a total inference budget, how do you split it between: (a) number of samples, (b) length per sample (more reasoning tokens), (c) verification passes? This is an active research question with no clean answer.

Open Problem
Generalization of Reasoning

Current reasoning models work best on math/code (easy to verify). Can the same approach work for domains without clear verification signals? Creative reasoning, ethical reasoning, scientific hypothesis generation? The verification bottleneck is the key limitation.

Connections

This LectureConnects To
GRPOPolicy Gradients (REINFORCE + baseline), PPO clipping
MCTSMulti-armed bandits (UCB), tree search
Majority VoteEnsemble methods, Condorcet jury theorem
Test-time computeRLHF, reward modeling
Best-of-NRejection sampling, reward models
The One Sentence

Training gives you knowledge. Search gives you reasoning. The future of AI is learning when and how to think, not just what to know.

The Full Landscape: Training vs. Inference ComputeInteractive