Teaching models to think step by step — from prompting tricks to reinforcement learning.
Here's a simple math problem: "A cafeteria had 23 apples. They used 20 to make lunch and bought 6 more. How many do they have?" Most humans get 9 instantly. You subtract 20 from 23 to get 3, then add 6 to get 9. Two steps, trivially obvious.
Now feed that same problem to a large language model with no special prompting. It might say "27." Or "29." Or some other confident, wrong number. Not because the model is stupid — GPT-4 has passed the bar exam, after all — but because the model is trying to jump directly from question to answer without working through intermediate steps.
This is the fundamental tension. LLMs generate tokens left to right, one at a time. Each token prediction is essentially a single "step" of computation. For simple lookups ("What is the capital of France?"), one step is enough. But for multi-step reasoning ("If A implies B and B implies C, and A is true, what about C?"), the model needs to chain multiple logical steps — and it has no built-in mechanism for doing so.
Daniel Kahneman described two systems of thinking: System 1 (fast, automatic, intuitive) and System 2 (slow, deliberate, analytical). A standard LLM prompt is System 1: instant pattern-matching. The entire field of reasoning in LLMs is about giving models System 2 — the ability to slow down, decompose, and work through a problem methodically.
The simulation below shows what happens when a model tries to answer a multi-step math problem directly versus when it's forced to show its work. Click "Direct Answer" to see the model jump to a (wrong) conclusion. Then click "Step-by-Step" to see what happens when we give the model scratch space to think.
Direct prompting jumps to a wrong answer. Step-by-step prompting reveals intermediate reasoning that leads to the correct answer.
The gap is stark. The direct approach treats reasoning as a single token prediction: question in, answer out. The step-by-step approach gives the model "thinking tokens" — intermediate text that decomposes the problem into sub-problems the model can actually solve.
This insight — that models reason better when they write out their reasoning — launched an entire subfield. In 2022, Jason Wei and colleagues at Google showed that simply adding "Let's think step by step" to a prompt could boost math accuracy from 17% to 78% on GSM8K. No retraining. No new parameters. Just a different prompt.
We'll build the full reasoning stack, from simple prompting tricks to reinforcement learning:
In January 2022, Jason Wei et al. at Google Brain published a deceptively simple idea: instead of showing the model question-answer pairs, show it question-reasoning-answer triples. Give the model examples where the reasoning is written out, and it will learn to write out reasoning too.
This is Chain-of-Thought (CoT) prompting. The "chain" is the sequence of intermediate reasoning steps between the question and the final answer. Each link in the chain is a logical step the model can verify and build upon.
Here's the critical difference. In standard few-shot prompting, you provide input-output examples:
standard prompt
Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now?
A: 11.
Q: The cafeteria had 23 apples. Used 20, bought 6 more. How many?
A:
The model sees "question → number" and tries to pattern-match directly. Now compare with CoT prompting:
chain-of-thought prompt
Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now?
A: Roger started with 5. He bought 2 cans of 3, so 2 × 3 = 6.
5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. Used 20, bought 6 more. How many?
A:
Now the model sees "question → step 1 → step 2 → ... → answer." It learns to generate its own intermediate steps before committing to a final answer. Each step is a simple operation the model can handle, even if the full problem is complex.
Left: standard prompting jumps from question to answer. Right: CoT prompting generates intermediate steps. Click "Step Through" to watch the reasoning chain build token by token.
The key insight is that CoT gives the model intermediate tokens to compute with. A Transformer processes information through attention and feedforward layers, but each layer has finite computational capacity. By generating intermediate reasoning tokens, the model effectively gets more "layers" of computation — each generated token becomes context for the next prediction. Feng et al. (2023) formalized this: they proved that constant-depth Transformers can only solve problems in the complexity class TC0 without chain-of-thought, but with CoT they can solve problems in P (polynomial time). The intermediate tokens literally expand the class of problems the model can solve.
Think of it like a student taking an exam. A standard prompt is like asking "write only the final answer." A CoT prompt is like giving them scratch paper. The scratch paper doesn't change their knowledge — it lets them organize and apply it step by step.
There's a beautiful connection to computation theory here. A Turing machine's tape serves the same purpose as CoT tokens: it provides external memory that the finite-state control can read and write during computation. Without the tape, the machine can only solve problems bounded by its finite state space. With the tape, it can solve anything computable. CoT tokens are, in a real sense, a Transformer's Turing tape.
Wei et al. discovered a remarkable property: CoT is an emergent ability of scale. Small models (under ~10B parameters) show no benefit from CoT prompting — they just generate incoherent intermediate steps. But above a critical scale threshold (~100B parameters), CoT dramatically improves performance.
| Model Size | Standard Accuracy (GSM8K) | CoT Accuracy (GSM8K) |
|---|---|---|
| 8B | ~4% | ~4% (no benefit) |
| 62B | ~15% | ~33% |
| 540B (PaLM) | ~17% | ~58% |
Why? Smaller models can't reliably generate correct intermediate steps. CoT requires the model to already know how to do individual sub-steps (addition, comparison, logical deduction). If the sub-steps are wrong, chaining them just compounds errors. Scale gives the model enough capacity to get each sub-step right.
This scale dependency sparked intense debate. Is CoT a fundamental capability that emerges from scale, or is it a shallow pattern-matching trick that happens to work at large scale? The DeepSeek-R1 results (Chapter 5) provide evidence for the former: when you train with RL, even models that initially can't do CoT learn to generate it. The capability is real, not an artifact — but it requires a minimum level of base competence to bootstrap.
python # Standard prompting: one forward pass per answer token prompt = "Q: 23 - 20 + 6 = ?\nA:" output = model.generate(prompt, max_tokens=3) # → " 27" (wrong — pattern-matched without computing) # CoT prompting: multiple tokens of "scratch work" prompt = """Q: Roger has 5 balls. Buys 2 cans of 3. How many? A: 5 + (2 × 3) = 5 + 6 = 11. The answer is 11. Q: 23 apples. Used 20, bought 6 more. How many? A:""" output = model.generate(prompt, max_tokens=50) # → " 23 - 20 = 3. 3 + 6 = 9. The answer is 9." # Each intermediate token conditions the next prediction
Notice: the model generates "23 - 20 = 3" as text, then conditions on that text to generate "3 + 6 = 9." The intermediate result "3" exists as a token in the context, not as a hidden state. This is crucial — the reasoning is explicit and inspectable.
CoT doesn't help everything equally. The benefit is largest for tasks that require multi-step composition — where the answer depends on chaining several sub-operations. Here's the breakdown:
| Task Type | CoT Benefit | Why |
|---|---|---|
| Multi-step arithmetic | Huge (+40%) | Each arithmetic operation is a sub-step the model can do individually |
| Symbolic reasoning | Huge (+30%) | Logical deduction requires explicit chaining of rules |
| Commonsense reasoning | Moderate (+10%) | Some problems need inference chaining, others are direct lookup |
| Factual QA | Minimal (+2%) | Single-hop retrieval from parameters — no chain needed |
| Translation | None or negative | Already a sequence-to-sequence task; intermediate steps add noise |
The rule of thumb: if a human would need scratch paper to solve the problem, CoT will help. If a human can answer instantly from memory, CoT is wasted tokens.
Chain-of-Thought gives the model scratch paper. But what if the model makes an arithmetic mistake halfway through, or takes a wrong logical turn? A single chain is fragile — one bad link breaks the whole thing. What if we could get multiple opinions?
This is the insight behind Self-Consistency (Wang et al., 2022): sample multiple independent reasoning chains from the model, extract the final answer from each, and take a majority vote. Different chains might make different mistakes, but the correct answer should appear most frequently.
The algorithm is beautifully simple:
Why does this work? Think of it like asking five mathematicians to solve the same problem independently. Each might take a different approach (algebra, geometry, estimation), and some might make errors. But if four out of five get "9," it's probably 9 — even if the fifth got "7" due to a sign error.
Five reasoning chains sampled for the same question. Each takes a slightly different path. The majority answer (highlighted) is selected as the final output.
Self-consistency provides consistent improvements over single-chain CoT across every benchmark tested:
| Benchmark | CoT (1 chain) | SC (40 chains) | Improvement |
|---|---|---|---|
| GSM8K (math) | 56.5% | 74.4% | +17.9 |
| SVAMP (math) | 79.0% | 86.6% | +7.6 |
| ARC-Challenge | 85.2% | 88.7% | +3.5 |
| StrategyQA | 73.4% | 79.3% | +5.9 |
The cost is obvious: sampling 40 chains costs 40x the compute of a single chain. In practice, most gains come from the first 5-10 samples — the marginal benefit of each additional chain decreases rapidly. After about 20 samples, you're paying a lot more in compute for diminishing returns. The practical sweet spot for most applications is K=5 to K=10.
The basic version of self-consistency uses unweighted majority voting: each chain gets one vote regardless of its confidence. But you can do better. Some chains are more confident than others — the model assigns higher probability to tokens in some chains. A weighted majority vote gives more weight to chains where the model was more confident:
In practice, weighted voting helps on the margins but unweighted voting is simpler and nearly as effective. The diversity from multiple chains matters more than the confidence weighting.
Here's the elegant argument. Suppose each chain independently gets the right answer with probability p (say p=0.6). What's the probability that the majority of K=5 chains are correct? This is a binomial distribution: P(majority correct) = P(3 or more correct out of 5). For p=0.6, this works out to about 0.68. For p=0.7, it's 0.84. For p=0.8, it's 0.94.
The key insight: majority voting amplifies accuracy. If each individual chain is better than random (p > 0.5), then the majority vote is strictly better than any individual chain, and the improvement grows with K. This is the Condorcet jury theorem applied to language models.
The assumption that matters: chains must make independent errors. If all chains make the same mistake (because the model has a systematic bias), majority voting can't help. This is why temperature is important — it ensures diversity.
Temperature controls the diversity of the sampled chains. At T=0 (greedy), every chain is identical — self-consistency degenerates to single-chain CoT. At T=1.0, chains are highly diverse but more chains contain errors. The sweet spot is around T=0.5-0.7: diverse enough to explore different reasoning paths, constrained enough that most paths are coherent.
python # Self-consistency implementation def self_consistency(prompt, model, K=40, temp=0.7): answers = [] for _ in range(K): chain = model.generate(prompt, temperature=temp) answer = extract_answer(chain) # parse final number answers.append(answer) # Majority vote from collections import Counter vote = Counter(answers).most_common(1)[0][0] return vote # Example: K=5 chains give answers [9, 9, 7, 9, 11] # Majority vote: 9 (appears 3/5 times) ✓
Chain-of-Thought (Chapter 1) requires carefully crafted few-shot examples with hand-written reasoning chains. What if you don't have examples for your specific task? What if you just want reasoning on any arbitrary question?
In 2022, Kojima et al. discovered something remarkable: you can trigger chain-of-thought reasoning with zero examples. Just append a single sentence to any question: "Let's think step by step."
That's it. Five words. No examples. No task-specific engineering. The model was already capable of step-by-step reasoning — it just needed permission to do it.
Zero-shot CoT isn't the only trigger phrase. Researchers have tested dozens of alternatives, and the exact wording matters more than you'd expect:
| Prompt Strategy | Trigger Text | Accuracy (MultiArith) |
|---|---|---|
| Standard (no CoT) | none | 17.7% |
| Zero-shot CoT | "Let's think step by step" | 78.7% |
| Alternative 1 | "Let's solve this problem by splitting it into steps" | 72.2% |
| Alternative 2 | "First," | 66.4% |
| Alternative 3 | "Let's think about this logically" | 56.6% |
| Negative | "Let's think step by step but be brief" | 41.5% |
The phrase "Let's think step by step" wins consistently. Why? Because the model's training data contains millions of instances where this phrase precedes detailed explanations — in tutorials, textbooks, forum answers. The phrase activates a "explain your work" mode that the model already learned during pretraining.
This reveals something profound about how language models store and access capabilities. The model already knew how to reason step by step — it just needed the right prompt to activate that behavior. Think of it like this: during pretraining, the model learned statistical associations between phrases and what follows them. "Let's think step by step" is statistically followed by detailed, structured explanations. By using this phrase, we're essentially setting up the right conditional distribution for the model to sample from.
This has implications beyond reasoning. It suggests that large language models have many latent capabilities that can be unlocked with the right prompt. The model is not a fixed function — it's a family of functions, and the prompt selects which member of that family to use. Prompt engineering is, in a deep sense, function selection.
This also explains why different phrases work differently. "Let's think step by step" activates tutorial-style explanations. "First," activates enumerated lists. "Let's think about this logically" activates a more philosophical, less computational style. Each trigger phrase selects a different distribution over reasoning formats — and some formats are more effective for mathematical problem-solving than others.
The practical lesson: prompt engineering is not magic. It's applied statistics. You're choosing which region of the model's output distribution to sample from, and some regions contain better reasoning strategies than others. The best prompt isn't the cleverest — it's the one that most reliably activates the model's strongest reasoning mode for your specific task type.
Toggle between prompt strategies and see how each affects the model's reasoning behavior. The accuracy counter shows results across a batch of math problems.
Zero-shot CoT actually works in two stages, not one:
Stage 1: Reasoning extraction. Append "Let's think step by step" to the question. The model generates a reasoning chain. Don't extract the answer yet — the model might not have stated a clear final answer.
Stage 2: Answer extraction. Append "Therefore, the answer is" to the generated reasoning. This prompts the model to produce a clean, parseable final answer.
python # Stage 1: Generate reasoning prompt1 = "Q: If a train travels 120 miles in 2 hours, " + \ "what is its speed in mph?\n" + \ "A: Let's think step by step." reasoning = model.generate(prompt1) # → "The train goes 120 miles in 2 hours. # Speed = distance / time = 120 / 2 = 60." # Stage 2: Extract answer prompt2 = prompt1 + reasoning + "\nTherefore, the answer is" answer = model.generate(prompt2, max_tokens=10) # → " 60 mph."
| Criterion | Few-Shot CoT | Zero-Shot CoT |
|---|---|---|
| Setup effort | High (craft examples) | None |
| Accuracy (best case) | Higher (task-specific) | Slightly lower |
| Generalization | Limited to similar tasks | Any task |
| Prompt length | Long (examples + reasoning) | Short (just 5 extra words) |
| Best for | Production, high-stakes | Exploration, diverse tasks |
We've seen that models can generate reasoning chains. But how do we know if a chain is good? There are two fundamentally different ways to evaluate reasoning, and the choice between them has profound implications for how we train and improve models.
Outcome Reward Models (ORMs) look only at the final answer. Did the model get "9"? Then the chain is good, regardless of how it got there. The entire reasoning chain gets a single score: correct or incorrect.
Process Reward Models (PRMs) score each intermediate step independently. Step 1: correct. Step 2: correct. Step 3: error! Even if the model stumbles back to the right final answer, the PRM identifies exactly where the reasoning went wrong.
Consider two reasoning chains for "What is 23 - 20 + 6?":
Chain A: "23 - 20 = 3. 3 + 6 = 9. The answer is 9." — Every step correct.
Chain B: "23 - 20 = 13. 13 - 4 = 9. The answer is 9." — Wrong intermediate steps, right final answer by luck.
An ORM gives both chains the same score. A PRM catches that Chain B got lucky — its reasoning is flawed, and it will fail on similar problems. This distinction is critical: we want models that reason correctly, not models that get lucky.
A 5-step reasoning chain scored by both models. The PRM assigns per-step scores (green/red), while the ORM only scores the final answer. Toggle between the two to see the difference.
Lightman et al. (2023) at OpenAI built PRM800K — a dataset of 800,000 step-level human labels for math reasoning. Human annotators read each step of a model-generated solution and labeled it as "correct," "incorrect," or "neutral." This is expensive — each solution requires a mathematician to carefully verify every intermediate step — but it produces a reward model that can pinpoint errors with high precision.
The labeling process is careful: annotators are shown the problem and the solution up to step k, then asked "Is step k mathematically valid?" They don't just check the arithmetic — they verify logical coherence, correct use of definitions, and whether the step follows from previous steps. A step like "Since x > 0, we know x² > 0" is labeled correct. A step like "Since x > 0, we know 1/x > 1" is labeled incorrect (it's only true for 0 < x < 1).
The PRM is trained as a classifier: given a reasoning chain up to step k, predict whether step k is correct. At inference time, you can use the PRM to:
| Method | MATH Accuracy (maj@1) | MATH Accuracy (best-of-1860) |
|---|---|---|
| ORM (outcome supervision) | ~50% | ~72% |
| PRM (process supervision) | ~50% | ~78.2% |
At maj@1 (single sample), ORMs and PRMs perform similarly. The gap opens up at scale: when you sample many chains and use the reward model to select the best one. PRMs are dramatically better at identifying chains with sound reasoning, because they can reject chains that reach the right answer through flawed logic.
The elephant in the room: process labels are expensive. ORM labels are cheap — you just need the correct final answer (often automatically verifiable for math). PRM labels require a human expert to read and judge every single step. For PRM800K, this meant paying mathematicians to annotate 800,000 individual steps. Most teams can't afford this.
Can we generate process labels automatically? Partially. One approach: generate many solutions to the same problem, and for each step, check whether the remaining steps can reach the correct answer. If a step leads to a dead end (no subsequent path reaches the right answer), it's probably wrong. This Monte Carlo estimation approach gives noisy but scalable process labels without human annotators. We'll see more of this in Lecture 13.
python # ORM: scores the entire chain at once def orm_score(chain, correct_answer): final = extract_answer(chain) return 1.0 if final == correct_answer else 0.0 # PRM: scores each step independently def prm_score(chain, prm_model): steps = split_into_steps(chain) scores = [] context = "" for step in steps: context += step score = prm_model.score(context) # P(correct | context) scores.append(score) return min(scores) # chain is only as strong as weakest step
Everything we've seen so far uses prompting to elicit reasoning. Chain-of-thought, self-consistency, zero-shot CoT — these are all inference-time tricks. The model's weights never change. What if we could train the model to reason?
In January 2025, DeepSeek published R1, a model that discovered chain-of-thought reasoning through reinforcement learning alone. No human-written reasoning chains. No supervised examples. Just a reward signal: "did you get the right answer?" — and the model independently invented step-by-step reasoning as a strategy to maximize that reward.
The training pipeline has multiple stages, but the breakthrough comes from Stage 1:
The most remarkable finding from DeepSeek-R1 is what they called the "aha moment." During pure RL training (Stage 1, R1-Zero), the model spontaneously developed behaviors that nobody programmed:
Self-reflection: The model learned to re-read its own reasoning, notice errors, and correct them. "Wait, let me reconsider..." appeared naturally in generated text.
Exploration: The model learned to try multiple approaches to a problem. "Let me try a different method..." — essentially discovering self-consistency on its own.
Extended thinking: Reasoning chains grew from tens of tokens to thousands. The model discovered that longer chains (more thinking) led to higher rewards on hard problems.
Watch the RL training loop unfold. Early on, the model gives short, often wrong answers. As training progresses, it discovers that writing out reasoning steps leads to higher rewards. Chain length and accuracy both increase.
Previous approaches to reasoning (CoT prompting, supervised fine-tuning on reasoning chains) all required human-written reasoning examples. Someone had to write out "First, subtract 20 from 23..." for each training problem. This is expensive, it doesn't scale, and it limits reasoning to patterns humans have already demonstrated.
R1-Zero needs none of this. The only human input is a verifier that checks if the final answer is correct (trivial for math: just compare to the known answer). Everything else — the format of reasoning, the length of chains, the strategies for self-correction — emerges from RL exploration. This is a fundamentally different paradigm: instead of imitating human reasoning, the model discovers its own reasoning strategies optimized for its own architecture.
| Behavior | Emerged At | Example |
|---|---|---|
| Step-by-step | ~100 steps | "First, I need to find x..." |
| Self-correction | ~500 steps | "Wait, I made an error. Let me redo step 2." |
| Verification | ~1000 steps | "Let me verify: if x=3, then 3×4=12. Yes." |
| Alternative approaches | ~2000 steps | "Let me try solving this differently..." |
These behaviors were never in the training data as labeled examples. They emerged because RL discovered that models which exhibit these behaviors get higher rewards. This is the power of RL — it doesn't tell the model how to reason, it tells the model what success looks like and lets it figure out the rest.
R1 is enormous — 671B parameters. Too large for most practical deployments. But the reasoning patterns it discovered can be transferred to smaller models through distillation.
The process: generate thousands of reasoning chains from the large R1 model. Use these chains as supervised fine-tuning (SFT) data for smaller models (1.5B, 7B, 14B, 32B, 70B). The small model learns to imitate R1's reasoning style without ever doing RL itself.
Results are remarkable: a 14B distilled model matches or exceeds many 70B+ models that weren't trained with reasoning. The distilled 32B model surpasses OpenAI's o1-mini on several math benchmarks. Reasoning ability transfers through imitation, even to models 50x smaller than the teacher.
| Model | Size | AIME 2024 | Method |
|---|---|---|---|
| GPT-4o | ~1.8T | 9.3% | Standard |
| R1-Distill-Qwen-14B | 14B | 69.7% | Distilled from R1 |
| R1-Distill-Qwen-32B | 32B | 72.6% | Distilled from R1 |
| OpenAI o1-mini | ~100B? | 63.6% | RL-trained |
| DeepSeek-R1 | 671B | 79.8% | RL-trained |
The lesson: you pay the cost of RL training once (on the largest model), then distribute the benefits to smaller models via cheap SFT. This is the reasoning equivalent of the pretrain-then-distill paradigm that has been so successful for general language modeling.
There's a catch, though. Distilled models inherit the patterns of reasoning but not the capacity for it. A distilled 1.5B model can write chains that look like R1's reasoning, but it makes more errors in the intermediate steps because it has fewer parameters to perform each sub-computation. Distillation transfers the format of reasoning, not the depth. For the deepest reasoning, you still need the largest models.
R1-Zero uses two reward components:
python # DeepSeek-R1-Zero reward function (simplified) def reward(response, correct_answer): r = 0.0 # 1. Accuracy reward: did you get the right answer? final = extract_answer(response) if final == correct_answer: r += 1.0 # 2. Format reward: did you use <think>...</think> tags? if has_think_tags(response): r += 0.1 return r # No reward for reasoning quality — just outcome + format
Notice: there's no reward for good reasoning. The model only gets rewarded for correct answers and proper formatting. Yet it discovers that good reasoning is the best strategy for getting correct answers. The reasoning emerges as a means, not an end.
The analogy to evolution is apt. Evolution doesn't design eyes — it rewards survival, and eyes emerge because seeing helps you survive. Similarly, RL doesn't design reasoning — it rewards correct answers, and reasoning emerges because thinking step-by-step maximizes the probability of correctness. Reasoning is the optimal strategy discovered through selection pressure, not design.
DeepSeek-R1 used reinforcement learning to train reasoning, but which RL algorithm? The standard approach is Proximal Policy Optimization (PPO), the same algorithm used in RLHF (Lecture 8). But PPO has a dirty secret: it's expensive. PPO requires a critic model (value function) that's the same size as the policy model — doubling your GPU memory requirement.
DeepSeek introduced Group Relative Policy Optimization (GRPO), which eliminates the critic entirely. GRPO estimates the baseline by sampling a group of responses and using the group's mean reward as the baseline. No separate value network. No extra model to train.
PPO computes the advantage for each token as: A(s,a) = R - V(s), where V(s) is a learned value function. This value function requires a separate neural network (the critic) that maps states to expected future rewards. Training the critic requires its own loss function, learning rate, and GPU memory — essentially doubling the system complexity.
GRPO replaces V(s) with a simple group statistic. For each question, sample G responses. Compute each response's reward. The advantage for response i is:
That's it. No neural network. No backpropagation for the baseline. Just sample, score, normalize. Responses better than the group average get positive advantages (reinforced). Responses worse than average get negative advantages (discouraged).
PPO requires a separate critic network. GRPO eliminates it by using group statistics. Drag the group size slider to see how GRPO's advantage estimates change with more samples.
Decoupled Advantage Policy Optimization (DAPO) (ByteDance, 2025) addresses several failure modes of GRPO:
Problem 1: Entropy collapse. During RL training, the model's output distribution becomes increasingly peaked — it converges on a single reasoning pattern and stops exploring. DAPO fixes this with a dynamic sampling strategy: it oversamples from the current policy to ensure a minimum number of both positive and negative examples, preventing the model from "forgetting" alternative approaches.
Problem 2: Length bias. GRPO applies a KL penalty per-token, which implicitly penalizes longer responses. Since reasoning often requires more tokens, this discourages the model from thinking more. DAPO removes the per-token KL penalty entirely and replaces it with a clipped objective that doesn't penalize length. This is critical: we want models to think longer on hard problems. A per-token penalty fights this directly by making every reasoning token "costly" in the loss function.
The result is dramatic. Without the length penalty, DAPO-trained models generate longer, more thorough reasoning chains on hard problems while keeping chains short on easy ones. The model learns to adaptively allocate "thinking budget" based on problem difficulty — exactly the behavior we want.
Problem 3: Reward hacking. Models find shortcuts — like always outputting the most common answer in the training set. DAPO uses truncated filtering: if all G samples in a group get the same reward (all correct or all incorrect), that question is filtered out. Training only on questions where some responses succeed and some fail provides the most informative gradient signal.
| Feature | PPO | GRPO | DAPO |
|---|---|---|---|
| Critic network | Required | None | None |
| Baseline | Learned V(s) | Group mean | Group mean + filtering |
| KL penalty | Per-token | Per-token | Removed (clip-only) |
| Entropy management | Manual coefficient | None | Dynamic sampling |
| GPU memory | 2x (policy + critic) | 1x | 1x |
| AIME 2024 accuracy | ~30% | ~47% | ~50% |
Entropy collapse deserves special attention because it's the most insidious failure mode. During RL training, the model's output distribution gradually becomes a spike — it converges on a single phrasing, a single reasoning template, a single answer format. Diversity evaporates.
Why is this fatal for reasoning? Because reasoning requires exploration. A model that always starts with "Let me calculate..." will never discover that some problems are better solved by "Let me think about this differently..." or "Wait, what if I work backwards?" DAPO's dynamic sampling explicitly prevents this by maintaining a minimum level of output diversity throughout training.
The practical impact: DAPO-trained models continue to discover new reasoning strategies late in training, while GRPO-trained models plateau because they've lost the ability to explore alternatives.
python # GRPO advantage computation (simplified) def grpo_advantages(question, policy, reward_fn, G=8): # Sample G responses from the current policy responses = [policy.sample(question) for _ in range(G)] rewards = [reward_fn(r) for r in responses] # Group statistics replace the critic mu = sum(rewards) / G sigma = (sum((r - mu)**2 for r in rewards) / G) ** 0.5 # Normalize advantages advantages = [(r - mu) / (sigma + 1e-8) for r in rewards] # DAPO addition: filter if all same reward if sigma < 1e-6: return None # skip — no signal return list(zip(responses, advantages))
Time to see everything in action. The simulation below runs the same math problem through four different reasoning strategies side by side: Standard (direct answer), Chain-of-Thought (single chain), Self-Consistency (multiple chains with majority vote), and RL-trained (DeepSeek-R1-style model with emergent reasoning).
Click "New Problem" to generate a fresh multi-step math question. Then click "Run All" to watch each strategy process the problem simultaneously. Pay attention to the differences: how many tokens each generates, whether it catches its own errors, and whether it reaches the correct answer.
Four strategies on the same problem. Standard barely thinks. CoT writes out steps. Self-Consistency votes across chains. RL-trained self-corrects and verifies.
Standard: Jumps straight to an answer. Often wrong on multi-step problems. Uses the fewest tokens. Think of this as "System 1" thinking — fast, automatic, error-prone.
Chain-of-Thought: Decomposes the problem into steps. Usually gets it right on 2-3 step problems. But a single mistake in the chain propagates to the final answer. No error correction.
Self-Consistency: Runs multiple chains. Even if individual chains have errors, the majority vote converges on the correct answer. Uses ~5x the tokens of a single chain. The green highlight shows the majority answer.
RL-trained: Generates reasoning similar to CoT, but with self-correction. You'll see phrases like "Wait, let me reconsider" and "Let me verify this." The model learned to double-check its own work through RL training. It uses more tokens than basic CoT but fewer than self-consistency, achieving comparable or better accuracy.
| Strategy | Tokens Generated | Accuracy (GSM8K) | Cost Relative |
|---|---|---|---|
| Standard | ~5 | ~17% | 1x |
| CoT (1 chain) | ~80 | ~58% | ~16x |
| Self-Consistency (40) | ~3200 | ~74% | ~640x |
| RL-trained (R1) | ~200 | ~94% | ~40x |
The RL-trained model is the clear winner on the efficiency frontier. It achieves the highest accuracy at moderate cost. Self-consistency is the safest (ensemble effect), but expensive. The lesson: training the model to reason (RL) is more cost-effective than forcing it to reason at inference time (prompting).
Look carefully at the RL-trained column. It does something the others don't: self-correction. When it detects an intermediate result that seems off, it backs up and tries again. "Wait, let me reconsider" is not window dressing — it's a learned behavior that reduces the error rate.
Self-consistency achieves a similar effect through brute force (many chains + voting), but it costs 5x more tokens. The RL-trained model achieves it with a single chain that has built-in error correction. Think of it as the difference between having five proofreaders (self-consistency) and one careful writer who re-reads their own work (RL-trained).
This is the fundamental trade-off of the reasoning landscape: you can get accuracy through ensemble methods (expensive, simple) or through better training (expensive upfront, cheap at inference). The trend is clearly toward the latter — invest in training once, benefit at every inference.
Reasoning chains look impressive. A model that writes "Let me think... first I calculate X... then Y... therefore Z" feels trustworthy. But appearances can deceive. The most dangerous failure mode in reasoning models is unfaithful reasoning: the model generates a chain that looks logical but doesn't actually correspond to how it arrived at its answer.
Faithful reasoning means the written chain accurately reflects the model's internal computation. If the model writes "23 - 20 = 3, then 3 + 6 = 9," it actually used these intermediate results to compute the final answer. The chain is the real "scratchpad."
Unfaithful reasoning means the model decided on the answer first (perhaps through pattern-matching), then generated a plausible-looking justification after the fact. The chain is a post-hoc rationalization, not a real reasoning process. This is sometimes called "motivated reasoning" in psychology.
How can you tell? Here's the litmus test: if you change a step in the middle of the chain, does the final answer change? In faithful reasoning, it should — each step genuinely depends on the previous one. In unfaithful reasoning, the model might reach the same answer regardless of what intermediate steps you inject.
Toggle between faithful and unfaithful reasoning chains. In the unfaithful chain, notice how the conclusion doesn't logically follow from the (incorrect) intermediate steps — the model just pattern-matched the answer.
1. Unfaithful chains (post-hoc rationalization). The model arrives at an answer through pattern-matching, then constructs a reasoning chain that appears to justify it. The chain is fluent and plausible but doesn't reflect the actual computation. Most common on problems the model has seen similar versions of in training.
2. Error propagation without correction. An early step makes a small error (e.g., 23 - 20 = 2 instead of 3). All subsequent steps faithfully build on this wrong intermediate result, producing a confidently wrong final answer. The chain is internally consistent but externally wrong.
3. Overthinking (reasoning degradation). On simple problems, long reasoning chains can actually hurt accuracy. The model introduces unnecessary complexity, considers irrelevant edge cases, or "talks itself out of" the correct answer. This is especially common with RL-trained models that were incentivized to generate long chains.
4. Sycophantic reasoning. If the user's question contains a hint or suggestion ("I think the answer is 7..."), the model may construct a reasoning chain that arrives at the user's suggested answer, even if it's wrong. The chain is unfaithful in a specific way: it's optimized to agree with the user, not to find the truth.
5. Language-dependent reasoning. Models sometimes reason better in English than in other languages, even when the math is identical. This suggests the reasoning isn't purely logical but depends on the language patterns available in training data. DeepSeek-R1 notably exhibited "language mixing" during pure RL training — it would switch between Chinese and English mid-chain, using whichever language gave it access to better reasoning patterns for a given sub-step.
6. Reward hacking. RL-trained reasoning models can find shortcuts that game the reward function. For example, a model might learn that writing "Let me verify: [repeats the answer]" always increases its score from the reward model, even when no actual verification occurs. The chain looks like it self-corrects, but it's just learned a surface pattern that the PRM gives credit for.
| Failure Mode | Symptom | Mitigation |
|---|---|---|
| Unfaithful chains | Chain looks right, but perturbing a step doesn't change the answer | Process reward models (Ch 4) |
| Error propagation | One wrong step cascades through the whole chain | Self-consistency (Ch 2) |
| Overthinking | Simple problems get wrong answers with long chains | Adaptive chain length |
| Sycophancy | Reasoning chain steered by user's suggestion | Remove user hints from prompt |
| Language bias | Same problem, different accuracy by language | Multilingual training data |
| Reward hacking | Fake verification steps that game the reward model | Diverse reward models, adversarial evaluation |
How do researchers detect unfaithful reasoning? The main technique is causal intervention: modify something in the reasoning chain and check if the answer changes appropriately. If you corrupt step 2 of a 4-step chain (making it obviously wrong), a faithful model should produce a different (wrong) final answer. An unfaithful model will produce the same answer regardless — because the answer was never derived from the chain in the first place.
Lanham et al. (2023) tested this systematically. They found that early in the chain, perturbations often DO change the answer (suggesting some faithfulness). But for the last step or two, perturbations frequently DON'T change the answer — the model has already "decided" and is just filling in the conclusion. This suggests a mix: models are partially faithful for hard problems (where they genuinely need the scratch work) and unfaithful for easy problems (where they pattern-match the answer and backfill reasoning).
Turpin et al. (2023) showed that CoT explanations are often biased by features that shouldn't matter. When they added irrelevant features to the prompt (like the order of multiple-choice answers), the model's reasoning chains changed to justify answers influenced by those features — while the chains looked perfectly logical in isolation.
This means we can't simply read the chain to verify the model's reasoning. The chain might be a fabrication. This is a fundamental limitation: we can inspect the chain, but we can't verify that the chain caused the answer. The model's actual reasoning happens in its hidden states, which we don't have access to.
It would be a mistake to think of faithfulness as binary. In practice, reasoning chains exist on a spectrum:
| Level | Description | Example |
|---|---|---|
| Fully faithful | Every step genuinely contributes to the answer. Perturbing any step changes the output. | Hard math problems the model hasn't memorized |
| Partially faithful | Some steps are genuine computation, others are filler or post-hoc rationalization | Medium-difficulty problems with familiar patterns |
| Fully unfaithful | The entire chain is fabricated after the model has already decided the answer | Easy factual questions where the answer is memorized |
The troubling implication: we can't tell which level we're at by reading the chain. A fully unfaithful chain can look identical to a fully faithful one. This is why mechanistic interpretability — understanding what's happening inside the model's hidden states — is considered one of the most important research directions in AI safety. Only by understanding the internal computation can we verify that the external reasoning chain is genuine.
For now, the practical advice: trust reasoning chains on hard, novel problems (where the model genuinely needs the scratch work). Be skeptical of chains on easy, familiar problems (where the model may be rationalizing a memorized answer). And always prefer process reward models over outcome reward models, because PRMs at least check that each step is individually plausible, even if they can't guarantee faithfulness.
The silver lining: as reasoning models improve, they actually tend to become more faithful, because RL training selects for strategies that reliably produce correct answers. A model that reasons faithfully will outperform one that post-hoc rationalizes, because faithful reasoning generalizes to new problems while rationalization only works on familiar ones. So the same selection pressure that makes models smarter also makes them more honest — not by design, but because honesty is the optimal strategy in the long run.
Reasoning sits at the frontier of what LLMs can do. Everything in this lesson builds on the foundations laid in earlier lectures: pretraining (L07) gives the model knowledge, post-training (L08) teaches it to follow instructions, and PEFT (L09) lets us adapt it efficiently. Reasoning is what happens when we push these capabilities to solve problems that require multiple steps of logical deduction.
How each technique builds on the previous. Prompting methods require no training. RL methods modify the model's weights to internalize reasoning.
| Paper | Year | Contribution |
|---|---|---|
| Chain-of-Thought (Wei et al.) | 2022 | Few-shot CoT prompting. Showed step-by-step reasoning dramatically improves math/logic accuracy. Emergent ability of scale. |
| Self-Consistency (Wang et al.) | 2022 | Sample multiple CoT chains and majority vote. Trade compute for accuracy with no extra training. |
| Zero-Shot CoT (Kojima et al.) | 2022 | "Let's think step by step" — triggers reasoning with zero examples. |
| PRM800K (Lightman et al.) | 2023 | Process reward model with step-level human labels. Process supervision beats outcome supervision. |
| DeepSeek-R1 (DeepSeek) | 2025 | RL-trained reasoning from scratch. Model discovers CoT, self-correction, and verification through reward optimization alone. |
| DAPO (ByteDance) | 2025 | Fixes GRPO's entropy collapse and length bias. Dynamic sampling + truncated filtering for stable RL training. |
| Lesson | Connection |
|---|---|
| L08: Post-training | RLHF teaches general instruction-following. RL for reasoning teaches specific problem-solving strategies. Same algorithm (PPO/GRPO), different reward signal. |
| L09: PEFT | LoRA can fine-tune reasoning capabilities at lower cost than full fine-tuning. Distillation from R1 into smaller models uses SFT, not RL. |
| L11: Evaluation | Reasoning evaluation requires checking both the final answer AND the reasoning chain. Process reward models (PRMs) are evaluation tools as much as training tools. |
| L13: Reasoning Part 2 | Continues with tree search, Monte Carlo methods, and test-time compute scaling. How to make reasoning more systematic and reliable. |
The progression in this lesson tells a clear story:
| Era | Technique | Where Reasoning Lives |
|---|---|---|
| 2022 | CoT Prompting | In the prompt (inference-time) |
| 2022 | Self-Consistency | In the sampling (inference-time) |
| 2023 | Process Reward Models | In the evaluator (training-time) |
| 2025 | GRPO / DAPO / R1 | In the weights (training-time) |
We started by prompting models to reason (cheap, no training). We moved to evaluating reasoning quality (PRMs, requires labeled data). We ended with training models to reason from scratch (RL, expensive but powerful). The trend is clear: reasoning capabilities are moving from inference-time tricks into the model's weights.
But this is only Part 1. In Lecture 13 (Reasoning Part 2), we'll explore what happens when you combine these techniques with search — tree-of-thought, Monte Carlo tree search, and test-time compute scaling. The idea: instead of generating a single chain and hoping it's correct, systematically explore the space of possible reasoning paths and pick the best one. If CoT gives the model scratch paper, tree search gives it an entire whiteboard.
The fundamental question remains open: are we teaching models to reason, or teaching them to simulate reasoning? The distinction matters. True reasoning would generalize to entirely novel problems and domains. Simulated reasoning might look identical but fail on problems that deviate from training distribution patterns. The DeepSeek-R1 results suggest we're closer to the former than skeptics expected — but the jury is still out.