CS224N Lecture 12

Reasoning Part 1

Teaching models to think step by step — from prompting tricks to reinforcement learning.

Prerequisites: L09 PEFT + L08 Post-training. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Reasoning?

Here's a simple math problem: "A cafeteria had 23 apples. They used 20 to make lunch and bought 6 more. How many do they have?" Most humans get 9 instantly. You subtract 20 from 23 to get 3, then add 6 to get 9. Two steps, trivially obvious.

Now feed that same problem to a large language model with no special prompting. It might say "27." Or "29." Or some other confident, wrong number. Not because the model is stupid — GPT-4 has passed the bar exam, after all — but because the model is trying to jump directly from question to answer without working through intermediate steps.

This is the fundamental tension. LLMs generate tokens left to right, one at a time. Each token prediction is essentially a single "step" of computation. For simple lookups ("What is the capital of France?"), one step is enough. But for multi-step reasoning ("If A implies B and B implies C, and A is true, what about C?"), the model needs to chain multiple logical steps — and it has no built-in mechanism for doing so.

Daniel Kahneman described two systems of thinking: System 1 (fast, automatic, intuitive) and System 2 (slow, deliberate, analytical). A standard LLM prompt is System 1: instant pattern-matching. The entire field of reasoning in LLMs is about giving models System 2 — the ability to slow down, decompose, and work through a problem methodically.

The Reasoning Gap

The simulation below shows what happens when a model tries to answer a multi-step math problem directly versus when it's forced to show its work. Click "Direct Answer" to see the model jump to a (wrong) conclusion. Then click "Step-by-Step" to see what happens when we give the model scratch space to think.

The Reasoning Gap

Direct prompting jumps to a wrong answer. Step-by-step prompting reveals intermediate reasoning that leads to the correct answer.

The gap is stark. The direct approach treats reasoning as a single token prediction: question in, answer out. The step-by-step approach gives the model "thinking tokens" — intermediate text that decomposes the problem into sub-problems the model can actually solve.

This insight — that models reason better when they write out their reasoning — launched an entire subfield. In 2022, Jason Wei and colleagues at Google showed that simply adding "Let's think step by step" to a prompt could boost math accuracy from 17% to 78% on GSM8K. No retraining. No new parameters. Just a different prompt.

LLMs don't reason by default — they pattern-match. Multi-step problems require intermediate computation, but standard prompting asks for a single-hop answer. The entire field of "reasoning" in LLMs is about giving models explicit or implicit scratch space to chain logical steps together.

What This Lesson Covers

We'll build the full reasoning stack, from simple prompting tricks to reinforcement learning:

Chain-of-Thought (Ch 1)
Few-shot examples with step-by-step reasoning. The foundational technique.
Self-Consistency (Ch 2)
Sample multiple chains, vote on the answer. Trade compute for accuracy.
Zero-Shot CoT (Ch 3)
"Let's think step by step" — no examples needed.
Reward Models (Ch 4)
Score reasoning PROCESSES, not just final answers.
RL for Reasoning (Ch 5-6)
Train models to discover reasoning strategies via reinforcement learning.
Why do LLMs struggle with multi-step math problems when prompted directly?

Chapter 1: Chain-of-Thought

In January 2022, Jason Wei et al. at Google Brain published a deceptively simple idea: instead of showing the model question-answer pairs, show it question-reasoning-answer triples. Give the model examples where the reasoning is written out, and it will learn to write out reasoning too.

This is Chain-of-Thought (CoT) prompting. The "chain" is the sequence of intermediate reasoning steps between the question and the final answer. Each link in the chain is a logical step the model can verify and build upon.

Standard vs. CoT Prompting

Here's the critical difference. In standard few-shot prompting, you provide input-output examples:

standard prompt
Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now?
A: 11.

Q: The cafeteria had 23 apples. Used 20, bought 6 more. How many?
A:

The model sees "question → number" and tries to pattern-match directly. Now compare with CoT prompting:

chain-of-thought prompt
Q: Roger has 5 tennis balls. He buys 2 cans of 3. How many now?
A: Roger started with 5. He bought 2 cans of 3, so 2 × 3 = 6.
   5 + 6 = 11. The answer is 11.

Q: The cafeteria had 23 apples. Used 20, bought 6 more. How many?
A:

Now the model sees "question → step 1 → step 2 → ... → answer." It learns to generate its own intermediate steps before committing to a final answer. Each step is a simple operation the model can handle, even if the full problem is complex.

Standard vs. Chain-of-Thought Prompting

Left: standard prompting jumps from question to answer. Right: CoT prompting generates intermediate steps. Click "Step Through" to watch the reasoning chain build token by token.

Why Does This Work?

The key insight is that CoT gives the model intermediate tokens to compute with. A Transformer processes information through attention and feedforward layers, but each layer has finite computational capacity. By generating intermediate reasoning tokens, the model effectively gets more "layers" of computation — each generated token becomes context for the next prediction. Feng et al. (2023) formalized this: they proved that constant-depth Transformers can only solve problems in the complexity class TC0 without chain-of-thought, but with CoT they can solve problems in P (polynomial time). The intermediate tokens literally expand the class of problems the model can solve.

Think of it like a student taking an exam. A standard prompt is like asking "write only the final answer." A CoT prompt is like giving them scratch paper. The scratch paper doesn't change their knowledge — it lets them organize and apply it step by step.

There's a beautiful connection to computation theory here. A Turing machine's tape serves the same purpose as CoT tokens: it provides external memory that the finite-state control can read and write during computation. Without the tape, the machine can only solve problems bounded by its finite state space. With the tape, it can solve anything computable. CoT tokens are, in a real sense, a Transformer's Turing tape.

The Scaling Effect

Wei et al. discovered a remarkable property: CoT is an emergent ability of scale. Small models (under ~10B parameters) show no benefit from CoT prompting — they just generate incoherent intermediate steps. But above a critical scale threshold (~100B parameters), CoT dramatically improves performance.

Model SizeStandard Accuracy (GSM8K)CoT Accuracy (GSM8K)
8B~4%~4% (no benefit)
62B~15%~33%
540B (PaLM)~17%~58%

Why? Smaller models can't reliably generate correct intermediate steps. CoT requires the model to already know how to do individual sub-steps (addition, comparison, logical deduction). If the sub-steps are wrong, chaining them just compounds errors. Scale gives the model enough capacity to get each sub-step right.

This scale dependency sparked intense debate. Is CoT a fundamental capability that emerges from scale, or is it a shallow pattern-matching trick that happens to work at large scale? The DeepSeek-R1 results (Chapter 5) provide evidence for the former: when you train with RL, even models that initially can't do CoT learn to generate it. The capability is real, not an artifact — but it requires a minimum level of base competence to bootstrap.

Chain-of-Thought prompting is compute at inference time. Each intermediate token is effectively an extra layer of computation. The cost is more generated tokens (slower, more expensive). The benefit is dramatically higher accuracy on multi-step tasks. This is the fundamental trade-off of all reasoning techniques: trade inference compute for accuracy.

Data Flow: What Happens Inside

python
# Standard prompting: one forward pass per answer token
prompt = "Q: 23 - 20 + 6 = ?\nA:"
output = model.generate(prompt, max_tokens=3)
# → " 27"  (wrong — pattern-matched without computing)

# CoT prompting: multiple tokens of "scratch work"
prompt = """Q: Roger has 5 balls. Buys 2 cans of 3. How many?
A: 5 + (2 × 3) = 5 + 6 = 11. The answer is 11.

Q: 23 apples. Used 20, bought 6 more. How many?
A:"""
output = model.generate(prompt, max_tokens=50)
# → " 23 - 20 = 3. 3 + 6 = 9. The answer is 9."
# Each intermediate token conditions the next prediction

Notice: the model generates "23 - 20 = 3" as text, then conditions on that text to generate "3 + 6 = 9." The intermediate result "3" exists as a token in the context, not as a hidden state. This is crucial — the reasoning is explicit and inspectable.

What Kinds of Problems Benefit?

CoT doesn't help everything equally. The benefit is largest for tasks that require multi-step composition — where the answer depends on chaining several sub-operations. Here's the breakdown:

Task TypeCoT BenefitWhy
Multi-step arithmeticHuge (+40%)Each arithmetic operation is a sub-step the model can do individually
Symbolic reasoningHuge (+30%)Logical deduction requires explicit chaining of rules
Commonsense reasoningModerate (+10%)Some problems need inference chaining, others are direct lookup
Factual QAMinimal (+2%)Single-hop retrieval from parameters — no chain needed
TranslationNone or negativeAlready a sequence-to-sequence task; intermediate steps add noise

The rule of thumb: if a human would need scratch paper to solve the problem, CoT will help. If a human can answer instantly from memory, CoT is wasted tokens.

Why does Chain-of-Thought prompting improve reasoning accuracy?

Chapter 2: Self-Consistency

Chain-of-Thought gives the model scratch paper. But what if the model makes an arithmetic mistake halfway through, or takes a wrong logical turn? A single chain is fragile — one bad link breaks the whole thing. What if we could get multiple opinions?

This is the insight behind Self-Consistency (Wang et al., 2022): sample multiple independent reasoning chains from the model, extract the final answer from each, and take a majority vote. Different chains might make different mistakes, but the correct answer should appear most frequently.

How It Works

The algorithm is beautifully simple:

Step 1: Sample K Chains
Use the same CoT prompt, but sample with temperature > 0 (e.g., T=0.7). Each sample generates a different reasoning path.
Step 2: Extract Answers
Parse the final answer from each chain. Ignore the reasoning, keep only the conclusion.
Step 3: Majority Vote
The answer that appears most frequently across K chains is the final answer. Ties broken randomly.

Why does this work? Think of it like asking five mathematicians to solve the same problem independently. Each might take a different approach (algebra, geometry, estimation), and some might make errors. But if four out of five get "9," it's probably 9 — even if the fifth got "7" due to a sign error.

Self-Consistency: Majority Voting

Five reasoning chains sampled for the same question. Each takes a slightly different path. The majority answer (highlighted) is selected as the final output.

The Numbers

Self-consistency provides consistent improvements over single-chain CoT across every benchmark tested:

BenchmarkCoT (1 chain)SC (40 chains)Improvement
GSM8K (math)56.5%74.4%+17.9
SVAMP (math)79.0%86.6%+7.6
ARC-Challenge85.2%88.7%+3.5
StrategyQA73.4%79.3%+5.9

The cost is obvious: sampling 40 chains costs 40x the compute of a single chain. In practice, most gains come from the first 5-10 samples — the marginal benefit of each additional chain decreases rapidly. After about 20 samples, you're paying a lot more in compute for diminishing returns. The practical sweet spot for most applications is K=5 to K=10.

Weighted vs. Unweighted Voting

The basic version of self-consistency uses unweighted majority voting: each chain gets one vote regardless of its confidence. But you can do better. Some chains are more confident than others — the model assigns higher probability to tokens in some chains. A weighted majority vote gives more weight to chains where the model was more confident:

answer* = argmaxai: answeri=a P(chaini)

In practice, weighted voting helps on the margins but unweighted voting is simpler and nearly as effective. The diversity from multiple chains matters more than the confidence weighting.

Why Majority Vote Works: The Math

Here's the elegant argument. Suppose each chain independently gets the right answer with probability p (say p=0.6). What's the probability that the majority of K=5 chains are correct? This is a binomial distribution: P(majority correct) = P(3 or more correct out of 5). For p=0.6, this works out to about 0.68. For p=0.7, it's 0.84. For p=0.8, it's 0.94.

The key insight: majority voting amplifies accuracy. If each individual chain is better than random (p > 0.5), then the majority vote is strictly better than any individual chain, and the improvement grows with K. This is the Condorcet jury theorem applied to language models.

P(majority correct) = ∑k=⌈K/2⌉K C(K,k) · pk · (1-p)K-k

The assumption that matters: chains must make independent errors. If all chains make the same mistake (because the model has a systematic bias), majority voting can't help. This is why temperature is important — it ensures diversity.

Temperature and Diversity

Temperature controls the diversity of the sampled chains. At T=0 (greedy), every chain is identical — self-consistency degenerates to single-chain CoT. At T=1.0, chains are highly diverse but more chains contain errors. The sweet spot is around T=0.5-0.7: diverse enough to explore different reasoning paths, constrained enough that most paths are coherent.

python
# Self-consistency implementation
def self_consistency(prompt, model, K=40, temp=0.7):
    answers = []
    for _ in range(K):
        chain = model.generate(prompt, temperature=temp)
        answer = extract_answer(chain)  # parse final number
        answers.append(answer)

    # Majority vote
    from collections import Counter
    vote = Counter(answers).most_common(1)[0][0]
    return vote

# Example: K=5 chains give answers [9, 9, 7, 9, 11]
# Majority vote: 9 (appears 3/5 times) ✓
Self-consistency is embarrassingly parallel and requires no extra training. You're trading compute (K forward passes) for accuracy (majority voting filters out errors). The insight: correct reasoning paths are more robust than incorrect ones — errors are random, but truth is consistent.
A model generates 5 reasoning chains with final answers: [42, 38, 42, 42, 38]. What answer does self-consistency select, and why?

Chapter 3: Zero-Shot CoT

Chain-of-Thought (Chapter 1) requires carefully crafted few-shot examples with hand-written reasoning chains. What if you don't have examples for your specific task? What if you just want reasoning on any arbitrary question?

In 2022, Kojima et al. discovered something remarkable: you can trigger chain-of-thought reasoning with zero examples. Just append a single sentence to any question: "Let's think step by step."

That's it. Five words. No examples. No task-specific engineering. The model was already capable of step-by-step reasoning — it just needed permission to do it.

The Prompt Engineering Landscape

Zero-shot CoT isn't the only trigger phrase. Researchers have tested dozens of alternatives, and the exact wording matters more than you'd expect:

Prompt StrategyTrigger TextAccuracy (MultiArith)
Standard (no CoT)none17.7%
Zero-shot CoT"Let's think step by step"78.7%
Alternative 1"Let's solve this problem by splitting it into steps"72.2%
Alternative 2"First,"66.4%
Alternative 3"Let's think about this logically"56.6%
Negative"Let's think step by step but be brief"41.5%

The phrase "Let's think step by step" wins consistently. Why? Because the model's training data contains millions of instances where this phrase precedes detailed explanations — in tutorials, textbooks, forum answers. The phrase activates a "explain your work" mode that the model already learned during pretraining.

The "Trigger Phrase" Phenomenon

This reveals something profound about how language models store and access capabilities. The model already knew how to reason step by step — it just needed the right prompt to activate that behavior. Think of it like this: during pretraining, the model learned statistical associations between phrases and what follows them. "Let's think step by step" is statistically followed by detailed, structured explanations. By using this phrase, we're essentially setting up the right conditional distribution for the model to sample from.

This has implications beyond reasoning. It suggests that large language models have many latent capabilities that can be unlocked with the right prompt. The model is not a fixed function — it's a family of functions, and the prompt selects which member of that family to use. Prompt engineering is, in a deep sense, function selection.

This also explains why different phrases work differently. "Let's think step by step" activates tutorial-style explanations. "First," activates enumerated lists. "Let's think about this logically" activates a more philosophical, less computational style. Each trigger phrase selects a different distribution over reasoning formats — and some formats are more effective for mathematical problem-solving than others.

The practical lesson: prompt engineering is not magic. It's applied statistics. You're choosing which region of the model's output distribution to sample from, and some regions contain better reasoning strategies than others. The best prompt isn't the cleverest — it's the one that most reliably activates the model's strongest reasoning mode for your specific task type.

Zero-Shot CoT: Prompt Builder

Toggle between prompt strategies and see how each affects the model's reasoning behavior. The accuracy counter shows results across a batch of math problems.

The Two-Stage Process

Zero-shot CoT actually works in two stages, not one:

Stage 1: Reasoning extraction. Append "Let's think step by step" to the question. The model generates a reasoning chain. Don't extract the answer yet — the model might not have stated a clear final answer.

Stage 2: Answer extraction. Append "Therefore, the answer is" to the generated reasoning. This prompts the model to produce a clean, parseable final answer.

python
# Stage 1: Generate reasoning
prompt1 = "Q: If a train travels 120 miles in 2 hours, " + \
          "what is its speed in mph?\n" + \
          "A: Let's think step by step."
reasoning = model.generate(prompt1)
# → "The train goes 120 miles in 2 hours.
#    Speed = distance / time = 120 / 2 = 60."

# Stage 2: Extract answer
prompt2 = prompt1 + reasoning + "\nTherefore, the answer is"
answer = model.generate(prompt2, max_tokens=10)
# → " 60 mph."

Few-Shot vs. Zero-Shot: When to Use Each

CriterionFew-Shot CoTZero-Shot CoT
Setup effortHigh (craft examples)None
Accuracy (best case)Higher (task-specific)Slightly lower
GeneralizationLimited to similar tasksAny task
Prompt lengthLong (examples + reasoning)Short (just 5 extra words)
Best forProduction, high-stakesExploration, diverse tasks
"Let's think step by step" is prompt engineering's greatest hit. It works because the model already knows how to reason — it just needs a trigger phrase that activates the "show your work" behavior learned during pretraining. The implication: LLMs have latent capabilities that the right prompt can unlock.
Why does "Let's think step by step" trigger better reasoning, even without any examples?

Chapter 4: Process vs. Outcome Reward

We've seen that models can generate reasoning chains. But how do we know if a chain is good? There are two fundamentally different ways to evaluate reasoning, and the choice between them has profound implications for how we train and improve models.

Outcome Reward Models (ORMs) look only at the final answer. Did the model get "9"? Then the chain is good, regardless of how it got there. The entire reasoning chain gets a single score: correct or incorrect.

Process Reward Models (PRMs) score each intermediate step independently. Step 1: correct. Step 2: correct. Step 3: error! Even if the model stumbles back to the right final answer, the PRM identifies exactly where the reasoning went wrong.

Why Process Matters

Consider two reasoning chains for "What is 23 - 20 + 6?":

Chain A: "23 - 20 = 3. 3 + 6 = 9. The answer is 9." — Every step correct.

Chain B: "23 - 20 = 13. 13 - 4 = 9. The answer is 9." — Wrong intermediate steps, right final answer by luck.

An ORM gives both chains the same score. A PRM catches that Chain B got lucky — its reasoning is flawed, and it will fail on similar problems. This distinction is critical: we want models that reason correctly, not models that get lucky.

PRM vs. ORM Scoring

A 5-step reasoning chain scored by both models. The PRM assigns per-step scores (green/red), while the ORM only scores the final answer. Toggle between the two to see the difference.

Building a Process Reward Model

Lightman et al. (2023) at OpenAI built PRM800K — a dataset of 800,000 step-level human labels for math reasoning. Human annotators read each step of a model-generated solution and labeled it as "correct," "incorrect," or "neutral." This is expensive — each solution requires a mathematician to carefully verify every intermediate step — but it produces a reward model that can pinpoint errors with high precision.

The labeling process is careful: annotators are shown the problem and the solution up to step k, then asked "Is step k mathematically valid?" They don't just check the arithmetic — they verify logical coherence, correct use of definitions, and whether the step follows from previous steps. A step like "Since x > 0, we know x² > 0" is labeled correct. A step like "Since x > 0, we know 1/x > 1" is labeled incorrect (it's only true for 0 < x < 1).

The PRM is trained as a classifier: given a reasoning chain up to step k, predict whether step k is correct. At inference time, you can use the PRM to:

Best-of-N Selection
Generate N chains, score each with PRM, pick the one with highest minimum step score. This catches "lucky" chains.
Step-Level Search
At each step, generate multiple candidates, score with PRM, keep the best. Like beam search but over reasoning steps.
Training Signal
Use PRM scores as rewards in RL training. The model learns which reasoning patterns are reliable, not just which answers are correct.

PRM vs. ORM: Results

MethodMATH Accuracy (maj@1)MATH Accuracy (best-of-1860)
ORM (outcome supervision)~50%~72%
PRM (process supervision)~50%~78.2%

At maj@1 (single sample), ORMs and PRMs perform similarly. The gap opens up at scale: when you sample many chains and use the reward model to select the best one. PRMs are dramatically better at identifying chains with sound reasoning, because they can reject chains that reach the right answer through flawed logic.

The Cost of Process Labels

The elephant in the room: process labels are expensive. ORM labels are cheap — you just need the correct final answer (often automatically verifiable for math). PRM labels require a human expert to read and judge every single step. For PRM800K, this meant paying mathematicians to annotate 800,000 individual steps. Most teams can't afford this.

Can we generate process labels automatically? Partially. One approach: generate many solutions to the same problem, and for each step, check whether the remaining steps can reach the correct answer. If a step leads to a dead end (no subsequent path reaches the right answer), it's probably wrong. This Monte Carlo estimation approach gives noisy but scalable process labels without human annotators. We'll see more of this in Lecture 13.

Outcome rewards create models that get lucky. Process rewards create models that reason correctly. The PRM approach aligns training incentives with what we actually want: correct reasoning, not just correct answers. This mirrors how human math education works — teachers grade the work, not just the final number.
python
# ORM: scores the entire chain at once
def orm_score(chain, correct_answer):
    final = extract_answer(chain)
    return 1.0 if final == correct_answer else 0.0

# PRM: scores each step independently
def prm_score(chain, prm_model):
    steps = split_into_steps(chain)
    scores = []
    context = ""
    for step in steps:
        context += step
        score = prm_model.score(context)  # P(correct | context)
        scores.append(score)
    return min(scores)  # chain is only as strong as weakest step
Chain A: "20-15=5, 5×3=15. Answer: 15." (all steps correct). Chain B: "20-15=3, 3×5=15. Answer: 15." (wrong intermediate, right answer). An ORM scores both equally. Why is this dangerous?

Chapter 5: RL for Reasoning (DeepSeek-R1)

Everything we've seen so far uses prompting to elicit reasoning. Chain-of-thought, self-consistency, zero-shot CoT — these are all inference-time tricks. The model's weights never change. What if we could train the model to reason?

In January 2025, DeepSeek published R1, a model that discovered chain-of-thought reasoning through reinforcement learning alone. No human-written reasoning chains. No supervised examples. Just a reward signal: "did you get the right answer?" — and the model independently invented step-by-step reasoning as a strategy to maximize that reward.

The DeepSeek-R1 Pipeline

The training pipeline has multiple stages, but the breakthrough comes from Stage 1:

Stage 1: Pure RL (R1-Zero)
Start from DeepSeek-V3-Base. Train with RL only. Reward = correct final answer + format compliance. NO supervised reasoning examples.
Stage 2: Cold Start SFT
Collect high-quality CoT examples (some from R1-Zero, some human). Fine-tune a new base model on these examples to create a better starting point.
Stage 3: RL Again
Apply RL on top of the SFT model. Now the model starts from a better initialization and can discover even more sophisticated reasoning strategies.
Stage 4: Distillation
Use the large R1 model to generate reasoning chains. Train smaller models (1.5B-70B) on these chains via SFT. Small models inherit large model's reasoning.

The Aha Moment

The most remarkable finding from DeepSeek-R1 is what they called the "aha moment." During pure RL training (Stage 1, R1-Zero), the model spontaneously developed behaviors that nobody programmed:

Self-reflection: The model learned to re-read its own reasoning, notice errors, and correct them. "Wait, let me reconsider..." appeared naturally in generated text.

Exploration: The model learned to try multiple approaches to a problem. "Let me try a different method..." — essentially discovering self-consistency on its own.

Extended thinking: Reasoning chains grew from tens of tokens to thousands. The model discovered that longer chains (more thinking) led to higher rewards on hard problems.

RL Training: Model Discovers CoT

Watch the RL training loop unfold. Early on, the model gives short, often wrong answers. As training progresses, it discovers that writing out reasoning steps leads to higher rewards. Chain length and accuracy both increase.

Why This Matters: No Human Reasoning Data Needed

Previous approaches to reasoning (CoT prompting, supervised fine-tuning on reasoning chains) all required human-written reasoning examples. Someone had to write out "First, subtract 20 from 23..." for each training problem. This is expensive, it doesn't scale, and it limits reasoning to patterns humans have already demonstrated.

R1-Zero needs none of this. The only human input is a verifier that checks if the final answer is correct (trivial for math: just compare to the known answer). Everything else — the format of reasoning, the length of chains, the strategies for self-correction — emerges from RL exploration. This is a fundamentally different paradigm: instead of imitating human reasoning, the model discovers its own reasoning strategies optimized for its own architecture.

R1-Zero: Emergent Behaviors

BehaviorEmerged AtExample
Step-by-step~100 steps"First, I need to find x..."
Self-correction~500 steps"Wait, I made an error. Let me redo step 2."
Verification~1000 steps"Let me verify: if x=3, then 3×4=12. Yes."
Alternative approaches~2000 steps"Let me try solving this differently..."

These behaviors were never in the training data as labeled examples. They emerged because RL discovered that models which exhibit these behaviors get higher rewards. This is the power of RL — it doesn't tell the model how to reason, it tells the model what success looks like and lets it figure out the rest.

Distillation: Making It Practical

R1 is enormous — 671B parameters. Too large for most practical deployments. But the reasoning patterns it discovered can be transferred to smaller models through distillation.

The process: generate thousands of reasoning chains from the large R1 model. Use these chains as supervised fine-tuning (SFT) data for smaller models (1.5B, 7B, 14B, 32B, 70B). The small model learns to imitate R1's reasoning style without ever doing RL itself.

Results are remarkable: a 14B distilled model matches or exceeds many 70B+ models that weren't trained with reasoning. The distilled 32B model surpasses OpenAI's o1-mini on several math benchmarks. Reasoning ability transfers through imitation, even to models 50x smaller than the teacher.

ModelSizeAIME 2024Method
GPT-4o~1.8T9.3%Standard
R1-Distill-Qwen-14B14B69.7%Distilled from R1
R1-Distill-Qwen-32B32B72.6%Distilled from R1
OpenAI o1-mini~100B?63.6%RL-trained
DeepSeek-R1671B79.8%RL-trained

The lesson: you pay the cost of RL training once (on the largest model), then distribute the benefits to smaller models via cheap SFT. This is the reasoning equivalent of the pretrain-then-distill paradigm that has been so successful for general language modeling.

There's a catch, though. Distilled models inherit the patterns of reasoning but not the capacity for it. A distilled 1.5B model can write chains that look like R1's reasoning, but it makes more errors in the intermediate steps because it has fewer parameters to perform each sub-computation. Distillation transfers the format of reasoning, not the depth. For the deepest reasoning, you still need the largest models.

The Reward Design

R1-Zero uses two reward components:

python
# DeepSeek-R1-Zero reward function (simplified)
def reward(response, correct_answer):
    r = 0.0

    # 1. Accuracy reward: did you get the right answer?
    final = extract_answer(response)
    if final == correct_answer:
        r += 1.0

    # 2. Format reward: did you use <think>...</think> tags?
    if has_think_tags(response):
        r += 0.1

    return r  # No reward for reasoning quality — just outcome + format

Notice: there's no reward for good reasoning. The model only gets rewarded for correct answers and proper formatting. Yet it discovers that good reasoning is the best strategy for getting correct answers. The reasoning emerges as a means, not an end.

The analogy to evolution is apt. Evolution doesn't design eyes — it rewards survival, and eyes emerge because seeing helps you survive. Similarly, RL doesn't design reasoning — it rewards correct answers, and reasoning emerges because thinking step-by-step maximizes the probability of correctness. Reasoning is the optimal strategy discovered through selection pressure, not design.

DeepSeek-R1's breakthrough: Chain-of-Thought reasoning is not an annotation artifact — it's the optimal strategy. Given only a "right answer" reward, RL independently discovers that writing out step-by-step reasoning, self-correcting errors, and trying alternative approaches maximizes the probability of reaching correct conclusions. Reasoning is what you get when you optimize for truth.
DeepSeek-R1-Zero's reward function only checks the final answer and format. How does the model learn to write step-by-step reasoning?

Chapter 6: GRPO & DAPO

DeepSeek-R1 used reinforcement learning to train reasoning, but which RL algorithm? The standard approach is Proximal Policy Optimization (PPO), the same algorithm used in RLHF (Lecture 8). But PPO has a dirty secret: it's expensive. PPO requires a critic model (value function) that's the same size as the policy model — doubling your GPU memory requirement.

DeepSeek introduced Group Relative Policy Optimization (GRPO), which eliminates the critic entirely. GRPO estimates the baseline by sampling a group of responses and using the group's mean reward as the baseline. No separate value network. No extra model to train.

PPO vs. GRPO

PPO computes the advantage for each token as: A(s,a) = R - V(s), where V(s) is a learned value function. This value function requires a separate neural network (the critic) that maps states to expected future rewards. Training the critic requires its own loss function, learning rate, and GPU memory — essentially doubling the system complexity.

GRPO replaces V(s) with a simple group statistic. For each question, sample G responses. Compute each response's reward. The advantage for response i is:

Ai = (Ri − mean(R1, ..., RG)) / std(R1, ..., RG)

That's it. No neural network. No backpropagation for the baseline. Just sample, score, normalize. Responses better than the group average get positive advantages (reinforced). Responses worse than average get negative advantages (discouraged).

PPO vs. GRPO Architecture

PPO requires a separate critic network. GRPO eliminates it by using group statistics. Drag the group size slider to see how GRPO's advantage estimates change with more samples.

Group Size G 8

DAPO: Fixing GRPO's Weaknesses

Decoupled Advantage Policy Optimization (DAPO) (ByteDance, 2025) addresses several failure modes of GRPO:

Problem 1: Entropy collapse. During RL training, the model's output distribution becomes increasingly peaked — it converges on a single reasoning pattern and stops exploring. DAPO fixes this with a dynamic sampling strategy: it oversamples from the current policy to ensure a minimum number of both positive and negative examples, preventing the model from "forgetting" alternative approaches.

Problem 2: Length bias. GRPO applies a KL penalty per-token, which implicitly penalizes longer responses. Since reasoning often requires more tokens, this discourages the model from thinking more. DAPO removes the per-token KL penalty entirely and replaces it with a clipped objective that doesn't penalize length. This is critical: we want models to think longer on hard problems. A per-token penalty fights this directly by making every reasoning token "costly" in the loss function.

The result is dramatic. Without the length penalty, DAPO-trained models generate longer, more thorough reasoning chains on hard problems while keeping chains short on easy ones. The model learns to adaptively allocate "thinking budget" based on problem difficulty — exactly the behavior we want.

Problem 3: Reward hacking. Models find shortcuts — like always outputting the most common answer in the training set. DAPO uses truncated filtering: if all G samples in a group get the same reward (all correct or all incorrect), that question is filtered out. Training only on questions where some responses succeed and some fail provides the most informative gradient signal.

FeaturePPOGRPODAPO
Critic networkRequiredNoneNone
BaselineLearned V(s)Group meanGroup mean + filtering
KL penaltyPer-tokenPer-tokenRemoved (clip-only)
Entropy managementManual coefficientNoneDynamic sampling
GPU memory2x (policy + critic)1x1x
AIME 2024 accuracy~30%~47%~50%

Why Entropy Collapse Kills Reasoning

Entropy collapse deserves special attention because it's the most insidious failure mode. During RL training, the model's output distribution gradually becomes a spike — it converges on a single phrasing, a single reasoning template, a single answer format. Diversity evaporates.

Why is this fatal for reasoning? Because reasoning requires exploration. A model that always starts with "Let me calculate..." will never discover that some problems are better solved by "Let me think about this differently..." or "Wait, what if I work backwards?" DAPO's dynamic sampling explicitly prevents this by maintaining a minimum level of output diversity throughout training.

The practical impact: DAPO-trained models continue to discover new reasoning strategies late in training, while GRPO-trained models plateau because they've lost the ability to explore alternatives.

The evolution: PPO (expensive, works) → GRPO (cheap, works nearly as well) → DAPO (cheap, fixes edge cases). Each step simplifies the algorithm while maintaining or improving performance. The trend is clear: you don't need a complex RL setup to train reasoning. You need good reward signals, diverse sampling, and enough compute to let the model explore.
python
# GRPO advantage computation (simplified)
def grpo_advantages(question, policy, reward_fn, G=8):
    # Sample G responses from the current policy
    responses = [policy.sample(question) for _ in range(G)]
    rewards = [reward_fn(r) for r in responses]

    # Group statistics replace the critic
    mu = sum(rewards) / G
    sigma = (sum((r - mu)**2 for r in rewards) / G) ** 0.5

    # Normalize advantages
    advantages = [(r - mu) / (sigma + 1e-8) for r in rewards]

    # DAPO addition: filter if all same reward
    if sigma < 1e-6:
        return None  # skip — no signal

    return list(zip(responses, advantages))
GRPO eliminates the critic network from PPO. What does it use as the baseline instead?

Chapter 7: Reasoning Lab

Time to see everything in action. The simulation below runs the same math problem through four different reasoning strategies side by side: Standard (direct answer), Chain-of-Thought (single chain), Self-Consistency (multiple chains with majority vote), and RL-trained (DeepSeek-R1-style model with emergent reasoning).

Click "New Problem" to generate a fresh multi-step math question. Then click "Run All" to watch each strategy process the problem simultaneously. Pay attention to the differences: how many tokens each generates, whether it catches its own errors, and whether it reaches the correct answer.

Reasoning Strategy Comparison

Four strategies on the same problem. Standard barely thinks. CoT writes out steps. Self-Consistency votes across chains. RL-trained self-corrects and verifies.

What to Notice

Standard: Jumps straight to an answer. Often wrong on multi-step problems. Uses the fewest tokens. Think of this as "System 1" thinking — fast, automatic, error-prone.

Chain-of-Thought: Decomposes the problem into steps. Usually gets it right on 2-3 step problems. But a single mistake in the chain propagates to the final answer. No error correction.

Self-Consistency: Runs multiple chains. Even if individual chains have errors, the majority vote converges on the correct answer. Uses ~5x the tokens of a single chain. The green highlight shows the majority answer.

RL-trained: Generates reasoning similar to CoT, but with self-correction. You'll see phrases like "Wait, let me reconsider" and "Let me verify this." The model learned to double-check its own work through RL training. It uses more tokens than basic CoT but fewer than self-consistency, achieving comparable or better accuracy.

The Compute-Accuracy Trade-off

StrategyTokens GeneratedAccuracy (GSM8K)Cost Relative
Standard~5~17%1x
CoT (1 chain)~80~58%~16x
Self-Consistency (40)~3200~74%~640x
RL-trained (R1)~200~94%~40x

The RL-trained model is the clear winner on the efficiency frontier. It achieves the highest accuracy at moderate cost. Self-consistency is the safest (ensemble effect), but expensive. The lesson: training the model to reason (RL) is more cost-effective than forcing it to reason at inference time (prompting).

Why RL-Trained Reasoning Is Different

Look carefully at the RL-trained column. It does something the others don't: self-correction. When it detects an intermediate result that seems off, it backs up and tries again. "Wait, let me reconsider" is not window dressing — it's a learned behavior that reduces the error rate.

Self-consistency achieves a similar effect through brute force (many chains + voting), but it costs 5x more tokens. The RL-trained model achieves it with a single chain that has built-in error correction. Think of it as the difference between having five proofreaders (self-consistency) and one careful writer who re-reads their own work (RL-trained).

This is the fundamental trade-off of the reasoning landscape: you can get accuracy through ensemble methods (expensive, simple) or through better training (expensive upfront, cheap at inference). The trend is clearly toward the latter — invest in training once, benefit at every inference.

Chapter 8: Failure Modes

Reasoning chains look impressive. A model that writes "Let me think... first I calculate X... then Y... therefore Z" feels trustworthy. But appearances can deceive. The most dangerous failure mode in reasoning models is unfaithful reasoning: the model generates a chain that looks logical but doesn't actually correspond to how it arrived at its answer.

Faithful vs. Unfaithful Reasoning

Faithful reasoning means the written chain accurately reflects the model's internal computation. If the model writes "23 - 20 = 3, then 3 + 6 = 9," it actually used these intermediate results to compute the final answer. The chain is the real "scratchpad."

Unfaithful reasoning means the model decided on the answer first (perhaps through pattern-matching), then generated a plausible-looking justification after the fact. The chain is a post-hoc rationalization, not a real reasoning process. This is sometimes called "motivated reasoning" in psychology.

How can you tell? Here's the litmus test: if you change a step in the middle of the chain, does the final answer change? In faithful reasoning, it should — each step genuinely depends on the previous one. In unfaithful reasoning, the model might reach the same answer regardless of what intermediate steps you inject.

Faithful vs. Unfaithful Reasoning

Toggle between faithful and unfaithful reasoning chains. In the unfaithful chain, notice how the conclusion doesn't logically follow from the (incorrect) intermediate steps — the model just pattern-matched the answer.

Catalog of Failure Modes

1. Unfaithful chains (post-hoc rationalization). The model arrives at an answer through pattern-matching, then constructs a reasoning chain that appears to justify it. The chain is fluent and plausible but doesn't reflect the actual computation. Most common on problems the model has seen similar versions of in training.

2. Error propagation without correction. An early step makes a small error (e.g., 23 - 20 = 2 instead of 3). All subsequent steps faithfully build on this wrong intermediate result, producing a confidently wrong final answer. The chain is internally consistent but externally wrong.

3. Overthinking (reasoning degradation). On simple problems, long reasoning chains can actually hurt accuracy. The model introduces unnecessary complexity, considers irrelevant edge cases, or "talks itself out of" the correct answer. This is especially common with RL-trained models that were incentivized to generate long chains.

4. Sycophantic reasoning. If the user's question contains a hint or suggestion ("I think the answer is 7..."), the model may construct a reasoning chain that arrives at the user's suggested answer, even if it's wrong. The chain is unfaithful in a specific way: it's optimized to agree with the user, not to find the truth.

5. Language-dependent reasoning. Models sometimes reason better in English than in other languages, even when the math is identical. This suggests the reasoning isn't purely logical but depends on the language patterns available in training data. DeepSeek-R1 notably exhibited "language mixing" during pure RL training — it would switch between Chinese and English mid-chain, using whichever language gave it access to better reasoning patterns for a given sub-step.

6. Reward hacking. RL-trained reasoning models can find shortcuts that game the reward function. For example, a model might learn that writing "Let me verify: [repeats the answer]" always increases its score from the reward model, even when no actual verification occurs. The chain looks like it self-corrects, but it's just learned a surface pattern that the PRM gives credit for.

Failure ModeSymptomMitigation
Unfaithful chainsChain looks right, but perturbing a step doesn't change the answerProcess reward models (Ch 4)
Error propagationOne wrong step cascades through the whole chainSelf-consistency (Ch 2)
OverthinkingSimple problems get wrong answers with long chainsAdaptive chain length
SycophancyReasoning chain steered by user's suggestionRemove user hints from prompt
Language biasSame problem, different accuracy by languageMultilingual training data
Reward hackingFake verification steps that game the reward modelDiverse reward models, adversarial evaluation
A reasoning chain that reads well is not the same as a reasoning chain that reasons correctly. The core challenge: models are optimized to generate plausible text, and a plausible-looking reasoning chain IS plausible text. Distinguishing genuine reasoning from post-hoc rationalization remains an open problem — and one of the most important in AI safety.

Detecting Unfaithfulness

How do researchers detect unfaithful reasoning? The main technique is causal intervention: modify something in the reasoning chain and check if the answer changes appropriately. If you corrupt step 2 of a 4-step chain (making it obviously wrong), a faithful model should produce a different (wrong) final answer. An unfaithful model will produce the same answer regardless — because the answer was never derived from the chain in the first place.

Lanham et al. (2023) tested this systematically. They found that early in the chain, perturbations often DO change the answer (suggesting some faithfulness). But for the last step or two, perturbations frequently DON'T change the answer — the model has already "decided" and is just filling in the conclusion. This suggests a mix: models are partially faithful for hard problems (where they genuinely need the scratch work) and unfaithful for easy problems (where they pattern-match the answer and backfill reasoning).

Open Question: Can We Trust Reasoning?

Turpin et al. (2023) showed that CoT explanations are often biased by features that shouldn't matter. When they added irrelevant features to the prompt (like the order of multiple-choice answers), the model's reasoning chains changed to justify answers influenced by those features — while the chains looked perfectly logical in isolation.

This means we can't simply read the chain to verify the model's reasoning. The chain might be a fabrication. This is a fundamental limitation: we can inspect the chain, but we can't verify that the chain caused the answer. The model's actual reasoning happens in its hidden states, which we don't have access to.

The Faithfulness Spectrum

It would be a mistake to think of faithfulness as binary. In practice, reasoning chains exist on a spectrum:

LevelDescriptionExample
Fully faithfulEvery step genuinely contributes to the answer. Perturbing any step changes the output.Hard math problems the model hasn't memorized
Partially faithfulSome steps are genuine computation, others are filler or post-hoc rationalizationMedium-difficulty problems with familiar patterns
Fully unfaithfulThe entire chain is fabricated after the model has already decided the answerEasy factual questions where the answer is memorized

The troubling implication: we can't tell which level we're at by reading the chain. A fully unfaithful chain can look identical to a fully faithful one. This is why mechanistic interpretability — understanding what's happening inside the model's hidden states — is considered one of the most important research directions in AI safety. Only by understanding the internal computation can we verify that the external reasoning chain is genuine.

For now, the practical advice: trust reasoning chains on hard, novel problems (where the model genuinely needs the scratch work). Be skeptical of chains on easy, familiar problems (where the model may be rationalizing a memorized answer). And always prefer process reward models over outcome reward models, because PRMs at least check that each step is individually plausible, even if they can't guarantee faithfulness.

The silver lining: as reasoning models improve, they actually tend to become more faithful, because RL training selects for strategies that reliably produce correct answers. A model that reasons faithfully will outperform one that post-hoc rationalizes, because faithful reasoning generalizes to new problems while rationalization only works on familiar ones. So the same selection pressure that makes models smarter also makes them more honest — not by design, but because honesty is the optimal strategy in the long run.

What is "unfaithful reasoning" in a Chain-of-Thought context?

Chapter 9: Connections

Reasoning sits at the frontier of what LLMs can do. Everything in this lesson builds on the foundations laid in earlier lectures: pretraining (L07) gives the model knowledge, post-training (L08) teaches it to follow instructions, and PEFT (L09) lets us adapt it efficiently. Reasoning is what happens when we push these capabilities to solve problems that require multiple steps of logical deduction.

The Reasoning Stack

How each technique builds on the previous. Prompting methods require no training. RL methods modify the model's weights to internalize reasoning.

Key Papers

PaperYearContribution
Chain-of-Thought (Wei et al.)2022Few-shot CoT prompting. Showed step-by-step reasoning dramatically improves math/logic accuracy. Emergent ability of scale.
Self-Consistency (Wang et al.)2022Sample multiple CoT chains and majority vote. Trade compute for accuracy with no extra training.
Zero-Shot CoT (Kojima et al.)2022"Let's think step by step" — triggers reasoning with zero examples.
PRM800K (Lightman et al.)2023Process reward model with step-level human labels. Process supervision beats outcome supervision.
DeepSeek-R1 (DeepSeek)2025RL-trained reasoning from scratch. Model discovers CoT, self-correction, and verification through reward optimization alone.
DAPO (ByteDance)2025Fixes GRPO's entropy collapse and length bias. Dynamic sampling + truncated filtering for stable RL training.

Connections to Other Lessons

LessonConnection
L08: Post-trainingRLHF teaches general instruction-following. RL for reasoning teaches specific problem-solving strategies. Same algorithm (PPO/GRPO), different reward signal.
L09: PEFTLoRA can fine-tune reasoning capabilities at lower cost than full fine-tuning. Distillation from R1 into smaller models uses SFT, not RL.
L11: EvaluationReasoning evaluation requires checking both the final answer AND the reasoning chain. Process reward models (PRMs) are evaluation tools as much as training tools.
L13: Reasoning Part 2Continues with tree search, Monte Carlo methods, and test-time compute scaling. How to make reasoning more systematic and reliable.

The Big Picture

The progression in this lesson tells a clear story:

EraTechniqueWhere Reasoning Lives
2022CoT PromptingIn the prompt (inference-time)
2022Self-ConsistencyIn the sampling (inference-time)
2023Process Reward ModelsIn the evaluator (training-time)
2025GRPO / DAPO / R1In the weights (training-time)

We started by prompting models to reason (cheap, no training). We moved to evaluating reasoning quality (PRMs, requires labeled data). We ended with training models to reason from scratch (RL, expensive but powerful). The trend is clear: reasoning capabilities are moving from inference-time tricks into the model's weights.

But this is only Part 1. In Lecture 13 (Reasoning Part 2), we'll explore what happens when you combine these techniques with search — tree-of-thought, Monte Carlo tree search, and test-time compute scaling. The idea: instead of generating a single chain and hoping it's correct, systematically explore the space of possible reasoning paths and pick the best one. If CoT gives the model scratch paper, tree search gives it an entire whiteboard.

The fundamental question remains open: are we teaching models to reason, or teaching them to simulate reasoning? The distinction matters. True reasoning would generalize to entirely novel problems and domains. Simulated reasoning might look identical but fail on problems that deviate from training distribution patterns. The DeepSeek-R1 results suggest we're closer to the former than skeptics expected — but the jury is still out.

The progression: Post-training (L08) teaches instruction-following. PEFT (L09) teaches specialization. Reasoning (L12) teaches multi-step thinking. Each layer pushes the frontier of what models can do — and each requires new evaluation methods (L11) to measure progress.