DeepSeek-AI — January 2025

DeepSeek-R1: Emergent Reasoning via Reinforcement Learning

Train a model with RL alone — no supervised fine-tuning on reasoning data — and watch self-verification, reflection, and multi-step problem solving emerge from scratch. Group Relative Policy Optimization (GRPO) is the engine.

Prerequisites: Policy gradient basics + PPO intuition + Chain-of-thought prompting. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The SFT Bottleneck

How do you teach a language model to reason? The standard approach in 2024 was supervised fine-tuning (SFT): collect thousands of human-written reasoning traces (step-by-step solutions), and train the model to imitate them. This is how OpenAI's o1 was built — human experts wrote detailed chain-of-thought solutions, and the model learned to reproduce that reasoning pattern.

SFT works, but it has a fundamental bottleneck: you're limited by the quality and diversity of human demonstrations. A human can only write so many reasoning traces. The traces reflect the human's reasoning style, not necessarily the optimal style for the model. And collecting these demonstrations is expensive — you need PhD-level mathematicians writing step-by-step proofs.

DeepSeek asked a radical question: what if you skip SFT entirely? What if you just give the model a reward signal ("your answer is correct" or "your answer is wrong") and let it figure out how to reason on its own through trial and error?

The core bet of DeepSeek-R1: Reinforcement learning alone — without any supervised reasoning demonstrations — can teach a language model to reason. The model doesn't imitate human reasoning. It discovers its own reasoning strategies through reward optimization. And the strategies it discovers are sometimes better than human demonstrations.

This is a bold claim. The conventional wisdom was that RL is too unstable and sample-inefficient to learn complex reasoning from scratch. You need SFT to give the model a "warm start" — teach it the format of reasoning before RL refines it. DeepSeek-R1 proved this wrong.

The two-model progression

DeepSeek-R1-Zero: Pure RL, no SFT at all. Trained directly from the base model using only outcome-based rewards. This is the scientific proof-of-concept: reasoning CAN emerge from RL alone.

DeepSeek-R1: A small amount of SFT ("cold start") followed by RL. This is the production model. The cold start fixes formatting issues (mixed languages, unreadable outputs) that R1-Zero suffers from, while RL still does the heavy lifting for reasoning quality.

SFT vs RL Training Paradigms

Compare the two approaches: SFT (imitate human demonstrations) vs RL (learn from reward signals). Click to see how each paradigm trains the model differently.

What is the fundamental limitation of SFT-based reasoning training that DeepSeek-R1 addresses?

Chapter 1: RL-Only Training

DeepSeek-R1-Zero is trained with a stunningly simple setup. Start with the DeepSeek-V3 base model (671B parameters, MoE architecture). Give it math and coding problems. Check if the answer is correct. Use that binary signal to update the model via GRPO. That's it.

The reward function

R1-Zero uses only two types of rewards, both completely automatic (no human labelers):

Reward TypeHow It WorksExample
Accuracy reward+1 if the final answer is correct, 0 otherwise. For math: compare to ground truth. For code: run test cases.Math: "What is 2+3?" Answer: "5" → reward = 1
Format rewardSmall reward for putting reasoning in <think>...</think> tags and the final answer in a designated format.<think>reasoning</think> Answer: 5 → format reward = 0.1

Notice what's not in the reward: there's no reward for step quality, reasoning clarity, or intermediate steps. The model is ONLY told whether its final answer is right or wrong. Everything about how to reason must be discovered by the model itself.

Why outcome-only rewards work: The model generates reasoning traces of varying length and quality. Traces that lead to correct answers get reinforced. Traces that lead to wrong answers get suppressed. Over millions of updates, the model learns that certain reasoning patterns (careful step-by-step work, double-checking, exploring alternatives) tend to produce correct answers. These patterns emerge without anyone explicitly teaching them.

The training data

The training problems are carefully curated:

Math: Competition-level problems (AIME, AMC, Olympiad-style) where ground truth can be verified automatically. These are hard enough that the base model gets them wrong ~95% of the time, giving RL room to improve.

Code: LeetCode-style problems with test suites. Correctness is verified by running the generated code against test cases. Binary: all tests pass = reward 1, any test fails = reward 0.

No easy problems: Easy problems (where the base model already scores 90%+) provide little learning signal. The training set is biased toward hard problems where RL has the most to learn.

python
# R1-Zero training loop (simplified)
for problem in training_problems:
    # Generate G responses per problem (GRPO group)
    responses = []
    for _ in range(64):  # G = 64 samples per problem
        response = model.generate(problem, temperature=1.0)
        answer = extract_answer(response)

        # Compute reward (purely automatic)
        accuracy_reward = 1.0 if answer == problem.ground_truth else 0.0
        format_reward = 0.1 if has_think_tags(response) else 0.0
        reward = accuracy_reward + format_reward
        responses.append((response, reward))

    # GRPO update: relative advantages within the group
    grpo_update(model, responses)  # see Chapter 2

What emerges

Over the course of training, R1-Zero's behavior changes dramatically. The model starts by generating short, random-looking text. Gradually, it learns to:

Early training
Short responses, often just guessing an answer. ~5% accuracy on AIME.
Mid training
Longer responses with some structure. Basic step-by-step reasoning appears. ~30% accuracy.
Late training
Multi-paragraph reasoning with self-verification. "Wait, let me check..." emerges. ~70% accuracy.
Converged
Sophisticated reasoning with exploration, backtracking, and error correction. 71% on AIME 2024.
R1-Zero Training Progression

Watch how R1-Zero's behavior evolves during training. Drag the slider to see how response length, reasoning quality, and accuracy change as RL progresses.

Training step 0%
What types of rewards does DeepSeek-R1-Zero use during training?

Chapter 2: GRPO Algorithm

The engine behind R1 is Group Relative Policy Optimization (GRPO). It's a variant of PPO designed specifically for LLM reasoning tasks. The key difference: GRPO doesn't need a separate critic/value network. Instead, it estimates advantages by comparing responses within the same group.

Why not PPO?

Standard PPO requires a value function (critic) that estimates the expected return from each state. For LLMs, this means training a second model of similar size to predict "how good is the current position in the reasoning chain?" This doubles memory and compute requirements.

GRPO eliminates the critic entirely. Instead of comparing each response to a learned baseline, it compares responses to each other within a group of G samples for the same problem.

The GRPO algorithm

For each problem, GRPO samples G responses (typically G = 64). Each response gets a reward ri. The advantage of response i is computed relative to the group:

Ai = (ri − mean(r1..G)) / std(r1..G)

This is simply the z-score of the reward within the group. Responses with above-average rewards get positive advantages (reinforced). Below-average responses get negative advantages (suppressed).

Why group-relative advantages work: For a hard problem where the base model gets it right 10% of the time, the ~6 correct responses (out of 64) get large positive advantages, and the ~58 wrong responses get negative advantages. The model learns to generate more responses like the correct ones. No critic needed — the group IS the baseline.

The GRPO objective

The full GRPO loss combines the clipped policy gradient (from PPO) with a KL divergence penalty:

L = Ei[ min(rt(θ) Ai, clip(rt(θ), 1−ε, 1+ε) Ai) ] − β KL(πθ || πref)

Where rt(θ) is the probability ratio between the new and old policy, Ai is the group-relative advantage, ε is the clipping parameter (typically 0.2), and β controls the KL penalty that prevents the model from deviating too far from the reference policy.

python
# GRPO implementation
import torch

def grpo_step(model, ref_model, problem, G=64, epsilon=0.2, beta=0.01):
    # Step 1: Sample G responses from current policy
    responses = [model.generate(problem) for _ in range(G)]

    # Step 2: Compute rewards
    rewards = torch.tensor([reward_fn(r, problem) for r in responses])

    # Step 3: Group-relative advantages (z-score normalization)
    advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8)

    # Step 4: Compute probability ratios
    log_probs = torch.stack([model.log_prob(r) for r in responses])
    old_log_probs = log_probs.detach()
    ratios = torch.exp(log_probs - old_log_probs)

    # Step 5: Clipped surrogate objective (same as PPO)
    surr1 = ratios * advantages
    surr2 = torch.clamp(ratios, 1-epsilon, 1+epsilon) * advantages
    policy_loss = -torch.min(surr1, surr2).mean()

    # Step 6: KL penalty to prevent collapse
    ref_log_probs = torch.stack([ref_model.log_prob(r) for r in responses])
    kl = (log_probs - ref_log_probs).mean()

    # Total loss
    loss = policy_loss + beta * kl
    loss.backward()
    return loss.item()

GRPO vs PPO comparison

PropertyPPOGRPO
Critic/value networkRequired (same size as policy)Not needed
BaselineLearned value function V(s)Group mean reward
Memory2x (policy + critic)1x (policy only)
Advantage estimationGAE (temporal)Z-score within group
Samples per problem1-4 typically64 typically
ClippingYes (ε)Yes (ε)
GRPO Advantage Computation

See how GRPO computes advantages from a group of responses. Each bar is a response's reward. The dotted line is the group mean. Responses above the mean get positive advantages (green). Below: negative (red). Click "New Group" to resample.

How does GRPO compute advantages without a critic network?

Chapter 3: Emergent Behaviors

The most remarkable aspect of R1-Zero is what emerges without being taught. Nobody told the model to double-check its work. Nobody demonstrated self-reflection. Nobody wrote training data showing how to explore alternative approaches. These behaviors appeared spontaneously through RL optimization.

Self-verification ("Wait, let me check...")

After reaching an answer, R1-Zero learned to pause and verify its own work. This pattern appears in its reasoning traces as phrases like "Wait, let me re-examine this step" or "Let me verify by substituting back." The model literally learned to be its own checker.

text
R1-Zero response (actual, condensed):
<think>
To solve x² + 5x + 6 = 0, I'll factor.
Looking for two numbers that multiply to 6 and add to 5.
Those are 2 and 3. So (x+2)(x+3) = 0.
Solutions: x = -2, x = -3.

Wait, let me verify. If x = -2:
(-2)² + 5(-2) + 6 = 4 - 10 + 6 = 0 ✓
If x = -3:
(-3)² + 5(-3) + 6 = 9 - 15 + 6 = 0 ✓
Both check out.
</think>
Answer: x = -2, x = -3

Reflection and backtracking

When R1-Zero reaches a dead end or suspects an error, it backtracks and tries a different approach. This is NOT prompted or demonstrated — it emerged because responses that backtrack from errors and find the correct answer get rewarded, while responses that stubbornly continue down wrong paths don't.

text
R1-Zero response (actual pattern):
<think>
Let me try approach A...
[several steps]
Hmm, this leads to a contradiction. The determinant
can't be negative here. Let me reconsider.

Actually, I should use approach B instead...
[different approach, arrives at correct answer]
</think>

Extended thinking (the "aha moment")

Perhaps the most striking emergence: R1-Zero's responses get longer over training. The model learns that thinking longer — considering more possibilities, checking more conditions — leads to more correct answers. Average response length grows from ~100 tokens early in training to ~3,000+ tokens at convergence.

The "aha moment" in R1-Zero's training: There's a distinct phase transition around training step ~2000 where the model suddenly starts producing much longer, more structured reasoning. Before this point, responses are short and heuristic. After it, the model generates multi-paragraph reasoning with clear structure, self-verification, and alternative approaches. This phase transition corresponds to a rapid accuracy jump from ~20% to ~50% on AIME problems.

Why emergence, not memorization

These behaviors can't be memorized from pre-training data because:

1. The base model (DeepSeek-V3) doesn't exhibit these behaviors when prompted. It requires RL to activate them.

2. The specific phrases ("Wait, let me check...") aren't present in the pre-training corpus in this reasoning context.

3. The behaviors appear gradually during RL training, correlating with accuracy improvements — not suddenly, as memorized behavior would.

Emergent Behavior Timeline

See which behaviors emerge at different training stages. Early training: short guesses. Mid training: basic reasoning. Late training: self-verification, backtracking, and extended thinking all appear.

Training progress 0%
Why is self-verification ("Wait, let me check...") considered an emergent behavior rather than a memorized one?

Chapter 4: The Full Pipeline

R1-Zero proves that RL-only reasoning works, but it has practical problems: mixed languages in outputs, poor formatting, and difficulty with non-reasoning tasks like creative writing. The production DeepSeek-R1 model uses a four-stage pipeline to fix these issues.

Stage 1: Cold start SFT

Train the base model on a small set (~thousands) of high-quality reasoning examples. This teaches the model the basic output format: use <think> tags, structure reasoning into steps, output the final answer clearly. This is NOT teaching the model how to reason — it's teaching it how to FORMAT reasoning.

Stage 2: RL on reasoning tasks

The main training phase. Apply GRPO on math and coding problems with outcome-based rewards, exactly like R1-Zero. But starting from the cold-started model instead of the raw base model. This stage runs for the longest and is responsible for most of the reasoning capability.

Stage 3: Rejection sampling + SFT

After RL training, use the model to generate many solutions for each problem. Keep only the correct ones (rejection sampling). Also generate high-quality responses for non-reasoning tasks (writing, summarization, general QA). Fine-tune on this combined dataset to create a well-rounded model.

Stage 4: RL on all tasks

A final RL stage that optimizes both reasoning accuracy AND general helpfulness/safety. This uses both rule-based rewards (accuracy on math/code) and a learned reward model (for general quality).

Stage 1: Cold Start SFT
~thousands of examples. Teaches output format (think tags, structure). Quick and cheap.
Stage 2: Reasoning RL (GRPO)
Main training. Math + code problems. Outcome rewards only. The heavy lifting.
Stage 3: Rejection Sampling + SFT
Generate many solutions, keep correct ones. Add general-purpose data. Fine-tune.
Stage 4: All-Task RL
RL on reasoning + general tasks. Rule-based + learned rewards. Final polish.
Each stage has a purpose. Stage 1 prevents formatting chaos. Stage 2 builds reasoning muscle. Stage 3 broadens the model beyond math/code. Stage 4 aligns everything together. Skip any stage and you get a weaker model: skip Stage 1 and you get R1-Zero's formatting mess; skip Stage 3 and you get a math-only model.

Distillation: small models that reason

DeepSeek also distilled R1's reasoning ability into smaller models (1.5B to 70B). The process is simple: use R1 to generate reasoning traces, then fine-tune smaller models on those traces. The results are striking — a 32B distilled model outperforms many 70B+ models on math reasoning.

ModelSizeAIME 2024MATH-500
DeepSeek-R1671B (MoE)79.8%97.3%
R1-Distill-Qwen-32B32B72.6%94.3%
R1-Distill-Qwen-14B14B69.7%93.9%
R1-Distill-Qwen-7B7B55.5%92.8%
R1-Distill-Qwen-1.5B1.5B28.9%83.9%
Training Pipeline Visualizer

Click each stage to see what it does and what data it uses. Compare the R1-Zero path (RL only) to the R1 path (4 stages).

Why does the production R1 model use a cold-start SFT stage before RL, unlike R1-Zero?

Chapter 5: Results

DeepSeek-R1 achieves performance competitive with OpenAI's o1 across math, code, and science reasoning — at a fraction of the training cost and fully open-weight.

Benchmark comparison

BenchmarkDeepSeek-R1OpenAI o1GPT-4oClaude 3.5
AIME 202479.8%79.2%9.3%16.0%
MATH-50097.3%96.4%76.6%78.3%
Codeforces2029 ELO2061 ELO808 ELO717 ELO
GPQA Diamond71.5%78.0%53.6%65.0%
MMLU90.8%91.8%87.5%88.3%
LiveCodeBench65.9%63.4%33.4%36.3%

The results are striking. R1 matches or beats o1 on math (AIME, MATH-500) and competitive programming (LiveCodeBench), while being open-weight. It trails slightly on science reasoning (GPQA) and general knowledge (MMLU).

The significance: DeepSeek-R1 demonstrated that OpenAI's o1-level reasoning can be reproduced with a simple, well-understood algorithm (GRPO), minimal human supervision (no process reward models, no MCTS), and full openness (open weights, open paper). This democratized "reasoning model" technology overnight.

R1-Zero vs R1

ModelAIME 2024MATH-500FormattingGeneral Tasks
R1-Zero71.0%95.6%Poor (mixed languages)Weak
R179.8%97.3%CleanStrong

R1-Zero already achieves impressive reasoning (71% AIME), confirming that RL alone can produce strong reasoning. R1 adds ~9 points on AIME through the four-stage pipeline, mainly from better formatting and broader training.

The cost advantage

Because GRPO doesn't require a critic network, R1's training is roughly 2x cheaper than PPO-based alternatives. The open-weight release means anyone can deploy R1 locally, avoiding API costs entirely. The distilled models (7B-32B) run on consumer hardware.

Model Comparison Dashboard

Compare DeepSeek-R1 against other models across benchmarks. Select a benchmark to see head-to-head results.

What is the practical significance of DeepSeek-R1's results compared to OpenAI's o1?

Chapter 6: GRPO Simulator

Let's see GRPO in action. This simulator shows how the algorithm works step by step: sample responses, compute rewards, normalize advantages, and update the policy.

GRPO Training Simulator

Watch GRPO train on a simulated problem. Each column is one response in the group. Green = correct answer (reward = 1). Red = wrong (reward = 0). The height shows the advantage (z-scored reward). Click "GRPO Step" to run one training step and see the policy improve.

Group size G 16

What to watch for

As you run GRPO steps, notice:

1. The accuracy climbs. The policy's per-problem accuracy starts at the base model's level (~10-20% for hard problems) and climbs as correct responses are reinforced.

2. Response length increases. Longer, more careful responses tend to be correct more often. GRPO reinforces these, so average response length grows — matching the pattern observed in R1-Zero.

3. Variance decreases. Early in training, the group has high reward variance (some correct, many wrong). As the policy improves, more responses in each group are correct, and the advantages become smaller — the model is converging.

GRPO's elegance: The entire training signal comes from comparing the model's own responses to each other. No human labels beyond "correct/incorrect." No separate reward model. No critic network. Just: "within this group, which responses were better than average? Do more of those." Simple, scalable, and effective.
During GRPO training, what happens to the response length and why?

Chapter 7: Connections

DeepSeek-R1 sits at the intersection of two major trends: reinforcement learning for LLMs and inference-time compute scaling. It connects forward to DAPO and backward to PPO/DPO.

The RL for LLM reasoning lineage

MethodYearRelationship to R1
PPO for RLHF2022Original RL algorithm for LLMs. R1 replaces the critic with group comparison.
DPO2023Simplifies RLHF by skipping RL. R1 shows RL is essential for reasoning.
Self-Consistency2023Inference-time voting. R1 makes the base model stronger so fewer samples are needed.
Process Reward Models2023Step-level rewards. R1 achieves similar results with outcome-only rewards.
OpenAI o12024Proprietary reasoning model. R1 is the open-source reproduction.
DAPO2025Fixes GRPO's remaining issues: clip bias, entropy collapse, length bias.

What R1 got right

RL alone is enough for reasoning. This was the most surprising finding. The conventional wisdom that SFT is a prerequisite for reasoning was wrong.

Simplicity scales. GRPO is simpler than PPO (no critic), uses simpler rewards (binary accuracy), and achieves comparable results to much more complex systems.

Open-weight release. By releasing model weights, DeepSeek enabled the entire research community to build on R1. This accelerated the field significantly.

What R1 got wrong (or left open)

Language mixing in R1-Zero. Without SFT, the model produces reasoning in unpredictable languages. This is a practical limitation of pure RL.

No process rewards. R1 uses only outcome rewards. Adding step-level rewards (as in process reward models) could improve efficiency — rewarding good intermediate steps instead of only complete solutions.

Reward hacking. With long training, the model can learn to game the reward function (e.g., exploiting edge cases in test suites for code problems). DAPO addresses some of these issues.

The R1 legacy. DeepSeek-R1 proved three things that shaped the field: (1) RL-only reasoning works, (2) you don't need process reward models or MCTS for strong reasoning, and (3) open-weight reasoning models can match proprietary ones. Every reasoning model since January 2025 has been built on insights from this paper.

DAPO — Fixes GRPO's clipping bias and entropy collapse. Read the DAPO lesson →

Self-Consistency — Inference-time voting that complements R1's strong base reasoning. Read the SC lesson →

Let's Verify Step by Step — Process reward models that R1 showed aren't strictly necessary. Read the PRM lesson →

Scaling Test-Time Compute — R1's extended thinking is a form of test-time compute scaling. Read the TTC lesson →

RL for Reasoning Timeline

See how RL-based reasoning evolved from PPO/RLHF to R1's GRPO to DAPO and beyond.

Method R1 (2025)
What three things did DeepSeek-R1 prove that shaped the reasoning model landscape?