Train a model with RL alone — no supervised fine-tuning on reasoning data — and watch self-verification, reflection, and multi-step problem solving emerge from scratch. Group Relative Policy Optimization (GRPO) is the engine.
How do you teach a language model to reason? The standard approach in 2024 was supervised fine-tuning (SFT): collect thousands of human-written reasoning traces (step-by-step solutions), and train the model to imitate them. This is how OpenAI's o1 was built — human experts wrote detailed chain-of-thought solutions, and the model learned to reproduce that reasoning pattern.
SFT works, but it has a fundamental bottleneck: you're limited by the quality and diversity of human demonstrations. A human can only write so many reasoning traces. The traces reflect the human's reasoning style, not necessarily the optimal style for the model. And collecting these demonstrations is expensive — you need PhD-level mathematicians writing step-by-step proofs.
DeepSeek asked a radical question: what if you skip SFT entirely? What if you just give the model a reward signal ("your answer is correct" or "your answer is wrong") and let it figure out how to reason on its own through trial and error?
This is a bold claim. The conventional wisdom was that RL is too unstable and sample-inefficient to learn complex reasoning from scratch. You need SFT to give the model a "warm start" — teach it the format of reasoning before RL refines it. DeepSeek-R1 proved this wrong.
DeepSeek-R1-Zero: Pure RL, no SFT at all. Trained directly from the base model using only outcome-based rewards. This is the scientific proof-of-concept: reasoning CAN emerge from RL alone.
DeepSeek-R1: A small amount of SFT ("cold start") followed by RL. This is the production model. The cold start fixes formatting issues (mixed languages, unreadable outputs) that R1-Zero suffers from, while RL still does the heavy lifting for reasoning quality.
Compare the two approaches: SFT (imitate human demonstrations) vs RL (learn from reward signals). Click to see how each paradigm trains the model differently.
DeepSeek-R1-Zero is trained with a stunningly simple setup. Start with the DeepSeek-V3 base model (671B parameters, MoE architecture). Give it math and coding problems. Check if the answer is correct. Use that binary signal to update the model via GRPO. That's it.
R1-Zero uses only two types of rewards, both completely automatic (no human labelers):
| Reward Type | How It Works | Example |
|---|---|---|
| Accuracy reward | +1 if the final answer is correct, 0 otherwise. For math: compare to ground truth. For code: run test cases. | Math: "What is 2+3?" Answer: "5" → reward = 1 |
| Format reward | Small reward for putting reasoning in <think>...</think> tags and the final answer in a designated format. | <think>reasoning</think> Answer: 5 → format reward = 0.1 |
Notice what's not in the reward: there's no reward for step quality, reasoning clarity, or intermediate steps. The model is ONLY told whether its final answer is right or wrong. Everything about how to reason must be discovered by the model itself.
The training problems are carefully curated:
Math: Competition-level problems (AIME, AMC, Olympiad-style) where ground truth can be verified automatically. These are hard enough that the base model gets them wrong ~95% of the time, giving RL room to improve.
Code: LeetCode-style problems with test suites. Correctness is verified by running the generated code against test cases. Binary: all tests pass = reward 1, any test fails = reward 0.
No easy problems: Easy problems (where the base model already scores 90%+) provide little learning signal. The training set is biased toward hard problems where RL has the most to learn.
python # R1-Zero training loop (simplified) for problem in training_problems: # Generate G responses per problem (GRPO group) responses = [] for _ in range(64): # G = 64 samples per problem response = model.generate(problem, temperature=1.0) answer = extract_answer(response) # Compute reward (purely automatic) accuracy_reward = 1.0 if answer == problem.ground_truth else 0.0 format_reward = 0.1 if has_think_tags(response) else 0.0 reward = accuracy_reward + format_reward responses.append((response, reward)) # GRPO update: relative advantages within the group grpo_update(model, responses) # see Chapter 2
Over the course of training, R1-Zero's behavior changes dramatically. The model starts by generating short, random-looking text. Gradually, it learns to:
Watch how R1-Zero's behavior evolves during training. Drag the slider to see how response length, reasoning quality, and accuracy change as RL progresses.
The engine behind R1 is Group Relative Policy Optimization (GRPO). It's a variant of PPO designed specifically for LLM reasoning tasks. The key difference: GRPO doesn't need a separate critic/value network. Instead, it estimates advantages by comparing responses within the same group.
Standard PPO requires a value function (critic) that estimates the expected return from each state. For LLMs, this means training a second model of similar size to predict "how good is the current position in the reasoning chain?" This doubles memory and compute requirements.
GRPO eliminates the critic entirely. Instead of comparing each response to a learned baseline, it compares responses to each other within a group of G samples for the same problem.
For each problem, GRPO samples G responses (typically G = 64). Each response gets a reward ri. The advantage of response i is computed relative to the group:
This is simply the z-score of the reward within the group. Responses with above-average rewards get positive advantages (reinforced). Below-average responses get negative advantages (suppressed).
The full GRPO loss combines the clipped policy gradient (from PPO) with a KL divergence penalty:
Where rt(θ) is the probability ratio between the new and old policy, Ai is the group-relative advantage, ε is the clipping parameter (typically 0.2), and β controls the KL penalty that prevents the model from deviating too far from the reference policy.
python # GRPO implementation import torch def grpo_step(model, ref_model, problem, G=64, epsilon=0.2, beta=0.01): # Step 1: Sample G responses from current policy responses = [model.generate(problem) for _ in range(G)] # Step 2: Compute rewards rewards = torch.tensor([reward_fn(r, problem) for r in responses]) # Step 3: Group-relative advantages (z-score normalization) advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-8) # Step 4: Compute probability ratios log_probs = torch.stack([model.log_prob(r) for r in responses]) old_log_probs = log_probs.detach() ratios = torch.exp(log_probs - old_log_probs) # Step 5: Clipped surrogate objective (same as PPO) surr1 = ratios * advantages surr2 = torch.clamp(ratios, 1-epsilon, 1+epsilon) * advantages policy_loss = -torch.min(surr1, surr2).mean() # Step 6: KL penalty to prevent collapse ref_log_probs = torch.stack([ref_model.log_prob(r) for r in responses]) kl = (log_probs - ref_log_probs).mean() # Total loss loss = policy_loss + beta * kl loss.backward() return loss.item()
| Property | PPO | GRPO |
|---|---|---|
| Critic/value network | Required (same size as policy) | Not needed |
| Baseline | Learned value function V(s) | Group mean reward |
| Memory | 2x (policy + critic) | 1x (policy only) |
| Advantage estimation | GAE (temporal) | Z-score within group |
| Samples per problem | 1-4 typically | 64 typically |
| Clipping | Yes (ε) | Yes (ε) |
See how GRPO computes advantages from a group of responses. Each bar is a response's reward. The dotted line is the group mean. Responses above the mean get positive advantages (green). Below: negative (red). Click "New Group" to resample.
The most remarkable aspect of R1-Zero is what emerges without being taught. Nobody told the model to double-check its work. Nobody demonstrated self-reflection. Nobody wrote training data showing how to explore alternative approaches. These behaviors appeared spontaneously through RL optimization.
After reaching an answer, R1-Zero learned to pause and verify its own work. This pattern appears in its reasoning traces as phrases like "Wait, let me re-examine this step" or "Let me verify by substituting back." The model literally learned to be its own checker.
text R1-Zero response (actual, condensed): <think> To solve x² + 5x + 6 = 0, I'll factor. Looking for two numbers that multiply to 6 and add to 5. Those are 2 and 3. So (x+2)(x+3) = 0. Solutions: x = -2, x = -3. Wait, let me verify. If x = -2: (-2)² + 5(-2) + 6 = 4 - 10 + 6 = 0 ✓ If x = -3: (-3)² + 5(-3) + 6 = 9 - 15 + 6 = 0 ✓ Both check out. </think> Answer: x = -2, x = -3
When R1-Zero reaches a dead end or suspects an error, it backtracks and tries a different approach. This is NOT prompted or demonstrated — it emerged because responses that backtrack from errors and find the correct answer get rewarded, while responses that stubbornly continue down wrong paths don't.
text R1-Zero response (actual pattern): <think> Let me try approach A... [several steps] Hmm, this leads to a contradiction. The determinant can't be negative here. Let me reconsider. Actually, I should use approach B instead... [different approach, arrives at correct answer] </think>
Perhaps the most striking emergence: R1-Zero's responses get longer over training. The model learns that thinking longer — considering more possibilities, checking more conditions — leads to more correct answers. Average response length grows from ~100 tokens early in training to ~3,000+ tokens at convergence.
These behaviors can't be memorized from pre-training data because:
1. The base model (DeepSeek-V3) doesn't exhibit these behaviors when prompted. It requires RL to activate them.
2. The specific phrases ("Wait, let me check...") aren't present in the pre-training corpus in this reasoning context.
3. The behaviors appear gradually during RL training, correlating with accuracy improvements — not suddenly, as memorized behavior would.
See which behaviors emerge at different training stages. Early training: short guesses. Mid training: basic reasoning. Late training: self-verification, backtracking, and extended thinking all appear.
R1-Zero proves that RL-only reasoning works, but it has practical problems: mixed languages in outputs, poor formatting, and difficulty with non-reasoning tasks like creative writing. The production DeepSeek-R1 model uses a four-stage pipeline to fix these issues.
Train the base model on a small set (~thousands) of high-quality reasoning examples. This teaches the model the basic output format: use <think> tags, structure reasoning into steps, output the final answer clearly. This is NOT teaching the model how to reason — it's teaching it how to FORMAT reasoning.
The main training phase. Apply GRPO on math and coding problems with outcome-based rewards, exactly like R1-Zero. But starting from the cold-started model instead of the raw base model. This stage runs for the longest and is responsible for most of the reasoning capability.
After RL training, use the model to generate many solutions for each problem. Keep only the correct ones (rejection sampling). Also generate high-quality responses for non-reasoning tasks (writing, summarization, general QA). Fine-tune on this combined dataset to create a well-rounded model.
A final RL stage that optimizes both reasoning accuracy AND general helpfulness/safety. This uses both rule-based rewards (accuracy on math/code) and a learned reward model (for general quality).
DeepSeek also distilled R1's reasoning ability into smaller models (1.5B to 70B). The process is simple: use R1 to generate reasoning traces, then fine-tune smaller models on those traces. The results are striking — a 32B distilled model outperforms many 70B+ models on math reasoning.
| Model | Size | AIME 2024 | MATH-500 |
|---|---|---|---|
| DeepSeek-R1 | 671B (MoE) | 79.8% | 97.3% |
| R1-Distill-Qwen-32B | 32B | 72.6% | 94.3% |
| R1-Distill-Qwen-14B | 14B | 69.7% | 93.9% |
| R1-Distill-Qwen-7B | 7B | 55.5% | 92.8% |
| R1-Distill-Qwen-1.5B | 1.5B | 28.9% | 83.9% |
Click each stage to see what it does and what data it uses. Compare the R1-Zero path (RL only) to the R1 path (4 stages).
DeepSeek-R1 achieves performance competitive with OpenAI's o1 across math, code, and science reasoning — at a fraction of the training cost and fully open-weight.
| Benchmark | DeepSeek-R1 | OpenAI o1 | GPT-4o | Claude 3.5 |
|---|---|---|---|---|
| AIME 2024 | 79.8% | 79.2% | 9.3% | 16.0% |
| MATH-500 | 97.3% | 96.4% | 76.6% | 78.3% |
| Codeforces | 2029 ELO | 2061 ELO | 808 ELO | 717 ELO |
| GPQA Diamond | 71.5% | 78.0% | 53.6% | 65.0% |
| MMLU | 90.8% | 91.8% | 87.5% | 88.3% |
| LiveCodeBench | 65.9% | 63.4% | 33.4% | 36.3% |
The results are striking. R1 matches or beats o1 on math (AIME, MATH-500) and competitive programming (LiveCodeBench), while being open-weight. It trails slightly on science reasoning (GPQA) and general knowledge (MMLU).
| Model | AIME 2024 | MATH-500 | Formatting | General Tasks |
|---|---|---|---|---|
| R1-Zero | 71.0% | 95.6% | Poor (mixed languages) | Weak |
| R1 | 79.8% | 97.3% | Clean | Strong |
R1-Zero already achieves impressive reasoning (71% AIME), confirming that RL alone can produce strong reasoning. R1 adds ~9 points on AIME through the four-stage pipeline, mainly from better formatting and broader training.
Because GRPO doesn't require a critic network, R1's training is roughly 2x cheaper than PPO-based alternatives. The open-weight release means anyone can deploy R1 locally, avoiding API costs entirely. The distilled models (7B-32B) run on consumer hardware.
Compare DeepSeek-R1 against other models across benchmarks. Select a benchmark to see head-to-head results.
Let's see GRPO in action. This simulator shows how the algorithm works step by step: sample responses, compute rewards, normalize advantages, and update the policy.
Watch GRPO train on a simulated problem. Each column is one response in the group. Green = correct answer (reward = 1). Red = wrong (reward = 0). The height shows the advantage (z-scored reward). Click "GRPO Step" to run one training step and see the policy improve.
As you run GRPO steps, notice:
1. The accuracy climbs. The policy's per-problem accuracy starts at the base model's level (~10-20% for hard problems) and climbs as correct responses are reinforced.
2. Response length increases. Longer, more careful responses tend to be correct more often. GRPO reinforces these, so average response length grows — matching the pattern observed in R1-Zero.
3. Variance decreases. Early in training, the group has high reward variance (some correct, many wrong). As the policy improves, more responses in each group are correct, and the advantages become smaller — the model is converging.
DeepSeek-R1 sits at the intersection of two major trends: reinforcement learning for LLMs and inference-time compute scaling. It connects forward to DAPO and backward to PPO/DPO.
| Method | Year | Relationship to R1 |
|---|---|---|
| PPO for RLHF | 2022 | Original RL algorithm for LLMs. R1 replaces the critic with group comparison. |
| DPO | 2023 | Simplifies RLHF by skipping RL. R1 shows RL is essential for reasoning. |
| Self-Consistency | 2023 | Inference-time voting. R1 makes the base model stronger so fewer samples are needed. |
| Process Reward Models | 2023 | Step-level rewards. R1 achieves similar results with outcome-only rewards. |
| OpenAI o1 | 2024 | Proprietary reasoning model. R1 is the open-source reproduction. |
| DAPO | 2025 | Fixes GRPO's remaining issues: clip bias, entropy collapse, length bias. |
RL alone is enough for reasoning. This was the most surprising finding. The conventional wisdom that SFT is a prerequisite for reasoning was wrong.
Simplicity scales. GRPO is simpler than PPO (no critic), uses simpler rewards (binary accuracy), and achieves comparable results to much more complex systems.
Open-weight release. By releasing model weights, DeepSeek enabled the entire research community to build on R1. This accelerated the field significantly.
Language mixing in R1-Zero. Without SFT, the model produces reasoning in unpredictable languages. This is a practical limitation of pure RL.
No process rewards. R1 uses only outcome rewards. Adding step-level rewards (as in process reward models) could improve efficiency — rewarding good intermediate steps instead of only complete solutions.
Reward hacking. With long training, the model can learn to game the reward function (e.g., exploiting edge cases in test suites for code problems). DAPO addresses some of these issues.
DAPO — Fixes GRPO's clipping bias and entropy collapse. Read the DAPO lesson →
Self-Consistency — Inference-time voting that complements R1's strong base reasoning. Read the SC lesson →
Let's Verify Step by Step — Process reward models that R1 showed aren't strictly necessary. Read the PRM lesson →
Scaling Test-Time Compute — R1's extended thinking is a form of test-time compute scaling. Read the TTC lesson →
See how RL-based reasoning evolved from PPO/RLHF to R1's GRPO to DAPO and beyond.