DeepSeekMath — Veanors

Chapter 0: The Problem

It is early 2024. GPT-4 can solve competition-level math problems at roughly 53% accuracy on the MATH benchmark. Gemini Ultra hovers around the same mark. These models have hundreds of billions of parameters, were trained on trillions of tokens of undisclosed data, and cost enormous sums to build.

Meanwhile, the best open-source models are far behind. Llemma 34B, the strongest math-specific open model, manages only 25.3% on MATH. Even general-purpose open models like Mistral 7B or Qwen 72B top out around 35%. The gap is enormous: closed-source models lead open-source by 20+ percentage points on the hardest math benchmark.

Why? Two bottlenecks:

Data. Mathematical content is rare on the internet. OpenWebMath, the best publicly available math corpus, contains only 13.6B tokens. Minerva used ~17B tokens of math data. That is simply not enough for a small model to learn the deep structure of mathematical reasoning.
Post-training. Supervised fine-tuning teaches a model to mimic solutions it has seen. But math requires reasoning — generating novel chains of logic, not just pattern-matching. Reinforcement learning (RL) can push models beyond imitation, but the dominant RL algorithm for LLMs — PPO — requires training four separate models simultaneously (policy, reference, reward, and critic), which is prohibitively expensive for most teams.

The question DeepSeekMath asks: Can we close the gap with GPT-4 on math using a 7 billion parameter model? That would mean achieving a 2x improvement over models 5-77x larger, purely through better data and a smarter training algorithm.

The Math Reasoning Gap (Early 2024)

MATH benchmark accuracy. Open-source models are clustered far below the closed-source frontier. DeepSeekMath-RL closes this gap from 7B parameters.

What are the two main bottlenecks preventing open-source models from matching GPT-4 on math reasoning?

Insufficient high-quality math training data, and the prohibitive cost of RL post-training (PPO requires 4 models) Open-source models have fewer layers Mathematical reasoning requires chain-of-thought prompting which open models do not support

Chapter 1: The Key Insight

DeepSeekMath's strategy has two pillars, each addressing one bottleneck.

Pillar 1: Data at Scale

Build a 120B-token math corpus from Common Crawl using an iterative data selection pipeline. This is 9x larger than OpenWebMath and 7x larger than Minerva's math data. Start with a seed corpus, train a classifier, mine more data, annotate, retrain, repeat.

↓

Pillar 2: GRPO

Replace PPO with Group Relative Policy Optimization. The key trick: instead of training a separate value (critic) network to estimate baselines, sample a GROUP of responses for each question and compute advantages by comparing rewards within the group. No critic = 50% less memory.

The combination is surprisingly powerful. A 7B model, continued pre-trained on the DeepSeekMath Corpus and then fine-tuned with GRPO, reaches 51.7% on MATH — within 1-2 points of GPT-4 and Gemini Ultra.

Why these two pillars multiply: Data alone gets you to 36.2% (base model). SFT on top pushes to 46.8%. But GRPO adds another 5 points to 51.7% — and crucially, it improves even on benchmarks the RL training data never touched (out-of-domain transfer). The data gives the model knowledge; GRPO teaches it to reason more carefully with that knowledge.

A third surprising finding: code training before math training helps. DeepSeekMath starts from DeepSeek-Coder-Base, not a general LLM. The authors show experimentally that code pre-training improves mathematical reasoning, both with and without tool use. This is partial evidence for the long-standing hypothesis that code training improves general reasoning ability.

What is GRPO's core trick for eliminating the critic network?

Sample a group of responses per question and estimate advantages by comparing rewards within the group, rather than learning a value function Use a smaller critic network with fewer layers Set all advantages to 1 and skip value estimation

Chapter 2: The Data Pipeline

How do you find 120 billion tokens of math on the internet? Common Crawl contains 40 billion HTML pages after deduplication. The vast majority are not mathematical. The challenge is finding the needles in an ocean of haystacks.

The iterative selection pipeline

The pipeline starts with a seed corpus — OpenWebMath, a curated collection of 13.6B math tokens. From this seed, the process iterates:

Step 1: Train a Classifier

Train a fastText model on 500K positive examples (from the seed) and 500K random web pages (negatives). This binary classifier learns to distinguish "math-like" pages from everything else.

↓

Step 2: Mine Common Crawl

Run the classifier over all 40B deduplicated pages. Rank by predicted score. Keep the top-scoring pages. First iteration: top 40B tokens.

↓

Step 3: Discover New Domains

Group all pages by base URL domain. If >10% of a domain's pages were selected (e.g., mathoverflow.net), flag the entire domain as math-related.

↓

Step 4: Annotate and Expand

Humans annotate which URL paths within flagged domains contain math (e.g., mathoverflow.net/questions but not mathoverflow.net/users). Add uncollected pages from these paths to the seed. Retrain the classifier.

↻ repeat (4 iterations total)

After four iterations, the pipeline converges: 98% of data found in iteration 4 was already found in iteration 3. The final corpus: 35.5 million web pages, 120 billion tokens.

Why iterative? The first classifier, trained only on OpenWebMath, has limited diversity. It knows what math.stackexchange looks like, but misses math on e-commerce sites, government databases, or foreign-language forums. Each iteration expands the seed with newly discovered domains, making the next classifier broader and more accurate.

Quality validation

The authors validate quality by training a 1.3B model on different math corpora for 150B tokens each:

Corpus	Size	GSM8K	MATH
No math training	-	2.9%	3.0%
MathPile (mostly arXiv)	8.9B	2.7%	3.3%
OpenWebMath	13.6B	11.5%	8.9%
Proof-Pile-2	51.9B	14.3%	11.2%
DeepSeekMath Corpus	120.2B	23.8%	13.6%

A counterintuitive finding: ArXiv papers are ineffective for math reasoning. MathPile (85% arXiv) barely improves over no math training. The authors tested both cleaned and raw arXiv corpora on 1.3B and 7B models — no notable improvement on any benchmark. Web-mined math problems and solutions are far more useful than formal mathematical papers.

Decontamination

Any web page containing a 10-gram overlap with GSM8K, MATH, or Chinese math benchmarks is removed. For shorter benchmark texts (3-10 grams), exact matching is used. This ensures benchmark scores reflect genuine reasoning, not memorization.

Why does the data pipeline iterate multiple times rather than training a single classifier?

Each iteration discovers new math-related domains that the previous classifier missed, expanding diversity and coverage To increase the training data for the classifier itself Because fastText models need multiple training passes

Chapter 3: Continued Pre-training

With 120B math tokens in hand, how do you use them? The answer is continued pre-training: take a model that has already learned general language and code, then keep training it on a math-heavy mixture.

Starting point: DeepSeek-Coder-Base-v1.5 7B

A crucial decision: the base model is a code model, not a general-purpose LLM. The authors experimentally verified that code training prior to math training significantly improves mathematical reasoning:

Training Order	GSM8K	MATH
General (400B) then Math (150B)	19.1%	14.4%
Code (400B) then Math (150B)	21.9%	15.3%
Math only (150B)	20.5%	13.1%

Why does code help math? This is a partial answer to a long-standing question. The authors hypothesize that code training develops structured reasoning: decomposing problems into steps, maintaining logical state, applying conditional logic. These are exactly the skills needed for mathematical chain-of-thought reasoning, even when no code is generated.

The training data mix

DeepSeekMath-Base 7B is trained for 500 billion tokens with this distribution:

Pre-training Data Mixture

The training mix for continued pre-training. Math dominates, but code and natural language prevent catastrophic forgetting.

The mix is intentional. Including 24% code tokens (GitHub + AlgebraicStack) preserves the coding ability inherited from DeepSeek-Coder. The 10% natural language from Common Crawl maintains general language understanding. The remaining 66% is math: 56% DeepSeekMath Corpus + 10% arXiv (the arXiv is included despite being individually ineffective, as it may help when mixed with web data).

Results: DeepSeekMath-Base 7B

Model	Size	GSM8K	MATH
Minerva	540B	58.8%	33.6%
Llemma	34B	54.0%	25.3%
Mistral	7B	40.3%	14.3%
DeepSeekMath-Base	7B	64.2%	36.2%

A 7B model beats a 540B model. DeepSeekMath-Base 7B outperforms Minerva 540B (which is 77x larger) on both GSM8K and MATH. This is compelling evidence that model size is not the only key factor — high-quality, domain-specific data at scale can compensate for 77x fewer parameters.

Why does DeepSeekMath start from a code-trained base model rather than a general-purpose LLM?

Code training develops structured reasoning skills (decomposition, logic, state tracking) that transfer to mathematical problem-solving Code models are smaller and faster to train Math problems are always solved using Python code

Chapter 4: The GRPO Algorithm

This is the core technical contribution. Group Relative Policy Optimization (GRPO) is a variant of PPO that eliminates the critic (value) network by estimating advantages from group comparisons.

The problem with PPO for LLMs

In PPO, the advantage A_t measures "how much better was this action than expected?" Computing it requires a value function V_ψ — a separate neural network (the "critic") that estimates the expected future reward from each state. For LLMs, this means training a model of comparable size alongside the policy. Four models must be loaded simultaneously: policy, reference, reward model, and critic.

There is a deeper problem. In the LLM setting, the reward model only assigns a score to the complete output (did the final answer to the math problem match?). But the value function must estimate reward at every token. Training an accurate per-token value function when only a single end-of-sequence reward exists is unreliable.

GRPO's solution: group-relative advantages

Instead of learning a value function, GRPO does something elegantly simple. For each question q:

Sample a group. Generate G outputs {o₁, o₂, ..., o_G} from the current policy.
Score them. Use the reward model to get G rewards {r₁, r₂, ..., r_G}.
Normalize within the group. For each output i, the advantage is:

Â_i = (r_i − mean(r)) / std(r)

That's it. If your reward is above the group average, your advantage is positive (reinforce this behavior). If below average, your advantage is negative (suppress this behavior). The mean and standard deviation of the group are the baseline — no learned value function needed.

The full GRPO objective

J_GRPO(θ) = E_{q, {o_i}} [1/G ∑_i=1^G 1/|o_i| ∑_t=1^|o_i| min(r_t(θ) Â_i,t, clip(r_t(θ), 1−ε, 1+ε) Â_i,t) − β D_KL(π_θ || π_ref)]

Where r_t(θ) = π_θ(o_t|q, o_<t) / π_{θ_old}(o_t|q, o_<t) is the familiar PPO probability ratio, and the min/clip mechanism is identical to PPO's clipped surrogate. The key difference is how Â_i,t is computed.

Worked example with actual numbers. Question: "What is 17 + 38?" We sample G=5 responses:
o₁: "55" (correct) → r₁ = 1.0
o₂: "45" (wrong) → r₂ = 0.0
o₃: "55" (correct) → r₃ = 1.0
o₄: "65" (wrong) → r₄ = 0.0
o₅: "55" (correct) → r₅ = 1.0
mean(r) = 0.6, std(r) = 0.49
Â₁ = (1.0 − 0.6) / 0.49 = +0.82 (reinforce: this was above average)
Â₂ = (0.0 − 0.6) / 0.49 = −1.22 (suppress: this was below average)
The policy gets a positive gradient for the correct responses and a negative gradient for the wrong ones. No critic needed.

The KL regularization

Unlike PPO (which adds KL penalty per-token inside the reward), GRPO adds KL divergence directly to the loss. This is cleaner because it doesn't complicate the advantage calculation. The KL is estimated with an unbiased estimator:

D_KL(π_θ || π_ref) = π_ref(o_t) / π_θ(o_t) − log(π_ref(o_t) / π_θ(o_t)) − 1

This is guaranteed non-negative and avoids the instability of naive KL estimation.

GRPO: Group Relative Advantages

Click "Sample Group" to generate G responses for a math question. Each gets a reward (correct/wrong). Advantages are computed relative to the group mean. Green = reinforce, red = suppress.

Group size G6

In GRPO, the advantage for output o_i is computed as (r_i − mean(r)) / std(r). What does a negative advantage signal?

This output scored below the group average — the policy gradient will decrease its probability The reward model is poorly calibrated The question was too easy

Chapter 5: GRPO vs PPO

Let's make the comparison concrete. Both algorithms optimize the same clipped surrogate objective. The difference is entirely in how they compute advantages.

PPO's approach: learned value function

PPO uses Generalized Advantage Estimation (GAE):

A_t = δ_t + (γλ)δ_t+1 + (γλ)²δ_t+2 + ...

δ_t = r_t + γV_ψ(s_t+1) − V_ψ(s_t)

This requires a learned value function V_ψ — another neural network of comparable size to the policy. It must be accurate at every token position, which is hard when only the final token receives a reward from the reward model.

GRPO's approach: group statistics

Â_i,t = (r_i − mean(r)) / std(r) for all tokens t in output i

Every token in an output gets the same advantage — the normalized reward of the entire output. No learned function, no training overhead, no approximation error from a poorly-trained critic.

Memory comparison

For a 7B parameter model, here is what must be loaded into GPU memory:

Component	PPO	GRPO
Policy model π_θ	7B	7B
Reference model π_ref	7B	7B
Reward model r_φ	7B	7B
Critic (value) model V_ψ	7B	0
Total parameters	28B	21B
Memory savings	-	~25%

The memory math. With a 7B policy, PPO needs ~28B parameters loaded (4 models). GRPO needs ~21B (3 models). That is a 25% reduction in model parameters. But the savings are actually larger in practice: the critic also requires optimizer states (Adam momentum, variance) which roughly double its memory footprint. The effective memory savings are closer to 33-40%, making RL feasible on hardware that could not run PPO.

PPO vs GRPO: Memory Layout

Visual comparison of GPU memory allocation. PPO loads 4 models; GRPO loads 3. The freed memory can be used for larger batch sizes, enabling more stable training.

What does GRPO lose?

The group-relative advantage assigns the same value to every token in an output. This means GRPO cannot distinguish which specific tokens contributed to a correct or incorrect answer — it only knows whether the entire response was good or bad. PPO's per-token advantages from GAE can, in principle, provide finer-grained credit assignment.

In practice, this seems not to matter for math: the entire chain-of-thought either leads to the right answer or it doesn't. The binary signal (correct/incorrect) is sufficient, and GRPO outperforms PPO on the benchmarks tested.

What is the main limitation of GRPO's advantage estimation compared to PPO's GAE-based approach?

GRPO assigns the same advantage to every token in a response, so it cannot identify which specific tokens were responsible for success or failure GRPO cannot use a clipped surrogate objective GRPO requires more GPU memory than PPO

Chapter 6: The SFT + RL Pipeline

DeepSeekMath's post-training follows a two-stage pipeline: supervised fine-tuning (SFT) first, then reinforcement learning (GRPO). The order matters.

Stage 1: Supervised Fine-Tuning

The base model (DeepSeekMath-Base 7B) knows math but cannot follow instructions. SFT teaches it the format: given a problem, produce a step-by-step solution.

The SFT data: 776K examples covering English and Chinese math, with solutions in three formats:

Chain-of-thought (CoT): step-by-step natural language reasoning
Program-of-thought (PoT): solutions written as Python programs
Tool-integrated reasoning: mixing natural language with Python code execution

Training: 500 steps, batch size 256, constant learning rate 5e-5, max context length 4K tokens. The result: DeepSeekMath-Instruct 7B at 46.8% MATH.

Stage 2: GRPO Reinforcement Learning

GRPO takes DeepSeekMath-Instruct as its starting point and the reward model is trained on DeepSeekMath-Base 7B. The RL training data is a subset of the SFT data: only chain-of-thought questions from GSM8K and MATH, about 144K questions total.

Why only CoT data for RL? The authors deliberately exclude tool-use and program-of-thought questions from the RL phase. This lets them test whether RL transfers to tasks it was never trained on. Spoiler: it does. GRPO improves performance even on Chinese math benchmarks (CMATH: 84.6% → 88.8%) and tool-integrated reasoning (MATH+Python: 57.4% → 58.8%) despite never seeing those formats during RL training.

GRPO hyperparameters

Hyperparameter	Value
Group size G	64 outputs per question
Learning rate	1e-6
KL coefficient β	0.04
Max output length	1024 tokens
Batch size	1024
Updates per exploration	1

The full pipeline

DeepSeek-Coder-Base-v1.5 7B

Code pre-trained, 7B parameters

↓ 500B tokens (56% math, 24% code, 10% NL, 10% arXiv)

DeepSeekMath-Base 7B

36.2% MATH (base, few-shot)

↓ SFT on 776K examples

DeepSeekMath-Instruct 7B

46.8% MATH (instruction-tuned)

↓ GRPO on 144K CoT questions

DeepSeekMath-RL 7B

51.7% MATH (RL-optimized)

Why does the RL stage use ONLY chain-of-thought questions from GSM8K and MATH, excluding tool-use data?

To test whether RL improvements transfer to tasks the model was never RL-trained on (out-of-domain generalization) Tool-use data is too noisy for RL training The reward model cannot score tool-use outputs

Chapter 7: Results

DeepSeekMath-RL 7B sets a new high-water mark for open-source math reasoning. Let's look at the full picture.

Competition-level MATH benchmark

Model	Size	MATH
Gemini Ultra	-	53.2%
GPT-4	-	52.9%
DeepSeekMath-RL	7B	51.7%
Baichuan-3	-	49.2%
GLM-4	-	47.9%
InternLM2-Math	20B	37.7%
Qwen	72B	35.2%
WizardMath-v1.1	7B	33.0%

DeepSeekMath-RL at 7B outperforms every open-source model from 7B to 70B, and surpasses most closed-source models. It is within 1.5 points of GPT-4.

Self-consistency pushes further

Sample 64 solutions and take the majority answer: 60.9% on MATH. This was the highest score any open-source model had achieved at the time.

Out-of-domain transfer

Remember, GRPO was trained only on English CoT data from GSM8K and MATH. Yet it improves everywhere:

Benchmark	SFT Only	+ GRPO	Change
GSM8K (in-domain)	82.9%	88.2%	+5.3
MATH (in-domain)	46.8%	51.7%	+4.9
MGSM-zh (Chinese)	73.2%	79.6%	+6.4
CMATH (Chinese)	84.6%	88.8%	+4.2
MATH+Python (tool-use)	57.4%	58.8%	+1.4

The RL transfer finding is remarkable. GRPO was trained on English chain-of-thought reasoning only, yet it improved Chinese math (+6.4 MGSM-zh) and tool-integrated reasoning (+1.4 MATH+Python). This suggests RL doesn't just memorize solutions — it teaches the model to reason more carefully, and this improved reasoning generalizes across languages and output formats.

Impact of Each Training Stage

MATH benchmark accuracy at each stage: Base (few-shot), SFT (instruction-tuned), RL (GRPO-optimized). Each stage adds a meaningful jump.

DeepSeekMath-RL was trained with GRPO only on English CoT questions. What happened on Chinese math benchmarks it never saw during RL?

Performance improved significantly (e.g., CMATH 84.6% to 88.8%), demonstrating that RL-trained reasoning generalizes across languages Performance stayed the same Performance degraded due to catastrophic forgetting

Chapter 8: Analysis

The paper provides deep analysis through a unified paradigm and ablation experiments. Here are the most important findings.

The unified paradigm

The authors show that SFT, Rejection Sampling Fine-Tuning (RFT), DPO, PPO, and GRPO are all instances of the same general framework. Every method's gradient has the form:

∇_θ J(θ) = E_{(q,o) ~ D}[1/|o| ∑_t GC(q, o, t, π_rf) ∇_θ log π_θ(o_t|q, o_<t)]

The methods differ in three components:

Method	Data Source D	Reward	Online?
SFT	Human-curated	None (GC=1)	No
RFT	SFT model samples	Rule-based	No
DPO	SFT model pairs	Rule-based	No
Online RFT	Current policy samples	Rule-based	Yes
PPO	Current policy samples	Learned model	Yes
GRPO	Current policy groups	Learned model	Yes

The key distinction: online vs offline. SFT, RFT, and DPO train on data generated by a fixed policy (offline). PPO and GRPO train on data from the current policy (online). Online methods are consistently better because they explore the current model's frontier — they find and correct mistakes the current model makes, not mistakes the old model made.

Online beats offline

A direct comparison: Online RFT achieves 50.2% on MATH vs offline RFT at 47.3%. GRPO with outcome supervision reaches 51.7%. The online signal from the evolving policy is consistently worth 3-5 percentage points.

Outcome vs process supervision

Outcome supervision: a single reward at the end (was the final answer correct?). Process supervision: a reward at each reasoning step (was this step logically sound?). Surprisingly, with GRPO, outcome supervision performs comparably to process supervision on MATH. The simpler approach works just as well.

Iterative RL

As the policy improves, the reward model can fall behind — it was trained on outputs from the original SFT model, which now looks different from the improving RL policy. Iterative GRPO addresses this by periodically re-generating training data for the reward model using the current policy, and continuing to train the reward model with a replay mechanism (10% historical data to prevent forgetting). This yields further improvements.

Why does RL work beyond SFT?

The authors' analysis points to a key insight: SFT treats all training tokens equally (gradient coefficient = 1), while RL assigns different weights based on quality. RL amplifies good reasoning patterns and suppresses bad ones, rather than blindly imitating everything in the training data. This is especially important when training data contains multiple solution paths of varying quality.

What is the key difference between online RL methods (PPO, GRPO) and offline methods (RFT, DPO)?

Online methods train on data from the current (evolving) policy, finding and correcting mistakes the model makes now, while offline methods train on data from a fixed earlier model Online methods use more GPU memory Offline methods cannot use reward models

Chapter 9: Connections

What DeepSeekMath built on

PPO (Schulman et al., 2017): GRPO inherits PPO's clipped surrogate objective and probability ratio framework, replacing only the advantage estimation with group-relative computation.

RLHF (Ouyang et al., 2022): The paradigm of SFT followed by RL from a reward model. DeepSeekMath applies this to mathematical reasoning rather than helpfulness/harmlessness alignment.

OpenWebMath (Paster et al., 2023): The seed corpus that bootstraps the iterative data pipeline. Without this starting point, the fastText classifier could not begin.

Rejection Sampling Fine-Tuning (Yuan et al., 2023): The offline precursor to GRPO. RFT samples multiple solutions, keeps the correct ones, and fine-tunes on them. GRPO goes further by using normalized advantages (not just accept/reject) and training online.

What DeepSeekMath enabled

DeepSeek-V3 (2024): The DeepSeek team's flagship 671B MoE model uses GRPO as part of its post-training pipeline. The algorithm scales from 7B to hundreds of billions of parameters.

DeepSeek-R1 (2025): The reasoning-specialized model that achieves state-of-the-art on math and code uses GRPO as its RL algorithm, demonstrating that group-relative advantages scale to frontier reasoning.

DAPO (2024): Decoupled Alignment from PPO, a successor that builds on GRPO's group-relative approach with additional improvements for alignment tasks.

Kimi K2 (Moonshot AI, 2025): Uses GRPO-derived techniques for its agentic RL pipeline, training tool use and multi-step reasoning at trillion-parameter scale.

GRPO's legacy. What started as a memory optimization for 7B math models became the standard RL algorithm for frontier language models. DeepSeek-R1, one of the most capable reasoning models in the world, runs on GRPO. The insight that group-relative advantages can replace learned value functions turned out to be not just a memory trick but a better approach to RL for language models.

Cheat sheet

Core contribution

GRPO: eliminate the critic by computing advantages from group statistics. Â_i = (r_i − mean(r)) / std(r)

Data contribution

120B math tokens from Common Crawl via iterative classifier pipeline (4 rounds, fastText)

Key result

51.7% MATH (top-1) and 60.9% (maj@64) from a 7B model, approaching GPT-4

Surprising finding

Code pre-training helps math. ArXiv papers do not. RL transfers across languages.

Impact

GRPO became the default RL algorithm for DeepSeek-V3, DeepSeek-R1, and influenced Kimi K2

Which subsequent model demonstrated that GRPO scales to frontier reasoning at much larger parameter counts?

DeepSeek-R1, which uses GRPO as its RL algorithm and achieves state-of-the-art on math and code reasoning GPT-4, which adopted GRPO in a later update Llama 3, which uses GRPO for pre-training

DeepSeekMath: Pushing the Limits of Mathematical Reasoning