A 7B model that hits 51.7% on competition-level MATH — approaching GPT-4 — via 120B curated math tokens and a memory-efficient RL algorithm that eliminates the critic network entirely.
It is early 2024. GPT-4 can solve competition-level math problems at roughly 53% accuracy on the MATH benchmark. Gemini Ultra hovers around the same mark. These models have hundreds of billions of parameters, were trained on trillions of tokens of undisclosed data, and cost enormous sums to build.
Meanwhile, the best open-source models are far behind. Llemma 34B, the strongest math-specific open model, manages only 25.3% on MATH. Even general-purpose open models like Mistral 7B or Qwen 72B top out around 35%. The gap is enormous: closed-source models lead open-source by 20+ percentage points on the hardest math benchmark.
Why? Two bottlenecks:
MATH benchmark accuracy. Open-source models are clustered far below the closed-source frontier. DeepSeekMath-RL closes this gap from 7B parameters.
DeepSeekMath's strategy has two pillars, each addressing one bottleneck.
The combination is surprisingly powerful. A 7B model, continued pre-trained on the DeepSeekMath Corpus and then fine-tuned with GRPO, reaches 51.7% on MATH — within 1-2 points of GPT-4 and Gemini Ultra.
A third surprising finding: code training before math training helps. DeepSeekMath starts from DeepSeek-Coder-Base, not a general LLM. The authors show experimentally that code pre-training improves mathematical reasoning, both with and without tool use. This is partial evidence for the long-standing hypothesis that code training improves general reasoning ability.
How do you find 120 billion tokens of math on the internet? Common Crawl contains 40 billion HTML pages after deduplication. The vast majority are not mathematical. The challenge is finding the needles in an ocean of haystacks.
The pipeline starts with a seed corpus — OpenWebMath, a curated collection of 13.6B math tokens. From this seed, the process iterates:
After four iterations, the pipeline converges: 98% of data found in iteration 4 was already found in iteration 3. The final corpus: 35.5 million web pages, 120 billion tokens.
The authors validate quality by training a 1.3B model on different math corpora for 150B tokens each:
| Corpus | Size | GSM8K | MATH |
|---|---|---|---|
| No math training | - | 2.9% | 3.0% |
| MathPile (mostly arXiv) | 8.9B | 2.7% | 3.3% |
| OpenWebMath | 13.6B | 11.5% | 8.9% |
| Proof-Pile-2 | 51.9B | 14.3% | 11.2% |
| DeepSeekMath Corpus | 120.2B | 23.8% | 13.6% |
Any web page containing a 10-gram overlap with GSM8K, MATH, or Chinese math benchmarks is removed. For shorter benchmark texts (3-10 grams), exact matching is used. This ensures benchmark scores reflect genuine reasoning, not memorization.
With 120B math tokens in hand, how do you use them? The answer is continued pre-training: take a model that has already learned general language and code, then keep training it on a math-heavy mixture.
A crucial decision: the base model is a code model, not a general-purpose LLM. The authors experimentally verified that code training prior to math training significantly improves mathematical reasoning:
| Training Order | GSM8K | MATH |
|---|---|---|
| General (400B) then Math (150B) | 19.1% | 14.4% |
| Code (400B) then Math (150B) | 21.9% | 15.3% |
| Math only (150B) | 20.5% | 13.1% |
DeepSeekMath-Base 7B is trained for 500 billion tokens with this distribution:
The training mix for continued pre-training. Math dominates, but code and natural language prevent catastrophic forgetting.
The mix is intentional. Including 24% code tokens (GitHub + AlgebraicStack) preserves the coding ability inherited from DeepSeek-Coder. The 10% natural language from Common Crawl maintains general language understanding. The remaining 66% is math: 56% DeepSeekMath Corpus + 10% arXiv (the arXiv is included despite being individually ineffective, as it may help when mixed with web data).
| Model | Size | GSM8K | MATH |
|---|---|---|---|
| Minerva | 540B | 58.8% | 33.6% |
| Llemma | 34B | 54.0% | 25.3% |
| Mistral | 7B | 40.3% | 14.3% |
| DeepSeekMath-Base | 7B | 64.2% | 36.2% |
This is the core technical contribution. Group Relative Policy Optimization (GRPO) is a variant of PPO that eliminates the critic (value) network by estimating advantages from group comparisons.
In PPO, the advantage At measures "how much better was this action than expected?" Computing it requires a value function Vψ — a separate neural network (the "critic") that estimates the expected future reward from each state. For LLMs, this means training a model of comparable size alongside the policy. Four models must be loaded simultaneously: policy, reference, reward model, and critic.
There is a deeper problem. In the LLM setting, the reward model only assigns a score to the complete output (did the final answer to the math problem match?). But the value function must estimate reward at every token. Training an accurate per-token value function when only a single end-of-sequence reward exists is unreliable.
Instead of learning a value function, GRPO does something elegantly simple. For each question q:
That's it. If your reward is above the group average, your advantage is positive (reinforce this behavior). If below average, your advantage is negative (suppress this behavior). The mean and standard deviation of the group are the baseline — no learned value function needed.
Where rt(θ) = πθ(ot|q, o<t) / πθold(ot|q, o<t) is the familiar PPO probability ratio, and the min/clip mechanism is identical to PPO's clipped surrogate. The key difference is how Âi,t is computed.
Unlike PPO (which adds KL penalty per-token inside the reward), GRPO adds KL divergence directly to the loss. This is cleaner because it doesn't complicate the advantage calculation. The KL is estimated with an unbiased estimator:
This is guaranteed non-negative and avoids the instability of naive KL estimation.
Click "Sample Group" to generate G responses for a math question. Each gets a reward (correct/wrong). Advantages are computed relative to the group mean. Green = reinforce, red = suppress.
Let's make the comparison concrete. Both algorithms optimize the same clipped surrogate objective. The difference is entirely in how they compute advantages.
PPO uses Generalized Advantage Estimation (GAE):
This requires a learned value function Vψ — another neural network of comparable size to the policy. It must be accurate at every token position, which is hard when only the final token receives a reward from the reward model.
Every token in an output gets the same advantage — the normalized reward of the entire output. No learned function, no training overhead, no approximation error from a poorly-trained critic.
For a 7B parameter model, here is what must be loaded into GPU memory:
| Component | PPO | GRPO |
|---|---|---|
| Policy model πθ | 7B | 7B |
| Reference model πref | 7B | 7B |
| Reward model rφ | 7B | 7B |
| Critic (value) model Vψ | 7B | 0 |
| Total parameters | 28B | 21B |
| Memory savings | - | ~25% |
Visual comparison of GPU memory allocation. PPO loads 4 models; GRPO loads 3. The freed memory can be used for larger batch sizes, enabling more stable training.
The group-relative advantage assigns the same value to every token in an output. This means GRPO cannot distinguish which specific tokens contributed to a correct or incorrect answer — it only knows whether the entire response was good or bad. PPO's per-token advantages from GAE can, in principle, provide finer-grained credit assignment.
In practice, this seems not to matter for math: the entire chain-of-thought either leads to the right answer or it doesn't. The binary signal (correct/incorrect) is sufficient, and GRPO outperforms PPO on the benchmarks tested.
DeepSeekMath's post-training follows a two-stage pipeline: supervised fine-tuning (SFT) first, then reinforcement learning (GRPO). The order matters.
The base model (DeepSeekMath-Base 7B) knows math but cannot follow instructions. SFT teaches it the format: given a problem, produce a step-by-step solution.
The SFT data: 776K examples covering English and Chinese math, with solutions in three formats:
Training: 500 steps, batch size 256, constant learning rate 5e-5, max context length 4K tokens. The result: DeepSeekMath-Instruct 7B at 46.8% MATH.
GRPO takes DeepSeekMath-Instruct as its starting point and the reward model is trained on DeepSeekMath-Base 7B. The RL training data is a subset of the SFT data: only chain-of-thought questions from GSM8K and MATH, about 144K questions total.
| Hyperparameter | Value |
|---|---|
| Group size G | 64 outputs per question |
| Learning rate | 1e-6 |
| KL coefficient β | 0.04 |
| Max output length | 1024 tokens |
| Batch size | 1024 |
| Updates per exploration | 1 |
DeepSeekMath-RL 7B sets a new high-water mark for open-source math reasoning. Let's look at the full picture.
| Model | Size | MATH |
|---|---|---|
| Gemini Ultra | - | 53.2% |
| GPT-4 | - | 52.9% |
| DeepSeekMath-RL | 7B | 51.7% |
| Baichuan-3 | - | 49.2% |
| GLM-4 | - | 47.9% |
| InternLM2-Math | 20B | 37.7% |
| Qwen | 72B | 35.2% |
| WizardMath-v1.1 | 7B | 33.0% |
DeepSeekMath-RL at 7B outperforms every open-source model from 7B to 70B, and surpasses most closed-source models. It is within 1.5 points of GPT-4.
Sample 64 solutions and take the majority answer: 60.9% on MATH. This was the highest score any open-source model had achieved at the time.
Remember, GRPO was trained only on English CoT data from GSM8K and MATH. Yet it improves everywhere:
| Benchmark | SFT Only | + GRPO | Change |
|---|---|---|---|
| GSM8K (in-domain) | 82.9% | 88.2% | +5.3 |
| MATH (in-domain) | 46.8% | 51.7% | +4.9 |
| MGSM-zh (Chinese) | 73.2% | 79.6% | +6.4 |
| CMATH (Chinese) | 84.6% | 88.8% | +4.2 |
| MATH+Python (tool-use) | 57.4% | 58.8% | +1.4 |
MATH benchmark accuracy at each stage: Base (few-shot), SFT (instruction-tuned), RL (GRPO-optimized). Each stage adds a meaningful jump.
The paper provides deep analysis through a unified paradigm and ablation experiments. Here are the most important findings.
The authors show that SFT, Rejection Sampling Fine-Tuning (RFT), DPO, PPO, and GRPO are all instances of the same general framework. Every method's gradient has the form:
The methods differ in three components:
| Method | Data Source D | Reward | Online? |
|---|---|---|---|
| SFT | Human-curated | None (GC=1) | No |
| RFT | SFT model samples | Rule-based | No |
| DPO | SFT model pairs | Rule-based | No |
| Online RFT | Current policy samples | Rule-based | Yes |
| PPO | Current policy samples | Learned model | Yes |
| GRPO | Current policy groups | Learned model | Yes |
A direct comparison: Online RFT achieves 50.2% on MATH vs offline RFT at 47.3%. GRPO with outcome supervision reaches 51.7%. The online signal from the evolving policy is consistently worth 3-5 percentage points.
Outcome supervision: a single reward at the end (was the final answer correct?). Process supervision: a reward at each reasoning step (was this step logically sound?). Surprisingly, with GRPO, outcome supervision performs comparably to process supervision on MATH. The simpler approach works just as well.
As the policy improves, the reward model can fall behind — it was trained on outputs from the original SFT model, which now looks different from the improving RL policy. Iterative GRPO addresses this by periodically re-generating training data for the reward model using the current policy, and continuing to train the reward model with a replay mechanism (10% historical data to prevent forgetting). This yields further improvements.
The authors' analysis points to a key insight: SFT treats all training tokens equally (gradient coefficient = 1), while RL assigns different weights based on quality. RL amplifies good reasoning patterns and suppresses bad ones, rather than blindly imitating everything in the training data. This is especially important when training data contains multiple solution paths of varying quality.
PPO (Schulman et al., 2017): GRPO inherits PPO's clipped surrogate objective and probability ratio framework, replacing only the advantage estimation with group-relative computation.
RLHF (Ouyang et al., 2022): The paradigm of SFT followed by RL from a reward model. DeepSeekMath applies this to mathematical reasoning rather than helpfulness/harmlessness alignment.
OpenWebMath (Paster et al., 2023): The seed corpus that bootstraps the iterative data pipeline. Without this starting point, the fastText classifier could not begin.
Rejection Sampling Fine-Tuning (Yuan et al., 2023): The offline precursor to GRPO. RFT samples multiple solutions, keeps the correct ones, and fine-tunes on them. GRPO goes further by using normalized advantages (not just accept/reject) and training online.
DeepSeek-V3 (2024): The DeepSeek team's flagship 671B MoE model uses GRPO as part of its post-training pipeline. The algorithm scales from 7B to hundreds of billions of parameters.
DeepSeek-R1 (2025): The reasoning-specialized model that achieves state-of-the-art on math and code uses GRPO as its RL algorithm, demonstrating that group-relative advantages scale to frontier reasoning.
DAPO (2024): Decoupled Alignment from PPO, a successor that builds on GRPO's group-relative approach with additional improvements for alignment tasks.
Kimi K2 (Moonshot AI, 2025): Uses GRPO-derived techniques for its agentic RL pipeline, training tool use and multi-step reasoning at trillion-parameter scale.