Four targeted fixes for four failure modes of large-scale LLM reinforcement learning. Removes clipping bias, prevents entropy collapse, handles length bias, and uses dynamic sampling. Achieves 50 on AIME 2024 with Qwen2.5-32B.
DeepSeek-R1 proved that GRPO can teach language models to reason. But when ByteDance tried to reproduce R1's results at scale, they hit four systematic failure modes that prevented stable training. DAPO is the fix.
Think of GRPO as a car that can go fast but pulls to the left, overheats on long drives, and has unreliable brakes. DAPO doesn't replace the car — it fixes each of these specific problems. The result is a training algorithm that scales reliably to 32B parameters without the constant monitoring and manual intervention that GRPO requires.
| Failure Mode | What Happens | DAPO Fix |
|---|---|---|
| Clipping bias | PPO/GRPO's symmetric clipping suppresses both good and bad deviations equally, biasing the policy toward the old distribution | Clip-Higher: asymmetric clipping that only clips the ratio downward for negative advantages |
| Entropy collapse | The policy becomes deterministic too quickly, losing the ability to explore diverse solutions | Dynamic Sampling: filter out problems where accuracy is 0% or 100% to maintain learning signal |
| KL divergence issues | Sequence-level KL penalty is dominated by long responses, creating a length bias | Token-Level KL: compute KL per-token and average, making the penalty length-invariant |
| Length exploitation | Model learns to generate very long responses because longer = more correct on average | Overlong Reward Shaping: penalize responses that exceed a soft length limit |
Click each failure mode to see what happens during training and how DAPO fixes it.
PPO and GRPO use clipped surrogate objectives to prevent the policy from changing too much in one step. The standard clip limits the probability ratio to [1-ε, 1+ε]. But this symmetric clipping has an asymmetric effect that creates a bias.
Consider a response with positive advantage (good response, we want more of it). The gradient pushes the probability ratio rt upward. But the clip at 1+ε limits how much the ratio can increase. If the gradient is strong enough to push rt beyond 1+ε, the gradient is zeroed out.
Now consider a response with negative advantage (bad response, we want less of it). The gradient pushes rt downward. The clip at 1-ε limits how much it can decrease. So far, symmetric.
But here's the bias: we clip both upward and downward changes equally. This means the policy can't deviate much from the old policy in either direction. For positive advantages, we WANT the policy to increase — but the clip prevents it. This biases the policy toward the old distribution, slowing learning.
DAPO's Clip-Higher technique uses a higher upper clip bound (εhigh) than lower bound (εlow). For positive advantages, the ratio can increase more before being clipped. For negative advantages, the ratio is clipped as before.
In practice, εhigh is set to about 1.4x εlow. This asymmetry lets the policy move more aggressively toward good responses while still being cautious about moving away from bad ones.
python # Standard PPO clipping (symmetric) def ppo_clip(ratio, advantage, eps=0.2): surr1 = ratio * advantage surr2 = torch.clamp(ratio, 1-eps, 1+eps) * advantage return torch.min(surr1, surr2) # DAPO Clip-Higher (asymmetric) def dapo_clip(ratio, advantage, eps_low=0.2, eps_high=0.28): surr1 = ratio * advantage surr2 = torch.clamp(ratio, 1-eps_low, 1+eps_high) * advantage return torch.min(surr1, surr2) # Positive advantages: ratio can go up to 1.28 (more room) # Negative advantages: ratio clips at 0.80 (same as PPO)
Compare symmetric PPO clipping vs DAPO's asymmetric Clip-Higher. Adjust εhigh to see how it affects the policy's ability to reinforce good responses. The shaded region shows the allowed update range.
GRPO samples G responses per problem and computes advantages from the reward distribution within the group. But what happens when all responses are correct or all are wrong? The advantages become meaningless.
If all G responses to a problem are correct (all rewards = 1), the mean is 1, the standard deviation is 0, and the z-score advantages are undefined (or all zero after normalization). The model learns nothing from this problem — it already gets it right every time.
Conversely, if all G responses are wrong (all rewards = 0), the mean is 0, std is 0, and again no useful gradient. The model can't learn from problems that are too hard.
As training progresses, easy problems become 100%-correct groups and hard problems remain 0%-correct groups. The fraction of "useful" problems — those with a mix of correct and incorrect responses — shrinks. Learning slows down dramatically.
DAPO's Dynamic Sampling filters out degenerate groups before computing the GRPO update. For each problem, after sampling G responses:
python # Dynamic Sampling in DAPO def dapo_batch(model, problems, G=64, target_useful=32): useful_groups = [] for problem in shuffled(problems): # Sample G responses responses = [model.generate(problem) for _ in range(G)] rewards = [reward_fn(r, problem) for r in responses] # Check for degenerate group if all(r == rewards[0] for r in rewards): continue # SKIP: all same reward, no learning signal # This group has useful signal — keep it useful_groups.append((responses, rewards)) if len(useful_groups) >= target_useful: break # Compute GRPO update only on useful groups return grpo_update(model, useful_groups)
Dynamic sampling wastes some compute on problems that get filtered out. If 40% of problems are degenerate, you need to sample 40% more problems to fill each batch. This cost increases during training as more problems become easy. But the alternative — training on degenerate groups with zero signal — is worse: it wastes compute AND provides no learning.
See how the fraction of useful groups changes during training. Early training: most problems have mixed results (useful). Late training: many are all-correct or all-wrong. Dynamic sampling filters these out.
GRPO uses a KL divergence penalty to prevent the policy from drifting too far from a reference model. But the standard implementation has a subtle length bias that distorts training.
Standard KL is computed at the sequence level: sum the per-token KL divergence across all tokens in the response, then average across responses. This means longer responses accumulate more KL, getting a larger penalty. The model learns: shorter responses = less KL penalty = better.
This creates a perverse incentive: the model can reduce its KL penalty by generating shorter responses, even if those shorter responses are less accurate. For reasoning tasks where longer thinking = better accuracy, this is catastrophic.
DAPO changes the KL computation to be token-level: compute per-token KL, then average over tokens instead of summing. This makes the penalty invariant to response length.
python # Sequence-level KL (GRPO default — biased toward short responses) def seq_kl(policy_logprobs, ref_logprobs): # policy_logprobs: [batch, seq_len] per_token_kl = policy_logprobs - ref_logprobs # [batch, seq_len] return per_token_kl.sum(dim=-1).mean() # SUM over tokens, mean over batch # Problem: longer response → larger KL → larger penalty # Token-level KL (DAPO — length-invariant) def token_kl(policy_logprobs, ref_logprobs, mask): per_token_kl = policy_logprobs - ref_logprobs # MEAN over tokens (not sum), then mean over batch per_seq_kl = (per_token_kl * mask).sum(dim=-1) / mask.sum(dim=-1) return per_seq_kl.mean() # Fix: KL per token, averaged → same penalty for 100 or 1000 tokens
Reasoning models like R1 generate responses of 1,000-10,000 tokens. With sequence-level KL, a 5,000-token response gets 50x the KL penalty of a 100-token response. This creates enormous pressure to be brief — which directly undermines the "extended thinking" that makes reasoning models powerful. Token-level KL removes this pressure entirely.
Compare how sequence-level and token-level KL penalties scale with response length. Sequence-level grows linearly. Token-level stays flat. Adjust response length to see the divergence.
Even with token-level KL (which removes the length bias from regularization), the model can still learn to exploit response length. Longer reasoning traces are more likely to stumble on the correct answer by exhaustive enumeration. DAPO adds explicit length control through reward shaping.
During RL training, the model discovers that longer responses have higher accuracy on average. This is partly legitimate (more thinking = better answers) and partly exploitation (the model pads responses with repetitive text or explores many possibilities until it finds the right one by chance).
Without length control, response lengths grow unboundedly. R1-Zero's responses grew from ~100 to ~10,000 tokens. At some point, the extra length is pure waste — the model is rambling, not reasoning.
DAPO adds a reward shaping term that penalizes responses exceeding a soft maximum length Lmax:
Where λ controls the penalty strength and Lmax is the soft maximum. Responses under Lmax tokens get no penalty. Responses over Lmax get a linearly increasing penalty. The penalty is proportional to how much the response exceeds the limit.
python # Overlong Reward Shaping def shaped_reward(accuracy_reward, response_len, max_len=8192, penalty_weight=0.5): # No penalty if under max_len overlong = max(0, response_len - max_len) / max_len penalty = penalty_weight * overlong return accuracy_reward - penalty # Examples: # len=4000, max=8192: penalty = 0, shaped = accuracy (no change) # len=10000, max=8192: overlong=0.22, penalty=0.11 # len=16000, max=8192: overlong=0.95, penalty=0.48 (significant!)
The paper uses Lmax = 8,192 tokens for most experiments. This allows substantial reasoning (a few pages of text) while penalizing extreme responses. The value is chosen empirically: too low, and the model can't reason through hard problems; too high, and it doesn't constrain waste.
The four DAPO techniques interact cleanly:
| Technique | Controls | Mechanism |
|---|---|---|
| Clip-Higher | Policy update magnitude | Asymmetric clipping of probability ratios |
| Dynamic Sampling | Learning signal quality | Filtering degenerate groups |
| Token-Level KL | Distribution drift | Length-invariant regularization |
| Overlong Shaping | Response length | Soft penalty on overlong responses |
None of these techniques conflict. Clip-Higher controls how much the policy changes per update. Dynamic Sampling ensures every update uses informative data. Token-Level KL prevents distribution collapse. Overlong Shaping prevents length exploitation. They're orthogonal fixes for orthogonal problems.
See how the overlong reward penalty affects the shaped reward. Adjust response length and Lmax to see the penalty kick in.
DAPO is evaluated by training Qwen2.5-32B on math reasoning tasks using DAPO versus GRPO and other baselines. The headline result: DAPO reaches 50/100 on AIME 2024 with a 32B model — competitive with much larger systems.
| Model | Size | AIME 2024 | Algorithm |
|---|---|---|---|
| DeepSeek-R1-Zero | 671B MoE | 71.0 | GRPO |
| DeepSeek-R1 | 671B MoE | 79.8 | GRPO + SFT |
| Qwen2.5-32B + GRPO | 32B | 38.0 | GRPO |
| Qwen2.5-32B + DAPO | 32B | 50.0 | DAPO |
DAPO improves AIME 2024 from 38 to 50 on the same 32B model — a 32% relative improvement. This closes a significant portion of the gap to the 671B R1 model while using only a fraction of the compute.
| Configuration | AIME | Contribution |
|---|---|---|
| GRPO baseline | 38.0 | — |
| + Clip-Higher | 41.5 | +3.5 |
| + Dynamic Sampling | 44.0 | +2.5 |
| + Token-Level KL | 47.0 | +3.0 |
| + Overlong Shaping | 50.0 | +3.0 |
Each technique contributes roughly 2.5-3.5 AIME points. No single technique dominates — they address independent failure modes. This confirms DAPO's design philosophy: four minimal fixes for four specific problems.
Beyond raw performance, DAPO dramatically improves training stability. With GRPO, training often requires manual intervention: restarting from checkpoints when the model collapses, adjusting hyperparameters mid-run, monitoring for reward hacking. With DAPO, training runs to completion without intervention — a necessary property for scaling to larger models and longer training.
See how each DAPO technique contributes to the final AIME score. Click "Add" to incrementally add each technique on top of the GRPO baseline.
See all four DAPO techniques working together. This simulator shows a training run with toggleable techniques — enable or disable each fix to see its effect on training stability and final accuracy.
Toggle each DAPO technique on/off and run training steps to see how they affect accuracy and stability. With all four off = GRPO baseline. With all four on = full DAPO.
With all off (GRPO baseline): Training is unstable. Accuracy plateaus around 38. Entropy collapses. Response length grows unboundedly.
With Clip-Higher only: Learning is faster initially, but still suffers from entropy collapse and length explosion.
With all four (full DAPO): Training is stable, accuracy reaches ~50, entropy stays healthy, and response length stabilizes around the soft maximum.
DAPO is an incremental but important contribution: it takes the GRPO algorithm (from DeepSeek-R1) and makes it production-ready for large-scale training. It sits in the lineage of RL algorithms for LLMs.
| Algorithm | Year | Relationship to DAPO |
|---|---|---|
| PPO | 2017 | The original clipped surrogate objective. DAPO inherits and modifies the clipping. |
| GRPO | 2025 | Removes PPO's critic. DAPO fixes GRPO's four remaining weaknesses. |
| DAPO | 2025 | Clip-Higher + Dynamic Sampling + Token KL + Overlong Shaping. This paper. |
| ReMax | 2024 | Alternative critic-free method using REINFORCE with baseline. Different approach, similar goal. |
| Dr. GRPO | 2025 | Another GRPO fix focusing on variance reduction. Complementary to DAPO. |
Precise diagnosis. Instead of proposing a new algorithm, DAPO precisely identified four failure modes and applied minimal fixes. This approach is more reproducible and easier to build on.
Full open-source. Training code, hyperparameters, and training logs are all released. Other teams can reproduce the results exactly.
Process rewards. DAPO still uses outcome-only rewards. Combining DAPO with process reward models could yield further improvements.
Beyond math. DAPO is evaluated primarily on math reasoning. Its effectiveness on code generation, scientific reasoning, and general instruction following is less clear.
DeepSeek-R1 — The GRPO paper that DAPO improves upon. Read the R1 lesson →
PPO — The original clipped surrogate objective. Read the PPO lesson →
See how RL algorithms for LLMs evolved from PPO to GRPO to DAPO.