ByteDance Seed, Tsinghua AIR, HKU — 2025

DAPO: Decoupled Advantage Policy Optimization

Four targeted fixes for four failure modes of large-scale LLM reinforcement learning. Removes clipping bias, prevents entropy collapse, handles length bias, and uses dynamic sampling. Achieves 50 on AIME 2024 with Qwen2.5-32B.

Prerequisites: PPO clipping intuition + GRPO basics + Policy gradient basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: GRPO's Failure Modes

DeepSeek-R1 proved that GRPO can teach language models to reason. But when ByteDance tried to reproduce R1's results at scale, they hit four systematic failure modes that prevented stable training. DAPO is the fix.

Think of GRPO as a car that can go fast but pulls to the left, overheats on long drives, and has unreliable brakes. DAPO doesn't replace the car — it fixes each of these specific problems. The result is a training algorithm that scales reliably to 32B parameters without the constant monitoring and manual intervention that GRPO requires.

The four failure modes

Failure ModeWhat HappensDAPO Fix
Clipping biasPPO/GRPO's symmetric clipping suppresses both good and bad deviations equally, biasing the policy toward the old distributionClip-Higher: asymmetric clipping that only clips the ratio downward for negative advantages
Entropy collapseThe policy becomes deterministic too quickly, losing the ability to explore diverse solutionsDynamic Sampling: filter out problems where accuracy is 0% or 100% to maintain learning signal
KL divergence issuesSequence-level KL penalty is dominated by long responses, creating a length biasToken-Level KL: compute KL per-token and average, making the penalty length-invariant
Length exploitationModel learns to generate very long responses because longer = more correct on averageOverlong Reward Shaping: penalize responses that exceed a soft length limit
DAPO's philosophy: Don't redesign the algorithm — diagnose each failure mode precisely and apply the minimal fix. GRPO is fundamentally sound. It just has four specific engineering weaknesses that emerge at scale. Each of DAPO's four techniques addresses exactly one weakness, and together they make large-scale RL training stable and reproducible.
GRPO Failure Modes Visualizer

Click each failure mode to see what happens during training and how DAPO fixes it.

What is DAPO's relationship to GRPO?

Chapter 1: Clip-Higher

PPO and GRPO use clipped surrogate objectives to prevent the policy from changing too much in one step. The standard clip limits the probability ratio to [1-ε, 1+ε]. But this symmetric clipping has an asymmetric effect that creates a bias.

The problem: symmetric clipping is biased

Consider a response with positive advantage (good response, we want more of it). The gradient pushes the probability ratio rt upward. But the clip at 1+ε limits how much the ratio can increase. If the gradient is strong enough to push rt beyond 1+ε, the gradient is zeroed out.

Now consider a response with negative advantage (bad response, we want less of it). The gradient pushes rt downward. The clip at 1-ε limits how much it can decrease. So far, symmetric.

But here's the bias: we clip both upward and downward changes equally. This means the policy can't deviate much from the old policy in either direction. For positive advantages, we WANT the policy to increase — but the clip prevents it. This biases the policy toward the old distribution, slowing learning.

Standard PPO: clip(rt, 1−εlow, 1+εhigh) with εlow = εhigh = 0.2

The fix: asymmetric clipping

DAPO's Clip-Higher technique uses a higher upper clip bound (εhigh) than lower bound (εlow). For positive advantages, the ratio can increase more before being clipped. For negative advantages, the ratio is clipped as before.

DAPO: clip(rt, 1−εlow, 1+εhigh) with εlow = 0.2, εhigh = 0.28

In practice, εhigh is set to about 1.4x εlow. This asymmetry lets the policy move more aggressively toward good responses while still being cautious about moving away from bad ones.

python
# Standard PPO clipping (symmetric)
def ppo_clip(ratio, advantage, eps=0.2):
    surr1 = ratio * advantage
    surr2 = torch.clamp(ratio, 1-eps, 1+eps) * advantage
    return torch.min(surr1, surr2)

# DAPO Clip-Higher (asymmetric)
def dapo_clip(ratio, advantage, eps_low=0.2, eps_high=0.28):
    surr1 = ratio * advantage
    surr2 = torch.clamp(ratio, 1-eps_low, 1+eps_high) * advantage
    return torch.min(surr1, surr2)
    # Positive advantages: ratio can go up to 1.28 (more room)
    # Negative advantages: ratio clips at 0.80 (same as PPO)
Why asymmetry helps: The symmetric clip was designed for stability in robotics RL where you want conservative updates. But for LLM reasoning, we WANT the model to aggressively adopt strategies that work. Clip-Higher gives the model more room to increase the probability of correct reasoning patterns while maintaining stability for incorrect ones.
Clipping Comparison

Compare symmetric PPO clipping vs DAPO's asymmetric Clip-Higher. Adjust εhigh to see how it affects the policy's ability to reinforce good responses. The shaded region shows the allowed update range.

εhigh 0.28
Why does DAPO use a higher upper clip bound (εhigh = 0.28) than lower bound (εlow = 0.2)?

Chapter 2: Dynamic Sampling

GRPO samples G responses per problem and computes advantages from the reward distribution within the group. But what happens when all responses are correct or all are wrong? The advantages become meaningless.

The problem: degenerate groups

If all G responses to a problem are correct (all rewards = 1), the mean is 1, the standard deviation is 0, and the z-score advantages are undefined (or all zero after normalization). The model learns nothing from this problem — it already gets it right every time.

Conversely, if all G responses are wrong (all rewards = 0), the mean is 0, std is 0, and again no useful gradient. The model can't learn from problems that are too hard.

As training progresses, easy problems become 100%-correct groups and hard problems remain 0%-correct groups. The fraction of "useful" problems — those with a mix of correct and incorrect responses — shrinks. Learning slows down dramatically.

The entropy collapse spiral: As the model improves, more problems become easy (all-correct groups). These provide zero learning signal. The model only learns from the remaining hard problems, but those may be too hard (all-wrong groups). The useful training set shrinks from both ends, and the policy's entropy (diversity) collapses — it becomes deterministic and stops exploring.

The fix: dynamic sampling

DAPO's Dynamic Sampling filters out degenerate groups before computing the GRPO update. For each problem, after sampling G responses:

Sample G responses
Generate G outputs for the problem, compute reward for each.
Check diversity
Are all rewards identical? If all correct or all wrong, SKIP this problem.
Compute advantages
Only compute GRPO advantages for groups with mixed rewards (some correct, some wrong).
Oversample to compensate
To maintain batch size, sample more problems. Target: N problems with useful signal per batch.
python
# Dynamic Sampling in DAPO
def dapo_batch(model, problems, G=64, target_useful=32):
    useful_groups = []

    for problem in shuffled(problems):
        # Sample G responses
        responses = [model.generate(problem) for _ in range(G)]
        rewards = [reward_fn(r, problem) for r in responses]

        # Check for degenerate group
        if all(r == rewards[0] for r in rewards):
            continue  # SKIP: all same reward, no learning signal

        # This group has useful signal — keep it
        useful_groups.append((responses, rewards))

        if len(useful_groups) >= target_useful:
            break

    # Compute GRPO update only on useful groups
    return grpo_update(model, useful_groups)

The oversampling cost

Dynamic sampling wastes some compute on problems that get filtered out. If 40% of problems are degenerate, you need to sample 40% more problems to fill each batch. This cost increases during training as more problems become easy. But the alternative — training on degenerate groups with zero signal — is worse: it wastes compute AND provides no learning.

Dynamic Sampling Visualizer

See how the fraction of useful groups changes during training. Early training: most problems have mixed results (useful). Late training: many are all-correct or all-wrong. Dynamic sampling filters these out.

Training progress 0%
Why does DAPO skip groups where all G responses have the same reward?

Chapter 3: Token-Level KL

GRPO uses a KL divergence penalty to prevent the policy from drifting too far from a reference model. But the standard implementation has a subtle length bias that distorts training.

The problem: sequence-level KL favors short responses

Standard KL is computed at the sequence level: sum the per-token KL divergence across all tokens in the response, then average across responses. This means longer responses accumulate more KL, getting a larger penalty. The model learns: shorter responses = less KL penalty = better.

KLseq = ∑t=1T KL(t) ← grows with response length T

This creates a perverse incentive: the model can reduce its KL penalty by generating shorter responses, even if those shorter responses are less accurate. For reasoning tasks where longer thinking = better accuracy, this is catastrophic.

The fix: token-level KL

DAPO changes the KL computation to be token-level: compute per-token KL, then average over tokens instead of summing. This makes the penalty invariant to response length.

KLtoken = (1/T) ∑t=1T KL(t) ← length-invariant
python
# Sequence-level KL (GRPO default — biased toward short responses)
def seq_kl(policy_logprobs, ref_logprobs):
    # policy_logprobs: [batch, seq_len]
    per_token_kl = policy_logprobs - ref_logprobs   # [batch, seq_len]
    return per_token_kl.sum(dim=-1).mean()  # SUM over tokens, mean over batch
    # Problem: longer response → larger KL → larger penalty

# Token-level KL (DAPO — length-invariant)
def token_kl(policy_logprobs, ref_logprobs, mask):
    per_token_kl = policy_logprobs - ref_logprobs
    # MEAN over tokens (not sum), then mean over batch
    per_seq_kl = (per_token_kl * mask).sum(dim=-1) / mask.sum(dim=-1)
    return per_seq_kl.mean()
    # Fix: KL per token, averaged → same penalty for 100 or 1000 tokens
The interaction with response length. Token-level KL interacts with DAPO's overlong reward shaping (Chapter 4). Together, they decouple the KL regularization from length control. KL says "don't deviate too far from the reference model's token-level behavior." Overlong shaping says "don't generate excessively long responses." Each addresses one concern cleanly.

Why this matters for reasoning

Reasoning models like R1 generate responses of 1,000-10,000 tokens. With sequence-level KL, a 5,000-token response gets 50x the KL penalty of a 100-token response. This creates enormous pressure to be brief — which directly undermines the "extended thinking" that makes reasoning models powerful. Token-level KL removes this pressure entirely.

Sequence vs Token KL

Compare how sequence-level and token-level KL penalties scale with response length. Sequence-level grows linearly. Token-level stays flat. Adjust response length to see the divergence.

Response length 500 tokens
Why does DAPO average per-token KL instead of summing it across the sequence?

Chapter 4: Overlong Reward Shaping

Even with token-level KL (which removes the length bias from regularization), the model can still learn to exploit response length. Longer reasoning traces are more likely to stumble on the correct answer by exhaustive enumeration. DAPO adds explicit length control through reward shaping.

The length exploitation problem

During RL training, the model discovers that longer responses have higher accuracy on average. This is partly legitimate (more thinking = better answers) and partly exploitation (the model pads responses with repetitive text or explores many possibilities until it finds the right one by chance).

Without length control, response lengths grow unboundedly. R1-Zero's responses grew from ~100 to ~10,000 tokens. At some point, the extra length is pure waste — the model is rambling, not reasoning.

The fix: soft length penalty

DAPO adds a reward shaping term that penalizes responses exceeding a soft maximum length Lmax:

rshaped = raccuracy − λ · max(0, len(response) − Lmax) / Lmax

Where λ controls the penalty strength and Lmax is the soft maximum. Responses under Lmax tokens get no penalty. Responses over Lmax get a linearly increasing penalty. The penalty is proportional to how much the response exceeds the limit.

python
# Overlong Reward Shaping
def shaped_reward(accuracy_reward, response_len,
                   max_len=8192, penalty_weight=0.5):
    # No penalty if under max_len
    overlong = max(0, response_len - max_len) / max_len
    penalty = penalty_weight * overlong

    return accuracy_reward - penalty

# Examples:
# len=4000, max=8192: penalty = 0, shaped = accuracy (no change)
# len=10000, max=8192: overlong=0.22, penalty=0.11
# len=16000, max=8192: overlong=0.95, penalty=0.48 (significant!)
Soft vs hard length limits: DAPO uses a SOFT limit — responses can exceed Lmax but get penalized proportionally. A hard limit (truncate at Lmax) would cut off reasoning mid-thought, potentially destroying a correct solution. The soft penalty lets the model decide: is the extra reasoning worth the penalty? For truly hard problems, the model may choose to think longer and accept the penalty. For easier problems, it stays concise.

Setting Lmax

The paper uses Lmax = 8,192 tokens for most experiments. This allows substantial reasoning (a few pages of text) while penalizing extreme responses. The value is chosen empirically: too low, and the model can't reason through hard problems; too high, and it doesn't constrain waste.

Interaction with other techniques

The four DAPO techniques interact cleanly:

TechniqueControlsMechanism
Clip-HigherPolicy update magnitudeAsymmetric clipping of probability ratios
Dynamic SamplingLearning signal qualityFiltering degenerate groups
Token-Level KLDistribution driftLength-invariant regularization
Overlong ShapingResponse lengthSoft penalty on overlong responses

None of these techniques conflict. Clip-Higher controls how much the policy changes per update. Dynamic Sampling ensures every update uses informative data. Token-Level KL prevents distribution collapse. Overlong Shaping prevents length exploitation. They're orthogonal fixes for orthogonal problems.

Length Penalty Visualizer

See how the overlong reward penalty affects the shaped reward. Adjust response length and Lmax to see the penalty kick in.

Lmax 8192
Why does DAPO use a soft length penalty instead of a hard truncation limit?

Chapter 5: Results

DAPO is evaluated by training Qwen2.5-32B on math reasoning tasks using DAPO versus GRPO and other baselines. The headline result: DAPO reaches 50/100 on AIME 2024 with a 32B model — competitive with much larger systems.

AIME 2024 comparison

ModelSizeAIME 2024Algorithm
DeepSeek-R1-Zero671B MoE71.0GRPO
DeepSeek-R1671B MoE79.8GRPO + SFT
Qwen2.5-32B + GRPO32B38.0GRPO
Qwen2.5-32B + DAPO32B50.0DAPO

DAPO improves AIME 2024 from 38 to 50 on the same 32B model — a 32% relative improvement. This closes a significant portion of the gap to the 671B R1 model while using only a fraction of the compute.

The significance of 50/100 on AIME with 32B: AIME is a competition-level math exam designed for top high school students. Scoring 50/100 means the model can solve half of these challenging problems. Doing this with a 32B model (not 671B) demonstrates that DAPO's algorithmic improvements are more impactful than raw scale — fixing four training bugs gave +12 AIME points, equivalent to roughly 10x model scale.

Ablation: each technique's contribution

ConfigurationAIMEContribution
GRPO baseline38.0
+ Clip-Higher41.5+3.5
+ Dynamic Sampling44.0+2.5
+ Token-Level KL47.0+3.0
+ Overlong Shaping50.0+3.0

Each technique contributes roughly 2.5-3.5 AIME points. No single technique dominates — they address independent failure modes. This confirms DAPO's design philosophy: four minimal fixes for four specific problems.

Training stability

Beyond raw performance, DAPO dramatically improves training stability. With GRPO, training often requires manual intervention: restarting from checkpoints when the model collapses, adjusting hyperparameters mid-run, monitoring for reward hacking. With DAPO, training runs to completion without intervention — a necessary property for scaling to larger models and longer training.

Ablation Results

See how each DAPO technique contributes to the final AIME score. Click "Add" to incrementally add each technique on top of the GRPO baseline.

What does the ablation study reveal about DAPO's four techniques?

Chapter 6: DAPO Simulator

See all four DAPO techniques working together. This simulator shows a training run with toggleable techniques — enable or disable each fix to see its effect on training stability and final accuracy.

DAPO Training Simulator

Toggle each DAPO technique on/off and run training steps to see how they affect accuracy and stability. With all four off = GRPO baseline. With all four on = full DAPO.

What to observe

With all off (GRPO baseline): Training is unstable. Accuracy plateaus around 38. Entropy collapses. Response length grows unboundedly.

With Clip-Higher only: Learning is faster initially, but still suffers from entropy collapse and length explosion.

With all four (full DAPO): Training is stable, accuracy reaches ~50, entropy stays healthy, and response length stabilizes around the soft maximum.

What happens to training stability when you disable Dynamic Sampling while keeping the other three techniques?

Chapter 7: Connections

DAPO is an incremental but important contribution: it takes the GRPO algorithm (from DeepSeek-R1) and makes it production-ready for large-scale training. It sits in the lineage of RL algorithms for LLMs.

AlgorithmYearRelationship to DAPO
PPO2017The original clipped surrogate objective. DAPO inherits and modifies the clipping.
GRPO2025Removes PPO's critic. DAPO fixes GRPO's four remaining weaknesses.
DAPO2025Clip-Higher + Dynamic Sampling + Token KL + Overlong Shaping. This paper.
ReMax2024Alternative critic-free method using REINFORCE with baseline. Different approach, similar goal.
Dr. GRPO2025Another GRPO fix focusing on variance reduction. Complementary to DAPO.

What DAPO got right

Precise diagnosis. Instead of proposing a new algorithm, DAPO precisely identified four failure modes and applied minimal fixes. This approach is more reproducible and easier to build on.

Full open-source. Training code, hyperparameters, and training logs are all released. Other teams can reproduce the results exactly.

What DAPO left open

Process rewards. DAPO still uses outcome-only rewards. Combining DAPO with process reward models could yield further improvements.

Beyond math. DAPO is evaluated primarily on math reasoning. Its effectiveness on code generation, scientific reasoning, and general instruction following is less clear.

The engineering lesson. DAPO teaches that algorithmic improvements often come not from novel theory but from careful debugging of existing methods at scale. The four failure modes DAPO fixes are the kind of issues that only appear with large models and long training runs. Finding and fixing them required careful empirical work — monitoring training curves, analyzing failure cases, and testing hypotheses one at a time.

DeepSeek-R1 — The GRPO paper that DAPO improves upon. Read the R1 lesson →

PPO — The original clipped surrogate objective. Read the PPO lesson →

RL Algorithm Evolution

See how RL algorithms for LLMs evolved from PPO to GRPO to DAPO.

Algorithm DAPO
What is the main lesson from DAPO's approach to improving GRPO?