DAPO (ByteDance 2025)

Chapter 0: GRPO's Failure Modes

DeepSeek-R1 proved that GRPO can teach language models to reason. But when ByteDance tried to reproduce R1's results at scale, they hit four systematic failure modes that prevented stable training. DAPO is the fix.

Think of GRPO as a car that can go fast but pulls to the left, overheats on long drives, and has unreliable brakes. DAPO doesn't replace the car — it fixes each of these specific problems. The result is a training algorithm that scales reliably to 32B parameters without the constant monitoring and manual intervention that GRPO requires.

The four failure modes

Failure Mode	What Happens	DAPO Fix
Clipping bias	PPO/GRPO's symmetric clipping suppresses both good and bad deviations equally, biasing the policy toward the old distribution	Clip-Higher: asymmetric clipping that only clips the ratio downward for negative advantages
Entropy collapse	The policy becomes deterministic too quickly, losing the ability to explore diverse solutions	Dynamic Sampling: filter out problems where accuracy is 0% or 100% to maintain learning signal
KL divergence issues	Sequence-level KL penalty is dominated by long responses, creating a length bias	Token-Level KL: compute KL per-token and average, making the penalty length-invariant
Length exploitation	Model learns to generate very long responses because longer = more correct on average	Overlong Reward Shaping: penalize responses that exceed a soft length limit

DAPO's philosophy: Don't redesign the algorithm — diagnose each failure mode precisely and apply the minimal fix. GRPO is fundamentally sound. It just has four specific engineering weaknesses that emerge at scale. Each of DAPO's four techniques addresses exactly one weakness, and together they make large-scale RL training stable and reproducible.

GRPO Failure Modes Visualizer

Click each failure mode to see what happens during training and how DAPO fixes it.

What is DAPO's relationship to GRPO?

DAPO doesn't replace GRPO — it fixes four specific failure modes (clipping bias, entropy collapse, KL length bias, and length exploitation) that emerge when scaling GRPO to large models, making training stable and reproducible DAPO is a completely different algorithm from GRPO DAPO replaces reinforcement learning with supervised learning

Chapter 1: Clip-Higher

PPO and GRPO use clipped surrogate objectives to prevent the policy from changing too much in one step. The standard clip limits the probability ratio to [1-ε, 1+ε]. But this symmetric clipping has an asymmetric effect that creates a bias.

The problem: symmetric clipping is biased

Consider a response with positive advantage (good response, we want more of it). The gradient pushes the probability ratio r_t upward. But the clip at 1+ε limits how much the ratio can increase. If the gradient is strong enough to push r_t beyond 1+ε, the gradient is zeroed out.

Now consider a response with negative advantage (bad response, we want less of it). The gradient pushes r_t downward. The clip at 1-ε limits how much it can decrease. So far, symmetric.

But here's the bias: we clip both upward and downward changes equally. This means the policy can't deviate much from the old policy in either direction. For positive advantages, we WANT the policy to increase — but the clip prevents it. This biases the policy toward the old distribution, slowing learning.

Standard PPO: clip(r_t, 1−ε_low, 1+ε_high) with ε_low = ε_high = 0.2

The fix: asymmetric clipping

DAPO's Clip-Higher technique uses a higher upper clip bound (ε_high) than lower bound (ε_low). For positive advantages, the ratio can increase more before being clipped. For negative advantages, the ratio is clipped as before.

DAPO: clip(r_t, 1−ε_low, 1+ε_high) with ε_low = 0.2, ε_high = 0.28

In practice, ε_high is set to about 1.4x ε_low. This asymmetry lets the policy move more aggressively toward good responses while still being cautious about moving away from bad ones.

python
# Standard PPO clipping (symmetric)
def ppo_clip(ratio, advantage, eps=0.2):
    surr1 = ratio * advantage
    surr2 = torch.clamp(ratio, 1-eps, 1+eps) * advantage
    return torch.min(surr1, surr2)

# DAPO Clip-Higher (asymmetric)
def dapo_clip(ratio, advantage, eps_low=0.2, eps_high=0.28):
    surr1 = ratio * advantage
    surr2 = torch.clamp(ratio, 1-eps_low, 1+eps_high) * advantage
    return torch.min(surr1, surr2)
    # Positive advantages: ratio can go up to 1.28 (more room)
    # Negative advantages: ratio clips at 0.80 (same as PPO)

Why asymmetry helps: The symmetric clip was designed for stability in robotics RL where you want conservative updates. But for LLM reasoning, we WANT the model to aggressively adopt strategies that work. Clip-Higher gives the model more room to increase the probability of correct reasoning patterns while maintaining stability for incorrect ones.

Clipping Comparison

Compare symmetric PPO clipping vs DAPO's asymmetric Clip-Higher. Adjust ε_high to see how it affects the policy's ability to reinforce good responses. The shaded region shows the allowed update range.

ε_high 0.28

Why does DAPO use a higher upper clip bound (ε_high = 0.28) than lower bound (ε_low = 0.2)?

Because for positive advantages (correct responses), we want the policy to aggressively increase their probability — symmetric clipping biases toward the old distribution by equally limiting both upward and downward changes, slowing learning of good strategies Because the upper bound needs to be larger for mathematical reasons Because negative advantages are more important than positive ones

Chapter 2: Dynamic Sampling

GRPO samples G responses per problem and computes advantages from the reward distribution within the group. But what happens when all responses are correct or all are wrong? The advantages become meaningless.

The problem: degenerate groups

If all G responses to a problem are correct (all rewards = 1), the mean is 1, the standard deviation is 0, and the z-score advantages are undefined (or all zero after normalization). The model learns nothing from this problem — it already gets it right every time.

Conversely, if all G responses are wrong (all rewards = 0), the mean is 0, std is 0, and again no useful gradient. The model can't learn from problems that are too hard.

As training progresses, easy problems become 100%-correct groups and hard problems remain 0%-correct groups. The fraction of "useful" problems — those with a mix of correct and incorrect responses — shrinks. Learning slows down dramatically.

The entropy collapse spiral: As the model improves, more problems become easy (all-correct groups). These provide zero learning signal. The model only learns from the remaining hard problems, but those may be too hard (all-wrong groups). The useful training set shrinks from both ends, and the policy's entropy (diversity) collapses — it becomes deterministic and stops exploring.

The fix: dynamic sampling

DAPO's Dynamic Sampling filters out degenerate groups before computing the GRPO update. For each problem, after sampling G responses:

Sample G responses

Generate G outputs for the problem, compute reward for each.

↓

Check diversity

Are all rewards identical? If all correct or all wrong, SKIP this problem.

↓

Compute advantages

Only compute GRPO advantages for groups with mixed rewards (some correct, some wrong).

↓

Oversample to compensate

To maintain batch size, sample more problems. Target: N problems with useful signal per batch.

python
# Dynamic Sampling in DAPO
def dapo_batch(model, problems, G=64, target_useful=32):
    useful_groups = []

    for problem in shuffled(problems):
        # Sample G responses
        responses = [model.generate(problem) for _ in range(G)]
        rewards = [reward_fn(r, problem) for r in responses]

        # Check for degenerate group
        if all(r == rewards[0] for r in rewards):
            continue  # SKIP: all same reward, no learning signal

        # This group has useful signal — keep it
        useful_groups.append((responses, rewards))

        if len(useful_groups) >= target_useful:
            break

    # Compute GRPO update only on useful groups
    return grpo_update(model, useful_groups)

The oversampling cost

Dynamic sampling wastes some compute on problems that get filtered out. If 40% of problems are degenerate, you need to sample 40% more problems to fill each batch. This cost increases during training as more problems become easy. But the alternative — training on degenerate groups with zero signal — is worse: it wastes compute AND provides no learning.

Dynamic Sampling Visualizer

See how the fraction of useful groups changes during training. Early training: most problems have mixed results (useful). Late training: many are all-correct or all-wrong. Dynamic sampling filters these out.

Training progress 0%

Why does DAPO skip groups where all G responses have the same reward?

Because when all rewards are identical, the z-score advantages are zero — the model gets no gradient signal about which responses are better, wasting compute on a problem that provides no learning Because identical rewards mean the problem is broken To save memory

Chapter 3: Token-Level KL

GRPO uses a KL divergence penalty to prevent the policy from drifting too far from a reference model. But the standard implementation has a subtle length bias that distorts training.

The problem: sequence-level KL favors short responses

Standard KL is computed at the sequence level: sum the per-token KL divergence across all tokens in the response, then average across responses. This means longer responses accumulate more KL, getting a larger penalty. The model learns: shorter responses = less KL penalty = better.

KL_seq = ∑_t=1^T KL(t) ← grows with response length T

This creates a perverse incentive: the model can reduce its KL penalty by generating shorter responses, even if those shorter responses are less accurate. For reasoning tasks where longer thinking = better accuracy, this is catastrophic.

The fix: token-level KL

DAPO changes the KL computation to be token-level: compute per-token KL, then average over tokens instead of summing. This makes the penalty invariant to response length.

KL_token = (1/T) ∑_t=1^T KL(t) ← length-invariant

python
# Sequence-level KL (GRPO default — biased toward short responses)
def seq_kl(policy_logprobs, ref_logprobs):
    # policy_logprobs: [batch, seq_len]
    per_token_kl = policy_logprobs - ref_logprobs   # [batch, seq_len]
    return per_token_kl.sum(dim=-1).mean()  # SUM over tokens, mean over batch
    # Problem: longer response → larger KL → larger penalty

# Token-level KL (DAPO — length-invariant)
def token_kl(policy_logprobs, ref_logprobs, mask):
    per_token_kl = policy_logprobs - ref_logprobs
    # MEAN over tokens (not sum), then mean over batch
    per_seq_kl = (per_token_kl * mask).sum(dim=-1) / mask.sum(dim=-1)
    return per_seq_kl.mean()
    # Fix: KL per token, averaged → same penalty for 100 or 1000 tokens

The interaction with response length. Token-level KL interacts with DAPO's overlong reward shaping (Chapter 4). Together, they decouple the KL regularization from length control. KL says "don't deviate too far from the reference model's token-level behavior." Overlong shaping says "don't generate excessively long responses." Each addresses one concern cleanly.

Why this matters for reasoning

Reasoning models like R1 generate responses of 1,000-10,000 tokens. With sequence-level KL, a 5,000-token response gets 50x the KL penalty of a 100-token response. This creates enormous pressure to be brief — which directly undermines the "extended thinking" that makes reasoning models powerful. Token-level KL removes this pressure entirely.

Sequence vs Token KL

Compare how sequence-level and token-level KL penalties scale with response length. Sequence-level grows linearly. Token-level stays flat. Adjust response length to see the divergence.

Response length 500 tokens

Why does DAPO average per-token KL instead of summing it across the sequence?

Because summing KL creates a length bias — longer responses get proportionally larger penalties, incentivizing the model to be brief, which undermines the extended thinking that reasoning models need. Averaging makes the penalty length-invariant. Because averaging is computationally cheaper Because individual token KL values are more interpretable

Chapter 4: Overlong Reward Shaping

Even with token-level KL (which removes the length bias from regularization), the model can still learn to exploit response length. Longer reasoning traces are more likely to stumble on the correct answer by exhaustive enumeration. DAPO adds explicit length control through reward shaping.

The length exploitation problem

During RL training, the model discovers that longer responses have higher accuracy on average. This is partly legitimate (more thinking = better answers) and partly exploitation (the model pads responses with repetitive text or explores many possibilities until it finds the right one by chance).

Without length control, response lengths grow unboundedly. R1-Zero's responses grew from ~100 to ~10,000 tokens. At some point, the extra length is pure waste — the model is rambling, not reasoning.

The fix: soft length penalty

DAPO adds a reward shaping term that penalizes responses exceeding a soft maximum length L_max:

r_shaped = r_accuracy − λ · max(0, len(response) − L_max) / L_max

Where λ controls the penalty strength and L_max is the soft maximum. Responses under L_max tokens get no penalty. Responses over L_max get a linearly increasing penalty. The penalty is proportional to how much the response exceeds the limit.

python
# Overlong Reward Shaping
def shaped_reward(accuracy_reward, response_len,
                   max_len=8192, penalty_weight=0.5):
    # No penalty if under max_len
    overlong = max(0, response_len - max_len) / max_len
    penalty = penalty_weight * overlong

    return accuracy_reward - penalty

# Examples:
# len=4000, max=8192: penalty = 0, shaped = accuracy (no change)
# len=10000, max=8192: overlong=0.22, penalty=0.11
# len=16000, max=8192: overlong=0.95, penalty=0.48 (significant!)

Soft vs hard length limits: DAPO uses a SOFT limit — responses can exceed L_max but get penalized proportionally. A hard limit (truncate at L_max) would cut off reasoning mid-thought, potentially destroying a correct solution. The soft penalty lets the model decide: is the extra reasoning worth the penalty? For truly hard problems, the model may choose to think longer and accept the penalty. For easier problems, it stays concise.

Setting L_max

The paper uses L_max = 8,192 tokens for most experiments. This allows substantial reasoning (a few pages of text) while penalizing extreme responses. The value is chosen empirically: too low, and the model can't reason through hard problems; too high, and it doesn't constrain waste.

Interaction with other techniques

The four DAPO techniques interact cleanly:

Technique	Controls	Mechanism
Clip-Higher	Policy update magnitude	Asymmetric clipping of probability ratios
Dynamic Sampling	Learning signal quality	Filtering degenerate groups
Token-Level KL	Distribution drift	Length-invariant regularization
Overlong Shaping	Response length	Soft penalty on overlong responses

None of these techniques conflict. Clip-Higher controls how much the policy changes per update. Dynamic Sampling ensures every update uses informative data. Token-Level KL prevents distribution collapse. Overlong Shaping prevents length exploitation. They're orthogonal fixes for orthogonal problems.

Length Penalty Visualizer

See how the overlong reward penalty affects the shaped reward. Adjust response length and L_max to see the penalty kick in.

L_max 8192

Why does DAPO use a soft length penalty instead of a hard truncation limit?

Because hard truncation cuts off reasoning mid-thought, potentially destroying correct solutions — the soft penalty lets the model decide whether extra reasoning is worth the penalty, allowing longer thinking for truly hard problems while discouraging waste on easy ones Because soft penalties are easier to implement Because all responses should be the same length

Chapter 5: Results

DAPO is evaluated by training Qwen2.5-32B on math reasoning tasks using DAPO versus GRPO and other baselines. The headline result: DAPO reaches 50/100 on AIME 2024 with a 32B model — competitive with much larger systems.

AIME 2024 comparison

Model	Size	AIME 2024	Algorithm
DeepSeek-R1-Zero	671B MoE	71.0	GRPO
DeepSeek-R1	671B MoE	79.8	GRPO + SFT
Qwen2.5-32B + GRPO	32B	38.0	GRPO
Qwen2.5-32B + DAPO	32B	50.0	DAPO

DAPO improves AIME 2024 from 38 to 50 on the same 32B model — a 32% relative improvement. This closes a significant portion of the gap to the 671B R1 model while using only a fraction of the compute.

The significance of 50/100 on AIME with 32B: AIME is a competition-level math exam designed for top high school students. Scoring 50/100 means the model can solve half of these challenging problems. Doing this with a 32B model (not 671B) demonstrates that DAPO's algorithmic improvements are more impactful than raw scale — fixing four training bugs gave +12 AIME points, equivalent to roughly 10x model scale.

Ablation: each technique's contribution

Configuration	AIME	Contribution
GRPO baseline	38.0	—
+ Clip-Higher	41.5	+3.5
+ Dynamic Sampling	44.0	+2.5
+ Token-Level KL	47.0	+3.0
+ Overlong Shaping	50.0	+3.0

Each technique contributes roughly 2.5-3.5 AIME points. No single technique dominates — they address independent failure modes. This confirms DAPO's design philosophy: four minimal fixes for four specific problems.

Training stability

Beyond raw performance, DAPO dramatically improves training stability. With GRPO, training often requires manual intervention: restarting from checkpoints when the model collapses, adjusting hyperparameters mid-run, monitoring for reward hacking. With DAPO, training runs to completion without intervention — a necessary property for scaling to larger models and longer training.

Ablation Results

See how each DAPO technique contributes to the final AIME score. Click "Add" to incrementally add each technique on top of the GRPO baseline.

What does the ablation study reveal about DAPO's four techniques?

Each technique contributes roughly equally (2.5-3.5 AIME points), confirming they address independent failure modes — no single technique dominates, and all four are needed for the full 12-point improvement over GRPO Clip-Higher is by far the most important technique The techniques are redundant with each other

Chapter 6: DAPO Simulator

See all four DAPO techniques working together. This simulator shows a training run with toggleable techniques — enable or disable each fix to see its effect on training stability and final accuracy.

DAPO Training Simulator

Toggle each DAPO technique on/off and run training steps to see how they affect accuracy and stability. With all four off = GRPO baseline. With all four on = full DAPO.

What to observe

With all off (GRPO baseline): Training is unstable. Accuracy plateaus around 38. Entropy collapses. Response length grows unboundedly.

With Clip-Higher only: Learning is faster initially, but still suffers from entropy collapse and length explosion.

With all four (full DAPO): Training is stable, accuracy reaches ~50, entropy stays healthy, and response length stabilizes around the soft maximum.

What happens to training stability when you disable Dynamic Sampling while keeping the other three techniques?

Training becomes less stable because without filtering degenerate groups (all-correct or all-wrong), many GRPO updates provide zero learning signal, causing entropy collapse and slower convergence Nothing changes because the other three techniques compensate Training becomes faster because there's more data

Chapter 7: Connections

DAPO is an incremental but important contribution: it takes the GRPO algorithm (from DeepSeek-R1) and makes it production-ready for large-scale training. It sits in the lineage of RL algorithms for LLMs.

Algorithm	Year	Relationship to DAPO
PPO	2017	The original clipped surrogate objective. DAPO inherits and modifies the clipping.
GRPO	2025	Removes PPO's critic. DAPO fixes GRPO's four remaining weaknesses.
DAPO	2025	Clip-Higher + Dynamic Sampling + Token KL + Overlong Shaping. This paper.
ReMax	2024	Alternative critic-free method using REINFORCE with baseline. Different approach, similar goal.
Dr. GRPO	2025	Another GRPO fix focusing on variance reduction. Complementary to DAPO.

What DAPO got right

Precise diagnosis. Instead of proposing a new algorithm, DAPO precisely identified four failure modes and applied minimal fixes. This approach is more reproducible and easier to build on.

Full open-source. Training code, hyperparameters, and training logs are all released. Other teams can reproduce the results exactly.

What DAPO left open

Process rewards. DAPO still uses outcome-only rewards. Combining DAPO with process reward models could yield further improvements.

Beyond math. DAPO is evaluated primarily on math reasoning. Its effectiveness on code generation, scientific reasoning, and general instruction following is less clear.

The engineering lesson. DAPO teaches that algorithmic improvements often come not from novel theory but from careful debugging of existing methods at scale. The four failure modes DAPO fixes are the kind of issues that only appear with large models and long training runs. Finding and fixing them required careful empirical work — monitoring training curves, analyzing failure cases, and testing hypotheses one at a time.

DeepSeek-R1 — The GRPO paper that DAPO improves upon. Read the R1 lesson →

PPO — The original clipped surrogate objective. Read the PPO lesson →

RL Algorithm Evolution

See how RL algorithms for LLMs evolved from PPO to GRPO to DAPO.

Algorithm DAPO

What is the main lesson from DAPO's approach to improving GRPO?

That algorithmic improvements often come from careful debugging of existing methods at scale — precisely diagnosing each failure mode and applying minimal, targeted fixes — rather than proposing entirely new algorithms That completely new algorithms are always needed That larger models are always the solution

DAPO: Decoupled Advantage Policy Optimization