Poly-EPO — Veanors

Chapter 0: The Problem

You have a capable base language model. It can solve hard math problems in many different ways — algebraic manipulation, substitution, geometric reasoning, recursion. Each generation might take a completely different path to the same answer. Some paths succeed where others fail. This diversity is a feature: when you sample k times, at least one path usually finds the answer.

Now you apply RL post-training — GRPO, PPO, whatever your favorite algorithm is. The reward is simple: +1 for correct, 0 for wrong. After a few hundred training steps, accuracy on pass@1 goes up. Great.

But something terrible is happening underneath. The model is collapsing. It found two or three strategies that work often, and it is doubling down on those. All the other strategies — the ones that work on the problems the dominant strategies cannot solve — are dying.

The diversity collapse paradox: RL fine-tuning makes your model better at pass@1 but worse at pass@k. After enough GRPO training, the BASE MODEL outperforms the RL-trained model at k=32. The untrained model, with all its naive diversity, finds more unique solutions than the "improved" model locked onto a handful of strategies.

The chart below shows the core problem. GRPO starts with 8+ reasoning clusters (distinct strategy families) and collapses to 2-3 within a few hundred steps. The model forgets how to think differently.

Diversity Collapse

Number of distinct reasoning strategy clusters over training steps. GRPO (teal) collapses rapidly. Poly-EPO (warm) maintains and grows diversity. Click "Run" to animate.

This is not just a theoretical concern. On AIME 2024, pass@32 for the base model (Qwen3-4B) is higher than for the GRPO-trained version. The RL model wins at k=1, ties at k=4, and loses at larger k. You traded broad competence for narrow excellence.

Why does this happen? Think about the RL training loop. Each step, you sample responses and reward the correct ones. If strategy A works on 60% of problems and strategy B works on 30%, strategy A gets reinforced more often. Its probability mass grows. Strategy B's mass shrinks. After a few hundred steps, strategy A dominates. But those 30% of problems where B was the only viable approach? The model has lost the ability to solve them.

The root cause is that standard RL has no signal for diversity. The advantage function asks "is this response better than average?" not "does this response bring something new to the table?" A fifth correct response using strategy A gets the same positive advantage as the first correct response using strategy B, even though the marginal value of yet-another-A is nearly zero while a novel B is enormously valuable for coverage.

This is a well-known problem in reinforcement learning called mode collapse, and it happens in many domains beyond LLMs. In robotics, a policy might collapse onto one grasping strategy. In game-playing, an agent might find one exploit and never learn alternatives. The LLM setting makes it especially costly because the base model's pre-existing diversity — learned from the vast variety of human reasoning in the training corpus — is a precious resource that RL training actively destroys.

The paradox, quantified

Consider a concrete example. The base model has 10 reasoning strategies, each with a 20% success rate. The probability that at least one of k=32 samples succeeds is 1 − (1 − 0.2)³² × (coverage factor) ≈ high, because diverse strategies cover different problem types.

After GRPO training, the model has 2 strategies at 45% success rate each. Pass@1 improved (0.45 vs 0.20). But pass@32? The 2 strategies cover fewer problem types. On problems where neither strategy applies, even 32 samples cannot help. The base model's broad strategy portfolio — despite each strategy being weaker — provides better coverage at high k.

This is the exploration-exploitation trade-off, but at the model level rather than the action level. GRPO exploits known-good strategies aggressively. Poly-EPO maintains a balanced portfolio, accepting slightly lower per-strategy accuracy in exchange for much broader coverage. The portfolio-level optimization is what Set RL provides.

Why can a base model outperform a GRPO-trained model at pass@k for large k?

The base model is just smarter before fine-tuning GRPO collapses diversity onto few strategies, so repeated sampling keeps trying the same approaches, while the diverse base model explores more paths GRPO training corrupts the model weights

Chapter 1: The Key Insight

Standard RL scores each generation independently. Generation y_i gets reward r(x, y_i), and the advantage measures how y_i compares to other generations. The problem? There is no signal for whether y_i is different from the other generations. As long as y_i gets a good reward, it gets reinforced — even if it is a carbon copy of every other successful generation.

This is the fundamental design flaw. GRPO, PPO, REINFORCE — all standard policy gradient methods score individual outputs against a baseline. They optimize "how good is this output?" but never ask "how good is this output given what we already have?" The marginal value of the 10th copy of strategy A is near zero, but standard RL gives it the same advantage as the 1st copy of a novel strategy B.

The key insight: Instead of scoring individual generations, score sets of generations. Define an objective over a set {y₁, ..., y_n} that is the product of average reward and diversity. Since both factors are non-negative, the set cannot score well unless it is BOTH high-reward AND diverse. This creates a natural synergy — no λ-tuning required.

This is Poly-EPO: Polychromic Exploratory Policy Optimization. "Polychromic" means "many-colored" — the generations in a set should cover many distinct reasoning strategies, like a spectrum of colors rather than a monochrome beam.

The architecture has three pieces:

Set RL framework. Sample N generations per prompt, construct sets of size n, score each set, then compute a marginal set advantage for each individual generation that captures its contribution to all sets it belongs to.
Polychromic objective. f_poly(x, y_1:n) = (mean reward) × (diversity). A product, not a sum. No hyperparameter to balance them.
LM-judge clustering. Diversity is measured by how many distinct reasoning strategies the set contains, as judged by an LM that reads the solutions and clusters them by approach.

The marginal set advantage decomposes into a beautiful result: it naturally encodes the covariance between reward and diversity in the sets containing each generation. Generations that help sets be BOTH rewarding and diverse get extra positive signal. This covariance term is the "synergy" that reward shaping fundamentally cannot produce.

The name "Polychromic" is deliberate. "Mono-chromatic" light has one wavelength — one color, one strategy. "Poly-chromatic" light is a rich mix of wavelengths — a full spectrum. A white light beam (polychromic) can be separated into its components by a prism, revealing hidden structure. A laser beam (monochromatic) is powerful but reveals nothing new. GRPO trains lasers. Poly-EPO trains white light.

What makes the polychromic objective fundamentally different from adding a diversity bonus to the reward?

It uses a product (not sum) of reward and diversity, so a set cannot score well by optimizing only one factor — this creates a covariance synergy term that reward shaping lacks It uses a better diversity metric It trains longer

Chapter 2: Why Reward Shaping Fails

Before Poly-EPO, the natural approach to preserving diversity was reward shaping. You add a diversity bonus to the reward:

r_shaped(x, y) = r(x, y) + λ · d(x, y)

where d(x, y) measures how different y is from the other generations, and λ controls the exploration-exploitation trade-off.

This seems reasonable. It gives a generation credit for being unique, even if it is wrong. But it has three fundamental problems:

Problem 1: the decoupling trap

With additive reward shaping, a generation can get a high shaped reward by being correct (r = 1) with no diversity, OR by being wildly different (high d) with no correctness. The sum does not require both. The gradient signal says "be correct" and "be different" but never "be correct in a different way."

Think of it as optimizing two separate objectives that happen to share a loss function. A student who aces the exam by memorizing one method and a student who experiments wildly but gets everything wrong both look "good" to reward shaping. Only a student who aces the exam using novel methods is what we actually want — and that requires the multiplicative coupling that Poly-EPO provides.

Problem 2: λ-tuning is fragile

Too small a λ and diversity gets drowned out by the reward signal — you are back to GRPO. Too large and the model generates diverse garbage that never answers correctly. The sweet spot depends on the task, the model size, and the training stage. It shifts as training progresses.

Worse, the optimal λ is not just "hard to find" — it may not exist as a single constant. Early in training, when accuracy is low, you want more exploration (high λ). Late in training, when accuracy is high, the exploration budget needs to shrink to avoid wasting capacity. This means λ should be a function of training stage, accuracy, and potentially even per-prompt difficulty. At that point, you have replaced one hyperparameter with a scheduling function, which is arguably harder to tune.

The polychromic product objective avoids this entirely. The product naturally adapts: when accuracy is low (reward ≈ 0), the product is low regardless of diversity, so the model focuses on getting correct first. When accuracy is high and diversity is the bottleneck, the product rewards diversity improvements. No scheduling, no λ, no per-task tuning. The product is the schedule.

Problem 3: the impossible threshold

This is the killer. Consider an incorrect generation (r = 0) that adds diversity d to the group. Under reward shaping with GRPO-style advantages, this generation gets advantage:

A_shaped(y) ∝ λ · d(x,y) − p − λ · d̄

where p is the group accuracy and d̄ is mean diversity. For this incorrect-but-diverse generation to have positive advantage (and thus survive training), we need λd(x,y) > p + λd̄. As accuracy p increases toward 1, this becomes impossible unless d(x,y) is astronomically large. The exploration signal vanishes exactly when you need it most — when the model is already good and you want it to keep finding new strategies.

The vanishing exploration window: As the model improves (p → 1), the region where an incorrect-but-exploratory generation gets positive advantage shrinks to zero. Reward shaping guarantees that late-stage training kills diversity, no matter what λ you choose.

Reward Shaping Threshold

The shaded region shows where an incorrect-but-diverse generation gets positive advantage under reward shaping. As group accuracy p increases, the region shrinks. Drag the slider to see the threshold become impossible.

Accuracy p 0.30

λ 0.50

Poly-EPO sidesteps all three problems. Because the objective is a product, you cannot maximize it without both factors being high. Because it operates on sets, there is no λ to tune. And because incorrect generations can still boost the SET's diversity score, they can get positive marginal advantage even when accuracy is near 1. We will prove this in Chapter 5.

The meta-lesson: Reward shaping is an individual-level fix for a population-level problem. Diversity is a property of the set of generations, not of any single generation. Trying to encode set-level properties into individual rewards creates the decoupling problem. The right solution operates at the set level natively, which is exactly what Set RL does.

As model accuracy p approaches 1, what happens to incorrect-but-exploratory generations under reward shaping?

They get stronger signal because they are rare Their advantage becomes impossible to make positive because the accuracy term dominates the diversity bonus They are unaffected by accuracy changes

Chapter 3: Set RL Framework

Standard RL scores individual generations. Set RL scores groups of generations. Here is the full construction.

Step 1: Sample N generations

For a prompt x, sample N responses {y₁, ..., y_N} from the current policy π_θ. Think N = 32 or 64.

Step 2: Construct K sets of size n

From the N generations, construct K subsets, each of size n (e.g., n = 8). These are drawn combinatorially without replacement. With N = 32 and n = 8, there are C(32,8) ≈ 10 million possible sets. In practice, you sample K ≈ a few hundred sets uniformly at random.

Why not use all possible sets? Two reasons. First, C(N,n) is astronomically large for practical N and n. Second, you do not need all sets — a random sample provides an unbiased estimate of the true gradient (this is the U-statistic property). More sets reduce variance but hit diminishing returns quickly. K = 200-500 is typically sufficient.

The sets must be sampled without replacement within each set (no generation appears twice in the same set), but a generation CAN appear in multiple different sets. In fact, this is essential — the marginal set advantage aggregates signal across all sets containing a given generation, so each generation must appear in many sets to get a reliable advantage estimate.

Step 3: Score each set

Evaluate objective f(x, G_j) for each set G_j = {y_j1, ..., y_jn}. The objective can be anything that maps a set of generations to a scalar score. For Poly-EPO, it will be the polychromic score (Chapter 4). But Set RL is general — you could use any function that captures a desirable property of the SET as a whole.

Crucially, the objective is a function of the set, not of any individual member. This is what allows it to capture emergent properties like diversity, which are meaningless for a single generation. You cannot ask "how diverse is this one response?" You can only ask "how diverse is this collection of responses?"

Step 4: Compute set advantage

Just like GRPO computes advantages by normalizing rewards across a group, Set RL computes advantages by normalizing set scores:

A^♯(x, G_j) = f(x, G_j) − f̄(x)

where f̄(x) is the mean score across all K sets. A set that scores above average has positive advantage.

Step 5: Marginal set advantage

Here is the crucial step. We need to convert the set-level signal back into a per-generation signal. For each generation y_i, its marginal set advantage is the sum of advantages of ALL sets containing y_i:

A^♯_marg(x, y_i) = ∑_{j : y_i ∈ G_j} A^♯(x, G_j)

A generation that appears in many high-scoring sets gets high marginal advantage. A generation that drags down every set it joins gets low (or negative) marginal advantage.

Proposition 3.1 (unbiased gradient): The marginal set advantage is an unbiased estimator of the true set RL policy gradient. This is a consequence of U-statistic theory — the sum over all sets containing y_i exactly produces the right gradient, no bias correction needed.

Step 6: Drop into standard RL

Replace the advantage in GRPO (or PPO) with the marginal set advantage. Everything else — clipping, normalization, optimization — stays the same. Set RL is a drop-in replacement for the advantage computation, not a whole new algorithm.

Worked example

Prompt x = "Solve x² + 3x + 2 = 0." You sample N=4 generations:

y₁: factors (x+1)(x+2) → correct, cluster A
y₂: quadratic formula → correct, cluster B
y₃: completing the square → wrong (arithmetic error), cluster C
y₄: factors (x+1)(x+2) → correct, cluster A

Construct K=3 sets of size n=2: S₁={y₁,y₃}, S₂={y₂,y₄}, S₃={y₁,y₂}.

Under the polychromic objective:

S₁: reward = 0.5, diversity = 2/2 = 1.0, f = 0.50
S₂: reward = 1.0, diversity = 2/2 = 1.0, f = 1.00
S₃: reward = 1.0, diversity = 2/2 = 1.0, f = 1.00

Baseline f̄ = 0.833. Set advantages: A(S₁)=−0.333, A(S₂)=+0.167, A(S₃)=+0.167.

Marginal advantages: y₁ appears in S₁,S₃ → −0.333+0.167 = −0.166. y₂ in S₂,S₃ → +0.334. y₃ in S₁ → −0.333. y₄ in S₂ → +0.167.

Notice y₃ (incorrect, unique cluster C) got the most negative advantage. But compare this to pure GRPO where y₃'s z-score advantage would be about −1.73. In Poly-EPO the penalty is milder because y₃ contributed diversity to S₁. If S₁ had been a larger set with more correct members, y₃'s marginal advantage could flip positive.

How to think about marginal set advantage

The marginal set advantage answers a precise question: "If I remove generation y from all the sets it belongs to, how much does the total score drop?" A generation that appears in many high-scoring sets and whose removal would hurt those sets badly gets a large positive marginal advantage. A generation that drags down every set it joins gets a large negative one.

This is analogous to Shapley values in cooperative game theory. Shapley values ask "what is each player's marginal contribution to every possible coalition?" The marginal set advantage asks the same question: "what is this generation's marginal contribution to every set it belongs to?" The key mathematical result (Proposition 3.1) is that this marginal contribution is an unbiased estimator of the policy gradient — meaning it points in the right direction for improving the polychromic objective.

python
# Standard GRPO advantage
A_i = (r_i - mean(rewards)) / std(rewards)

# Set RL marginal advantage (drop-in replacement)
for y_i in generations:
    A_marg_i = 0
    for G_j in sets:
        if y_i in G_j:
            A_marg_i += f(x, G_j) - f_bar
    # Use A_marg_i instead of A_i in GRPO/PPO

What makes the marginal set advantage an unbiased gradient estimator?

It averages over all possible sets By U-statistic theory, summing the advantages of all sets containing y_i produces an unbiased estimate of the true set RL policy gradient The baseline f-bar removes the bias

Chapter 4: The Polychromic Objective

Set RL is a general framework — it works with any set-level objective f. Poly-EPO uses the polychromic objective:

f_poly(x, y_1:n) = (1/n ∑_i r(x, y_i)) · d(x, y_1:n)

Two factors, multiplied:

Average reward: 1/n ∑ r(x, y_i). For binary rewards (1 or 0), this is the fraction of correct generations in the set.
Diversity: d(x, y_1:n) = |unique strategy clusters| / n. If all n generations use the same strategy, d = 1/n. If every generation uses a different strategy, d = 1 (or close to it).

Because this is a product, both factors must be positive for the set to score well. A set of 8 correct generations all using the same strategy scores low (reward = 1.0, diversity = 1/8 = 0.125 → f = 0.125). A set of 4 correct using 4 different strategies scores higher (reward = 0.5, diversity = 4/8 = 0.5 → f = 0.25).

The optimal set under f_poly has every member correct AND every member using a different strategy. With 8 correct generations across 8 clusters: reward = 1.0, diversity = 1.0, f = 1.0. This is the "white light" ideal — maximum reward and maximum diversity simultaneously.

Notice the product creates a natural diminishing-returns effect. Adding a 5th copy of strategy A to a set increases reward (if correct) but does not increase diversity. The marginal value of that 5th copy, in terms of f_poly, is less than adding a 1st copy of strategy E — even if strategy E has a lower success rate. This is exactly the incentive structure we want.

Why a product? A sum r + λd lets you optimize either factor independently. A product r · d is zero when either factor is zero, and large only when both are large. No λ to tune. No way to "cheat" by being diverse-but-wrong or correct-but-uniform.

LM-judge clustering

How do you measure diversity? Not by surface text (paraphrases are not diverse). Not by final answer (same answer, different paths). You cluster by reasoning strategy.

An LM judge reads all N generations and assigns each to a cluster based on:

Macro strategy: the overall framework (e.g., recursion vs. closed-form series vs. generating functions vs. geometric argument)
Micro strategy: the specific technique at key decision points (e.g., completing the square vs. using the quadratic formula)

Cluster 100 is reserved for degenerate responses — off-topic, repetitive, or incomprehensible generations. These add no diversity.

Why not simpler diversity metrics?

You might wonder: why not use embedding cosine distance, or BLEU score, or even just "different final answers"? The problem is that these metrics conflate surface-level variation with genuine strategic diversity.

Embedding distance: Two solutions that factor a polynomial but use different variable names have high embedding distance but zero strategic diversity. Two solutions that use factoring vs. the quadratic formula have genuine diversity but might have similar embeddings (both are "math solutions").
Token-level metrics (BLEU, edit distance): Same problem. A verbose factoring solution and a terse one are far apart in token space but strategically identical.
Answer diversity: Many problems have one correct answer. All correct solutions give the same answer. Answer diversity would say they are all the same, missing the strategic diversity entirely.

Strategy-level clustering requires understanding the reasoning, which is exactly what an LM judge provides. The judge can read "I'll use the quadratic formula" and "Let me factor this" and correctly assign them to different clusters, regardless of surface similarity.

What counts as a "different" strategy?

The paper draws an important distinction between macro and micro diversity:

Macro strategy: The overall framework. "Use recursion" vs. "Use a closed-form series" vs. "Use generating functions." These are fundamentally different approaches to the same problem. A model with macro diversity can attack problems from multiple angles.
Micro strategy: The specific technique at key decision points within a framework. "Complete the square" vs. "use the quadratic formula" within an algebraic framework. These are variations within the same approach.

The LM judge captures both levels. Two generations using recursion with different base cases are micro-diverse (same cluster unless the base cases lead to fundamentally different algorithms). Two generations, one using recursion and one using dynamic programming, are macro-diverse (different clusters). This hierarchical notion of diversity is crucial — what we most want to preserve is macro diversity, since losing an entire approach family is far more costly than losing a minor variation.

Interactive Set RL — Polychromic Scoring

8 generations for a math prompt. Toggle reward (correct/incorrect) and cluster color (strategy). Watch how sets are scored and how marginal advantages differ between GRPO and Poly-EPO. Click a generation to toggle its reward. Click its color swatch to cycle its strategy cluster.

The showcase above reveals the core mechanism. Toggle an incorrect generation's cluster to a unique color. Under GRPO, its advantage stays negative (wrong = bad). Under Poly-EPO, its advantage can become positive — it boosts the diversity of every set it joins, and those sets score higher overall. This is optimistic exploration: the model keeps trying new strategies even when they fail, because the SET benefits from having them.

A set of 8 generations has 6 correct (all using the same strategy) and 2 incorrect (each using a unique strategy). What is the polychromic score?

f = (6/8) * (3/8) = 0.281 — three unique clusters (the dominant one + two unique incorrect ones) divided by 8 f = 6/8 = 0.75 — just the reward f = 3/8 = 0.375 — just the diversity

Chapter 5: The Synergy

This is the mathematical heart of the paper. The marginal set advantage under the polychromic objective decomposes into three terms that reveal exactly how Poly-EPO creates synergy between reward and diversity.

The decomposition (Equation 15)

For a generation y with reward r(y), the marginal set advantage under f_poly is:

A^♯_marg(x, y; f_poly) = E_y[reward] · E_y[diversity] − baseline + Cov_y(reward, diversity)

where E_y[reward] means "average reward of sets containing y," E_y[diversity] means "average diversity of sets containing y," and Cov_y is the covariance between reward and diversity across all sets containing y.

This decomposition comes from the identity E[X · Y] = E[X] · E[Y] + Cov(X, Y). Since the polychromic objective is a product of reward and diversity, expanding the expectation naturally produces the covariance term. This is not an approximation — it is exact.

Term 1: the product of expectations

This is analogous to what reward shaping gives you — credit for being in sets that tend to be rewarding AND sets that tend to be diverse. But it still treats them somewhat independently.

Term 2: the covariance — THIS is the synergy

Cov_y(reward, diversity) captures something reward shaping fundamentally cannot. It asks: do the sets containing y tend to have high reward AND high diversity together?

A generation y that helps sets achieve both gets extra positive signal from this term. This is exactly "be correct in a different way" — the thing we wanted but could not get from additive rewards.

When is the covariance positive? When sets containing y that happen to have high reward also happen to have high diversity. This occurs for generations that contribute a unique strategy to already-strong sets. When is it negative? When high-reward sets containing y tend to have low diversity — meaning y is part of a monoculture that happens to be correct.

The critical implication — optimistic exploration: An incorrect generation (r = 0) can get positive marginal advantage under Poly-EPO. How? The sets it belongs to may still have high overall reward (other members are correct) and high diversity (y adds a unique cluster). The product of these is high, and the covariance term adds more. In standard RL with reward shaping, an incorrect generation at high accuracy can NEVER get positive advantage — the exploration signal dies. In Poly-EPO, it persists.

Why this matters for late-stage training

In standard RL, once accuracy p is high, all incorrect generations get hammered with negative advantage. The model learns to never produce them. But some of those incorrect generations were trying novel strategies that haven't been fully explored. They die before they can be refined.

In Poly-EPO, an incorrect generation using a unique strategy still boosts the polychromic score of every set it joins. The marginal advantage stays positive as long as the SET it joins is strong. The model keeps producing diverse attempts, giving those novel strategies time to mature.

To see why this matters, consider a training step where the model is 90% accurate on a problem. In standard GRPO, the 10% of wrong generations get advantage ≈ −3.0 (the z-score of r=0 when mean=0.9). Those generations get strongly suppressed, regardless of whether they were trying something novel or just making arithmetic mistakes.

In Poly-EPO, a wrong generation using cluster C (unique among its peers) joins sets where the other 3 of 4 members might be correct from different clusters. Those sets score well: reward = 0.75, diversity = 1.0, f = 0.75. This could easily be above the baseline. The wrong generation gets credit for its contribution to the set's diversity, even though it individually failed. The generation is not reinforced for being wrong — it is reinforced for trying something different, which happens to correlate with being wrong right now but could become correct with more training.

This is optimistic exploration — the model is optimistic that a currently-failing strategy might eventually succeed, because the diversity benefit outweighs the individual failure. And it does this automatically, without any explicit exploration bonus or temperature scheduling.

Concrete comparison: reward shaping vs. Poly-EPO

Imagine 8 generations for a hard problem. 7 are correct using strategy A. 1 is incorrect using unique strategy B. Group accuracy p = 7/8 = 0.875.

Method	Advantage of y₈ (incorrect, unique)	Sign
Standard GRPO	z-score of r=0 when mean≈0.875 → large negative	Negative
Reward shaping	λ·d − 0.875 − λ·d̄ → negative unless λ huge	Negative
Poly-EPO	Sets with y₈ have unique cluster B → diversity boost. If 3 of 4 set members correct, f = 0.75 × 0.5 = 0.375, which can exceed baseline	Can be positive

The table makes the mechanism concrete. In Poly-EPO, y₈ is "carried" by the correct generations in its sets — they provide the reward, y₈ provides the diversity, and the product rewards the entire set. The marginal set advantage credits y₈ for the diversity it contributed.

A biological analogy. In evolutionary biology, maintaining genetic diversity in a population improves long-term fitness even though individual mutations are often harmful. Natural selection at the individual level would eliminate diversity. But selection at the GROUP level (kin selection, multi-level selection) can preserve it. Poly-EPO is essentially multi-level selection for reasoning strategies — scoring groups (sets) rather than individuals.

What does the covariance term in the marginal set advantage decomposition capture?

It measures how much diversity the generation adds on its own The covariance captures synergy — generations that help sets achieve BOTH high reward and high diversity get extra positive signal, encoding "be correct in a different way" It measures how correlated the rewards are across generations

Chapter 6: The Algorithm

Poly-EPO consists of two algorithms. Algorithm 1 is the general Set RL recipe. Algorithm 2 computes the polychromic objective specifically.

Algorithm 1: Set RL (general)

python
def set_rl_advantage(prompt, policy, objective_fn, N, n, K):
    # Step 1: Sample N generations
    generations = [policy.sample(prompt) for _ in range(N)]

    # Step 2: Construct K sets of size n
    sets = [random.sample(generations, n) for _ in range(K)]

    # Step 3: Score each set
    scores = [objective_fn(prompt, s) for s in sets]
    f_bar = mean(scores)

    # Step 4: Set advantages
    set_adv = [s - f_bar for s in scores]

    # Step 5: Marginal set advantage per generation
    marginal = {}
    for y in generations:
        marginal[y] = sum(
            set_adv[j] for j, G in enumerate(sets) if y in G
        )

    # Step 6: Use marginal as drop-in for GRPO/PPO advantage
    return marginal

Algorithm 2: Polychromic objective

python
def polychromic_score(prompt, gen_set, reward_fn, lm_judge):
    # Compute per-generation rewards
    rewards = [reward_fn(prompt, y) for y in gen_set]
    avg_reward = mean(rewards)

    # Cluster by reasoning strategy via LM judge
    clusters = lm_judge.cluster(prompt, gen_set)
    # clusters[i] in {1, 2, ..., 99} for real strategies
    # clusters[i] = 100 for degenerate responses

    n = len(gen_set)
    unique_clusters = len(set(
        c for c in clusters if c != 100
    ))
    diversity = unique_clusters / n

    # Product — not sum!
    return avg_reward * diversity

Practical considerations

N vs K trade-off. Larger N gives more generation diversity to work with. Larger K gives more sets and lower-variance advantage estimates. The paper uses N=32-64 and K=a few hundred.
Set size n. Must be large enough to capture meaningful diversity but small enough that most sets are not identical. n=8 works well.
LM judge cost. Clustering is done once per prompt per iteration, not per set. The judge sees all N generations and produces cluster assignments. Then sets are scored using those pre-computed clusters.
Normalization. The marginal set advantages are normalized (z-scored) before being used as GRPO/PPO advantages, just like standard advantage normalization.

Drop-in compatibility: Poly-EPO changes only the advantage computation. The rest of GRPO/PPO — clipping, KL penalty (or lack thereof), optimization — stays identical. You can add Poly-EPO to any existing RL post-training pipeline by swapping one function.

LM judge prompt design

The LM judge receives all N generations for a prompt and must output cluster assignments. The prompt instructs the judge to:

Read all N solutions carefully.
Identify the macro strategy of each (the overall framework or approach).
Within each macro strategy, identify micro variations (specific techniques at key decision points).
Assign generations to numbered clusters. Same cluster = same strategy. Different cluster = meaningfully different approach.
Assign cluster 100 to degenerate responses (off-topic, repetitive loops, incomprehensible).

The judge needs to distinguish between genuine strategic diversity (factoring vs. quadratic formula) and superficial variation (same method, different variable names). This is why an LM judge is used rather than a simpler metric like edit distance or embedding similarity — reasoning strategy is a semantic property that requires understanding the solution.

Computational overhead

Where does the compute go? The main additional cost of Poly-EPO beyond standard GRPO is:

LM judge calls: One call per prompt per iteration (not per generation or per set). With batching, this adds ~10-20% overhead.
Set construction and scoring: O(K · n) per prompt, negligible compared to generation cost.
Marginal advantage computation: O(K · N) per prompt, also negligible.

The generation cost (sampling N=32-64 responses) is the dominant expense, and this is the same as any best-of-N or GRPO setup that samples multiple responses per prompt. The additional overhead from Poly-EPO-specific computations is modest.

The key insight is that the expensive part — generating N responses per prompt — is something you are already doing in GRPO/PPO anyway. Poly-EPO simply uses those N responses more intelligently by organizing them into sets and extracting a richer training signal. You pay a small extra tax (LM judge + set combinatorics) and get a fundamentally better advantage signal in return.

How does Poly-EPO integrate with existing RL training pipelines like GRPO?

It requires a completely new training framework It is a drop-in replacement for the advantage computation — everything else (clipping, optimization) stays the same It requires training a separate diversity reward model

Chapter 7: Results

The experiments use Qwen3-4B-Base trained on POLARIS-53k, a curated dataset of 53k math and reasoning problems. All methods use the same base model, same data, same compute budget. The only difference is the advantage computation.

Pass@k: the headline result

Pass@k measures: "if I sample k solutions, does at least one get the right answer?" This directly measures the value of diversity.

At k=1: Poly-EPO and GRPO are comparable (both improved over base).
At k=4: Poly-EPO starts pulling ahead.
At k=16: Poly-EPO is 10-15% better.
At k=32: Poly-EPO is up to 20% better. GRPO is now WORSE than the base model.

Pass@k Curves

Pass@k on AIME-equivalent benchmarks. Base model (gray), GRPO (teal), Poly-EPO (warm). Notice GRPO falls below the base model at high k.

GRPO degrades so badly that the base model beats it at k ≥ 32. This is not a small effect. The untrained model, with all its naive variety, finds solutions that the RL-trained model has forgotten how to produce. Poly-EPO avoids this entirely by maintaining and growing diversity throughout training.

Training dynamics

The number of distinct correct strategy clusters over training tells the full story:

GRPO: Starts at 8+ clusters, collapses to 2-3 by step 200, and flatlines. The model "chose" its strategies and stopped exploring.
Poly-EPO: Starts at 8+ clusters and steadily grows to 12-15. New correct strategies emerge throughout training as the model refines initially-failing approaches.

Branching analysis

Where in the token sequence do generations diverge? The paper measures this by computing the average prefix length shared between pairs of generations. A long shared prefix means the generations follow the same reasoning path and only diverge late (superficial diversity). A short shared prefix means they commit to different strategies early (structural diversity).

Poly-EPO branches earlier — generations commit to different strategies from the opening tokens. GRPO generations stay coupled for most of the sequence and only diverge in the final steps. This means GRPO "diversity" is superficial (different final calculations, same approach), while Poly-EPO diversity is structural (fundamentally different solution paths).

This has practical implications. Early branching means the model is choosing its strategy deliberately, not stumbling into diversity through random token-level noise. The diversity is intentional and robust — it persists across different random seeds and temperature settings.

Majority voting

With majority voting, Poly-EPO achieves equal or better accuracy despite having lower vote share for the top answer. Why? More diverse strategies means the correct strategies cover more problem types. The majority vote is less concentrated but more often right.

Why pass@1 stays competitive

A natural worry: does maintaining diversity hurt the single-sample performance? No. Poly-EPO's pass@1 is comparable to GRPO. The polychromic objective does not penalize the dominant strategy — it just prevents it from killing the other strategies. The best strategy still gets high reward, still gets reinforced. The difference is that alternative strategies survive alongside it rather than being extinguished.

This is the "free lunch" of Poly-EPO: you keep pass@1 and dramatically improve pass@k. You are not trading single-sample quality for diversity. You are adding diversity without cost.

Numbers in context

Metric	Base	GRPO	Poly-EPO
Pass@1	~15%	~28%	~27%
Pass@8	~44%	~50%	~60%
Pass@32	~72%	~58%	~82%
# Correct clusters	8+	2-3	12-15
Branch point	Early	Late	Early

Ablation: what matters most?

The paper runs ablations removing each component:

Product → sum: Replacing the product with r + d (reward shaping) degrades pass@k by 8-12% at k=32. The synergy is essential.
LM judge → embedding clustering: Using sentence embeddings instead of the LM judge reduces the diversity signal quality. pass@k drops 5-7%. Strategy-level clustering matters.
Set RL → individual RL with diversity reward: Scoring individuals with a diversity bonus (reward shaping) is strictly worse than Set RL. This confirms the theoretical analysis in Chapter 2.
K sets → fewer sets: Reducing K from 200 to 20 increases variance but still outperforms GRPO. The method is robust to the number of sets sampled.

The product objective and LM-judge clustering are the two most critical components. The Set RL framework itself (vs. individual RL) contributes the third-largest effect.

What does the branching analysis reveal about the quality of diversity in Poly-EPO vs GRPO?

Both branch at similar points but Poly-EPO has more branches Poly-EPO branches earlier in the token sequence, committing to different strategies from the start, while GRPO diversifies only superficially at the end GRPO branches earlier because it explores more

Chapter 8: Synthetic Domains

The paper validates Poly-EPO on two synthetic domains where the ground-truth set of valid strategies is known exactly. This lets us count strategies precisely, rather than relying on LM-judge clustering.

Polynomial solving

Given a polynomial equation, find the roots. There are many valid approaches: factoring, quadratic formula, completing the square, synthetic division, Newton's method, Vieta's formulas, etc. The paper enumerates the valid strategies and tracks which ones each model discovers.

GRPO: discovers 3-4 strategies, converges on quadratic formula + factoring. By step 200, the model has essentially forgotten the other approaches that the base model could produce.
Poly-EPO: discovers 15+ strategies, including obscure ones like geometric constructions, modular arithmetic approaches, Newton's identities, and Vieta's formulas applied creatively. That is roughly 5x more than GRPO.

Remarkably, some strategies discovered by Poly-EPO are approaches that the base model itself could only produce at very low probability (less than 0.1%). The optimistic exploration mechanism amplified these rare strategies, giving them enough probability mass to be sampled regularly. Without the set diversity signal, these strategies would have been immediately suppressed by standard RL's negative advantage for low-probability, occasionally-wrong attempts.

Multi-digit multiplication

Multiply two large numbers. This is a task with many known algorithms, each with different computational complexity and error profiles. Strategies include: standard long multiplication, lattice method, Russian peasant multiplication, Karatsuba-style divide-and-conquer, repeated addition, partial products with regrouping, etc.

GRPO: locks onto standard long multiplication almost exclusively. By step 300, over 95% of generations use the same column-by-column approach.
Poly-EPO: maintains 5-6 distinct methods, including lattice, Karatsuba-like decompositions, and creative partial-products approaches. Some of these methods have lower error rates on specific number patterns (e.g., lattice multiplication is more reliable for numbers with many repeated digits).

Why synthetic domains matter: In real math benchmarks, we rely on an LM judge to count strategies, which introduces noise. Synthetic domains have a known, finite strategy space. The 5x advantage of Poly-EPO in strategy discovery is a clean, controlled result that validates the approach before scaling to harder tasks.

The synthetic results also show that Poly-EPO's diversity is not "noise" — these are genuinely different, valid solution methods. The model is not just generating random variations; it is discovering fundamentally different algorithms.

Strategy discovery timeline

In polynomial solving, the training dynamics are especially revealing. GRPO discovers its 3-4 strategies within the first 50 steps and then prunes back to 2 as the dominant strategy monopolizes the probability mass. Poly-EPO shows a different pattern: it discovers strategies continuously throughout training, with new approaches appearing even at step 400+. Some of these late-discovered strategies are obscure methods that the base model could theoretically produce but almost never did.

This suggests that the optimistic exploration mechanism is not just preserving existing diversity — it is actively discovering new strategies by keeping the policy's support broad enough for rare approaches to be sampled and then reinforced via the set diversity signal.

Strategy refinement

An important nuance: maintaining diversity does not mean maintaining bad strategies forever. As training progresses, the LM judge may re-cluster generations as strategies refine. A strategy that started as "try recursion" and consistently failed will eventually be distinguished from a refined version "try recursion with memoization" that starts succeeding. The diversity metric rewards the refined version while letting the crude version fade naturally — not because it was suppressed, but because better versions of the idea emerged.

This is qualitatively different from GRPO, where the entire "recursion" branch gets killed after a few failures. In Poly-EPO, the branch survives long enough for the model to refine it into something that works. This is the essence of exploration: you need to tolerate short-term failure to discover long-term success.

Scaling implications

The synthetic results have important implications for scaling test-time compute. If you want pass@1000 to be better than pass@100, you need the model to keep producing genuinely new solution attempts. A model that collapsed onto 3 strategies will saturate quickly — after ~50 samples, you have tried each strategy many times and additional samples add nothing. A model with 15+ strategies keeps finding new answers well into the hundreds of samples.

This connects directly to the "Large Language Monkeys" observation: coverage scales with repeated sampling, but only if the model has enough diversity to fill the coverage space. Poly-EPO is the training-time complement to test-time scaling — it ensures the model has the diversity that repeated sampling needs to be effective.

The compound benefit: A model trained with Poly-EPO and deployed with best-of-k sampling gets two compounding advantages. First, each sample is more likely to try a unique strategy (training-time diversity). Second, the strategies themselves are more varied, so k samples cover more of the problem space (test-time coverage). The combination is superlinear — diversity at training time multiplies the value of compute at test time.

Why are synthetic domains (polynomial solving, multiplication) important for validating Poly-EPO?

They have a known, finite strategy space, allowing precise strategy counting without LM-judge noise — Poly-EPO discovers 5x more valid strategies than GRPO They are easier for the models to solve They use less compute for training

Chapter 9: Connections

Cheat sheet

Concept	What it does
Diversity collapse	Standard RL reduces the number of reasoning strategies the model can use, hurting pass@k
Reward shaping	Adding λ·diversity to reward fails because the sum decouples the goals and the threshold becomes impossible at high accuracy
Set RL	Score sets of generations instead of individuals. Marginal set advantage maps set-level signal back to per-generation credit.
Polychromic objective	f = (avg reward) × (diversity). Product forces both to be high. No λ.
Synergy / covariance	Marginal advantage includes Cov(reward, diversity) — generations that help sets be BOTH good and diverse get extra signal
Optimistic exploration	Incorrect generations can get positive advantage if they add diversity to high-reward sets. Persists even at high accuracy, unlike reward shaping.
LM-judge clustering	Cluster by reasoning strategy (macro + micro), not surface text. Cluster 100 = degenerate.
U-statistic	Marginal set advantage is an unbiased gradient estimator (Proposition 3.1). No bias correction needed.
Drop-in replacement	Only the advantage computation changes. Clipping, optimization, and infrastructure stay the same as GRPO/PPO.
Pass@k improvement	Up to 20% better than GRPO at k=32. GRPO degrades below base model; Poly-EPO does not.

Related work

DAPO — fixes four GRPO failure modes (entropy collapse, dead gradients, etc.) for large-scale RL. Poly-EPO addresses a different failure mode (diversity collapse) that DAPO's fixes alone cannot prevent.
DeepSeek-Math / GRPO — the base RL algorithm that Poly-EPO builds on. GRPO eliminates the value function; Poly-EPO adds set-level scoring on top.
Scaling Test-Time Compute — sampling more at inference. Poly-EPO makes those extra samples actually diverse, so pass@k improves rather than saturating.
Large Language Monkeys — repeated sampling reveals that coverage scales with number of samples. Poly-EPO ensures the model maintains the diversity needed for coverage to keep growing.
PPO — the clipped surrogate objective that both GRPO and Poly-EPO use. Poly-EPO is compatible with PPO too — just swap the advantage.

Broader implications

Poly-EPO is the first work to rigorously formalize and solve the diversity collapse problem in LLM RL. But the Set RL framework is more general. You could imagine set-level objectives that optimize for:

Coverage of input space: a set of tool-calling trajectories that together handle all edge cases.
Complementarity: a set of code solutions that together pass all unit tests, even if no single solution passes them all.
Robustness: a set of reasoning chains that arrive at the same answer via independent paths (providing redundancy).

Any property that is meaningful at the population level (not individual level) can potentially be optimized via Set RL. Poly-EPO opened this door; the field is just starting to walk through it.

Perhaps most importantly, Poly-EPO demonstrates that the tools of cooperative game theory (set functions, marginal contributions, U-statistics) have direct applications in LLM training. These mathematical frameworks have been developed for decades in economics and social choice theory. Poly-EPO is among the first to apply them to policy optimization in generative models. The set of possible future work in this direction is, appropriately, highly diverse.

Key takeaways

Three things to remember:
1. Diversity collapse is a real, measurable problem — standard RL actively destroys the base model's strategy diversity, making pass@k worse.
2. The product objective (reward × diversity) creates synergy that additive reward shaping cannot — the covariance term in the marginal advantage is the mathematical signature of this synergy.
3. Set RL is a general framework, and Poly-EPO is one instantiation — the door is open for other set-level objectives tailored to different desiderata.

Open questions

Scaling to larger models. The paper uses Qwen3-4B-Base. Does diversity collapse get worse or better at 70B+? Does Poly-EPO's advantage grow or shrink?
Non-math domains. The polychromic objective assumes binary rewards and clustering by strategy. How does it extend to open-ended generation (creative writing, coding) where "strategy" is harder to define?
LM judge reliability. The LM judge introduces a potential bottleneck. What happens when the judge misclusters? The paper shows robustness to moderate noise, but systematic bias in clustering could misalign the diversity signal.
Optimal set size. n=8 works well empirically, but is there a principled way to choose n based on the strategy space size?
Interaction with DAPO. DAPO fixes entropy collapse, dead gradients, and other GRPO failure modes. Poly-EPO fixes diversity collapse. Are these orthogonal? Can you stack them? The paper suggests yes, but the interaction effects are not fully characterized.
Beyond math. Code generation has verifiable outputs (tests pass/fail) and a natural notion of solution diversity (different algorithms, different libraries, different paradigms). This seems like an ideal next testbed for Set RL.
Multi-turn settings. Can Set RL work for agentic tasks where each "generation" is a full trajectory? The set construction would be over trajectory bundles, which opens interesting questions about what "diversity" means for multi-step reasoning.

The big picture: Poly-EPO shows that the objective for RL post-training should not just ask "is this generation good?" but "does this generation make the MODEL'S REPERTOIRE better?" By scoring sets, you optimize the portfolio of strategies, not just individual outputs. This is a shift from individual to ensemble thinking in RL for LLMs.

Poly-EPO: Exploratory Reasoning