DPO secretly solves a misspecified statistical estimation problem. When the true reward can't be expressed by your policy class, DPO projects onto the wrong answer — reversing preferences and degrading reward. AuxDPO fixes this with auxiliary degrees of freedom.
You want to align an LLM with human preferences. The standard recipe: collect (prompt, preferred response, rejected response) triplets, then fine-tune the model so it favors what humans favor. Two roads diverge.
Road 1: Two-stage RLHF. First, train a separate reward model on the preferences. Then run PPO (or similar RL) to maximize that reward while staying close to the base model via a KL penalty. This works, but it's expensive — you need a reward model, on-policy rollouts, clip ranges, variance reduction, and careful tuning of half a dozen hyperparameters.
Road 2: DPO (Direct Preference Optimization). Rafailov et al. (2023) showed a beautiful trick: the KL-regularized RL objective has a closed-form solution that links the optimal policy directly to the reward. Reparameterize the reward in terms of the policy, substitute into the Bradley-Terry preference model, and you get a single supervised loss. No reward model, no RL — just minimize one cross-entropy loss. It's simple, stable, and fast. The open-source community adopted it almost overnight.
But there's a catch. A hidden assumption. DPO's derivation assumes the policy class is tabular — meaning it can represent every possible conditional distribution over responses given prompts. In a tabular policy, there's one free parameter per (prompt, response) pair. The KL-regularized objective can be solved in closed form because the optimization is over an unconstrained probability simplex.
Real LLMs are not tabular. A 7B-parameter Transformer parameterizes distributions over millions of possible responses using only 7 billion shared weights. The policy class is a tiny manifold embedded in the vast space of all possible distributions. The closed-form solution that DPO relies on? It might not live on that manifold.
This isn't a small-sample problem. The failure modes persist even with infinite preference data drawn from a perfect Bradley-Terry model. The issue is geometric: DPO is a misspecified estimator, and misspecified estimators can produce arbitrarily bad results regardless of how much data you feed them.
Left: in a tabular policy, every reward function can be expressed as an implicit reward. Right: in a parametric policy, only a low-dimensional manifold of rewards is reachable. DPO projects onto this manifold, but the projection can land anywhere. Drag the slider to change the policy class dimension.
Here is the paper's central insight, stated plainly: DPO loss minimization is equivalent to a weighted KL-projection of the true reward function onto the manifold of rewards that the policy class can express.
Let's unpack that one piece at a time.
Given any policy πθ and a reference policy πθ0, DPO defines an implicit reward function:
This is just the log-probability-ratio, scaled by β. Every policy parameter θ induces exactly one implicit reward. The set of all such implicit rewards, as θ varies over Rd, forms a manifold Rβ inside the full reward space Rm (where m = |S| · |A|).
When you minimize the DPO loss with infinite data, you're finding the implicit reward in Rβ that is closest to the true reward r* — where "closest" means minimizing a weighted sum of KL divergences between Bernoulli preference probabilities. The weights are the preference pair counts ns,a,a'.
If r* happens to live on the manifold Rβ (i.e., some policy parameter θ yields an implicit reward equal to r*), then DPO finds the RLHF-optimal policy. No problem.
But if r* is off the manifold — which is the generic case when d << m — then DPO projects r* onto Rβ. This projection depends on the weights (preference counts), and can land at a point on the manifold that corresponds to a worse policy than doing nothing at all.
The paper then studies the geometry of two-stage RLHF and discovers that RLHF partitions all reward functions into equivalence classes: rewards that differ by a vector in the nullspace of a certain matrix all yield the same optimal policy. DPO can only search the column space. AuxDPO adds auxiliary variables that search the nullspace too, expanding the search from a d-dimensional manifold to the full m-dimensional reward space. The result: the projection is no longer misspecified.
Before we dive into the proofs, let's make sure every tool is sharp. This chapter covers the three building blocks: the Bradley-Terry preference model, the DPO loss, and the geometry of misspecified estimation.
Given a prompt s and two responses a, a', how do we model which one a human prefers? The BTL model says: each response has a latent "reward" r*(s, a), and the probability that a is preferred over a' is:
where σ(z) = 1/(1 + e−z) is the sigmoid function.
Let's build some intuition with numbers. Suppose r*(a1) = 3 and r*(a2) = 1. Then:
So a1 is preferred 88% of the time. The bigger the reward gap, the stronger the preference. If rewards are equal, preference is 50-50 (since σ(0) = 0.5).
Given a reward function r* and a reference (base) policy πref = πθ0, the RLHF objective maximizes expected reward while penalizing deviation from the base:
Here β > 0 controls how far the aligned policy can stray from the base. Large β means "stay close to the base" (conservative alignment). Small β means "chase the reward aggressively."
Worked example: Suppose you have one prompt, two responses, πref(a1) = πref(a2) = 0.5, and r*(a1) = 1, r*(a2) = 0. If β = 1:
Setting p = πθ(a1):
Taking the derivative and setting to zero yields the optimal p* = e1/β / (1 + e1/β) = σ(1/β). For β = 1, that's σ(1) ≈ 0.73. So the optimal policy puts 73% probability on the preferred response — not 100%, because the KL penalty holds it back.
For a tabular policy class, the RLHF objective has a closed-form maximizer:
Rearranging for the reward:
Substituting this into the BTL model, the Z*(s) terms cancel (since they're the same for both responses at the same prompt), giving:
DPO says: instead of learning r* first and then optimizing the policy, directly learn θ by treating the above as a binary classification loss. The DPO loss is:
In statistics, an estimator is misspecified when the true data-generating distribution doesn't belong to the model family being fit. Classic example: fitting a linear model to quadratic data. The fit converges (with infinite data) to the best linear approximation — but that approximation can be misleading.
White (1982) showed that misspecified maximum likelihood estimators converge to the KL-projection of the truth onto the model family. The projection is consistent (it converges) but not to the right thing. DPO exhibits exactly this phenomenon: it's a misspecified estimator in reward function space.
Drag the reward sliders to see how preference probabilities change under the BTL model. Notice: only the difference in rewards matters.
Now let's formalize everything. We need precise definitions because the paper's results are about exact mathematical objects, not hand-wavy intuitions.
We have a dataset D of n preference triplets (s(i), aw(i), al(i)), where aw ≻ al means "aw is preferred over al at prompt s." Prompts come from S (finite set), responses from A (finite set). Let m = |S| · |A| be the total number of (prompt, response) pairs.
πθ : S → Δ(A) is parameterized by θ ∈ Rd. The base policy is πθ0. Two key cases:
For any policy parameter θ, define the implicit reward function:
Note that rθ0β ≡ 0 by definition (the base policy has zero implicit reward). This makes sense: we're measuring reward relative to the base.
The set of all achievable implicit rewards forms:
For a tabular policy, Rβ = Rm (you can reach any reward). For a parametric policy, Rβ is a d-dimensional manifold embedded in Rm, where d << m. This dimensional mismatch is the source of all trouble.
With infinite data (population setting), the number of preferences for triplet (s, a, a') is ns,a,a' and the fraction that prefer a over a' is ps,a,a'BTL(r*). The population DPO loss is:
This is a weighted cross-entropy loss. Each pair (s, a, a') contributes a binary cross-entropy term, weighted by how many times that pair was compared (ns,a,a'). The "label" is the true BTL preference probability; the "prediction" is the preference probability induced by the policy's implicit reward.
A critical object: the d × m matrix where the (s, a)-th column is ∇ log πθ0(a|s). This is the score function of the base policy, evaluated at every (prompt, response) pair.
Why does this matter? Because the linearized implicit reward is:
So the column space C(Aθ0T) is the linearized implicit reward manifold. Whatever isn't in this column space is invisible to DPO.
A 1D policy class (single parameter θ) creates a 1D line of implicit rewards in 3D reward space. The true reward r* is off the line. DPO projects onto the line. Click "Randomize r*" to see how different true rewards project to different (potentially bad) policies.
This is the paper's first main theorem (Proposition 1). It reveals what DPO actually does in the population setting.
Let's derive this step by step, filling in the gaps the paper skips.
Step 1. The population DPO loss is:
where p* = ps,a,a'BTL(r*) is the true preference probability and q(θ) = ps,a,a'BTL(rθβ) is the model's predicted preference probability.
Step 2. Rewrite using the identity H(p, q) = H(p) + dKL(p || q), where H(p, q) is binary cross-entropy and H(p) is entropy:
The entropy term H(p*) doesn't depend on θ, so:
Step 3. Since θ determines rθβ, and rθβ ranges over Rβ as θ varies, this is equivalent to:
This is a weighted KL-projection of r* onto Rβ. QED.
Three critical implications:
1. Realizable case (r* ∈ Rβ): The KL divergence is zero at r* itself, so the projection trivially finds r*, and DPO recovers the RLHF-optimal policy. This is the tabular case where DPO was designed to work.
2. Misspecified case (r* ∉ Rβ): The projection finds the closest point on Rβ in the weighted KL sense. But "closest in weighted KL" doesn't mean "best policy." The weights ns,a,a' — how many times each pair was compared — control where the projection lands. Different preference distributions give different answers.
3. The weights are the problem: In RLHF, the preference counts affect how well you learn the reward (statistical efficiency), but the learned reward is still the right one with enough data. In DPO, the preference counts determine which wrong answer you converge to. This is fundamentally different.
Let's trace through the paper's 3-response example from Proposition 3. One prompt, three responses a1, a2, a3 with true rewards r* = [1, 2, 0] (after shifting so the minimum is 0). Policy: πθ = [eθ, e−θ, 1]/Z, base: θ0 = 0 (uniform).
The Jacobian: Aθ0 = [1, −1, 0]. So the implicit reward manifold is span([1, −1, 0]) — a line in R3.
The true reward r* = [1, 2, 0] is not on this line (it would need to be of the form [c, −c, 0]). DPO must project.
Now the projection depends on the preference counts. If n1,3 (comparisons between a1 and a3) dominates, DPO is primarily trying to match the a1 vs. a3 preference probability. Since r*(a1) − r*(a3) = 1 > 0, DPO will set the implicit reward to have rθ(a1) − rθ(a3) > 0, which means rθ ≈ [α, −α, 0] for some α > 0. This makes a1 preferred over a2 — a preference reversal since the true ordering is a2 ≻ a1 ≻ a3.
The true reward r* (gold star) is projected onto the 1D manifold (red line). Adjust the preference count ratio to see how the projection point (orange dot) slides along the manifold. Watch for preference reversal!
Now we see DPO fail. Not because of bad data, not because of poor optimization, not because of insufficient training — but because of the geometry of misspecification.
This is the paper's most striking result. Let's set it up carefully.
Setup: One prompt. Three responses a1, a2, a3. True rewards: r* = [2, 3, 1]. True preference order: a2 ≻ a1 ≻ a3. Policy: πθ = [eθ, e−θ, 1]/Z with a single parameter θ. Base policy: θ0 = 0 (uniform, each action gets probability 1/3).
Since BTL only cares about differences, we shift to r* = [1, 2, 0]. The Jacobian at θ0 = 0:
So Rβ ≈ span([1, −1, 0]). The manifold is a line through the origin in the direction [1, −1, 0].
Suppose the dataset is imbalanced: lots of (a1 vs. a3) comparisons, few (a1 vs. a2) or (a2 vs. a3) comparisons. Then DPO mostly cares about matching the a1 vs. a3 preference.
True preference: P(a1 ≻ a3) = σ(1 − 0) = σ(1) ≈ 0.73. DPO finds an implicit reward on the line [α, −α, 0] that matches this. Setting r(a1) − r(a3) = α − 0 = α, DPO needs σ(α) ≈ 0.73, giving α ≈ 1.
The resulting implicit reward: rDPO ≈ [1, −1, 0]. Now check the implied preference order:
It gets worse. The expected reward under the DPO policy is lower than under the base policy.
With θ = α > 0:
Let's compute for α = 1:
Compare with the base policy (θ = 0):
So DPO achieves expected reward 0.845 vs. the base policy's 1.0. DPO made things worse.
Two-stage RLHF, by contrast, would learn r* accurately in stage 1, then optimize the policy in stage 2. RLHF's expected reward can only increase from the base policy (the base policy is always a feasible point for the KL-regularized objective, so the optimizer does at least as well).
The killer: which failure mode you get depends entirely on which pairs are compared more often.
Remark 4 in the paper emphasizes: this failure is not due to insufficient coverage. Song et al. (2024) argued that DPO needs a coverage condition maxs,a πθ(a|s)/πθ0(a|s) ≤ C. In our example, C = 3 for the uniform base policy — coverage is perfect. But DPO still fails. The problem is geometric (misspecification), not statistical (coverage).
The paper's 3-response, 1-parameter counterexample. Drag the sliders to set the preference pair counts. Watch the DPO projection slide along the red manifold line, causing preference reversal and reward degradation.
Having seen DPO fail, the paper now asks: what does two-stage RLHF actually compute, geometrically? Understanding RLHF's local behavior will reveal the path to fixing DPO.
We approximate J(θ; r*) around the base policy θ0 using Taylor expansions. The expected reward is linear in θ (first-order), and the KL penalty is quadratic (second-order):
Let's define the key matrices:
Taking the gradient of the quadratic approximation and setting it to zero:
Solving:
Here's the key insight for fixing DPO. The RLHF solution θ* depends on r* only through the product Aρ,θ0 r*. This means:
All reward functions in the same equivalence class produce the same RLHF-optimal policy. Two rewards r1, r2 are equivalent if and only if r1 − r2 ∈ N(Aρ,θ0) — they differ by a nullspace element.
Worked example: For our 1D policy with Aθ0 = [1, −1, 0] and uniform base policy, Aρ,θ0 = (1/3)[1, −1, 0]. Its nullspace N(Aρ,θ0) = {[a, a, b] : a, b ∈ R} — any reward where a1 and a2 have the same reward (regardless of a3's reward).
So r* = [1, 2, 0] and r* + [c, c, d] = [1+c, 2+c, d] all yield the same RLHF policy. That's because the 1D policy can only control the ratio of a1 to a2, and any reward with r(a2) − r(a1) = 1 gives the same θ*.
DPO's implicit reward rθβ is the minimum-norm representative of the RLHF equivalence class Reqβ(θ), measured in the Mahalanobis norm ||r||Dρ,θ0.
In other words: for each θ, there's a whole affine subspace of rewards that would make RLHF choose θ. DPO picks the shortest one. This is elegant — but it means DPO is constrained to the column space C(Aθ0T), while the true reward r* may require a nullspace component to be properly represented.
The column space (red line) and nullspace (blue arrows) partition reward space. All rewards in the same equivalence class (blue line) map to the same RLHF policy. DPO can only search the red line.
Now the fix. The authors have identified the problem: DPO searches the column space C(Aθ0T), but the true reward has a nullspace component that distorts the projection. The solution: search both spaces simultaneously.
Introduce auxiliary variables δ ∈ N(Aρ,θ0) that represent the nullspace component of the reward. Optimize the DPO loss jointly over θ (column space) and δ (nullspace):
where L(θ, δ) is the DPO loss but with the reward replaced by rθβ + δ:
where rθ,δβ = rθβ + δ.
By the rank-nullity theorem:
So searching column space (θ) + nullspace (δ) covers the full m-dimensional reward space Rm. The reward r* = rθ*β + δ* is now realizable in the augmented representation. The misspecification vanishes.
The theoretical formulation requires knowing N(Aρ,θ0) and optimizing δ over it. In practice, we can't compute the nullspace of a matrix with millions of rows. The paper uses a clever relaxation:
Step 1: Discretize δ. Instead of defining δ over all m = |S| · |A| pairs, define it only at the 2n data points that appear in the dataset: δ = {δ(s(i), aw(i)), δ(s(i), al(i))}i=1n ∈ R2n.
Step 2: Enforce the nullspace constraint via a penalty. Replace the hard constraint δ ∈ N(Aρ,θ0) with a penalty term ||Aρ,θ0 δ||2. Since Aρ,θ0 δ = Eρ,πθ0[δ(s, a) ∇ log πθ0(a|s)], this can be estimated from the dataset.
The empirical AuxDPO loss:
# AuxDPO Training Loop def auxdpo_loss(theta, delta, batch, ref_model, beta, lam): # batch: (s, a_w, a_l) triplets, indices i # 1. Compute implicit rewards (same as DPO) log_ratio_w = log_prob(theta, s, a_w) - log_prob(ref_model, s, a_w) log_ratio_l = log_prob(theta, s, a_l) - log_prob(ref_model, s, a_l) r_w = beta * log_ratio_w # shape: [batch_size] r_l = beta * log_ratio_l # shape: [batch_size] # 2. Add auxiliary variables (per-datapoint scalars) delta_w = delta[2*i] # shape: [batch_size] delta_l = delta[2*i + 1] # shape: [batch_size] # 3. AuxDPO logit = DPO logit + delta correction logit = (r_w - r_l) + (delta_w - delta_l) # 4. Binary cross-entropy loss (same as DPO) bce = -log_sigmoid(logit).mean() # 5. Nullspace penalty: ||A_{rho,theta_0} delta||^2 # score_w = grad log pi_{theta_0}(a_w | s), shape: [batch, d] # Pre-computed and cached for reference model penalty = (delta_w[:, None] * score_w + delta_l[:, None] * score_l).mean(0) penalty = (penalty ** 2).sum() return bce + lam * penalty
DPO (orange) projects r* onto the column space line. AuxDPO (green) adds a nullspace shift δ so the augmented reward rθβ + δ can reach r*. The result: θ lands at the correct RLHF solution. Toggle between DPO and AuxDPO to see the difference.
Theory says AuxDPO should fix DPO's misspecification. Does it work in practice? The paper tests on two fronts: a didactic bandit setting (where we can verify the theory exactly) and real LLM alignment tasks.
The 3-response, 1-parameter example from Proposition 3 is implemented with a log-linear policy. With n1,3 dominating, DPO produces preference reversal as predicted. AuxDPO with λ = 1.0 correctly recovers θ < 0, favoring a2. The auxiliary variable δ absorbs the nullspace component, steering the projection to the correct equivalence class.
Training data: UltraFeedback (Cui et al., 2024) and its binarized version. Models: Llama3.1-8B, Llama3.2-1B, Qwen3-0.6B.
| Model | Dataset | Setting | DPO | AuxDPO | IPO | DPOP |
|---|---|---|---|---|---|---|
| Llama3.1-8B | MMLU-Pro | ID | 57.14 | 63.26 | 59.18 | 61.22 |
| MMLU-Pro | OOD | 8.16 | 14.28 | 10.20 | 6.12 | |
| RewardBench v2 | ID | 56.01 | 66.72 | 61.34 | 62.27 | |
| RewardBench v2 | OOD | 14.31 | 32.44 | 20.17 | 19.87 | |
| Llama3.2-1B | MMLU-Pro | ID | 39.58 | 45.83 | 43.75 | 44.21 |
| MMLU-Pro | OOD | 6.25 | 12.52 | 14.58 | 4.16 | |
| RewardBench v2 | ID | 77.21 | 86.37 | 69.72 | 71.21 | |
| RewardBench v2 | OOD | 14.11 | 43.27 | 20.42 | 18.76 | |
| Qwen3-0.6B | MMLU-Pro | ID | 53.12 | 61.78 | 47.48 | 56.67 |
| MMLU-Pro | OOD | 11.34 | 22.22 | 15.56 | 17.78 | |
| RewardBench v2 | ID | 55.10 | 65.31 | 53.06 | 51.02 | |
| RewardBench v2 | OOD | −8.16 | 18.36 | −8.23 | −6.25 |
Values show % change in mean accuracy relative to the base policy. Bold = best. Negative = degradation from base policy.
Several patterns jump out:
1. AuxDPO wins everywhere that matters. On all 12 (model, dataset, setting) combinations, AuxDPO is either first or tied for first. DPO is never best.
2. The OOD gap is huge. On Llama3.2-1B RewardBench v2 OOD, AuxDPO achieves +43.27% vs. DPO's +14.11%. That's a 3x improvement in OOD generalization. This suggests that DPO's misspecification causes it to overfit to distribution-specific artifacts, while AuxDPO finds more robust alignments.
3. DPO can be catastrophically bad. On Qwen3-0.6B RewardBench v2 OOD, DPO scores −8.16% — meaning it's worse than the base model. This is exactly the reward degradation predicted by the theory (Proposition 3). AuxDPO scores +18.36%, a swing of over 26 percentage points.
4. Model size matters. The smaller the model (fewer parameters d relative to task complexity m), the worse the misspecification. Qwen3-0.6B has the most severe DPO failures, which makes sense: fewer parameters means a lower-dimensional manifold, means more reward functions are unrealizable.
The paper also reports per-subject accuracies on MMLU-Pro. Some highlights:
The gains are not uniform across subjects, but AuxDPO is never worse than DPO in any subject.
Accuracy improvement over base policy (% change) across model sizes and settings. Bars below zero mean the method is worse than doing nothing.
| Equation | What it says | When to use it |
|---|---|---|
| rθβ(s,a) = β log(πθ/πref) | Implicit reward: every policy induces a reward via log-ratio | Understanding DPO's reward space |
| Rβ = {rθβ : θ ∈ Rd} | Implicit reward manifold: d-dimensional subset of Rm | Checking if misspecification applies |
| rDPO = argminr ∈ Rβ ∑ n · dKL | DPO = weighted KL-projection of r* onto Rβ | Understanding DPO's behavior |
| rθβ ≈ β Aθ0T(θ−θ0) | Linearized implicit reward: lives in C(Aθ0T) | Local analysis near base policy |
| θ* = θ0 + (1/β) F† Aρ r* | RLHF solution = natural policy gradient step | Understanding the target |
| Reqβ(θ) = {r : Ar = βF(θ−θ0)} | RLHF equivalence class: all rewards yielding same θ* | Understanding why many rewards are equivalent |
| LAuxDPO = LDPO(rθ+δ) + λ||Aδ||2 | AuxDPO loss: DPO + nullspace auxiliary variables + penalty | Implementation |
This paper reveals a fundamental tension in direct alignment: DPO's elegance comes from reparameterizing the reward in terms of the policy, but this reparameterization constrains the reward space to a low-dimensional manifold. The constraint is invisible in the tabular case (where the manifold is the whole space) but becomes a source of systematic error for parametric policies.
AuxDPO's fix is principled and minimal: add auxiliary variables that search the directions the policy can't reach. The cost is small (2n extra scalars, one hyperparameter), and the theory guarantees recovery of the RLHF solution. The practical gains are substantial, especially for smaller models where misspecification is more severe.
The deeper lesson: whenever you reparameterize one quantity in terms of another (reward in terms of policy, here), you inherit the latter's capacity limitations. If those limitations don't match the true data, you're misspecified — and no amount of data will save you. Only expanding the representational capacity (here, via auxiliary variables) can fix a misspecification problem.