Gopalan, Chowdhury & Banerjee — IISc / IIT Kanpur / HP AI, ICLR 2026 Oral

Why DPO Is a Misspecified Estimator

DPO secretly solves a misspecified statistical estimation problem. When the true reward can't be expressed by your policy class, DPO projects onto the wrong answer — reversing preferences and degrading reward. AuxDPO fixes this with auxiliary degrees of freedom.

Prerequisites: DPO basics + KL divergence + Bradley-Terry model + Linear algebra (nullspace, column space)
10
Chapters
8+
Simulations

Chapter 0: The Problem

You want to align an LLM with human preferences. The standard recipe: collect (prompt, preferred response, rejected response) triplets, then fine-tune the model so it favors what humans favor. Two roads diverge.

Road 1: Two-stage RLHF. First, train a separate reward model on the preferences. Then run PPO (or similar RL) to maximize that reward while staying close to the base model via a KL penalty. This works, but it's expensive — you need a reward model, on-policy rollouts, clip ranges, variance reduction, and careful tuning of half a dozen hyperparameters.

Road 2: DPO (Direct Preference Optimization). Rafailov et al. (2023) showed a beautiful trick: the KL-regularized RL objective has a closed-form solution that links the optimal policy directly to the reward. Reparameterize the reward in terms of the policy, substitute into the Bradley-Terry preference model, and you get a single supervised loss. No reward model, no RL — just minimize one cross-entropy loss. It's simple, stable, and fast. The open-source community adopted it almost overnight.

But there's a catch. A hidden assumption. DPO's derivation assumes the policy class is tabular — meaning it can represent every possible conditional distribution over responses given prompts. In a tabular policy, there's one free parameter per (prompt, response) pair. The KL-regularized objective can be solved in closed form because the optimization is over an unconstrained probability simplex.

Real LLMs are not tabular. A 7B-parameter Transformer parameterizes distributions over millions of possible responses using only 7 billion shared weights. The policy class is a tiny manifold embedded in the vast space of all possible distributions. The closed-form solution that DPO relies on? It might not live on that manifold.

The core question this paper asks: When you minimize the DPO loss over a parametric (non-tabular) policy class, what actually happens? Does it find the RLHF-optimal policy? The answer is no — and the failure modes are worse than anyone expected. DPO can reverse the preference ordering, decrease the expected reward below the base policy, and the result depends sensitively on how many times each pair was compared in the dataset.

This isn't a small-sample problem. The failure modes persist even with infinite preference data drawn from a perfect Bradley-Terry model. The issue is geometric: DPO is a misspecified estimator, and misspecified estimators can produce arbitrarily bad results regardless of how much data you feed them.

The Tabular vs. Parametric Gap

Left: in a tabular policy, every reward function can be expressed as an implicit reward. Right: in a parametric policy, only a low-dimensional manifold of rewards is reachable. DPO projects onto this manifold, but the projection can land anywhere. Drag the slider to change the policy class dimension.

DPO's derivation assumes the policy class is tabular. What does "tabular" mean here, and why does it matter?

Chapter 1: The Key Insight

Here is the paper's central insight, stated plainly: DPO loss minimization is equivalent to a weighted KL-projection of the true reward function onto the manifold of rewards that the policy class can express.

Let's unpack that one piece at a time.

What is an "implicit reward"?

Given any policy πθ and a reference policy πθ0, DPO defines an implicit reward function:

rθβ(s, a) = β log πθ(a|s) / πθ0(a|s)

This is just the log-probability-ratio, scaled by β. Every policy parameter θ induces exactly one implicit reward. The set of all such implicit rewards, as θ varies over Rd, forms a manifold Rβ inside the full reward space Rm (where m = |S| · |A|).

What does DPO actually minimize?

When you minimize the DPO loss with infinite data, you're finding the implicit reward in Rβ that is closest to the true reward r* — where "closest" means minimizing a weighted sum of KL divergences between Bernoulli preference probabilities. The weights are the preference pair counts ns,a,a'.

When does this go wrong?

If r* happens to live on the manifold Rβ (i.e., some policy parameter θ yields an implicit reward equal to r*), then DPO finds the RLHF-optimal policy. No problem.

But if r* is off the manifold — which is the generic case when d << m — then DPO projects r* onto Rβ. This projection depends on the weights (preference counts), and can land at a point on the manifold that corresponds to a worse policy than doing nothing at all.

The geometry in one picture: Imagine reward space as a high-dimensional room. The policy class carves out a thin curved surface (manifold) in that room. The true reward r* is a point floating above the surface. DPO drops a perpendicular from r* onto the surface — but the "perpendicular" direction is warped by the preference data distribution. Tilt the data, and the projection slides to a completely different spot on the surface. Some spots correspond to good policies. Others correspond to policies that actively reverse human preferences.

The fix: AuxDPO

The paper then studies the geometry of two-stage RLHF and discovers that RLHF partitions all reward functions into equivalence classes: rewards that differ by a vector in the nullspace of a certain matrix all yield the same optimal policy. DPO can only search the column space. AuxDPO adds auxiliary variables that search the nullspace too, expanding the search from a d-dimensional manifold to the full m-dimensional reward space. The result: the projection is no longer misspecified.

DPO
Projects r* onto Rβ = {β log πθ / πref} — a d-dimensional manifold. Projection depends on preference counts. Can land anywhere.
↓ vs.
RLHF
Learns r* accurately (stage 1), then optimizes policy (stage 2). Equivalence classes: all rewards differing by nullspace elements yield same θ*.
↓ insight
AuxDPO
Adds auxiliary variables δ ∈ N(Aρ,θ0) to DPO loss. Searches column space + nullspace = full Rm. No longer misspecified.
Why does DPO become misspecified for parametric policy classes?

Chapter 2: Prerequisites

Before we dive into the proofs, let's make sure every tool is sharp. This chapter covers the three building blocks: the Bradley-Terry preference model, the DPO loss, and the geometry of misspecified estimation.

The Bradley-Terry-Luce (BTL) Model

Given a prompt s and two responses a, a', how do we model which one a human prefers? The BTL model says: each response has a latent "reward" r*(s, a), and the probability that a is preferred over a' is:

ps,a,a'BTL(r*) = σ(r*(s, a) − r*(s, a'))

where σ(z) = 1/(1 + e−z) is the sigmoid function.

Let's build some intuition with numbers. Suppose r*(a1) = 3 and r*(a2) = 1. Then:

pBTL(a1 ≻ a2) = σ(3 − 1) = σ(2) = 1/(1 + e−2) ≈ 0.88

So a1 is preferred 88% of the time. The bigger the reward gap, the stronger the preference. If rewards are equal, preference is 50-50 (since σ(0) = 0.5).

Key property of BTL: Only differences in rewards matter. Adding the same constant to all rewards doesn't change any preference probabilities, because σ(r + c − r' − c) = σ(r − r'). This means r* is really an equivalence class up to a per-prompt constant — a fact that will matter later.

The KL-Regularized RLHF Objective

Given a reward function r* and a reference (base) policy πref = πθ0, the RLHF objective maximizes expected reward while penalizing deviation from the base:

J(θ; r*) = Eρ, πθ[r*(s, a)] − β · DKLθ(⋅|s) || πθ0(⋅|s))

Here β > 0 controls how far the aligned policy can stray from the base. Large β means "stay close to the base" (conservative alignment). Small β means "chase the reward aggressively."

Worked example: Suppose you have one prompt, two responses, πref(a1) = πref(a2) = 0.5, and r*(a1) = 1, r*(a2) = 0. If β = 1:

J(θ) = πθ(a1) · 1 + πθ(a2) · 0 − 1 · [πθ(a1) log(πθ(a1)/0.5) + πθ(a2) log(πθ(a2)/0.5)]

Setting p = πθ(a1):

J = p − [p log(2p) + (1−p) log(2(1−p))]

Taking the derivative and setting to zero yields the optimal p* = e1/β / (1 + e1/β) = σ(1/β). For β = 1, that's σ(1) ≈ 0.73. So the optimal policy puts 73% probability on the preferred response — not 100%, because the KL penalty holds it back.

The DPO Reparameterization

For a tabular policy class, the RLHF objective has a closed-form maximizer:

πθ*(a|s) = (1/Z*(s)) · πθ0(a|s) · exp(r*(s, a) / β)

Rearranging for the reward:

r*(s, a) = β log(πθ*(a|s) / πθ0(a|s)) + β log Z*(s)

Substituting this into the BTL model, the Z*(s) terms cancel (since they're the same for both responses at the same prompt), giving:

pBTL(a ≻ a' | s) = σ(β log(πθ*(a|s)/πθ0(a|s)) − β log(πθ*(a'|s)/πθ0(a'|s)))

DPO says: instead of learning r* first and then optimizing the policy, directly learn θ by treating the above as a binary classification loss. The DPO loss is:

LDPO(θ) = − ∑i log σ(β log(πθ(aw(i)|s(i))/πθ0(aw(i)|s(i))) − β log(πθ(al(i)|s(i))/πθ0(al(i)|s(i))))
The hidden assumption: The reparameterization step — going from the reward to the log-policy-ratio — uses the closed-form solution of the RLHF objective. This closed-form exists only when the policy class is tabular. For parametric policies, the RLHF maximizer θ* doesn't satisfy the closed-form relation. DPO still minimizes the same loss, but the loss no longer corresponds to learning the true reward. It's trying to fit r* into a space (the implicit reward manifold) that may not contain r*.

Misspecified Estimation

In statistics, an estimator is misspecified when the true data-generating distribution doesn't belong to the model family being fit. Classic example: fitting a linear model to quadratic data. The fit converges (with infinite data) to the best linear approximation — but that approximation can be misleading.

White (1982) showed that misspecified maximum likelihood estimators converge to the KL-projection of the truth onto the model family. The projection is consistent (it converges) but not to the right thing. DPO exhibits exactly this phenomenon: it's a misspecified estimator in reward function space.

BTL Preference Probabilities

Drag the reward sliders to see how preference probabilities change under the BTL model. Notice: only the difference in rewards matters.

  
In the BTL model with rewards r*(a1) = 5 and r*(a2) = 3, what is P(a1 ≻ a2)?

Chapter 3: Setup & Definitions

Now let's formalize everything. We need precise definitions because the paper's results are about exact mathematical objects, not hand-wavy intuitions.

The Data

We have a dataset D of n preference triplets (s(i), aw(i), al(i)), where aw ≻ al means "aw is preferred over al at prompt s." Prompts come from S (finite set), responses from A (finite set). Let m = |S| · |A| be the total number of (prompt, response) pairs.

The Policy Class

πθ : S → Δ(A) is parameterized by θ ∈ Rd. The base policy is πθ0. Two key cases:

The Implicit Reward

For any policy parameter θ, define the implicit reward function:

rθβ(s, a) := β log(πθ(a|s) / πθ0(a|s))

Note that rθ0β ≡ 0 by definition (the base policy has zero implicit reward). This makes sense: we're measuring reward relative to the base.

The Implicit Reward Manifold

The set of all achievable implicit rewards forms:

Rβ = {rθβ : θ ∈ Rd} ⊂ Rm

For a tabular policy, Rβ = Rm (you can reach any reward). For a parametric policy, Rβ is a d-dimensional manifold embedded in Rm, where d << m. This dimensional mismatch is the source of all trouble.

Worked example — counting dimensions: Suppose you have 1 prompt, 3 responses (m = 3), and a 1-dimensional policy parameter θ via the softmax: πθ = [eθ, e−θ, 1] / Z. The implicit reward manifold Rβ is the set of vectors β[θ, −θ, 0] − β[0, 0, 0] = β[θ, −θ, 0] for all θ. This is a 1-dimensional line in R3 — specifically, the span of [1, −1, 0]. Any true reward r* = [r1, r2, r3] that isn't a scalar multiple of [1, −1, 0] can't be exactly represented. And most rewards aren't on this line.

The Population DPO Loss

With infinite data (population setting), the number of preferences for triplet (s, a, a') is ns,a,a' and the fraction that prefer a over a' is ps,a,a'BTL(r*). The population DPO loss is:

L(θ) = − ∑s,a,a' ns,a,a' [ps,a,a'BTL(r*) · log ps,a,a'BTL(rθβ) + (1 − ps,a,a'BTL(r*)) · log(1 − ps,a,a'BTL(rθβ))]

This is a weighted cross-entropy loss. Each pair (s, a, a') contributes a binary cross-entropy term, weighted by how many times that pair was compared (ns,a,a'). The "label" is the true BTL preference probability; the "prediction" is the preference probability induced by the policy's implicit reward.

The Jacobian Matrix Aθ0

A critical object: the d × m matrix where the (s, a)-th column is ∇ log πθ0(a|s). This is the score function of the base policy, evaluated at every (prompt, response) pair.

Aθ0 = [∇ log πθ0(a1|s1), ∇ log πθ0(a2|s1), ..., ∇ log πθ0(a|A||s|S|)]

Why does this matter? Because the linearized implicit reward is:

rθβ ≈ β · Aθ0T(θ − θ0)

So the column space C(Aθ0T) is the linearized implicit reward manifold. Whatever isn't in this column space is invisible to DPO.

Implicit Reward Manifold

A 1D policy class (single parameter θ) creates a 1D line of implicit rewards in 3D reward space. The true reward r* is off the line. DPO projects onto the line. Click "Randomize r*" to see how different true rewards project to different (potentially bad) policies.

A policy class has d = 100 parameters and there are m = 10,000 (prompt, response) pairs. What is the dimension of the implicit reward manifold Rβ?

Chapter 4: DPO as KL-Projection

This is the paper's first main theorem (Proposition 1). It reveals what DPO actually does in the population setting.

The Theorem

Proposition 1 (DPO is weighted KL-projection). Assume infinite preference data drawn from BTL(r*), with ns,a,a' preference pairs per triplet. If θDPO minimizes the population DPO loss, then its implicit reward satisfies:

rθDPOβ = argminr ∈ Rβs,a,a' ns,a,a' · dKL(ps,a,a'BTL(r*) || ps,a,a'BTL(r))
where dKL(p || q) is the KL divergence between Bernoulli(p) and Bernoulli(q).

Derivation: Why This is True

Let's derive this step by step, filling in the gaps the paper skips.

Step 1. The population DPO loss is:

L(θ) = − ∑s,a,a' ns,a,a' [p* · log q(θ) + (1−p*) · log(1−q(θ))]

where p* = ps,a,a'BTL(r*) is the true preference probability and q(θ) = ps,a,a'BTL(rθβ) is the model's predicted preference probability.

Step 2. Rewrite using the identity H(p, q) = H(p) + dKL(p || q), where H(p, q) is binary cross-entropy and H(p) is entropy:

L(θ) = ∑s,a,a' ns,a,a' · H(p*, q(θ)) = ∑s,a,a' ns,a,a' · [H(p*) + dKL(p* || q(θ))]

The entropy term H(p*) doesn't depend on θ, so:

argminθ L(θ) = argminθs,a,a' ns,a,a' · dKL(ps,a,a'BTL(r*) || ps,a,a'BTL(rθβ))

Step 3. Since θ determines rθβ, and rθβ ranges over Rβ as θ varies, this is equivalent to:

rθDPOβ = argminr ∈ Rβs,a,a' ns,a,a' · dKL(ps,a,a'BTL(r*) || ps,a,a'BTL(r))

This is a weighted KL-projection of r* onto Rβ. QED.

What This Means

Three critical implications:

1. Realizable case (r* ∈ Rβ): The KL divergence is zero at r* itself, so the projection trivially finds r*, and DPO recovers the RLHF-optimal policy. This is the tabular case where DPO was designed to work.

2. Misspecified case (r* ∉ Rβ): The projection finds the closest point on Rβ in the weighted KL sense. But "closest in weighted KL" doesn't mean "best policy." The weights ns,a,a' — how many times each pair was compared — control where the projection lands. Different preference distributions give different answers.

3. The weights are the problem: In RLHF, the preference counts affect how well you learn the reward (statistical efficiency), but the learned reward is still the right one with enough data. In DPO, the preference counts determine which wrong answer you converge to. This is fundamentally different.

Why reverse-KL, not forward-KL? The loss uses dKL(p* || q), which is reverse-KL (also called "mode-seeking"). This means DPO tries to make q match p* wherever p* is large — i.e., wherever the true preferences are strong. It's less concerned about pairs where the true preference is weak (close to 50-50). This asymmetry amplifies the sensitivity to the preference data distribution.

Worked Numerical Example

Let's trace through the paper's 3-response example from Proposition 3. One prompt, three responses a1, a2, a3 with true rewards r* = [1, 2, 0] (after shifting so the minimum is 0). Policy: πθ = [eθ, e−θ, 1]/Z, base: θ0 = 0 (uniform).

The Jacobian: Aθ0 = [1, −1, 0]. So the implicit reward manifold is span([1, −1, 0]) — a line in R3.

The true reward r* = [1, 2, 0] is not on this line (it would need to be of the form [c, −c, 0]). DPO must project.

Now the projection depends on the preference counts. If n1,3 (comparisons between a1 and a3) dominates, DPO is primarily trying to match the a1 vs. a3 preference probability. Since r*(a1) − r*(a3) = 1 > 0, DPO will set the implicit reward to have rθ(a1) − rθ(a3) > 0, which means rθ ≈ [α, −α, 0] for some α > 0. This makes a1 preferred over a2 — a preference reversal since the true ordering is a2 ≻ a1 ≻ a3.

DPO as Weighted KL-Projection

The true reward r* (gold star) is projected onto the 1D manifold (red line). Adjust the preference count ratio to see how the projection point (orange dot) slides along the manifold. Watch for preference reversal!

In DPO's weighted KL-projection, what role do the preference pair counts ns,a,a' play?

Chapter 5: Failure Modes

Now we see DPO fail. Not because of bad data, not because of poor optimization, not because of insufficient training — but because of the geometry of misspecification.

The 3-Response Counterexample (Proposition 3)

This is the paper's most striking result. Let's set it up carefully.

Setup: One prompt. Three responses a1, a2, a3. True rewards: r* = [2, 3, 1]. True preference order: a2 ≻ a1 ≻ a3. Policy: πθ = [eθ, e−θ, 1]/Z with a single parameter θ. Base policy: θ0 = 0 (uniform, each action gets probability 1/3).

Since BTL only cares about differences, we shift to r* = [1, 2, 0]. The Jacobian at θ0 = 0:

Aθ0 = (1/3)[1 + 2e0, −(1 + 2e0), e0 − e0] = [1, −1, 0]

So Rβ ≈ span([1, −1, 0]). The manifold is a line through the origin in the direction [1, −1, 0].

When n1,3 Dominates

Suppose the dataset is imbalanced: lots of (a1 vs. a3) comparisons, few (a1 vs. a2) or (a2 vs. a3) comparisons. Then DPO mostly cares about matching the a1 vs. a3 preference.

True preference: P(a1 ≻ a3) = σ(1 − 0) = σ(1) ≈ 0.73. DPO finds an implicit reward on the line [α, −α, 0] that matches this. Setting r(a1) − r(a3) = α − 0 = α, DPO needs σ(α) ≈ 0.73, giving α ≈ 1.

The resulting implicit reward: rDPO ≈ [1, −1, 0]. Now check the implied preference order:

Preference reversal! DPO has placed a2 (the truly best response) at the bottom. It has placed a1 at the top. The policy πθ now assigns the highest probability to the second-best response and the lowest probability to the best response. This isn't a small perturbation — it's a complete inversion of the preference order between a1 and a2.

Reward Degradation

It gets worse. The expected reward under the DPO policy is lower than under the base policy.

With θ = α > 0:

πθT r* = (eα · 1 + e−α · 2 + 1 · 0) / (1 + eα + e−α)

Let's compute for α = 1:

πθT r* = (e · 1 + e−1 · 2 + 0) / (1 + e + e−1) = (2.718 + 0.736) / (1 + 2.718 + 0.368) = 3.454 / 4.086 ≈ 0.845

Compare with the base policy (θ = 0):

πθ0T r* = (1 · 1 + 1 · 2 + 1 · 0) / 3 = 1.0

So DPO achieves expected reward 0.845 vs. the base policy's 1.0. DPO made things worse.

Two-stage RLHF, by contrast, would learn r* accurately in stage 1, then optimize the policy in stage 2. RLHF's expected reward can only increase from the base policy (the base policy is always a feasible point for the KL-regularized objective, so the optimizer does at least as well).

Sensitivity to Preference Distribution

The killer: which failure mode you get depends entirely on which pairs are compared more often.

The insight: Whether DPO succeeds or fails is determined by a factor the practitioner doesn't control — the distribution of which pairs appear in the dataset. In real preference datasets, this distribution is an artifact of how annotators were assigned, which prompts were sampled, and which responses the base model generated. It's essentially arbitrary. This makes DPO's behavior unpredictable for parametric models.

Not a Coverage Problem

Remark 4 in the paper emphasizes: this failure is not due to insufficient coverage. Song et al. (2024) argued that DPO needs a coverage condition maxs,a πθ(a|s)/πθ0(a|s) ≤ C. In our example, C = 3 for the uniform base policy — coverage is perfect. But DPO still fails. The problem is geometric (misspecification), not statistical (coverage).

DPO Failure Modes: Interactive Counterexample

The paper's 3-response, 1-parameter counterexample. Drag the sliders to set the preference pair counts. Watch the DPO projection slide along the red manifold line, causing preference reversal and reward degradation.

In the paper's 3-response counterexample, what causes DPO to reverse preferences?

Chapter 6: RLHF Geometry

Having seen DPO fail, the paper now asks: what does two-stage RLHF actually compute, geometrically? Understanding RLHF's local behavior will reveal the path to fixing DPO.

Local Approximation of the RLHF Objective

We approximate J(θ; r*) around the base policy θ0 using Taylor expansions. The expected reward is linear in θ (first-order), and the KL penalty is quadratic (second-order):

J(θ; r*) ≈ Eρ,πθ0[r*(s, a)] + (θ − θ0)T Aρ,θ0 r* − (β/2)(θ − θ0)T Fρ,θ0 (θ − θ0)

Let's define the key matrices:

The Natural Gradient Solution

Taking the gradient of the quadratic approximation and setting it to zero:

θ J = Aρ,θ0 r* − β Fρ,θ0 (θ − θ0) = 0

Solving:

θ* = θ0 + (1/β) Fρ,θ0 Aρ,θ0 r*
This is a natural policy gradient step! The update direction F ∇J is the natural gradient — the steepest ascent direction measured in KL-divergence rather than Euclidean distance. Kakade (2001) showed that natural policy gradient converges faster than vanilla gradient descent because it accounts for the geometry of probability distributions. The RLHF solution is exactly one natural gradient step from the base policy, with step size 1/β.

Equivalence Classes of Reward Functions

Here's the key insight for fixing DPO. The RLHF solution θ* depends on r* only through the product Aρ,θ0 r*. This means:

Reqβ(θ) = {r ∈ Rm : Aρ,θ0 r = β Fρ,θ0 (θ − θ0)}

All reward functions in the same equivalence class produce the same RLHF-optimal policy. Two rewards r1, r2 are equivalent if and only if r1 − r2 ∈ N(Aρ,θ0) — they differ by a nullspace element.

Worked example: For our 1D policy with Aθ0 = [1, −1, 0] and uniform base policy, Aρ,θ0 = (1/3)[1, −1, 0]. Its nullspace N(Aρ,θ0) = {[a, a, b] : a, b ∈ R} — any reward where a1 and a2 have the same reward (regardless of a3's reward).

So r* = [1, 2, 0] and r* + [c, c, d] = [1+c, 2+c, d] all yield the same RLHF policy. That's because the 1D policy can only control the ratio of a1 to a2, and any reward with r(a2) − r(a1) = 1 gives the same θ*.

The DPO-RLHF Connection (Proposition 7)

DPO's implicit reward rθβ is the minimum-norm representative of the RLHF equivalence class Reqβ(θ), measured in the Mahalanobis norm ||r||Dρ,θ0.

In other words: for each θ, there's a whole affine subspace of rewards that would make RLHF choose θ. DPO picks the shortest one. This is elegant — but it means DPO is constrained to the column space C(Aθ0T), while the true reward r* may require a nullspace component to be properly represented.

Column space vs. nullspace: Think of Rm as being split into two orthogonal subspaces (under the Mahalanobis inner product): the column space C(Aθ0T) and the nullspace N(Aρ,θ0). DPO searches only the column space. The true reward r* = rθ*β + δ where δ is a nullspace component. DPO can find rθ*β only if it can ignore δ. But the weighted KL-projection doesn't know to ignore δ — it projects the whole r* and gets the wrong answer.
RLHF Equivalence Classes

The column space (red line) and nullspace (blue arrows) partition reward space. All rewards in the same equivalence class (blue line) map to the same RLHF policy. DPO can only search the red line.

Two reward functions r1 = [3, 5, 1] and r2 = [4, 6, 2] yield the same RLHF policy for the 1D softmax policy class with Aρ,θ0 = (1/3)[1, −1, 0]. Why?

Chapter 7: The AuxDPO Algorithm

Now the fix. The authors have identified the problem: DPO searches the column space C(Aθ0T), but the true reward has a nullspace component that distorts the projection. The solution: search both spaces simultaneously.

The Core Idea

Introduce auxiliary variables δ ∈ N(Aρ,θ0) that represent the nullspace component of the reward. Optimize the DPO loss jointly over θ (column space) and δ (nullspace):

minimizeθ ∈ Rd, δ ∈ N(Aρ,θ0) L(θ, δ)

where L(θ, δ) is the DPO loss but with the reward replaced by rθβ + δ:

L(θ, δ) = − ∑s,a,a' ns,a,a' [pBTL(r*) log pBTL(rθ,δβ) + (1−pBTL(r*)) log(1−pBTL(rθ,δβ))]

where rθ,δβ = rθβ + δ.

Why This Works (Proposition 9)

By the rank-nullity theorem:

dim(C(Aθ0T)) + dim(N(Aρ,θ0)) = m

So searching column space (θ) + nullspace (δ) covers the full m-dimensional reward space Rm. The reward r* = rθ*β + δ* is now realizable in the augmented representation. The misspecification vanishes.

Proposition 9: For sufficiently large β and any tolerance ε > 0, the AuxDPO optimization achieves θ = θ* up to O(ε) error. In other words, AuxDPO recovers the RLHF-optimal policy.

From Theory to Practice: The Empirical Loss

The theoretical formulation requires knowing N(Aρ,θ0) and optimizing δ over it. In practice, we can't compute the nullspace of a matrix with millions of rows. The paper uses a clever relaxation:

Step 1: Discretize δ. Instead of defining δ over all m = |S| · |A| pairs, define it only at the 2n data points that appear in the dataset: δ = {δ(s(i), aw(i)), δ(s(i), al(i))}i=1n ∈ R2n.

Step 2: Enforce the nullspace constraint via a penalty. Replace the hard constraint δ ∈ N(Aρ,θ0) with a penalty term ||Aρ,θ0 δ||2. Since Aρ,θ0 δ = Eρ,πθ0[δ(s, a) ∇ log πθ0(a|s)], this can be estimated from the dataset.

The empirical AuxDPO loss:

LD(θ, δ) = −(1/n) ∑i=1n log σ(rθβ(s(i), aw(i)) − rθβ(s(i), al(i)) + δ(s(i), aw(i)) − δ(s(i), al(i))) + λ · ||(1/2n) ∑iw(i) ∇ log πθ0(aw(i)|s(i)) + δl(i) ∇ log πθ0(al(i)|s(i))]||2

Implementation Details

Practical setup:
  • Parameters: d (model weights θ) + 2n (auxiliary variables δ). Since typically n << d (e.g., n = 10K pairs, d = 7B parameters), the overhead is negligible: d + 2n ≈ d.
  • Hyperparameters: λ (nullspace penalty strength) is the only new hyperparameter. The paper doesn't tune it extensively — typical values work across tasks.
  • Gradients: ∇ log πθ0(a|s) is the score function of the frozen reference model. This needs to be computed once per data point and cached. It's a vector of dimension d (same as model parameters), so storing 2n such vectors costs 2n × d memory. In practice, the penalty term is computed in batch.
  • What δ learns: The auxiliary variables absorb the nullspace component of r* that the model's implicit reward can't express. After training, δ is discarded — only θ is used for inference.

Pseudocode

# AuxDPO Training Loop
def auxdpo_loss(theta, delta, batch, ref_model, beta, lam):
    # batch: (s, a_w, a_l) triplets, indices i

    # 1. Compute implicit rewards (same as DPO)
    log_ratio_w = log_prob(theta, s, a_w) - log_prob(ref_model, s, a_w)
    log_ratio_l = log_prob(theta, s, a_l) - log_prob(ref_model, s, a_l)
    r_w = beta * log_ratio_w   # shape: [batch_size]
    r_l = beta * log_ratio_l   # shape: [batch_size]

    # 2. Add auxiliary variables (per-datapoint scalars)
    delta_w = delta[2*i]        # shape: [batch_size]
    delta_l = delta[2*i + 1]    # shape: [batch_size]

    # 3. AuxDPO logit = DPO logit + delta correction
    logit = (r_w - r_l) + (delta_w - delta_l)

    # 4. Binary cross-entropy loss (same as DPO)
    bce = -log_sigmoid(logit).mean()

    # 5. Nullspace penalty: ||A_{rho,theta_0} delta||^2
    # score_w = grad log pi_{theta_0}(a_w | s), shape: [batch, d]
    # Pre-computed and cached for reference model
    penalty = (delta_w[:, None] * score_w + delta_l[:, None] * score_l).mean(0)
    penalty = (penalty ** 2).sum()

    return bce + lam * penalty
The likely discovery process: The authors probably started by asking "why does DPO fail with parametric policies?" They noticed the KL-projection interpretation (Prop. 1), realized it's a misspecified estimation problem, and studied the RLHF geometry to understand what the correct answer looks like. The equivalence class structure (Lemma 6) revealed that many rewards give the same RLHF policy, and DPO picks the wrong representative. The fix was natural: let the optimizer search the nullspace too, so it can find the right equivalence class. The penalty term is a standard trick for converting a constrained optimization to an unconstrained one.
AuxDPO: Adding Nullspace Degrees of Freedom

DPO (orange) projects r* onto the column space line. AuxDPO (green) adds a nullspace shift δ so the augmented reward rθβ + δ can reach r*. The result: θ lands at the correct RLHF solution. Toggle between DPO and AuxDPO to see the difference.

Showing: DPO only
How does AuxDPO fix DPO's misspecification?

Chapter 8: Experiments

Theory says AuxDPO should fix DPO's misspecification. Does it work in practice? The paper tests on two fronts: a didactic bandit setting (where we can verify the theory exactly) and real LLM alignment tasks.

Didactic Bandit Setting

The 3-response, 1-parameter example from Proposition 3 is implemented with a log-linear policy. With n1,3 dominating, DPO produces preference reversal as predicted. AuxDPO with λ = 1.0 correctly recovers θ < 0, favoring a2. The auxiliary variable δ absorbs the nullspace component, steering the projection to the correct equivalence class.

LLM Alignment: Datasets

Training data: UltraFeedback (Cui et al., 2024) and its binarized version. Models: Llama3.1-8B, Llama3.2-1B, Qwen3-0.6B.

Main Results

Key finding: AuxDPO consistently outperforms DPO across all models and datasets, both in-distribution (ID) and out-of-distribution (OOD). The gains are especially large on OOD evaluation, suggesting that AuxDPO learns more transferable alignments.
ModelDatasetSettingDPOAuxDPOIPODPOP
Llama3.1-8BMMLU-ProID57.1463.2659.1861.22
MMLU-ProOOD8.1614.2810.206.12
RewardBench v2ID56.0166.7261.3462.27
RewardBench v2OOD14.3132.4420.1719.87
Llama3.2-1BMMLU-ProID39.5845.8343.7544.21
MMLU-ProOOD6.2512.5214.584.16
RewardBench v2ID77.2186.3769.7271.21
RewardBench v2OOD14.1143.2720.4218.76
Qwen3-0.6BMMLU-ProID53.1261.7847.4856.67
MMLU-ProOOD11.3422.2215.5617.78
RewardBench v2ID55.1065.3153.0651.02
RewardBench v2OOD−8.1618.36−8.23−6.25

Values show % change in mean accuracy relative to the base policy. Bold = best. Negative = degradation from base policy.

Reading the Results

Several patterns jump out:

1. AuxDPO wins everywhere that matters. On all 12 (model, dataset, setting) combinations, AuxDPO is either first or tied for first. DPO is never best.

2. The OOD gap is huge. On Llama3.2-1B RewardBench v2 OOD, AuxDPO achieves +43.27% vs. DPO's +14.11%. That's a 3x improvement in OOD generalization. This suggests that DPO's misspecification causes it to overfit to distribution-specific artifacts, while AuxDPO finds more robust alignments.

3. DPO can be catastrophically bad. On Qwen3-0.6B RewardBench v2 OOD, DPO scores −8.16% — meaning it's worse than the base model. This is exactly the reward degradation predicted by the theory (Proposition 3). AuxDPO scores +18.36%, a swing of over 26 percentage points.

4. Model size matters. The smaller the model (fewer parameters d relative to task complexity m), the worse the misspecification. Qwen3-0.6B has the most severe DPO failures, which makes sense: fewer parameters means a lower-dimensional manifold, means more reward functions are unrealizable.

What the paper doesn't say: The computational overhead of AuxDPO is not extensively analyzed. Adding 2n auxiliary variables and computing the penalty term (which involves reference model score functions) adds memory and compute. For n = 10K and d = 7B, the 20K extra parameters are negligible, but caching/computing ∇ log πθ0 for every data point could be significant for very large datasets. The paper also doesn't test on reward model benchmarks beyond accuracy — calibration, ranking quality, and win rates against human judges would strengthen the case.

Per-Subject Breakdown (MMLU-Pro)

The paper also reports per-subject accuracies on MMLU-Pro. Some highlights:

The gains are not uniform across subjects, but AuxDPO is never worse than DPO in any subject.

Results Comparison: DPO vs. AuxDPO

Accuracy improvement over base policy (% change) across model sizes and settings. Bars below zero mean the method is worse than doing nothing.

On Qwen3-0.6B RewardBench v2 OOD, DPO scores −8.16% (worse than base). What does this correspond to in the theory?

Chapter 9: Connections

Cheat Sheet: Every Key Equation

EquationWhat it saysWhen to use it
rθβ(s,a) = β log(πθref)Implicit reward: every policy induces a reward via log-ratioUnderstanding DPO's reward space
Rβ = {rθβ : θ ∈ Rd}Implicit reward manifold: d-dimensional subset of RmChecking if misspecification applies
rDPO = argminr ∈ Rβ ∑ n · dKLDPO = weighted KL-projection of r* onto RβUnderstanding DPO's behavior
rθβ ≈ β Aθ0T(θ−θ0)Linearized implicit reward: lives in C(Aθ0T)Local analysis near base policy
θ* = θ0 + (1/β) F Aρ r*RLHF solution = natural policy gradient stepUnderstanding the target
Reqβ(θ) = {r : Ar = βF(θ−θ0)}RLHF equivalence class: all rewards yielding same θ*Understanding why many rewards are equivalent
LAuxDPO = LDPO(rθ+δ) + λ||Aδ||2AuxDPO loss: DPO + nullspace auxiliary variables + penaltyImplementation

Related Papers & Lessons

Open Questions

The Big Picture

This paper reveals a fundamental tension in direct alignment: DPO's elegance comes from reparameterizing the reward in terms of the policy, but this reparameterization constrains the reward space to a low-dimensional manifold. The constraint is invisible in the tabular case (where the manifold is the whole space) but becomes a source of systematic error for parametric policies.

AuxDPO's fix is principled and minimal: add auxiliary variables that search the directions the policy can't reach. The cost is small (2n extra scalars, one hyperparameter), and the theory guarantees recovery of the RLHF solution. The practical gains are substantial, especially for smaller models where misspecification is more severe.

The deeper lesson: whenever you reparameterize one quantity in terms of another (reward in terms of policy, here), you inherit the latter's capacity limitations. If those limitations don't match the true data, you're misspecified — and no amount of data will save you. Only expanding the representational capacity (here, via auxiliary variables) can fix a misspecification problem.

A colleague says: "Just use a bigger model and DPO's misspecification goes away." Is this correct?