Why DPO Is a Misspecified Estimator

Chapter 0: The Problem

You want to align an LLM with human preferences. The standard recipe: collect (prompt, preferred response, rejected response) triplets, then fine-tune the model so it favors what humans favor. Two roads diverge.

Road 1: Two-stage RLHF. First, train a separate reward model on the preferences. Then run PPO (or similar RL) to maximize that reward while staying close to the base model via a KL penalty. This works, but it's expensive — you need a reward model, on-policy rollouts, clip ranges, variance reduction, and careful tuning of half a dozen hyperparameters.

Road 2: DPO (Direct Preference Optimization). Rafailov et al. (2023) showed a beautiful trick: the KL-regularized RL objective has a closed-form solution that links the optimal policy directly to the reward. Reparameterize the reward in terms of the policy, substitute into the Bradley-Terry preference model, and you get a single supervised loss. No reward model, no RL — just minimize one cross-entropy loss. It's simple, stable, and fast. The open-source community adopted it almost overnight.

But there's a catch. A hidden assumption. DPO's derivation assumes the policy class is tabular — meaning it can represent every possible conditional distribution over responses given prompts. In a tabular policy, there's one free parameter per (prompt, response) pair. The KL-regularized objective can be solved in closed form because the optimization is over an unconstrained probability simplex.

Real LLMs are not tabular. A 7B-parameter Transformer parameterizes distributions over millions of possible responses using only 7 billion shared weights. The policy class is a tiny manifold embedded in the vast space of all possible distributions. The closed-form solution that DPO relies on? It might not live on that manifold.

The core question this paper asks: When you minimize the DPO loss over a parametric (non-tabular) policy class, what actually happens? Does it find the RLHF-optimal policy? The answer is no — and the failure modes are worse than anyone expected. DPO can reverse the preference ordering, decrease the expected reward below the base policy, and the result depends sensitively on how many times each pair was compared in the dataset.

This isn't a small-sample problem. The failure modes persist even with infinite preference data drawn from a perfect Bradley-Terry model. The issue is geometric: DPO is a misspecified estimator, and misspecified estimators can produce arbitrarily bad results regardless of how much data you feed them.

The Tabular vs. Parametric Gap

Left: in a tabular policy, every reward function can be expressed as an implicit reward. Right: in a parametric policy, only a low-dimensional manifold of rewards is reachable. DPO projects onto this manifold, but the projection can land anywhere. Drag the slider to change the policy class dimension.

Policy dimension d: 2

DPO's derivation assumes the policy class is tabular. What does "tabular" mean here, and why does it matter?

Tabular means the policy has one free parameter per (prompt, response) pair, so it can represent any conditional distribution — the KL-regularized objective has a closed-form solution only under this assumption Tabular means the policy is stored in a lookup table, which is faster than a neural network Tabular means the policy has been trained on tabular data rather than text

Chapter 1: The Key Insight

Here is the paper's central insight, stated plainly: DPO loss minimization is equivalent to a weighted KL-projection of the true reward function onto the manifold of rewards that the policy class can express.

Let's unpack that one piece at a time.

What is an "implicit reward"?

Given any policy π_θ and a reference policy π_θ₀, DPO defines an implicit reward function:

r_θ^β(s, a) = β log π_θ(a|s) / π_θ₀(a|s)

This is just the log-probability-ratio, scaled by β. Every policy parameter θ induces exactly one implicit reward. The set of all such implicit rewards, as θ varies over R^d, forms a manifold R^β inside the full reward space R^m (where m = |S| · |A|).

What does DPO actually minimize?

When you minimize the DPO loss with infinite data, you're finding the implicit reward in R^β that is closest to the true reward r* — where "closest" means minimizing a weighted sum of KL divergences between Bernoulli preference probabilities. The weights are the preference pair counts n_s,a,a'.

When does this go wrong?

If r* happens to live on the manifold R^β (i.e., some policy parameter θ yields an implicit reward equal to r*), then DPO finds the RLHF-optimal policy. No problem.

But if r* is off the manifold — which is the generic case when d << m — then DPO projects r* onto R^β. This projection depends on the weights (preference counts), and can land at a point on the manifold that corresponds to a worse policy than doing nothing at all.

The geometry in one picture: Imagine reward space as a high-dimensional room. The policy class carves out a thin curved surface (manifold) in that room. The true reward r* is a point floating above the surface. DPO drops a perpendicular from r* onto the surface — but the "perpendicular" direction is warped by the preference data distribution. Tilt the data, and the projection slides to a completely different spot on the surface. Some spots correspond to good policies. Others correspond to policies that actively reverse human preferences.

The fix: AuxDPO

The paper then studies the geometry of two-stage RLHF and discovers that RLHF partitions all reward functions into equivalence classes: rewards that differ by a vector in the nullspace of a certain matrix all yield the same optimal policy. DPO can only search the column space. AuxDPO adds auxiliary variables that search the nullspace too, expanding the search from a d-dimensional manifold to the full m-dimensional reward space. The result: the projection is no longer misspecified.

DPO

Projects r* onto R^β = {β log π_θ / π_ref} — a d-dimensional manifold. Projection depends on preference counts. Can land anywhere.

↓ vs.

RLHF

Learns r* accurately (stage 1), then optimizes policy (stage 2). Equivalence classes: all rewards differing by nullspace elements yield same θ*.

↓ insight

AuxDPO

Adds auxiliary variables δ ∈ N(A_ρ,θ₀) to DPO loss. Searches column space + nullspace = full R^m. No longer misspecified.

Why does DPO become misspecified for parametric policy classes?

Because the learning rate is too high for non-tabular policies Because the implicit reward manifold R^β is lower-dimensional than the full reward space, so the true reward r* typically can't be expressed as any policy's implicit reward — DPO must project, and the projection can be arbitrarily bad Because DPO uses the Bradley-Terry model which is itself misspecified

Chapter 2: Prerequisites

Before we dive into the proofs, let's make sure every tool is sharp. This chapter covers the three building blocks: the Bradley-Terry preference model, the DPO loss, and the geometry of misspecified estimation.

The Bradley-Terry-Luce (BTL) Model

Given a prompt s and two responses a, a', how do we model which one a human prefers? The BTL model says: each response has a latent "reward" r*(s, a), and the probability that a is preferred over a' is:

p_s,a,a'^BTL(r*) = σ(r*(s, a) − r*(s, a'))

where σ(z) = 1/(1 + e^−z) is the sigmoid function.

Let's build some intuition with numbers. Suppose r*(a₁) = 3 and r*(a₂) = 1. Then:

p^BTL(a₁ ≻ a₂) = σ(3 − 1) = σ(2) = 1/(1 + e⁻²) ≈ 0.88

So a₁ is preferred 88% of the time. The bigger the reward gap, the stronger the preference. If rewards are equal, preference is 50-50 (since σ(0) = 0.5).

Key property of BTL: Only differences in rewards matter. Adding the same constant to all rewards doesn't change any preference probabilities, because σ(r + c − r' − c) = σ(r − r'). This means r* is really an equivalence class up to a per-prompt constant — a fact that will matter later.

The KL-Regularized RLHF Objective

Given a reward function r* and a reference (base) policy π_ref = π_θ₀, the RLHF objective maximizes expected reward while penalizing deviation from the base:

J(θ; r*) = E_{ρ, π_θ}[r*(s, a)] − β · D_KL(π_θ(⋅|s) || π_θ₀(⋅|s))

Here β > 0 controls how far the aligned policy can stray from the base. Large β means "stay close to the base" (conservative alignment). Small β means "chase the reward aggressively."

Worked example: Suppose you have one prompt, two responses, π_ref(a₁) = π_ref(a₂) = 0.5, and r*(a₁) = 1, r*(a₂) = 0. If β = 1:

J(θ) = π_θ(a₁) · 1 + π_θ(a₂) · 0 − 1 · [π_θ(a₁) log(π_θ(a₁)/0.5) + π_θ(a₂) log(π_θ(a₂)/0.5)]

Setting p = π_θ(a₁):

J = p − [p log(2p) + (1−p) log(2(1−p))]

Taking the derivative and setting to zero yields the optimal p* = e^1/β / (1 + e^1/β) = σ(1/β). For β = 1, that's σ(1) ≈ 0.73. So the optimal policy puts 73% probability on the preferred response — not 100%, because the KL penalty holds it back.

The DPO Reparameterization

For a tabular policy class, the RLHF objective has a closed-form maximizer:

π_θ*(a|s) = (1/Z*(s)) · π_θ₀(a|s) · exp(r*(s, a) / β)

Rearranging for the reward:

r*(s, a) = β log(π_θ*(a|s) / π_θ₀(a|s)) + β log Z*(s)

Substituting this into the BTL model, the Z*(s) terms cancel (since they're the same for both responses at the same prompt), giving:

DPO says: instead of learning r* first and then optimizing the policy, directly learn θ by treating the above as a binary classification loss. The DPO loss is:

L_DPO(θ) = − ∑_i log σ(β log(π_θ(a_w⁽ⁱ⁾|s⁽ⁱ⁾)/π_θ₀(a_w⁽ⁱ⁾|s⁽ⁱ⁾)) − β log(π_θ(a_l⁽ⁱ⁾|s⁽ⁱ⁾)/π_θ₀(a_l⁽ⁱ⁾|s⁽ⁱ⁾)))

The hidden assumption: The reparameterization step — going from the reward to the log-policy-ratio — uses the closed-form solution of the RLHF objective. This closed-form exists only when the policy class is tabular. For parametric policies, the RLHF maximizer θ* doesn't satisfy the closed-form relation. DPO still minimizes the same loss, but the loss no longer corresponds to learning the true reward. It's trying to fit r* into a space (the implicit reward manifold) that may not contain r*.

Misspecified Estimation

In statistics, an estimator is misspecified when the true data-generating distribution doesn't belong to the model family being fit. Classic example: fitting a linear model to quadratic data. The fit converges (with infinite data) to the best linear approximation — but that approximation can be misleading.

White (1982) showed that misspecified maximum likelihood estimators converge to the KL-projection of the truth onto the model family. The projection is consistent (it converges) but not to the right thing. DPO exhibits exactly this phenomenon: it's a misspecified estimator in reward function space.

BTL Preference Probabilities

Drag the reward sliders to see how preference probabilities change under the BTL model. Notice: only the difference in rewards matters.

r*(a₁): 2.0 r*(a₂): 0.0

In the BTL model with rewards r*(a₁) = 5 and r*(a₂) = 3, what is P(a₁ ≻ a₂)?

σ(2) ≈ 0.88, since only the difference r*(a₁) − r*(a₂) = 2 matters σ(5) ≈ 0.99, based on r*(a₁) alone 5/8 = 0.625, by normalizing the rewards

Chapter 3: Setup & Definitions

Now let's formalize everything. We need precise definitions because the paper's results are about exact mathematical objects, not hand-wavy intuitions.

The Data

We have a dataset D of n preference triplets (s⁽ⁱ⁾, a_w⁽ⁱ⁾, a_l⁽ⁱ⁾), where a_w ≻ a_l means "a_w is preferred over a_l at prompt s." Prompts come from S (finite set), responses from A (finite set). Let m = |S| · |A| be the total number of (prompt, response) pairs.

The Policy Class

π_θ : S → Δ(A) is parameterized by θ ∈ R^d. The base policy is π_θ₀. Two key cases:

Tabular: d = m, each θ_s,a is a free parameter, π_θ(a|s) ∝ θ_s,a. The policy can represent anything.
Parametric (e.g., neural softmax): d << m. π_θ(a|s) = exp(f_θ(s, a)) / ∑_a' exp(f_θ(s, a')), where f_θ is a neural network. The policy lives on a d-dimensional manifold in the m-dimensional simplex.

The Implicit Reward

For any policy parameter θ, define the implicit reward function:

r_θ^β(s, a) := β log(π_θ(a|s) / π_θ₀(a|s))

Note that r_θ₀^β ≡ 0 by definition (the base policy has zero implicit reward). This makes sense: we're measuring reward relative to the base.

The Implicit Reward Manifold

The set of all achievable implicit rewards forms:

R^β = {r_θ^β : θ ∈ R^d} ⊂ R^m

For a tabular policy, R^β = R^m (you can reach any reward). For a parametric policy, R^β is a d-dimensional manifold embedded in R^m, where d << m. This dimensional mismatch is the source of all trouble.

Worked example — counting dimensions: Suppose you have 1 prompt, 3 responses (m = 3), and a 1-dimensional policy parameter θ via the softmax: π_θ = [e^θ, e^−θ, 1] / Z. The implicit reward manifold R^β is the set of vectors β[θ, −θ, 0] − β[0, 0, 0] = β[θ, −θ, 0] for all θ. This is a 1-dimensional line in R³ — specifically, the span of [1, −1, 0]. Any true reward r* = [r₁, r₂, r₃] that isn't a scalar multiple of [1, −1, 0] can't be exactly represented. And most rewards aren't on this line.

The Population DPO Loss

With infinite data (population setting), the number of preferences for triplet (s, a, a') is n_s,a,a' and the fraction that prefer a over a' is p_s,a,a'^BTL(r*). The population DPO loss is:

L(θ) = − ∑_s,a,a' n_s,a,a' [p_s,a,a'^BTL(r*) · log p_s,a,a'^BTL(r_θ^β) + (1 − p_s,a,a'^BTL(r*)) · log(1 − p_s,a,a'^BTL(r_θ^β))]

This is a weighted cross-entropy loss. Each pair (s, a, a') contributes a binary cross-entropy term, weighted by how many times that pair was compared (n_s,a,a'). The "label" is the true BTL preference probability; the "prediction" is the preference probability induced by the policy's implicit reward.

The Jacobian Matrix A_θ₀

A critical object: the d × m matrix where the (s, a)-th column is ∇ log π_θ₀(a|s). This is the score function of the base policy, evaluated at every (prompt, response) pair.

A_θ₀ = [∇ log π_θ₀(a₁|s₁), ∇ log π_θ₀(a₂|s₁), ..., ∇ log π_θ₀(a_|A||s_|S|)]

Why does this matter? Because the linearized implicit reward is:

r_θ^β ≈ β · A_θ₀^T(θ − θ₀)

So the column space C(A_θ₀^T) is the linearized implicit reward manifold. Whatever isn't in this column space is invisible to DPO.

Implicit Reward Manifold

A 1D policy class (single parameter θ) creates a 1D line of implicit rewards in 3D reward space. The true reward r* is off the line. DPO projects onto the line. Click "Randomize r*" to see how different true rewards project to different (potentially bad) policies.

A policy class has d = 100 parameters and there are m = 10,000 (prompt, response) pairs. What is the dimension of the implicit reward manifold R^β?

10,000 — same as the reward space At most 100 — bounded by the policy parameter dimension d, making r* generically unrealizable 10,000 − 100 = 9,900

Chapter 4: DPO as KL-Projection

This is the paper's first main theorem (Proposition 1). It reveals what DPO actually does in the population setting.

The Theorem

Proposition 1 (DPO is weighted KL-projection). Assume infinite preference data drawn from BTL(r*), with n_s,a,a' preference pairs per triplet. If θ_DPO minimizes the population DPO loss, then its implicit reward satisfies:

r_{θ_DPO}^β = argmin_{r ∈ R^β} ∑_s,a,a' n_s,a,a' · d_KL(p_s,a,a'^BTL(r*) || p_s,a,a'^BTL(r))
where d_KL(p || q) is the KL divergence between Bernoulli(p) and Bernoulli(q).

Derivation: Why This is True

Let's derive this step by step, filling in the gaps the paper skips.

Step 1. The population DPO loss is:

L(θ) = − ∑_s,a,a' n_s,a,a' [p* · log q(θ) + (1−p*) · log(1−q(θ))]

where p* = p_s,a,a'^BTL(r*) is the true preference probability and q(θ) = p_s,a,a'^BTL(r_θ^β) is the model's predicted preference probability.

Step 2. Rewrite using the identity H(p, q) = H(p) + d_KL(p || q), where H(p, q) is binary cross-entropy and H(p) is entropy:

L(θ) = ∑_s,a,a' n_s,a,a' · H(p*, q(θ)) = ∑_s,a,a' n_s,a,a' · [H(p*) + d_KL(p* || q(θ))]

The entropy term H(p*) doesn't depend on θ, so:

argmin_θ L(θ) = argmin_θ ∑_s,a,a' n_s,a,a' · d_KL(p_s,a,a'^BTL(r*) || p_s,a,a'^BTL(r_θ^β))

Step 3. Since θ determines r_θ^β, and r_θ^β ranges over R^β as θ varies, this is equivalent to:

r_{θ_DPO}^β = argmin_{r ∈ R^β} ∑_s,a,a' n_s,a,a' · d_KL(p_s,a,a'^BTL(r*) || p_s,a,a'^BTL(r))

This is a weighted KL-projection of r* onto R^β. QED.

What This Means

Three critical implications:

1. Realizable case (r* ∈ R^β): The KL divergence is zero at r* itself, so the projection trivially finds r*, and DPO recovers the RLHF-optimal policy. This is the tabular case where DPO was designed to work.

2. Misspecified case (r* ∉ R^β): The projection finds the closest point on R^β in the weighted KL sense. But "closest in weighted KL" doesn't mean "best policy." The weights n_s,a,a' — how many times each pair was compared — control where the projection lands. Different preference distributions give different answers.

3. The weights are the problem: In RLHF, the preference counts affect how well you learn the reward (statistical efficiency), but the learned reward is still the right one with enough data. In DPO, the preference counts determine which wrong answer you converge to. This is fundamentally different.

Why reverse-KL, not forward-KL? The loss uses d_KL(p* || q), which is reverse-KL (also called "mode-seeking"). This means DPO tries to make q match p* wherever p* is large — i.e., wherever the true preferences are strong. It's less concerned about pairs where the true preference is weak (close to 50-50). This asymmetry amplifies the sensitivity to the preference data distribution.

Worked Numerical Example

Let's trace through the paper's 3-response example from Proposition 3. One prompt, three responses a₁, a₂, a₃ with true rewards r* = [1, 2, 0] (after shifting so the minimum is 0). Policy: π_θ = [e^θ, e^−θ, 1]/Z, base: θ₀ = 0 (uniform).

The Jacobian: A_θ₀ = [1, −1, 0]. So the implicit reward manifold is span([1, −1, 0]) — a line in R³.

The true reward r* = [1, 2, 0] is not on this line (it would need to be of the form [c, −c, 0]). DPO must project.

Now the projection depends on the preference counts. If n_1,3 (comparisons between a₁ and a₃) dominates, DPO is primarily trying to match the a₁ vs. a₃ preference probability. Since r*(a₁) − r*(a₃) = 1 > 0, DPO will set the implicit reward to have r_θ(a₁) − r_θ(a₃) > 0, which means r_θ ≈ [α, −α, 0] for some α > 0. This makes a₁ preferred over a₂ — a preference reversal since the true ordering is a₂ ≻ a₁ ≻ a₃.

DPO as Weighted KL-Projection

The true reward r* (gold star) is projected onto the 1D manifold (red line). Adjust the preference count ratio to see how the projection point (orange dot) slides along the manifold. Watch for preference reversal!

n_1,3 dominance: 80%

In DPO's weighted KL-projection, what role do the preference pair counts n_s,a,a' play?

They determine the learning rate for each pair They only affect convergence speed, not the final answer They weight the KL divergence terms, determining WHERE on the manifold the projection lands — different counts yield different (potentially contradictory) policies even with infinite data

Chapter 5: Failure Modes

Now we see DPO fail. Not because of bad data, not because of poor optimization, not because of insufficient training — but because of the geometry of misspecification.

The 3-Response Counterexample (Proposition 3)

This is the paper's most striking result. Let's set it up carefully.

Setup: One prompt. Three responses a₁, a₂, a₃. True rewards: r* = [2, 3, 1]. True preference order: a₂ ≻ a₁ ≻ a₃. Policy: π_θ = [e^θ, e^−θ, 1]/Z with a single parameter θ. Base policy: θ₀ = 0 (uniform, each action gets probability 1/3).

Since BTL only cares about differences, we shift to r* = [1, 2, 0]. The Jacobian at θ₀ = 0:

A_θ₀ = (1/3)[1 + 2e⁰, −(1 + 2e⁰), e⁰ − e⁰] = [1, −1, 0]

So R^β ≈ span([1, −1, 0]). The manifold is a line through the origin in the direction [1, −1, 0].

When n_1,3 Dominates

Suppose the dataset is imbalanced: lots of (a₁ vs. a₃) comparisons, few (a₁ vs. a₂) or (a₂ vs. a₃) comparisons. Then DPO mostly cares about matching the a₁ vs. a₃ preference.

True preference: P(a₁ ≻ a₃) = σ(1 − 0) = σ(1) ≈ 0.73. DPO finds an implicit reward on the line [α, −α, 0] that matches this. Setting r(a₁) − r(a₃) = α − 0 = α, DPO needs σ(α) ≈ 0.73, giving α ≈ 1.

The resulting implicit reward: r_DPO ≈ [1, −1, 0]. Now check the implied preference order:

r_DPO(a₁) = 1, r_DPO(a₂) = −1, r_DPO(a₃) = 0
Implied order: a₁ ≻ a₃ ≻ a₂
True order: a₂ ≻ a₁ ≻ a₃

Preference reversal! DPO has placed a₂ (the truly best response) at the bottom. It has placed a₁ at the top. The policy π_θ now assigns the highest probability to the second-best response and the lowest probability to the best response. This isn't a small perturbation — it's a complete inversion of the preference order between a₁ and a₂.

Reward Degradation

It gets worse. The expected reward under the DPO policy is lower than under the base policy.

With θ = α > 0:

π_θ^T r* = (e^α · 1 + e^−α · 2 + 1 · 0) / (1 + e^α + e^−α)

Let's compute for α = 1:

π_θ^T r* = (e · 1 + e⁻¹ · 2 + 0) / (1 + e + e⁻¹) = (2.718 + 0.736) / (1 + 2.718 + 0.368) = 3.454 / 4.086 ≈ 0.845

Compare with the base policy (θ = 0):

π_θ₀^T r* = (1 · 1 + 1 · 2 + 1 · 0) / 3 = 1.0

So DPO achieves expected reward 0.845 vs. the base policy's 1.0. DPO made things worse.

Two-stage RLHF, by contrast, would learn r* accurately in stage 1, then optimize the policy in stage 2. RLHF's expected reward can only increase from the base policy (the base policy is always a feasible point for the KL-regularized objective, so the optimizer does at least as well).

Sensitivity to Preference Distribution

The killer: which failure mode you get depends entirely on which pairs are compared more often.

If n_1,2 dominates: DPO needs σ(r(a₁) − r(a₂)) ≈ σ(−1) ≈ 0.27, so r(a₁) − r(a₂) ≈ −1. On the line [α, −α, 0], this means 2α ≈ −1, so α < 0 and θ < 0. The policy correctly favors a₂! Success.
If n_1,3 dominates: Preference reversal and reward degradation (as shown above). Failure.
If n_2,3 dominates: DPO needs σ(r(a₂) − r(a₃)) ≈ σ(2) ≈ 0.88. On the line: r(a₂) − r(a₃) = −α − 0 = −α, so −α ≈ 2, α ≈ −2, θ ≈ −2. Again favors a₂. Success, but degree of success depends on the count.

The insight: Whether DPO succeeds or fails is determined by a factor the practitioner doesn't control — the distribution of which pairs appear in the dataset. In real preference datasets, this distribution is an artifact of how annotators were assigned, which prompts were sampled, and which responses the base model generated. It's essentially arbitrary. This makes DPO's behavior unpredictable for parametric models.

Not a Coverage Problem

Remark 4 in the paper emphasizes: this failure is not due to insufficient coverage. Song et al. (2024) argued that DPO needs a coverage condition max_s,a π_θ(a|s)/π_θ₀(a|s) ≤ C. In our example, C = 3 for the uniform base policy — coverage is perfect. But DPO still fails. The problem is geometric (misspecification), not statistical (coverage).

DPO Failure Modes: Interactive Counterexample

The paper's 3-response, 1-parameter counterexample. Drag the sliders to set the preference pair counts. Watch the DPO projection slide along the red manifold line, causing preference reversal and reward degradation.

n_1,2: 10 n_2,3: 10 n_1,3: 80

In the paper's 3-response counterexample, what causes DPO to reverse preferences?

The 1D manifold can only encode the pattern [α, −α, 0]. When the dominant pair in the data forces α > 0, this makes a₁ preferred over a₂ — the opposite of the true ordering — because the manifold can't independently set r(a₁) and r(a₂) The learning rate is too high and the model overshoots The Bradley-Terry model is a poor fit for the true preferences

Chapter 6: RLHF Geometry

Having seen DPO fail, the paper now asks: what does two-stage RLHF actually compute, geometrically? Understanding RLHF's local behavior will reveal the path to fixing DPO.

Local Approximation of the RLHF Objective

We approximate J(θ; r*) around the base policy θ₀ using Taylor expansions. The expected reward is linear in θ (first-order), and the KL penalty is quadratic (second-order):

J(θ; r*) ≈ E_{ρ,π_θ₀}[r*(s, a)] + (θ − θ₀)^T A_ρ,θ₀ r* − (β/2)(θ − θ₀)^T F_ρ,θ₀ (θ − θ₀)

Let's define the key matrices:

D_ρ,θ₀: Diagonal matrix with entries ρ(s)π_θ₀(a|s) — the base distribution over (s, a) pairs.
A_ρ,θ₀ = A_θ₀ D_ρ,θ₀: The "scaled Jacobian" — columns are ρ(s) ∇π_θ₀(a|s) (note: ∇π, not ∇logπ).
F_ρ,θ₀ = A_θ₀ D_ρ,θ₀ A_θ₀^T = A_ρ,θ₀ A_θ₀^T: The Fisher information matrix. This is the "natural metric" on policy space — it measures how quickly the KL divergence grows as you move θ.

The Natural Gradient Solution

Taking the gradient of the quadratic approximation and setting it to zero:

∇_θ J = A_ρ,θ₀ r* − β F_ρ,θ₀ (θ − θ₀) = 0

Solving:

θ* = θ₀ + (1/β) F_ρ,θ₀^† A_ρ,θ₀ r*

This is a natural policy gradient step! The update direction F^† ∇J is the natural gradient — the steepest ascent direction measured in KL-divergence rather than Euclidean distance. Kakade (2001) showed that natural policy gradient converges faster than vanilla gradient descent because it accounts for the geometry of probability distributions. The RLHF solution is exactly one natural gradient step from the base policy, with step size 1/β.

Equivalence Classes of Reward Functions

Here's the key insight for fixing DPO. The RLHF solution θ* depends on r* only through the product A_ρ,θ₀ r*. This means:

R_eq^β(θ) = {r ∈ R^m : A_ρ,θ₀ r = β F_ρ,θ₀ (θ − θ₀)}

All reward functions in the same equivalence class produce the same RLHF-optimal policy. Two rewards r₁, r₂ are equivalent if and only if r₁ − r₂ ∈ N(A_ρ,θ₀) — they differ by a nullspace element.

Worked example: For our 1D policy with A_θ₀ = [1, −1, 0] and uniform base policy, A_ρ,θ₀ = (1/3)[1, −1, 0]. Its nullspace N(A_ρ,θ₀) = {[a, a, b] : a, b ∈ R} — any reward where a₁ and a₂ have the same reward (regardless of a₃'s reward).

So r* = [1, 2, 0] and r* + [c, c, d] = [1+c, 2+c, d] all yield the same RLHF policy. That's because the 1D policy can only control the ratio of a₁ to a₂, and any reward with r(a₂) − r(a₁) = 1 gives the same θ*.

The DPO-RLHF Connection (Proposition 7)

DPO's implicit reward r_θ^β is the minimum-norm representative of the RLHF equivalence class R_eq^β(θ), measured in the Mahalanobis norm ||r||_{D_ρ,θ₀}.

In other words: for each θ, there's a whole affine subspace of rewards that would make RLHF choose θ. DPO picks the shortest one. This is elegant — but it means DPO is constrained to the column space C(A_θ₀^T), while the true reward r* may require a nullspace component to be properly represented.

Column space vs. nullspace: Think of R^m as being split into two orthogonal subspaces (under the Mahalanobis inner product): the column space C(A_θ₀^T) and the nullspace N(A_ρ,θ₀). DPO searches only the column space. The true reward r* = r_θ*^β + δ where δ is a nullspace component. DPO can find r_θ*^β only if it can ignore δ. But the weighted KL-projection doesn't know to ignore δ — it projects the whole r* and gets the wrong answer.

RLHF Equivalence Classes

The column space (red line) and nullspace (blue arrows) partition reward space. All rewards in the same equivalence class (blue line) map to the same RLHF policy. DPO can only search the red line.

r* angle: 60°

Two reward functions r₁ = [3, 5, 1] and r₂ = [4, 6, 2] yield the same RLHF policy for the 1D softmax policy class with A_ρ,θ₀ = (1/3)[1, −1, 0]. Why?

r₁ − r₂ = [−1, −1, −1] is in the nullspace N(A_ρ,θ₀) since (1/3)(1·(−1) + (−1)·(−1) + 0·(−1)) = 0. Both rewards project to the same column-space component, yielding identical θ* Because r₁ and r₂ have the same preference ordering Because the KL penalty makes the policy ignore small differences

Chapter 7: The AuxDPO Algorithm

Now the fix. The authors have identified the problem: DPO searches the column space C(A_θ₀^T), but the true reward has a nullspace component that distorts the projection. The solution: search both spaces simultaneously.

The Core Idea

Introduce auxiliary variables δ ∈ N(A_ρ,θ₀) that represent the nullspace component of the reward. Optimize the DPO loss jointly over θ (column space) and δ (nullspace):

minimize_{θ ∈ R^d, δ ∈ N(A_ρ,θ₀)} L(θ, δ)

where L(θ, δ) is the DPO loss but with the reward replaced by r_θ^β + δ:

L(θ, δ) = − ∑_s,a,a' n_s,a,a' [p^BTL(r*) log p^BTL(r_θ,δ^β) + (1−p^BTL(r*)) log(1−p^BTL(r_θ,δ^β))]

where r_θ,δ^β = r_θ^β + δ.

Why This Works (Proposition 9)

By the rank-nullity theorem:

dim(C(A_θ₀^T)) + dim(N(A_ρ,θ₀)) = m

So searching column space (θ) + nullspace (δ) covers the full m-dimensional reward space R^m. The reward r* = r_θ*^β + δ* is now realizable in the augmented representation. The misspecification vanishes.

Proposition 9: For sufficiently large β and any tolerance ε > 0, the AuxDPO optimization achieves θ = θ* up to O(ε) error. In other words, AuxDPO recovers the RLHF-optimal policy.

From Theory to Practice: The Empirical Loss

The theoretical formulation requires knowing N(A_ρ,θ₀) and optimizing δ over it. In practice, we can't compute the nullspace of a matrix with millions of rows. The paper uses a clever relaxation:

Step 1: Discretize δ. Instead of defining δ over all m = |S| · |A| pairs, define it only at the 2n data points that appear in the dataset: δ = {δ(s⁽ⁱ⁾, a_w⁽ⁱ⁾), δ(s⁽ⁱ⁾, a_l⁽ⁱ⁾)}_i=1ⁿ ∈ R²ⁿ.

Step 2: Enforce the nullspace constraint via a penalty. Replace the hard constraint δ ∈ N(A_ρ,θ₀) with a penalty term ||A_ρ,θ₀ δ||². Since A_ρ,θ₀ δ = E_{ρ,π_θ₀}[δ(s, a) ∇ log π_θ₀(a|s)], this can be estimated from the dataset.

The empirical AuxDPO loss:

L_D(θ, δ) = −(1/n) ∑_i=1ⁿ log σ(r_θ^β(s⁽ⁱ⁾, a_w⁽ⁱ⁾) − r_θ^β(s⁽ⁱ⁾, a_l⁽ⁱ⁾) + δ(s⁽ⁱ⁾, a_w⁽ⁱ⁾) − δ(s⁽ⁱ⁾, a_l⁽ⁱ⁾)) + λ · ||(1/2n) ∑_i [δ_w⁽ⁱ⁾ ∇ log π_θ₀(a_w⁽ⁱ⁾|s⁽ⁱ⁾) + δ_l⁽ⁱ⁾ ∇ log π_θ₀(a_l⁽ⁱ⁾|s⁽ⁱ⁾)]||²

Implementation Details

Practical setup:

Parameters: d (model weights θ) + 2n (auxiliary variables δ). Since typically n << d (e.g., n = 10K pairs, d = 7B parameters), the overhead is negligible: d + 2n ≈ d.
Hyperparameters: λ (nullspace penalty strength) is the only new hyperparameter. The paper doesn't tune it extensively — typical values work across tasks.
Gradients: ∇ log π_θ₀(a|s) is the score function of the frozen reference model. This needs to be computed once per data point and cached. It's a vector of dimension d (same as model parameters), so storing 2n such vectors costs 2n × d memory. In practice, the penalty term is computed in batch.
What δ learns: The auxiliary variables absorb the nullspace component of r* that the model's implicit reward can't express. After training, δ is discarded — only θ is used for inference.

Pseudocode

# AuxDPO Training Loop
def auxdpo_loss(theta, delta, batch, ref_model, beta, lam):
    # batch: (s, a_w, a_l) triplets, indices i

    # 1. Compute implicit rewards (same as DPO)
    log_ratio_w = log_prob(theta, s, a_w) - log_prob(ref_model, s, a_w)
    log_ratio_l = log_prob(theta, s, a_l) - log_prob(ref_model, s, a_l)
    r_w = beta * log_ratio_w   # shape: [batch_size]
    r_l = beta * log_ratio_l   # shape: [batch_size]

    # 2. Add auxiliary variables (per-datapoint scalars)
    delta_w = delta[2*i]        # shape: [batch_size]
    delta_l = delta[2*i + 1]    # shape: [batch_size]

    # 3. AuxDPO logit = DPO logit + delta correction
    logit = (r_w - r_l) + (delta_w - delta_l)

    # 4. Binary cross-entropy loss (same as DPO)
    bce = -log_sigmoid(logit).mean()

    # 5. Nullspace penalty: ||A_{rho,theta_0} delta||^2
    # score_w = grad log pi_{theta_0}(a_w | s), shape: [batch, d]
    # Pre-computed and cached for reference model
    penalty = (delta_w[:, None] * score_w + delta_l[:, None] * score_l).mean(0)
    penalty = (penalty ** 2).sum()

    return bce + lam * penalty

The likely discovery process: The authors probably started by asking "why does DPO fail with parametric policies?" They noticed the KL-projection interpretation (Prop. 1), realized it's a misspecified estimation problem, and studied the RLHF geometry to understand what the correct answer looks like. The equivalence class structure (Lemma 6) revealed that many rewards give the same RLHF policy, and DPO picks the wrong representative. The fix was natural: let the optimizer search the nullspace too, so it can find the right equivalence class. The penalty term is a standard trick for converting a constrained optimization to an unconstrained one.

AuxDPO: Adding Nullspace Degrees of Freedom

DPO (orange) projects r* onto the column space line. AuxDPO (green) adds a nullspace shift δ so the augmented reward r_θ^β + δ can reach r*. The result: θ lands at the correct RLHF solution. Toggle between DPO and AuxDPO to see the difference.

Showing: DPO only

How does AuxDPO fix DPO's misspecification?

It adds auxiliary variables δ in the nullspace of A_ρ,θ₀ to the reward, so the combined search over θ (column space) and δ (nullspace) covers the full reward space — eliminating misspecification It uses a larger model with more parameters to make the policy class closer to tabular It replaces the KL divergence with a different divergence that is less sensitive to misspecification

Chapter 8: Experiments

Theory says AuxDPO should fix DPO's misspecification. Does it work in practice? The paper tests on two fronts: a didactic bandit setting (where we can verify the theory exactly) and real LLM alignment tasks.

Didactic Bandit Setting

The 3-response, 1-parameter example from Proposition 3 is implemented with a log-linear policy. With n_1,3 dominating, DPO produces preference reversal as predicted. AuxDPO with λ = 1.0 correctly recovers θ < 0, favoring a₂. The auxiliary variable δ absorbs the nullspace component, steering the projection to the correct equivalence class.

LLM Alignment: Datasets

RewardBench v2 (Malik et al., 2025): 1.87K prompts testing factuality, instruction following, and focus. Each prompt has a chosen and rejected response.
MMLU-Pro (Wang et al., 2024b): 12K multi-task understanding questions, 10 possible answers each. Converted to preference format by pairing the correct answer (chosen) with each incorrect one (rejected).

Training data: UltraFeedback (Cui et al., 2024) and its binarized version. Models: Llama3.1-8B, Llama3.2-1B, Qwen3-0.6B.

Main Results

Key finding: AuxDPO consistently outperforms DPO across all models and datasets, both in-distribution (ID) and out-of-distribution (OOD). The gains are especially large on OOD evaluation, suggesting that AuxDPO learns more transferable alignments.

Model	Dataset	Setting	DPO	AuxDPO	IPO	DPOP
Llama3.1-8B	MMLU-Pro	ID	57.14	63.26	59.18	61.22
	MMLU-Pro	OOD	8.16	14.28	10.20	6.12
	RewardBench v2	ID	56.01	66.72	61.34	62.27
	RewardBench v2	OOD	14.31	32.44	20.17	19.87
Llama3.2-1B	MMLU-Pro	ID	39.58	45.83	43.75	44.21
	MMLU-Pro	OOD	6.25	12.52	14.58	4.16
	RewardBench v2	ID	77.21	86.37	69.72	71.21
	RewardBench v2	OOD	14.11	43.27	20.42	18.76
Qwen3-0.6B	MMLU-Pro	ID	53.12	61.78	47.48	56.67
	MMLU-Pro	OOD	11.34	22.22	15.56	17.78
	RewardBench v2	ID	55.10	65.31	53.06	51.02
	RewardBench v2	OOD	−8.16	18.36	−8.23	−6.25

Values show % change in mean accuracy relative to the base policy. Bold = best. Negative = degradation from base policy.

Reading the Results

Several patterns jump out:

1. AuxDPO wins everywhere that matters. On all 12 (model, dataset, setting) combinations, AuxDPO is either first or tied for first. DPO is never best.

2. The OOD gap is huge. On Llama3.2-1B RewardBench v2 OOD, AuxDPO achieves +43.27% vs. DPO's +14.11%. That's a 3x improvement in OOD generalization. This suggests that DPO's misspecification causes it to overfit to distribution-specific artifacts, while AuxDPO finds more robust alignments.

3. DPO can be catastrophically bad. On Qwen3-0.6B RewardBench v2 OOD, DPO scores −8.16% — meaning it's worse than the base model. This is exactly the reward degradation predicted by the theory (Proposition 3). AuxDPO scores +18.36%, a swing of over 26 percentage points.

4. Model size matters. The smaller the model (fewer parameters d relative to task complexity m), the worse the misspecification. Qwen3-0.6B has the most severe DPO failures, which makes sense: fewer parameters means a lower-dimensional manifold, means more reward functions are unrealizable.

What the paper doesn't say: The computational overhead of AuxDPO is not extensively analyzed. Adding 2n auxiliary variables and computing the penalty term (which involves reference model score functions) adds memory and compute. For n = 10K and d = 7B, the 20K extra parameters are negligible, but caching/computing ∇ log π_θ₀ for every data point could be significant for very large datasets. The paper also doesn't test on reward model benchmarks beyond accuracy — calibration, ranking quality, and win rates against human judges would strengthen the case.

Per-Subject Breakdown (MMLU-Pro)

The paper also reports per-subject accuracies on MMLU-Pro. Some highlights:

Biology OOD: AuxDPO gets 75.07% vs. DPO's 55.93% — a 19-point gap.
Engineering ID: AuxDPO gets 32.18% vs. DPO's 10.22% — tripling the improvement.
Law OOD: DPO gets 23.61% (degraded from base), AuxDPO gets 34.50% (improved).

The gains are not uniform across subjects, but AuxDPO is never worse than DPO in any subject.

Results Comparison: DPO vs. AuxDPO

Accuracy improvement over base policy (% change) across model sizes and settings. Bars below zero mean the method is worse than doing nothing.

Dataset:

On Qwen3-0.6B RewardBench v2 OOD, DPO scores −8.16% (worse than base). What does this correspond to in the theory?

This is the reward degradation predicted by Proposition 3: DPO's misspecified KL-projection finds a policy with lower expected reward than the base, exactly because the small model's reward manifold can't represent the true reward The model is too small for the task and needs more training data The OOD setting is inherently unfair to DPO

Chapter 9: Connections

Cheat Sheet: Every Key Equation

Equation	What it says	When to use it
r_θ^β(s,a) = β log(π_θ/π_ref)	Implicit reward: every policy induces a reward via log-ratio	Understanding DPO's reward space
R^β = {r_θ^β : θ ∈ R^d}	Implicit reward manifold: d-dimensional subset of R^m	Checking if misspecification applies
r_DPO = argmin_{r ∈ R^β} ∑ n · d_KL	DPO = weighted KL-projection of r* onto R^β	Understanding DPO's behavior
r_θ^β ≈ β A_θ₀^T(θ−θ₀)	Linearized implicit reward: lives in C(A_θ₀^T)	Local analysis near base policy
θ* = θ₀ + (1/β) F^† A_ρ r*	RLHF solution = natural policy gradient step	Understanding the target
R_eq^β(θ) = {r : Ar = βF(θ−θ₀)}	RLHF equivalence class: all rewards yielding same θ*	Understanding why many rewards are equivalent
L_AuxDPO = L_DPO(r_θ+δ) + λ\|\|Aδ\|\|²	AuxDPO loss: DPO + nullspace auxiliary variables + penalty	Implementation

Open Questions

How bad is misspecification for specific architectures? The paper shows misspecification exists for any parametric class. But how much does it matter for a 70B Transformer vs. a 1B one? Is there a sweet spot where d is large enough that misspecification is negligible?
Can we estimate misspecification? Is there a diagnostic that tells you "DPO is heavily misspecified on your data" without running AuxDPO? Perhaps by measuring the nullspace penalty term after DPO training.
Scaling the penalty computation. The score function cache for the nullspace penalty grows as O(n × d). For very large datasets and models, this could be prohibitive. Can the penalty be approximated with a sketch or random projection?
Beyond BTL. The analysis assumes the Bradley-Terry preference model. Real preferences are nosier, inconsistent, and context-dependent. Does AuxDPO still help when the BTL assumption itself is misspecified?

The Big Picture

This paper reveals a fundamental tension in direct alignment: DPO's elegance comes from reparameterizing the reward in terms of the policy, but this reparameterization constrains the reward space to a low-dimensional manifold. The constraint is invisible in the tabular case (where the manifold is the whole space) but becomes a source of systematic error for parametric policies.

AuxDPO's fix is principled and minimal: add auxiliary variables that search the directions the policy can't reach. The cost is small (2n extra scalars, one hyperparameter), and the theory guarantees recovery of the RLHF solution. The practical gains are substantial, especially for smaller models where misspecification is more severe.

The deeper lesson: whenever you reparameterize one quantity in terms of another (reward in terms of policy, here), you inherit the latter's capacity limitations. If those limitations don't match the true data, you're misspecified — and no amount of data will save you. Only expanding the representational capacity (here, via auxiliary variables) can fix a misspecification problem.

A colleague says: "Just use a bigger model and DPO's misspecification goes away." Is this correct?

Yes, because a bigger model has more parameters and can represent more reward functions Partially — a bigger model increases d, making R^β higher-dimensional and reducing misspecification in theory. But as long as d < m (which is always true for real LLMs), misspecification persists. The severity depends on the specific reward and data distribution, not just model size. AuxDPO is a targeted fix regardless of scale. No, model size has nothing to do with misspecification