Rafailov, Sharma, Mitchell, Manning, Ermon, Finn — 2023

Direct Preference Optimization

Your language model is secretly a reward model — skip the RL loop entirely and align LLMs with a simple classification loss on preference pairs.

Prerequisites: RLHF pipeline + KL divergence + Bradley-Terry model
10
Chapters
5+
Simulations

Chapter 0: The Problem

By 2023, RLHF had become the standard recipe for aligning language models with human preferences. ChatGPT, InstructGPT, Claude — all used the same three-stage pipeline:

  1. SFT: Fine-tune a pretrained LM on high-quality demonstrations
  2. Reward model: Train a separate neural network to score outputs using human preference comparisons
  3. RL optimization: Use PPO to maximize the learned reward while staying close to the SFT model via a KL penalty

This pipeline works. But it is a lot of machinery. You need to train and serve a reward model. You need to implement PPO with all its moving parts — value function, advantage estimation, clipping, multiple epochs. You need to sample from the policy during training (expensive for large LMs). You need to tune a KL penalty coefficient. And the whole thing is notoriously brittle: reward hacking, mode collapse, and training instability are constant threats.

The RLHF Pipeline

The standard three-stage RLHF pipeline vs. DPO's simplified approach. Toggle to compare.

The wish list: Can we get the same alignment quality without training a reward model, without running PPO, without sampling from the LM during training, and without tuning RL hyperparameters? Can we reduce RLHF to something as simple as supervised fine-tuning? DPO says yes.
What makes the standard RLHF pipeline complex to implement and run?

Chapter 1: The Key Insight

Here is the core idea of DPO, in one sentence:

The optimal policy under the RLHF objective has a closed-form solution. This means you can express the reward function as a function of the optimal policy and the reference policy. Substitute this into the preference model, and the reward model disappears — you get a loss that directly optimizes the policy on preference data.

Let's unpack this step by step. The standard RLHF objective is:

maxπ Ex~D, y~π[r(x,y)] − β DKL(π || πref)

This is a constrained optimization problem: maximize reward while staying close to the reference policy. The remarkable fact is that this problem has an analytical solution:

π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x)

where Z(x) is a normalizing constant. This is known from the control-as-inference literature, but nobody had thought to rearrange it. Rearranging for r:

r(x,y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)

The reward is just β times the log-ratio of the optimal policy to the reference policy, plus a prompt-dependent constant. Now here's the magic: the Bradley-Terry preference model only cares about differences in reward between two completions for the same prompt. And β log Z(x) is the same for both completions. It cancels.

So you can write the probability of preferring yw over yl entirely in terms of policy log-ratios — no reward model needed. Train the policy directly on preference data with a binary cross-entropy loss.

Why does the partition function Z(x) drop out of the DPO loss?

Chapter 2: The RLHF Objective

Before we derive DPO, we need to understand precisely what RLHF optimizes. The pipeline has two key components.

The reward model

Given a dataset of human preferences — triples (x, yw, yl) where yw is preferred over yl for prompt x — we fit a reward model rφ(x,y) using the Bradley-Terry model. The probability that y1 is preferred over y2 is:

p*(y1 ≻ y2 | x) = exp(r*(x, y1)) / (exp(r*(x, y1)) + exp(r*(x, y2))) = σ(r*(x, y1) − r*(x, y2))

where σ is the sigmoid function. The reward model loss is binary cross-entropy:

LR(rφ) = −E(x, yw, yl)[log σ(rφ(x, yw) − rφ(x, yl))]

The KL-constrained RL objective

Once we have a reward model, we optimize the policy to maximize reward while not drifting too far from the reference (SFT) model:

maxπθ Ex~D, y~πθ[rφ(x,y)] − β DKLθ(y|x) || πref(y|x))

The β parameter controls how much the policy can deviate from πref. Large β means stay very close to the reference (conservative). Small β means aggressively chase reward (risky — can lead to reward hacking).

Why the KL constraint? Without it, the policy would exploit any imperfections in the reward model. A small bias in the reward model becomes a gaping hole that the policy drives a truck through. The KL penalty keeps the policy close to the reference distribution where the reward model's training data lives — preventing the policy from generating out-of-distribution outputs that "trick" the reward model into giving high scores.
Reward vs. KL Tradeoff

Drag β to see how the KL constraint strength affects the tradeoff between reward and policy divergence.

β0.10
What would happen if you maximized reward without any KL constraint (β = 0)?

Chapter 3: The Closed-Form Solution

This is the mathematical heart of DPO. We will derive the closed-form optimal policy for the KL-constrained reward maximization objective, step by step.

Step 1: Expand the KL divergence

The objective is:

maxπ Ey~π[r(x,y)] − β DKL(π || πref)

Expanding the KL term:

= maxπ Ey~π[r(x,y)] − β Ey~π[log(π(y|x) / πref(y|x))]
= maxπ Ey~π[r(x,y) − β log(π(y|x) / πref(y|x))]

Step 2: Write as a KL divergence

We can rearrange to recognize this as a KL divergence. Define:

π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y) / β)

where Z(x) = Σy πref(y|x) exp(r(x,y)/β) ensures normalization. Then the objective can be rewritten as:

maxπ −β DKL(π(y|x) || π*(y|x)) + β log Z(x)

Since DKL ≥ 0 with equality iff π = π*, the maximum is achieved at π = π*. Done.

Step 3: Reparameterize the reward

Now we rearrange the optimal policy equation to express reward in terms of policy ratios. Starting from:

π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y)/β)

Take the logarithm of both sides:

log π*(y|x) = log πref(y|x) + r(x,y)/β − log Z(x)

Solve for r(x,y):

r(x,y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)
The reparameterization: The reward is just β times the log-ratio of optimal policy to reference policy, plus a prompt-dependent constant β log Z(x). This is the equation that makes DPO possible. Instead of learning a reward model, we can express reward entirely in terms of things we already have: the policy we're training and the frozen reference model.
What is the closed-form optimal policy for the KL-constrained reward maximization objective?

Chapter 4: The DPO Loss

Now we complete the derivation. We have the reward reparameterized as:

r(x,y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)

We substitute this into the Bradley-Terry preference model. The probability of preferring yw over yl is:

p*(yw ≻ yl | x) = σ(r*(x, yw) − r*(x, yl))

Step 4: Substitute and cancel

Plugging in the reparameterized reward:

r*(x, yw) − r*(x, yl) = β log(π*(yw|x) / πref(yw|x)) + β log Z(x) − β log(π*(yl|x) / πref(yl|x)) − β log Z(x)

The β log Z(x) terms cancel! We're left with:

= β log(π*(yw|x) / πref(yw|x)) − β log(π*(yl|x) / πref(yl|x))

Step 5: The final DPO loss

Now we replace π* with our trainable policy πθ and write the maximum likelihood objective:

LDPOθ; πref) = −E(x, yw, yl)~D[log σ(β log(πθ(yw|x) / πref(yw|x)) − β log(πθ(yl|x) / πref(yl|x)))]

That's it. The entire DPO algorithm in one equation. No reward model. No RL loop. No value function. Just a supervised loss on preference pairs.

What this loss says in plain English: For each preference pair, compute the log-probability ratio of the preferred completion under the current policy vs. the reference policy, minus the same ratio for the dispreferred completion. Push this difference through a sigmoid and do binary cross-entropy. The loss decreases when the policy assigns relatively higher probability to preferred completions (compared to the reference) than to dispreferred ones.
The DPO Derivation Flow

The five steps from RLHF objective to DPO loss. The key moment: Z(x) cancels in the Bradley-Terry difference.

What is the DPO loss function?

Chapter 5: Why DPO Works

The loss function looks simple, but what is it actually doing to the policy at each gradient step? Let's look at the gradient.

The DPO gradient

The gradient of the DPO loss with respect to the policy parameters θ is:

θ LDPO = −β E(x, yw, yl)[σ(r̂θ(x,yl) − r̂θ(x,yw)) · (θ log πθ(yw|x)θ log πθ(yl|x))]

where r̂θ(x,y) = β log(πθ(y|x) / πref(y|x)) is the implicit reward.

There are three parts to this gradient:

  1. θ log πθ(yw|x)increase the log-probability of the preferred completion
  2. θ log πθ(yl|x)decrease the log-probability of the dispreferred completion
  3. σ(r̂θ(x,yl) − r̂θ(x,yw)) — a weighting factor that measures how "wrong" the current policy is

The importance weight

The weighting factor is the sigmoid of (implicit reward of dispreferred − implicit reward of preferred). When the implicit reward incorrectly ranks the dispreferred completion higher than the preferred one, this weight is large — the gradient pushes hard to fix the ordering. When the policy already correctly ranks them, the weight is small — the gradient is gentle, preventing over-optimization.

Self-regulating gradients: Without this weighting, a naive approach (just maximize log π(yw) and minimize log π(yl)) causes the model to degenerate. The paper confirms this experimentally — the "unlikelihood" baseline without proper weighting performs poorly. The sigmoid weight is what prevents DPO from collapsing: it automatically focuses learning on examples the policy currently gets wrong.
DPO Gradient Dynamics

Watch how the implicit reward evolves during training. The weight (orange) is large when the policy ranks incorrectly and shrinks as it learns. Drag to adjust the implicit reward gap.

r̂(yw) − r̂(yl)-0.5
What prevents DPO from causing the model to degenerate by aggressively pushing all preferred completions up and all dispreferred completions down?

Chapter 6: Theoretical Properties

DPO isn't just a practical trick. It has rigorous theoretical backing that guarantees it optimizes exactly the same objective as RLHF.

Property 1: Same objective as RLHF

DPO and PPO-based RLHF optimize the same KL-constrained reward maximization objective. The only difference is how they optimize it. RLHF does it in two stages (fit reward, then RL). DPO does it in one stage (direct classification on preferences). The optimal solution is identical.

Property 2: No loss of expressiveness

The reparameterization r(x,y) = β log(π(y|x) / πref(y|x)) might seem restrictive — are we limiting what rewards we can represent? No. The paper proves (Theorem 1) that every reward function's equivalence class can be represented this way. Two rewards that differ by only a function of x (i.e., r(x,y) − r'(x,y) = f(x)) produce the same preference distribution and the same optimal policy. The reparameterization just picks a canonical representative from each class.

Property 3: Consistency under Bradley-Terry

If the true preference data is generated by a Bradley-Terry model with some reward r*, then as the dataset grows, the DPO solution converges to the optimal policy for r*. This is a standard consistency guarantee — DPO doesn't introduce any additional bias beyond the Bradley-Terry assumption.

Property 4: Implicit reward model

Your language model is a reward model. The implicit reward at any point during training is:

θ(x,y) = β log(πθ(y|x) / πref(y|x))

You can extract this reward for any (x, y) pair by just computing the log-probability under your policy minus the log-probability under the reference. No separate reward model needed — and this implicit reward is provably as expressive as any explicit one.

Diagnosing RLHF instability: The paper also uses DPO's framework to explain why PPO-based RLHF can be unstable. The normalization term β log Z(x) in the reward acts like a soft value function for the reference policy. Without proper baselines, the policy gradient has high variance. PPO tries to handle this with a learned value function, but that adds another source of error. DPO sidesteps the issue entirely — the reparameterization yields a reward that doesn't need any baselines.
Does the DPO reparameterization r(x,y) = β log(π/πref) limit what reward functions can be learned?

Chapter 7: Results

The paper evaluates DPO on three tasks: controlled sentiment generation (IMDb), summarization (TL;DR), and single-turn dialogue (Anthropic HH).

Sentiment control (IMDb, GPT-2 large)

DPO achieves the best reward-KL frontier of all methods. For any given KL budget, DPO achieves higher reward than PPO, even when PPO uses the ground-truth reward function (PPO-GT). This is remarkable: DPO optimizes the same objective more efficiently than PPO, despite being much simpler.

Summarization (TL;DR, GPT-J 6B)

Using GPT-4 as evaluator against human-written reference summaries:

DPO not only wins but is much more robust to sampling temperature. PPO's performance degrades sharply at high temperatures; DPO remains stable.

Dialogue (Anthropic HH, Pythia 2.8B)

DPO is the only computationally efficient method that improves over the preferred completions in the dataset. It matches or exceeds Best-of-128 (a computationally impractical baseline) while being orders of magnitude cheaper.

DPO vs PPO: Summarization Win Rates

Win rates against human-written summaries on TL;DR, evaluated by GPT-4.

Human validation: The paper also conducted human evaluations. On TL;DR, DPO samples (temperature 0.25) were preferred 58% of the time over PPO samples (temperature 0.0) by human raters. GPT-4's judgments correlated strongly with human preferences, with agreement rates comparable to inter-annotator agreement.
How does DPO compare to PPO on the sentiment control task?

Chapter 8: Practical Advantages

DPO's simplicity isn't just aesthetic — it translates to real practical benefits for training aligned language models.

What you DON'T need

What you DO need

  1. A reference model πref: The SFT model, frozen. You need to compute log-probabilities under it for each training example. This is a forward pass only — no gradients, so it can be run in eval mode.
  2. A preference dataset: Triples (x, yw, yl). These can be pre-collected offline.
  3. One hyperparameter β: Controls how conservative the alignment is. The paper uses β ∈ {0.1, 0.5} and finds results are robust.

Implementation

DPO can be implemented in roughly 20 lines of core training code:

Forward
Compute log πθ(yw|x), log πθ(yl|x) for current policy
Reference
Look up log πref(yw|x), log πref(yl|x) (precomputed or cached)
Loss
loss = −log σ(β · ((log πθ(yw) − log πref(yw)) − (log πθ(yl) − log πref(yl))))
Update
Standard backprop + Adam. That's it.
Compute savings: In RLHF, each training step requires (1) generating completions from the policy (autoregressive, slow), (2) scoring them with the reward model (another forward pass), (3) computing advantages, (4) running PPO with multiple epochs. DPO needs only (1) a forward pass through the policy on the fixed dataset and (2) a backward pass. For a 6B parameter model, this can reduce training time by 5-10x.
What is the most significant computational saving of DPO over PPO-based RLHF?

Chapter 9: Connections

What DPO built on

RLHF (Christiano et al., 2017): The foundational framework for learning from human preferences. DPO optimizes the same objective but bypasses the RL loop entirely.

PPO (Schulman et al., 2017): The RL algorithm used in standard RLHF. DPO makes PPO unnecessary for preference learning by showing the optimal policy has a closed form.

Bradley-Terry model (1952): The preference model that DPO inherits. The key property — depending only on reward differences — is what allows Z(x) to cancel.

Control as inference (Levine, 2018): The framework connecting optimal control to probabilistic inference. The closed-form solution π* ∝ πref exp(r/β) comes directly from this literature.

What DPO enabled

SimPO (Meng et al., 2024): Simplifies DPO further by removing the reference model — uses average log-probability as the implicit reward, adding a target reward margin.

KTO (Ethayarajh et al., 2024): Kahneman-Tversky Optimization — extends DPO to work with unpaired preference data (just "good" or "bad" labels, no pairwise comparisons needed).

IPO (Azar et al., 2024): Identity Preference Optimization — addresses DPO's potential overfitting to Bradley-Terry assumptions by using a different, more robust loss.

ORPO (Hong et al., 2024): Odds Ratio Preference Optimization — combines SFT and preference alignment into a single stage, no reference model.

Industry adoption

DPO has been adopted broadly. Llama 3 (Meta), Gemma (Google), Mistral, Zephyr, and many other open models use DPO or DPO variants for alignment. It has largely replaced PPO-based RLHF in the open-source community due to its simplicity, while some frontier labs continue to use PPO or hybrid approaches (online DPO, iterative DPO) for maximum performance.

DPO's legacy: DPO showed that RLHF's complexity was mostly unnecessary. The field's reaction was dramatic — within a year, DPO became the default alignment method for open-source models. It spawned a family of "direct alignment" methods (SimPO, KTO, IPO, ORPO) that continue to simplify and improve preference learning. The core insight — that you can reparameterize reward as a function of policy ratios — is one of the most impactful observations in modern ML.

Cheat sheet

Core equation
L = −E[log σ(β(log πθ(yw)/πref(yw) − log πθ(yl)/πref(yl)))]
Key insight
r(x,y) = β log(π*/πref) + β log Z(x), and Z(x) cancels in Bradley-Terry
Mechanism
Implicit reward via policy/reference log-ratio; sigmoid weighting prevents degeneration
Advantage
No reward model, no RL loop, no sampling, no value function — just supervised learning
Impact
Default alignment for open-source LLMs; enabled SimPO, KTO, IPO, ORPO
What is DPO's core mathematical insight that eliminates the need for a reward model?