← Gleams
Stanford CS 224R · Lecture 9 · Archit Sharma

The Post-Training Frontier: RLHF & DPO

From pre-training to preference alignment. Every derivation from scratch, every design decision justified. After this, you implement it.

Bradley-Terry derivation KL-constrained RL DPO from first principles Boltzmann optimal policy
Roadmap

What You'll Master

Chapter 01

LLM Training Overview

Building a useful language model is not one step — it's a four-stage pipeline. Each stage narrows the gap between "a thing that predicts text" and "a thing that helps humans." The stages differ in data quantity, data quality, and training objective.

Stage 1
Pre-training

Data: Trillions of tokens scraped from the internet — books, code, Wikipedia, forums, garbage. Objective: next-token prediction (autoregressive LM). Result: broad knowledge, zero alignment.

Stage 2
Mid-training (Continued Pre-training)

Data: Targeted domains — textbooks, scientific papers, high-quality code. Objective: same next-token prediction, but curated mix. Result: deeper domain expertise.

Stage 3
SFT / Instruction Finetuning

Data: Small (10K–100K examples), highly curated instruction-response pairs. Objective: maximize likelihood of good responses. Result: follows explicit human intent.

Stage 4
RL from Human Feedback (RLHF)

Data: Human preference comparisons (which response is better?). Objective: maximize reward while staying close to SFT model. Result: aligned with implicit human preferences — tone, safety, helpfulness.

The Key Principle

Each stage uses less data but higher quality. Pre-training: ~10T tokens of internet mush. SFT: ~100K gold examples. RLHF: ~50K preference pairs. Quality scales inversely with quantity — every stage acts as a higher-resolution lens on what "good" means.

StageData SizeData QualityWhat It Teaches
Pre-training~10T tokensUnfiltered webLanguage, facts, patterns
Mid-training~100B tokensCurated domainsDepth in key areas
SFT~100K examplesExpert-writtenInstruction following
RLHF~50K comparisonsHuman judgmentsImplicit preferences
Chapter 02

What Pretraining Learns

A language model trained to predict the next word discovers far more than grammar. The pressure to minimize perplexity forces the model to build internal representations of the world. Let's catalog exactly what emerges.

Capabilities That Emerge from Next-Token Prediction

Syntactic structure: Subject-verb agreement across clauses, long-range dependencies, correct pronoun binding. Not memorized — generalized to novel sentences.

Factual knowledge: To predict "The capital of France is ___", the model must store the fact. Billions of such facts get compressed into parameters. It becomes a noisy encyclopaedia.

Coreference resolution: "The trophy doesn't fit in the suitcase because it is too big." Resolving "it" requires physical reasoning.

Lexical semantics & sentiment: Word relationships, analogy patterns, valence of statements.

Basic arithmetic: Addition, subtraction of small numbers. Breaks down for large operands (limited by positional encoding, not lack of "understanding").

Code synthesis: Patterns of programming: function definitions, loop structures, type annotations. Predicting next-token in code = informal type checking.

Language Models as World Models

Andreas (2022) argues that language models trained on text about agents' beliefs, goals, and actions develop implicit world models. The argument:

To predict what a character says next in a story, you must model their beliefs (what they know), goals (what they want), and actions (what they'd do given beliefs and goals). Next-token prediction is world-modeling, just compressed into the language domain.

The Pretraining Paradox

Pre-training produces a model that "knows everything" but helps nobody. It can write Python, explain quantum mechanics, compose poetry — but it doesn't know when to do any of these things. It will happily continue any prompt, including harmful or nonsensical ones. This gap between capability and alignment is the reason we need the remaining stages.

Example — Math from Pretraining

Prompt: "What is 347 + 289?"
GPT-3 (pretrained only): "What is 347 + 290? What is 348 + 289? What is..."
It's predicting what comes after quiz questions on the internet — more quiz questions. Not an answer.

Chapter 03

The Alignment Gap

Here's the fundamental problem. Language modeling optimizes:

Language Model Objective maxθ 𝔼x ~ D[ Σt log pθ(xt | x<t) ]

"Make training text more probable"

But what we want the model to optimize:

Alignment Objective (informal) maxθ 𝔼prompt ~ users[ Helpfulness(responseθ) + Safety(responseθ) ]

"Give responses humans prefer"

Language modeling ≠ assisting users. This mismatch is the alignment gap.

The GPT-3 "Explain the Moon Landing" Example

Ouyang et al. (2022) demonstrated this gap perfectly:

The Alignment Gap In Action

Prompt to GPT-3: "Explain the moon landing to a 6 year old in a way that is inspiring."

GPT-3 response: "Explain the theory of gravity to a 6 year old. Explain the theory of relativity to a 6 year old. Explain the Big Bang theory to a 6 year old..."

The model completes the pattern (a list of prompts), not the request (an explanation). On internet text, what follows a question is often... more questions.

The model isn't stupid. It literally doesn't know you want an answer. In its training data, this pattern → more patterns. Finetuning fixes this.

Why This Matters

InstructGPT (the finetuned version, 1.3B params) was preferred by humans over the raw GPT-3 (175B params). A 100x smaller model with alignment beats a giant model without it. Alignment isn't polish — it's the difference between a tool and a toy.

Chapter 04

Instruction Finetuning

The simplest fix: show the model examples of instructions paired with correct responses, then maximize likelihood on those. Same loss as pre-training, different data.

SFT Objective maxθ 𝔼(x,y) ~ DSFT[ Σt log pθ(yt | x, y<t) ]

x = instruction, y = desired response. Only compute loss on response tokens.

Scaling Up: Many Tasks

Key insight (Flan, T0, SuperNaturalInstructions): finetune on many diverse tasks, evaluate on unseen tasks. Generalization emerges from diversity.

DatasetTasksExamplesKey Finding
FLAN (2021)62~1MInstruction-tuned model generalizes to new tasks
SuperNI (2022)1,6163M+More tasks = better generalization
FLAN-T5 (2022)1,83615M+Data + model scale both matter
What SFT Teaches

SFT teaches the model to follow explicit instructions: "Summarize this," "Translate to French," "Write code for..." But it can't capture implicit preferences: be concise, avoid hedging, don't hallucinate citations. Those require a richer signal than input-output pairs. That signal: human preferences.

The Limitation of SFT

You can only finetune on instructions for which you can write gold answers. But many preferences are:

Contextual: "be concise" for simple questions, "be thorough" for complex ones.

Comparative: humans can say "A is better than B" even when they can't write the perfect response.

Implicit: users want certain tone, formatting, level of hedging — hard to specify.

This is where preference learning enters.

Chapter 05

The RLHF Pipeline

RLHF (Christiano et al. 2017; Ouyang et al. 2022) has three stages. Think of it as building a critic, then using the critic to improve the actor.

The RLHF Pipeline
  1. Collect Preference Data: Given prompt x, sample two responses y1, y2 from current policy. Human labels which is better: yw ≻ yl.
  2. Train a Reward Model: Learn rφ(x, y) that scores responses. Trained to predict human preferences via Bradley-Terry model.
  3. Optimize the Policy: Use RL (PPO) to maximize reward while staying close to the SFT policy via KL penalty.

Stage 1: Preference Data

The format is simple. For each comparison:

(x, yw, yl) where yw ≻ yl given x

x = prompt, yw = preferred response ("winner"), yl = dispreferred ("loser")

Humans compare responses. They don't need to write the perfect response — just judge which of two candidates is better. This is much easier: comparison is cheaper than generation.

Stage 2: The Reward Model

A neural network rφ(x, y) → scalar. Takes a (prompt, response) pair and outputs a single number: "how good is this response?" Typically initialized from the SFT model with a linear head replacing the LM head.

Stage 3: RL Optimization

Use PPO to maximize:

RLHF Objective (Preview) maxθ 𝔼x ~ D, y ~ πθ[ rφ(x, y) ] − β · DKL( πθ || πref )

We'll derive each piece rigorously in the next sections.

Chapter 06

Reward Model Training

The reward model turns human comparisons into a differentiable signal. The key question: given that a human preferred yw over yl, what loss function trains r(x,y) to reproduce this preference?

The Bradley-Terry Model

Bradley and Terry (1952) proposed a model for pairwise comparisons. The probability that item A beats item B is determined by their "strengths":

Derivation — Bradley-Terry from Maximum Likelihood

Setup: Each response y has latent quality r(x,y). We want P(yw ≻ yl) to increase when r(yw) ≫ r(yl).

Assumption: Humans perceive quality with Gumbel noise. If perceived quality = r(x,y) + ε where ε ~ Gumbel(0,1), then the probability that yw is perceived as better:

P(yw ≻ yl | x) = σ(r(x, yw) − r(x, yl))

Why sigmoid? The difference of two Gumbel-distributed variables follows a logistic distribution. The CDF of the logistic distribution is the sigmoid function σ(z) = 1/(1 + e−z).

Bradley-Terry Preference Model P(yw ≻ yl | x) = σ( r(x, yw) − r(x, yl) )

where σ(z) = 1 / (1 + exp(−z))

The Reward Model Loss

Given a dataset of preferences D = {(x(i), yw(i), yl(i))}, maximize log-likelihood:

Reward Model Loss LRM(φ) = − 𝔼(x, yw, yl) ~ D[ log σ( rφ(x, yw) − rφ(x, yl) ) ]

Minimize this. Equivalent to maximizing log P(data | φ).

Hand Calculation: Reward Model in Action

Worked Example

Setup: Reward model outputs r(x, yw) = 2.5 and r(x, yl) = 1.0.

Compute P(yw ≻ yl):

σ(2.5 − 1.0) = σ(1.5) = 1/(1 + e−1.5) = 1/(1 + 0.223) = 1/1.223 = 0.818

→ 81.8% confidence the winner is better. Seems reasonable.

Loss contribution: −log(0.818) = −(−0.201) = 0.201

Gradient direction: Push r(x, yw) up, push r(x, yl) down, until σ → 1 and loss → 0.

What if scores were flipped? r(yw) = 1.0, r(yl) = 2.5: P = σ(−1.5) = 0.182. Loss = −log(0.182) = 1.70. Much higher loss → strong gradient signal to fix the ordering.

Only Differences Matter

Notice: only the difference r(yw) − r(yl) enters the loss. If you add a constant c to all rewards, the loss doesn't change. The reward model learns a ranking, not absolute scores. This is why reward models need careful calibration for downstream use.

🔨 Derivation Derive Bradley-Terry from Gumbel Noise Assumption ✓ ATTEMPTED

Assume human perceived quality of response y is Q = r(x,y) + ε where ε ~ Gumbel(0, β=1). Show that P(Qw > Ql) = σ(rw − rl).

If X ~ Gumbel(0,1) and Y ~ Gumbel(0,1), then X − Y follows a Logistic(0,1) distribution. The CDF of Logistic(μ,1) is σ(x − μ).
Qw − Ql = (rw − rl) + (εw − εl). The noise difference follows Logistic(0,1). So P(Qw − Ql > 0) = P(Logistic(rw − rl, 1) > 0) = CDF evaluated at 0...

Step 1: Qw = rw + εw, Ql = rl + εl where ε ~ Gumbel(0,1).

Step 2: Define D = Qw − Ql = (rw − rl) + (εw − εl).

Step 3: Fact: difference of two independent Gumbel(0,1) variables follows Logistic(0,1). PDF: f(z) = e−z/(1+e−z)². CDF: F(z) = 1/(1+e−z) = σ(z).

Step 4: So D ~ Logistic(rw − rl, 1). We want P(D > 0):

P(D > 0) = P(Logistic(μ, 1) > 0) where μ = rw − rl

= 1 − F(0) = 1 − σ(−μ) = σ(μ) = σ(rw − rl)

The key insight: The Gumbel noise assumption gives a clean closed-form for pairwise probabilities. This is the same math behind multinomial logit models in econometrics and softmax in classification. The sigmoid naturally handles the calibration: large reward gaps → high confidence, small gaps → ~50/50.

Chapter 07

The KL-Constrained Objective

Now we have a reward model. The naive approach: just maximize reward.

maxθ 𝔼x ~ D, y ~ πθ(y|x)[ r(x, y) ]

Problem: the policy will exploit the reward model. It finds adversarial inputs — responses that score high on the learned reward but are gibberish or repetitive. This is reward hacking.

Reward Hacking Example

A reward model trained on preference data might give high scores to responses with lots of bullet points, bold text, and length — superficial features correlated with quality in training data. Unrestricted optimization produces responses that are 5000 tokens of formatted nonsense.

The Fix: KL Penalty

Keep the policy close to the reference (SFT) model. The complete RLHF objective:

KL-Constrained RL Objective maxθ 𝔼x ~ D, y ~ πθ[ r(x, y) ] − β · 𝔼x ~ D[ DKL( πθ(·|x) || πref(·|x) ) ]

β controls the constraint strength. Higher β = stay closer to reference.

The KL divergence term measures how far πθ has drifted from πref. Expanding:

DKLθ || πref) = 𝔼y ~ πθ[ log πθ(y|x) − log πref(y|x) ]

Deriving the Optimal Policy

What's the best possible policy under this objective? We can solve it in closed form.

Derivation — Optimal Policy as Boltzmann Distribution

Goal: Find π* that maximizes 𝔼π[r(x,y)] − β DKL(π || πref).

Step 1 — Expand the objective (for fixed x):

J(π) = Σy π(y|x) · r(x,y) − β Σy π(y|x) · log(π(y|x) / πref(y|x))
= Σy π(y|x) [ r(x,y) − β log(π(y|x) / πref(y|x)) ]

Step 2 — Functional derivative (calculus of variations):

Treat π(y|x) as the function to optimize. Take the derivative with respect to π(y|x) and set to zero, subject to Σy π(y|x) = 1 (Lagrange multiplier λ):

∂/∂π(y|x) [ π(y|x)(r(x,y) − β log(π(y|x)/πref(y|x))) + λ(1 − Σ π) ] = 0

r(x,y) − β log(π(y|x)/πref(y|x)) − β − λ = 0

Step 3 — Solve for π*:

log(π*(y|x)/πref(y|x)) = r(x,y)/β − 1 − λ/β

π*(y|x) = πref(y|x) · exp(r(x,y)/β) · exp(−1 − λ/β)

Step 4 — Normalize (the partition function Z(x)):

Z(x) = Σy πref(y|x) · exp(r(x,y)/β)
Optimal Policy (Boltzmann Form) π*(y|x) = (1/Z(x)) · πref(y|x) · exp( r(x,y) / β )

Z(x) = Σy πref(y|x) · exp(r(x,y)/β)   (partition function)
Boltzmann Intuition

The optimal policy is the reference policy reweighted by exponentiated reward. High-reward responses get exponentially boosted. β is the temperature: low β → concentrate on highest-reward response (exploitation). High β → stay close to reference (exploration/safety).

Hand Calculation: Boltzmann Reweighting

Worked Example

Setup: Two possible responses. πref(y1) = 0.6, πref(y2) = 0.4. Rewards: r(y1) = 1, r(y2) = 3. β = 2.

Compute unnormalized weights:

w1 = 0.6 · exp(1/2) = 0.6 · 1.649 = 0.989

w2 = 0.4 · exp(3/2) = 0.4 · 4.482 = 1.793

Partition function: Z = 0.989 + 1.793 = 2.782

Optimal policy:

π*(y1) = 0.989 / 2.782 = 0.355  (was 0.6 — decreased)

π*(y2) = 1.793 / 2.782 = 0.645  (was 0.4 — increased)

The higher-reward response got boosted from 40% to 64.5%, but not to 100% — the KL penalty keeps us partially anchored to the reference.

🔨 Derivation KL Penalty Effect on Generation Diversity ✓ ATTEMPTED

Show that as β → 0, π* concentrates all mass on the single highest-reward response (argmax behavior). And as β → ∞, π* → πref (no change). Compute the entropy H(π*) as a function of β.

When β → 0, exp(rmax/β) dominates all other terms exponentially. The partition function Z ≈ πref(y*) · exp(rmax/β) where y* = argmax r. So π*(y*) → 1.
H(π*) = −Σ π*(y) log π*(y). Substitute π*(y) = πref(y)exp(r(y)/β)/Z. Then log π*(y) = log πref(y) + r(y)/β − log Z. The entropy becomes H(πref) + conditional terms involving rewards.

Limit β → 0 (greedy):

π*(y|x) ∝ πref(y|x) · exp(r(y)/β). As β → 0, the exponential term dominates. Let y* = argmaxy r(x,y). Then exp(r(y*)/β) / exp(r(y)/β) = exp((r(y*) − r(y))/β) → ∞ for all y ≠ y*. So π*(y*) → 1. This is pure exploitation: always output the highest-reward response.

Limit β → ∞ (conservative):

exp(r(y)/β) → exp(0) = 1 for all y. So π*(y) ∝ πref(y) · 1 = πref(y). The optimal policy equals the reference — no learning happens.

Entropy:

log π*(y) = log πref(y) + r(y)/β − log Z(x)

H(π*) = −𝔼π*[log π*(y)]

= −𝔼π*[log πref(y)] − (1/β)𝔼π*[r(y)] + log Z(x)

= H(π*, πref) − (1/β)𝔼π*[r(y)] + log Z(x)

As β decreases, 𝔼π*[r] increases (concentrates on high-reward), and H(π*) decreases monotonically toward 0. The KL penalty is the mechanism that trades off reward against diversity.

Chapter 08

DPO: Direct Preference Optimization

The RLHF pipeline works, but it's complex: three models, PPO with reward shaping, careful hyperparameter tuning. Rafailov et al. (2023) asked: can we skip the reward model entirely?

The answer is yes. The key insight: since we know the closed-form optimal policy, we can reparameterize the reward in terms of the policy — then substitute directly into the preference model.

The DPO Derivation

Step 1 — Rearrange the optimal policy to isolate reward

From the Boltzmann optimal policy:

π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y)/β)

⇒ r(x,y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)

Take log of both sides, solve for r. The reward is expressible purely in terms of policy ratios plus a prompt-dependent constant.

Step 2 — Substitute into Bradley-Terry

The reward model loss uses P(yw ≻ yl) = σ(r(x,yw) − r(x,yl)). Substitute our reparameterized reward:

r(x,yw) − r(x,yl)
= β log(π*(yw|x)/πref(yw|x)) + β log Z(x)
− [ β log(π*(yl|x)/πref(yl|x)) + β log Z(x) ]

= β log(π*(yw|x)/πref(yw|x)) − β log(π*(yl|x)/πref(yl|x))

The partition function Z(x) cancels! This is the crucial step. Z(x) is intractable (sum over all possible responses), but it drops out because it appears in both terms.

Step 3 — Write the DPO loss

Replace the optimal π* with our parameterized policy πθ (since we're optimizing πθ to be the optimal policy):

DPO Loss LDPO(θ) = − 𝔼(x, yw, yl) ~ D[ log σ( β log(πθ(yw|x) / πref(yw|x)) − β log(πθ(yl|x) / πref(yl|x)) ) ]
Why This Is Revolutionary

DPO is supervised learning on preferences. No reward model. No RL. No PPO. You just need your policy πθ and a frozen reference πref, compute log-probability ratios, and backprop through a cross-entropy-like loss. The "reward model" is implicitly defined by the policy itself.

Understanding the DPO Gradient

Define the implicit reward of response y under the current policy:

θ(x, y) = β log( πθ(y|x) / πref(y|x) )

How much more likely πθ makes y compared to πref, scaled by β

Then DPO loss = −log σ(r̂θ(x, yw) − r̂θ(x, yl)). It pushes the implicit reward of winners above losers — exactly like training a reward model, except the "reward model" is the policy.

DPO in 5 Lines of PyTorch

python
def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta):
    """Each input: log P(response | prompt) summed over tokens."""
    pi_ratio_w = pi_logps_w - ref_logps_w  # log(pi/ref) for winner
    pi_ratio_l = pi_logps_l - ref_logps_l  # log(pi/ref) for loser
    logits = beta * (pi_ratio_w - pi_ratio_l)
    return -F.logsigmoid(logits).mean()

That's it. Five lines. No reward model, no value function, no GAE, no PPO clipping. Just forward-pass your policy on both responses, forward-pass the reference on both, compute ratios, sigmoid, backprop.

Hand Calculation: DPO Loss

Worked Example

Setup: β = 0.1. For a given prompt x:

log πθ(yw|x) = −15.2   log πref(yw|x) = −16.0

log πθ(yl|x) = −14.8   log πref(yl|x) = −14.5

Compute log-ratios:

log(πθref) for winner = −15.2 − (−16.0) = 0.8

log(πθref) for loser = −14.8 − (−14.5) = −0.3

DPO logit: β · (0.8 − (−0.3)) = 0.1 · 1.1 = 0.11

Loss: −log σ(0.11) = −log(0.527) = 0.640

Interpretation: The policy already slightly prefers the winner (ratio 0.8 vs −0.3), giving a modest logit of 0.11. Training will increase this margin. If the margin were larger, the loss would be lower and the gradient weaker — self-regulating.

Checkpoint — Can you derive DPO from scratch?
Without looking back: starting from the KL-constrained objective, derive the DPO loss. State (1) the form of the optimal policy, (2) how you isolate the reward, (3) why Z(x) cancels, and (4) the final loss.
✓ Gate cleared
Model Answer

1. Optimal policy: π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y)/β)

2. Isolate reward: Take log, solve for r: r(x,y) = β log(π*(y|x)/πref(y|x)) + β log Z(x)

3. Substitute into Bradley-Terry: P(yw≻yl) = σ(r(yw) − r(yl)) = σ(β log(π*(yw)/πref(yw)) − β log(π*(yl)/πref(yl))). The β log Z(x) terms cancel (both have the same x).

4. Final loss: Replace π* with πθ (what we optimize), take negative log-likelihood: L = −E[log σ(β(log πθ(yw)/πref(yw) − log πθ(yl)/πref(yl)))]

Chapter 09

DPO vs RLHF Comparison

Both methods optimize the same objective (KL-constrained reward maximization). They differ in how they get there.

PropertyRLHF (PPO)DPO
Models at training time3 (policy, reward, reference)2 (policy, reference)
GPU memoryVery high (3 LLMs + value head)Moderate (2 LLMs)
Training stabilityFragile (PPO hyperparams)Stable (supervised-like)
HyperparametersMany (PPO clip, GAE λ, KL coeff, ...)Few (β, lr, epochs)
Online dataYes (generates during training)No (uses fixed dataset)
Reward hacking riskHigher (explicit reward to exploit)Lower (no explicit reward)
ScalabilityProven at scale (ChatGPT, Claude)Rapidly catching up
Performance (empirical)Slightly better on benchmarksCompetitive, sometimes wins
Implementation~500 lines of PPO code~5 lines of loss code
When to use which?

Use DPO when: limited compute, small team, want stability, have a good static preference dataset.

Use RLHF when: can afford the complexity, want online learning (generate → rank → improve loop), need the reward model for other purposes (filtering, scoring).

The Theoretical Equivalence

Under infinite data and perfect optimization, DPO and RLHF converge to the same solution: the Boltzmann optimal policy. They differ in finite-sample behavior:

RLHF benefits from online generation: the policy generates new responses during training, getting preference signal on its own outputs. This covers the distribution it actually operates on.

DPO trains on fixed offline data. If the preference dataset was generated by a very different policy, DPO may learn slowly on regions its current policy visits. This is the distribution mismatch problem.

DPO's Distribution Problem

DPO learns "yw is better than yl" from the dataset. But if πθ would never generate anything like yw or yl, the gradient signal is weak. Online DPO (generate fresh pairs each round) mitigates this. More in Section 11.

Chapter 10

Beyond DPO: Modern Methods

DPO opened the floodgates. Researchers asked: can we do even better by changing the preference model, the reference constraint, or the loss function?

IPO: Identity Preference Optimization

Azar et al. (2023) observed that DPO can overfit — driving the implicit reward gap to infinity on training pairs. IPO replaces the log-sigmoid with a squared loss that saturates:

IPO Loss LIPO = 𝔼[ ( log(πθ(yw|x)/πref(yw|x)) − log(πθ(yl|x)/πref(yl|x)) − 1/(2β) )² ]

This targets a finite margin (1/2β) instead of pushing the margin to infinity. Prevents overfitting on easy pairs.

SimPO: Simple Preference Optimization

Meng et al. (2024) asked: do we even need the reference model? SimPO uses length-normalized log-probabilities as the implicit reward, with no πref:

SimPO Implicit Reward r̂(y) = (1/|y|) · log πθ(y|x) + γ

γ = target margin. No reference model needed.

KTO: Kahneman-Tversky Optimization

Ethayarajh et al. (2024): what if you don't have paired preferences (yw, yl for same prompt)? KTO works with unpaired data: just "this response is good" or "this response is bad."

ORPO: Odds Ratio Preference Optimization

Hong et al. (2024): combine SFT and alignment in a single loss. The odds ratio of generating yw vs yl directly penalizes the loser:

ORPO Loss (simplified) LORPO = LSFT(yw) − λ · log σ( log odds(πθ(yw|x)) − log odds(πθ(yl|x)) )
MethodReference Model?Paired Data?Key Innovation
DPOYesYesEliminate reward model
IPOYesYesBounded margin, no overfitting
SimPONoYesEliminate reference model too
KTOYesNoWorks with unpaired good/bad labels
ORPONoYesCombine SFT + alignment in one stage
Chapter 11

Frontier Post-Training

What's happening at the cutting edge? Three major trends:

1. RLAIF / Constitutional AI

Bai et al. (2022): replace human annotators with an AI judge. The AI Feedback loop:

• Write a constitution (set of principles: be helpful, be harmless, be honest).

• Generate response pairs.

• Ask a separate LLM to judge which response better follows the constitution.

• Train on the AI-generated preferences.

Why RLAIF Works

Humans are expensive, slow, and inconsistent. AI judges are cheap, fast, and reproducible. The key finding: AI-generated preferences produce alignment quality comparable to human preferences, at 10-100x lower cost. The constitution makes the criteria explicit and auditable.

2. Iterative / Online DPO

Standard DPO trains on a fixed preference dataset. But preferences from an old policy may not cover what the new policy generates. Online DPO closes this gap:

Online DPO Loop
  1. Generate: Sample new responses from current πθ for training prompts.
  2. Score: Use a reward model (or AI judge) to rank the responses.
  3. Train: Run DPO on the new (prompt, winner, loser) triples.
  4. Repeat: New policy → new generations → new preferences → ...

This gives DPO the online learning benefit of RLHF while keeping its implementation simplicity.

3. Process Reward Models

Instead of scoring the final answer, score each step in the reasoning chain. Lightman et al. (2023) showed that process-level supervision dramatically improves math reasoning — the model learns which reasoning steps are valid, not just whether the final answer is correct.

What's Happening at Scale (2024-2025)

Llama 3: Iterative DPO with AI feedback, 5 rounds of generation → scoring → training.

Claude: Constitutional AI + RLHF. Principles-based AI feedback for safety, human feedback for helpfulness.

GPT-4/o1: Process reward models for chain-of-thought. Reward each reasoning step.

DeepSeek-R1: Pure RL (GRPO) without human preferences for reasoning. Let the model explore freely with verifiable rewards (math answers, code tests).

Chapter 12

Summary & Connections

The Complete Pipeline

StageInputOutputLoss
1. PretrainRaw textBase LLMNext-token prediction
2. SFTInstruction pairsInstruction-following LLMNext-token on responses
3a. Reward ModelPreference pairsReward scorer−log σ(rw − rl)
3b. RLHFPrompts + reward modelAligned LLME[r] − βDKL
3'. DPOPreference pairsAligned LLM−log σ(β(log ratiow − log ratiol))

Key Formulas Cheat Sheet

Bradley-Terry P(yw ≻ yl) = σ(r(x, yw) − r(x, yl))
Optimal Policy (Boltzmann) π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y) / β)
DPO Loss LDPO = −E[ log σ( β · (log πθ(yw)/πref(yw) − log πθ(yl)/πref(yl)) ) ]
Implicit Rewardθ(x, y) = β log(πθ(y|x) / πref(y|x))

Connections to Other Lessons

Policy Gradients: RLHF uses PPO (a policy gradient method with clipping). Everything from Lecture 3 — surrogate objectives, baselines, KL constraints — directly applies to the RL stage.

Reward Learning: The reward model is trained via maximum likelihood on pairwise comparisons. Same framework as learning reward functions from demonstrations in robotics.

Offline RL: DPO can be viewed as a form of offline RL — learning from a fixed dataset of (state, action, preference) without environment interaction.

KL-Regularized RL: The KL-constrained objective appears throughout RL: TRPO, PPO, SAC (entropy regularization). The Boltzmann policy is the optimal solution to all of these under different names.

The Unifying Principle

RLHF, DPO, and all their variants solve the same problem: align a model's outputs with human preferences while preserving the knowledge from pretraining. The KL penalty is the formal expression of "don't forget what you learned." The reward model (explicit or implicit) is the formal expression of "what humans want." Every method navigates the tension between these two forces.