CS224R — RLHF & DPO: The Post-Training Frontier

Roadmap

What You'll Master

01LLM Training Overview 02What Pretraining Learns 03The Alignment Gap 04Instruction Finetuning 05The RLHF Pipeline 06Reward Model Training 07The KL-Constrained Objective 08DPO: Direct Preference Optimization 09DPO vs RLHF Comparison 10Beyond DPO: Modern Methods 11Frontier Post-Training 12Summary & Connections

Chapter 01

LLM Training Overview

Building a useful language model is not one step — it's a four-stage pipeline. Each stage narrows the gap between "a thing that predicts text" and "a thing that helps humans." The stages differ in data quantity, data quality, and training objective.

Stage 1

Pre-training

Data: Trillions of tokens scraped from the internet — books, code, Wikipedia, forums, garbage. Objective: next-token prediction (autoregressive LM). Result: broad knowledge, zero alignment.

Stage 2

Mid-training (Continued Pre-training)

Data: Targeted domains — textbooks, scientific papers, high-quality code. Objective: same next-token prediction, but curated mix. Result: deeper domain expertise.

Stage 3

SFT / Instruction Finetuning

Data: Small (10K–100K examples), highly curated instruction-response pairs. Objective: maximize likelihood of good responses. Result: follows explicit human intent.

Stage 4

RL from Human Feedback (RLHF)

Data: Human preference comparisons (which response is better?). Objective: maximize reward while staying close to SFT model. Result: aligned with implicit human preferences — tone, safety, helpfulness.

The Key Principle

Each stage uses less data but higher quality. Pre-training: ~10T tokens of internet mush. SFT: ~100K gold examples. RLHF: ~50K preference pairs. Quality scales inversely with quantity — every stage acts as a higher-resolution lens on what "good" means.

Stage	Data Size	Data Quality	What It Teaches
Pre-training	~10T tokens	Unfiltered web	Language, facts, patterns
Mid-training	~100B tokens	Curated domains	Depth in key areas
SFT	~100K examples	Expert-written	Instruction following
RLHF	~50K comparisons	Human judgments	Implicit preferences

Chapter 02

What Pretraining Learns

A language model trained to predict the next word discovers far more than grammar. The pressure to minimize perplexity forces the model to build internal representations of the world. Let's catalog exactly what emerges.

Capabilities That Emerge from Next-Token Prediction

Syntactic structure: Subject-verb agreement across clauses, long-range dependencies, correct pronoun binding. Not memorized — generalized to novel sentences.

Factual knowledge: To predict "The capital of France is ___", the model must store the fact. Billions of such facts get compressed into parameters. It becomes a noisy encyclopaedia.

Coreference resolution: "The trophy doesn't fit in the suitcase because it is too big." Resolving "it" requires physical reasoning.

Lexical semantics & sentiment: Word relationships, analogy patterns, valence of statements.

Basic arithmetic: Addition, subtraction of small numbers. Breaks down for large operands (limited by positional encoding, not lack of "understanding").

Code synthesis: Patterns of programming: function definitions, loop structures, type annotations. Predicting next-token in code = informal type checking.

Language Models as World Models

Andreas (2022) argues that language models trained on text about agents' beliefs, goals, and actions develop implicit world models. The argument:

To predict what a character says next in a story, you must model their beliefs (what they know), goals (what they want), and actions (what they'd do given beliefs and goals). Next-token prediction is world-modeling, just compressed into the language domain.

The Pretraining Paradox

Pre-training produces a model that "knows everything" but helps nobody. It can write Python, explain quantum mechanics, compose poetry — but it doesn't know when to do any of these things. It will happily continue any prompt, including harmful or nonsensical ones. This gap between capability and alignment is the reason we need the remaining stages.

Example — Math from Pretraining

Prompt: "What is 347 + 289?"
GPT-3 (pretrained only): "What is 347 + 290? What is 348 + 289? What is..."
It's predicting what comes after quiz questions on the internet — more quiz questions. Not an answer.

Chapter 03

The Alignment Gap

Here's the fundamental problem. Language modeling optimizes:

Language Model Objective max_θ 𝔼_{x ~ D}[ Σ_t log p_θ(x_t | x_<t) ]

"Make training text more probable"

But what we want the model to optimize:

Alignment Objective (informal) max_θ 𝔼_{prompt ~ users}[ Helpfulness(response_θ) + Safety(response_θ) ]

"Give responses humans prefer"

Language modeling ≠ assisting users. This mismatch is the alignment gap.

The GPT-3 "Explain the Moon Landing" Example

Ouyang et al. (2022) demonstrated this gap perfectly:

The Alignment Gap In Action

Prompt to GPT-3: "Explain the moon landing to a 6 year old in a way that is inspiring."

GPT-3 response: "Explain the theory of gravity to a 6 year old. Explain the theory of relativity to a 6 year old. Explain the Big Bang theory to a 6 year old..."

The model completes the pattern (a list of prompts), not the request (an explanation). On internet text, what follows a question is often... more questions.

The model isn't stupid. It literally doesn't know you want an answer. In its training data, this pattern → more patterns. Finetuning fixes this.

Why This Matters

InstructGPT (the finetuned version, 1.3B params) was preferred by humans over the raw GPT-3 (175B params). A 100x smaller model with alignment beats a giant model without it. Alignment isn't polish — it's the difference between a tool and a toy.

Chapter 04

Instruction Finetuning

The simplest fix: show the model examples of instructions paired with correct responses, then maximize likelihood on those. Same loss as pre-training, different data.

SFT Objective max_θ 𝔼_{(x,y) ~ D_SFT}[ Σ_t log p_θ(y_t | x, y_<t) ]

x = instruction, y = desired response. Only compute loss on response tokens.

Scaling Up: Many Tasks

Key insight (Flan, T0, SuperNaturalInstructions): finetune on many diverse tasks, evaluate on unseen tasks. Generalization emerges from diversity.

Dataset	Tasks	Examples	Key Finding
FLAN (2021)	62	~1M	Instruction-tuned model generalizes to new tasks
SuperNI (2022)	1,616	3M+	More tasks = better generalization
FLAN-T5 (2022)	1,836	15M+	Data + model scale both matter

What SFT Teaches

SFT teaches the model to follow explicit instructions: "Summarize this," "Translate to French," "Write code for..." But it can't capture implicit preferences: be concise, avoid hedging, don't hallucinate citations. Those require a richer signal than input-output pairs. That signal: human preferences.

The Limitation of SFT

You can only finetune on instructions for which you can write gold answers. But many preferences are:

• Contextual: "be concise" for simple questions, "be thorough" for complex ones.

• Comparative: humans can say "A is better than B" even when they can't write the perfect response.

• Implicit: users want certain tone, formatting, level of hedging — hard to specify.

This is where preference learning enters.

Chapter 05

The RLHF Pipeline

RLHF (Christiano et al. 2017; Ouyang et al. 2022) has three stages. Think of it as building a critic, then using the critic to improve the actor.

The RLHF Pipeline

Collect Preference Data: Given prompt x, sample two responses y₁, y₂ from current policy. Human labels which is better: y_w ≻ y_l.
Train a Reward Model: Learn r_φ(x, y) that scores responses. Trained to predict human preferences via Bradley-Terry model.
Optimize the Policy: Use RL (PPO) to maximize reward while staying close to the SFT policy via KL penalty.

Stage 1: Preference Data

The format is simple. For each comparison:

(x, y_w, y_l) where y_w ≻ y_l given x

x = prompt, y_w = preferred response ("winner"), y_l = dispreferred ("loser")

Humans compare responses. They don't need to write the perfect response — just judge which of two candidates is better. This is much easier: comparison is cheaper than generation.

Stage 2: The Reward Model

A neural network r_φ(x, y) → scalar. Takes a (prompt, response) pair and outputs a single number: "how good is this response?" Typically initialized from the SFT model with a linear head replacing the LM head.

Stage 3: RL Optimization

Use PPO to maximize:

RLHF Objective (Preview) max_θ 𝔼_{x ~ D, y ~ π_θ}[ r_φ(x, y) ] − β · D_KL( π_θ || π_ref )

We'll derive each piece rigorously in the next sections.

Chapter 06

Reward Model Training

The reward model turns human comparisons into a differentiable signal. The key question: given that a human preferred y_w over y_l, what loss function trains r(x,y) to reproduce this preference?

The Bradley-Terry Model

Bradley and Terry (1952) proposed a model for pairwise comparisons. The probability that item A beats item B is determined by their "strengths":

Derivation — Bradley-Terry from Maximum Likelihood

Setup: Each response y has latent quality r(x,y). We want P(y_w ≻ y_l) to increase when r(y_w) ≫ r(y_l).

Assumption: Humans perceive quality with Gumbel noise. If perceived quality = r(x,y) + ε where ε ~ Gumbel(0,1), then the probability that y_w is perceived as better:

P(y_w ≻ y_l | x) = σ(r(x, y_w) − r(x, y_l))

Why sigmoid? The difference of two Gumbel-distributed variables follows a logistic distribution. The CDF of the logistic distribution is the sigmoid function σ(z) = 1/(1 + e^−z).

Bradley-Terry Preference Model P(y_w ≻ y_l | x) = σ( r(x, y_w) − r(x, y_l) )

where σ(z) = 1 / (1 + exp(−z))

The Reward Model Loss

Given a dataset of preferences D = {(x⁽ⁱ⁾, y_w⁽ⁱ⁾, y_l⁽ⁱ⁾)}, maximize log-likelihood:

Reward Model Loss L_RM(φ) = − 𝔼_{(x, y_w, y_l) ~ D}[ log σ( r_φ(x, y_w) − r_φ(x, y_l) ) ]

Minimize this. Equivalent to maximizing log P(data | φ).

Hand Calculation: Reward Model in Action

Worked Example

Setup: Reward model outputs r(x, y_w) = 2.5 and r(x, y_l) = 1.0.

Compute P(y_w ≻ y_l):

σ(2.5 − 1.0) = σ(1.5) = 1/(1 + e^−1.5) = 1/(1 + 0.223) = 1/1.223 = 0.818

→ 81.8% confidence the winner is better. Seems reasonable.

Loss contribution: −log(0.818) = −(−0.201) = 0.201

Gradient direction: Push r(x, y_w) up, push r(x, y_l) down, until σ → 1 and loss → 0.

What if scores were flipped? r(y_w) = 1.0, r(y_l) = 2.5: P = σ(−1.5) = 0.182. Loss = −log(0.182) = 1.70. Much higher loss → strong gradient signal to fix the ordering.

Only Differences Matter

Notice: only the difference r(y_w) − r(y_l) enters the loss. If you add a constant c to all rewards, the loss doesn't change. The reward model learns a ranking, not absolute scores. This is why reward models need careful calibration for downstream use.

🔨 Derivation Derive Bradley-Terry from Gumbel Noise Assumption ▶ ✓ ATTEMPTED

Assume human perceived quality of response y is Q = r(x,y) + ε where ε ~ Gumbel(0, β=1). Show that P(Q_w > Q_l) = σ(r_w − r_l).

If X ~ Gumbel(0,1) and Y ~ Gumbel(0,1), then X − Y follows a Logistic(0,1) distribution. The CDF of Logistic(μ,1) is σ(x − μ).

Q_w − Q_l = (r_w − r_l) + (ε_w − ε_l). The noise difference follows Logistic(0,1). So P(Q_w − Q_l > 0) = P(Logistic(r_w − r_l, 1) > 0) = CDF evaluated at 0...

Step 1: Q_w = r_w + ε_w, Q_l = r_l + ε_l where ε ~ Gumbel(0,1).

Step 2: Define D = Q_w − Q_l = (r_w − r_l) + (ε_w − ε_l).

Step 3: Fact: difference of two independent Gumbel(0,1) variables follows Logistic(0,1). PDF: f(z) = e^−z/(1+e^−z)². CDF: F(z) = 1/(1+e^−z) = σ(z).

Step 4: So D ~ Logistic(r_w − r_l, 1). We want P(D > 0):

P(D > 0) = P(Logistic(μ, 1) > 0) where μ = r_w − r_l

= 1 − F(0) = 1 − σ(−μ) = σ(μ) = σ(r_w − r_l) ■

The key insight: The Gumbel noise assumption gives a clean closed-form for pairwise probabilities. This is the same math behind multinomial logit models in econometrics and softmax in classification. The sigmoid naturally handles the calibration: large reward gaps → high confidence, small gaps → ~50/50.

Chapter 07

The KL-Constrained Objective

Now we have a reward model. The naive approach: just maximize reward.

max_θ 𝔼_{x ~ D, y ~ π_θ(y|x)}[ r(x, y) ]

Problem: the policy will exploit the reward model. It finds adversarial inputs — responses that score high on the learned reward but are gibberish or repetitive. This is reward hacking.

Reward Hacking Example

A reward model trained on preference data might give high scores to responses with lots of bullet points, bold text, and length — superficial features correlated with quality in training data. Unrestricted optimization produces responses that are 5000 tokens of formatted nonsense.

The Fix: KL Penalty

Keep the policy close to the reference (SFT) model. The complete RLHF objective:

KL-Constrained RL Objective max_θ 𝔼_{x ~ D, y ~ π_θ}[ r(x, y) ] − β · 𝔼_{x ~ D}[ D_KL( π_θ(·|x) || π_ref(·|x) ) ]

β controls the constraint strength. Higher β = stay closer to reference.

The KL divergence term measures how far π_θ has drifted from π_ref. Expanding:

D_KL(π_θ || π_ref) = 𝔼_{y ~ π_θ}[ log π_θ(y|x) − log π_ref(y|x) ]

Deriving the Optimal Policy

What's the best possible policy under this objective? We can solve it in closed form.

Derivation — Optimal Policy as Boltzmann Distribution

Goal: Find π* that maximizes 𝔼_π[r(x,y)] − β D_KL(π || π_ref).

Step 1 — Expand the objective (for fixed x):

Step 2 — Functional derivative (calculus of variations):

Treat π(y|x) as the function to optimize. Take the derivative with respect to π(y|x) and set to zero, subject to Σ_y π(y|x) = 1 (Lagrange multiplier λ):

∂/∂π(y|x) [ π(y|x)(r(x,y) − β log(π(y|x)/π_ref(y|x))) + λ(1 − Σ π) ] = 0

r(x,y) − β log(π(y|x)/π_ref(y|x)) − β − λ = 0

Step 3 — Solve for π*:

log(π*(y|x)/π_ref(y|x)) = r(x,y)/β − 1 − λ/β

π*(y|x) = π_ref(y|x) · exp(r(x,y)/β) · exp(−1 − λ/β)

Step 4 — Normalize (the partition function Z(x)):

Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β)

Optimal Policy (Boltzmann Form) π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp( r(x,y) / β )

Z(x) = Σ_y π_ref(y|x) · exp(r(x,y)/β) (partition function)

Boltzmann Intuition

The optimal policy is the reference policy reweighted by exponentiated reward. High-reward responses get exponentially boosted. β is the temperature: low β → concentrate on highest-reward response (exploitation). High β → stay close to reference (exploration/safety).

Hand Calculation: Boltzmann Reweighting

Worked Example

Setup: Two possible responses. π_ref(y₁) = 0.6, π_ref(y₂) = 0.4. Rewards: r(y₁) = 1, r(y₂) = 3. β = 2.

Compute unnormalized weights:

w₁ = 0.6 · exp(1/2) = 0.6 · 1.649 = 0.989

w₂ = 0.4 · exp(3/2) = 0.4 · 4.482 = 1.793

Partition function: Z = 0.989 + 1.793 = 2.782

Optimal policy:

π*(y₁) = 0.989 / 2.782 = 0.355 (was 0.6 — decreased)

π*(y₂) = 1.793 / 2.782 = 0.645 (was 0.4 — increased)

The higher-reward response got boosted from 40% to 64.5%, but not to 100% — the KL penalty keeps us partially anchored to the reference.

🔨 Derivation KL Penalty Effect on Generation Diversity ▶ ✓ ATTEMPTED

Show that as β → 0, π* concentrates all mass on the single highest-reward response (argmax behavior). And as β → ∞, π* → π_ref (no change). Compute the entropy H(π*) as a function of β.

When β → 0, exp(r_max/β) dominates all other terms exponentially. The partition function Z ≈ π_ref(y*) · exp(r_max/β) where y* = argmax r. So π*(y*) → 1.

H(π*) = −Σ π*(y) log π*(y). Substitute π*(y) = π_ref(y)exp(r(y)/β)/Z. Then log π*(y) = log π_ref(y) + r(y)/β − log Z. The entropy becomes H(π_ref) + conditional terms involving rewards.

Limit β → 0 (greedy):

π*(y|x) ∝ π_ref(y|x) · exp(r(y)/β). As β → 0, the exponential term dominates. Let y* = argmax_y r(x,y). Then exp(r(y*)/β) / exp(r(y)/β) = exp((r(y*) − r(y))/β) → ∞ for all y ≠ y*. So π*(y*) → 1. This is pure exploitation: always output the highest-reward response.

Limit β → ∞ (conservative):

exp(r(y)/β) → exp(0) = 1 for all y. So π*(y) ∝ π_ref(y) · 1 = π_ref(y). The optimal policy equals the reference — no learning happens.

Entropy:

log π*(y) = log π_ref(y) + r(y)/β − log Z(x)

H(π*) = −𝔼_π*[log π*(y)]

= −𝔼_π*[log π_ref(y)] − (1/β)𝔼_π*[r(y)] + log Z(x)

= H(π*, π_ref) − (1/β)𝔼_π*[r(y)] + log Z(x)

As β decreases, 𝔼_π*[r] increases (concentrates on high-reward), and H(π*) decreases monotonically toward 0. The KL penalty is the mechanism that trades off reward against diversity.

Chapter 08

DPO: Direct Preference Optimization

The RLHF pipeline works, but it's complex: three models, PPO with reward shaping, careful hyperparameter tuning. Rafailov et al. (2023) asked: can we skip the reward model entirely?

The answer is yes. The key insight: since we know the closed-form optimal policy, we can reparameterize the reward in terms of the policy — then substitute directly into the preference model.

The DPO Derivation

Step 1 — Rearrange the optimal policy to isolate reward

From the Boltzmann optimal policy:

π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)

⇒ r(x,y) = β log(π*(y|x) / π_ref(y|x)) + β log Z(x)

Take log of both sides, solve for r. The reward is expressible purely in terms of policy ratios plus a prompt-dependent constant.

Step 2 — Substitute into Bradley-Terry

The reward model loss uses P(y_w ≻ y_l) = σ(r(x,y_w) − r(x,y_l)). Substitute our reparameterized reward:

The partition function Z(x) cancels! This is the crucial step. Z(x) is intractable (sum over all possible responses), but it drops out because it appears in both terms.

Step 3 — Write the DPO loss

Replace the optimal π* with our parameterized policy π_θ (since we're optimizing π_θ to be the optimal policy):

DPO Loss L_DPO(θ) = − 𝔼_{(x, y_w, y_l) ~ D}[ log σ( β log(π_θ(y_w|x) / π_ref(y_w|x)) − β log(π_θ(y_l|x) / π_ref(y_l|x)) ) ]

Why This Is Revolutionary

DPO is supervised learning on preferences. No reward model. No RL. No PPO. You just need your policy π_θ and a frozen reference π_ref, compute log-probability ratios, and backprop through a cross-entropy-like loss. The "reward model" is implicitly defined by the policy itself.

Understanding the DPO Gradient

Define the implicit reward of response y under the current policy:

r̂_θ(x, y) = β log( π_θ(y|x) / π_ref(y|x) )

How much more likely π_θ makes y compared to π_ref, scaled by β

Then DPO loss = −log σ(r̂_θ(x, y_w) − r̂_θ(x, y_l)). It pushes the implicit reward of winners above losers — exactly like training a reward model, except the "reward model" is the policy.

DPO in 5 Lines of PyTorch

python
def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta):
    """Each input: log P(response | prompt) summed over tokens."""
    pi_ratio_w = pi_logps_w - ref_logps_w  # log(pi/ref) for winner
    pi_ratio_l = pi_logps_l - ref_logps_l  # log(pi/ref) for loser
    logits = beta * (pi_ratio_w - pi_ratio_l)
    return -F.logsigmoid(logits).mean()

That's it. Five lines. No reward model, no value function, no GAE, no PPO clipping. Just forward-pass your policy on both responses, forward-pass the reference on both, compute ratios, sigmoid, backprop.

Hand Calculation: DPO Loss

Worked Example

Setup: β = 0.1. For a given prompt x:

log π_θ(y_w|x) = −15.2 log π_ref(y_w|x) = −16.0

log π_θ(y_l|x) = −14.8 log π_ref(y_l|x) = −14.5

Compute log-ratios:

log(π_θ/π_ref) for winner = −15.2 − (−16.0) = 0.8

log(π_θ/π_ref) for loser = −14.8 − (−14.5) = −0.3

DPO logit: β · (0.8 − (−0.3)) = 0.1 · 1.1 = 0.11

Loss: −log σ(0.11) = −log(0.527) = 0.640

Interpretation: The policy already slightly prefers the winner (ratio 0.8 vs −0.3), giving a modest logit of 0.11. Training will increase this margin. If the margin were larger, the loss would be lower and the gradient weaker — self-regulating.

Checkpoint — Can you derive DPO from scratch?

Without looking back: starting from the KL-constrained objective, derive the DPO loss. State (1) the form of the optimal policy, (2) how you isolate the reward, (3) why Z(x) cancels, and (4) the final loss.

✓ Gate cleared

Model Answer

1. Optimal policy: π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y)/β)

2. Isolate reward: Take log, solve for r: r(x,y) = β log(π*(y|x)/π_ref(y|x)) + β log Z(x)

3. Substitute into Bradley-Terry: P(y_w≻y_l) = σ(r(y_w) − r(y_l)) = σ(β log(π*(y_w)/π_ref(y_w)) − β log(π*(y_l)/π_ref(y_l))). The β log Z(x) terms cancel (both have the same x).

4. Final loss: Replace π* with π_θ (what we optimize), take negative log-likelihood: L = −E[log σ(β(log π_θ(y_w)/π_ref(y_w) − log π_θ(y_l)/π_ref(y_l)))]

Chapter 09

DPO vs RLHF Comparison

Both methods optimize the same objective (KL-constrained reward maximization). They differ in how they get there.

Property	RLHF (PPO)	DPO
Models at training time	3 (policy, reward, reference)	2 (policy, reference)
GPU memory	Very high (3 LLMs + value head)	Moderate (2 LLMs)
Training stability	Fragile (PPO hyperparams)	Stable (supervised-like)
Hyperparameters	Many (PPO clip, GAE λ, KL coeff, ...)	Few (β, lr, epochs)
Online data	Yes (generates during training)	No (uses fixed dataset)
Reward hacking risk	Higher (explicit reward to exploit)	Lower (no explicit reward)
Scalability	Proven at scale (ChatGPT, Claude)	Rapidly catching up
Performance (empirical)	Slightly better on benchmarks	Competitive, sometimes wins
Implementation	~500 lines of PPO code	~5 lines of loss code

When to use which?

Use DPO when: limited compute, small team, want stability, have a good static preference dataset.

Use RLHF when: can afford the complexity, want online learning (generate → rank → improve loop), need the reward model for other purposes (filtering, scoring).

The Theoretical Equivalence

Under infinite data and perfect optimization, DPO and RLHF converge to the same solution: the Boltzmann optimal policy. They differ in finite-sample behavior:

• RLHF benefits from online generation: the policy generates new responses during training, getting preference signal on its own outputs. This covers the distribution it actually operates on.

• DPO trains on fixed offline data. If the preference dataset was generated by a very different policy, DPO may learn slowly on regions its current policy visits. This is the distribution mismatch problem.

DPO's Distribution Problem

DPO learns "y_w is better than y_l" from the dataset. But if π_θ would never generate anything like y_w or y_l, the gradient signal is weak. Online DPO (generate fresh pairs each round) mitigates this. More in Section 11.

Chapter 10

Beyond DPO: Modern Methods

DPO opened the floodgates. Researchers asked: can we do even better by changing the preference model, the reference constraint, or the loss function?

IPO: Identity Preference Optimization

Azar et al. (2023) observed that DPO can overfit — driving the implicit reward gap to infinity on training pairs. IPO replaces the log-sigmoid with a squared loss that saturates:

IPO Loss L_IPO = 𝔼[ ( log(π_θ(y_w|x)/π_ref(y_w|x)) − log(π_θ(y_l|x)/π_ref(y_l|x)) − 1/(2β) )² ]

This targets a finite margin (1/2β) instead of pushing the margin to infinity. Prevents overfitting on easy pairs.

SimPO: Simple Preference Optimization

Meng et al. (2024) asked: do we even need the reference model? SimPO uses length-normalized log-probabilities as the implicit reward, with no π_ref:

SimPO Implicit Reward r̂(y) = (1/|y|) · log π_θ(y|x) + γ

γ = target margin. No reference model needed.

KTO: Kahneman-Tversky Optimization

Ethayarajh et al. (2024): what if you don't have paired preferences (y_w, y_l for same prompt)? KTO works with unpaired data: just "this response is good" or "this response is bad."

ORPO: Odds Ratio Preference Optimization

Hong et al. (2024): combine SFT and alignment in a single loss. The odds ratio of generating y_w vs y_l directly penalizes the loser:

ORPO Loss (simplified) L_ORPO = L_SFT(y_w) − λ · log σ( log odds(π_θ(y_w|x)) − log odds(π_θ(y_l|x)) )

Method	Reference Model?	Paired Data?	Key Innovation
DPO	Yes	Yes	Eliminate reward model
IPO	Yes	Yes	Bounded margin, no overfitting
SimPO	No	Yes	Eliminate reference model too
KTO	Yes	No	Works with unpaired good/bad labels
ORPO	No	Yes	Combine SFT + alignment in one stage

Chapter 11

Frontier Post-Training

What's happening at the cutting edge? Three major trends:

1. RLAIF / Constitutional AI

Bai et al. (2022): replace human annotators with an AI judge. The AI Feedback loop:

• Write a constitution (set of principles: be helpful, be harmless, be honest).

• Generate response pairs.

• Ask a separate LLM to judge which response better follows the constitution.

• Train on the AI-generated preferences.

Why RLAIF Works

Humans are expensive, slow, and inconsistent. AI judges are cheap, fast, and reproducible. The key finding: AI-generated preferences produce alignment quality comparable to human preferences, at 10-100x lower cost. The constitution makes the criteria explicit and auditable.

2. Iterative / Online DPO

Standard DPO trains on a fixed preference dataset. But preferences from an old policy may not cover what the new policy generates. Online DPO closes this gap:

Online DPO Loop

Generate: Sample new responses from current π_θ for training prompts.
Score: Use a reward model (or AI judge) to rank the responses.
Train: Run DPO on the new (prompt, winner, loser) triples.
Repeat: New policy → new generations → new preferences → ...

This gives DPO the online learning benefit of RLHF while keeping its implementation simplicity.

3. Process Reward Models

Instead of scoring the final answer, score each step in the reasoning chain. Lightman et al. (2023) showed that process-level supervision dramatically improves math reasoning — the model learns which reasoning steps are valid, not just whether the final answer is correct.

What's Happening at Scale (2024-2025)

• Llama 3: Iterative DPO with AI feedback, 5 rounds of generation → scoring → training.

• Claude: Constitutional AI + RLHF. Principles-based AI feedback for safety, human feedback for helpfulness.

• GPT-4/o1: Process reward models for chain-of-thought. Reward each reasoning step.

• DeepSeek-R1: Pure RL (GRPO) without human preferences for reasoning. Let the model explore freely with verifiable rewards (math answers, code tests).

Chapter 12

Summary & Connections

The Complete Pipeline

Stage	Input	Output	Loss
1. Pretrain	Raw text	Base LLM	Next-token prediction
2. SFT	Instruction pairs	Instruction-following LLM	Next-token on responses
3a. Reward Model	Preference pairs	Reward scorer	−log σ(r_w − r_l)
3b. RLHF	Prompts + reward model	Aligned LLM	E[r] − βD_KL
3'. DPO	Preference pairs	Aligned LLM	−log σ(β(log ratio_w − log ratio_l))

Key Formulas Cheat Sheet

Bradley-Terry P(y_w ≻ y_l) = σ(r(x, y_w) − r(x, y_l))

Optimal Policy (Boltzmann) π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y) / β)

DPO Loss L_DPO = −E[ log σ( β · (log π_θ(y_w)/π_ref(y_w) − log π_θ(y_l)/π_ref(y_l)) ) ]

Implicit Reward r̂_θ(x, y) = β log(π_θ(y|x) / π_ref(y|x))

Connections to Other Lessons

• Policy Gradients: RLHF uses PPO (a policy gradient method with clipping). Everything from Lecture 3 — surrogate objectives, baselines, KL constraints — directly applies to the RL stage.

• Reward Learning: The reward model is trained via maximum likelihood on pairwise comparisons. Same framework as learning reward functions from demonstrations in robotics.

• Offline RL: DPO can be viewed as a form of offline RL — learning from a fixed dataset of (state, action, preference) without environment interaction.

• KL-Regularized RL: The KL-constrained objective appears throughout RL: TRPO, PPO, SAC (entropy regularization). The Boltzmann policy is the optimal solution to all of these under different names.

The Unifying Principle

RLHF, DPO, and all their variants solve the same problem: align a model's outputs with human preferences while preserving the knowledge from pretraining. The KL penalty is the formal expression of "don't forget what you learned." The reward model (explicit or implicit) is the formal expression of "what humans want." Every method navigates the tension between these two forces.

The Post-Training Frontier: RLHF & DPO

What You'll Master

LLM Training Overview

What Pretraining Learns

Capabilities That Emerge from Next-Token Prediction

Language Models as World Models

The Alignment Gap

The GPT-3 "Explain the Moon Landing" Example

Instruction Finetuning

Scaling Up: Many Tasks

The Limitation of SFT

The RLHF Pipeline

Stage 1: Preference Data

Stage 2: The Reward Model

Stage 3: RL Optimization

Reward Model Training

The Bradley-Terry Model

The Reward Model Loss

Hand Calculation: Reward Model in Action

The KL-Constrained Objective

The Fix: KL Penalty

Deriving the Optimal Policy

Hand Calculation: Boltzmann Reweighting

DPO: Direct Preference Optimization

The DPO Derivation

Understanding the DPO Gradient

DPO in 5 Lines of PyTorch

Hand Calculation: DPO Loss

DPO vs RLHF Comparison

The Theoretical Equivalence

Beyond DPO: Modern Methods

IPO: Identity Preference Optimization

SimPO: Simple Preference Optimization

KTO: Kahneman-Tversky Optimization

ORPO: Odds Ratio Preference Optimization

Frontier Post-Training

1. RLAIF / Constitutional AI

2. Iterative / Online DPO

3. Process Reward Models

Summary & Connections

The Complete Pipeline

Key Formulas Cheat Sheet

Connections to Other Lessons