From pre-training to preference alignment. Every derivation from scratch, every design decision justified. After this, you implement it.
Building a useful language model is not one step — it's a four-stage pipeline. Each stage narrows the gap between "a thing that predicts text" and "a thing that helps humans." The stages differ in data quantity, data quality, and training objective.
Data: Trillions of tokens scraped from the internet — books, code, Wikipedia, forums, garbage. Objective: next-token prediction (autoregressive LM). Result: broad knowledge, zero alignment.
Data: Targeted domains — textbooks, scientific papers, high-quality code. Objective: same next-token prediction, but curated mix. Result: deeper domain expertise.
Data: Small (10K–100K examples), highly curated instruction-response pairs. Objective: maximize likelihood of good responses. Result: follows explicit human intent.
Data: Human preference comparisons (which response is better?). Objective: maximize reward while staying close to SFT model. Result: aligned with implicit human preferences — tone, safety, helpfulness.
Each stage uses less data but higher quality. Pre-training: ~10T tokens of internet mush. SFT: ~100K gold examples. RLHF: ~50K preference pairs. Quality scales inversely with quantity — every stage acts as a higher-resolution lens on what "good" means.
| Stage | Data Size | Data Quality | What It Teaches |
|---|---|---|---|
| Pre-training | ~10T tokens | Unfiltered web | Language, facts, patterns |
| Mid-training | ~100B tokens | Curated domains | Depth in key areas |
| SFT | ~100K examples | Expert-written | Instruction following |
| RLHF | ~50K comparisons | Human judgments | Implicit preferences |
A language model trained to predict the next word discovers far more than grammar. The pressure to minimize perplexity forces the model to build internal representations of the world. Let's catalog exactly what emerges.
Syntactic structure: Subject-verb agreement across clauses, long-range dependencies, correct pronoun binding. Not memorized — generalized to novel sentences.
Factual knowledge: To predict "The capital of France is ___", the model must store the fact. Billions of such facts get compressed into parameters. It becomes a noisy encyclopaedia.
Coreference resolution: "The trophy doesn't fit in the suitcase because it is too big." Resolving "it" requires physical reasoning.
Lexical semantics & sentiment: Word relationships, analogy patterns, valence of statements.
Basic arithmetic: Addition, subtraction of small numbers. Breaks down for large operands (limited by positional encoding, not lack of "understanding").
Code synthesis: Patterns of programming: function definitions, loop structures, type annotations. Predicting next-token in code = informal type checking.
Andreas (2022) argues that language models trained on text about agents' beliefs, goals, and actions develop implicit world models. The argument:
To predict what a character says next in a story, you must model their beliefs (what they know), goals (what they want), and actions (what they'd do given beliefs and goals). Next-token prediction is world-modeling, just compressed into the language domain.
Pre-training produces a model that "knows everything" but helps nobody. It can write Python, explain quantum mechanics, compose poetry — but it doesn't know when to do any of these things. It will happily continue any prompt, including harmful or nonsensical ones. This gap between capability and alignment is the reason we need the remaining stages.
Prompt: "What is 347 + 289?"
GPT-3 (pretrained only): "What is 347 + 290? What is 348 + 289? What is..."
It's predicting what comes after quiz questions on the internet — more quiz questions. Not an answer.
Here's the fundamental problem. Language modeling optimizes:
But what we want the model to optimize:
Language modeling ≠ assisting users. This mismatch is the alignment gap.
Ouyang et al. (2022) demonstrated this gap perfectly:
Prompt to GPT-3: "Explain the moon landing to a 6 year old in a way that is inspiring."
GPT-3 response: "Explain the theory of gravity to a 6 year old. Explain the theory of relativity to a 6 year old. Explain the Big Bang theory to a 6 year old..."
The model completes the pattern (a list of prompts), not the request (an explanation). On internet text, what follows a question is often... more questions.
The model isn't stupid. It literally doesn't know you want an answer. In its training data, this pattern → more patterns. Finetuning fixes this.
InstructGPT (the finetuned version, 1.3B params) was preferred by humans over the raw GPT-3 (175B params). A 100x smaller model with alignment beats a giant model without it. Alignment isn't polish — it's the difference between a tool and a toy.
The simplest fix: show the model examples of instructions paired with correct responses, then maximize likelihood on those. Same loss as pre-training, different data.
Key insight (Flan, T0, SuperNaturalInstructions): finetune on many diverse tasks, evaluate on unseen tasks. Generalization emerges from diversity.
| Dataset | Tasks | Examples | Key Finding |
|---|---|---|---|
| FLAN (2021) | 62 | ~1M | Instruction-tuned model generalizes to new tasks |
| SuperNI (2022) | 1,616 | 3M+ | More tasks = better generalization |
| FLAN-T5 (2022) | 1,836 | 15M+ | Data + model scale both matter |
SFT teaches the model to follow explicit instructions: "Summarize this," "Translate to French," "Write code for..." But it can't capture implicit preferences: be concise, avoid hedging, don't hallucinate citations. Those require a richer signal than input-output pairs. That signal: human preferences.
You can only finetune on instructions for which you can write gold answers. But many preferences are:
• Contextual: "be concise" for simple questions, "be thorough" for complex ones.
• Comparative: humans can say "A is better than B" even when they can't write the perfect response.
• Implicit: users want certain tone, formatting, level of hedging — hard to specify.
This is where preference learning enters.
RLHF (Christiano et al. 2017; Ouyang et al. 2022) has three stages. Think of it as building a critic, then using the critic to improve the actor.
The format is simple. For each comparison:
Humans compare responses. They don't need to write the perfect response — just judge which of two candidates is better. This is much easier: comparison is cheaper than generation.
A neural network rφ(x, y) → scalar. Takes a (prompt, response) pair and outputs a single number: "how good is this response?" Typically initialized from the SFT model with a linear head replacing the LM head.
Use PPO to maximize:
We'll derive each piece rigorously in the next sections.
The reward model turns human comparisons into a differentiable signal. The key question: given that a human preferred yw over yl, what loss function trains r(x,y) to reproduce this preference?
Bradley and Terry (1952) proposed a model for pairwise comparisons. The probability that item A beats item B is determined by their "strengths":
Setup: Each response y has latent quality r(x,y). We want P(yw ≻ yl) to increase when r(yw) ≫ r(yl).
Assumption: Humans perceive quality with Gumbel noise. If perceived quality = r(x,y) + ε where ε ~ Gumbel(0,1), then the probability that yw is perceived as better:
Why sigmoid? The difference of two Gumbel-distributed variables follows a logistic distribution. The CDF of the logistic distribution is the sigmoid function σ(z) = 1/(1 + e−z).
Given a dataset of preferences D = {(x(i), yw(i), yl(i))}, maximize log-likelihood:
Setup: Reward model outputs r(x, yw) = 2.5 and r(x, yl) = 1.0.
Compute P(yw ≻ yl):
σ(2.5 − 1.0) = σ(1.5) = 1/(1 + e−1.5) = 1/(1 + 0.223) = 1/1.223 = 0.818
→ 81.8% confidence the winner is better. Seems reasonable.
Loss contribution: −log(0.818) = −(−0.201) = 0.201
Gradient direction: Push r(x, yw) up, push r(x, yl) down, until σ → 1 and loss → 0.
What if scores were flipped? r(yw) = 1.0, r(yl) = 2.5: P = σ(−1.5) = 0.182. Loss = −log(0.182) = 1.70. Much higher loss → strong gradient signal to fix the ordering.
Notice: only the difference r(yw) − r(yl) enters the loss. If you add a constant c to all rewards, the loss doesn't change. The reward model learns a ranking, not absolute scores. This is why reward models need careful calibration for downstream use.
Assume human perceived quality of response y is Q = r(x,y) + ε where ε ~ Gumbel(0, β=1). Show that P(Qw > Ql) = σ(rw − rl).
Step 1: Qw = rw + εw, Ql = rl + εl where ε ~ Gumbel(0,1).
Step 2: Define D = Qw − Ql = (rw − rl) + (εw − εl).
Step 3: Fact: difference of two independent Gumbel(0,1) variables follows Logistic(0,1). PDF: f(z) = e−z/(1+e−z)². CDF: F(z) = 1/(1+e−z) = σ(z).
Step 4: So D ~ Logistic(rw − rl, 1). We want P(D > 0):
P(D > 0) = P(Logistic(μ, 1) > 0) where μ = rw − rl
= 1 − F(0) = 1 − σ(−μ) = σ(μ) = σ(rw − rl) ■
The key insight: The Gumbel noise assumption gives a clean closed-form for pairwise probabilities. This is the same math behind multinomial logit models in econometrics and softmax in classification. The sigmoid naturally handles the calibration: large reward gaps → high confidence, small gaps → ~50/50.
Now we have a reward model. The naive approach: just maximize reward.
Problem: the policy will exploit the reward model. It finds adversarial inputs — responses that score high on the learned reward but are gibberish or repetitive. This is reward hacking.
A reward model trained on preference data might give high scores to responses with lots of bullet points, bold text, and length — superficial features correlated with quality in training data. Unrestricted optimization produces responses that are 5000 tokens of formatted nonsense.
Keep the policy close to the reference (SFT) model. The complete RLHF objective:
The KL divergence term measures how far πθ has drifted from πref. Expanding:
What's the best possible policy under this objective? We can solve it in closed form.
Goal: Find π* that maximizes 𝔼π[r(x,y)] − β DKL(π || πref).
Step 1 — Expand the objective (for fixed x):
Step 2 — Functional derivative (calculus of variations):
Treat π(y|x) as the function to optimize. Take the derivative with respect to π(y|x) and set to zero, subject to Σy π(y|x) = 1 (Lagrange multiplier λ):
Step 3 — Solve for π*:
Step 4 — Normalize (the partition function Z(x)):
The optimal policy is the reference policy reweighted by exponentiated reward. High-reward responses get exponentially boosted. β is the temperature: low β → concentrate on highest-reward response (exploitation). High β → stay close to reference (exploration/safety).
Setup: Two possible responses. πref(y1) = 0.6, πref(y2) = 0.4. Rewards: r(y1) = 1, r(y2) = 3. β = 2.
Compute unnormalized weights:
w1 = 0.6 · exp(1/2) = 0.6 · 1.649 = 0.989
w2 = 0.4 · exp(3/2) = 0.4 · 4.482 = 1.793
Partition function: Z = 0.989 + 1.793 = 2.782
Optimal policy:
π*(y1) = 0.989 / 2.782 = 0.355 (was 0.6 — decreased)
π*(y2) = 1.793 / 2.782 = 0.645 (was 0.4 — increased)
The higher-reward response got boosted from 40% to 64.5%, but not to 100% — the KL penalty keeps us partially anchored to the reference.
Show that as β → 0, π* concentrates all mass on the single highest-reward response (argmax behavior). And as β → ∞, π* → πref (no change). Compute the entropy H(π*) as a function of β.
Limit β → 0 (greedy):
π*(y|x) ∝ πref(y|x) · exp(r(y)/β). As β → 0, the exponential term dominates. Let y* = argmaxy r(x,y). Then exp(r(y*)/β) / exp(r(y)/β) = exp((r(y*) − r(y))/β) → ∞ for all y ≠ y*. So π*(y*) → 1. This is pure exploitation: always output the highest-reward response.
Limit β → ∞ (conservative):
exp(r(y)/β) → exp(0) = 1 for all y. So π*(y) ∝ πref(y) · 1 = πref(y). The optimal policy equals the reference — no learning happens.
Entropy:
log π*(y) = log πref(y) + r(y)/β − log Z(x)
H(π*) = −𝔼π*[log π*(y)]
= −𝔼π*[log πref(y)] − (1/β)𝔼π*[r(y)] + log Z(x)
= H(π*, πref) − (1/β)𝔼π*[r(y)] + log Z(x)
As β decreases, 𝔼π*[r] increases (concentrates on high-reward), and H(π*) decreases monotonically toward 0. The KL penalty is the mechanism that trades off reward against diversity.
The RLHF pipeline works, but it's complex: three models, PPO with reward shaping, careful hyperparameter tuning. Rafailov et al. (2023) asked: can we skip the reward model entirely?
The answer is yes. The key insight: since we know the closed-form optimal policy, we can reparameterize the reward in terms of the policy — then substitute directly into the preference model.
From the Boltzmann optimal policy:
Take log of both sides, solve for r. The reward is expressible purely in terms of policy ratios plus a prompt-dependent constant.
The reward model loss uses P(yw ≻ yl) = σ(r(x,yw) − r(x,yl)). Substitute our reparameterized reward:
The partition function Z(x) cancels! This is the crucial step. Z(x) is intractable (sum over all possible responses), but it drops out because it appears in both terms.
Replace the optimal π* with our parameterized policy πθ (since we're optimizing πθ to be the optimal policy):
DPO is supervised learning on preferences. No reward model. No RL. No PPO. You just need your policy πθ and a frozen reference πref, compute log-probability ratios, and backprop through a cross-entropy-like loss. The "reward model" is implicitly defined by the policy itself.
Define the implicit reward of response y under the current policy:
Then DPO loss = −log σ(r̂θ(x, yw) − r̂θ(x, yl)). It pushes the implicit reward of winners above losers — exactly like training a reward model, except the "reward model" is the policy.
python def dpo_loss(pi_logps_w, pi_logps_l, ref_logps_w, ref_logps_l, beta): """Each input: log P(response | prompt) summed over tokens.""" pi_ratio_w = pi_logps_w - ref_logps_w # log(pi/ref) for winner pi_ratio_l = pi_logps_l - ref_logps_l # log(pi/ref) for loser logits = beta * (pi_ratio_w - pi_ratio_l) return -F.logsigmoid(logits).mean()
That's it. Five lines. No reward model, no value function, no GAE, no PPO clipping. Just forward-pass your policy on both responses, forward-pass the reference on both, compute ratios, sigmoid, backprop.
Setup: β = 0.1. For a given prompt x:
log πθ(yw|x) = −15.2 log πref(yw|x) = −16.0
log πθ(yl|x) = −14.8 log πref(yl|x) = −14.5
Compute log-ratios:
log(πθ/πref) for winner = −15.2 − (−16.0) = 0.8
log(πθ/πref) for loser = −14.8 − (−14.5) = −0.3
DPO logit: β · (0.8 − (−0.3)) = 0.1 · 1.1 = 0.11
Loss: −log σ(0.11) = −log(0.527) = 0.640
Interpretation: The policy already slightly prefers the winner (ratio 0.8 vs −0.3), giving a modest logit of 0.11. Training will increase this margin. If the margin were larger, the loss would be lower and the gradient weaker — self-regulating.
1. Optimal policy: π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y)/β)
2. Isolate reward: Take log, solve for r: r(x,y) = β log(π*(y|x)/πref(y|x)) + β log Z(x)
3. Substitute into Bradley-Terry: P(yw≻yl) = σ(r(yw) − r(yl)) = σ(β log(π*(yw)/πref(yw)) − β log(π*(yl)/πref(yl))). The β log Z(x) terms cancel (both have the same x).
4. Final loss: Replace π* with πθ (what we optimize), take negative log-likelihood: L = −E[log σ(β(log πθ(yw)/πref(yw) − log πθ(yl)/πref(yl)))]
Both methods optimize the same objective (KL-constrained reward maximization). They differ in how they get there.
| Property | RLHF (PPO) | DPO |
|---|---|---|
| Models at training time | 3 (policy, reward, reference) | 2 (policy, reference) |
| GPU memory | Very high (3 LLMs + value head) | Moderate (2 LLMs) |
| Training stability | Fragile (PPO hyperparams) | Stable (supervised-like) |
| Hyperparameters | Many (PPO clip, GAE λ, KL coeff, ...) | Few (β, lr, epochs) |
| Online data | Yes (generates during training) | No (uses fixed dataset) |
| Reward hacking risk | Higher (explicit reward to exploit) | Lower (no explicit reward) |
| Scalability | Proven at scale (ChatGPT, Claude) | Rapidly catching up |
| Performance (empirical) | Slightly better on benchmarks | Competitive, sometimes wins |
| Implementation | ~500 lines of PPO code | ~5 lines of loss code |
Use DPO when: limited compute, small team, want stability, have a good static preference dataset.
Use RLHF when: can afford the complexity, want online learning (generate → rank → improve loop), need the reward model for other purposes (filtering, scoring).
Under infinite data and perfect optimization, DPO and RLHF converge to the same solution: the Boltzmann optimal policy. They differ in finite-sample behavior:
• RLHF benefits from online generation: the policy generates new responses during training, getting preference signal on its own outputs. This covers the distribution it actually operates on.
• DPO trains on fixed offline data. If the preference dataset was generated by a very different policy, DPO may learn slowly on regions its current policy visits. This is the distribution mismatch problem.
DPO learns "yw is better than yl" from the dataset. But if πθ would never generate anything like yw or yl, the gradient signal is weak. Online DPO (generate fresh pairs each round) mitigates this. More in Section 11.
DPO opened the floodgates. Researchers asked: can we do even better by changing the preference model, the reference constraint, or the loss function?
Azar et al. (2023) observed that DPO can overfit — driving the implicit reward gap to infinity on training pairs. IPO replaces the log-sigmoid with a squared loss that saturates:
This targets a finite margin (1/2β) instead of pushing the margin to infinity. Prevents overfitting on easy pairs.
Meng et al. (2024) asked: do we even need the reference model? SimPO uses length-normalized log-probabilities as the implicit reward, with no πref:
Ethayarajh et al. (2024): what if you don't have paired preferences (yw, yl for same prompt)? KTO works with unpaired data: just "this response is good" or "this response is bad."
Hong et al. (2024): combine SFT and alignment in a single loss. The odds ratio of generating yw vs yl directly penalizes the loser:
| Method | Reference Model? | Paired Data? | Key Innovation |
|---|---|---|---|
| DPO | Yes | Yes | Eliminate reward model |
| IPO | Yes | Yes | Bounded margin, no overfitting |
| SimPO | No | Yes | Eliminate reference model too |
| KTO | Yes | No | Works with unpaired good/bad labels |
| ORPO | No | Yes | Combine SFT + alignment in one stage |
What's happening at the cutting edge? Three major trends:
Bai et al. (2022): replace human annotators with an AI judge. The AI Feedback loop:
• Write a constitution (set of principles: be helpful, be harmless, be honest).
• Generate response pairs.
• Ask a separate LLM to judge which response better follows the constitution.
• Train on the AI-generated preferences.
Humans are expensive, slow, and inconsistent. AI judges are cheap, fast, and reproducible. The key finding: AI-generated preferences produce alignment quality comparable to human preferences, at 10-100x lower cost. The constitution makes the criteria explicit and auditable.
Standard DPO trains on a fixed preference dataset. But preferences from an old policy may not cover what the new policy generates. Online DPO closes this gap:
This gives DPO the online learning benefit of RLHF while keeping its implementation simplicity.
Instead of scoring the final answer, score each step in the reasoning chain. Lightman et al. (2023) showed that process-level supervision dramatically improves math reasoning — the model learns which reasoning steps are valid, not just whether the final answer is correct.
• Llama 3: Iterative DPO with AI feedback, 5 rounds of generation → scoring → training.
• Claude: Constitutional AI + RLHF. Principles-based AI feedback for safety, human feedback for helpfulness.
• GPT-4/o1: Process reward models for chain-of-thought. Reward each reasoning step.
• DeepSeek-R1: Pure RL (GRPO) without human preferences for reasoning. Let the model explore freely with verifiable rewards (math answers, code tests).
| Stage | Input | Output | Loss |
|---|---|---|---|
| 1. Pretrain | Raw text | Base LLM | Next-token prediction |
| 2. SFT | Instruction pairs | Instruction-following LLM | Next-token on responses |
| 3a. Reward Model | Preference pairs | Reward scorer | −log σ(rw − rl) |
| 3b. RLHF | Prompts + reward model | Aligned LLM | E[r] − βDKL |
| 3'. DPO | Preference pairs | Aligned LLM | −log σ(β(log ratiow − log ratiol)) |
• Policy Gradients: RLHF uses PPO (a policy gradient method with clipping). Everything from Lecture 3 — surrogate objectives, baselines, KL constraints — directly applies to the RL stage.
• Reward Learning: The reward model is trained via maximum likelihood on pairwise comparisons. Same framework as learning reward functions from demonstrations in robotics.
• Offline RL: DPO can be viewed as a form of offline RL — learning from a fixed dataset of (state, action, preference) without environment interaction.
• KL-Regularized RL: The KL-constrained objective appears throughout RL: TRPO, PPO, SAC (entropy regularization). The Boltzmann policy is the optimal solution to all of these under different names.
RLHF, DPO, and all their variants solve the same problem: align a model's outputs with human preferences while preserving the knowledge from pretraining. The KL penalty is the formal expression of "don't forget what you learned." The reward model (explicit or implicit) is the formal expression of "what humans want." Every method navigates the tension between these two forces.