The Complete Beginner's Path

Understand Reward Models
& Alignment

How we teach language models to be helpful, harmless, and honest — turning raw capability into something humans actually want.

Prerequisites: Basic LLM intuition + What fine-tuning means. That's it.
10
Chapters
8+
Interactives
0
PhD Required

Chapter 0: The Alignment Problem

A language model trained on internet text can write poetry, solve math problems, and generate working code. But it can also produce toxic content, confidently hallucinate, or help with dangerous requests. Capability is not alignment. The model learned to predict the next token — it didn't learn what humans actually want.

The alignment problem is the gap between what a model can do and what it should do. A base model completes text. An aligned model has a notion of helpful, harmless, and honest. Bridging that gap is the central challenge of modern AI safety.

The core tension: Pre-training gives capability (the model learns the distribution of all internet text), but it doesn't distinguish a helpful medical answer from a conspiracy theory. Both are "likely next tokens." Alignment steers the model toward responses humans actually prefer.
Capability vs Alignment

Watch how a base model (predicting likely text) diverges from what an aligned model should say. The red path is raw capability. The green path is aligned behavior.

Early approaches tried to fix this with careful prompting or filtering. But these are band-aids. The real solution requires changing the model itself — teaching it a reward signal that captures human preferences. This is the story of alignment.

Check: Why isn't a model trained on internet text automatically aligned?

Chapter 1: Reward Modeling

How do you teach a model what "good" means? You ask humans. Given a prompt, generate two candidate responses A and B. Show them to a human annotator who picks the better one. Collect thousands of these preference pairs and train a separate model — the reward model — to predict which response a human would prefer.

The reward model takes a (prompt, response) pair and outputs a scalar score. Higher score = more likely to be preferred. The key insight: we don't need humans to assign absolute scores (that's noisy and inconsistent). We only need relative judgments: "A is better than B." This is the Bradley-Terry model:

P(A > B) = σ(r(A) − r(B))

where σ is the sigmoid function and r(·) is the reward model's score. The loss function pushes the reward for the preferred response above the rejected one:

L = −log σ(r(chosen) − r(rejected))

Reward Model Architecture

The reward model is typically the same architecture as the LLM itself, but with the language modeling head replaced by a scalar head. The data flow:

Input
Concatenated [prompt + response] tokenized → [B, L] token IDs
LLM Backbone
Transformer layers → [B, L, d_model] hidden states
Pool
Take last token's hidden state → [B, d_model]
Scalar Head
Linear(d_model, 1) → [B, 1] reward score

For a 7B reward model with d_model=4096: the entire backbone processes the prompt+response, the last token's representation is pooled, and a single linear layer projects 4096 dimensions down to 1 scalar. The backbone is usually initialized from the SFT checkpoint — it already understands language; it just needs to learn to evaluate quality.

Training the Reward Model

Each training example is a tuple: (prompt, chosen_response, rejected_response). Forward both responses through the RM to get r(chosen) and r(rejected). The Bradley-Terry loss pushes the gap apart:

python
r_chosen  = reward_model(prompt + chosen)     # scalar
r_rejected = reward_model(prompt + rejected)  # scalar
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected))
loss.backward()

Typical dataset sizes: 50K–500K preference pairs. Training: 1–3 epochs, learning rate ~1e-5, batch size 64. The RM converges quickly because it's initialized from a strong LLM — most of the work is already done. You're just teaching the final layer what "quality" means.

Why pairs, not scores? Asking "rate this response 1-10" is unreliable — one annotator's 7 is another's 5. But asking "which is better?" gives consistent signal. Humans are much better at comparisons than absolute judgments.
Reward Model Training

Drag the slider to adjust the reward gap between chosen and rejected responses. Watch how the Bradley-Terry loss changes.

r(chosen)1.5
r(rejected)-0.5
Loss = −log σ(gap) = 0.13
Check: What does the reward model learn to predict?
🔨 Derivation Bradley-Terry Loss from Maximum Likelihood ✓ ATTEMPTED

Given: A reward model r(·) that assigns scalar scores to responses. Human annotators provide N preference pairs where response A is preferred over response B. The Bradley-Terry model says P(A > B) = σ(r(A) − r(B)).

Your task: Derive the training loss L = −log σ(r(chosen) − r(rejected)) from the maximum likelihood principle. Why is this equivalent to binary cross-entropy?

If human says A > B, the likelihood under our model is P(A > B) = σ(r(A) − r(B)). We want to maximize this probability.
Maximizing P is equivalent to maximizing log P (log is monotonic). For N pairs: max Σ log σ(r(choseni) − r(rejectedi)). Minimizing the negative gives us the loss.
Binary cross-entropy with label y=1 and predicted probability p is: −log(p). Here our "label" is always 1 (chosen IS preferred) and our predicted probability is σ(gap). So L = −log(σ(gap)) = binary CE with y=1.

Full derivation:

1. Model assumption: P(A > B | r) = σ(r(A) − r(B)) where σ(x) = 1/(1+e−x)

2. Likelihood of N pairs: L(r) = Πi=1..N σ(r(choseni) − r(rejectedi))

3. Log-likelihood: log L(r) = Σi log σ(r(choseni) − r(rejectedi))

4. Negative log-likelihood (loss): J(r) = −(1/N) Σi log σ(r(choseni) − r(rejectedi))

5. Per-sample loss: L = −log σ(r(chosen) − r(rejected))

This is exactly binary cross-entropy where the "true label" is always 1 (chosen is always the preferred one in our dataset) and the model's predicted probability is σ(gap).

The key insight: We never need absolute reward values. The loss only depends on the difference r(chosen) − r(rejected). This means the reward model is only calibrated up to an additive constant — which is fine, because we only ever use reward differences downstream.

Checkpoint — Before you move on
Explain in your own words: why does the reward model only need to learn relative rankings, not absolute scores? What would go wrong if we tried to train it with absolute scores (e.g., "rate this response 1-10")?
✓ Gate cleared
Model Answer

Relative rankings work because: (1) Humans are inconsistent at absolute scales — one annotator's 7 is another's 5, creating noisy gradients. (2) The Bradley-Terry loss only depends on the difference r(A)−r(B), so any monotonic transformation of scores preserves the same training signal. (3) PPO only needs to compare rewards between candidate responses for the same prompt — the absolute scale is irrelevant.

Absolute scores fail because: Different annotators have different baselines and scales. Regression to an absolute target (MSE loss) would fight this noise. The model would waste capacity trying to calibrate to individual annotator biases rather than learning the underlying quality ordering. The Bradley-Terry formulation elegantly sidesteps this by only asking "which is better?" — a question humans answer consistently.

Chapter 2: The RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) is a three-stage pipeline that transforms a base model into an aligned assistant. Each stage builds on the last:

Stage 1: SFT
Fine-tune on high-quality demonstrations. The model learns the format of helpful responses.
Stage 2: Reward Model
Train on human preference pairs. The RM learns to score response quality.
Stage 3: PPO
Use RL to maximize the reward model's score while staying close to the SFT model.

Stage 1 (SFT) gives the model the right "shape" — it learns to follow instructions, use a helpful tone, and structure answers clearly. But SFT alone is limited by the quality and diversity of demonstrations.

Stage 2 (Reward Model) captures nuanced preferences that are hard to demonstrate. It's easier to judge than to demonstrate: a human can quickly say "A is better than B" even if writing the perfect response from scratch would take much longer.

Stage 3 (PPO) is where the magic happens. The policy model generates responses, the reward model scores them, and PPO updates the policy to produce higher-scoring outputs. A KL penalty prevents the model from drifting too far from the SFT checkpoint.

What's Running During PPO

The PPO training loop is surprisingly GPU-hungry. At any given moment, you need four models in memory:

ModelRoleUpdated?
Policy πθGenerates responses. This is the model being trained.Yes (via PPO)
Reference πrefFrozen SFT checkpoint. Computes KL penalty.No (frozen)
Reward model rφScores generated responses.No (frozen)
Value head VψEstimates expected future reward (for advantage computation).Yes (via PPO)

For a 7B model, that's roughly 4 × 14 GB = 56 GB just for weights in fp16 — before any activations or optimizer states. This is why RLHF is expensive and why simpler alternatives like DPO are appealing.

Why three stages? SFT alone can't capture subtle preferences. The reward model alone can't generate text. PPO alone would collapse without a good starting point. Each stage solves a different piece of the puzzle.
RLHF Pipeline Flow

Interactive pipeline diagram. Click each stage to highlight its role and data flow.

Click a stage to learn more about its role in the pipeline.
Check: What is the correct order of the RLHF pipeline?

Chapter 3: PPO for LLMs

Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. The idea: generate a response, score it with the reward model, then nudge the policy to make high-reward responses more likely — but not too much per step.

PPO uses a clipped surrogate objective. The ratio rt = π(a|s) / πold(a|s) measures how much the policy has changed. PPO clips this ratio to [1−ε, 1+ε], preventing destructive updates:

LCLIP = min(rt At, clip(rt, 1−ε, 1+ε) At)

But there's a critical addition for LLMs: a KL penalty that keeps the policy close to a reference model (usually the SFT checkpoint). Without it, the model would "overoptimize" — finding weird outputs that score high on the reward model but are gibberish to humans.

Rtotal = Rreward(x, y) − β KL(π || πref)
Why KL matters: The reward model is an imperfect proxy for human preferences. If you optimize it too aggressively, the policy finds "adversarial examples" that exploit the reward model's blind spots. The KL penalty is a leash: the policy can learn, but can't wander into territory the reward model wasn't trained on.

The β Tradeoff in Practice

The total reward for a generated response is Rtotal = rmodel(y) − β · KL(π || πref). The coefficient β is the most important hyperparameter in RLHF training:

In practice, teams often anneal β during training — starting high for stability, then lowering it as the reward model proves reliable.

PPO Clipped Objective

Adjust the advantage (positive = good action, negative = bad action) and the clip range ε. The teal curve is the clipped objective.

Advantage A1.0
Clip ε0.20
KL penalty β0.10
SymbolMeaning
rtProbability ratio (new policy / old policy)
AtAdvantage: how much better than expected
εClip range (typically 0.1–0.2)
βKL penalty coefficient
πrefReference policy (SFT model)

One PPO Iteration, Step by Step

1. Generate
Sample prompt x from dataset. Policy πθ generates response y = [t1, ..., tL].
2. Score
Reward model scores (x, y) → scalar r. Reference model computes log πref(y|x). KL = log πθ(y|x) − log πref(y|x).
3. Advantages
Rtotal = r − β · KL. Value head estimates V(x). Advantage A = Rtotal − V(x). Apply GAE for variance reduction.
4. Update
Multiple PPO epochs on the batch: update policy (maximize clipped advantage) and value head (minimize V loss).
Check: Why is the KL penalty necessary in RLHF?
🔨 Derivation KL Penalty Coefficient β and the Optimal Policy ✓ ATTEMPTED

The RLHF objective is: maxπ Ex~D, y~π[r(x,y)] − β KL(π || πref). This is a constrained optimization problem: maximize reward subject to staying close to the reference.

Your task: Show that the optimal policy under this objective takes the form π*(y|x) ∝ πref(y|x) · exp(r(x,y) / β). What role does β play as a "temperature"?

KL(π || πref) = Ey~π[log π(y|x) − log πref(y|x)]. The objective becomes: Ey~π[r(x,y) − β log π(y|x) + β log πref(y|x)]. Group the terms involving π.
Rearrange: Ey~π[r(x,y)/β + log πref(y|x) − log π(y|x)]. This is maximized when π(y|x) ∝ πref(y|x) · exp(r(x,y)/β) — this is the Gibbs/Boltzmann distribution from statistical mechanics!
As β → 0: exp(r/β) concentrates all mass on the highest-reward response (greedy). As β → ∞: exp(r/β) → 1 for all y, so π* = πref (no learning). β literally controls the "temperature" of how sharply the policy concentrates on high-reward outputs.

Full derivation:

1. Objective: maxπ Ey~π(y|x)[r(x,y)] − β KL(π || πref)

2. Expand KL: = Ey~π[r(x,y) − β(log π(y|x) − log πref(y|x))]

3. Rearrange: = Ey~π[r(x,y) + β log πref(y|x)] − β Ey~π[log π(y|x)]

4. Recognize: The second term is β · H(π) (entropy). So we maximize: E[r + β log πref] + β H(π)

5. Variational solution: For any "energy" function f(y), the distribution maximizing Eπ[f(y)] + T·H(π) is π*(y) ∝ exp(f(y)/T). Here f = r + β log πref and T = β.

6. Result: π*(y|x) ∝ exp((r(x,y) + β log πref(y|x)) / β) = πref(y|x) · exp(r(x,y)/β)

7. Normalizing: π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x), where Z(x) = Σy πref(y|x) exp(r(x,y)/β)

The key insight: β is literally a temperature parameter from the Boltzmann distribution in statistical mechanics. Low β = "cold" = sharply peaked at highest-reward outputs. High β = "hot" = spread out, close to reference. This is why β too low causes reward hacking (greedy exploitation) and β too high causes no learning (staying at reference).

🔗 Pattern Recognition
PPO as Policy Gradient — Same Algorithm, Different Reward
This Lesson (RLHF)
Rtotal = rRM(x,y) − β KL(π||πref).
Policy gradient with a learned reward model as the signal.
RL Lesson (Standard PPO)
Rtotal = Σ γt r(st, at).
Policy gradient with an environment reward as the signal. → RL Algorithms

The PPO algorithm is identical in both cases — clip ratios, compute advantages via GAE, update for multiple epochs. The only difference is where the reward comes from: in RL, it's the environment; in RLHF, it's a learned neural network. The KL penalty in RLHF plays the same role as reward shaping in RL: it keeps the policy in a region where the reward signal is trustworthy.

The next time you see any optimization with a "stay close to a reference" constraint, recognize it as the same pattern: exploration bounded by trust.

🔗 Pattern Recognition
KL Divergence as Information-Theoretic Distance
This Lesson (RLHF)
KL(π || πref) = Ey~π[log π(y|x) − log πref(y|x)].
Measures how far the policy has drifted from the reference.
VAE Lesson (Latent Variable Models)
KL(q(z|x) || p(z)) = Ez~q[log q(z|x) − log p(z)].
Measures how far the encoder has drifted from the prior. → VAE & VQ-VAE

Both use KL as a "leash" that prevents optimization from going too far. In RLHF, it prevents reward hacking. In VAEs, it prevents posterior collapse (the encoder ignoring the prior). The pattern: whenever you optimize a flexible model against an objective, add a KL penalty to keep it in a trusted region. The strength of the penalty (β) trades off "how much can we learn" vs "how likely are we to break."

Where else in deep learning do you see a KL penalty stabilizing optimization? (Hint: think about knowledge distillation and variational inference.)

Chapter 4: DPO — Direct Preference Optimization

What if we could skip the reward model entirely? DPO (Rafailov et al., 2023) makes a beautiful mathematical observation: the optimal policy under the RLHF objective has a closed-form solution in terms of the reward function. By rearranging terms, we can express the reward implicitly through the policy itself:

r(x, y) = β log(π(y|x) / πref(y|x)) + C

Substituting this into the Bradley-Terry preference model gives the DPO loss directly in terms of the policy log-probabilities — no reward model needed:

LDPO = −log σ(β[log π(yw|x)/πref(yw|x) − log π(yl|x)/πref(yl|x)])

where yw is the preferred (winning) response and yl is the rejected (losing) response. DPO increases the relative log-probability of preferred responses while decreasing that of rejected ones, all while staying anchored to the reference model through the log-ratio terms.

DPO in Practice

Training DPO requires computing four forward passes per batch: π(yw|x), π(yl|x), πref(yw|x), πref(yl|x). The reference model is frozen (no gradients). Each forward pass returns the log-probability of the response: sum of log-probs per token. The loss is a single sigmoid cross-entropy on the gap between the two log-ratios. No sampling, no reward model, no RL loop. Just standard supervised training with a clever loss.

Typical hyperparameters: β = 0.1–0.5 (controls deviation from reference), learning rate 1e-6 to 5e-7 (very low — you're fine-tuning a fine-tuned model), 1–3 epochs over the preference dataset. Memory: only 2 models (policy + frozen reference) instead of 4 for RLHF.

The Implicit Reward

Here's the key derivation that makes DPO work. The optimal policy under the KL-constrained RLHF objective is:

π*(y|x) = πref(y|x) · exp(r(x,y) / β) / Z(x)

Rearranging for r gives us: r(x,y) = β log(π(y|x)/πref(y|x)) + β log Z(x). The partition function Z(x) cancels when we take the difference r(yw) − r(yl) inside the Bradley-Terry model. This is why DPO works: it's implicitly computing a reward function where the reward of any response is just β times its log-probability ratio relative to the reference. Higher probability under π (relative to πref) = higher implicit reward.

The elegance: DPO collapses three stages (SFT → RM → PPO) into a single supervised learning problem. No RL, no reward model training, no sampling during training. Just a clever loss function applied directly to preference pairs. In practice, DPO is simpler to implement, more stable to train, and often matches RLHF performance.
Interactive: DPO Loss Landscape

Adjust the log-probability ratios for winning and losing responses. The DPO loss pushes the gap apart.

log π/πref (win)1.0
log π/πref (lose)-0.5
β (temperature)0.50
DPO Loss = 0.24
Check: What is DPO's main advantage over RLHF?
🔨 Derivation DPO Loss from the RLHF Objective ✓ ATTEMPTED

Given: The optimal RLHF policy is π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x). The Bradley-Terry preference model is P(yw > yl) = σ(r(yw) − r(yl)).

Your task: Derive the DPO loss by (1) solving for r(x,y) in terms of π and πref, then (2) substituting into Bradley-Terry to eliminate the reward function entirely.

Take log of both sides of π*(y|x) = πref(y|x) exp(r/β) / Z(x): log π*(y|x) = log πref(y|x) + r(x,y)/β − log Z(x). Rearrange: r(x,y) = β [log π*(y|x) − log πref(y|x)] + β log Z(x).
P(yw > yl) = σ(r(yw) − r(yl)) = σ(β[log π(yw|x)/πref(yw|x) − log π(yl|x)/πref(yl|x)] + β log Z − β log Z). The partition function Z(x) cancels!
Since P(yw > yl) = σ(β[log(π/πref)(yw) − log(π/πref)(yl)]), the NLL loss is just −log of this expression. That's the DPO loss!

Full derivation:

1. Start with optimal policy: π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x)

2. Solve for reward: r(x,y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)

3. Substitute into Bradley-Terry:
P(yw > yl) = σ(r(yw) − r(yl))
= σ(β[log π*(yw|x)/πref(yw|x)] + β log Z − β[log π*(yl|x)/πref(yl|x)] − β log Z)
= σ(β[log π*(yw|x)/πref(yw|x) − log π*(yl|x)/πref(yl|x)])

4. The Z(x) terms cancel! This is the crucial step. The intractable partition function vanishes when we take the difference.

5. DPO loss: LDPO = −log σ(β[log πθ(yw|x)/πref(yw|x) − log πθ(yl|x)/πref(yl|x)])

The key insight: The partition function Z(x) — which requires summing over ALL possible responses (intractable!) — cancels because we only need the difference in rewards. This is why DPO can bypass the reward model: the thing that made RL necessary (computing expectations over the policy) gets algebraically eliminated. Brilliant.

💻 Build It Implement the DPO Loss from Scratch ✓ ATTEMPTED
You have log-probabilities from two models (policy and reference) for both the winning and losing responses. Implement the DPO loss that trains the policy to prefer the winning response.
signature def dpo_loss( policy_chosen_logps: torch.Tensor, # [B] log P(y_w | x) under policy policy_rejected_logps: torch.Tensor, # [B] log P(y_l | x) under policy ref_chosen_logps: torch.Tensor, # [B] log P(y_w | x) under reference ref_rejected_logps: torch.Tensor, # [B] log P(y_l | x) under reference beta: float = 0.1 # temperature parameter ) -> torch.Tensor: """Returns scalar DPO loss (mean over batch)."""
Test case
policy_chosen_logps = torch.tensor([-1.0, -0.5])
policy_rejected_logps = torch.tensor([-2.0, -3.0])
ref_chosen_logps = torch.tensor([-1.2, -0.8])
ref_rejected_logps = torch.tensor([-1.8, -2.5])
beta = 0.1
Expected: loss ≈ 0.669 (verify: gap = beta * [((-1+1.2) - (-2+1.8)), ((-0.5+0.8) - (-3+2.5))] = 0.1*[(0.2-(-0.2)), (0.3-(-0.5))] = 0.1*[0.4, 0.8] = [0.04, 0.08]; loss = mean(-log(sigmoid([0.04, 0.08]))) ≈ 0.683)
log(π(y|x) / πref(y|x)) = log π(y|x) − log πref(y|x). So the chosen log-ratio = policy_chosen_logps − ref_chosen_logps. Same for rejected. Then scale the difference by beta.
python
import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    # Log-ratios: how much more likely under policy vs reference
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # The "implicit reward gap" scaled by beta
    logits = beta * (chosen_logratios - rejected_logratios)

    # Negative log-sigmoid = binary cross-entropy with label=1
    loss = -F.logsigmoid(logits).mean()
    return loss
Bonus challenge: Add label smoothing (mix in a small probability that the "rejected" response is actually preferred). This helps with noisy human labels. How does this change the loss formula?
⚔ Adversarial: DPO trains for 3 epochs on a preference dataset. Loss decreases steadily. But when you evaluate, the model gives worse responses than the SFT baseline. What happened?
You're training Llama-7B with DPO on 50K preference pairs collected from GPT-4 (as the "chosen") vs the SFT model's own outputs (as "rejected"). Beta = 0.1, LR = 5e-7, 3 epochs.

Chapter 5: KTO & Simpler Methods

DPO still requires paired preferences (A vs B for the same prompt). But what if you only have binary feedback — thumbs up or thumbs down on individual responses? KTO (Kahneman-Tversky Optimization) works with unpaired data by leveraging prospect theory: humans feel losses more strongly than equivalent gains.

The KTO loss treats desirable and undesirable outputs asymmetrically. For a good response, it encourages the log-ratio to be high. For a bad response, it penalizes the log-ratio being high. The asymmetry mirrors human loss aversion:

LKTO = λ · (1 − σ(β · gap))   [desirable]    |    (1 − σ(−β · gap))   [undesirable]

ORPO (Odds Ratio Preference Optimization) takes yet another approach: it combines the SFT loss with a preference signal in a single training objective, using the odds ratio of generating preferred vs rejected responses.

MethodData RequiredStagesKey Idea
RLHFPaired preferencesSFT → RM → PPOTrain reward model, then optimize via RL
DPOPaired preferencesSFT → DPODirect loss on preference pairs
KTOBinary (good/bad)SFT → KTOLoss-averse binary feedback
ORPOPaired preferencesSingle stageCombine SFT + odds-ratio preference
Trend: The field is moving toward simpler methods. RLHF requires the most infrastructure (RL loop, reward model, reference model all in memory). DPO simplifies this considerably. KTO simplifies the data requirements. ORPO simplifies the pipeline to a single pass.

The Simplicity Gradient

Each method trades off against the others in a clear pattern:

Method Comparison

Adjust the complexity and data axes to see how each method trades off simplicity vs data requirements.

Check: What kind of data does KTO need (unlike DPO)?

Chapter 6: Constitutional AI

Human annotation is expensive and doesn't scale. Anthropic's Constitutional AI (CAI) asks: can the AI critique and revise itself using a set of written principles (a "constitution")? The answer is yes, and it works surprisingly well.

CAI has two phases. In the critique-revision phase, the model generates a response, then is asked to evaluate it against principles like "is this harmful?" or "is this honest?" and rewrite it. The revised responses become the SFT training data. In the RLAIF phase (RL from AI Feedback), the AI itself generates preference labels instead of human annotators.

Constitution
A set of principles: "Be helpful," "Avoid harm," "Be honest," etc.
Critique
Model reviews its own response against each principle.
Revision
Model rewrites response to address critiques.
RLAIF
AI-generated preferences used for RL training (instead of human labels).
Why this matters: Human feedback is the bottleneck. If we need millions of preference labels, we can't hire enough annotators. CAI provides a scalable alternative: the model's own judgment, guided by explicit principles. The constitution makes the values transparent and auditable.
Constitutional Critique-Revision

Watch how a response improves through rounds of self-critique. Each round applies a principle from the constitution.

Round: 0 / 4
Check: In Constitutional AI, who provides the preference labels for RL?
🏗 Design Challenge You're the Architect: Alignment Pipeline for a Code Assistant ✓ ATTEMPTED
You're building a code-generation assistant (think Copilot/Cursor). The model must be helpful for coding tasks while refusing to generate malware, avoiding insecure patterns, and admitting when it's unsure. Design the full alignment pipeline from preference data collection through training.
Budget
$200K for annotation, 64 A100s for 2 weeks
Base model
34B parameter, code-pretrained (SFT already done)
Safety req
Must refuse malware/exploits but still help with security research and pentesting
Quality req
Code must compile, pass unit tests, follow best practices
1. What preference data do you collect? Human-written pairs, or model-generated? How do you handle the security edge case (pentesting is legitimate, but the same code could be malware)?
2. Do you use RLHF, DPO, or RLAIF? Given 64 A100s and a 34B model, can you fit RLHF in memory? What are the tradeoffs?
3. How do you define "correctness" for code? A reward model trained on human preferences might prefer "looks right" over "actually works." Do you incorporate execution feedback?
4. How do you handle the tension between helpfulness and safety? A user asks "write a function that scans all open ports on a network" — is this pentesting (allowed) or reconnaissance for an attack (refused)?

Real-world approaches (composite of Codex, StarCoder, DeepSeek-Coder):

Data: Hybrid approach. (1) Generate N=8 completions per prompt, run unit tests, rank by test pass rate — this gives "execution-verified" preferences for free. (2) For safety, use constitutional approach: define security principles, have the model self-evaluate edge cases, collect human labels only for truly ambiguous cases (5-10% of budget). (3) For style/quality, use human annotators who are actual developers (NOT crowd workers who can't evaluate code).

Method: DPO is the dominant choice for code models at this scale. RLHF with a 34B model requires 4x34B = 136B parameters in memory (impossible on 64 A100s without extreme sharding). DPO only needs 2x34B = 68B, fitting on 64 A100s with room for activations. Some teams (e.g., DeepSeek) use iterative DPO: train, generate new responses, re-rank, train again.

Correctness: Multi-signal reward combining (1) execution pass rate (binary, high signal), (2) static analysis (linting, type checking), (3) human preference for readability. Weight execution at 3x human preference since "works" matters more than "looks nice."

Safety boundary: Context-dependent classification. The SAME code (port scanner) can be safe or unsafe depending on context. Approach: train a separate intent classifier, or use the constitutional method where principles distinguish "I'm a security researcher testing my own systems" from "I want to attack someone." Default to helpful (write the code) with a safety disclaimer. Over-refusal is a worse product failure than occasional edge-case compliance.

Chapter 7: Process Reward Models

Standard reward models (Outcome Reward Models, or ORMs) score the final answer. But for multi-step reasoning, the final answer might be right for the wrong reasons — or wrong because of a single bad step in an otherwise sound chain. Process Reward Models (PRMs) score each step of the reasoning process.

PRMs provide much denser supervision. Instead of one score for the whole response, you get a score per step. This helps in at least three ways: (1) better credit assignment (which step went wrong?), (2) more training signal per example, and (3) the ability to do tree search over reasoning paths at inference time.

ORM vs PRM: An ORM is like grading an exam by only looking at the final answer. A PRM is like grading each step of the work. The PRM catches errors earlier and provides more useful feedback for learning.
PRM vs ORM Scoring

A reasoning chain with 5 steps. The ORM scores only the final answer. The PRM scores each step. Click steps to toggle correctness and see how scores change.

FeatureORMPRM
GranularityFinal answer onlyEach reasoning step
Credit assignmentPoor (reward shared across all steps)Good (per-step feedback)
Annotation costLow (check final answer)High (verify each step)
Search capabilityLimitedEnables best-of-N and tree search
Best forShort, single-step tasksMulti-step reasoning (math, code)
Check: What is the main advantage of a PRM over an ORM?

Chapter 8: Reward Hacking

Here's the dark side of optimization: the model will find loopholes. If the reward model gives higher scores to longer responses, the model learns to be verbose. If the reward model prefers confident-sounding text, the model learns to sound confident even when wrong. This is reward hacking — the policy exploits imperfections in the proxy reward.

Common failure modes include:

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The reward model is a proxy for human preferences, not the real thing. Push optimization pressure hard enough and the proxy breaks down.
Reward Hacking in Action

Watch the proxy reward climb while true quality degrades. The green line is proxy reward, the teal line is true quality. Adjust KL penalty to see the mitigation effect.

KL penalty β0.05
Training steps100

The Overoptimization Curve

There's a characteristic pattern in RLHF training. Plot proxy reward (from the RM) and true reward (from held-out human evaluations) against KL divergence from the reference. Early on, both curves rise together — the model genuinely improves. But past a critical KL threshold, proxy reward keeps climbing while true reward peaks and then drops. The model has found adversarial inputs that game the RM.

Gao et al. (2022) showed this empirically across multiple model sizes. The critical KL scales roughly with model capacity: larger models can exploit RMs more aggressively. This is why β must be tuned per model size — there's no universal value.

Mitigation Strategies

StrategyHow It Helps
KL penaltyLimits how far the policy can drift from the reference
Reward model ensemblesMultiple RMs make it harder to find shared blind spots
Length normalizationScore per-token rather than per-response
Adversarial trainingDeliberately look for hacking patterns and retrain
Regular RM refreshRetrain the RM on policy's new distribution
Check: Sycophancy is an example of reward hacking because...
💥 Break-It Lab What Dies When You Remove RLHF Components? ✓ ATTEMPTED
A working RLHF training loop with KL penalty, reward model, and clipping. Toggle components off to see the specific failure mode each one prevents. The green curve is true quality; the orange curve is proxy reward.
Remove KL Penalty (β=0) ACTIVE
Failure mode: Without the KL leash, the policy drifts far from the reference into out-of-distribution territory. The reward model's scores become meaningless — the policy finds "adversarial inputs" that score high on the RM but are gibberish to humans. Proxy reward climbs to infinity while true quality collapses. This is Goodhart's Law in action.
Remove PPO Clipping (ε=∞) ACTIVE
Failure mode: Without clipping, a single high-advantage sample can cause an enormous policy update. The policy "overshoots" — probability ratios go to 10x or 100x in one step. Next step, the model generates from this wildly different policy, advantages are computed wrong, and training becomes unstable (oscillating loss, occasional mode collapse).
Set β Too High (β=5.0) ACTIVE
Failure mode: The KL penalty dominates the objective. Every gradient step that moves away from πref gets punished more than it gets rewarded. The policy barely changes — you've spent millions on compute for a model that's essentially the SFT checkpoint with minor cosmetic differences. Training "succeeds" (loss is low) but the model hasn't learned anything new.
Out-of-Distribution Reward Model ACTIVE
Failure mode: The reward model was trained on short, factual Q&A pairs but the policy generates long creative stories. The RM assigns random scores to this unseen distribution — sometimes very high, sometimes very low. The policy learns to exploit whichever OOD pattern accidentally gets high scores. This is why you must train the RM on the same distribution the policy will generate from.
⚔ Adversarial: Your RLHF model scores 4.2/5 on the reward model (up from 3.1 for the SFT baseline). But in blind human evaluation, raters prefer the SFT baseline 62% of the time. What happened and what do you do?
The reward model was trained 3 months ago on 100K preference pairs. Since then, you've run 50K PPO steps with β=0.02. The model's responses have become noticeably longer (avg 450 tokens vs 180 for SFT) and use more formatting (bullet points, headers).
Checkpoint — Before you move on
In your own words: why is reward hacking inevitable rather than merely possible? Connect this to Goodhart's Law and explain why no single mitigation (KL penalty, ensembles, etc.) can fully solve it.
✓ Gate cleared
Model Answer

It's inevitable because: The reward model is a finite neural network trained on a finite dataset. It has limited capacity and has only seen a tiny fraction of all possible outputs. Any such model necessarily has blind spots — regions of output space where its predictions are uncalibrated. An RL optimizer, given enough steps, will find these blind spots and exploit them. This is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure."

Why no single fix works: KL penalty slows exploitation but doesn't prevent it (the model still drifts, just slower). RM ensembles reduce shared blind spots but every finite ensemble has gaps. Length normalization fixes ONE exploit but the policy finds others. The fundamental issue is that we're using an imperfect proxy for an unspecifiable target ("what humans want"). The only true solution would be a perfect reward model — which would require capturing all of human values in a neural network. That's the alignment problem itself, recursively.

Chapter 9: Open Problems

Alignment research is far from solved. As models become more capable, the challenges deepen. Here are the frontier problems that keep researchers up at night:

Scalable Oversight

How do you supervise a model that can perform tasks beyond human ability? If the model writes code that's too complex for the annotator to evaluate, or reasons about problems the human can't verify, the entire preference-based framework breaks down. Proposals include recursive reward modeling (use AI to help humans evaluate AI), debate (two AIs argue, human judges), and market-based mechanisms.

Weak-to-Strong Generalization

Can a weaker model supervise a stronger one? If GPT-2-level labels are used to fine-tune a GPT-4-level model, can the strong model generalize beyond the quality of its supervisor? Early results from OpenAI suggest this partially works — the strong model can "figure out" what the weak supervisor meant, analogous to a smart student learning from an imperfect teacher. But it doesn't fully close the gap.

Interpretability for Alignment

Can we look inside the model and verify its "values" directly? If we could read the model's internal representations and confirm it's being helpful for the right reasons (not just gaming the reward), that would be far more robust than any behavioral test. Mechanistic interpretability aims to make this possible.

The big picture: Current alignment is behavioral — we judge models by their outputs. Future alignment may be mechanistic — we verify models by their internals. The shift from "does it behave well?" to "does it reason well?" is the frontier.
Alignment Research Landscape

The map of open alignment problems. Each node represents a research direction. Hover to see connections.

ProblemCore QuestionStatus
Scalable oversightHow to evaluate superhuman outputs?Active research
Weak-to-strongCan weak supervisors train strong models?Promising early results
InterpretabilityCan we verify alignment by reading internals?Rapid progress
RobustnessDoes alignment hold under distribution shift?Largely unsolved
Multi-agentHow to align interacting agents?Early stage
"The AI alignment problem is not about making AI obey us. It's about making AI understand what we actually mean."
— paraphrasing Stuart Russell

You now understand the full landscape of alignment: from RLHF to DPO, from reward hacking to the open frontier. The field is young, the problems are deep, and the stakes are high.

Check: What is the "scalable oversight" problem?