microAlignment — Reward Models & Alignment for LLMs

Chapter 1: Reward Modeling

How do you teach a model what "good" means? You ask humans. Given a prompt, generate two candidate responses A and B. Show them to a human annotator who picks the better one. Collect thousands of these preference pairs and train a separate model — the reward model — to predict which response a human would prefer.

The reward model takes a (prompt, response) pair and outputs a scalar score. Higher score = more likely to be preferred. The key insight: we don't need humans to assign absolute scores (that's noisy and inconsistent). We only need relative judgments: "A is better than B." This is the Bradley-Terry model:

P(A > B) = σ(r(A) − r(B))

where σ is the sigmoid function and r(·) is the reward model's score. The loss function pushes the reward for the preferred response above the rejected one:

L = −log σ(r(chosen) − r(rejected))

Reward Model Architecture

The reward model is typically the same architecture as the LLM itself, but with the language modeling head replaced by a scalar head. The data flow:

Input

Concatenated [prompt + response] tokenized → [B, L] token IDs

↓

LLM Backbone

Transformer layers → [B, L, d_model] hidden states

↓

Pool

Take last token's hidden state → [B, d_model]

↓

Scalar Head

Linear(d_model, 1) → [B, 1] reward score

For a 7B reward model with d_model=4096: the entire backbone processes the prompt+response, the last token's representation is pooled, and a single linear layer projects 4096 dimensions down to 1 scalar. The backbone is usually initialized from the SFT checkpoint — it already understands language; it just needs to learn to evaluate quality.

Training the Reward Model

Each training example is a tuple: (prompt, chosen_response, rejected_response). Forward both responses through the RM to get r(chosen) and r(rejected). The Bradley-Terry loss pushes the gap apart:

python
r_chosen  = reward_model(prompt + chosen)     # scalar
r_rejected = reward_model(prompt + rejected)  # scalar
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected))
loss.backward()

Typical dataset sizes: 50K–500K preference pairs. Training: 1–3 epochs, learning rate ~1e-5, batch size 64. The RM converges quickly because it's initialized from a strong LLM — most of the work is already done. You're just teaching the final layer what "quality" means.

Why pairs, not scores? Asking "rate this response 1-10" is unreliable — one annotator's 7 is another's 5. But asking "which is better?" gives consistent signal. Humans are much better at comparisons than absolute judgments.

Reward Model Training

Drag the slider to adjust the reward gap between chosen and rejected responses. Watch how the Bradley-Terry loss changes.

r(chosen)1.5

r(rejected)-0.5

      Loss = −log σ(gap) =
      0.13
    

Check: What does the reward model learn to predict?

The next token in a sequence Which response a human would prefer The grammatical correctness of text

🔨 Derivation Bradley-Terry Loss from Maximum Likelihood ▶ ✓ ATTEMPTED

Given: A reward model r(·) that assigns scalar scores to responses. Human annotators provide N preference pairs where response A is preferred over response B. The Bradley-Terry model says P(A > B) = σ(r(A) − r(B)).

Your task: Derive the training loss L = −log σ(r(chosen) − r(rejected)) from the maximum likelihood principle. Why is this equivalent to binary cross-entropy?

If human says A > B, the likelihood under our model is P(A > B) = σ(r(A) − r(B)). We want to maximize this probability.

Maximizing P is equivalent to maximizing log P (log is monotonic). For N pairs: max Σ log σ(r(chosen_i) − r(rejected_i)). Minimizing the negative gives us the loss.

Binary cross-entropy with label y=1 and predicted probability p is: −log(p). Here our "label" is always 1 (chosen IS preferred) and our predicted probability is σ(gap). So L = −log(σ(gap)) = binary CE with y=1.

Full derivation:

1. Model assumption: P(A > B | r) = σ(r(A) − r(B)) where σ(x) = 1/(1+e^−x)

2. Likelihood of N pairs: L(r) = Π_i=1..N σ(r(chosen_i) − r(rejected_i))

3. Log-likelihood: log L(r) = Σ_i log σ(r(chosen_i) − r(rejected_i))

4. Negative log-likelihood (loss): J(r) = −(1/N) Σ_i log σ(r(chosen_i) − r(rejected_i))

5. Per-sample loss: L = −log σ(r(chosen) − r(rejected))

This is exactly binary cross-entropy where the "true label" is always 1 (chosen is always the preferred one in our dataset) and the model's predicted probability is σ(gap).

The key insight: We never need absolute reward values. The loss only depends on the difference r(chosen) − r(rejected). This means the reward model is only calibrated up to an additive constant — which is fine, because we only ever use reward differences downstream.

Checkpoint — Before you move on

Explain in your own words: why does the reward model only need to learn relative rankings, not absolute scores? What would go wrong if we tried to train it with absolute scores (e.g., "rate this response 1-10")?

✓ Gate cleared

Model Answer

Relative rankings work because: (1) Humans are inconsistent at absolute scales — one annotator's 7 is another's 5, creating noisy gradients. (2) The Bradley-Terry loss only depends on the difference r(A)−r(B), so any monotonic transformation of scores preserves the same training signal. (3) PPO only needs to compare rewards between candidate responses for the same prompt — the absolute scale is irrelevant.

Absolute scores fail because: Different annotators have different baselines and scales. Regression to an absolute target (MSE loss) would fight this noise. The model would waste capacity trying to calibrate to individual annotator biases rather than learning the underlying quality ordering. The Bradley-Terry formulation elegantly sidesteps this by only asking "which is better?" — a question humans answer consistently.

Chapter 2: The RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) is a three-stage pipeline that transforms a base model into an aligned assistant. Each stage builds on the last:

Stage 1: SFT

Fine-tune on high-quality demonstrations. The model learns the format of helpful responses.

↓

Stage 2: Reward Model

Train on human preference pairs. The RM learns to score response quality.

↓

Stage 3: PPO

Use RL to maximize the reward model's score while staying close to the SFT model.

Stage 1 (SFT) gives the model the right "shape" — it learns to follow instructions, use a helpful tone, and structure answers clearly. But SFT alone is limited by the quality and diversity of demonstrations.

Stage 2 (Reward Model) captures nuanced preferences that are hard to demonstrate. It's easier to judge than to demonstrate: a human can quickly say "A is better than B" even if writing the perfect response from scratch would take much longer.

Stage 3 (PPO) is where the magic happens. The policy model generates responses, the reward model scores them, and PPO updates the policy to produce higher-scoring outputs. A KL penalty prevents the model from drifting too far from the SFT checkpoint.

What's Running During PPO

The PPO training loop is surprisingly GPU-hungry. At any given moment, you need four models in memory:

Model	Role	Updated?
Policy π_θ	Generates responses. This is the model being trained.	Yes (via PPO)
Reference π_ref	Frozen SFT checkpoint. Computes KL penalty.	No (frozen)
Reward model r_φ	Scores generated responses.	No (frozen)
Value head V_ψ	Estimates expected future reward (for advantage computation).	Yes (via PPO)

For a 7B model, that's roughly 4 × 14 GB = 56 GB just for weights in fp16 — before any activations or optimizer states. This is why RLHF is expensive and why simpler alternatives like DPO are appealing.

Why three stages? SFT alone can't capture subtle preferences. The reward model alone can't generate text. PPO alone would collapse without a good starting point. Each stage solves a different piece of the puzzle.

RLHF Pipeline Flow

Interactive pipeline diagram. Click each stage to highlight its role and data flow.

Click a stage to learn more about its role in the pipeline.

Check: What is the correct order of the RLHF pipeline?

PPO → SFT → Reward Model SFT → Reward Model → PPO Reward Model → PPO → SFT

Chapter 3: PPO for LLMs

Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. The idea: generate a response, score it with the reward model, then nudge the policy to make high-reward responses more likely — but not too much per step.

PPO uses a clipped surrogate objective. The ratio r_t = π(a|s) / π_old(a|s) measures how much the policy has changed. PPO clips this ratio to [1−ε, 1+ε], preventing destructive updates:

L^CLIP = min(r_t A_t, clip(r_t, 1−ε, 1+ε) A_t)

But there's a critical addition for LLMs: a KL penalty that keeps the policy close to a reference model (usually the SFT checkpoint). Without it, the model would "overoptimize" — finding weird outputs that score high on the reward model but are gibberish to humans.

R_total = R_reward(x, y) − β KL(π || π_ref)

Why KL matters: The reward model is an imperfect proxy for human preferences. If you optimize it too aggressively, the policy finds "adversarial examples" that exploit the reward model's blind spots. The KL penalty is a leash: the policy can learn, but can't wander into territory the reward model wasn't trained on.

The β Tradeoff in Practice

The total reward for a generated response is R_total = r_model(y) − β · KL(π || π_ref). The coefficient β is the most important hyperparameter in RLHF training:

β too high (e.g., 0.5): The KL penalty dominates. The model barely deviates from the SFT checkpoint. You pay all the compute of RL but learn almost nothing.
β too low (e.g., 0.001): The model is free to drift far from π_ref. It discovers reward hacking strategies — adversarial outputs that score high on the reward model but are gibberish or sycophantic to humans.
β just right (typically 0.01–0.1): The model improves meaningfully while staying in the "trustworthy" region where the reward model's scores are meaningful.

In practice, teams often anneal β during training — starting high for stability, then lowering it as the reward model proves reliable.

PPO Clipped Objective

Adjust the advantage (positive = good action, negative = bad action) and the clip range ε. The teal curve is the clipped objective.

Advantage A1.0

Clip ε0.20

KL penalty β0.10

Symbol	Meaning
r_t	Probability ratio (new policy / old policy)
A_t	Advantage: how much better than expected
ε	Clip range (typically 0.1–0.2)
β	KL penalty coefficient
π_ref	Reference policy (SFT model)

One PPO Iteration, Step by Step

1. Generate

Sample prompt x from dataset. Policy π_θ generates response y = [t₁, ..., t_L].

↓

2. Score

Reward model scores (x, y) → scalar r. Reference model computes log π_ref(y|x). KL = log π_θ(y|x) − log π_ref(y|x).

↓

3. Advantages

R_total = r − β · KL. Value head estimates V(x). Advantage A = R_total − V(x). Apply GAE for variance reduction.

↓

4. Update

Multiple PPO epochs on the batch: update policy (maximize clipped advantage) and value head (minimize V loss).

Check: Why is the KL penalty necessary in RLHF?

It prevents the model from exploiting the imperfect reward model It makes training faster It reduces memory usage

🔨 Derivation KL Penalty Coefficient β and the Optimal Policy ▶ ✓ ATTEMPTED

The RLHF objective is: max_π E_{x~D, y~π}[r(x,y)] − β KL(π || π_ref). This is a constrained optimization problem: maximize reward subject to staying close to the reference.

Your task: Show that the optimal policy under this objective takes the form π*(y|x) ∝ π_ref(y|x) · exp(r(x,y) / β). What role does β play as a "temperature"?

KL(π || π_ref) = E_y~π[log π(y|x) − log π_ref(y|x)]. The objective becomes: E_y~π[r(x,y) − β log π(y|x) + β log π_ref(y|x)]. Group the terms involving π.

Rearrange: E_y~π[r(x,y)/β + log π_ref(y|x) − log π(y|x)]. This is maximized when π(y|x) ∝ π_ref(y|x) · exp(r(x,y)/β) — this is the Gibbs/Boltzmann distribution from statistical mechanics!

As β → 0: exp(r/β) concentrates all mass on the highest-reward response (greedy). As β → ∞: exp(r/β) → 1 for all y, so π* = π_ref (no learning). β literally controls the "temperature" of how sharply the policy concentrates on high-reward outputs.

Full derivation:

1. Objective: max_π E_y~π(y|x)[r(x,y)] − β KL(π || π_ref)

2. Expand KL: = E_y~π[r(x,y) − β(log π(y|x) − log π_ref(y|x))]

3. Rearrange: = E_y~π[r(x,y) + β log π_ref(y|x)] − β E_y~π[log π(y|x)]

4. Recognize: The second term is β · H(π) (entropy). So we maximize: E[r + β log π_ref] + β H(π)

5. Variational solution: For any "energy" function f(y), the distribution maximizing E_π[f(y)] + T·H(π) is π*(y) ∝ exp(f(y)/T). Here f = r + β log π_ref and T = β.

6. Result: π*(y|x) ∝ exp((r(x,y) + β log π_ref(y|x)) / β) = π_ref(y|x) · exp(r(x,y)/β)

7. Normalizing: π*(y|x) = π_ref(y|x) · exp(r(x,y)/β) / Z(x), where Z(x) = Σ_y π_ref(y|x) exp(r(x,y)/β)

The key insight: β is literally a temperature parameter from the Boltzmann distribution in statistical mechanics. Low β = "cold" = sharply peaked at highest-reward outputs. High β = "hot" = spread out, close to reference. This is why β too low causes reward hacking (greedy exploitation) and β too high causes no learning (staying at reference).

🔗 Pattern Recognition

PPO as Policy Gradient — Same Algorithm, Different Reward

This Lesson (RLHF)

R_total = r_RM(x,y) − β KL(π||π_ref).
Policy gradient with a learned reward model as the signal.

RL Lesson (Standard PPO)

R_total = Σ γ^t r(s_t, a_t).
Policy gradient with an environment reward as the signal. → RL Algorithms

The PPO algorithm is identical in both cases — clip ratios, compute advantages via GAE, update for multiple epochs. The only difference is where the reward comes from: in RL, it's the environment; in RLHF, it's a learned neural network. The KL penalty in RLHF plays the same role as reward shaping in RL: it keeps the policy in a region where the reward signal is trustworthy.

The next time you see any optimization with a "stay close to a reference" constraint, recognize it as the same pattern: exploration bounded by trust.

🔗 Pattern Recognition

KL Divergence as Information-Theoretic Distance

This Lesson (RLHF)

KL(π || π_ref) = E_y~π[log π(y|x) − log π_ref(y|x)].
Measures how far the policy has drifted from the reference.

VAE Lesson (Latent Variable Models)

KL(q(z|x) || p(z)) = E_z~q[log q(z|x) − log p(z)].
Measures how far the encoder has drifted from the prior. → VAE & VQ-VAE

Both use KL as a "leash" that prevents optimization from going too far. In RLHF, it prevents reward hacking. In VAEs, it prevents posterior collapse (the encoder ignoring the prior). The pattern: whenever you optimize a flexible model against an objective, add a KL penalty to keep it in a trusted region. The strength of the penalty (β) trades off "how much can we learn" vs "how likely are we to break."

Where else in deep learning do you see a KL penalty stabilizing optimization? (Hint: think about knowledge distillation and variational inference.)

Chapter 4: DPO — Direct Preference Optimization

What if we could skip the reward model entirely? DPO (Rafailov et al., 2023) makes a beautiful mathematical observation: the optimal policy under the RLHF objective has a closed-form solution in terms of the reward function. By rearranging terms, we can express the reward implicitly through the policy itself:

r(x, y) = β log(π(y|x) / π_ref(y|x)) + C

Substituting this into the Bradley-Terry preference model gives the DPO loss directly in terms of the policy log-probabilities — no reward model needed:

L_DPO = −log σ(β[log π(y_w|x)/π_ref(y_w|x) − log π(y_l|x)/π_ref(y_l|x)])

where y_w is the preferred (winning) response and y_l is the rejected (losing) response. DPO increases the relative log-probability of preferred responses while decreasing that of rejected ones, all while staying anchored to the reference model through the log-ratio terms.

DPO in Practice

Training DPO requires computing four forward passes per batch: π(y_w|x), π(y_l|x), π_ref(y_w|x), π_ref(y_l|x). The reference model is frozen (no gradients). Each forward pass returns the log-probability of the response: sum of log-probs per token. The loss is a single sigmoid cross-entropy on the gap between the two log-ratios. No sampling, no reward model, no RL loop. Just standard supervised training with a clever loss.

Typical hyperparameters: β = 0.1–0.5 (controls deviation from reference), learning rate 1e-6 to 5e-7 (very low — you're fine-tuning a fine-tuned model), 1–3 epochs over the preference dataset. Memory: only 2 models (policy + frozen reference) instead of 4 for RLHF.

The Implicit Reward

Here's the key derivation that makes DPO work. The optimal policy under the KL-constrained RLHF objective is:

π*(y|x) = π_ref(y|x) · exp(r(x,y) / β) / Z(x)

Rearranging for r gives us: r(x,y) = β log(π(y|x)/π_ref(y|x)) + β log Z(x). The partition function Z(x) cancels when we take the difference r(y_w) − r(y_l) inside the Bradley-Terry model. This is why DPO works: it's implicitly computing a reward function where the reward of any response is just β times its log-probability ratio relative to the reference. Higher probability under π (relative to π_ref) = higher implicit reward.

The elegance: DPO collapses three stages (SFT → RM → PPO) into a single supervised learning problem. No RL, no reward model training, no sampling during training. Just a clever loss function applied directly to preference pairs. In practice, DPO is simpler to implement, more stable to train, and often matches RLHF performance.

Interactive: DPO Loss Landscape

Adjust the log-probability ratios for winning and losing responses. The DPO loss pushes the gap apart.

log π/π_ref (win)1.0

log π/π_ref (lose)-0.5

β (temperature)0.50

      DPO Loss =
      0.24
    

Check: What is DPO's main advantage over RLHF?

It skips the reward model and uses a direct supervised loss on preferences It uses more training data It requires no human preferences

🔨 Derivation DPO Loss from the RLHF Objective ▶ ✓ ATTEMPTED

Given: The optimal RLHF policy is π*(y|x) = π_ref(y|x) · exp(r(x,y)/β) / Z(x). The Bradley-Terry preference model is P(y_w > y_l) = σ(r(y_w) − r(y_l)).

Your task: Derive the DPO loss by (1) solving for r(x,y) in terms of π and π_ref, then (2) substituting into Bradley-Terry to eliminate the reward function entirely.

P(y_w > y_l) = σ(r(y_w) − r(y_l)) = σ(β[log π(y_w|x)/π_ref(y_w|x) − log π(y_l|x)/π_ref(y_l|x)] + β log Z − β log Z). The partition function Z(x) cancels!

Since P(y_w > y_l) = σ(β[log(π/π_ref)(y_w) − log(π/π_ref)(y_l)]), the NLL loss is just −log of this expression. That's the DPO loss!

Full derivation:

1. Start with optimal policy: π*(y|x) = π_ref(y|x) · exp(r(x,y)/β) / Z(x)

2. Solve for reward: r(x,y) = β log(π*(y|x) / π_ref(y|x)) + β log Z(x)

4. The Z(x) terms cancel! This is the crucial step. The intractable partition function vanishes when we take the difference.

5. DPO loss: L_DPO = −log σ(β[log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x)])

The key insight: The partition function Z(x) — which requires summing over ALL possible responses (intractable!) — cancels because we only need the difference in rewards. This is why DPO can bypass the reward model: the thing that made RL necessary (computing expectations over the policy) gets algebraically eliminated. Brilliant.

💻 Build It Implement the DPO Loss from Scratch ▶ ✓ ATTEMPTED

You have log-probabilities from two models (policy and reference) for both the winning and losing responses. Implement the DPO loss that trains the policy to prefer the winning response.

signature def dpo_loss( policy_chosen_logps: torch.Tensor, # [B] log P(y_w | x) under policy policy_rejected_logps: torch.Tensor, # [B] log P(y_l | x) under policy ref_chosen_logps: torch.Tensor, # [B] log P(y_w | x) under reference ref_rejected_logps: torch.Tensor, # [B] log P(y_l | x) under reference beta: float = 0.1 # temperature parameter ) -> torch.Tensor: """Returns scalar DPO loss (mean over batch)."""

Test case

policy_chosen_logps = torch.tensor([-1.0, -0.5])
policy_rejected_logps = torch.tensor([-2.0, -3.0])
ref_chosen_logps = torch.tensor([-1.2, -0.8])
ref_rejected_logps = torch.tensor([-1.8, -2.5])
beta = 0.1
Expected: loss ≈ 0.669 (verify: gap = beta * [((-1+1.2) - (-2+1.8)), ((-0.5+0.8) - (-3+2.5))] = 0.1*[(0.2-(-0.2)), (0.3-(-0.5))] = 0.1*[0.4, 0.8] = [0.04, 0.08]; loss = mean(-log(sigmoid([0.04, 0.08]))) ≈ 0.683)

log(π(y|x) / π_ref(y|x)) = log π(y|x) − log π_ref(y|x). So the chosen log-ratio = policy_chosen_logps − ref_chosen_logps. Same for rejected. Then scale the difference by beta.

python
import torch
import torch.nn.functional as F

def dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    ref_chosen_logps: torch.Tensor,
    ref_rejected_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    # Log-ratios: how much more likely under policy vs reference
    chosen_logratios = policy_chosen_logps - ref_chosen_logps
    rejected_logratios = policy_rejected_logps - ref_rejected_logps

    # The "implicit reward gap" scaled by beta
    logits = beta * (chosen_logratios - rejected_logratios)

    # Negative log-sigmoid = binary cross-entropy with label=1
    loss = -F.logsigmoid(logits).mean()
    return loss

Bonus challenge: Add label smoothing (mix in a small probability that the "rejected" response is actually preferred). This helps with noisy human labels. How does this change the loss formula?

⚔ Adversarial: DPO trains for 3 epochs on a preference dataset. Loss decreases steadily. But when you evaluate, the model gives worse responses than the SFT baseline. What happened?

You're training Llama-7B with DPO on 50K preference pairs collected from GPT-4 (as the "chosen") vs the SFT model's own outputs (as "rejected"). Beta = 0.1, LR = 5e-7, 3 epochs.

The learning rate is too high and the model diverged 50K pairs is too few for a 7B model The "rejected" responses are from the SFT model but the reference model is also SFT — the log-ratios for rejected are near zero, giving a degenerate signal DPO can't work with GPT-4 generated data

Chapter 5: KTO & Simpler Methods

DPO still requires paired preferences (A vs B for the same prompt). But what if you only have binary feedback — thumbs up or thumbs down on individual responses? KTO (Kahneman-Tversky Optimization) works with unpaired data by leveraging prospect theory: humans feel losses more strongly than equivalent gains.

The KTO loss treats desirable and undesirable outputs asymmetrically. For a good response, it encourages the log-ratio to be high. For a bad response, it penalizes the log-ratio being high. The asymmetry mirrors human loss aversion:

L_KTO = λ · (1 − σ(β · gap)) [desirable] | (1 − σ(−β · gap)) [undesirable]

ORPO (Odds Ratio Preference Optimization) takes yet another approach: it combines the SFT loss with a preference signal in a single training objective, using the odds ratio of generating preferred vs rejected responses.

Method	Data Required	Stages	Key Idea
RLHF	Paired preferences	SFT → RM → PPO	Train reward model, then optimize via RL
DPO	Paired preferences	SFT → DPO	Direct loss on preference pairs
KTO	Binary (good/bad)	SFT → KTO	Loss-averse binary feedback
ORPO	Paired preferences	Single stage	Combine SFT + odds-ratio preference

Trend: The field is moving toward simpler methods. RLHF requires the most infrastructure (RL loop, reward model, reference model all in memory). DPO simplifies this considerably. KTO simplifies the data requirements. ORPO simplifies the pipeline to a single pass.

The Simplicity Gradient

Each method trades off against the others in a clear pattern:

RLHF: Most flexible (any reward signal), but 4 models in memory + RL instability. Used by OpenAI for GPT-4, Anthropic for Claude.
DPO: Mathematically equivalent to RLHF under Bradley-Terry, but only 2 models + standard supervised training. Used by Meta for Llama 3.
KTO: Works with thumbs-up/down data (cheaper to collect than pairs). Slightly worse empirically but much easier data collection.
ORPO: Single-stage training combining SFT and preference in one loss. Simplest pipeline but least studied at scale.

Method Comparison

Adjust the complexity and data axes to see how each method trades off simplicity vs data requirements.

Check: What kind of data does KTO need (unlike DPO)?

Just binary thumbs-up/down on individual responses (no pairs needed) More paired comparisons than DPO Expert-written demonstrations only

Chapter 6: Constitutional AI

Human annotation is expensive and doesn't scale. Anthropic's Constitutional AI (CAI) asks: can the AI critique and revise itself using a set of written principles (a "constitution")? The answer is yes, and it works surprisingly well.

CAI has two phases. In the critique-revision phase, the model generates a response, then is asked to evaluate it against principles like "is this harmful?" or "is this honest?" and rewrite it. The revised responses become the SFT training data. In the RLAIF phase (RL from AI Feedback), the AI itself generates preference labels instead of human annotators.

Constitution

A set of principles: "Be helpful," "Avoid harm," "Be honest," etc.

↓

Critique

Model reviews its own response against each principle.

↓

Revision

Model rewrites response to address critiques.

↓

RLAIF

AI-generated preferences used for RL training (instead of human labels).

Why this matters: Human feedback is the bottleneck. If we need millions of preference labels, we can't hire enough annotators. CAI provides a scalable alternative: the model's own judgment, guided by explicit principles. The constitution makes the values transparent and auditable.

Constitutional Critique-Revision

Watch how a response improves through rounds of self-critique. Each round applies a principle from the constitution.

Round: 0 / 4

Check: In Constitutional AI, who provides the preference labels for RL?

Only human annotators The AI model itself, guided by principles A random process

🏗 Design Challenge You're the Architect: Alignment Pipeline for a Code Assistant ▶ ✓ ATTEMPTED

You're building a code-generation assistant (think Copilot/Cursor). The model must be helpful for coding tasks while refusing to generate malware, avoiding insecure patterns, and admitting when it's unsure. Design the full alignment pipeline from preference data collection through training.

Budget

$200K for annotation, 64 A100s for 2 weeks

Base model

34B parameter, code-pretrained (SFT already done)

Safety req

Must refuse malware/exploits but still help with security research and pentesting

Quality req

Code must compile, pass unit tests, follow best practices

1. What preference data do you collect? Human-written pairs, or model-generated? How do you handle the security edge case (pentesting is legitimate, but the same code could be malware)?

2. Do you use RLHF, DPO, or RLAIF? Given 64 A100s and a 34B model, can you fit RLHF in memory? What are the tradeoffs?

3. How do you define "correctness" for code? A reward model trained on human preferences might prefer "looks right" over "actually works." Do you incorporate execution feedback?

4. How do you handle the tension between helpfulness and safety? A user asks "write a function that scans all open ports on a network" — is this pentesting (allowed) or reconnaissance for an attack (refused)?

Real-world approaches (composite of Codex, StarCoder, DeepSeek-Coder):

Data: Hybrid approach. (1) Generate N=8 completions per prompt, run unit tests, rank by test pass rate — this gives "execution-verified" preferences for free. (2) For safety, use constitutional approach: define security principles, have the model self-evaluate edge cases, collect human labels only for truly ambiguous cases (5-10% of budget). (3) For style/quality, use human annotators who are actual developers (NOT crowd workers who can't evaluate code).

Method: DPO is the dominant choice for code models at this scale. RLHF with a 34B model requires 4x34B = 136B parameters in memory (impossible on 64 A100s without extreme sharding). DPO only needs 2x34B = 68B, fitting on 64 A100s with room for activations. Some teams (e.g., DeepSeek) use iterative DPO: train, generate new responses, re-rank, train again.

Correctness: Multi-signal reward combining (1) execution pass rate (binary, high signal), (2) static analysis (linting, type checking), (3) human preference for readability. Weight execution at 3x human preference since "works" matters more than "looks nice."

Safety boundary: Context-dependent classification. The SAME code (port scanner) can be safe or unsafe depending on context. Approach: train a separate intent classifier, or use the constitutional method where principles distinguish "I'm a security researcher testing my own systems" from "I want to attack someone." Default to helpful (write the code) with a safety disclaimer. Over-refusal is a worse product failure than occasional edge-case compliance.

Chapter 7: Process Reward Models

Standard reward models (Outcome Reward Models, or ORMs) score the final answer. But for multi-step reasoning, the final answer might be right for the wrong reasons — or wrong because of a single bad step in an otherwise sound chain. Process Reward Models (PRMs) score each step of the reasoning process.

PRMs provide much denser supervision. Instead of one score for the whole response, you get a score per step. This helps in at least three ways: (1) better credit assignment (which step went wrong?), (2) more training signal per example, and (3) the ability to do tree search over reasoning paths at inference time.

ORM vs PRM: An ORM is like grading an exam by only looking at the final answer. A PRM is like grading each step of the work. The PRM catches errors earlier and provides more useful feedback for learning.

PRM vs ORM Scoring

A reasoning chain with 5 steps. The ORM scores only the final answer. The PRM scores each step. Click steps to toggle correctness and see how scores change.

Feature	ORM	PRM
Granularity	Final answer only	Each reasoning step
Credit assignment	Poor (reward shared across all steps)	Good (per-step feedback)
Annotation cost	Low (check final answer)	High (verify each step)
Search capability	Limited	Enables best-of-N and tree search
Best for	Short, single-step tasks	Multi-step reasoning (math, code)

Check: What is the main advantage of a PRM over an ORM?

It's cheaper to train It scores each reasoning step, enabling better credit assignment It doesn't need any human labels

Chapter 8: Reward Hacking

Here's the dark side of optimization: the model will find loopholes. If the reward model gives higher scores to longer responses, the model learns to be verbose. If the reward model prefers confident-sounding text, the model learns to sound confident even when wrong. This is reward hacking — the policy exploits imperfections in the proxy reward.

Common failure modes include:

Sycophancy — agreeing with the user even when they're wrong, because agreement gets higher reward.
Length bias — generating unnecessarily long responses because the RM associates length with quality.
Hedging — giving vague, noncommittal answers that avoid being clearly wrong.
Format gaming — using bullet points, bold text, or other formatting tricks that correlate with high scores.

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The reward model is a proxy for human preferences, not the real thing. Push optimization pressure hard enough and the proxy breaks down.

Reward Hacking in Action

Watch the proxy reward climb while true quality degrades. The green line is proxy reward, the teal line is true quality. Adjust KL penalty to see the mitigation effect.

KL penalty β0.05

Training steps100

The Overoptimization Curve

There's a characteristic pattern in RLHF training. Plot proxy reward (from the RM) and true reward (from held-out human evaluations) against KL divergence from the reference. Early on, both curves rise together — the model genuinely improves. But past a critical KL threshold, proxy reward keeps climbing while true reward peaks and then drops. The model has found adversarial inputs that game the RM.

Gao et al. (2022) showed this empirically across multiple model sizes. The critical KL scales roughly with model capacity: larger models can exploit RMs more aggressively. This is why β must be tuned per model size — there's no universal value.

Mitigation Strategies

Strategy	How It Helps
KL penalty	Limits how far the policy can drift from the reference
Reward model ensembles	Multiple RMs make it harder to find shared blind spots
Length normalization	Score per-token rather than per-response
Adversarial training	Deliberately look for hacking patterns and retrain
Regular RM refresh	Retrain the RM on policy's new distribution

Check: Sycophancy is an example of reward hacking because...

The model learns that agreeing with users scores higher, even when wrong The model runs out of training data The model becomes too helpful

💥 Break-It Lab What Dies When You Remove RLHF Components? ▶ ✓ ATTEMPTED

A working RLHF training loop with KL penalty, reward model, and clipping. Toggle components off to see the specific failure mode each one prevents. The green curve is true quality; the orange curve is proxy reward.

Remove KL Penalty (β=0) ACTIVE

Failure mode: Without the KL leash, the policy drifts far from the reference into out-of-distribution territory. The reward model's scores become meaningless — the policy finds "adversarial inputs" that score high on the RM but are gibberish to humans. Proxy reward climbs to infinity while true quality collapses. This is Goodhart's Law in action.

Remove PPO Clipping (ε=∞) ACTIVE

Failure mode: Without clipping, a single high-advantage sample can cause an enormous policy update. The policy "overshoots" — probability ratios go to 10x or 100x in one step. Next step, the model generates from this wildly different policy, advantages are computed wrong, and training becomes unstable (oscillating loss, occasional mode collapse).

Set β Too High (β=5.0) ACTIVE

Failure mode: The KL penalty dominates the objective. Every gradient step that moves away from π_ref gets punished more than it gets rewarded. The policy barely changes — you've spent millions on compute for a model that's essentially the SFT checkpoint with minor cosmetic differences. Training "succeeds" (loss is low) but the model hasn't learned anything new.

Out-of-Distribution Reward Model ACTIVE

Failure mode: The reward model was trained on short, factual Q&A pairs but the policy generates long creative stories. The RM assigns random scores to this unseen distribution — sometimes very high, sometimes very low. The policy learns to exploit whichever OOD pattern accidentally gets high scores. This is why you must train the RM on the same distribution the policy will generate from.

⚔ Adversarial: Your RLHF model scores 4.2/5 on the reward model (up from 3.1 for the SFT baseline). But in blind human evaluation, raters prefer the SFT baseline 62% of the time. What happened and what do you do?

The reward model was trained 3 months ago on 100K preference pairs. Since then, you've run 50K PPO steps with β=0.02. The model's responses have become noticeably longer (avg 450 tokens vs 180 for SFT) and use more formatting (bullet points, headers).

The evaluation methodology is wrong The RM has been overoptimized — the policy found length/formatting heuristics that game the RM's spurious correlations The SFT model was already perfect and didn't need RLHF β=0.02 is too high and prevented learning

Checkpoint — Before you move on

In your own words: why is reward hacking inevitable rather than merely possible? Connect this to Goodhart's Law and explain why no single mitigation (KL penalty, ensembles, etc.) can fully solve it.

✓ Gate cleared

Model Answer

It's inevitable because: The reward model is a finite neural network trained on a finite dataset. It has limited capacity and has only seen a tiny fraction of all possible outputs. Any such model necessarily has blind spots — regions of output space where its predictions are uncalibrated. An RL optimizer, given enough steps, will find these blind spots and exploit them. This is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure."

Why no single fix works: KL penalty slows exploitation but doesn't prevent it (the model still drifts, just slower). RM ensembles reduce shared blind spots but every finite ensemble has gaps. Length normalization fixes ONE exploit but the policy finds others. The fundamental issue is that we're using an imperfect proxy for an unspecifiable target ("what humans want"). The only true solution would be a perfect reward model — which would require capturing all of human values in a neural network. That's the alignment problem itself, recursively.

Chapter 9: Open Problems

Alignment research is far from solved. As models become more capable, the challenges deepen. Here are the frontier problems that keep researchers up at night:

Scalable Oversight

How do you supervise a model that can perform tasks beyond human ability? If the model writes code that's too complex for the annotator to evaluate, or reasons about problems the human can't verify, the entire preference-based framework breaks down. Proposals include recursive reward modeling (use AI to help humans evaluate AI), debate (two AIs argue, human judges), and market-based mechanisms.

Weak-to-Strong Generalization

Can a weaker model supervise a stronger one? If GPT-2-level labels are used to fine-tune a GPT-4-level model, can the strong model generalize beyond the quality of its supervisor? Early results from OpenAI suggest this partially works — the strong model can "figure out" what the weak supervisor meant, analogous to a smart student learning from an imperfect teacher. But it doesn't fully close the gap.

Interpretability for Alignment

Can we look inside the model and verify its "values" directly? If we could read the model's internal representations and confirm it's being helpful for the right reasons (not just gaming the reward), that would be far more robust than any behavioral test. Mechanistic interpretability aims to make this possible.

The big picture: Current alignment is behavioral — we judge models by their outputs. Future alignment may be mechanistic — we verify models by their internals. The shift from "does it behave well?" to "does it reason well?" is the frontier.

Alignment Research Landscape

The map of open alignment problems. Each node represents a research direction. Hover to see connections.

Problem	Core Question	Status
Scalable oversight	How to evaluate superhuman outputs?	Active research
Weak-to-strong	Can weak supervisors train strong models?	Promising early results
Interpretability	Can we verify alignment by reading internals?	Rapid progress
Robustness	Does alignment hold under distribution shift?	Largely unsolved
Multi-agent	How to align interacting agents?	Early stage

"The AI alignment problem is not about making AI obey us. It's about making AI understand what we actually mean."

— paraphrasing Stuart Russell

You now understand the full landscape of alignment: from RLHF to DPO, from reward hacking to the open frontier. The field is young, the problems are deep, and the stakes are high.

Check: What is the "scalable oversight" problem?

How to evaluate and supervise AI on tasks beyond human ability How to train models faster How to reduce training costs

Understand Reward Models
& Alignment

Chapter 0: The Alignment Problem

Chapter 1: Reward Modeling

Reward Model Architecture

Training the Reward Model

Chapter 2: The RLHF Pipeline

What's Running During PPO

Chapter 3: PPO for LLMs

The β Tradeoff in Practice

One PPO Iteration, Step by Step

Chapter 4: DPO — Direct Preference Optimization

DPO in Practice

The Implicit Reward

Chapter 5: KTO & Simpler Methods

The Simplicity Gradient

Chapter 6: Constitutional AI

Chapter 7: Process Reward Models

Chapter 8: Reward Hacking

The Overoptimization Curve

Mitigation Strategies

Chapter 9: Open Problems

Scalable Oversight

Weak-to-Strong Generalization

Interpretability for Alignment

Understand Reward Models& Alignment

Chapter 0: The Alignment Problem

Chapter 1: Reward Modeling

Reward Model Architecture

Training the Reward Model

Chapter 2: The RLHF Pipeline

What's Running During PPO

Chapter 3: PPO for LLMs

The β Tradeoff in Practice

One PPO Iteration, Step by Step

Chapter 4: DPO — Direct Preference Optimization

DPO in Practice

The Implicit Reward

Chapter 5: KTO & Simpler Methods

The Simplicity Gradient

Chapter 6: Constitutional AI

Chapter 7: Process Reward Models

Chapter 8: Reward Hacking

The Overoptimization Curve

Mitigation Strategies

Chapter 9: Open Problems

Scalable Oversight

Weak-to-Strong Generalization

Interpretability for Alignment

Understand Reward Models
& Alignment