How we teach language models to be helpful, harmless, and honest — turning raw capability into something humans actually want.
A language model trained on internet text can write poetry, solve math problems, and generate working code. But it can also produce toxic content, confidently hallucinate, or help with dangerous requests. Capability is not alignment. The model learned to predict the next token — it didn't learn what humans actually want.
The alignment problem is the gap between what a model can do and what it should do. A base model completes text. An aligned model has a notion of helpful, harmless, and honest. Bridging that gap is the central challenge of modern AI safety.
Watch how a base model (predicting likely text) diverges from what an aligned model should say. The red path is raw capability. The green path is aligned behavior.
Early approaches tried to fix this with careful prompting or filtering. But these are band-aids. The real solution requires changing the model itself — teaching it a reward signal that captures human preferences. This is the story of alignment.
How do you teach a model what "good" means? You ask humans. Given a prompt, generate two candidate responses A and B. Show them to a human annotator who picks the better one. Collect thousands of these preference pairs and train a separate model — the reward model — to predict which response a human would prefer.
The reward model takes a (prompt, response) pair and outputs a scalar score. Higher score = more likely to be preferred. The key insight: we don't need humans to assign absolute scores (that's noisy and inconsistent). We only need relative judgments: "A is better than B." This is the Bradley-Terry model:
where σ is the sigmoid function and r(·) is the reward model's score. The loss function pushes the reward for the preferred response above the rejected one:
The reward model is typically the same architecture as the LLM itself, but with the language modeling head replaced by a scalar head. The data flow:
[B, L] token IDs[B, L, d_model] hidden states[B, d_model][B, 1] reward scoreFor a 7B reward model with d_model=4096: the entire backbone processes the prompt+response, the last token's representation is pooled, and a single linear layer projects 4096 dimensions down to 1 scalar. The backbone is usually initialized from the SFT checkpoint — it already understands language; it just needs to learn to evaluate quality.
Each training example is a tuple: (prompt, chosen_response, rejected_response). Forward both responses through the RM to get r(chosen) and r(rejected). The Bradley-Terry loss pushes the gap apart:
python r_chosen = reward_model(prompt + chosen) # scalar r_rejected = reward_model(prompt + rejected) # scalar loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)) loss.backward()
Typical dataset sizes: 50K–500K preference pairs. Training: 1–3 epochs, learning rate ~1e-5, batch size 64. The RM converges quickly because it's initialized from a strong LLM — most of the work is already done. You're just teaching the final layer what "quality" means.
Drag the slider to adjust the reward gap between chosen and rejected responses. Watch how the Bradley-Terry loss changes.
Given: A reward model r(·) that assigns scalar scores to responses. Human annotators provide N preference pairs where response A is preferred over response B. The Bradley-Terry model says P(A > B) = σ(r(A) − r(B)).
Your task: Derive the training loss L = −log σ(r(chosen) − r(rejected)) from the maximum likelihood principle. Why is this equivalent to binary cross-entropy?
Full derivation:
1. Model assumption: P(A > B | r) = σ(r(A) − r(B)) where σ(x) = 1/(1+e−x)
2. Likelihood of N pairs: L(r) = Πi=1..N σ(r(choseni) − r(rejectedi))
3. Log-likelihood: log L(r) = Σi log σ(r(choseni) − r(rejectedi))
4. Negative log-likelihood (loss): J(r) = −(1/N) Σi log σ(r(choseni) − r(rejectedi))
5. Per-sample loss: L = −log σ(r(chosen) − r(rejected))
This is exactly binary cross-entropy where the "true label" is always 1 (chosen is always the preferred one in our dataset) and the model's predicted probability is σ(gap).
The key insight: We never need absolute reward values. The loss only depends on the difference r(chosen) − r(rejected). This means the reward model is only calibrated up to an additive constant — which is fine, because we only ever use reward differences downstream.
Relative rankings work because: (1) Humans are inconsistent at absolute scales — one annotator's 7 is another's 5, creating noisy gradients. (2) The Bradley-Terry loss only depends on the difference r(A)−r(B), so any monotonic transformation of scores preserves the same training signal. (3) PPO only needs to compare rewards between candidate responses for the same prompt — the absolute scale is irrelevant.
Absolute scores fail because: Different annotators have different baselines and scales. Regression to an absolute target (MSE loss) would fight this noise. The model would waste capacity trying to calibrate to individual annotator biases rather than learning the underlying quality ordering. The Bradley-Terry formulation elegantly sidesteps this by only asking "which is better?" — a question humans answer consistently.
RLHF (Reinforcement Learning from Human Feedback) is a three-stage pipeline that transforms a base model into an aligned assistant. Each stage builds on the last:
Stage 1 (SFT) gives the model the right "shape" — it learns to follow instructions, use a helpful tone, and structure answers clearly. But SFT alone is limited by the quality and diversity of demonstrations.
Stage 2 (Reward Model) captures nuanced preferences that are hard to demonstrate. It's easier to judge than to demonstrate: a human can quickly say "A is better than B" even if writing the perfect response from scratch would take much longer.
Stage 3 (PPO) is where the magic happens. The policy model generates responses, the reward model scores them, and PPO updates the policy to produce higher-scoring outputs. A KL penalty prevents the model from drifting too far from the SFT checkpoint.
The PPO training loop is surprisingly GPU-hungry. At any given moment, you need four models in memory:
| Model | Role | Updated? |
|---|---|---|
| Policy πθ | Generates responses. This is the model being trained. | Yes (via PPO) |
| Reference πref | Frozen SFT checkpoint. Computes KL penalty. | No (frozen) |
| Reward model rφ | Scores generated responses. | No (frozen) |
| Value head Vψ | Estimates expected future reward (for advantage computation). | Yes (via PPO) |
For a 7B model, that's roughly 4 × 14 GB = 56 GB just for weights in fp16 — before any activations or optimizer states. This is why RLHF is expensive and why simpler alternatives like DPO are appealing.
Interactive pipeline diagram. Click each stage to highlight its role and data flow.
Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. The idea: generate a response, score it with the reward model, then nudge the policy to make high-reward responses more likely — but not too much per step.
PPO uses a clipped surrogate objective. The ratio rt = π(a|s) / πold(a|s) measures how much the policy has changed. PPO clips this ratio to [1−ε, 1+ε], preventing destructive updates:
But there's a critical addition for LLMs: a KL penalty that keeps the policy close to a reference model (usually the SFT checkpoint). Without it, the model would "overoptimize" — finding weird outputs that score high on the reward model but are gibberish to humans.
The total reward for a generated response is Rtotal = rmodel(y) − β · KL(π || πref). The coefficient β is the most important hyperparameter in RLHF training:
In practice, teams often anneal β during training — starting high for stability, then lowering it as the reward model proves reliable.
Adjust the advantage (positive = good action, negative = bad action) and the clip range ε. The teal curve is the clipped objective.
| Symbol | Meaning |
|---|---|
| rt | Probability ratio (new policy / old policy) |
| At | Advantage: how much better than expected |
| ε | Clip range (typically 0.1–0.2) |
| β | KL penalty coefficient |
| πref | Reference policy (SFT model) |
The RLHF objective is: maxπ Ex~D, y~π[r(x,y)] − β KL(π || πref). This is a constrained optimization problem: maximize reward subject to staying close to the reference.
Your task: Show that the optimal policy under this objective takes the form π*(y|x) ∝ πref(y|x) · exp(r(x,y) / β). What role does β play as a "temperature"?
Full derivation:
1. Objective: maxπ Ey~π(y|x)[r(x,y)] − β KL(π || πref)
2. Expand KL: = Ey~π[r(x,y) − β(log π(y|x) − log πref(y|x))]
3. Rearrange: = Ey~π[r(x,y) + β log πref(y|x)] − β Ey~π[log π(y|x)]
4. Recognize: The second term is β · H(π) (entropy). So we maximize: E[r + β log πref] + β H(π)
5. Variational solution: For any "energy" function f(y), the distribution maximizing Eπ[f(y)] + T·H(π) is π*(y) ∝ exp(f(y)/T). Here f = r + β log πref and T = β.
6. Result: π*(y|x) ∝ exp((r(x,y) + β log πref(y|x)) / β) = πref(y|x) · exp(r(x,y)/β)
7. Normalizing: π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x), where Z(x) = Σy πref(y|x) exp(r(x,y)/β)
The key insight: β is literally a temperature parameter from the Boltzmann distribution in statistical mechanics. Low β = "cold" = sharply peaked at highest-reward outputs. High β = "hot" = spread out, close to reference. This is why β too low causes reward hacking (greedy exploitation) and β too high causes no learning (staying at reference).
The PPO algorithm is identical in both cases — clip ratios, compute advantages via GAE, update for multiple epochs. The only difference is where the reward comes from: in RL, it's the environment; in RLHF, it's a learned neural network. The KL penalty in RLHF plays the same role as reward shaping in RL: it keeps the policy in a region where the reward signal is trustworthy.
The next time you see any optimization with a "stay close to a reference" constraint, recognize it as the same pattern: exploration bounded by trust.
Both use KL as a "leash" that prevents optimization from going too far. In RLHF, it prevents reward hacking. In VAEs, it prevents posterior collapse (the encoder ignoring the prior). The pattern: whenever you optimize a flexible model against an objective, add a KL penalty to keep it in a trusted region. The strength of the penalty (β) trades off "how much can we learn" vs "how likely are we to break."
Where else in deep learning do you see a KL penalty stabilizing optimization? (Hint: think about knowledge distillation and variational inference.)
What if we could skip the reward model entirely? DPO (Rafailov et al., 2023) makes a beautiful mathematical observation: the optimal policy under the RLHF objective has a closed-form solution in terms of the reward function. By rearranging terms, we can express the reward implicitly through the policy itself:
Substituting this into the Bradley-Terry preference model gives the DPO loss directly in terms of the policy log-probabilities — no reward model needed:
where yw is the preferred (winning) response and yl is the rejected (losing) response. DPO increases the relative log-probability of preferred responses while decreasing that of rejected ones, all while staying anchored to the reference model through the log-ratio terms.
Training DPO requires computing four forward passes per batch: π(yw|x), π(yl|x), πref(yw|x), πref(yl|x). The reference model is frozen (no gradients). Each forward pass returns the log-probability of the response: sum of log-probs per token. The loss is a single sigmoid cross-entropy on the gap between the two log-ratios. No sampling, no reward model, no RL loop. Just standard supervised training with a clever loss.
Typical hyperparameters: β = 0.1–0.5 (controls deviation from reference), learning rate 1e-6 to 5e-7 (very low — you're fine-tuning a fine-tuned model), 1–3 epochs over the preference dataset. Memory: only 2 models (policy + frozen reference) instead of 4 for RLHF.
Here's the key derivation that makes DPO work. The optimal policy under the KL-constrained RLHF objective is:
Rearranging for r gives us: r(x,y) = β log(π(y|x)/πref(y|x)) + β log Z(x). The partition function Z(x) cancels when we take the difference r(yw) − r(yl) inside the Bradley-Terry model. This is why DPO works: it's implicitly computing a reward function where the reward of any response is just β times its log-probability ratio relative to the reference. Higher probability under π (relative to πref) = higher implicit reward.
Adjust the log-probability ratios for winning and losing responses. The DPO loss pushes the gap apart.
Given: The optimal RLHF policy is π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x). The Bradley-Terry preference model is P(yw > yl) = σ(r(yw) − r(yl)).
Your task: Derive the DPO loss by (1) solving for r(x,y) in terms of π and πref, then (2) substituting into Bradley-Terry to eliminate the reward function entirely.
Full derivation:
1. Start with optimal policy: π*(y|x) = πref(y|x) · exp(r(x,y)/β) / Z(x)
2. Solve for reward: r(x,y) = β log(π*(y|x) / πref(y|x)) + β log Z(x)
3. Substitute into Bradley-Terry:
P(yw > yl) = σ(r(yw) − r(yl))
= σ(β[log π*(yw|x)/πref(yw|x)] + β log Z − β[log π*(yl|x)/πref(yl|x)] − β log Z)
= σ(β[log π*(yw|x)/πref(yw|x) − log π*(yl|x)/πref(yl|x)])
4. The Z(x) terms cancel! This is the crucial step. The intractable partition function vanishes when we take the difference.
5. DPO loss: LDPO = −log σ(β[log πθ(yw|x)/πref(yw|x) − log πθ(yl|x)/πref(yl|x)])
The key insight: The partition function Z(x) — which requires summing over ALL possible responses (intractable!) — cancels because we only need the difference in rewards. This is why DPO can bypass the reward model: the thing that made RL necessary (computing expectations over the policy) gets algebraically eliminated. Brilliant.
python import torch import torch.nn.functional as F def dpo_loss( policy_chosen_logps: torch.Tensor, policy_rejected_logps: torch.Tensor, ref_chosen_logps: torch.Tensor, ref_rejected_logps: torch.Tensor, beta: float = 0.1 ) -> torch.Tensor: # Log-ratios: how much more likely under policy vs reference chosen_logratios = policy_chosen_logps - ref_chosen_logps rejected_logratios = policy_rejected_logps - ref_rejected_logps # The "implicit reward gap" scaled by beta logits = beta * (chosen_logratios - rejected_logratios) # Negative log-sigmoid = binary cross-entropy with label=1 loss = -F.logsigmoid(logits).mean() return loss
DPO still requires paired preferences (A vs B for the same prompt). But what if you only have binary feedback — thumbs up or thumbs down on individual responses? KTO (Kahneman-Tversky Optimization) works with unpaired data by leveraging prospect theory: humans feel losses more strongly than equivalent gains.
The KTO loss treats desirable and undesirable outputs asymmetrically. For a good response, it encourages the log-ratio to be high. For a bad response, it penalizes the log-ratio being high. The asymmetry mirrors human loss aversion:
ORPO (Odds Ratio Preference Optimization) takes yet another approach: it combines the SFT loss with a preference signal in a single training objective, using the odds ratio of generating preferred vs rejected responses.
| Method | Data Required | Stages | Key Idea |
|---|---|---|---|
| RLHF | Paired preferences | SFT → RM → PPO | Train reward model, then optimize via RL |
| DPO | Paired preferences | SFT → DPO | Direct loss on preference pairs |
| KTO | Binary (good/bad) | SFT → KTO | Loss-averse binary feedback |
| ORPO | Paired preferences | Single stage | Combine SFT + odds-ratio preference |
Each method trades off against the others in a clear pattern:
Adjust the complexity and data axes to see how each method trades off simplicity vs data requirements.
Human annotation is expensive and doesn't scale. Anthropic's Constitutional AI (CAI) asks: can the AI critique and revise itself using a set of written principles (a "constitution")? The answer is yes, and it works surprisingly well.
CAI has two phases. In the critique-revision phase, the model generates a response, then is asked to evaluate it against principles like "is this harmful?" or "is this honest?" and rewrite it. The revised responses become the SFT training data. In the RLAIF phase (RL from AI Feedback), the AI itself generates preference labels instead of human annotators.
Watch how a response improves through rounds of self-critique. Each round applies a principle from the constitution.
Real-world approaches (composite of Codex, StarCoder, DeepSeek-Coder):
Data: Hybrid approach. (1) Generate N=8 completions per prompt, run unit tests, rank by test pass rate — this gives "execution-verified" preferences for free. (2) For safety, use constitutional approach: define security principles, have the model self-evaluate edge cases, collect human labels only for truly ambiguous cases (5-10% of budget). (3) For style/quality, use human annotators who are actual developers (NOT crowd workers who can't evaluate code).
Method: DPO is the dominant choice for code models at this scale. RLHF with a 34B model requires 4x34B = 136B parameters in memory (impossible on 64 A100s without extreme sharding). DPO only needs 2x34B = 68B, fitting on 64 A100s with room for activations. Some teams (e.g., DeepSeek) use iterative DPO: train, generate new responses, re-rank, train again.
Correctness: Multi-signal reward combining (1) execution pass rate (binary, high signal), (2) static analysis (linting, type checking), (3) human preference for readability. Weight execution at 3x human preference since "works" matters more than "looks nice."
Safety boundary: Context-dependent classification. The SAME code (port scanner) can be safe or unsafe depending on context. Approach: train a separate intent classifier, or use the constitutional method where principles distinguish "I'm a security researcher testing my own systems" from "I want to attack someone." Default to helpful (write the code) with a safety disclaimer. Over-refusal is a worse product failure than occasional edge-case compliance.
Standard reward models (Outcome Reward Models, or ORMs) score the final answer. But for multi-step reasoning, the final answer might be right for the wrong reasons — or wrong because of a single bad step in an otherwise sound chain. Process Reward Models (PRMs) score each step of the reasoning process.
PRMs provide much denser supervision. Instead of one score for the whole response, you get a score per step. This helps in at least three ways: (1) better credit assignment (which step went wrong?), (2) more training signal per example, and (3) the ability to do tree search over reasoning paths at inference time.
A reasoning chain with 5 steps. The ORM scores only the final answer. The PRM scores each step. Click steps to toggle correctness and see how scores change.
| Feature | ORM | PRM |
|---|---|---|
| Granularity | Final answer only | Each reasoning step |
| Credit assignment | Poor (reward shared across all steps) | Good (per-step feedback) |
| Annotation cost | Low (check final answer) | High (verify each step) |
| Search capability | Limited | Enables best-of-N and tree search |
| Best for | Short, single-step tasks | Multi-step reasoning (math, code) |
Here's the dark side of optimization: the model will find loopholes. If the reward model gives higher scores to longer responses, the model learns to be verbose. If the reward model prefers confident-sounding text, the model learns to sound confident even when wrong. This is reward hacking — the policy exploits imperfections in the proxy reward.
Common failure modes include:
Watch the proxy reward climb while true quality degrades. The green line is proxy reward, the teal line is true quality. Adjust KL penalty to see the mitigation effect.
There's a characteristic pattern in RLHF training. Plot proxy reward (from the RM) and true reward (from held-out human evaluations) against KL divergence from the reference. Early on, both curves rise together — the model genuinely improves. But past a critical KL threshold, proxy reward keeps climbing while true reward peaks and then drops. The model has found adversarial inputs that game the RM.
Gao et al. (2022) showed this empirically across multiple model sizes. The critical KL scales roughly with model capacity: larger models can exploit RMs more aggressively. This is why β must be tuned per model size — there's no universal value.
| Strategy | How It Helps |
|---|---|
| KL penalty | Limits how far the policy can drift from the reference |
| Reward model ensembles | Multiple RMs make it harder to find shared blind spots |
| Length normalization | Score per-token rather than per-response |
| Adversarial training | Deliberately look for hacking patterns and retrain |
| Regular RM refresh | Retrain the RM on policy's new distribution |
It's inevitable because: The reward model is a finite neural network trained on a finite dataset. It has limited capacity and has only seen a tiny fraction of all possible outputs. Any such model necessarily has blind spots — regions of output space where its predictions are uncalibrated. An RL optimizer, given enough steps, will find these blind spots and exploit them. This is Goodhart's Law: "when a measure becomes a target, it ceases to be a good measure."
Why no single fix works: KL penalty slows exploitation but doesn't prevent it (the model still drifts, just slower). RM ensembles reduce shared blind spots but every finite ensemble has gaps. Length normalization fixes ONE exploit but the policy finds others. The fundamental issue is that we're using an imperfect proxy for an unspecifiable target ("what humans want"). The only true solution would be a perfect reward model — which would require capturing all of human values in a neural network. That's the alignment problem itself, recursively.
Alignment research is far from solved. As models become more capable, the challenges deepen. Here are the frontier problems that keep researchers up at night:
How do you supervise a model that can perform tasks beyond human ability? If the model writes code that's too complex for the annotator to evaluate, or reasons about problems the human can't verify, the entire preference-based framework breaks down. Proposals include recursive reward modeling (use AI to help humans evaluate AI), debate (two AIs argue, human judges), and market-based mechanisms.
Can a weaker model supervise a stronger one? If GPT-2-level labels are used to fine-tune a GPT-4-level model, can the strong model generalize beyond the quality of its supervisor? Early results from OpenAI suggest this partially works — the strong model can "figure out" what the weak supervisor meant, analogous to a smart student learning from an imperfect teacher. But it doesn't fully close the gap.
Can we look inside the model and verify its "values" directly? If we could read the model's internal representations and confirm it's being helpful for the right reasons (not just gaming the reward), that would be far more robust than any behavioral test. Mechanistic interpretability aims to make this possible.
The map of open alignment problems. Each node represents a research direction. Hover to see connections.
| Problem | Core Question | Status |
|---|---|---|
| Scalable oversight | How to evaluate superhuman outputs? | Active research |
| Weak-to-strong | Can weak supervisors train strong models? | Promising early results |
| Interpretability | Can we verify alignment by reading internals? | Rapid progress |
| Robustness | Does alignment hold under distribution shift? | Largely unsolved |
| Multi-agent | How to align interacting agents? | Early stage |
You now understand the full landscape of alignment: from RLHF to DPO, from reward hacking to the open frontier. The field is young, the problems are deep, and the stakes are high.