The Complete Beginner's Path

Understand Reward Models
& Alignment

How we teach language models to be helpful, harmless, and honest — turning raw capability into something humans actually want.

Prerequisites: Basic LLM intuition + What fine-tuning means. That's it.
10
Chapters
8+
Interactives
0
PhD Required

Chapter 0: The Alignment Problem

A language model trained on internet text can write poetry, solve math problems, and generate working code. But it can also produce toxic content, confidently hallucinate, or help with dangerous requests. Capability is not alignment. The model learned to predict the next token — it didn't learn what humans actually want.

The alignment problem is the gap between what a model can do and what it should do. A base model completes text. An aligned model has a notion of helpful, harmless, and honest. Bridging that gap is the central challenge of modern AI safety.

The core tension: Pre-training gives capability (the model learns the distribution of all internet text), but it doesn't distinguish a helpful medical answer from a conspiracy theory. Both are "likely next tokens." Alignment steers the model toward responses humans actually prefer.
Capability vs Alignment

Watch how a base model (predicting likely text) diverges from what an aligned model should say. The red path is raw capability. The green path is aligned behavior.

Early approaches tried to fix this with careful prompting or filtering. But these are band-aids. The real solution requires changing the model itself — teaching it a reward signal that captures human preferences. This is the story of alignment.

Check: Why isn't a model trained on internet text automatically aligned?

Chapter 1: Reward Modeling

How do you teach a model what "good" means? You ask humans. Given a prompt, generate two candidate responses A and B. Show them to a human annotator who picks the better one. Collect thousands of these preference pairs and train a separate model — the reward model — to predict which response a human would prefer.

The reward model takes a (prompt, response) pair and outputs a scalar score. Higher score = more likely to be preferred. The key insight: we don't need humans to assign absolute scores (that's noisy and inconsistent). We only need relative judgments: "A is better than B." This is the Bradley-Terry model:

P(A > B) = σ(r(A) − r(B))

where σ is the sigmoid function and r(·) is the reward model's score. The loss function pushes the reward for the preferred response above the rejected one:

L = −log σ(r(chosen) − r(rejected))
Why pairs, not scores? Asking "rate this response 1-10" is unreliable — one annotator's 7 is another's 5. But asking "which is better?" gives consistent signal. Humans are much better at comparisons than absolute judgments.
Reward Model Training

Drag the slider to adjust the reward gap between chosen and rejected responses. Watch how the Bradley-Terry loss changes.

r(chosen)1.5
r(rejected)-0.5
Loss = −log σ(gap) = 0.13
Check: What does the reward model learn to predict?

Chapter 2: The RLHF Pipeline

RLHF (Reinforcement Learning from Human Feedback) is a three-stage pipeline that transforms a base model into an aligned assistant. Each stage builds on the last:

Stage 1: SFT
Fine-tune on high-quality demonstrations. The model learns the format of helpful responses.
Stage 2: Reward Model
Train on human preference pairs. The RM learns to score response quality.
Stage 3: PPO
Use RL to maximize the reward model's score while staying close to the SFT model.

Stage 1 (SFT) gives the model the right "shape" — it learns to follow instructions, use a helpful tone, and structure answers clearly. But SFT alone is limited by the quality and diversity of demonstrations.

Stage 2 (Reward Model) captures nuanced preferences that are hard to demonstrate. It's easier to judge than to demonstrate: a human can quickly say "A is better than B" even if writing the perfect response from scratch would take much longer.

Stage 3 (PPO) is where the magic happens. The policy model generates responses, the reward model scores them, and PPO updates the policy to produce higher-scoring outputs. A KL penalty prevents the model from drifting too far from the SFT checkpoint.

Why three stages? SFT alone can't capture subtle preferences. The reward model alone can't generate text. PPO alone would collapse without a good starting point. Each stage solves a different piece of the puzzle.
RLHF Pipeline Flow

Interactive pipeline diagram. Click each stage to highlight its role and data flow.

Click a stage to learn more about its role in the pipeline.
Check: What is the correct order of the RLHF pipeline?

Chapter 3: PPO for LLMs

Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. The idea: generate a response, score it with the reward model, then nudge the policy to make high-reward responses more likely — but not too much per step.

PPO uses a clipped surrogate objective. The ratio rt = π(a|s) / πold(a|s) measures how much the policy has changed. PPO clips this ratio to [1−ε, 1+ε], preventing destructive updates:

LCLIP = min(rt At, clip(rt, 1−ε, 1+ε) At)

But there's a critical addition for LLMs: a KL penalty that keeps the policy close to a reference model (usually the SFT checkpoint). Without it, the model would "overoptimize" — finding weird outputs that score high on the reward model but are gibberish to humans.

Rtotal = Rreward(x, y) − β KL(π || πref)
Why KL matters: The reward model is an imperfect proxy for human preferences. If you optimize it too aggressively, the policy finds "adversarial examples" that exploit the reward model's blind spots. The KL penalty is a leash: the policy can learn, but can't wander into territory the reward model wasn't trained on.
PPO Clipped Objective

Adjust the advantage (positive = good action, negative = bad action) and the clip range ε. The teal curve is the clipped objective.

Advantage A1.0
Clip ε0.20
KL penalty β0.10
SymbolMeaning
rtProbability ratio (new policy / old policy)
AtAdvantage: how much better than expected
εClip range (typically 0.1–0.2)
βKL penalty coefficient
πrefReference policy (SFT model)
Check: Why is the KL penalty necessary in RLHF?

Chapter 4: DPO — Direct Preference Optimization

What if we could skip the reward model entirely? DPO (Rafailov et al., 2023) makes a beautiful mathematical observation: the optimal policy under the RLHF objective has a closed-form solution in terms of the reward function. By rearranging terms, we can express the reward implicitly through the policy itself:

r(x, y) = β log(π(y|x) / πref(y|x)) + C

Substituting this into the Bradley-Terry preference model gives the DPO loss directly in terms of the policy log-probabilities — no reward model needed:

LDPO = −log σ(β[log π(yw|x)/πref(yw|x) − log π(yl|x)/πref(yl|x)])

where yw is the preferred (winning) response and yl is the rejected (losing) response. DPO increases the relative log-probability of preferred responses while decreasing that of rejected ones, all while staying anchored to the reference model through the log-ratio terms.

The elegance: DPO collapses three stages (SFT → RM → PPO) into a single supervised learning problem. No RL, no reward model training, no sampling during training. Just a clever loss function applied directly to preference pairs. In practice, DPO is simpler to implement, more stable to train, and often matches RLHF performance.
Interactive: DPO Loss Landscape

Adjust the log-probability ratios for winning and losing responses. The DPO loss pushes the gap apart.

log π/πref (win)1.0
log π/πref (lose)-0.5
β (temperature)0.50
DPO Loss = 0.24
Check: What is DPO's main advantage over RLHF?

Chapter 5: KTO & Simpler Methods

DPO still requires paired preferences (A vs B for the same prompt). But what if you only have binary feedback — thumbs up or thumbs down on individual responses? KTO (Kahneman-Tversky Optimization) works with unpaired data by leveraging prospect theory: humans feel losses more strongly than equivalent gains.

The KTO loss treats desirable and undesirable outputs asymmetrically. For a good response, it encourages the log-ratio to be high. For a bad response, it penalizes the log-ratio being high. The asymmetry mirrors human loss aversion:

LKTO = λ · (1 − σ(β · gap))   [desirable]    |    (1 − σ(−β · gap))   [undesirable]

ORPO (Odds Ratio Preference Optimization) takes yet another approach: it combines the SFT loss with a preference signal in a single training objective, using the odds ratio of generating preferred vs rejected responses.

MethodData RequiredStagesKey Idea
RLHFPaired preferencesSFT → RM → PPOTrain reward model, then optimize via RL
DPOPaired preferencesSFT → DPODirect loss on preference pairs
KTOBinary (good/bad)SFT → KTOLoss-averse binary feedback
ORPOPaired preferencesSingle stageCombine SFT + odds-ratio preference
Trend: The field is moving toward simpler methods. RLHF requires the most infrastructure (RL loop, reward model, reference model all in memory). DPO simplifies this considerably. KTO simplifies the data requirements. ORPO simplifies the pipeline to a single pass.
Method Comparison

Adjust the complexity and data axes to see how each method trades off simplicity vs data requirements.

Check: What kind of data does KTO need (unlike DPO)?

Chapter 6: Constitutional AI

Human annotation is expensive and doesn't scale. Anthropic's Constitutional AI (CAI) asks: can the AI critique and revise itself using a set of written principles (a "constitution")? The answer is yes, and it works surprisingly well.

CAI has two phases. In the critique-revision phase, the model generates a response, then is asked to evaluate it against principles like "is this harmful?" or "is this honest?" and rewrite it. The revised responses become the SFT training data. In the RLAIF phase (RL from AI Feedback), the AI itself generates preference labels instead of human annotators.

Constitution
A set of principles: "Be helpful," "Avoid harm," "Be honest," etc.
Critique
Model reviews its own response against each principle.
Revision
Model rewrites response to address critiques.
RLAIF
AI-generated preferences used for RL training (instead of human labels).
Why this matters: Human feedback is the bottleneck. If we need millions of preference labels, we can't hire enough annotators. CAI provides a scalable alternative: the model's own judgment, guided by explicit principles. The constitution makes the values transparent and auditable.
Constitutional Critique-Revision

Watch how a response improves through rounds of self-critique. Each round applies a principle from the constitution.

Round: 0 / 4
Check: In Constitutional AI, who provides the preference labels for RL?

Chapter 7: Process Reward Models

Standard reward models (Outcome Reward Models, or ORMs) score the final answer. But for multi-step reasoning, the final answer might be right for the wrong reasons — or wrong because of a single bad step in an otherwise sound chain. Process Reward Models (PRMs) score each step of the reasoning process.

PRMs provide much denser supervision. Instead of one score for the whole response, you get a score per step. This helps in at least three ways: (1) better credit assignment (which step went wrong?), (2) more training signal per example, and (3) the ability to do tree search over reasoning paths at inference time.

ORM vs PRM: An ORM is like grading an exam by only looking at the final answer. A PRM is like grading each step of the work. The PRM catches errors earlier and provides more useful feedback for learning.
PRM vs ORM Scoring

A reasoning chain with 5 steps. The ORM scores only the final answer. The PRM scores each step. Click steps to toggle correctness and see how scores change.

FeatureORMPRM
GranularityFinal answer onlyEach reasoning step
Credit assignmentPoor (reward shared across all steps)Good (per-step feedback)
Annotation costLow (check final answer)High (verify each step)
Search capabilityLimitedEnables best-of-N and tree search
Best forShort, single-step tasksMulti-step reasoning (math, code)
Check: What is the main advantage of a PRM over an ORM?

Chapter 8: Reward Hacking

Here's the dark side of optimization: the model will find loopholes. If the reward model gives higher scores to longer responses, the model learns to be verbose. If the reward model prefers confident-sounding text, the model learns to sound confident even when wrong. This is reward hacking — the policy exploits imperfections in the proxy reward.

Common failure modes include:

Goodhart's Law: "When a measure becomes a target, it ceases to be a good measure." The reward model is a proxy for human preferences, not the real thing. Push optimization pressure hard enough and the proxy breaks down.
Reward Hacking in Action

Watch the proxy reward climb while true quality degrades. The green line is proxy reward, the teal line is true quality. Adjust KL penalty to see the mitigation effect.

KL penalty β0.05
Training steps100

Mitigation Strategies

StrategyHow It Helps
KL penaltyLimits how far the policy can drift from the reference
Reward model ensemblesMultiple RMs make it harder to find shared blind spots
Length normalizationScore per-token rather than per-response
Adversarial trainingDeliberately look for hacking patterns and retrain
Regular RM refreshRetrain the RM on policy's new distribution
Check: Sycophancy is an example of reward hacking because...

Chapter 9: Open Problems

Alignment research is far from solved. As models become more capable, the challenges deepen. Here are the frontier problems that keep researchers up at night:

Scalable Oversight

How do you supervise a model that can perform tasks beyond human ability? If the model writes code that's too complex for the annotator to evaluate, or reasons about problems the human can't verify, the entire preference-based framework breaks down. Proposals include recursive reward modeling (use AI to help humans evaluate AI), debate (two AIs argue, human judges), and market-based mechanisms.

Weak-to-Strong Generalization

Can a weaker model supervise a stronger one? If GPT-2-level labels are used to fine-tune a GPT-4-level model, can the strong model generalize beyond the quality of its supervisor? Early results from OpenAI suggest this partially works — the strong model can "figure out" what the weak supervisor meant, analogous to a smart student learning from an imperfect teacher. But it doesn't fully close the gap.

Interpretability for Alignment

Can we look inside the model and verify its "values" directly? If we could read the model's internal representations and confirm it's being helpful for the right reasons (not just gaming the reward), that would be far more robust than any behavioral test. Mechanistic interpretability aims to make this possible.

The big picture: Current alignment is behavioral — we judge models by their outputs. Future alignment may be mechanistic — we verify models by their internals. The shift from "does it behave well?" to "does it reason well?" is the frontier.
Alignment Research Landscape

The map of open alignment problems. Each node represents a research direction. Hover to see connections.

ProblemCore QuestionStatus
Scalable oversightHow to evaluate superhuman outputs?Active research
Weak-to-strongCan weak supervisors train strong models?Promising early results
InterpretabilityCan we verify alignment by reading internals?Rapid progress
RobustnessDoes alignment hold under distribution shift?Largely unsolved
Multi-agentHow to align interacting agents?Early stage
"The AI alignment problem is not about making AI obey us. It's about making AI understand what we actually mean."
— paraphrasing Stuart Russell

You now understand the full landscape of alignment: from RLHF to DPO, from reward hacking to the open frontier. The field is young, the problems are deep, and the stakes are high.

Check: What is the "scalable oversight" problem?