How we teach language models to be helpful, harmless, and honest — turning raw capability into something humans actually want.
A language model trained on internet text can write poetry, solve math problems, and generate working code. But it can also produce toxic content, confidently hallucinate, or help with dangerous requests. Capability is not alignment. The model learned to predict the next token — it didn't learn what humans actually want.
The alignment problem is the gap between what a model can do and what it should do. A base model completes text. An aligned model has a notion of helpful, harmless, and honest. Bridging that gap is the central challenge of modern AI safety.
Watch how a base model (predicting likely text) diverges from what an aligned model should say. The red path is raw capability. The green path is aligned behavior.
Early approaches tried to fix this with careful prompting or filtering. But these are band-aids. The real solution requires changing the model itself — teaching it a reward signal that captures human preferences. This is the story of alignment.
How do you teach a model what "good" means? You ask humans. Given a prompt, generate two candidate responses A and B. Show them to a human annotator who picks the better one. Collect thousands of these preference pairs and train a separate model — the reward model — to predict which response a human would prefer.
The reward model takes a (prompt, response) pair and outputs a scalar score. Higher score = more likely to be preferred. The key insight: we don't need humans to assign absolute scores (that's noisy and inconsistent). We only need relative judgments: "A is better than B." This is the Bradley-Terry model:
where σ is the sigmoid function and r(·) is the reward model's score. The loss function pushes the reward for the preferred response above the rejected one:
Drag the slider to adjust the reward gap between chosen and rejected responses. Watch how the Bradley-Terry loss changes.
RLHF (Reinforcement Learning from Human Feedback) is a three-stage pipeline that transforms a base model into an aligned assistant. Each stage builds on the last:
Stage 1 (SFT) gives the model the right "shape" — it learns to follow instructions, use a helpful tone, and structure answers clearly. But SFT alone is limited by the quality and diversity of demonstrations.
Stage 2 (Reward Model) captures nuanced preferences that are hard to demonstrate. It's easier to judge than to demonstrate: a human can quickly say "A is better than B" even if writing the perfect response from scratch would take much longer.
Stage 3 (PPO) is where the magic happens. The policy model generates responses, the reward model scores them, and PPO updates the policy to produce higher-scoring outputs. A KL penalty prevents the model from drifting too far from the SFT checkpoint.
Interactive pipeline diagram. Click each stage to highlight its role and data flow.
Proximal Policy Optimization (PPO) is the workhorse RL algorithm behind RLHF. The idea: generate a response, score it with the reward model, then nudge the policy to make high-reward responses more likely — but not too much per step.
PPO uses a clipped surrogate objective. The ratio rt = π(a|s) / πold(a|s) measures how much the policy has changed. PPO clips this ratio to [1−ε, 1+ε], preventing destructive updates:
But there's a critical addition for LLMs: a KL penalty that keeps the policy close to a reference model (usually the SFT checkpoint). Without it, the model would "overoptimize" — finding weird outputs that score high on the reward model but are gibberish to humans.
Adjust the advantage (positive = good action, negative = bad action) and the clip range ε. The teal curve is the clipped objective.
| Symbol | Meaning |
|---|---|
| rt | Probability ratio (new policy / old policy) |
| At | Advantage: how much better than expected |
| ε | Clip range (typically 0.1–0.2) |
| β | KL penalty coefficient |
| πref | Reference policy (SFT model) |
What if we could skip the reward model entirely? DPO (Rafailov et al., 2023) makes a beautiful mathematical observation: the optimal policy under the RLHF objective has a closed-form solution in terms of the reward function. By rearranging terms, we can express the reward implicitly through the policy itself:
Substituting this into the Bradley-Terry preference model gives the DPO loss directly in terms of the policy log-probabilities — no reward model needed:
where yw is the preferred (winning) response and yl is the rejected (losing) response. DPO increases the relative log-probability of preferred responses while decreasing that of rejected ones, all while staying anchored to the reference model through the log-ratio terms.
Adjust the log-probability ratios for winning and losing responses. The DPO loss pushes the gap apart.
DPO still requires paired preferences (A vs B for the same prompt). But what if you only have binary feedback — thumbs up or thumbs down on individual responses? KTO (Kahneman-Tversky Optimization) works with unpaired data by leveraging prospect theory: humans feel losses more strongly than equivalent gains.
The KTO loss treats desirable and undesirable outputs asymmetrically. For a good response, it encourages the log-ratio to be high. For a bad response, it penalizes the log-ratio being high. The asymmetry mirrors human loss aversion:
ORPO (Odds Ratio Preference Optimization) takes yet another approach: it combines the SFT loss with a preference signal in a single training objective, using the odds ratio of generating preferred vs rejected responses.
| Method | Data Required | Stages | Key Idea |
|---|---|---|---|
| RLHF | Paired preferences | SFT → RM → PPO | Train reward model, then optimize via RL |
| DPO | Paired preferences | SFT → DPO | Direct loss on preference pairs |
| KTO | Binary (good/bad) | SFT → KTO | Loss-averse binary feedback |
| ORPO | Paired preferences | Single stage | Combine SFT + odds-ratio preference |
Adjust the complexity and data axes to see how each method trades off simplicity vs data requirements.
Human annotation is expensive and doesn't scale. Anthropic's Constitutional AI (CAI) asks: can the AI critique and revise itself using a set of written principles (a "constitution")? The answer is yes, and it works surprisingly well.
CAI has two phases. In the critique-revision phase, the model generates a response, then is asked to evaluate it against principles like "is this harmful?" or "is this honest?" and rewrite it. The revised responses become the SFT training data. In the RLAIF phase (RL from AI Feedback), the AI itself generates preference labels instead of human annotators.
Watch how a response improves through rounds of self-critique. Each round applies a principle from the constitution.
Standard reward models (Outcome Reward Models, or ORMs) score the final answer. But for multi-step reasoning, the final answer might be right for the wrong reasons — or wrong because of a single bad step in an otherwise sound chain. Process Reward Models (PRMs) score each step of the reasoning process.
PRMs provide much denser supervision. Instead of one score for the whole response, you get a score per step. This helps in at least three ways: (1) better credit assignment (which step went wrong?), (2) more training signal per example, and (3) the ability to do tree search over reasoning paths at inference time.
A reasoning chain with 5 steps. The ORM scores only the final answer. The PRM scores each step. Click steps to toggle correctness and see how scores change.
| Feature | ORM | PRM |
|---|---|---|
| Granularity | Final answer only | Each reasoning step |
| Credit assignment | Poor (reward shared across all steps) | Good (per-step feedback) |
| Annotation cost | Low (check final answer) | High (verify each step) |
| Search capability | Limited | Enables best-of-N and tree search |
| Best for | Short, single-step tasks | Multi-step reasoning (math, code) |
Here's the dark side of optimization: the model will find loopholes. If the reward model gives higher scores to longer responses, the model learns to be verbose. If the reward model prefers confident-sounding text, the model learns to sound confident even when wrong. This is reward hacking — the policy exploits imperfections in the proxy reward.
Common failure modes include:
Watch the proxy reward climb while true quality degrades. The green line is proxy reward, the teal line is true quality. Adjust KL penalty to see the mitigation effect.
| Strategy | How It Helps |
|---|---|
| KL penalty | Limits how far the policy can drift from the reference |
| Reward model ensembles | Multiple RMs make it harder to find shared blind spots |
| Length normalization | Score per-token rather than per-response |
| Adversarial training | Deliberately look for hacking patterns and retrain |
| Regular RM refresh | Retrain the RM on policy's new distribution |
Alignment research is far from solved. As models become more capable, the challenges deepen. Here are the frontier problems that keep researchers up at night:
How do you supervise a model that can perform tasks beyond human ability? If the model writes code that's too complex for the annotator to evaluate, or reasons about problems the human can't verify, the entire preference-based framework breaks down. Proposals include recursive reward modeling (use AI to help humans evaluate AI), debate (two AIs argue, human judges), and market-based mechanisms.
Can a weaker model supervise a stronger one? If GPT-2-level labels are used to fine-tune a GPT-4-level model, can the strong model generalize beyond the quality of its supervisor? Early results from OpenAI suggest this partially works — the strong model can "figure out" what the weak supervisor meant, analogous to a smart student learning from an imperfect teacher. But it doesn't fully close the gap.
Can we look inside the model and verify its "values" directly? If we could read the model's internal representations and confirm it's being helpful for the right reasons (not just gaming the reward), that would be far more robust than any behavioral test. Mechanistic interpretability aims to make this possible.
The map of open alignment problems. Each node represents a research direction. Hover to see connections.
| Problem | Core Question | Status |
|---|---|---|
| Scalable oversight | How to evaluate superhuman outputs? | Active research |
| Weak-to-strong | Can weak supervisors train strong models? | Promising early results |
| Interpretability | Can we verify alignment by reading internals? | Rapid progress |
| Robustness | Does alignment hold under distribution shift? | Largely unsolved |
| Multi-agent | How to align interacting agents? | Early stage |
You now understand the full landscape of alignment: from RLHF to DPO, from reward hacking to the open frontier. The field is young, the problems are deep, and the stakes are high.