What Is It?
Alignment is the set of methods for making AI behavior match human preferences. A pretrained language model can generate text — alignment determines which text it prefers to generate. It turns a capable model into a helpful, harmless, honest one.
The core problem: pretraining optimizes for next-token prediction, which produces a model that can write anything — poetry, code, toxicity, misinformation, all with equal facility. Alignment adds a second objective: generate text that humans would actually prefer.
The RLHF Pipeline
Reinforcement Learning from Human Feedback is the original alignment recipe, pioneered by Christiano et al. (2017) and scaled by OpenAI for InstructGPT (2022). It has three stages:
Stage 1 — SFT: Fine-tune the pretrained LLM on high-quality demonstrations (human-written ideal responses). This teaches the format and style of helpful answers.
Stage 2 — Reward Model: Collect preference pairs (human says response A > B), then train a classifier to predict which response humans prefer. This is the learned reward signal.
Stage 3 — PPO: Use the reward model as a scoring function and optimize the policy (the LLM) via Proximal Policy Optimization. A KL penalty keeps the policy close to the SFT baseline to prevent reward hacking.
Core Methods
RLHF was the beginning, not the end. Researchers have developed increasingly elegant alternatives.
RLHF RL-Based
Train a reward model on preference pairs, then optimize the policy with PPO. The OG approach. Requires 4 models in memory (policy, reference, reward, value). High infrastructure cost, but proven at scale (InstructGPT, ChatGPT).
DPO Direct
Direct Preference Optimization. Skip the reward model entirely — reparameterize
the reward as a function of the policy itself. Loss:
-log σ(β(log π(y_w)/π_ref(y_w) - log π(y_l)/π_ref(y_l))).
Just a classification loss on preference pairs. Dramatically simpler.
KTO Binary
Kahneman-Tversky Optimization. Works with binary feedback (thumbs up/down) — no paired comparisons needed. Leverages prospect theory: losses loom larger than gains. Practical for production systems where paired data is expensive.
ORPO Odds Ratio
Odds Ratio Preference Optimization. Combines SFT and alignment into a single stage by adding an odds-ratio penalty. No reference model needed, no separate SFT step. The simplest pipeline of all.
RLAIF AI Feedback
Reinforcement Learning from AI Feedback. Replace human annotators with a capable AI model that generates preference labels. Scales annotation without human bottleneck. Used in Anthropic's Constitutional AI pipeline and Google's research.
Constitutional AI Self-Critique
Define a set of principles (a “constitution”). The model critiques and revises its own outputs against those principles. Then train on the self-revised data. Reduces reliance on human labelers for safety-related feedback.
Reward Modeling
The reward model is the lynchpin of RLHF. It translates fuzzy human preferences into a scalar signal that RL can optimize. Here is how it works:
Preference Collection
Given a prompt, the model generates two responses. A human annotator picks the better one. This creates a preference pair: yw (chosen) > yl (rejected).
Bradley-Terry Model
The reward model is trained using the Bradley-Terry framework: the probability that response A is preferred over B is modeled as:
where r(x, y) is the scalar reward and σ is the sigmoid function.
Training minimizes the negative log-likelihood of the observed preferences.
Process vs. Outcome Reward Models
Outcome Reward Model (ORM): scores the final answer only. Simple but can reward correct answers reached by wrong reasoning.
Process Reward Model (PRM): scores each intermediate step. More supervision signal, catches errors earlier, but much more expensive to annotate.
Process Reward Models
Process reward models score each step of reasoning, not just the final answer. This is critical for math and coding, where a single wrong step invalidates everything downstream. OpenAI's PRM800K dataset provides step-level labels for 800K math reasoning steps.
Reward Hacking
The model finds loopholes in the reward signal. It learns to maximize the reward model's score without actually being more helpful. This is Goodhart's Law applied to AI: “When a measure becomes a target, it ceases to be a good measure.”
Common Failure Modes
Verbosity Bias
Longer responses get higher reward scores, even when brevity would be better. The model learns: longer = better.
Sycophancy
The model agrees with whatever the user says, even if the user is wrong. Agreeable = higher reward.
Formatting Tricks
Excessive use of bullet points, bold text, and headers to appear thorough without adding substance.
Hedging
Overuse of qualifiers and disclaimers to avoid being rated as wrong, even when confident answers are appropriate.
Mitigation Strategies
KL Penalty
Penalize the policy for diverging too far from the SFT baseline. Prevents extreme exploitation.
RM Ensembles
Use multiple reward models and take the conservative estimate. Harder to hack all of them.
Iterative RLHF
Retrain the reward model on new policy outputs. The reward model co-evolves with the policy.
Length Penalty
Normalize reward by response length, or directly penalize verbose responses.
Training Pipeline Visualization
The full alignment pipeline, from raw pretraining to deployed model:
DPO vs. RLHF: Architecture Comparison
RLHF (PPO)
- Requires 4 models: policy, reference, reward, value
- Separate reward model training phase
- PPO optimization with clipped objective
- KL penalty to stay near reference
- Complex infrastructure, high memory
- Proven at massive scale (GPT-4, Claude)
- More flexible: reward model can be reused
DPO (Direct)
- Requires 2 models: policy and reference only
- No separate reward model needed
- Simple classification loss on preferences
- KL constraint is implicit in the loss
- Simple to implement, lower memory
- Proven on smaller models, scaling ongoing
- Must retrain for different preference data
DPO Loss Explorer — see how the loss changes with model probabilities
Method Comparison: RLHF vs. DPO vs. KTO vs. ORPO
| Property | RLHF (PPO) | DPO | KTO | ORPO |
|---|---|---|---|---|
| Year | 2017/2022 | 2023 | 2024 | 2024 |
| Data Required | Preference pairs | Preference pairs | Binary feedback | Preference pairs |
| Reward Model | Required (separate) | Not needed | Not needed | Not needed |
| Reference Model | Yes (frozen) | Yes (frozen) | Yes (frozen) | Not needed |
| Models in Memory | 4 (policy, ref, RM, value) | 2 (policy, ref) | 2 (policy, ref) | 1 (policy only) |
| Separate SFT | Yes | Yes | Yes | No (combined) |
| Implementation | Complex (RL loop) | Simple (classification) | Simple (classification) | Simple (SFT + penalty) |
| KL Constraint | Explicit penalty | Implicit in loss | Implicit in loss | Via odds ratio |
| Scale | Proven at massive scale | Scaling ongoing | Early research | Early research |
| Key Paper | Christiano+ 2017, Ouyang+ 2022 |
Rafailov+ 2023 |
Ethayarajh+ 2024 |
Hong+ 2024 |
The implicit reward is: r(x,y) = β log π(y|x)/π_ref(y|x) + const