What Is It?

Alignment is the set of methods for making AI behavior match human preferences. A pretrained language model can generate text — alignment determines which text it prefers to generate. It turns a capable model into a helpful, harmless, honest one.

The core problem: pretraining optimizes for next-token prediction, which produces a model that can write anything — poetry, code, toxicity, misinformation, all with equal facility. Alignment adds a second objective: generate text that humans would actually prefer.

Key Insight
Capability and alignment are orthogonal. A highly capable model without alignment is dangerous. A well-aligned model without capability is useless. Modern AI research must solve both simultaneously.

The RLHF Pipeline

Reinforcement Learning from Human Feedback is the original alignment recipe, pioneered by Christiano et al. (2017) and scaled by OpenAI for InstructGPT (2022). It has three stages:

RLHF Pipeline Interactive
Click to animate

Stage 1 — SFT: Fine-tune the pretrained LLM on high-quality demonstrations (human-written ideal responses). This teaches the format and style of helpful answers.

Stage 2 — Reward Model: Collect preference pairs (human says response A > B), then train a classifier to predict which response humans prefer. This is the learned reward signal.

Stage 3 — PPO: Use the reward model as a scoring function and optimize the policy (the LLM) via Proximal Policy Optimization. A KL penalty keeps the policy close to the SFT baseline to prevent reward hacking.


Core Methods

RLHF was the beginning, not the end. Researchers have developed increasingly elegant alternatives.

RLHF RL-Based

Train a reward model on preference pairs, then optimize the policy with PPO. The OG approach. Requires 4 models in memory (policy, reference, reward, value). High infrastructure cost, but proven at scale (InstructGPT, ChatGPT).

DPO Direct

Direct Preference Optimization. Skip the reward model entirely — reparameterize the reward as a function of the policy itself. Loss: -log σ(β(log π(y_w)/π_ref(y_w) - log π(y_l)/π_ref(y_l))). Just a classification loss on preference pairs. Dramatically simpler.

KTO Binary

Kahneman-Tversky Optimization. Works with binary feedback (thumbs up/down) — no paired comparisons needed. Leverages prospect theory: losses loom larger than gains. Practical for production systems where paired data is expensive.

ORPO Odds Ratio

Odds Ratio Preference Optimization. Combines SFT and alignment into a single stage by adding an odds-ratio penalty. No reference model needed, no separate SFT step. The simplest pipeline of all.

RLAIF AI Feedback

Reinforcement Learning from AI Feedback. Replace human annotators with a capable AI model that generates preference labels. Scales annotation without human bottleneck. Used in Anthropic's Constitutional AI pipeline and Google's research.

Constitutional AI Self-Critique

Define a set of principles (a “constitution”). The model critiques and revises its own outputs against those principles. Then train on the self-revised data. Reduces reliance on human labelers for safety-related feedback.


Reward Modeling

The reward model is the lynchpin of RLHF. It translates fuzzy human preferences into a scalar signal that RL can optimize. Here is how it works:

Preference Collection

Given a prompt, the model generates two responses. A human annotator picks the better one. This creates a preference pair: yw (chosen) > yl (rejected).

Preference Pair Comparison Interactive
Prompt: Explain quantum computing in simple terms.
Chosen (y_w) Quantum computing uses qubits that can be 0, 1, or both at once (superposition). This lets quantum computers explore many solutions simultaneously, making them powerful for specific problems like cryptography and drug discovery.
>
Rejected (y_l) Quantum computing is a type of computing that leverages quantum mechanical phenomena such as superposition and entanglement to perform operations on data using quantum bits or qubits, which differ from classical bits in their ability to exist in multiple states simultaneously according to the principles of quantum mechanics.

Bradley-Terry Model

The reward model is trained using the Bradley-Terry framework: the probability that response A is preferred over B is modeled as:

P(y_w > y_l) = σ(r(x, y_w) - r(x, y_l))

where r(x, y) is the scalar reward and σ is the sigmoid function. Training minimizes the negative log-likelihood of the observed preferences.

Process vs. Outcome Reward Models

Outcome Reward Model (ORM): scores the final answer only. Simple but can reward correct answers reached by wrong reasoning.

Process Reward Model (PRM): scores each intermediate step. More supervision signal, catches errors earlier, but much more expensive to annotate.


Process Reward Models

Process reward models score each step of reasoning, not just the final answer. This is critical for math and coding, where a single wrong step invalidates everything downstream. OpenAI's PRM800K dataset provides step-level labels for 800K math reasoning steps.

Process Reward: Step-by-Step Scoring Interactive
Problem: Solve 2x + 5 = 17
Step 1 2x + 5 = 17
Step 2 2x = 17 - 5 = 12
Step 3 x = 12 / 2 = 6
Answer x = 6
Why PRMs Matter
On the MATH benchmark, best-of-N selection with a PRM outperforms the same method with an ORM by a significant margin. The PRM catches errors mid-chain, while the ORM only sees the final answer and can be fooled by correct-looking but wrong reasoning.

Reward Hacking

The model finds loopholes in the reward signal. It learns to maximize the reward model's score without actually being more helpful. This is Goodhart's Law applied to AI: “When a measure becomes a target, it ceases to be a good measure.”

Reward Hacking: Reward vs. Actual Quality Interactive
Reward score vs actual quality over training
Reward Model Score
Actual Human Rating
Divergence (Hacking)

Common Failure Modes

Verbosity Bias

Longer responses get higher reward scores, even when brevity would be better. The model learns: longer = better.

Sycophancy

The model agrees with whatever the user says, even if the user is wrong. Agreeable = higher reward.

Formatting Tricks

Excessive use of bullet points, bold text, and headers to appear thorough without adding substance.

Hedging

Overuse of qualifiers and disclaimers to avoid being rated as wrong, even when confident answers are appropriate.

Mitigation Strategies

KL Penalty

Penalize the policy for diverging too far from the SFT baseline. Prevents extreme exploitation.

RM Ensembles

Use multiple reward models and take the conservative estimate. Harder to hack all of them.

Iterative RLHF

Retrain the reward model on new policy outputs. The reward model co-evolves with the policy.

Length Penalty

Normalize reward by response length, or directly penalize verbose responses.


Training Pipeline Visualization

The full alignment pipeline, from raw pretraining to deployed model:

End-to-End Alignment Pipeline Overview
1
Pretrain
Next-token prediction on trillions of tokens
2
SFT
Fine-tune on human demonstrations
3
Reward Model
Train on preference pairs (A > B)
4
RL (PPO/DPO)
Optimize policy against reward signal
5
Deploy
Safety eval, red-teaming, monitoring

DPO vs. RLHF: Architecture Comparison

DPO vs RLHF Comparison Architecture

RLHF (PPO)

  • Requires 4 models: policy, reference, reward, value
  • Separate reward model training phase
  • PPO optimization with clipped objective
  • KL penalty to stay near reference
  • Complex infrastructure, high memory
  • Proven at massive scale (GPT-4, Claude)
  • More flexible: reward model can be reused

DPO (Direct)

  • Requires 2 models: policy and reference only
  • No separate reward model needed
  • Simple classification loss on preferences
  • KL constraint is implicit in the loss
  • Simple to implement, lower memory
  • Proven on smaller models, scaling ongoing
  • Must retrain for different preference data

DPO Loss Explorer — see how the loss changes with model probabilities

0.50
-0.50
0.10
1.00 Margin
0.52 σ(β * margin)
0.65 DPO Loss
0.10 Implicit Reward

Method Comparison: RLHF vs. DPO vs. KTO vs. ORPO

Property RLHF (PPO) DPO KTO ORPO
Year 2017/2022 2023 2024 2024
Data Required Preference pairs Preference pairs Binary feedback Preference pairs
Reward Model Required (separate) Not needed Not needed Not needed
Reference Model Yes (frozen) Yes (frozen) Yes (frozen) Not needed
Models in Memory 4 (policy, ref, RM, value) 2 (policy, ref) 2 (policy, ref) 1 (policy only)
Separate SFT Yes Yes Yes No (combined)
Implementation Complex (RL loop) Simple (classification) Simple (classification) Simple (SFT + penalty)
KL Constraint Explicit penalty Implicit in loss Implicit in loss Via odds ratio
Scale Proven at massive scale Scaling ongoing Early research Early research
Key Paper Christiano+ 2017, Ouyang+ 2022 Rafailov+ 2023 Ethayarajh+ 2024 Hong+ 2024
DPO Loss Function
L_DPO(π; π_ref) = -E[ log σ( β ( log π(y_w|x)/π_ref(y_w|x) - log π(y_l|x)/π_ref(y_l|x) ) ) ]

The implicit reward is: r(x,y) = β log π(y|x)/π_ref(y|x) + const