Christiano, Leike, Brown, Martic, Legg, Amodei — 2017

Deep RL from Human Preferences

Learn a reward function from pairwise human comparisons, then optimize a policy against it — the paper that launched RLHF and made ChatGPT possible.

Prerequisites: Policy gradients + Reward functions + Basic probability
10
Chapters
5+
Simulations

Chapter 0: The Problem

Reinforcement learning has a dirty secret: it needs a reward function. And writing a good reward function is really hard.

Suppose you want to train a robot to clean a table. What reward do you assign? Distance to crumbs? But what about the ones behind the salt shaker? Percentage of surface area cleaned? But it might knock everything off the table to maximize area. Every simple metric you try either misses the point or gets "gamed" — the agent finds a loophole that scores high but does something you never intended.

This is the reward specification problem. It shows up everywhere:

The core issue: humans can recognize good behavior when they see it, but they can't write down a mathematical function that captures what they mean. There's a gap between what we want and what we can formalize.

The fundamental tension: RL requires a scalar reward function r(s, a) → R. But real-world goals — "clean the table nicely," "drive safely," "be helpful and honest" — resist formalization. This paper proposes a radical solution: don't write the reward function at all. Instead, let humans express preferences by comparing pairs of behaviors, and learn the reward function from those comparisons.
The Reward Specification Problem

Click each reward function to see how the agent exploits it. The intended behavior is to move right smoothly — but hand-crafted rewards produce unintended shortcuts.

Why can't we simply hand-engineer a reward function for complex tasks like "clean a table"?

Chapter 1: The Key Insight

Here's the insight that changes everything: humans can't write down reward functions, but they can watch two short video clips and say "I prefer this one."

Pairwise comparison is one of the simplest forms of human feedback. You don't need expertise. You don't need to assign numerical scores. You just watch two 1-2 second clips of an agent's behavior and pick the one that looks better. Even a non-expert can do this quickly and consistently.

The paper's approach has three components running simultaneously:

  1. The policy interacts with the environment, generating trajectories
  2. A human compares pairs of short clips and says which is better
  3. A reward model is trained on these comparisons, learning to predict what the human would prefer

The policy then optimizes the learned reward instead of a hand-crafted one. The reward model serves as a proxy for the human — an always-available, differentiable approximation of human judgment.

Why comparisons, not scores? The authors tried having humans assign absolute scores (1-10 ratings) to clips. It didn't work well — different people use different scales, and the same person is inconsistent over time. But comparisons are remarkably consistent: "clip A is better than clip B" is a much easier question than "how good is clip A on a scale of 1 to 10?" This is well-studied in psychology — pairwise comparison is the most reliable way to elicit preferences from humans.

The key technical question is: how do you turn binary comparison labels ("A is better than B") into a continuous reward function r(s, a)? That's what the Bradley-Terry model does, and we'll derive it in the next chapter.

Why do the authors use pairwise comparisons rather than absolute numerical scores?

Chapter 2: The Preference Model

We need a mathematical model that connects a reward function to human preferences. Enter the Bradley-Terry model, originally developed in 1952 for ranking chess players from pairwise game outcomes.

The idea: every trajectory segment σ has a hidden "quality score" — the sum of rewards along the segment:

R(σ) = ∑t r̂(ot, at)

The probability that a human prefers segment σ1 over σ2 depends exponentially on the difference in total reward:

P(σ1 ≻ σ2) = exp(R(σ1)) / (exp(R(σ1)) + exp(R(σ2)))

This is a softmax over the total rewards of the two segments. Let's unpack what this means:

The Elo connection: This is exactly how Elo ratings work in chess. The difference in Elo points predicts the probability of one player beating another. Here, the difference in total predicted reward predicts which trajectory segment a human will prefer. Reward = Elo rating for behavior.

Why exponential?

The exponential form comes from the Luce choice axiom — a principle from mathematical psychology. It says that the probability of choosing an option should be proportional to some "value" function of that option. The softmax (which uses exp) is the canonical way to turn arbitrary real-valued scores into valid probabilities that satisfy this axiom.

Bradley-Terry Model

Drag the reward sliders for two segments to see how the preference probability changes. When rewards are equal, it's a coin flip. As the gap grows, confidence increases.

R(σ1)1.0
R(σ2)-0.5
In the Bradley-Terry model, what does P(σ1 ≻ σ2) = exp(R(σ1)) / (exp(R(σ1)) + exp(R(σ2))) predict?

Chapter 3: Reward Learning

Now we need to actually train a neural network to be the reward function r̂(o, a). The training signal comes from human comparisons stored in a database D of triples (σ1, σ2, μ), where μ encodes which segment the human preferred.

The loss function

Given the Bradley-Terry model, we train r̂ by minimizing cross-entropy between the model's predicted preferences and the actual human labels:

loss(r̂) = − ∑12,μ)∈D [μ(1) log P̂(σ1 ≻ σ2) + μ(2) log P̂(σ2 ≻ σ1)]

Where μ(1) is 1 if the human preferred σ1, μ(2) is 1 if they preferred σ2, and both are 0.5 if they said "equally good."

This is standard cross-entropy — the same loss you'd use for binary classification. The key difference: the "logits" come from summing the reward model's outputs across all timesteps in each segment, not from a single forward pass.

Practical modifications

The paper found several tricks essential for making this work:

The 10% error assumption: Without this, the reward model could assign arbitrarily large reward differences to segments that the human always agrees on. By assuming a floor probability of 0.05 (half of 10%) for the less-preferred option, the model keeps its reward predictions bounded. This is a form of label smoothing, and it's crucial for stable training.

The reward model architecture

For MuJoCo tasks: a fully-connected network that takes the observation and action as input and outputs a scalar reward. For Atari: a convolutional network that takes 4 stacked frames as input. The network outputs r̂(ot, at) for each timestep, and these are summed to get R(σ) for each segment.

Why does the paper train an ensemble of reward predictors rather than a single model?

Chapter 4: Policy Optimization

Once we have a reward model r̂, we're back in familiar RL territory. The policy π interacts with the environment, but instead of receiving the environment's true reward, it receives r̂(ot, at) at each timestep. From the policy's perspective, this is just a standard RL problem.

Choice of RL algorithm

The paper uses two different RL algorithms depending on the domain:

The non-stationarity challenge

There's a subtle but important wrinkle: the reward function r̂ is non-stationary. It's being updated continuously as new human comparisons come in. This means the "ground truth" reward for any given state-action pair changes over time.

This is why they chose policy gradient methods over value-based methods like DQN. Policy gradient methods are more robust to reward changes because they don't maintain a value function that needs to be re-learned every time the reward shifts. They also increased the entropy bonus for TRPO to ensure adequate exploration, since the changing reward landscape means previously explored regions might become newly valuable.

Reward normalization: The learned rewards r̂ have arbitrary scale and offset — the Bradley-Terry model only depends on differences in total reward, not absolute values. So the paper normalizes rewards to have zero mean and constant standard deviation before passing them to the RL algorithm. Without this, the effective learning rate would drift as the reward model's scale evolves.
Learned vs True Reward

The learned reward (teal) approximates the true reward (gray) with increasing accuracy as more comparisons are gathered. Drag the slider to add comparisons.

Comparisons10
Why does the paper prefer policy gradient methods (A2C, TRPO) over value-based methods (DQN) for optimizing the learned reward?

Chapter 5: The Full Pipeline

The complete system has three asynchronous processes running in parallel, forming a continuous loop:

1. Policy Rollouts
Policy π interacts with the environment, generating trajectories. Uses the current reward model r̂ to compute rewards. Updated via A2C/TRPO.
↓ trajectory clips flow down
2. Human Comparisons
Pairs of 1-2 second clips are selected and shown to the human. Human says which is better (or "equal" or "can't tell"). Stored in database D.
↓ preference labels flow down
3. Reward Model Training
Train ensemble of reward predictors on D via cross-entropy loss. Updated reward parameters flow back to the policy.
↑ updated r̂ flows back to step 1

The crucial insight: these three processes run asynchronously. The policy doesn't wait for human labels — it keeps collecting experience using the most recent reward model. The human labels don't need to arrive in order. The reward model trains continuously on all available data.

Worked example — backflip training: The agent starts with random MuJoCo locomotion. The human sees two clips: one where the robot falls forward, another where it stumbles backward. The human picks the one that looks more like the start of a backflip. After ~200 comparisons, the reward model starts assigning high reward to upside-down orientations. After ~500, it rewards the full rotation. By 900 comparisons (~45 minutes of human time), the agent performs consistent backflips — a behavior that would be extremely difficult to specify via a hand-crafted reward function.
The RLHF Pipeline

Watch the three asynchronous processes run simultaneously. Click "Step" to advance the pipeline, or "Auto" to run continuously. Observe how reward quality improves as comparisons accumulate.

Click Step to begin
What are the three asynchronous processes in the RLHF pipeline?

Chapter 6: Efficiency

A key selling point: the method needs surprisingly little human feedback. The agent collects thousands of hours of experience but the human only needs to label comparisons for about 1% of that experience.

The numbers

Compare this to direct human reward: if every timestep required a human score, you'd need hundreds of thousands of labels. The reward model amplifies a small amount of human feedback into dense reward for every single timestep.

Active query selection

Not all comparisons are equally informative. The paper uses the ensemble disagreement to select the most useful queries:

  1. Sample a large pool of candidate clip pairs
  2. Each reward predictor in the ensemble predicts which clip is better
  3. Select pairs where the ensemble members disagree most (highest variance in predictions)

Intuitively, if all ensemble members agree on a comparison, we don't learn much from asking the human. But if they disagree, the human's answer resolves genuine model uncertainty.

When does active selection help? The ablation studies show mixed results. Active query selection helps on tasks where the reward landscape has complex structure (like Hopper), but can slightly hurt on simpler tasks. The authors note that a more sophisticated approach — like expected value of information — might work better, but even the simple variance-based heuristic is a reasonable default.
How does the paper select which trajectory pairs to show the human?

Chapter 7: Results

MuJoCo robotics

With just 700 human comparisons, the method nearly matches the performance of RL with the true reward function on all 8 MuJoCo tasks (HalfCheetah, Hopper, Walker, Ant, Swimmer, Reacher, Humanoid, Pendulum). Surprisingly, with 1,400 synthetic labels the method sometimes exceeds true-reward RL — because the learned reward provides better-shaped feedback.

Atari games

With 5,500 human comparisons, the method shows substantial learning on 7 Atari games. On BeamRider and Pong, it matches RL with true rewards. On Seaquest and Qbert, it reaches near-RL performance but learns more slowly. On SpaceInvaders and Breakout, it learns significantly but doesn't fully match.

Novel behaviors (the real payoff)

This is where the approach truly shines — learning behaviors that would be nearly impossible to specify with hand-crafted rewards:

RLHF vs True Reward on MuJoCo

Performance comparison across MuJoCo tasks. RLHF with 700 human labels nearly matches RL with the true reward function.

Better than true reward? On the Ant task, human feedback actually outperformed the true reward function. The humans were told to prefer trajectories where the robot "stands upright," which provided better reward shaping than the hand-crafted bonus in the environment. The learned reward assigned positive value to all behaviors that typically lead to high reward, effectively providing a more informative gradient signal.
What novel behavior did the paper train from scratch using only ~900 human comparisons?

Chapter 8: Scalability

The paper introduces several design choices that make the system practical at scale.

Asynchronous feedback

The three processes (policy training, human labeling, reward model training) run independently. The policy never waits for a human — it uses the most recent reward model. This means the system can collect experience 24/7, even when no humans are available. New labels are incorporated as they arrive, continuously improving the reward model.

Online vs offline labels

A critical finding: labels must be collected online (throughout training), not just at the beginning. The ablation studies show that offline labeling leads to bizarre behaviors:

Why online feedback is crucial: The agent's state distribution shifts as it learns — a phenomenon called distributional shift. Early on, the agent mostly falls over. Later, it runs smoothly. A reward model trained only on "falling over" clips has no idea how to evaluate "running smoothly" clips. Online labeling ensures the reward model always has coverage over the states the policy actually visits.

Computational cost

The entire system is remarkably cheap. For Atari experiments:

The human cost and compute cost are already comparable — meaning further improvements in sample efficiency would hit diminishing returns, because compute would become the bottleneck.

Why does collecting human labels only at the beginning of training (offline) lead to poor performance?

Chapter 9: Connections

What this paper built on

Bradley-Terry model (1952): The pairwise comparison framework from mathematical psychology that underlies the entire reward learning approach. Originally developed for ranking chess players.

Inverse RL (Ng & Russell, 2000): Recovering a reward function from demonstrations. This paper takes a different tack — learning reward from preferences rather than demonstrations, which doesn't require expert demonstrations.

TAMER (Knox, 2012): Earlier work on learning from human reward signals, but limited to simple tasks. This paper scales the idea to deep RL with complex environments.

What this paper enabled

InstructGPT / ChatGPT (Ouyang et al., 2022): Applied this exact framework to language models. Human annotators compare pairs of model outputs, a reward model learns from these comparisons, and PPO optimizes the language model against the learned reward. This is the "RLHF" in ChatGPT's training pipeline.

Constitutional AI (Bai et al., 2022): Extends RLHF by replacing some human feedback with AI feedback — the AI critiques its own outputs using a set of principles ("constitution"). This addresses the scalability bottleneck of human labeling.

DPO (Rafailov et al., 2023): Direct Preference Optimization eliminates the reward model entirely. It shows that the cross-entropy loss on preferences can be rewritten as a loss directly on the policy, skipping the intermediate reward-model step. Simpler, but builds on the same Bradley-Terry preference model introduced here.

GRPO / DeepSeek-R1 (2024-25): Group Relative Policy Optimization simplifies the RL step, but the reward model training follows the same preference-learning paradigm from this paper.

This paper's legacy: Before Christiano et al. 2017, RLHF was a niche research direction limited to toy problems. This paper demonstrated that preference learning could scale to complex deep RL tasks — Atari games and MuJoCo locomotion. Five years later, the same pipeline (with PPO and language models replacing A2C and game environments) became the secret sauce behind ChatGPT, Claude, and every major aligned language model. The Bradley-Terry preference model from this paper is now one of the most widely-used components in modern AI.

Cheat sheet

Core equation
P(σ1≻σ2) = exp(∑r̂(σ1)) / (exp(∑r̂(σ1)) + exp(∑r̂(σ2)))
Loss
Cross-entropy between predicted and actual human preferences
Pipeline
Policy rollouts → human comparisons → reward model → policy optimization (async loop)
Efficiency
~1% of agent experience needs human labels; 700 comparisons for MuJoCo
Impact
Foundation for InstructGPT, ChatGPT RLHF, DPO, Constitutional AI
How is this paper's contribution used in the training of ChatGPT?