RLHF — Veanors

Chapter 0: The Problem

Reinforcement learning has a dirty secret: it needs a reward function. And writing a good reward function is really hard.

Suppose you want to train a robot to clean a table. What reward do you assign? Distance to crumbs? But what about the ones behind the salt shaker? Percentage of surface area cleaned? But it might knock everything off the table to maximize area. Every simple metric you try either misses the point or gets "gamed" — the agent finds a loophole that scores high but does something you never intended.

This is the reward specification problem. It shows up everywhere:

A game agent finds a bug that gives infinite score instead of actually playing
A robot learns to look like it's cleaning instead of actually cleaning
A language model produces fluent-sounding nonsense that scores well on automated metrics

The core issue: humans can recognize good behavior when they see it, but they can't write down a mathematical function that captures what they mean. There's a gap between what we want and what we can formalize.

The fundamental tension: RL requires a scalar reward function r(s, a) → R. But real-world goals — "clean the table nicely," "drive safely," "be helpful and honest" — resist formalization. This paper proposes a radical solution: don't write the reward function at all. Instead, let humans express preferences by comparing pairs of behaviors, and learn the reward function from those comparisons.

The Reward Specification Problem

Click each reward function to see how the agent exploits it. The intended behavior is to move right smoothly — but hand-crafted rewards produce unintended shortcuts.

Why can't we simply hand-engineer a reward function for complex tasks like "clean a table"?

Simple reward functions get exploited — the agent finds loopholes that score high without actually doing what we want Reward functions are too expensive to compute RL algorithms can't optimize scalar rewards

Chapter 1: The Key Insight

Here's the insight that changes everything: humans can't write down reward functions, but they can watch two short video clips and say "I prefer this one."

Pairwise comparison is one of the simplest forms of human feedback. You don't need expertise. You don't need to assign numerical scores. You just watch two 1-2 second clips of an agent's behavior and pick the one that looks better. Even a non-expert can do this quickly and consistently.

The paper's approach has three components running simultaneously:

The policy interacts with the environment, generating trajectories
A human compares pairs of short clips and says which is better
A reward model is trained on these comparisons, learning to predict what the human would prefer

The policy then optimizes the learned reward instead of a hand-crafted one. The reward model serves as a proxy for the human — an always-available, differentiable approximation of human judgment.

Why comparisons, not scores? The authors tried having humans assign absolute scores (1-10 ratings) to clips. It didn't work well — different people use different scales, and the same person is inconsistent over time. But comparisons are remarkably consistent: "clip A is better than clip B" is a much easier question than "how good is clip A on a scale of 1 to 10?" This is well-studied in psychology — pairwise comparison is the most reliable way to elicit preferences from humans.

The key technical question is: how do you turn binary comparison labels ("A is better than B") into a continuous reward function r(s, a)? That's what the Bradley-Terry model does, and we'll derive it in the next chapter.

Why do the authors use pairwise comparisons rather than absolute numerical scores?

Comparisons are much more consistent across raters and over time — "A is better than B" is easier and more reliable than assigning a score on a numerical scale Numerical scores are too slow to collect Neural networks can only process binary labels

Chapter 2: The Preference Model

We need a mathematical model that connects a reward function to human preferences. Enter the Bradley-Terry model, originally developed in 1952 for ranking chess players from pairwise game outcomes.

The idea: every trajectory segment σ has a hidden "quality score" — the sum of rewards along the segment:

R(σ) = ∑_t r̂(o_t, a_t)

The probability that a human prefers segment σ¹ over σ² depends exponentially on the difference in total reward:

P(σ¹ ≻ σ²) = exp(R(σ¹)) / (exp(R(σ¹)) + exp(R(σ²)))

This is a softmax over the total rewards of the two segments. Let's unpack what this means:

If R(σ¹) ≫ R(σ²), then P(σ¹ ≻ σ²) > 0.5 — the model predicts the human will prefer the higher-reward segment
If R(σ¹) = R(σ²), then P = 0.5 — a coin flip, meaning the model is indifferent
As the gap R(σ¹) − R(σ²) grows, P approaches 1 — increasing confidence

The Elo connection: This is exactly how Elo ratings work in chess. The difference in Elo points predicts the probability of one player beating another. Here, the difference in total predicted reward predicts which trajectory segment a human will prefer. Reward = Elo rating for behavior.

Why exponential?

The exponential form comes from the Luce choice axiom — a principle from mathematical psychology. It says that the probability of choosing an option should be proportional to some "value" function of that option. The softmax (which uses exp) is the canonical way to turn arbitrary real-valued scores into valid probabilities that satisfy this axiom.

Bradley-Terry Model

Drag the reward sliders for two segments to see how the preference probability changes. When rewards are equal, it's a coin flip. As the gap grows, confidence increases.

R(σ¹)1.0

R(σ²)-0.5

In the Bradley-Terry model, what does P(σ¹ ≻ σ²) = exp(R(σ¹)) / (exp(R(σ¹)) + exp(R(σ²))) predict?

The probability that a human will prefer trajectory segment σ¹ over σ², based on the difference in their total predicted rewards The total reward of segment σ¹ The policy's probability of taking an action

Chapter 3: Reward Learning

Now we need to actually train a neural network to be the reward function r̂(o, a). The training signal comes from human comparisons stored in a database D of triples (σ¹, σ², μ), where μ encodes which segment the human preferred.

The loss function

Given the Bradley-Terry model, we train r̂ by minimizing cross-entropy between the model's predicted preferences and the actual human labels:

loss(r̂) = − ∑_{(σ¹,σ²,μ)∈D} [μ(1) log P̂(σ¹ ≻ σ²) + μ(2) log P̂(σ² ≻ σ¹)]

Where μ(1) is 1 if the human preferred σ¹, μ(2) is 1 if they preferred σ², and both are 0.5 if they said "equally good."

This is standard cross-entropy — the same loss you'd use for binary classification. The key difference: the "logits" come from summing the reward model's outputs across all timesteps in each segment, not from a single forward pass.

Practical modifications

The paper found several tricks essential for making this work:

Ensemble of predictors: Train multiple reward models on bootstrapped samples of D. The final reward is the average of the ensemble. This reduces overfitting and provides uncertainty estimates.
Regularization: L2 regularization tuned to keep validation loss between 1.1x and 1.5x training loss, plus optional dropout.
Error tolerance: Assume a 10% chance the human responds randomly. This prevents the model from being overconfident when two segments are vastly different in quality — real humans make mistakes.

The 10% error assumption: Without this, the reward model could assign arbitrarily large reward differences to segments that the human always agrees on. By assuming a floor probability of 0.05 (half of 10%) for the less-preferred option, the model keeps its reward predictions bounded. This is a form of label smoothing, and it's crucial for stable training.

The reward model architecture

For MuJoCo tasks: a fully-connected network that takes the observation and action as input and outputs a scalar reward. For Atari: a convolutional network that takes 4 stacked frames as input. The network outputs r̂(o_t, a_t) for each timestep, and these are summed to get R(σ) for each segment.

Why does the paper train an ensemble of reward predictors rather than a single model?

Ensembles reduce overfitting and provide uncertainty estimates — which are used to select the most informative queries to show to the human A single model can't handle the input dimensionality Ensembles train faster

Chapter 4: Policy Optimization

Once we have a reward model r̂, we're back in familiar RL territory. The policy π interacts with the environment, but instead of receiving the environment's true reward, it receives r̂(o_t, a_t) at each timestep. From the policy's perspective, this is just a standard RL problem.

Choice of RL algorithm

The paper uses two different RL algorithms depending on the domain:

A2C (Advantage Actor-Critic) for Atari games — a synchronous variant of A3C, suitable for discrete action spaces
TRPO (Trust Region Policy Optimization) for MuJoCo robotics — better for continuous control because the trust region prevents destructively large updates

The non-stationarity challenge

There's a subtle but important wrinkle: the reward function r̂ is non-stationary. It's being updated continuously as new human comparisons come in. This means the "ground truth" reward for any given state-action pair changes over time.

This is why they chose policy gradient methods over value-based methods like DQN. Policy gradient methods are more robust to reward changes because they don't maintain a value function that needs to be re-learned every time the reward shifts. They also increased the entropy bonus for TRPO to ensure adequate exploration, since the changing reward landscape means previously explored regions might become newly valuable.

Reward normalization: The learned rewards r̂ have arbitrary scale and offset — the Bradley-Terry model only depends on differences in total reward, not absolute values. So the paper normalizes rewards to have zero mean and constant standard deviation before passing them to the RL algorithm. Without this, the effective learning rate would drift as the reward model's scale evolves.

Learned vs True Reward

The learned reward (teal) approximates the true reward (gray) with increasing accuracy as more comparisons are gathered. Drag the slider to add comparisons.

Comparisons10

Why does the paper prefer policy gradient methods (A2C, TRPO) over value-based methods (DQN) for optimizing the learned reward?

The reward function is non-stationary (updated as new comparisons arrive), and policy gradient methods are more robust to changing rewards than value-based methods which maintain a value function that becomes stale Policy gradient methods are always better than DQN DQN can't handle continuous observations

Chapter 5: The Full Pipeline

The complete system has three asynchronous processes running in parallel, forming a continuous loop:

1. Policy Rollouts

Policy π interacts with the environment, generating trajectories. Uses the current reward model r̂ to compute rewards. Updated via A2C/TRPO.

↓ trajectory clips flow down

2. Human Comparisons

Pairs of 1-2 second clips are selected and shown to the human. Human says which is better (or "equal" or "can't tell"). Stored in database D.

↓ preference labels flow down

3. Reward Model Training

Train ensemble of reward predictors on D via cross-entropy loss. Updated reward parameters flow back to the policy.

↑ updated r̂ flows back to step 1

The crucial insight: these three processes run asynchronously. The policy doesn't wait for human labels — it keeps collecting experience using the most recent reward model. The human labels don't need to arrive in order. The reward model trains continuously on all available data.

Worked example — backflip training: The agent starts with random MuJoCo locomotion. The human sees two clips: one where the robot falls forward, another where it stumbles backward. The human picks the one that looks more like the start of a backflip. After ~200 comparisons, the reward model starts assigning high reward to upside-down orientations. After ~500, it rewards the full rotation. By 900 comparisons (~45 minutes of human time), the agent performs consistent backflips — a behavior that would be extremely difficult to specify via a hand-crafted reward function.

The RLHF Pipeline

Watch the three asynchronous processes run simultaneously. Click "Step" to advance the pipeline, or "Auto" to run continuously. Observe how reward quality improves as comparisons accumulate.

Click Step to begin

What are the three asynchronous processes in the RLHF pipeline?

(1) Policy collects trajectories using learned reward, (2) human compares clip pairs, (3) reward model trains on comparison labels — all running in parallel (1) Data collection, (2) model training, (3) evaluation — running sequentially (1) Pretraining, (2) fine-tuning, (3) inference

Chapter 6: Efficiency

A key selling point: the method needs surprisingly little human feedback. The agent collects thousands of hours of experience but the human only needs to label comparisons for about 1% of that experience.

The numbers

MuJoCo robotics: 700 comparisons (about 30 minutes of human time) to nearly match RL with true reward
Atari games: 5,500 comparisons (about 5 hours of human time) for substantial learning
Novel behaviors: 900 comparisons (~1 hour) to learn backflips from scratch

Compare this to direct human reward: if every timestep required a human score, you'd need hundreds of thousands of labels. The reward model amplifies a small amount of human feedback into dense reward for every single timestep.

Active query selection

Not all comparisons are equally informative. The paper uses the ensemble disagreement to select the most useful queries:

Sample a large pool of candidate clip pairs
Each reward predictor in the ensemble predicts which clip is better
Select pairs where the ensemble members disagree most (highest variance in predictions)

Intuitively, if all ensemble members agree on a comparison, we don't learn much from asking the human. But if they disagree, the human's answer resolves genuine model uncertainty.

When does active selection help? The ablation studies show mixed results. Active query selection helps on tasks where the reward landscape has complex structure (like Hopper), but can slightly hurt on simpler tasks. The authors note that a more sophisticated approach — like expected value of information — might work better, but even the simple variance-based heuristic is a reasonable default.

How does the paper select which trajectory pairs to show the human?

It selects pairs where the ensemble of reward predictors disagrees most — maximizing the information gained from each human comparison It always picks the longest trajectories Pairs are selected uniformly at random

Chapter 7: Results

MuJoCo robotics

With just 700 human comparisons, the method nearly matches the performance of RL with the true reward function on all 8 MuJoCo tasks (HalfCheetah, Hopper, Walker, Ant, Swimmer, Reacher, Humanoid, Pendulum). Surprisingly, with 1,400 synthetic labels the method sometimes exceeds true-reward RL — because the learned reward provides better-shaped feedback.

Atari games

With 5,500 human comparisons, the method shows substantial learning on 7 Atari games. On BeamRider and Pong, it matches RL with true rewards. On Seaquest and Qbert, it reaches near-RL performance but learns more slowly. On SpaceInvaders and Breakout, it learns significantly but doesn't fully match.

Novel behaviors (the real payoff)

This is where the approach truly shines — learning behaviors that would be nearly impossible to specify with hand-crafted rewards:

Backflips: The Hopper robot learns to perform consistent backflips from ~900 comparisons in under an hour of human time
One-leg running: HalfCheetah learns to run forward while balancing on one leg (~800 comparisons)
Driving with traffic: In Enduro, the agent learns to stay alongside other cars rather than pass them (~1,300 comparisons)

RLHF vs True Reward on MuJoCo

Performance comparison across MuJoCo tasks. RLHF with 700 human labels nearly matches RL with the true reward function.

Better than true reward? On the Ant task, human feedback actually outperformed the true reward function. The humans were told to prefer trajectories where the robot "stands upright," which provided better reward shaping than the hand-crafted bonus in the environment. The learned reward assigned positive value to all behaviors that typically lead to high reward, effectively providing a more informative gradient signal.

What novel behavior did the paper train from scratch using only ~900 human comparisons?

A simulated Hopper robot performing consistent backflips — a behavior that would be extremely difficult to specify via a hand-crafted reward function Playing chess at grandmaster level Natural language understanding

Chapter 8: Scalability

The paper introduces several design choices that make the system practical at scale.

Asynchronous feedback

The three processes (policy training, human labeling, reward model training) run independently. The policy never waits for a human — it uses the most recent reward model. This means the system can collect experience 24/7, even when no humans are available. New labels are incorporated as they arrive, continuously improving the reward model.

Online vs offline labels

A critical finding: labels must be collected online (throughout training), not just at the beginning. The ablation studies show that offline labeling leads to bizarre behaviors:

On Pong, offline training produced an agent that avoids losing points but never scores — resulting in infinite-length volleys
The reward model captures only the part of the reward landscape that's visible in early trajectories
As the policy improves and visits new states, the offline reward model has no information about these regions and can assign arbitrarily wrong rewards

Why online feedback is crucial: The agent's state distribution shifts as it learns — a phenomenon called distributional shift. Early on, the agent mostly falls over. Later, it runs smoothly. A reward model trained only on "falling over" clips has no idea how to evaluate "running smoothly" clips. Online labeling ensures the reward model always has coverage over the states the policy actually visits.

Computational cost

The entire system is remarkably cheap. For Atari experiments:

Compute: 16 CPUs + 1 GPU, ~1 day of training, ~$25
Human time: 5 hours at minimum wage = ~$36
Total cost: ~$61 to train an Atari agent from human preferences alone

The human cost and compute cost are already comparable — meaning further improvements in sample efficiency would hit diminishing returns, because compute would become the bottleneck.

Why does collecting human labels only at the beginning of training (offline) lead to poor performance?

Distributional shift — as the policy improves, it visits new states that the offline reward model has never seen, leading to arbitrarily wrong reward predictions in those regions The labels expire after a fixed time Offline labels are lower quality

Chapter 9: Connections

What this paper built on

Bradley-Terry model (1952): The pairwise comparison framework from mathematical psychology that underlies the entire reward learning approach. Originally developed for ranking chess players.

Inverse RL (Ng & Russell, 2000): Recovering a reward function from demonstrations. This paper takes a different tack — learning reward from preferences rather than demonstrations, which doesn't require expert demonstrations.

TAMER (Knox, 2012): Earlier work on learning from human reward signals, but limited to simple tasks. This paper scales the idea to deep RL with complex environments.

What this paper enabled

InstructGPT / ChatGPT (Ouyang et al., 2022): Applied this exact framework to language models. Human annotators compare pairs of model outputs, a reward model learns from these comparisons, and PPO optimizes the language model against the learned reward. This is the "RLHF" in ChatGPT's training pipeline.

Constitutional AI (Bai et al., 2022): Extends RLHF by replacing some human feedback with AI feedback — the AI critiques its own outputs using a set of principles ("constitution"). This addresses the scalability bottleneck of human labeling.

DPO (Rafailov et al., 2023): Direct Preference Optimization eliminates the reward model entirely. It shows that the cross-entropy loss on preferences can be rewritten as a loss directly on the policy, skipping the intermediate reward-model step. Simpler, but builds on the same Bradley-Terry preference model introduced here.

GRPO / DeepSeek-R1 (2024-25): Group Relative Policy Optimization simplifies the RL step, but the reward model training follows the same preference-learning paradigm from this paper.

This paper's legacy: Before Christiano et al. 2017, RLHF was a niche research direction limited to toy problems. This paper demonstrated that preference learning could scale to complex deep RL tasks — Atari games and MuJoCo locomotion. Five years later, the same pipeline (with PPO and language models replacing A2C and game environments) became the secret sauce behind ChatGPT, Claude, and every major aligned language model. The Bradley-Terry preference model from this paper is now one of the most widely-used components in modern AI.

Cheat sheet

Core equation

P(σ¹≻σ²) = exp(∑r̂(σ¹)) / (exp(∑r̂(σ¹)) + exp(∑r̂(σ²)))

Loss

Cross-entropy between predicted and actual human preferences

Pipeline

Policy rollouts → human comparisons → reward model → policy optimization (async loop)

Efficiency

~1% of agent experience needs human labels; 700 comparisons for MuJoCo

Impact

Foundation for InstructGPT, ChatGPT RLHF, DPO, Constitutional AI

How is this paper's contribution used in the training of ChatGPT?

Human annotators compare pairs of model outputs, a reward model is trained on these preferences using the Bradley-Terry model, and PPO optimizes the language model against the learned reward — the exact RLHF pipeline from this paper ChatGPT uses the reward function from this paper directly ChatGPT only uses supervised learning, not RLHF

Deep RL from Human Preferences