Learn a reward function from pairwise human comparisons, then optimize a policy against it — the paper that launched RLHF and made ChatGPT possible.
Reinforcement learning has a dirty secret: it needs a reward function. And writing a good reward function is really hard.
Suppose you want to train a robot to clean a table. What reward do you assign? Distance to crumbs? But what about the ones behind the salt shaker? Percentage of surface area cleaned? But it might knock everything off the table to maximize area. Every simple metric you try either misses the point or gets "gamed" — the agent finds a loophole that scores high but does something you never intended.
This is the reward specification problem. It shows up everywhere:
The core issue: humans can recognize good behavior when they see it, but they can't write down a mathematical function that captures what they mean. There's a gap between what we want and what we can formalize.
Click each reward function to see how the agent exploits it. The intended behavior is to move right smoothly — but hand-crafted rewards produce unintended shortcuts.
Here's the insight that changes everything: humans can't write down reward functions, but they can watch two short video clips and say "I prefer this one."
Pairwise comparison is one of the simplest forms of human feedback. You don't need expertise. You don't need to assign numerical scores. You just watch two 1-2 second clips of an agent's behavior and pick the one that looks better. Even a non-expert can do this quickly and consistently.
The paper's approach has three components running simultaneously:
The policy then optimizes the learned reward instead of a hand-crafted one. The reward model serves as a proxy for the human — an always-available, differentiable approximation of human judgment.
The key technical question is: how do you turn binary comparison labels ("A is better than B") into a continuous reward function r(s, a)? That's what the Bradley-Terry model does, and we'll derive it in the next chapter.
We need a mathematical model that connects a reward function to human preferences. Enter the Bradley-Terry model, originally developed in 1952 for ranking chess players from pairwise game outcomes.
The idea: every trajectory segment σ has a hidden "quality score" — the sum of rewards along the segment:
The probability that a human prefers segment σ1 over σ2 depends exponentially on the difference in total reward:
This is a softmax over the total rewards of the two segments. Let's unpack what this means:
The exponential form comes from the Luce choice axiom — a principle from mathematical psychology. It says that the probability of choosing an option should be proportional to some "value" function of that option. The softmax (which uses exp) is the canonical way to turn arbitrary real-valued scores into valid probabilities that satisfy this axiom.
Drag the reward sliders for two segments to see how the preference probability changes. When rewards are equal, it's a coin flip. As the gap grows, confidence increases.
Now we need to actually train a neural network to be the reward function r̂(o, a). The training signal comes from human comparisons stored in a database D of triples (σ1, σ2, μ), where μ encodes which segment the human preferred.
Given the Bradley-Terry model, we train r̂ by minimizing cross-entropy between the model's predicted preferences and the actual human labels:
Where μ(1) is 1 if the human preferred σ1, μ(2) is 1 if they preferred σ2, and both are 0.5 if they said "equally good."
This is standard cross-entropy — the same loss you'd use for binary classification. The key difference: the "logits" come from summing the reward model's outputs across all timesteps in each segment, not from a single forward pass.
The paper found several tricks essential for making this work:
For MuJoCo tasks: a fully-connected network that takes the observation and action as input and outputs a scalar reward. For Atari: a convolutional network that takes 4 stacked frames as input. The network outputs r̂(ot, at) for each timestep, and these are summed to get R(σ) for each segment.
Once we have a reward model r̂, we're back in familiar RL territory. The policy π interacts with the environment, but instead of receiving the environment's true reward, it receives r̂(ot, at) at each timestep. From the policy's perspective, this is just a standard RL problem.
The paper uses two different RL algorithms depending on the domain:
There's a subtle but important wrinkle: the reward function r̂ is non-stationary. It's being updated continuously as new human comparisons come in. This means the "ground truth" reward for any given state-action pair changes over time.
This is why they chose policy gradient methods over value-based methods like DQN. Policy gradient methods are more robust to reward changes because they don't maintain a value function that needs to be re-learned every time the reward shifts. They also increased the entropy bonus for TRPO to ensure adequate exploration, since the changing reward landscape means previously explored regions might become newly valuable.
The learned reward (teal) approximates the true reward (gray) with increasing accuracy as more comparisons are gathered. Drag the slider to add comparisons.
The complete system has three asynchronous processes running in parallel, forming a continuous loop:
The crucial insight: these three processes run asynchronously. The policy doesn't wait for human labels — it keeps collecting experience using the most recent reward model. The human labels don't need to arrive in order. The reward model trains continuously on all available data.
Watch the three asynchronous processes run simultaneously. Click "Step" to advance the pipeline, or "Auto" to run continuously. Observe how reward quality improves as comparisons accumulate.
A key selling point: the method needs surprisingly little human feedback. The agent collects thousands of hours of experience but the human only needs to label comparisons for about 1% of that experience.
Compare this to direct human reward: if every timestep required a human score, you'd need hundreds of thousands of labels. The reward model amplifies a small amount of human feedback into dense reward for every single timestep.
Not all comparisons are equally informative. The paper uses the ensemble disagreement to select the most useful queries:
Intuitively, if all ensemble members agree on a comparison, we don't learn much from asking the human. But if they disagree, the human's answer resolves genuine model uncertainty.
With just 700 human comparisons, the method nearly matches the performance of RL with the true reward function on all 8 MuJoCo tasks (HalfCheetah, Hopper, Walker, Ant, Swimmer, Reacher, Humanoid, Pendulum). Surprisingly, with 1,400 synthetic labels the method sometimes exceeds true-reward RL — because the learned reward provides better-shaped feedback.
With 5,500 human comparisons, the method shows substantial learning on 7 Atari games. On BeamRider and Pong, it matches RL with true rewards. On Seaquest and Qbert, it reaches near-RL performance but learns more slowly. On SpaceInvaders and Breakout, it learns significantly but doesn't fully match.
This is where the approach truly shines — learning behaviors that would be nearly impossible to specify with hand-crafted rewards:
Performance comparison across MuJoCo tasks. RLHF with 700 human labels nearly matches RL with the true reward function.
The paper introduces several design choices that make the system practical at scale.
The three processes (policy training, human labeling, reward model training) run independently. The policy never waits for a human — it uses the most recent reward model. This means the system can collect experience 24/7, even when no humans are available. New labels are incorporated as they arrive, continuously improving the reward model.
A critical finding: labels must be collected online (throughout training), not just at the beginning. The ablation studies show that offline labeling leads to bizarre behaviors:
The entire system is remarkably cheap. For Atari experiments:
The human cost and compute cost are already comparable — meaning further improvements in sample efficiency would hit diminishing returns, because compute would become the bottleneck.
Bradley-Terry model (1952): The pairwise comparison framework from mathematical psychology that underlies the entire reward learning approach. Originally developed for ranking chess players.
Inverse RL (Ng & Russell, 2000): Recovering a reward function from demonstrations. This paper takes a different tack — learning reward from preferences rather than demonstrations, which doesn't require expert demonstrations.
TAMER (Knox, 2012): Earlier work on learning from human reward signals, but limited to simple tasks. This paper scales the idea to deep RL with complex environments.
InstructGPT / ChatGPT (Ouyang et al., 2022): Applied this exact framework to language models. Human annotators compare pairs of model outputs, a reward model learns from these comparisons, and PPO optimizes the language model against the learned reward. This is the "RLHF" in ChatGPT's training pipeline.
Constitutional AI (Bai et al., 2022): Extends RLHF by replacing some human feedback with AI feedback — the AI critiques its own outputs using a set of principles ("constitution"). This addresses the scalability bottleneck of human labeling.
DPO (Rafailov et al., 2023): Direct Preference Optimization eliminates the reward model entirely. It shows that the cross-entropy loss on preferences can be rewritten as a loss directly on the policy, skipping the intermediate reward-model step. Simpler, but builds on the same Bradley-Terry preference model introduced here.
GRPO / DeepSeek-R1 (2024-25): Group Relative Policy Optimization simplifies the RL step, but the reward model training follows the same preference-learning paradigm from this paper.