CS224R — Reward Learning: The Complete Guide

Roadmap

What You'll Master

01The Reward Specification Problem 02Task Specification Landscape 03Goal Classifiers as Rewards 04The Exploitation Problem 05Adversarial Classifier Update 06Human Preferences 07Reward Learning Algorithm 08RLHF for LLMs 09RLAIF / Constitutional AI 10Unsupervised RL & Self-Play 11Summary & Comparison 12Connections & Beyond

Chapter 01

The Reward Specification Problem

In Atari Breakout, the reward is trivial: your score goes up when a brick shatters. The environment gives you the number. You don't have to invent it. RL algorithms just maximize that number and you're done.

Now consider a robot folding a towel. What's the reward? Is it folded "well enough" when the corners align? When wrinkles are below some threshold? When it takes under 10 seconds? There's no number on screen. You have to design or learn one.

This is the reward specification problem: for most real-world tasks — dialog systems, autonomous driving, robotic manipulation, content generation — there is no built-in reward signal. The reward must be created by humans, and creating it well turns out to be incredibly hard.

Definition

Reward Specification Problem

The challenge of defining a mathematical reward function r(s, a) that faithfully captures human intent for a task where no natural reward signal exists. Imprecise reward functions lead to reward hacking — the agent optimizes exactly what you specified, not what you meant.

Consider the classic "walking robot" failure mode. You specify reward = forward velocity. The robot learns to fall forward at maximum speed — zero steps, face-first into the ground. Fix it by adding "stay upright." Now it stands perfectly still. Add "move feet." Now it shuffles in place. Each fix creates a new exploit.

Hand Calculation — Reward Hacking

Robot reward: r = 2.0 · v_forward + 1.0 · upright + 0.5 · foot_contacts

Falling forward: v_forward = 5.0, upright = 0.0, foot_contacts = 0 → r = 2(5) + 0 + 0 = 10.0

Standing still: v_forward = 0.0, upright = 1.0, foot_contacts = 2 → r = 0 + 1 + 1 = 2.0

Walking nicely: v_forward = 1.5, upright = 0.9, foot_contacts = 1 → r = 3 + 0.9 + 0.5 = 4.4

The agent picks "fall forward" because 10.0 > 4.4. The intended behavior has lower reward than the exploit!

Core Insight

The problem isn't that RL is broken. RL is working perfectly — it's maximizing exactly what you told it to maximize. The problem is that what you told it and what you meant are different things. The gap between specification and intention is where reward hacking lives.

Reward Availability Across Domains

Domains with built-in rewards (green) vs. domains requiring learned rewards (red). The further right, the harder reward specification becomes.

Where Do Rewards Actually Come From?

In the real world, there are exactly four sources of reward signals:

1. Built-in (game score). Atari, Go, Chess. The environment literally returns a number. RL works out of the box.

2. Hand-designed. A human writes r(s,a) = w₁f₁(s) + w₂f₂(s) + .... Fragile, exploitable, requires domain expertise. Months of tuning for complex tasks.

3. Learned from demonstrations. Given expert trajectories, infer what reward the expert was optimizing. This is inverse RL (IRL).

4. Learned from preferences. Ask humans "which trajectory is better?" and learn a reward from pairwise comparisons. This is reward learning from preferences — the foundation of RLHF.

Warning

Many RL papers assume the reward function is "given." In the real world, designing r(s,a) often takes 10x longer than training the policy. The reward IS the hard part. This lesson is about solutions to that problem.

Chapter 02

Task Specification Landscape

Before diving into algorithms, let's map the full landscape of how humans communicate goals to agents. Each approach trades off ease-of-specification against reliability.

Definition

Task Specification

Any mechanism by which a human communicates desired behavior to an RL agent. Includes reward functions, demonstrations, goal images, preference comparisons, natural language instructions, and more. Each mechanism has different annotation cost, scalability, and robustness properties.

Approach	Input from Human	Pros	Cons
Hand-designed reward	Mathematical function r(s,a)	Direct; no learning needed	Fragile; reward hacking; expensive to tune
Imitation learning	Expert demonstrations τ*	Rich signal; covers full trajectories	Bounded by expert; distribution shift
Goal examples	Images/states of success	Easy to collect (point camera)	Binary only; no "how" information
Human preferences	Pairwise rankings (τ_A ≻ τ_B)	Humans good at comparison; scales to complex tasks	Expensive annotation; noisy
Language instructions	"Pick up the red cup"	Natural; flexible	Ambiguous; grounding problem
AI feedback	LLM judges trajectories	Infinite scale; cheap	Inherits LLM biases; no ground truth

Specification Cost vs. Reliability

Each approach occupies a different point in the annotation-cost vs. reliability tradeoff space.

The Demonstration Spectrum

Even within "demonstrations," there's a spectrum of how much information you provide:

Full demonstrations — Expert executes entire trajectories. Maximum information but maximum cost. One demo can take 10+ minutes for a complex manipulation task.

Goal images — Show the agent what "done" looks like. A single photo of a folded towel. Cheap to collect but only provides the endpoint, not the path.

Preference rankings — "This trajectory is better than that one." Doesn't require expertise, just judgment. A human can compare two robot videos in seconds.

Key Insight — Comparison is Easier Than Evaluation

Asking "rate this robot trajectory from 1-10" is hard — what's a 6 vs a 7? But asking "which of these two attempts is better?" is easy. Humans are reliable comparison machines, even when they can't produce absolute scores. This insight drives the entire RLHF pipeline.

Worked Example — Data Requirements

Consider a robotic manipulation task (pick-and-place):

Hand-designed reward: 2 weeks of a PhD student's time tuning weights, 0 annotation data.

Demonstrations: 50 kinesthetic demos at 3 min each = 150 minutes = 2.5 hours.

Goal images: 20 success photos + 20 failure photos = 10 minutes of data collection.

Preferences: 500 pairwise comparisons at 10 sec each = 83 minutes.

Goal images are cheapest per-annotation but provide least signal. Preferences scale to complex tasks where "what is success" is hard to photograph.

Warning — No Free Lunch

Every approach has failure modes. Demonstrations suffer distribution shift. Goal classifiers get exploited by OOD states. Preferences require thousands of comparisons. Language instructions are ambiguous. The "right" choice depends on your domain, budget, and failure tolerance.

Chapter 03

Goal Classifiers as Rewards

The simplest idea for learning a reward: train a binary classifier to distinguish "success" from "failure," then use its output probability as the reward signal.

Definition

Goal Classifier Reward

Reward from Classifier r(s) = P(success | s) = σ(f_φ(s))

Where f_φ is a neural network with parameters φ, trained on labeled success/failure states. σ is the sigmoid function mapping logits to probabilities.

Data Collection

You need two sets of states:

Positive examples (success states): Images or state vectors of the task being "done." For a cup-placing task: photos of the cup on the target location. For a drawer-closing task: states where the drawer is fully closed.

Negative examples (failure states): Everything else — random states, half-completed attempts, wrong configurations. Usually easy to collect in bulk.

Concrete Example — Robotic Pushing

Task: push a block to a target location.

Positive states: 50 images with block on target (different angles, lighting).

Negative states: 200 images of block anywhere else.

Classifier: ResNet-18, fine-tuned. Outputs P(block_on_target | image).

RL reward: r(s_t) = P(success | s_t). Agent maximizes this.

When block is near target: P ≈ 0.85 → high reward. Block far away: P ≈ 0.1 → low reward. The classifier creates a smooth "reward landscape" that the RL agent can follow.

Goal Classifier Decision Boundary

Green = high P(success), red = low. The classifier creates a reward landscape the agent climbs. Drag the agent (white dot) to see its reward value.

Noise:20%

Why Probabilities, Not Logits?

Using raw logits f_φ(s) is problematic. Logits are unbounded — the agent can find states where the logit is +1000 (classifier is extremely confident) but the state is actually nonsense. The sigmoid bounds output to [0, 1], providing a natural reward scale.

Intuition — Reward Shaping via Classifier Confidence

The classifier isn't just saying "yes/no." Its confidence creates a gradient. States that are almost at the goal get reward 0.7. States far from the goal get reward 0.1. This gradient is what makes RL optimization possible — without it, the agent would get zero signal until it randomly achieves the goal.

Training the Goal Classifier L(φ) = − (1/N) Σ_i=1^N [ y_i log σ(f_φ(s_i)) + (1−y_i) log(1 − σ(f_φ(s_i))) ]

Standard binary cross-entropy. y_i = 1 for success states, 0 for failure states.

Symbol breakdown:

• φ — classifier parameters (neural network weights).

• f_φ(s_i) — raw logit output for state s_i.

• σ(·) — sigmoid: σ(z) = 1/(1+e^−z). Maps logit to probability [0,1].

• y_i — label. 1 = success, 0 = failure.

• N — total labeled examples.

Warning — The Simplicity Trap

This approach seems perfect: collect 50 images, train a classifier, run RL. In practice, it breaks spectacularly due to the exploitation problem (next section). The classifier was trained on a small, well-behaved dataset. RL will push the agent into states the classifier has never seen.

Chapter 04

The Exploitation Problem

Here's what actually happens when you use a fixed classifier as reward: RL finds adversarial inputs to the classifier. States that look nothing like success but fool the classifier into outputting P(success) ≈ 1.0.

Definition

Classifier Exploitation

When an RL policy discovers states outside the classifier's training distribution (OOD states) that receive high classifier confidence despite not representing true task success. The classifier was never trained to say "I don't know" — it always outputs a probability, even for states it's never seen.

Think of it this way: your classifier was trained on normal photos of cups on tables. The RL agent discovers that if the robot arm occludes the camera in exactly the right way, the classifier outputs 0.99. Or if the robot flips the cup upside down at a specific angle, it triggers the same visual features. These are adversarial examples found by RL optimization.

Hand Calculation — Exploitation

Classifier training data: 100 states from distribution D_train (normal robot operation).

RL policy explores: visits 10,000 states, many from distribution D_RL ≠ D_train.

In-distribution state (block near target): f_φ(s) = 2.1 → σ(2.1) = 0.89. Legitimate high reward.

OOD state (robot arm jammed against wall, camera occluded): f_φ(s) = 4.7 → σ(4.7) = 0.99. Illegitimate reward — task not completed!

The agent prefers the OOD exploit (0.99) over legitimate completion (0.89). Policy converges to the exploit.

Classifier Exploitation Visualization

Blue region = training distribution. Red dots = OOD states RL discovers that fool the classifier. Click "Run RL" to watch the policy find exploits.

RL Steps:0

Why Does This Happen?

Neural network classifiers are interpolators, not extrapolators. Within the training distribution, they generalize well. Outside it, their predictions are meaningless — but they still output confident probabilities. A sigmoid always returns a number between 0 and 1, even for garbage inputs.

RL is the world's most aggressive optimizer. Given millions of environment interactions, it WILL find these weak spots. It's essentially performing adversarial attacks on your classifier, just through trial-and-error instead of gradient-based perturbation.

The GAN Connection

This problem is structurally identical to mode collapse in GANs. The "generator" (policy) learns to fool the "discriminator" (classifier). The fix is also the same: update the discriminator to account for the generator's current outputs. This leads directly to the adversarial approach in the next section.

Warning — Severity Scales with Optimization Pressure

With weak RL (few steps, high regularization), exploitation may not manifest. With strong RL (millions of steps, good exploration), it's guaranteed. The better your RL algorithm, the worse the exploitation. This is a fundamental tension: better optimization + bad reward = worse outcomes.

The Exploitation Inequality max_{s ∈ D_RL} P(success | s) > max_{s ∈ D_success} P(success | s)

RL finds states with higher classifier confidence than actual success states. The exploit is "better" than genuine completion.

✎ Derivation Challenge Why does exploitation get worse with more RL steps? COMPLETED ▶

Show that the probability of finding an exploit increases with the number of unique states visited. Model the classifier's error rate on OOD states as uniform random in [0,1]. If the agent visits n OOD states, what's the expected maximum confidence?

The maximum of n uniform [0,1] random variables has expected value n/(n+1). This is a standard order statistics result.

As n grows: n/(n+1) → 1. With 9 OOD visits, expected max is 0.9. With 99 OOD visits, expected max is 0.99. With 999 OOD visits, 0.999.

Model: classifier assigns random confidence c ~ Uniform[0,1] to each OOD state (since it has no valid information about these states).

After visiting n OOD states, the agent selects the state with maximum confidence:

E[max(c₁, ..., c_n)] = n / (n+1)

Proof: CDF of max is F(x) = xⁿ. PDF is f(x) = nx^n-1. Expected value = ∫₀¹ x · nx^n-1 dx = n/(n+1).

So with n=99 states explored, expected best exploit has confidence 0.99. With n=999, it's 0.999. RL with millions of steps visits thousands of OOD states → near-certain to find a high-confidence exploit.

Chapter 05

Adversarial Classifier Update

The fix: don't leave the classifier static. When RL finds exploits, add those exploits as negative examples and retrain. The classifier learns "that's not success either." Then RL finds new exploits. Update again. Iterate.

Definition

Adversarial Goal Classifier

A goal classifier that is iteratively updated by adding the current RL policy's visited states as negative examples. This creates an adversarial game: the policy tries to fool the classifier, and the classifier adapts to reject the policy's attempts. Converges when the policy can only achieve high classifier confidence by actually completing the task.

Algorithm: Adversarial Goal Classifier RL

Initialize: Collect success states D⁺ and initial failure states D⁻. Train classifier C_φ on D⁺ ∪ D⁻.
Run RL: Train policy π_θ using r(s) = C_φ(s) for K steps. Collect visited states S_RL.
Update negatives: Add RL-visited states to negatives: D⁻ ← D⁻ ∪ S_RL.
Retrain classifier: Retrain C_φ on D⁺ ∪ D⁻ (expanded negatives).
Repeat: Go to step 2. Iterate until policy converges to actual task completion.

The GAN Parallel

This is exactly the GAN (Generative Adversarial Network) training procedure:

GAN Component	Reward Learning Analog
Generator	RL policy π_θ (produces states)
Discriminator	Goal classifier C_φ (judges success)
Generator loss	Negative RL reward (−r)
Discriminator loss	Binary cross-entropy on updated D⁺/D⁻
Mode collapse	Policy exploiting one OOD state
Training instability	Oscillation between exploits and corrections

Worked Example — Two Iterations

Iteration 1:

D⁺ = {50 images of cup on target}. D⁻ = {100 random robot images}.

Train classifier. Run RL for 10k steps. Policy learns to occlude camera (exploit). C(s_exploit) = 0.97.

Iteration 2:

Add 500 RL-visited states (including occlusion states) to D⁻.

Retrain classifier. Now C(s_occlude) = 0.15. Exploit fixed.

Run RL again. Policy finds new exploit: pressing cup against wall at odd angle. C(s_wall) = 0.92.

Iteration 3: Add wall-pressing states to D⁻. Retrain. Eventually, the ONLY way to get high reward is genuine task completion.

Adversarial Training Iterations

Watch the classifier boundary evolve as RL exploits are added as negatives. Blue = classifier confidence. Red dots = RL-discovered exploits added each round.

Iteration:0

Why Balanced Batches Matter

Critical implementation detail: when retraining the classifier, use balanced batches — equal numbers of positive and negative examples per batch.

Balanced Batch Property In a balanced batch: P(y=1) = P(y=0) = 0.5
⇒ For any true success state s⁺: C(s⁺) ≥ 0.5

If the classifier can't distinguish s⁺ from the batch baseline, it defaults to 0.5 (the prior). True success states always get at least this baseline reward.

Without balanced batches: if negatives outnumber positives 10:1, the classifier learns a prior of P(success) = 0.09. Even true success states may get low reward, starving the RL signal.

Convergence Intuition

Each iteration, the classifier's "holes" (OOD regions with high confidence) get patched. After enough iterations, the only remaining high-confidence region is the actual success manifold. The policy has no choice but to genuinely complete the task.

Warning — Stability

Like GANs, this procedure can be unstable. If you add too many negatives too fast, the classifier becomes conservative and gives low reward everywhere (reward collapse). If you don't update frequently enough, exploitation persists. The balance is delicate.

Scenario: Your adversarial classifier RL system has been running for 5 iterations. The policy's average reward keeps increasing, but task success rate (measured by ground truth) stays at 0%. What's happening?

Iteration 1: reward=0.3, success=0%. Iteration 2: reward=0.5, success=0%. Iteration 3: reward=0.6, success=0%. Iteration 4: reward=0.7, success=0%. Iteration 5: reward=0.8, success=0%.

Diagnose the failure mode and propose a fix:

Model Answer

The classifier is being updated but the exploit keeps shifting to new OOD regions faster than the classifier can cover them. The reward increases because each new exploit has higher confidence than the last (order statistics effect from Ch 4). Fix: (1) Increase the RL budget between classifier updates so the policy fully exploits the current classifier, making the exploit easier to identify. (2) Add ALL visited states as negatives, not just final states. (3) Use ensemble of classifiers — exploit must fool all of them.

Chapter 06

Human Preferences

Goal classifiers need labeled "success" and "failure" states. But for many tasks — dialog quality, driving smoothness, code correctness — there's no clear binary. What does "success" even mean for a conversation?

Solution: don't ask humans for absolute labels. Ask them for pairwise comparisons. "Which of these two trajectories is better?" This is easier, more reliable, and doesn't require the annotator to be an expert.

Definition

Preference-Based Reward Learning

Learning a reward function r_θ(s, a) from human pairwise preference labels. Given two trajectory segments σ¹ and σ², a human indicates which is better: σ¹ ≻ σ² or σ² ≻ σ¹. The reward function is trained to be consistent with these preferences.

The Bradley-Terry Model

How do we turn "trajectory A is better than trajectory B" into a loss function? We need a probabilistic model of preferences. The Bradley-Terry model (1952) provides exactly this:

Definition

Bradley-Terry Model

Preference Probability P(τ_w ≻ τ_l) = σ(R(τ_w) − R(τ_l))

where R(τ) = Σ_t r_θ(s_t, a_t) (sum of learned rewards along trajectory)
and σ(x) = 1 / (1 + e^−x) (sigmoid function)

The probability that τ_w is preferred over τ_l depends only on the difference in total reward. Higher total reward → more likely to be preferred.

Symbol-by-symbol breakdown:

• τ_w — the "winner" trajectory (the one the human preferred).

• τ_l — the "loser" trajectory (the one the human rejected).

• R(τ) — total return: sum of per-step rewards along the trajectory.

• r_θ(s_t, a_t) — learned reward at each step. This is what we're training.

• σ — sigmoid. Maps reward difference (−∞, +∞) to probability (0, 1).

Derivation — Why This Form?

Assume each trajectory has "quality" proportional to exp(R(τ)). Then:

P(τ_w ≻ τ_l) = exp(R(τ_w)) / (exp(R(τ_w)) + exp(R(τ_l)))

Divide numerator and denominator by exp(R(τ_w)):

= 1 / (1 + exp(R(τ_l) − R(τ_w))) = σ(R(τ_w) − R(τ_l))

This is the softmax over two items — the same logistic model used in Elo ratings (chess), TrueSkill (Xbox), and pairwise classification. It's the maximum-entropy model consistent with "higher reward = more preferred."

Hand Calculation — Bradley-Terry

Two robot trajectories for a reaching task:

τ_A: 5 steps, learned rewards = [0.1, 0.3, 0.5, 0.7, 0.9] → R(τ_A) = 2.5

τ_B: 5 steps, learned rewards = [0.2, 0.2, 0.3, 0.3, 0.4] → R(τ_B) = 1.4

P(τ_A ≻ τ_B) = σ(2.5 − 1.4) = σ(1.1) = 1/(1+e^−1.1) = 1/(1+0.333) = 0.75

If a human preferred τ_A, this 75% prediction is reasonable. The loss encourages pushing this probability higher (toward 1.0) by increasing rewards along τ_A relative to τ_B.

Bradley-Terry Preference Model

The sigmoid maps reward difference to preference probability. Adjust the reward difference to see how confident the model becomes.

R(τ_w) − R(τ_l):1.1

The Preference Loss Function

Reward Learning Loss L(θ) = − Σ_{(w,l) ∈ D} log σ(R_θ(τ_w) − R_θ(τ_l))

Maximize log-probability of observed preferences. Standard cross-entropy with label = 1 for all pairs (since we define w as the winner).

This is identical to binary cross-entropy where every example has label 1 (because we always put the winner first). The gradient pushes R(τ_w) up and R(τ_l) down.

Key Insight — Only Differences Matter

Adding a constant c to all rewards doesn't change preferences: σ((R+c) − (R'+c)) = σ(R − R'). The reward function is identified only up to a shift. This means the absolute scale of learned rewards is arbitrary — only relative ordering matters.

Chapter 07

The Reward Learning Algorithm

Now we assemble the full pipeline. The key insight: reward learning and policy optimization can run in a loop. Collect trajectories, get preferences, update reward, update policy. Repeat.

Algorithm: Reward Learning from Human Preferences

Sample k trajectories from current policy π_θ: {τ₁, ..., τ_k}.
Present pairs to human: Show all (k choose 2) = k(k−1)/2 trajectory pairs. Human labels winner for each pair.
Update reward function: Minimize L(φ) = − Σ_(w,l) log σ(R_φ(τ_w) − R_φ(τ_l)).
Update policy: Run RL (e.g., PPO) using r_φ(s,a) as reward for M steps.
Repeat from step 1 until convergence or annotation budget exhausted.

Scaling with (k choose 2)

A crucial efficiency trick: from k trajectories, you get k(k−1)/2 pairwise labels. With k=10, that's 45 pairs from just 10 human judgments (if you ask for a full ranking). With k=20, that's 190 pairs.

Hand Calculation — Annotation Efficiency

Budget: 100 human comparison queries per round.

Naive (present raw pairs): 100 labeled pairs directly. 100 comparisons.

Smart (rank k trajectories): Ask human to rank k=15 trajectories (14 comparisons for a full sort). This gives (15 choose 2) = 105 pairwise labels from ~14 ranking operations!

Efficiency gain: 105 / 14 = 7.5x more training signal per human judgment when using rankings instead of isolated pairs.

Gradient of Reward Loss ∇_φ L = − Σ_(w,l) [1 − σ(R_φ(τ_w) − R_φ(τ_l))] · [∇_φ R_φ(τ_w) − ∇_φ R_φ(τ_l)]

When the model already correctly predicts the preference (high σ), the gradient is small. Focuses learning on "surprising" preferences.

Symbol breakdown of the gradient:

• [1 − σ(·)] — prediction error. Close to 0 when prediction is correct, close to 1 when wrong.

• ∇_φ R_φ(τ_w) — gradient of total reward of winner w.r.t. reward parameters. This gets pushed up.

• −∇_φ R_φ(τ_l) — gradient of total reward of loser. This gets pushed down.

Online Reward Learning Loop

Watch the reward function evolve as preferences accumulate. The policy improves as the reward becomes more accurate.

Preferences:0

Warning — Reward Overoptimization

As the policy improves, it may overfit to imperfections in the learned reward. The reward function was trained on data from earlier policies. The current policy may visit states where the reward model is inaccurate. This is the preference-learning analog of classifier exploitation. Fix: keep collecting fresh preferences from the improving policy.

Why Online Matters

If you only collect preferences once and train reward to convergence, then train policy to convergence, the policy will overfit to reward model errors. The interleaved loop prevents this: the reward model always has fresh data from the current policy's distribution.

💻 Implementation Checkpoint Implement the Bradley-Terry loss COMPLETED ▶

Implement the reward learning loss in PyTorch. The function takes a reward model, a batch of winner trajectories, and loser trajectories, and returns the scalar loss.

Pythondef preference_loss(reward_model, tau_w, tau_l): """ Args: reward_model: nn.Module mapping states to scalar rewards tau_w: Tensor [batch, T, state_dim] - winner trajectories tau_l: Tensor [batch, T, state_dim] - loser trajectories Returns: loss: scalar Tensor """

Solutiondef preference_loss(reward_model, tau_w, tau_l): # Compute per-step rewards for winners: [batch, T] r_w = reward_model(tau_w).squeeze(-1) # [batch, T] r_l = reward_model(tau_l).squeeze(-1) # [batch, T] # Sum over time to get total return R_w = r_w.sum(dim=1) # [batch] R_l = r_l.sum(dim=1) # [batch] # Bradley-Terry loss: -log sigma(R_w - R_l) loss = -torch.nn.functional.logsigmoid(R_w - R_l).mean() return loss

Chapter 08

RLHF for LLMs

Everything above applies to robotics, games, and continuous control. But the technique that made reward learning famous is RLHF for language models — the process that turned GPT-3 into ChatGPT.

The pipeline has three stages, each building on the previous:

Definition

RLHF (Reinforcement Learning from Human Feedback)

A three-stage pipeline for aligning language models with human preferences: (1) Pretraining on internet text, (2) Supervised Fine-Tuning (SFT) on curated demonstrations, (3) RL optimization against a learned reward model trained on human preference comparisons.

Stage 1: Pretraining

Train a large language model on internet text via next-token prediction. This gives a model that can write fluently but has no concept of "helpfulness" or "safety." It's like a person who can speak perfectly but has no goals.

Stage 2: Supervised Fine-Tuning (SFT)

Collect ~10k-100k demonstrations of ideal assistant behavior. Human annotators write high-quality responses to prompts. Fine-tune the pretrained model on these (prompt, response) pairs via standard cross-entropy loss. This gives a model that tries to be helpful but isn't consistent.

Stage 3: RLHF

This is where reward learning enters:

Algorithm: RLHF for Language Models

Sample prompt x from a dataset of diverse user queries.
Generate responses: Sample y₁, y₂ ~ π_SFT(y|x) from the SFT model.
Human comparison: Annotator picks the better response: y_w ≻ y_l.
Train reward model: r_φ(x, y) trained with Bradley-Terry loss on accumulated comparisons.
Fine-tune with PPO: Optimize π_θ to maximize r_φ(x, y) with KL penalty: max E[r_φ(x,y)] − β · KL(π_θ || π_SFT).

RLHF Objective max_θ 𝔼_{x~D, y~π_θ}[ r_φ(x, y) ] − β · 𝔼_x~D[ KL(π_θ(·|x) || π_SFT(·|x)) ]

Maximize reward while staying close to SFT model. β controls the tradeoff.

Symbol breakdown:

• r_φ(x, y) — learned reward for response y given prompt x. A neural network (often the LLM itself with a value head).

• β — KL penalty coefficient. Higher β → stay closer to SFT model (more conservative).

• KL(π_θ || π_SFT) — Kullback-Leibler divergence measuring how much the policy has drifted from its starting point.

Why the KL Penalty?

Without it, the model would hack the reward model — finding degenerate outputs that get high reward but are nonsensical (the same exploitation problem from Ch 4!). The KL penalty says "you can't drift too far from a sensible model." It's a regularizer against reward overoptimization.

Worked Example — KL Tradeoff

Prompt: "Explain quantum mechanics simply."

β = 0 (no constraint): Model finds reward hack. Outputs repetitive superlatives: "Absolutely amazing incredible wonderful fantastic..." → r = 8.2 (reward model fooled by enthusiastic language).

β = 0.1 (moderate): Model gives a genuinely helpful explanation. r = 5.4, KL = 12.0. Net: 5.4 − 0.1(12) = 4.2.

β = 1.0 (heavy): Model barely moves from SFT. r = 3.8, KL = 0.5. Net: 3.8 − 0.5 = 3.3. Too conservative.

Sweet spot: β around 0.01–0.1 for most LLM training.

RLHF Three-Stage Pipeline

The complete pipeline from raw LLM to aligned assistant. Each stage builds on the previous.

Warning — Reward Model Capacity

The reward model must be powerful enough to represent complex preferences but not so powerful that it memorizes individual comparisons. In practice, the reward model is often the same size as the policy model (both are LLMs). Training a 7B reward model to judge a 7B policy's outputs is standard.

Progression Gate

Why does RLHF use the SFT model as initialization for RL, rather than the pretrained model? What would go wrong if you skipped Stage 2?

✓ PASSED

Model Answer

The pretrained model completes text, not instructions. It might continue a prompt with "..." or write it as a blog post instead of answering. SFT teaches the model the format of a helpful response (answer the question, appropriate length, conversational tone). RL then improves quality within that format. Without SFT, RL would have to simultaneously learn both format AND quality from the reward signal alone — a much harder optimization problem with a weaker starting point.

Chapter 09

RLAIF / Constitutional AI

RLHF's biggest bottleneck: human annotation is expensive. Each comparison costs $1–5 depending on task complexity. Training a frontier model requires 100k+ comparisons = $100k–500k just for labels.

Key observation: critique is easier than generation. You don't need a Nobel laureate to judge which of two physics explanations is better. Similarly, a language model can often judge which response is better, even if it can't generate the better response itself.

Definition

RLAIF (RL from AI Feedback)

Replacing human annotators with an AI model (typically a large LLM) that provides preference judgments. The AI judges which response is better based on specified criteria (helpfulness, harmlessness, honesty). The rest of the pipeline (Bradley-Terry reward model, PPO fine-tuning) remains identical to RLHF.

Constitutional AI (Bai et al., 2022)

Anthropic's Constitutional AI formalizes RLAIF with an explicit constitution — a set of principles the AI judge must follow:

Algorithm: Constitutional AI

Generate responses: Given prompt x, sample response y from the model.
Critique: Ask a "judge" LLM: "Does this response violate principle P?" (e.g., "Is it harmful? Is it dishonest?")
Revise: Ask the LLM to rewrite y to fix any violations → y'.
Preference pairs: Original y vs. revised y' form a preference pair (y' ≻ y).
Train reward model on accumulated (revised ≻ original) pairs.
RL fine-tune policy against learned reward model.

Worked Example — Constitutional Critique

Prompt: "How do I pick a lock?"

Model response y: "Here's a step-by-step guide to pick a pin tumbler lock: First, insert a tension wrench..."

Principle: "Responses should not provide detailed instructions for illegal activities."

Critique: "This response provides detailed lock-picking instructions that could be used for burglary. Violation: yes."

Revision y': "Lock-picking is a legitimate skill for locksmiths. I can explain the general principle (pin tumbler mechanisms) without providing step-by-step instructions that could enable illegal entry."

Preference: y' ≻ y. This pair trains the reward model.

The Helpfulness-Harmlessness Frontier

A fundamental tension in alignment: maximizing helpfulness often conflicts with minimizing harm. A model that refuses everything is maximally harmless but useless. A model that helps with anything is maximally helpful but dangerous.

Helpfulness-Harmlessness Pareto Frontier

Different training strategies achieve different points on the Pareto frontier. RLAIF with constitutional principles can push the frontier outward.

Why AI Feedback Works

Critique is easier than generation for the same reason that reviewing code is easier than writing it, or grading an essay is easier than writing one. The judge only needs to compare two options against clear criteria. The generator needs to create something good from scratch. This asymmetry means a model can meaningfully judge outputs that are at or near its own quality level.

Warning — AI Judge Biases

AI judges inherit biases: preference for longer responses, formal language, responses that agree with the prompt's premise, and sycophantic behavior. If your AI judge prefers verbose responses, your trained model will be verbose regardless of whether humans prefer concise answers. Calibration against human preferences is essential.

RLAIF vs RLHF Comparison RLHF: D_pref = {(τ_w, τ_l) : labeled by humans}
RLAIF: D_pref = {(τ_w, τ_l) : labeled by LLM judge}

Empirical finding (Bai et al.): RLAIF achieves ~95% of RLHF quality at 1% of the cost. The gap narrows as judge models improve.

Chapter 10

Unsupervised RL & Self-Play

All methods so far require human input (demonstrations, preferences, or principles). But can agents propose their own goals? Can they learn useful skills without any human reward signal?

Definition

Unsupervised RL

Training RL agents without any externally specified reward. The agent generates its own reward signal through intrinsic motivation (curiosity, surprise, novelty) or self-play (competing against itself). The goal: pre-train diverse skills that transfer to downstream tasks.

Asymmetric Self-Play (Sukhbaatar et al., 2017)

Idea: two agents, Alice and Bob, play an asymmetric game:

Alice's job: Reach some state and challenge Bob: "Get to where I am." Alice gets reward if Bob fails.

Bob's job: Reach Alice's goal state. Bob gets reward if he succeeds.

The game naturally escalates: Alice proposes increasingly difficult goals (to stump Bob), and Bob develops increasingly capable skills (to achieve those goals). No human specifies what goals are interesting — the adversarial pressure generates a curriculum automatically.

Algorithm: Asymmetric Self-Play

Alice acts for T_A steps from start state s₀, reaching state s_goal.
Bob resets to s₀ and tries to reach s_goal within T_B steps.
Alice's reward: +1 if Bob fails, −1 if Bob succeeds. Alice wants to find hard goals.
Bob's reward: +1 if Bob reaches s_goal, −1 if Bob fails. Bob wants to be capable.
Update both policies with standard RL (e.g., policy gradients).

The Automatic Curriculum

Initially, Alice proposes easy goals (1 step away). Bob learns to solve them. Then Alice must propose harder goals (2 steps away). Bob learns those too. The difficulty ratchets up naturally — Alice can't propose impossible goals (she must reach them herself) and can't propose too-easy goals (Bob already solves those). This is a self-regulating curriculum without any human design.

Intrinsic Motivation

Alternative to self-play: reward the agent for novelty or surprise. Formalized as prediction error of a forward model:

Curiosity-Driven Reward r_intrinsic(s_t, a_t) = || f(s_t, a_t) − s_t+1 ||²

High reward when the agent encounters states its forward model can't predict. Encourages exploration of novel regions.

Problem: the agent may find "noise" rewarding (TV static is always unpredictable). Modern approaches use learned feature spaces instead of raw state predictions to filter out irrelevant stochasticity.

Warning — The Noisy TV Problem

An agent with curiosity reward placed in a room with a TV showing random static will stare at the TV forever — it's maximally unpredictable! Solutions: use learned representations (ignore pixel noise), limit prediction to controllable aspects of the environment, or use ensemble disagreement instead of raw error.

Self-Play Goal Difficulty Escalation

Alice (gold) proposes goals. Bob (blue) tries to reach them. Watch the difficulty increase as both improve.

Round:0

Chapter 11

Summary & Comparison

We've covered the complete spectrum of reward specification — from hand-designed rewards to AI-generated preferences. Here's how they compare:

Method	Human Input	Scalability	Failure Mode	Best For
Hand-designed	Engineer writes r(s,a)	Zero annotation cost	Reward hacking	Simple tasks with clear metrics
Goal classifier	Success/failure images	~50-200 images	OOD exploitation	Robotics with visual goals
Adversarial classifier	Same + iterative updates	Moderate compute	Instability, reward collapse	Robotics with persistent exploits
Human preferences	Pairwise comparisons	1k-100k queries	Reward overoptimization	Complex tasks (dialog, driving)
RLHF (LLMs)	SFT demos + comparisons	100k+ queries	KL collapse or reward hacking	Language model alignment
RLAIF	Constitution (principles)	Infinite (AI judges)	Judge bias propagation	Scaling RLHF cheaply
Self-play	None	Infinite (self-generated)	Degenerate equilibria	Pre-training diverse skills

The Master Insight

Rewards can't be taken for granted. In the real world, defining "what good looks like" IS the hard problem. The RL algorithm is the easy part — it just maximizes whatever number you give it. The entire field of reward learning exists because writing down that number correctly is somewhere between difficult and impossible for most tasks that matter.

Decision Flowchart

Q: Can you write r(s,a) directly? (e.g., game score, physical distance to goal) → Use hand-designed reward. Done.

Q: Can you photograph "success"? (e.g., robotics with clear goal states) → Use goal classifier (with adversarial updates).

Q: Can humans compare outcomes? (e.g., dialog, content quality) → Use preference learning (RLHF).

Q: Is human annotation too expensive? → Use RLAIF with a judge model. Calibrate against a small human-labeled set.

Q: No supervision at all? → Use intrinsic motivation or self-play for skill pre-training, then fine-tune with task reward later.

Method Selection Decision Tree

Navigate the decision tree to find the right reward specification method for your task.

🎨 Design Challenge Design a reward system for autonomous driving COMPLETED ▶

Design a complete reward specification approach for training a self-driving car. Consider: What makes a "good" drive? How would you collect training signal? What exploits might the agent find?

Safety Req.

Zero collisions in 1M miles

Comfort

Max jerk < 2 m/s^3

Annotation Budget

$50k

Timeline

6 months

Chapter 12

Connections & Beyond

Reward learning doesn't exist in isolation. It connects to every part of the RL and AI alignment ecosystem:

🔗 Cross-Domain Bridge

Reward Learning ↔ Inverse RL

IRL infers a reward function from optimal demonstrations. Preference learning infers it from comparative judgments. Both solve the same core problem (what reward was the human optimizing?) but with different supervision signals.

IRL Input

Expert trajectories τ*. Assumes demos are near-optimal under unknown r*.

Preference Input

Pairwise comparisons. No optimality assumption — just "A is better than B."

When would you choose IRL over preference learning? (Hint: think about demonstration availability vs. comparison feasibility.)

🔗 Cross-Domain Bridge

RLHF ↔ Direct Preference Optimization (DPO)

DPO (Rafailov et al., 2023) shows you can skip the explicit reward model entirely. The optimal policy under Bradley-Terry preferences has a closed-form relationship to the reward, allowing direct fine-tuning from preferences without the RL step.

RLHF

Train r_φ, then PPO against r_φ. Two optimization stages, reward model as intermediary.

DPO

Directly optimize policy with: L = −log σ(β log(π/π_ref)(y_w) − β log(π/π_ref)(y_l)).

DPO is simpler but RLHF still dominates at frontier scale. Why? (Hint: think about data reuse and online learning.)

🔗 Cross-Domain Bridge

Reward Learning ↔ Offline RL

Offline RL trains policies from fixed datasets without environment interaction. Learned reward models can retrospectively label offline datasets, enabling preference-guided offline RL — no human reward function needed for the original data collection.

Problem

You have 1M robot trajectories but no reward labels. Labeling all is infeasible.

Solution

Label 1% with preferences, train reward model, label remaining 99% automatically.

What's Next

Active research frontiers in reward learning:

Process reward models — Reward each reasoning STEP, not just the final answer. Enables credit assignment in chain-of-thought.

Scalable oversight — How do you provide feedback on tasks too complex for any single human to evaluate? Decomposition, debate, and recursive reward modeling.

Reward model interpretability — What features does the reward model actually use? Can we audit it for biases before deploying?

Multi-objective RLHF — Instead of one scalar reward, learn separate models for helpfulness, harmlessness, honesty, and let users set the tradeoff at inference time.

Closing Thought

The reward function IS the specification of the problem. Getting it wrong doesn't just make your agent suboptimal — it makes your agent optimize for something you never intended. As AI systems become more capable, the gap between "what we specified" and "what we meant" becomes the most important safety problem in the field. Reward learning is not just a convenience — it's a necessity for building systems that do what we actually want.

The Complete Guide to Reward Learning

What You'll Master

The Reward Specification Problem

Where Do Rewards Actually Come From?

Task Specification Landscape

The Demonstration Spectrum

Goal Classifiers as Rewards

Data Collection

Why Probabilities, Not Logits?

The Exploitation Problem

Why Does This Happen?

Adversarial Classifier Update

The GAN Parallel

Why Balanced Batches Matter

Human Preferences

The Bradley-Terry Model

The Preference Loss Function

The Reward Learning Algorithm

Scaling with (k choose 2)

RLHF for LLMs

Stage 1: Pretraining

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: RLHF

RLAIF / Constitutional AI

Constitutional AI (Bai et al., 2022)

The Helpfulness-Harmlessness Frontier

Unsupervised RL & Self-Play

Asymmetric Self-Play (Sukhbaatar et al., 2017)

Intrinsic Motivation

Summary & Comparison

Decision Flowchart

Connections & Beyond

What's Next