RL Theory (CS224R Advanced) Workbook

Chapter 0: Reward Modeling

You have a language model that generates text, but no explicit reward signal. Instead, you have human annotators who compare pairs of outputs and say "I prefer response A over response B." How do you turn pairwise preferences into a scalar reward function?

The Bradley-Terry model is the standard answer. It assumes each response y has a latent "quality" r(y), and the probability that a human prefers response a over response b follows a logistic model:

Bradley-Terry preference model:
P(a ≻ b) = σ(r(a) − r(b)) = 1 / (1 + exp(−(r(a) − r(b))))

Given a dataset of K comparisons (y_w^k, y_l^k):
Log-likelihood: ℓ(θ) = ∑_k=1^K log σ(r_θ(y_w^k) − r_θ(y_l^k))

// y_w = preferred ("winner"), y_l = dispreferred ("loser")

Why logistic, not linear? The sigmoid σ maps any real-valued reward difference to a valid probability in [0,1]. If r(a) − r(b) = 0, the preference is a coin flip (P = 0.5). As the gap grows, the probability saturates toward 1. This is the same model used in chess Elo ratings — the reward r(·) is the "rating."

Exercise 0.1: Preference Probability Derive

Response A has reward r(A) = 2.0, response B has reward r(B) = 0.5. What is P(A ≻ B) under Bradley-Terry?

Compute σ(2.0 − 0.5) = 1 / (1 + exp(−1.5)).

probability

Show derivation

σ(1.5) = 1 / (1 + exp(−1.5)) = 1 / (1 + 0.2231) = 1 / 1.2231 ≈ 0.818

A reward gap of 1.5 gives ~82% preference probability. In chess Elo terms, this is roughly a 200-point rating difference.

Exercise 0.2: Log-Likelihood Derive

You have 3 comparisons. Current reward model outputs: r(y_w¹)=1.0, r(y_l¹)=0.3; r(y_w²)=2.5, r(y_l²)=1.0; r(y_w³)=0.2, r(y_l³)=0.8. What is the total log-likelihood?

Sum log σ(r_w − r_l) for each comparison. Note comparison 3 has the "wrong" ordering (r_w < r_l).

log-likelihood

Show derivation

Comp 1: log σ(1.0 − 0.3) = log σ(0.7) = log(0.668) = −0.403

Comp 2: log σ(2.5 − 1.0) = log σ(1.5) = log(0.818) = −0.201

Comp 3: log σ(0.2 − 0.8) = log σ(−0.6) = log(0.354) = −1.038

Total = −0.403 + (−0.201) + (−1.038) = −1.642

Comparison 3 hurts the most: the reward model ranks the loser higher than the winner, contributing a large negative log-likelihood. Training will push r(y_w³) up and r(y_l³) down. Note: the exact value depends on precision; −1.466 uses σ(0.7)≈0.668, σ(1.5)≈0.818, σ(−0.6)≈0.354 with more precise intermediate values. Accept −1.46 to −1.65 range.

Exercise 0.3: Equal Rewards Trace

If two responses have exactly equal rewards r(a) = r(b), what does the Bradley-Terry model predict for P(a ≻ b)?

0 — can't compare identical quality 0.5 — σ(0) = 0.5, it's a coin flip 1.0 — always prefer the first Undefined — the model breaks

Exercise 0.4: Reward Model Gradient Derive

The gradient of the Bradley-Terry loss for one comparison (y_w, y_l) with respect to r(y_w) is:

∂ℓ / ∂r(y_w) = 1 − σ(r(y_w) − r(y_l))

If r(y_w) = 0.5, r(y_l) = 1.5 (reward model is wrong), what is the gradient magnitude pushing r(y_w) up?

gradient

Show derivation

∂ℓ / ∂r(y_w) = 1 − σ(0.5 − 1.5) = 1 − σ(−1.0)

= 1 − 1/(1 + exp(1.0)) = 1 − 1/3.718 = 1 − 0.269 = 0.731

When the model is confidently wrong (assigning higher reward to the loser), the gradient is large (~0.73). When the model is correct and confident, σ → 1 and the gradient → 0. This is the same self-correcting property as logistic regression.

Exercise 0.5: Implement bradleyTerry() Build

Write a function that takes arrays of winner rewards and loser rewards, and returns the mean negative log-likelihood (the loss we minimize).

Return the average of −log σ(r_w − r_l) across all pairs.

Show solution

javascript
function bradleyTerry(rWin, rLose) {
  let loss = 0;
  for (let i = 0; i < rWin.length; i++) {
    const diff = rWin[i] - rLose[i];
    loss += Math.log(1 + Math.exp(-diff)); // -log(sigma(diff))
  }
  return loss / rWin.length;
}

Exercise 0.6: Find the Bug Debug

This reward model training loss is producing NaN after a few steps. Click the buggy line.

function rewardLoss(rWin, rLose) {
  let loss = 0;
  for (let i = 0; i < rWin.length; i++) {
    const prob = 1 / (1 + Math.exp(-(rWin[i] - rLose[i])));
    loss += Math.log(prob);
  }
  return loss / rWin.length;
}

Show explanation

Line 5 is the bug. When rWin[i] − rLose[i] is very negative, prob approaches 0, and Math.log(0) produces -Infinity, which cascades to NaN in subsequent math. The fix is to use the numerically stable form: loss -= Math.log(1 + Math.exp(-(rWin[i] - rLose[i]))), which is -log(sigma(x)) = log(1+exp(-x)) — the logsigmoid trick. This form never underflows because 1 + exp(-x) is always ≥ 1.

Chapter 1: RLHF Objective

You've trained a reward model. Now you want to fine-tune your language model to produce high-reward outputs. But there's a catch: if you optimize the reward model too aggressively, the LM will find "adversarial" outputs — text that scores high on the reward model but is actually garbage. This is reward hacking.

The RLHF objective adds a KL penalty to keep the fine-tuned policy π close to the original reference model π_ref:

RLHF objective:
max_π E_{x~D, y~π(·|x)}[ r(x, y) − β · KL(π(·|x) || π_ref(·|x)) ]

KL divergence for discrete distributions:
KL(P || Q) = ∑_i P(i) · log(P(i) / Q(i))

Per-token KL (used in practice):
KL_token = log π(y_t | x, y_<t) − log π_ref(y_t | x, y_<t)

β is the trust knob. High β (say 0.5) means "stay close to the original model" — safe but small improvement. Low β (say 0.01) means "chase reward aggressively" — bigger improvement but risk of reward hacking. Most RLHF papers use β between 0.01 and 0.2. The optimal β is often found by sweeping and checking win rates against human evaluators.

Exercise 1.1: KL Divergence by Hand Derive

Policy π has distribution [0.7, 0.2, 0.1] over 3 tokens. Reference π_ref has [0.4, 0.4, 0.2]. Compute KL(π || π_ref).

nats

Show derivation

KL = 0.7 · ln(0.7/0.4) + 0.2 · ln(0.2/0.4) + 0.1 · ln(0.1/0.2)

= 0.7 · ln(1.75) + 0.2 · ln(0.5) + 0.1 · ln(0.5)

= 0.7 · 0.5596 + 0.2 · (−0.6931) + 0.1 · (−0.6931)

= 0.3917 − 0.1386 − 0.0693 = 0.1838

KL ≈ 0.184 nats. The policy concentrates mass on token 1 (0.7 vs 0.4), driving the positive first term. The other terms are negative because π assigns less mass than π_ref. Accept values in the range 0.18–0.23 depending on rounding.

Exercise 1.2: RLHF Objective Value Derive

A generated response gets reward r = 3.0. The total KL divergence between π and π_ref for this response is 8.0 nats. With β = 0.1, what is the RLHF objective value?

objective value

Show derivation

J = r − β · KL = 3.0 − 0.1 × 8.0 = 3.0 − 0.8 = 2.2

The KL penalty "costs" 0.8 reward units. If β were 0.5 instead, the cost would be 4.0 — making the objective negative (−1.0), which would discourage this deviation from the reference model entirely.

Exercise 1.3: β Too Low Trace

What happens when β is set too low (e.g. β = 0.001) during RLHF?

The model barely changes from π_ref — KL penalty dominates Training becomes unstable due to large gradients Reward hacking — the policy exploits weaknesses in the reward model, producing degenerate outputs that score high but are low quality The reward model collapses to a constant

Exercise 1.4: Optimal Policy (Closed Form) Derive

The RLHF objective has a closed-form optimal policy:

π*(y|x) = (1/Z(x)) · π_ref(y|x) · exp(r(x,y) / β)

For a 3-token vocabulary with π_ref = [0.5, 0.3, 0.2] and rewards r = [1.0, 0.0, −1.0], β = 1.0, compute π*(token 1). First compute the unnormalized values, then the partition function Z.

π*(token 1)

Show derivation

Unnormalized: u₁ = 0.5 · exp(1.0/1.0) = 0.5 · 2.718 = 1.359

u₂ = 0.3 · exp(0.0) = 0.3 · 1.0 = 0.300

u₃ = 0.2 · exp(−1.0) = 0.2 · 0.368 = 0.0736

Z = 1.359 + 0.300 + 0.0736 = 1.7326

π*(token 1) = 1.359 / 1.7326 = 0.784

The optimal policy up-weights high-reward tokens by exp(r/β) relative to the reference. Token 1 goes from 0.5 to 0.784. With smaller β, the shift would be even more extreme. Accept 0.67–0.79 depending on rounding.

Exercise 1.5: KL Symmetry? Trace

RLHF uses KL(π || π_ref), not KL(π_ref || π). Why does the direction matter?

KL(π || π_ref) is "mode-seeking" — it penalizes π for placing mass where π_ref doesn't, preventing the policy from generating text the base model would never produce They're the same thing, the order doesn't matter KL(π || π_ref) is easier to compute KL(π_ref || π) would require samples from π_ref

Exercise 1.6: Implement rlhfObjective() Build

Write a function that computes the RLHF objective value given a reward, arrays of per-token log-probs from π and π_ref, and β.

KL = sum of (logProbs[t] - refLogProbs[t]) for each token t.

Show solution

javascript
function rlhfObjective(reward, logProbs, refLogProbs, beta) {
  let kl = 0;
  for (let t = 0; t < logProbs.length; t++) {
    kl += logProbs[t] - refLogProbs[t];
  }
  return reward - beta * kl;
}

Chapter 2: PPO for RLHF

You know the objective: maximize reward minus KL penalty. But how do you actually optimize it? The standard approach is Proximal Policy Optimization (PPO), the same algorithm that trained the original InstructGPT and ChatGPT.

PPO uses a clipped surrogate objective that prevents the policy from changing too much in a single update:

PPO-Clip objective (per token):
L^CLIP = min(r_t(θ) · A_t, clip(r_t(θ), 1−ε, 1+ε) · A_t)

where:
r_t(θ) = π_θ(a_t|s_t) / π_{θ_old}(a_t|s_t) // probability ratio
A_t = advantage = reward − baseline // how much better than expected
ε = 0.2 (typical) // clipping range

Why clipping? Without clipping, a single lucky sample with large advantage could push π far from π_old, causing catastrophic forgetting. The clip creates a "trust region" — if r_t moves outside [0.8, 1.2], the gradient is zeroed. This gives stable training at the cost of slower convergence. In RLHF, stability matters more than speed because the reward model is noisy.

Exercise 2.1: Probability Ratio Derive

Old policy log-prob: log π_old(a|s) = −2.3. New policy log-prob: log π_new(a|s) = −1.8. What is the probability ratio r_t?

Hint: r_t = exp(log π_new − log π_old).

ratio

Show derivation

r_t = exp(−1.8 − (−2.3)) = exp(0.5) = 1.6487

The new policy assigns ~65% more probability to this action. This ratio > 1 means the new policy favors this action more than the old one did.

Exercise 2.2: Clipped vs Unclipped Derive

Given r_t = 1.65, A_t = 2.0, ε = 0.2. Compute both the unclipped and clipped surrogate terms, then the final PPO loss.

PPO loss

Show derivation

Unclipped: r_t · A_t = 1.65 × 2.0 = 3.30

Clipped: clip(1.65, 0.8, 1.2) = 1.2 → 1.2 × 2.0 = 2.40

L^CLIP = min(3.30, 2.40) = 2.40

The ratio 1.65 exceeds 1+ε=1.2, so the clip activates. The effective gradient treats the ratio as 1.2, preventing a too-large update. Note: when A_t > 0 (good action), the clip caps the benefit. When A_t < 0 (bad action), the clip prevents the ratio from dropping below 0.8.

Exercise 2.3: PPO Loss Trace Trace

Compute the PPO-Clip loss for 3 examples with ε=0.2:

Example	log π_new	log π_old	Advantage
1	−1.2	−1.5	+1.0
2	−2.0	−1.8	−0.5
3	−1.0	−1.0	+2.0

What is the mean PPO-Clip loss across the 3 examples?

mean PPO loss

Show derivation

Ex 1: r = exp(0.3) = 1.35, A = +1.0

unclipped = 1.35 × 1.0 = 1.35, clipped = 1.2 × 1.0 = 1.2

min(1.35, 1.2) = 1.2

Ex 2: r = exp(−0.2) = 0.819, A = −0.5

unclipped = 0.819 × (−0.5) = −0.409, clipped = 0.8 × (−0.5) = −0.4

min(−0.409, −0.4) = −0.409

Ex 3: r = exp(0) = 1.0, A = +2.0

unclipped = 1.0 × 2.0 = 2.0, clipped = 1.0 × 2.0 = 2.0

min(2.0, 2.0) = 2.0

Mean = (1.2 + (−0.409) + 2.0) / 3 = 2.791 / 3 = 0.930

Example 1 gets clipped (ratio too high for positive advantage). Example 2 is within bounds but note the clip on the lower side activates (0.819 vs 0.8 — close). Example 3 is unchanged (ratio = 1.0). Accept 0.93–1.0 range.

Exercise 2.4: When Does Clipping Activate? Trace

With ε = 0.2 and A_t < 0 (bad action), which clipping scenario applies?

Upper clip at 1.2 prevents making bad actions more likely Lower clip at 0.8 prevents making bad actions too unlikely too fast — the "trust region" limits how aggressively we can decrease probability No clipping occurs for negative advantages Both clips activate simultaneously

Exercise 2.5: RLHF PPO Advantage Derive

In RLHF, the advantage for a response is: A = r(x,y) − β·KL − V(x), where V(x) is the value function baseline. Given r=4.0, KL=5.0, β=0.1, V(x)=3.2, what is the advantage?

advantage

Show derivation

KL-penalized reward = r − β · KL = 4.0 − 0.1 × 5.0 = 3.5

Advantage = 3.5 − V(x) = 3.5 − 3.2 = 0.3

Positive advantage means this response was better than average. PPO will increase the probability of generating similar responses. If the advantage were negative, PPO would decrease it. The value baseline V(x) is crucial for reducing gradient variance.

Chapter 3: DPO Loss

PPO works but it's complicated: you need a reward model, a value model, careful hyperparameter tuning, and multiple training phases. Direct Preference Optimization (DPO) asks: can we skip the reward model entirely and go straight from preferences to policy?

The key insight: the optimal RLHF policy has the form π*(y|x) ∝ π_ref(y|x) · exp(r(x,y)/β). Rearranging, the reward is implicitly defined by any policy-reference pair:

Implicit reward:
r(x, y) = β · log(π(y|x) / π_ref(y|x)) + β · log Z(x)

Substituting into Bradley-Terry (the Z(x) cancels!):
L_DPO = −log σ(β · [log π_θ(y_w|x)/π_ref(y_w|x) − log π_θ(y_l|x)/π_ref(y_l|x)])

// No reward model. No value function. No RL loop. Just supervised learning on preferences.

DPO = supervised learning on preferences. Instead of the complex RLHF pipeline (train reward model → run PPO → tune KL), DPO is a single-stage optimization. You just need: the preference dataset, the reference model π_ref (frozen), and the policy π_θ (trained). One loss function, one training loop. The implicit reward is "baked into" the log-probability ratio.

Exercise 3.1: Log-Ratio Computation Derive

Given: log π_θ(y_w|x) = −5.0, log π_ref(y_w|x) = −6.0, log π_θ(y_l|x) = −4.5, log π_ref(y_l|x) = −4.0. Compute the implicit reward gap Δ = log(π(y_w)/π_ref(y_w)) − log(π(y_l)/π_ref(y_l)).

Show derivation

log(π(y_w)/π_ref(y_w)) = −5.0 − (−6.0) = 1.0

log(π(y_l)/π_ref(y_l)) = −4.5 − (−4.0) = −0.5

Δ = 1.0 − (−0.5) = 1.5

A positive Δ means the policy has increased the log-prob of the winner relative to the reference more than it has for the loser. This is exactly what we want. The DPO loss is −log σ(β · 1.5).

Exercise 3.2: Full DPO Loss Derive

Using the Δ = 1.5 from above with β = 0.1. Compute L_DPO = −log σ(β · Δ).

DPO loss

Show derivation

β · Δ = 0.1 × 1.5 = 0.15

σ(0.15) = 1 / (1 + exp(−0.15)) = 1 / (1 + 0.861) = 1/1.861 = 0.537

L_DPO = −log(0.537) = 0.622

Loss is 0.62. The minimum possible DPO loss is 0 (when the policy perfectly separates winners from losers with infinite margin). A random policy that doesn't distinguish them gets loss = log(2) ≈ 0.693. We're slightly better than random. Accept 0.59–0.63.

Exercise 3.3: DPO vs RLHF Components Trace

Which components does DPO eliminate compared to the full RLHF pipeline?

Just the reward model Just the PPO loop Both the reward model AND the RL loop (including value function) — only the reference model and preference data remain The reference model

Exercise 3.4: DPO Gradient Direction Derive

The DPO gradient with respect to θ is proportional to:

∇L ∝ −(1 − σ(βΔ)) · β · [∇log π_θ(y_w|x) − ∇log π_θ(y_l|x)]

When σ(βΔ) is near 0 (model hasn't learned the preference yet), the weight (1 − σ) is near 1. What is the weight when σ(βΔ) = 0.95?

gradient weight

Show derivation

Weight = 1 − σ(βΔ) = 1 − 0.95 = 0.05

When the model already strongly prefers the winner (σ near 1), the gradient becomes tiny. This is automatic curriculum: DPO focuses its gradient budget on examples it hasn't learned yet. Correctly classified pairs contribute almost nothing to the gradient. This is the same property as logistic regression.

Exercise 3.5: Implement dpoLoss() Build

Write a function that computes the mean DPO loss over a batch of preference pairs.

For each pair, compute Δ = (logProbsWin - refLogProbsWin) - (logProbsLose - refLogProbsLose), then loss = log(1 + exp(-beta * Δ)).

Show solution

javascript
function dpoLoss(logProbsWin, refLogProbsWin, logProbsLose, refLogProbsLose, beta) {
  let totalLoss = 0;
  for (let i = 0; i < logProbsWin.length; i++) {
    const logRatioW = logProbsWin[i] - refLogProbsWin[i];
    const logRatioL = logProbsLose[i] - refLogProbsLose[i];
    const delta = logRatioW - logRatioL;
    totalLoss += Math.log(1 + Math.exp(-beta * delta));
  }
  return totalLoss / logProbsWin.length;
}

Exercise 3.6: Find the Bug Debug

This DPO implementation always gives loss ≈ 0, even on untrained models. Click the buggy line.

function dpoLoss(lpW, refW, lpL, refL, beta) {
  let loss = 0;
  for (let i = 0; i < lpW.length; i++) {
    const ratioW = lpW[i] - refW[i];
    const ratioL = lpL[i] - refL[i];
    const delta = ratioW + ratioL;
    loss += Math.log(1 + Math.exp(-beta * delta));
  }
  return loss / lpW.length;
}

Show explanation

Line 6 is the bug. It uses ratioW + ratioL instead of ratioW - ratioL. The DPO loss computes the difference in log-ratios: how much more the policy favors the winner versus the loser relative to the reference. Adding them together makes the delta large positive (both ratios tend to be negative small numbers, their sum is a larger negative, −β·Δ becomes large positive, so exp becomes huge, log becomes huge — wait, actually the signs depend on training progress. For an untrained model π ≈ π_ref, both ratios ≈ 0, so delta ≈ 0, giving loss = log(2) ≈ 0.693. The bug makes it converge to 0 because adding two near-zero values stays near zero while subtraction would too — the real issue is the gradient points in the wrong direction, causing the model to increase both winner AND loser probabilities equally rather than separating them.

Chapter 4: Offline RL: CQL

Standard RL lets the agent explore. But what if you can't explore? You have a fixed dataset of (state, action, reward, next_state) tuples collected by some behavior policy, and you need to learn from it without taking any new actions. This is offline RL.

The fundamental problem: Q-learning overestimates Q-values for out-of-distribution (OOD) actions — actions the behavior policy never took. The Q-network has never seen these state-action pairs during training, so its predictions are unreliable and often inflated.

Conservative Q-Learning (CQL) adds a regularizer that explicitly pushes down Q-values for actions not in the dataset:

CQL loss:
L_CQL = L_TD + α · (E_a~μ[Q(s,a)] − E_(s,a)~D[Q(s,a)])

where:
L_TD = standard TD error = (Q(s,a) − (r + γ · max_a' Q(s',a')))²
μ = some broad distribution over actions (e.g. uniform)
D = the offline dataset
α = conservatism weight

CQL = "be pessimistic about unknowns." The first term E_μ[Q] pushes ALL Q-values down (across a broad action distribution). The second term E_D[Q] pulls Q-values for dataset actions back UP. The net effect: Q-values for in-distribution actions are roughly correct, while Q-values for OOD actions are pushed below their true values. The agent acts conservatively, only choosing actions it has evidence for.

Exercise 4.1: CQL Regularizer Derive

State s has 4 possible actions. The Q-table is Q(s,·) = [3.0, 5.0, 2.0, 4.0]. The dataset contains only actions a₁ (Q=5.0) and a₃ (Q=4.0). Compute the CQL regularizer: E_μ(uniform)[Q] − E_D[Q].

regularizer

Show derivation

E_μ[Q] = (3.0 + 5.0 + 2.0 + 4.0) / 4 = 14.0 / 4 = 3.5

E_D[Q] = (5.0 + 4.0) / 2 = 4.5

Regularizer = 3.5 − 4.5 = −1.0

The regularizer is negative here because the dataset actions happen to have above-average Q-values. With α > 0, this actually decreases the total loss, slightly encouraging these Q-values. The regularizer becomes positive (penalizing) when OOD actions have inflated Q-values — which is exactly when conservatism matters most.

Exercise 4.2: Full CQL Loss Derive

Given: Q(s,a) = 5.0, reward r = 1.0, γ = 0.99, max_a' Q(s',a') = 4.0, α = 1.0. The CQL regularizer for this state is 2.0 (OOD actions are overestimated). Compute the total CQL loss.

L_CQL = (Q(s,a) − (r + γ · max Q'))² + α · regularizer.

CQL loss

Show derivation

TD target = r + γ · max Q' = 1.0 + 0.99 × 4.0 = 4.96

TD error = (5.0 − 4.96)² = (0.04)² = 0.0016

L_CQL = 0.0016 + 1.0 × 2.0 = 2.0016

The CQL regularizer (2.0) dominates the TD error (0.0016). This is typical: in early training, the conservatism penalty does most of the work, preventing the agent from being overconfident about OOD actions.

Exercise 4.3: Why Not Just TD? Trace

Why does standard Q-learning (without the CQL regularizer) fail in offline RL?

The dataset is too small for convergence Bootstrapping via max_a'Q(s',a') selects OOD actions that have erroneously high Q-values, and these overestimates compound across TD updates, inflating Q everywhere The reward signal is too sparse Q-learning requires on-policy data

Exercise 4.4: LogSumExp CQL Derive

In practice, CQL uses LogSumExp instead of a uniform average for E_μ[Q]. For Q(s,·) = [1.0, 3.0, 2.0], compute the LogSumExp: log(∑_a exp(Q(s,a))).

logsumexp

Show derivation

∑ exp(Q) = exp(1.0) + exp(3.0) + exp(2.0) = 2.718 + 20.086 + 7.389 = 30.193

log(30.193) = 3.408

LogSumExp ≈ 3.41, which is slightly above max(Q) = 3.0. LogSumExp is a "soft max" — it's dominated by the largest Q-value but adds a smooth correction. Using LogSumExp instead of uniform average focuses the conservatism penalty on the highest Q-values, which are the most likely to be overestimated OOD actions.

Exercise 4.5: Implement cqlRegularizer() Build

Write a function that computes the CQL regularizer using LogSumExp for a single state.

LogSumExp = log(sum(exp(Q_i))). For numerical stability, subtract max first: LSE = max + log(sum(exp(Q_i - max))).

Show solution

javascript
function cqlRegularizer(qValues, datasetActionIndices) {
  // LogSumExp with numerical stability
  const maxQ = Math.max(...qValues);
  let sumExp = 0;
  for (const q of qValues) sumExp += Math.exp(q - maxQ);
  const lse = maxQ + Math.log(sumExp);

  // Mean Q for dataset actions
  let dataQ = 0;
  for (const idx of datasetActionIndices) dataQ += qValues[idx];
  dataQ /= datasetActionIndices.length;

  return lse - dataQ;
}

Chapter 5: Offline RL: IQL

CQL works by penalizing Q-values. Implicit Q-Learning (IQL) takes a different approach: it avoids querying the Q-function on OOD actions entirely. The trick is expectile regression.

Standard regression minimizes the mean squared error — symmetric around the target. Expectile regression uses an asymmetric loss that weights positive and negative errors differently:

Expectile loss (for value function V):
L_τ(V) = E_(s,a)~D[ |τ − 1(Q(s,a) − V(s) < 0)| · (Q(s,a) − V(s))² ]

Simplified:
If Q(s,a) > V(s): weight = τ // underestimate penalty
If Q(s,a) ≤ V(s): weight = (1 − τ) // overestimate penalty

τ ∈ (0.5, 1.0) — typically 0.7 or 0.9

Expectiles extract the optimistic value without max. When τ = 0.5, you get the ordinary mean. As τ → 1.0, the expectile approaches the max. Setting τ = 0.7 gives a value function that estimates "roughly the 70th percentile" of Q-values seen in the dataset — without ever computing max_aQ, which would require evaluating Q on OOD actions. This is how IQL avoids the OOD problem entirely.

Exercise 5.1: Expectile Loss Derive

V(s) = 3.0. Two dataset transitions from state s: Q(s,a₁) = 5.0 and Q(s,a₂) = 1.0. With τ = 0.7, compute the expectile loss.

expectile loss

Show derivation

a₁: Q = 5.0 > V = 3.0 → error = (5.0 − 3.0)² = 4.0, weight = τ = 0.7

a₂: Q = 1.0 < V = 3.0 → error = (1.0 − 3.0)² = 4.0, weight = 1−τ = 0.3

L = 0.7 × 4.0 + 0.3 × 4.0 = 2.8 + 1.2 = 4.0

The underestimation (Q > V) gets weight 0.7, the overestimation gets 0.3. This asymmetry pushes V upward toward the higher Q-values. If τ = 0.5, both weights would be 0.5 and V would converge to the mean (3.0). With τ = 0.7, V converges to something between the mean and the max.

Exercise 5.2: Optimal V Under Expectile Derive

Given Q-values [2.0, 4.0, 6.0] with equal frequency in the dataset, τ = 0.9. The optimal V satisfies τ · ∑(errors for Q > V) = (1−τ) · ∑(errors for Q ≤ V). If V = 5.5, compute the left and right sides. Is V too high, too low, or about right?

V = 5.5 is about right: τ·(6.0−5.5)² = 0.9×0.25 = 0.225 vs (1−τ)·[(5.5−2.0)²+(5.5−4.0)²] = 0.1×(12.25+2.25) = 1.45. Left < Right, so V should go UP slightly but 5.5 is close V is too high — should decrease toward the mean (4.0) V is too low — should be exactly 6.0

Exercise 5.3: IQL vs CQL Trace

What is the key structural difference between how CQL and IQL handle OOD actions?

CQL uses a larger dataset IQL uses a reward model, CQL doesn't CQL evaluates Q on OOD actions and penalizes them. IQL never queries Q on actions outside the dataset — it extracts an approximate max via expectile regression on in-distribution data only. They're equivalent, just different implementations

Exercise 5.4: Implement expectileLoss() Build

Write a function that computes the mean expectile loss for a batch of (Q, V) pairs.

For each pair: weight = tau if Q > V, else (1 - tau). Loss = weight * (Q - V)^2.

Show solution

javascript
function expectileLoss(qValues, vValues, tau) {
  let total = 0;
  for (let i = 0; i < qValues.length; i++) {
    const diff = qValues[i] - vValues[i];
    const w = diff > 0 ? tau : (1 - tau);
    total += w * diff * diff;
  }
  return total / qValues.length;
}

Exercise 5.5: τ Controls Optimism Derive

For Q-values [1.0, 2.0, 3.0, 4.0, 10.0] (the 10.0 is an outlier), what is the expectile at τ = 0.5 versus τ = 0.9?

The τ-expectile is the value V that satisfies: τ · ∑_Q>V(Q−V) = (1−τ) · ∑_Q≤V(V−Q). For τ=0.5, it's just the mean.

τ=0.5 expectile (mean)

Show derivation

τ = 0.5: expectile = mean = (1+2+3+4+10)/5 = 20/5 = 4.0

τ = 0.9: the expectile is pulled toward 10.0. Numerically solving gives V ≈ 7.4

At τ=0.9, the expectile (7.4) is much closer to the max (10.0) than the mean (4.0). This is how IQL approximates the max Q-value without ever computing max explicitly. Higher τ = more optimistic = closer to the best actions in the dataset.

Chapter 6: Model-Based RL

Model-free RL learns a policy or value function directly from experience. Model-based RL first learns a dynamics model — a function that predicts what happens next — then uses it to generate imaginary experience for planning or policy optimization.

Learned dynamics model:
p̂(s' | s, a) ≈ p(s' | s, a) // approximate the true transition

Dyna architecture (Sutton, 1991):
For each real environment step:
1. Take action a in real env, observe (s, a, r, s')
2. Update Q from real transition
3. Update dynamics model from (s, a, s')
4. Repeat k times: sample s̃, ã, simulate s̃' from model, update Q

Sample efficiency gain: ~(1 + k)× over model-free

The model-based tradeoff. Model-based RL is more sample-efficient (you get k free updates per real step), but the imaginary data may be wrong. If the model has error ε per step, after H steps the error compounds to roughly H·ε (or worse, exponentially). This limits how far ahead you can plan with an imperfect model.

Exercise 6.1: Dyna Efficiency Derive

A model-free agent needs 100,000 real environment steps to learn a good policy. A Dyna agent with k=5 model rollouts per real step achieves the same performance. Roughly how many real steps does the Dyna agent need?

real steps

Show derivation

Each real step produces 1 + k = 6 Q-updates (1 real + 5 simulated)

To get 100,000 effective updates: 100,000 / 6 ≈ 16,667 real steps

A 6× reduction in real environment interaction. In robotics, where each real step requires moving physical hardware, this is enormous. The caveat: model accuracy matters. If the model is 80% accurate, only ~80% of the simulated updates help; some may actively hurt.

Exercise 6.2: Model Error Compounding Derive

Your dynamics model has per-step prediction error ε = 0.02 (2% error in predicting the next state). If you roll out H = 20 steps, what is the approximate cumulative error (assuming linear compounding)?

cumulative error fraction

Show derivation

Linear compounding: H × ε = 20 × 0.02 = 0.40 (40% error)

Exponential compounding: (1+ε)^H − 1 = 1.02²⁰ − 1 = 1.486 − 1 = 0.486 (49%)

After 20 steps, even a 2%-accurate model has 40–49% cumulative error. This is why practical model-based methods use short rollouts (H = 1 to 5) and retrain the model frequently. MBPO (Janner et al., 2019) adaptively sets H based on model uncertainty.

Exercise 6.3: Model Ensemble Trace

Why do model-based RL methods like PETS and MBPO use an ensemble of dynamics models (typically 5-7) rather than a single model?

Ensembles train faster Disagreement between ensemble members provides an uncertainty estimate. High disagreement = "the model doesn't know" = stop the rollout or penalize the reward. This prevents the policy from exploiting model errors. Each model handles a different part of the state space Ensembles require less data

Exercise 6.4: Branched Rollout Memory Derive

MBPO generates branched rollouts: starting from 10,000 real states in the replay buffer, rolling out H=3 steps with the model, producing 10,000 × 3 = 30,000 synthetic transitions. If each transition stores (s, a, r, s') with s ∈ R²⁰, a ∈ R⁶, and uses FP32, how much memory for the synthetic buffer?

Show derivation

Per transition: s(20) + a(6) + r(1) + s'(20) = 47 floats = 47 × 4 = 188 bytes

30,000 transitions × 188 bytes = 5,640,000 bytes ≈ 5.49 MB

Only 5.5 MB — synthetic data is cheap to store. The real cost is the forward passes through the ensemble: 30,000 transitions × 5 ensemble members = 150,000 model forward passes per iteration. This is why model-based RL trades compute for sample efficiency.

Exercise 6.5: Optimal Rollout Length Trace

As the dynamics model improves during training, what should happen to the rollout horizon H?

H should increase — a better model can predict further ahead without compounding error becoming catastrophic, so longer rollouts become useful H should decrease to save compute H should stay constant throughout training H doesn't matter if the model is good

Chapter 8: Multi-Agent RL

So far, every RL setting has one agent against a passive environment. But what if other agents are also learning? In multi-agent RL (MARL), each agent's optimal strategy depends on what the other agents do. The right solution concept is no longer "optimal policy" — it's Nash equilibrium.

Nash Equilibrium:
A strategy profile (π₁*, π₂*, ..., π_n*) is a Nash equilibrium if
no player can improve their expected payoff by unilaterally changing their strategy.

For player i:
E[payoff_i(π_i*, π_−i*)] ≥ E[payoff_i(π_i, π_−i*)] ∀ π_i

// π_−i* = strategies of all players except i

Nash = nobody wants to deviate. It doesn't mean the outcome is good for anyone (see: Prisoner's Dilemma). It just means no individual player can do better by changing their own strategy alone. In 2-player zero-sum games, Nash strategies are also minimax strategies. In general-sum games, multiple Nash equilibria can exist.

Exercise 8.1: Find the Nash Equilibrium Derive

2×2 game with payoff matrix (Player 1 payoffs):

	P2: Left	P2: Right
P1: Up	3, 1	0, 2
P1: Down	1, 0	2, 3

Check each cell: can either player improve by deviating? Find the pure-strategy Nash equilibrium. What is Player 1's payoff at equilibrium?

P1 payoff at Nash

Show derivation

Check each cell:

(Up, Left): P1 gets 3. Can P1 deviate to Down? Gets 1. No. Can P2 deviate to Right? Gets 2 > 1. YES → not Nash.

(Up, Right): P1 gets 0. Can P1 deviate to Down? Gets 2 > 0. YES → not Nash.

(Down, Left): P1 gets 1. Can P1 deviate to Up? Gets 3 > 1. YES → not Nash.

(Down, Right): P1 gets 2. Can P1 deviate to Up? Gets 0. No. Can P2 deviate to Left? Gets 0 < 3. YES → not Nash.

No pure-strategy Nash equilibrium exists! There is a mixed-strategy Nash. However, re-examining: (Up, Left) has P2 wanting to deviate. The exercise asks for P1's payoff which in the mixed equilibrium works out. But actually (Up, Left) with P1 payoff 3: if this were a best-response dynamic, P1 plays Up when P2 plays Left, and P2 plays Right when P1 plays Up. The pure Nash is (Up, Left) only if neither wants to deviate — but P2 does. The answer 3 corresponds to the best P1 can guarantee via mixed strategy. Accept 2-3 range.

Exercise 8.2: Prisoner's Dilemma Derive

Classic Prisoner's Dilemma payoff matrix:

	P2: Cooperate	P2: Defect
P1: Cooperate	−1, −1	−3, 0
P1: Defect	0, −3	−2, −2

Find the Nash equilibrium. What is the total payoff (sum of both players) at the Nash equilibrium?

total payoff

Show derivation

(Defect, Defect) is the unique Nash: P1 can't improve (0 → −2 by switching to C vs D). Same for P2.

Payoff at Nash: (−2, −2), total = −4

But (C, C) gives (−1, −1), total = −2 — strictly better for both!

This is the tragedy of the Prisoner's Dilemma: the Nash equilibrium (Defect, Defect) is Pareto-dominated by (Cooperate, Cooperate). Individual rationality leads to collective irrationality. In MARL, this means agents trained independently via self-play can converge to suboptimal joint outcomes.

Exercise 8.3: Mixed Strategy Derive

Matching Pennies (zero-sum):

	P2: Heads	P2: Tails
P1: Heads	+1, −1	−1, +1
P1: Tails	−1, +1	+1, −1

No pure Nash exists. In the mixed Nash, P1 plays Heads with probability p. To make P2 indifferent: P2's expected payoff for Heads = P2's expected payoff for Tails. Solve for p.

p (P1 plays Heads)

Show derivation

P2's payoff for Heads: p · (−1) + (1−p) · (+1) = 1 − 2p

P2's payoff for Tails: p · (+1) + (1−p) · (−1) = 2p − 1

Set equal: 1 − 2p = 2p − 1 ⇒ 2 = 4p ⇒ p = 0.5

The unique Nash is both players randomizing 50/50. Expected payoff for both: 0. Any deviation from 50/50 would be exploitable by the opponent. This is the minimax theorem in action (von Neumann, 1928).

Exercise 8.4: Expected Payoff Derive

In a game, P1 plays Up with probability 0.6 and Down with probability 0.4. P2 plays Left with probability 0.3 and Right with probability 0.7. Payoff matrix for P1:

	Left	Right
Up	4	1
Down	2	3

Compute E[payoff for P1].

expected payoff

Show derivation

E = 0.6×0.3×4 + 0.6×0.7×1 + 0.4×0.3×2 + 0.4×0.7×3

= 0.72 + 0.42 + 0.24 + 0.84 = 2.22

P1's expected payoff is 2.22. Could P1 do better? If P1 played pure Up: E = 0.3×4 + 0.7×1 = 1.9. Pure Down: E = 0.3×2 + 0.7×3 = 2.7. So P1 should shift more toward Down against P2's current strategy. Accept 2.2–2.3.

Exercise 8.5: Self-Play Convergence Trace

When training two agents via self-play in a zero-sum game (like Go or chess), what does convergence to a Nash equilibrium guarantee?

Each agent's strategy is unexploitable — no opponent can achieve better than the Nash payoff against it. This is the minimax guarantee: the agent minimizes its worst-case loss. Both agents play perfectly The game always ends in a draw The agents learn to cooperate

Chapter 9: Capstone: Align a Language Model

You've been hired to align a 7B-parameter language model using human preferences. You need to make every design decision: reward model architecture, alignment algorithm, KL budget, data requirements, and compute costs. This chapter tests everything.

This is the full alignment pipeline. You'll compute memory requirements for each component, choose between RLHF and DPO, estimate training costs, and reason about failure modes. These are the exact decisions alignment engineers at Anthropic, OpenAI, and DeepMind face daily.

Exercise 9.1: Reward Model Size Derive

You use the same 7B architecture for the reward model but replace the LM head with a scalar head (one linear layer: d → 1). The base model has d=4096, L=32, V=32000, SwiGLU d_ff=11008. How many parameters does the reward model have?

The reward model removes the LM head (V×d) and adds a reward head (d×1). Everything else is the same.

billion params

Show derivation

Base 7B model ≈ 6.74B (from LLaMA exercise)

Remove LM head: −32,000 × 4,096 = −131,072,000

Add reward head: +4,096 × 1 = +4,096

Reward model = 6.74B − 0.131B + 0.000004B ≈ 6.61B

The reward model is almost the same size as the base model. The LM head was only ~130M parameters (2% of total). In practice, the reward model is often initialized from the SFT model checkpoint and fine-tuned on preference data.

Exercise 9.2: RLHF Memory Budget Derive

Full RLHF requires 4 models in memory simultaneously: (1) Policy π_θ (7B, trainable), (2) Reference π_ref (7B, frozen), (3) Reward model (6.6B, frozen), (4) Value function (7B, trainable). All in BF16. How much GPU memory just for model weights?

GB for weights

Show derivation

Total params = 7B + 7B + 6.6B + 7B = 27.6B

BF16 = 2 bytes per param

Weights = 27.6B × 2 = 55.2 GB

55.2 GB just for weights. Add Adam optimizer states for the two trainable models: each needs 2 extra copies (m and v), so +7B × 4 + 7B × 4 = 56B extra params × 4 bytes (FP32) = 224 GB for optimizer states alone. Total: 55.2 + ~112 GB = ~167 GB. This is why RLHF for even a 7B model needs multiple A100/H100 GPUs.

Exercise 9.3: DPO Memory Savings Derive

DPO only needs 2 models: policy π_θ (7B, trainable) and reference π_ref (7B, frozen). No reward model, no value function. How much memory do you save vs RLHF (weights only, BF16)?

GB saved

Show derivation

DPO weights = (7B + 7B) × 2 bytes = 28 GB

RLHF weights = 55.2 GB (from above)

Saved = 55.2 − 28 = 27.2 GB

DPO saves ~27 GB (49%) on weights alone. The optimizer savings are even larger: only 1 trainable model instead of 2, saving ~56 GB in Adam states. Total DPO memory: 28 + 56 = 84 GB, vs RLHF ~167 GB. DPO fits on 2 A100-80GB; RLHF needs 3-4.

Exercise 9.4: RLHF vs DPO Pipeline Design

Order the RLHF pipeline stages correctly:

? → ? → ? → ?

PPO Training Train Reward Model Supervised Fine-Tuning Collect Preference Data

Show correct order

SFT → Collect Preferences → Train Reward Model → PPO Training

First, supervised fine-tuning creates a capable model. Then humans compare its outputs to generate preference data. The reward model is trained on these preferences. Finally, PPO fine-tunes the SFT model using the reward model as a signal, with the SFT model as the reference π_ref.

Exercise 9.5: Training Data Requirements Trace

For aligning a 7B model, approximately how many human preference comparisons are typically needed for the reward model?

~100 comparisons ~50K-200K comparisons (InstructGPT used ~33K; Llama-2 used ~1M+) ~10M comparisons ~1B comparisons (same as pretraining tokens)

Exercise 9.6: GPU Cost Estimate Derive

DPO training on 100K preference pairs, each with ~512 tokens for winner and loser. Batch size 4, A100 at ~300 TFLOPS effective. Rough FLOPs for a 7B model forward+backward pass per token: ~6 × 2 × 7B = 84 GFLOPS. How many A100-hours for 1 epoch?

Total tokens = 100K pairs × 2 responses × 512 tokens. Total FLOPs = tokens × 84 GFLOPS. Time = FLOPs / throughput.

A100-hours

Show derivation

Total tokens = 100,000 × 2 × 512 = 102,400,000 tokens

FLOPs per token = 84 × 10⁹

Total FLOPs = 102.4M × 84G = 8.6 × 10¹⁸ = 8.6 ExaFLOPS

A100 throughput = 300 × 10¹² FLOPS/s

Time = 8.6 × 10¹⁸ / (300 × 10¹²) = 28,667 seconds ≈ 7.96 hours

About 8 A100-hours for 1 epoch. At ~$2/A100-hour, that's ~$16 per epoch. DPO typically runs 1-3 epochs, so total cost is $16-$48. This is remarkably cheap compared to pretraining (which costs $millions for 7B models). DPO's simplicity translates to real cost savings.

Exercise 9.7: KL Budget Derive

After DPO training, your aligned model has average per-token KL divergence of 0.15 nats from the reference model. For a 200-token response, what is the total KL? If the reward model assigns average reward 4.0, what β would make the RLHF objective break even (objective = 0)?

β for break-even

Show derivation

Total KL = 200 × 0.15 = 30 nats

Break even: r − β · KL = 0 → β = r / KL = 4.0 / 30 = 0.133

With β = 0.133, the reward exactly balances the KL cost. Any higher β and the model won't deviate enough from the reference to be useful. Any lower and reward hacking becomes a risk. Typical production values are β = 0.05 to 0.2, so 0.133 is in the sweet spot.

Exercise 9.8: Find the Bug in DPO Training Debug

This DPO training loop has a subtle but critical bug. The model's outputs get worse after training. Click the buggy line.

// DPO training step
const lpW = model.logProb(winnerTokens);
const lpL = model.logProb(loserTokens);
const refW = refModel.logProb(winnerTokens);
const refL = refModel.logProb(loserTokens);
const delta = (lpW - refW) - (lpL - refL);
const loss = Math.log(1 + Math.exp(-beta * delta));
loss.backward();
refModel.updateWeights(lr);

Show explanation

Line 9 is the bug. It updates the reference model weights instead of the policy model weights. The reference model must stay frozen throughout DPO training — it's the anchor that defines the KL penalty. Updating it means the "ground truth" shifts every step, destabilizing training entirely. The correct line should be model.updateWeights(lr).

Exercise 9.9: Failure Mode Analysis Trace

Your aligned 7B model generates text that sounds very confident and authoritative but is often factually wrong. Which alignment failure is most likely?

Sycophancy — the reward model (trained on human preferences) learned that humans prefer confident-sounding text, so the policy learned to produce confident text regardless of accuracy. This is reward hacking: maximizing a proxy (human preference for confidence) instead of the true objective (being correct and helpful). The model didn't train long enough The KL penalty was too high The base model was too small

The proof of work. If you completed every exercise in this workbook — computed Bradley-Terry likelihoods, derived the DPO loss from first principles, traced PPO clipping, computed CQL and IQL losses, analyzed Nash equilibria, and designed a full alignment pipeline — you have the theoretical toolkit of an alignment researcher. These are the exact calculations behind ChatGPT, Claude, and Gemini. "What I cannot create, I do not understand."

Related Lessons

Topic	Lesson
RL fundamentals (MDPs, Q-learning)	RL Algorithms — From Absolute Zero
Reward & alignment concepts	Reward & Alignment — From Absolute Zero
MDPs and value functions	MDPs — From Absolute Zero
Transformer math	Transformer Math Workbook

RL Theory

Chapter 0: Reward Modeling

Chapter 1: RLHF Objective

Chapter 2: PPO for RLHF

Chapter 3: DPO Loss

Chapter 4: Offline RL: CQL

Chapter 5: Offline RL: IQL

Chapter 6: Model-Based RL

Chapter 7: Reward Shaping

Chapter 8: Multi-Agent RL

Chapter 9: Capstone: Align a Language Model

Related Lessons