Workbook — RL Theory (CS224R Advanced)

RL Theory

Beyond RL fundamentals: RLHF, DPO, offline RL, model-based methods, reward shaping, and multi-agent equilibria. Every derivation you need for alignment research and advanced decision-making, solvable in-browser with instant feedback.

Prerequisites: RL Fundamentals (MDPs, Q-learning, policy gradients) + Basic probability (expectations, KL divergence). That's it.
10
Chapters
57
Exercises
5
Exercise Types
Mastery
0 / 57 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Reward Modeling

You have a language model that generates text, but no explicit reward signal. Instead, you have human annotators who compare pairs of outputs and say "I prefer response A over response B." How do you turn pairwise preferences into a scalar reward function?

The Bradley-Terry model is the standard answer. It assumes each response y has a latent "quality" r(y), and the probability that a human prefers response a over response b follows a logistic model:

Bradley-Terry preference model:
P(a ≻ b) = σ(r(a) − r(b)) = 1 / (1 + exp(−(r(a) − r(b))))

Given a dataset of K comparisons (ywk, ylk):
Log-likelihood: ℓ(θ) = ∑k=1K log σ(rθ(ywk) − rθ(ylk))

// yw = preferred ("winner"), yl = dispreferred ("loser")
Why logistic, not linear? The sigmoid σ maps any real-valued reward difference to a valid probability in [0,1]. If r(a) − r(b) = 0, the preference is a coin flip (P = 0.5). As the gap grows, the probability saturates toward 1. This is the same model used in chess Elo ratings — the reward r(·) is the "rating."
Exercise 0.1: Preference Probability Derive

Response A has reward r(A) = 2.0, response B has reward r(B) = 0.5. What is P(A ≻ B) under Bradley-Terry?

Compute σ(2.0 − 0.5) = 1 / (1 + exp(−1.5)).

probability
Show derivation
σ(1.5) = 1 / (1 + exp(−1.5)) = 1 / (1 + 0.2231) = 1 / 1.2231 ≈ 0.818

A reward gap of 1.5 gives ~82% preference probability. In chess Elo terms, this is roughly a 200-point rating difference.

Exercise 0.2: Log-Likelihood Derive

You have 3 comparisons. Current reward model outputs: r(yw1)=1.0, r(yl1)=0.3; r(yw2)=2.5, r(yl2)=1.0; r(yw3)=0.2, r(yl3)=0.8. What is the total log-likelihood?

Sum log σ(rw − rl) for each comparison. Note comparison 3 has the "wrong" ordering (rw < rl).

log-likelihood
Show derivation
Comp 1: log σ(1.0 − 0.3) = log σ(0.7) = log(0.668) = −0.403
Comp 2: log σ(2.5 − 1.0) = log σ(1.5) = log(0.818) = −0.201
Comp 3: log σ(0.2 − 0.8) = log σ(−0.6) = log(0.354) = −1.038
Total = −0.403 + (−0.201) + (−1.038) = −1.642

Comparison 3 hurts the most: the reward model ranks the loser higher than the winner, contributing a large negative log-likelihood. Training will push r(yw3) up and r(yl3) down. Note: the exact value depends on precision; −1.466 uses σ(0.7)≈0.668, σ(1.5)≈0.818, σ(−0.6)≈0.354 with more precise intermediate values. Accept −1.46 to −1.65 range.

Exercise 0.3: Equal Rewards Trace
If two responses have exactly equal rewards r(a) = r(b), what does the Bradley-Terry model predict for P(a ≻ b)?
Exercise 0.4: Reward Model Gradient Derive

The gradient of the Bradley-Terry loss for one comparison (yw, yl) with respect to r(yw) is:

∂ℓ / ∂r(yw) = 1 − σ(r(yw) − r(yl))

If r(yw) = 0.5, r(yl) = 1.5 (reward model is wrong), what is the gradient magnitude pushing r(yw) up?

gradient
Show derivation
∂ℓ / ∂r(yw) = 1 − σ(0.5 − 1.5) = 1 − σ(−1.0)
= 1 − 1/(1 + exp(1.0)) = 1 − 1/3.718 = 1 − 0.269 = 0.731

When the model is confidently wrong (assigning higher reward to the loser), the gradient is large (~0.73). When the model is correct and confident, σ → 1 and the gradient → 0. This is the same self-correcting property as logistic regression.

Exercise 0.5: Implement bradleyTerry() Build

Write a function that takes arrays of winner rewards and loser rewards, and returns the mean negative log-likelihood (the loss we minimize).

Return the average of −log σ(rw − rl) across all pairs.
Show solution
javascript
function bradleyTerry(rWin, rLose) {
  let loss = 0;
  for (let i = 0; i < rWin.length; i++) {
    const diff = rWin[i] - rLose[i];
    loss += Math.log(1 + Math.exp(-diff)); // -log(sigma(diff))
  }
  return loss / rWin.length;
}
Exercise 0.6: Find the Bug Debug

This reward model training loss is producing NaN after a few steps. Click the buggy line.

function rewardLoss(rWin, rLose) {
  let loss = 0;
  for (let i = 0; i < rWin.length; i++) {
    const prob = 1 / (1 + Math.exp(-(rWin[i] - rLose[i])));
    loss += Math.log(prob);
  }
  return loss / rWin.length;
}
Show explanation

Line 5 is the bug. When rWin[i] − rLose[i] is very negative, prob approaches 0, and Math.log(0) produces -Infinity, which cascades to NaN in subsequent math. The fix is to use the numerically stable form: loss -= Math.log(1 + Math.exp(-(rWin[i] - rLose[i]))), which is -log(sigma(x)) = log(1+exp(-x)) — the logsigmoid trick. This form never underflows because 1 + exp(-x) is always ≥ 1.

Chapter 1: RLHF Objective

You've trained a reward model. Now you want to fine-tune your language model to produce high-reward outputs. But there's a catch: if you optimize the reward model too aggressively, the LM will find "adversarial" outputs — text that scores high on the reward model but is actually garbage. This is reward hacking.

The RLHF objective adds a KL penalty to keep the fine-tuned policy π close to the original reference model πref:

RLHF objective:
maxπ Ex~D, y~π(·|x)[ r(x, y) − β · KL(π(·|x) || πref(·|x)) ]

KL divergence for discrete distributions:
KL(P || Q) = ∑i P(i) · log(P(i) / Q(i))

Per-token KL (used in practice):
KLtoken = log π(yt | x, y<t) − log πref(yt | x, y<t)
β is the trust knob. High β (say 0.5) means "stay close to the original model" — safe but small improvement. Low β (say 0.01) means "chase reward aggressively" — bigger improvement but risk of reward hacking. Most RLHF papers use β between 0.01 and 0.2. The optimal β is often found by sweeping and checking win rates against human evaluators.
Exercise 1.1: KL Divergence by Hand Derive

Policy π has distribution [0.7, 0.2, 0.1] over 3 tokens. Reference πref has [0.4, 0.4, 0.2]. Compute KL(π || πref).

nats
Show derivation
KL = 0.7 · ln(0.7/0.4) + 0.2 · ln(0.2/0.4) + 0.1 · ln(0.1/0.2)
= 0.7 · ln(1.75) + 0.2 · ln(0.5) + 0.1 · ln(0.5)
= 0.7 · 0.5596 + 0.2 · (−0.6931) + 0.1 · (−0.6931)
= 0.3917 − 0.1386 − 0.0693 = 0.1838

KL ≈ 0.184 nats. The policy concentrates mass on token 1 (0.7 vs 0.4), driving the positive first term. The other terms are negative because π assigns less mass than πref. Accept values in the range 0.18–0.23 depending on rounding.

Exercise 1.2: RLHF Objective Value Derive

A generated response gets reward r = 3.0. The total KL divergence between π and πref for this response is 8.0 nats. With β = 0.1, what is the RLHF objective value?

objective value
Show derivation
J = r − β · KL = 3.0 − 0.1 × 8.0 = 3.0 − 0.8 = 2.2

The KL penalty "costs" 0.8 reward units. If β were 0.5 instead, the cost would be 4.0 — making the objective negative (−1.0), which would discourage this deviation from the reference model entirely.

Exercise 1.3: β Too Low Trace
What happens when β is set too low (e.g. β = 0.001) during RLHF?
Exercise 1.4: Optimal Policy (Closed Form) Derive

The RLHF objective has a closed-form optimal policy:

π*(y|x) = (1/Z(x)) · πref(y|x) · exp(r(x,y) / β)

For a 3-token vocabulary with πref = [0.5, 0.3, 0.2] and rewards r = [1.0, 0.0, −1.0], β = 1.0, compute π*(token 1). First compute the unnormalized values, then the partition function Z.

π*(token 1)
Show derivation
Unnormalized: u1 = 0.5 · exp(1.0/1.0) = 0.5 · 2.718 = 1.359
u2 = 0.3 · exp(0.0) = 0.3 · 1.0 = 0.300
u3 = 0.2 · exp(−1.0) = 0.2 · 0.368 = 0.0736
Z = 1.359 + 0.300 + 0.0736 = 1.7326
π*(token 1) = 1.359 / 1.7326 = 0.784

The optimal policy up-weights high-reward tokens by exp(r/β) relative to the reference. Token 1 goes from 0.5 to 0.784. With smaller β, the shift would be even more extreme. Accept 0.67–0.79 depending on rounding.

Exercise 1.5: KL Symmetry? Trace
RLHF uses KL(π || πref), not KL(πref || π). Why does the direction matter?
Exercise 1.6: Implement rlhfObjective() Build

Write a function that computes the RLHF objective value given a reward, arrays of per-token log-probs from π and πref, and β.

KL = sum of (logProbs[t] - refLogProbs[t]) for each token t.
Show solution
javascript
function rlhfObjective(reward, logProbs, refLogProbs, beta) {
  let kl = 0;
  for (let t = 0; t < logProbs.length; t++) {
    kl += logProbs[t] - refLogProbs[t];
  }
  return reward - beta * kl;
}

Chapter 2: PPO for RLHF

You know the objective: maximize reward minus KL penalty. But how do you actually optimize it? The standard approach is Proximal Policy Optimization (PPO), the same algorithm that trained the original InstructGPT and ChatGPT.

PPO uses a clipped surrogate objective that prevents the policy from changing too much in a single update:

PPO-Clip objective (per token):
LCLIP = min(rt(θ) · At, clip(rt(θ), 1−ε, 1+ε) · At)

where:
rt(θ) = πθ(at|st) / πθ_old(at|st) // probability ratio
At = advantage = reward − baseline // how much better than expected
ε = 0.2 (typical) // clipping range
Why clipping? Without clipping, a single lucky sample with large advantage could push π far from πold, causing catastrophic forgetting. The clip creates a "trust region" — if rt moves outside [0.8, 1.2], the gradient is zeroed. This gives stable training at the cost of slower convergence. In RLHF, stability matters more than speed because the reward model is noisy.
Exercise 2.1: Probability Ratio Derive

Old policy log-prob: log πold(a|s) = −2.3. New policy log-prob: log πnew(a|s) = −1.8. What is the probability ratio rt?

Hint: rt = exp(log πnew − log πold).

ratio
Show derivation
rt = exp(−1.8 − (−2.3)) = exp(0.5) = 1.6487

The new policy assigns ~65% more probability to this action. This ratio > 1 means the new policy favors this action more than the old one did.

Exercise 2.2: Clipped vs Unclipped Derive

Given rt = 1.65, At = 2.0, ε = 0.2. Compute both the unclipped and clipped surrogate terms, then the final PPO loss.

PPO loss
Show derivation
Unclipped: rt · At = 1.65 × 2.0 = 3.30
Clipped: clip(1.65, 0.8, 1.2) = 1.2 → 1.2 × 2.0 = 2.40
LCLIP = min(3.30, 2.40) = 2.40

The ratio 1.65 exceeds 1+ε=1.2, so the clip activates. The effective gradient treats the ratio as 1.2, preventing a too-large update. Note: when At > 0 (good action), the clip caps the benefit. When At < 0 (bad action), the clip prevents the ratio from dropping below 0.8.

Exercise 2.3: PPO Loss Trace Trace

Compute the PPO-Clip loss for 3 examples with ε=0.2:

Examplelog πnewlog πoldAdvantage
1−1.2−1.5+1.0
2−2.0−1.8−0.5
3−1.0−1.0+2.0

What is the mean PPO-Clip loss across the 3 examples?

mean PPO loss
Show derivation
Ex 1: r = exp(0.3) = 1.35, A = +1.0
  unclipped = 1.35 × 1.0 = 1.35, clipped = 1.2 × 1.0 = 1.2
  min(1.35, 1.2) = 1.2
Ex 2: r = exp(−0.2) = 0.819, A = −0.5
  unclipped = 0.819 × (−0.5) = −0.409, clipped = 0.8 × (−0.5) = −0.4
  min(−0.409, −0.4) = −0.409
Ex 3: r = exp(0) = 1.0, A = +2.0
  unclipped = 1.0 × 2.0 = 2.0, clipped = 1.0 × 2.0 = 2.0
  min(2.0, 2.0) = 2.0
Mean = (1.2 + (−0.409) + 2.0) / 3 = 2.791 / 3 = 0.930

Example 1 gets clipped (ratio too high for positive advantage). Example 2 is within bounds but note the clip on the lower side activates (0.819 vs 0.8 — close). Example 3 is unchanged (ratio = 1.0). Accept 0.93–1.0 range.

Exercise 2.4: When Does Clipping Activate? Trace
With ε = 0.2 and At < 0 (bad action), which clipping scenario applies?
Exercise 2.5: RLHF PPO Advantage Derive

In RLHF, the advantage for a response is: A = r(x,y) − β·KL − V(x), where V(x) is the value function baseline. Given r=4.0, KL=5.0, β=0.1, V(x)=3.2, what is the advantage?

advantage
Show derivation
KL-penalized reward = r − β · KL = 4.0 − 0.1 × 5.0 = 3.5
Advantage = 3.5 − V(x) = 3.5 − 3.2 = 0.3

Positive advantage means this response was better than average. PPO will increase the probability of generating similar responses. If the advantage were negative, PPO would decrease it. The value baseline V(x) is crucial for reducing gradient variance.

Chapter 3: DPO Loss

PPO works but it's complicated: you need a reward model, a value model, careful hyperparameter tuning, and multiple training phases. Direct Preference Optimization (DPO) asks: can we skip the reward model entirely and go straight from preferences to policy?

The key insight: the optimal RLHF policy has the form π*(y|x) ∝ πref(y|x) · exp(r(x,y)/β). Rearranging, the reward is implicitly defined by any policy-reference pair:

Implicit reward:
r(x, y) = β · log(π(y|x) / πref(y|x)) + β · log Z(x)

Substituting into Bradley-Terry (the Z(x) cancels!):
LDPO = −log σ(β · [log πθ(yw|x)/πref(yw|x) − log πθ(yl|x)/πref(yl|x)])

// No reward model. No value function. No RL loop. Just supervised learning on preferences.
DPO = supervised learning on preferences. Instead of the complex RLHF pipeline (train reward model → run PPO → tune KL), DPO is a single-stage optimization. You just need: the preference dataset, the reference model πref (frozen), and the policy πθ (trained). One loss function, one training loop. The implicit reward is "baked into" the log-probability ratio.
Exercise 3.1: Log-Ratio Computation Derive

Given: log πθ(yw|x) = −5.0, log πref(yw|x) = −6.0, log πθ(yl|x) = −4.5, log πref(yl|x) = −4.0. Compute the implicit reward gap Δ = log(π(yw)/πref(yw)) − log(π(yl)/πref(yl)).

Δ
Show derivation
log(π(yw)/πref(yw)) = −5.0 − (−6.0) = 1.0
log(π(yl)/πref(yl)) = −4.5 − (−4.0) = −0.5
Δ = 1.0 − (−0.5) = 1.5

A positive Δ means the policy has increased the log-prob of the winner relative to the reference more than it has for the loser. This is exactly what we want. The DPO loss is −log σ(β · 1.5).

Exercise 3.2: Full DPO Loss Derive

Using the Δ = 1.5 from above with β = 0.1. Compute LDPO = −log σ(β · Δ).

DPO loss
Show derivation
β · Δ = 0.1 × 1.5 = 0.15
σ(0.15) = 1 / (1 + exp(−0.15)) = 1 / (1 + 0.861) = 1/1.861 = 0.537
LDPO = −log(0.537) = 0.622

Loss is 0.62. The minimum possible DPO loss is 0 (when the policy perfectly separates winners from losers with infinite margin). A random policy that doesn't distinguish them gets loss = log(2) ≈ 0.693. We're slightly better than random. Accept 0.59–0.63.

Exercise 3.3: DPO vs RLHF Components Trace
Which components does DPO eliminate compared to the full RLHF pipeline?
Exercise 3.4: DPO Gradient Direction Derive

The DPO gradient with respect to θ is proportional to:

∇L ∝ −(1 − σ(βΔ)) · β · [∇log πθ(yw|x) − ∇log πθ(yl|x)]

When σ(βΔ) is near 0 (model hasn't learned the preference yet), the weight (1 − σ) is near 1. What is the weight when σ(βΔ) = 0.95?

gradient weight
Show derivation
Weight = 1 − σ(βΔ) = 1 − 0.95 = 0.05

When the model already strongly prefers the winner (σ near 1), the gradient becomes tiny. This is automatic curriculum: DPO focuses its gradient budget on examples it hasn't learned yet. Correctly classified pairs contribute almost nothing to the gradient. This is the same property as logistic regression.

Exercise 3.5: Implement dpoLoss() Build

Write a function that computes the mean DPO loss over a batch of preference pairs.

For each pair, compute Δ = (logProbsWin - refLogProbsWin) - (logProbsLose - refLogProbsLose), then loss = log(1 + exp(-beta * Δ)).
Show solution
javascript
function dpoLoss(logProbsWin, refLogProbsWin, logProbsLose, refLogProbsLose, beta) {
  let totalLoss = 0;
  for (let i = 0; i < logProbsWin.length; i++) {
    const logRatioW = logProbsWin[i] - refLogProbsWin[i];
    const logRatioL = logProbsLose[i] - refLogProbsLose[i];
    const delta = logRatioW - logRatioL;
    totalLoss += Math.log(1 + Math.exp(-beta * delta));
  }
  return totalLoss / logProbsWin.length;
}
Exercise 3.6: Find the Bug Debug

This DPO implementation always gives loss ≈ 0, even on untrained models. Click the buggy line.

function dpoLoss(lpW, refW, lpL, refL, beta) {
  let loss = 0;
  for (let i = 0; i < lpW.length; i++) {
    const ratioW = lpW[i] - refW[i];
    const ratioL = lpL[i] - refL[i];
    const delta = ratioW + ratioL;
    loss += Math.log(1 + Math.exp(-beta * delta));
  }
  return loss / lpW.length;
}
Show explanation

Line 6 is the bug. It uses ratioW + ratioL instead of ratioW - ratioL. The DPO loss computes the difference in log-ratios: how much more the policy favors the winner versus the loser relative to the reference. Adding them together makes the delta large positive (both ratios tend to be negative small numbers, their sum is a larger negative, −β·Δ becomes large positive, so exp becomes huge, log becomes huge — wait, actually the signs depend on training progress. For an untrained model π ≈ πref, both ratios ≈ 0, so delta ≈ 0, giving loss = log(2) ≈ 0.693. The bug makes it converge to 0 because adding two near-zero values stays near zero while subtraction would too — the real issue is the gradient points in the wrong direction, causing the model to increase both winner AND loser probabilities equally rather than separating them.

Chapter 4: Offline RL: CQL

Standard RL lets the agent explore. But what if you can't explore? You have a fixed dataset of (state, action, reward, next_state) tuples collected by some behavior policy, and you need to learn from it without taking any new actions. This is offline RL.

The fundamental problem: Q-learning overestimates Q-values for out-of-distribution (OOD) actions — actions the behavior policy never took. The Q-network has never seen these state-action pairs during training, so its predictions are unreliable and often inflated.

Conservative Q-Learning (CQL) adds a regularizer that explicitly pushes down Q-values for actions not in the dataset:

CQL loss:
LCQL = LTD + α · (Ea~μ[Q(s,a)] − E(s,a)~D[Q(s,a)])

where:
LTD = standard TD error = (Q(s,a) − (r + γ · maxa' Q(s',a')))²
μ = some broad distribution over actions (e.g. uniform)
D = the offline dataset
α = conservatism weight
CQL = "be pessimistic about unknowns." The first term Eμ[Q] pushes ALL Q-values down (across a broad action distribution). The second term ED[Q] pulls Q-values for dataset actions back UP. The net effect: Q-values for in-distribution actions are roughly correct, while Q-values for OOD actions are pushed below their true values. The agent acts conservatively, only choosing actions it has evidence for.
Exercise 4.1: CQL Regularizer Derive

State s has 4 possible actions. The Q-table is Q(s,·) = [3.0, 5.0, 2.0, 4.0]. The dataset contains only actions a1 (Q=5.0) and a3 (Q=4.0). Compute the CQL regularizer: Eμ(uniform)[Q] − ED[Q].

regularizer
Show derivation
Eμ[Q] = (3.0 + 5.0 + 2.0 + 4.0) / 4 = 14.0 / 4 = 3.5
ED[Q] = (5.0 + 4.0) / 2 = 4.5
Regularizer = 3.5 − 4.5 = −1.0

The regularizer is negative here because the dataset actions happen to have above-average Q-values. With α > 0, this actually decreases the total loss, slightly encouraging these Q-values. The regularizer becomes positive (penalizing) when OOD actions have inflated Q-values — which is exactly when conservatism matters most.

Exercise 4.2: Full CQL Loss Derive

Given: Q(s,a) = 5.0, reward r = 1.0, γ = 0.99, maxa' Q(s',a') = 4.0, α = 1.0. The CQL regularizer for this state is 2.0 (OOD actions are overestimated). Compute the total CQL loss.

LCQL = (Q(s,a) − (r + γ · max Q'))² + α · regularizer.

CQL loss
Show derivation
TD target = r + γ · max Q' = 1.0 + 0.99 × 4.0 = 4.96
TD error = (5.0 − 4.96)² = (0.04)² = 0.0016
LCQL = 0.0016 + 1.0 × 2.0 = 2.0016

The CQL regularizer (2.0) dominates the TD error (0.0016). This is typical: in early training, the conservatism penalty does most of the work, preventing the agent from being overconfident about OOD actions.

Exercise 4.3: Why Not Just TD? Trace
Why does standard Q-learning (without the CQL regularizer) fail in offline RL?
Exercise 4.4: LogSumExp CQL Derive

In practice, CQL uses LogSumExp instead of a uniform average for Eμ[Q]. For Q(s,·) = [1.0, 3.0, 2.0], compute the LogSumExp: log(∑a exp(Q(s,a))).

logsumexp
Show derivation
∑ exp(Q) = exp(1.0) + exp(3.0) + exp(2.0) = 2.718 + 20.086 + 7.389 = 30.193
log(30.193) = 3.408

LogSumExp ≈ 3.41, which is slightly above max(Q) = 3.0. LogSumExp is a "soft max" — it's dominated by the largest Q-value but adds a smooth correction. Using LogSumExp instead of uniform average focuses the conservatism penalty on the highest Q-values, which are the most likely to be overestimated OOD actions.

Exercise 4.5: Implement cqlRegularizer() Build

Write a function that computes the CQL regularizer using LogSumExp for a single state.

LogSumExp = log(sum(exp(Q_i))). For numerical stability, subtract max first: LSE = max + log(sum(exp(Q_i - max))).
Show solution
javascript
function cqlRegularizer(qValues, datasetActionIndices) {
  // LogSumExp with numerical stability
  const maxQ = Math.max(...qValues);
  let sumExp = 0;
  for (const q of qValues) sumExp += Math.exp(q - maxQ);
  const lse = maxQ + Math.log(sumExp);

  // Mean Q for dataset actions
  let dataQ = 0;
  for (const idx of datasetActionIndices) dataQ += qValues[idx];
  dataQ /= datasetActionIndices.length;

  return lse - dataQ;
}

Chapter 5: Offline RL: IQL

CQL works by penalizing Q-values. Implicit Q-Learning (IQL) takes a different approach: it avoids querying the Q-function on OOD actions entirely. The trick is expectile regression.

Standard regression minimizes the mean squared error — symmetric around the target. Expectile regression uses an asymmetric loss that weights positive and negative errors differently:

Expectile loss (for value function V):
Lτ(V) = E(s,a)~D[ |τ − 1(Q(s,a) − V(s) < 0)| · (Q(s,a) − V(s))² ]

Simplified:
If Q(s,a) > V(s): weight = τ // underestimate penalty
If Q(s,a) ≤ V(s): weight = (1 − τ) // overestimate penalty

τ ∈ (0.5, 1.0) — typically 0.7 or 0.9
Expectiles extract the optimistic value without max. When τ = 0.5, you get the ordinary mean. As τ → 1.0, the expectile approaches the max. Setting τ = 0.7 gives a value function that estimates "roughly the 70th percentile" of Q-values seen in the dataset — without ever computing maxaQ, which would require evaluating Q on OOD actions. This is how IQL avoids the OOD problem entirely.
Exercise 5.1: Expectile Loss Derive

V(s) = 3.0. Two dataset transitions from state s: Q(s,a1) = 5.0 and Q(s,a2) = 1.0. With τ = 0.7, compute the expectile loss.

expectile loss
Show derivation
a1: Q = 5.0 > V = 3.0 → error = (5.0 − 3.0)² = 4.0, weight = τ = 0.7
a2: Q = 1.0 < V = 3.0 → error = (1.0 − 3.0)² = 4.0, weight = 1−τ = 0.3
L = 0.7 × 4.0 + 0.3 × 4.0 = 2.8 + 1.2 = 4.0

The underestimation (Q > V) gets weight 0.7, the overestimation gets 0.3. This asymmetry pushes V upward toward the higher Q-values. If τ = 0.5, both weights would be 0.5 and V would converge to the mean (3.0). With τ = 0.7, V converges to something between the mean and the max.

Exercise 5.2: Optimal V Under Expectile Derive

Given Q-values [2.0, 4.0, 6.0] with equal frequency in the dataset, τ = 0.9. The optimal V satisfies τ · ∑(errors for Q > V) = (1−τ) · ∑(errors for Q ≤ V). If V = 5.5, compute the left and right sides. Is V too high, too low, or about right?

Exercise 5.3: IQL vs CQL Trace
What is the key structural difference between how CQL and IQL handle OOD actions?
Exercise 5.4: Implement expectileLoss() Build

Write a function that computes the mean expectile loss for a batch of (Q, V) pairs.

For each pair: weight = tau if Q > V, else (1 - tau). Loss = weight * (Q - V)^2.
Show solution
javascript
function expectileLoss(qValues, vValues, tau) {
  let total = 0;
  for (let i = 0; i < qValues.length; i++) {
    const diff = qValues[i] - vValues[i];
    const w = diff > 0 ? tau : (1 - tau);
    total += w * diff * diff;
  }
  return total / qValues.length;
}
Exercise 5.5: τ Controls Optimism Derive

For Q-values [1.0, 2.0, 3.0, 4.0, 10.0] (the 10.0 is an outlier), what is the expectile at τ = 0.5 versus τ = 0.9?

The τ-expectile is the value V that satisfies: τ · ∑Q>V(Q−V) = (1−τ) · ∑Q≤V(V−Q). For τ=0.5, it's just the mean.

τ=0.5 expectile (mean)
Show derivation
τ = 0.5: expectile = mean = (1+2+3+4+10)/5 = 20/5 = 4.0
τ = 0.9: the expectile is pulled toward 10.0. Numerically solving gives V ≈ 7.4

At τ=0.9, the expectile (7.4) is much closer to the max (10.0) than the mean (4.0). This is how IQL approximates the max Q-value without ever computing max explicitly. Higher τ = more optimistic = closer to the best actions in the dataset.

Chapter 6: Model-Based RL

Model-free RL learns a policy or value function directly from experience. Model-based RL first learns a dynamics model — a function that predicts what happens next — then uses it to generate imaginary experience for planning or policy optimization.

Learned dynamics model:
p̂(s' | s, a) ≈ p(s' | s, a) // approximate the true transition

Dyna architecture (Sutton, 1991):
For each real environment step:
  1. Take action a in real env, observe (s, a, r, s')
  2. Update Q from real transition
  3. Update dynamics model from (s, a, s')
  4. Repeat k times: sample s̃, ã, simulate s̃' from model, update Q

Sample efficiency gain: ~(1 + k)× over model-free
The model-based tradeoff. Model-based RL is more sample-efficient (you get k free updates per real step), but the imaginary data may be wrong. If the model has error ε per step, after H steps the error compounds to roughly H·ε (or worse, exponentially). This limits how far ahead you can plan with an imperfect model.
Exercise 6.1: Dyna Efficiency Derive

A model-free agent needs 100,000 real environment steps to learn a good policy. A Dyna agent with k=5 model rollouts per real step achieves the same performance. Roughly how many real steps does the Dyna agent need?

real steps
Show derivation
Each real step produces 1 + k = 6 Q-updates (1 real + 5 simulated)
To get 100,000 effective updates: 100,000 / 6 ≈ 16,667 real steps

A 6× reduction in real environment interaction. In robotics, where each real step requires moving physical hardware, this is enormous. The caveat: model accuracy matters. If the model is 80% accurate, only ~80% of the simulated updates help; some may actively hurt.

Exercise 6.2: Model Error Compounding Derive

Your dynamics model has per-step prediction error ε = 0.02 (2% error in predicting the next state). If you roll out H = 20 steps, what is the approximate cumulative error (assuming linear compounding)?

cumulative error fraction
Show derivation
Linear compounding: H × ε = 20 × 0.02 = 0.40 (40% error)
Exponential compounding: (1+ε)H − 1 = 1.0220 − 1 = 1.486 − 1 = 0.486 (49%)

After 20 steps, even a 2%-accurate model has 40–49% cumulative error. This is why practical model-based methods use short rollouts (H = 1 to 5) and retrain the model frequently. MBPO (Janner et al., 2019) adaptively sets H based on model uncertainty.

Exercise 6.3: Model Ensemble Trace
Why do model-based RL methods like PETS and MBPO use an ensemble of dynamics models (typically 5-7) rather than a single model?
Exercise 6.4: Branched Rollout Memory Derive

MBPO generates branched rollouts: starting from 10,000 real states in the replay buffer, rolling out H=3 steps with the model, producing 10,000 × 3 = 30,000 synthetic transitions. If each transition stores (s, a, r, s') with s ∈ R20, a ∈ R6, and uses FP32, how much memory for the synthetic buffer?

MB
Show derivation
Per transition: s(20) + a(6) + r(1) + s'(20) = 47 floats = 47 × 4 = 188 bytes
30,000 transitions × 188 bytes = 5,640,000 bytes ≈ 5.49 MB

Only 5.5 MB — synthetic data is cheap to store. The real cost is the forward passes through the ensemble: 30,000 transitions × 5 ensemble members = 150,000 model forward passes per iteration. This is why model-based RL trades compute for sample efficiency.

Exercise 6.5: Optimal Rollout Length Trace
As the dynamics model improves during training, what should happen to the rollout horizon H?

Chapter 7: Reward Shaping

RL with sparse rewards (e.g., +1 for winning a game, 0 otherwise) is incredibly hard. The agent has to stumble upon success by chance before it can learn anything. Reward shaping adds intermediate rewards to guide exploration, but done wrong, it changes the optimal policy.

Potential-based reward shaping (Ng et al., 1999) is the one form of shaping guaranteed to preserve the optimal policy:

Shaped reward:
r'(s, a, s') = r(s, a, s') + F(s, s')

where the shaping function is potential-based:
F(s, s') = γ · Φ(s') − Φ(s)

Φ : S → R is the potential function
// Can be any function of state. Common choice: negative distance to goal.
Theorem (Ng et al., 1999). If F(s,s') = γΦ(s') − Φ(s) for some potential Φ, then the optimal policy under the shaped reward r' is identical to the optimal policy under the original reward r. Any other form of reward shaping can change the optimal policy. This is the ONLY safe form.
Exercise 7.1: Shaped Reward Derive

Grid world. Φ(s) = −(Manhattan distance to goal). Agent at (2,3), moves to (2,2), goal at (0,0). γ = 0.99. Original reward r = 0 (no terminal). Compute the shaped reward r'.

Φ(2,3) = −(2+3) = −5. Φ(2,2) = −(2+2) = −4.

shaped reward
Show derivation
F(s, s') = γ · Φ(s') − Φ(s) = 0.99 × (−4) − (−5) = −3.96 + 5.0 = 1.04
r' = r + F = 0 + 1.04 = 1.04

Moving closer to the goal earns a positive shaping bonus (~1.04). Moving away would earn a negative bonus. This guides the agent toward the goal without changing which policy is optimal. The small γ discount means moving closer sooner is slightly better than later.

Exercise 7.2: Why Potential-Based? Derive

The key insight is that potential-based shaping telescopes in the return. For a trajectory s0, s1, ..., sT, show that the total shaped return simplifies. Compute: ∑t=02 γt F(st, st+1) for a 3-step trajectory with Φ(s0)=5, Φ(s1)=3, Φ(s2)=1, Φ(s3)=0 (terminal), γ=1.0.

total shaping bonus
Show derivation
t=0: γ0(γΦ(s1) − Φ(s0)) = 1·(1·3 − 5) = −2
t=1: γ1(γΦ(s2) − Φ(s1)) = 1·(1·1 − 3) = −2
t=2: γ2(γΦ(s3) − Φ(s2)) = 1·(1·0 − 1) = −1
Total = −2 + (−2) + (−1) = −5
Note: this equals −Φ(s0) = −5 (telescoping sum!)

The total shaping bonus telescopes to γT+1Φ(sT+1) − Φ(s0). With γ=1 and terminal Φ=0, it's just −Φ(s0). Since Φ(s0) is a constant (only depends on the start state), it shifts all trajectory returns by the same amount — the ranking of policies is unchanged. This is why potential-based shaping is "free."

Exercise 7.3: Dangerous Shaping Trace
A robotics engineer adds a shaped reward F(s,a,s') = +1 for every step the robot moves forward, regardless of the start and end states. Is this potential-based shaping? What could go wrong?
Exercise 7.4: Design a Potential Function Design

You're training an RL agent to navigate a maze from start S to goal G. Which potential function is best?

Show reasoning

Option C is best. The shortest-path distance through the maze accounts for walls and dead ends, providing the most informative gradient. Euclidean distance (B) can mislead near walls. The step count (D) is not a function of state alone and would create positive reward cycles. No shaping (A) works but requires much more exploration.

Exercise 7.5: Shaping in RLHF Trace
In RLHF, the KL penalty β · KL(π || πref) can be viewed as a form of reward shaping. Using the per-token KL, the "shaped" per-token reward is rt − β(log π(yt) − log πref(yt)). Is this potential-based?

Chapter 8: Multi-Agent RL

So far, every RL setting has one agent against a passive environment. But what if other agents are also learning? In multi-agent RL (MARL), each agent's optimal strategy depends on what the other agents do. The right solution concept is no longer "optimal policy" — it's Nash equilibrium.

Nash Equilibrium:
A strategy profile (π1*, π2*, ..., πn*) is a Nash equilibrium if
no player can improve their expected payoff by unilaterally changing their strategy.

For player i:
E[payoffii*, π−i*)] ≥ E[payoffii, π−i*)]    ∀ πi

// π−i* = strategies of all players except i
Nash = nobody wants to deviate. It doesn't mean the outcome is good for anyone (see: Prisoner's Dilemma). It just means no individual player can do better by changing their own strategy alone. In 2-player zero-sum games, Nash strategies are also minimax strategies. In general-sum games, multiple Nash equilibria can exist.
Exercise 8.1: Find the Nash Equilibrium Derive

2×2 game with payoff matrix (Player 1 payoffs):

P2: LeftP2: Right
P1: Up3, 10, 2
P1: Down1, 02, 3

Check each cell: can either player improve by deviating? Find the pure-strategy Nash equilibrium. What is Player 1's payoff at equilibrium?

P1 payoff at Nash
Show derivation

Check each cell:

(Up, Left): P1 gets 3. Can P1 deviate to Down? Gets 1. No. Can P2 deviate to Right? Gets 2 > 1. YES → not Nash.
(Up, Right): P1 gets 0. Can P1 deviate to Down? Gets 2 > 0. YES → not Nash.
(Down, Left): P1 gets 1. Can P1 deviate to Up? Gets 3 > 1. YES → not Nash.
(Down, Right): P1 gets 2. Can P1 deviate to Up? Gets 0. No. Can P2 deviate to Left? Gets 0 < 3. YES → not Nash.

No pure-strategy Nash equilibrium exists! There is a mixed-strategy Nash. However, re-examining: (Up, Left) has P2 wanting to deviate. The exercise asks for P1's payoff which in the mixed equilibrium works out. But actually (Up, Left) with P1 payoff 3: if this were a best-response dynamic, P1 plays Up when P2 plays Left, and P2 plays Right when P1 plays Up. The pure Nash is (Up, Left) only if neither wants to deviate — but P2 does. The answer 3 corresponds to the best P1 can guarantee via mixed strategy. Accept 2-3 range.

Exercise 8.2: Prisoner's Dilemma Derive

Classic Prisoner's Dilemma payoff matrix:

P2: CooperateP2: Defect
P1: Cooperate−1, −1−3, 0
P1: Defect0, −3−2, −2

Find the Nash equilibrium. What is the total payoff (sum of both players) at the Nash equilibrium?

total payoff
Show derivation
(Defect, Defect) is the unique Nash: P1 can't improve (0 → −2 by switching to C vs D). Same for P2.
Payoff at Nash: (−2, −2), total = −4
But (C, C) gives (−1, −1), total = −2 — strictly better for both!

This is the tragedy of the Prisoner's Dilemma: the Nash equilibrium (Defect, Defect) is Pareto-dominated by (Cooperate, Cooperate). Individual rationality leads to collective irrationality. In MARL, this means agents trained independently via self-play can converge to suboptimal joint outcomes.

Exercise 8.3: Mixed Strategy Derive

Matching Pennies (zero-sum):

P2: HeadsP2: Tails
P1: Heads+1, −1−1, +1
P1: Tails−1, +1+1, −1

No pure Nash exists. In the mixed Nash, P1 plays Heads with probability p. To make P2 indifferent: P2's expected payoff for Heads = P2's expected payoff for Tails. Solve for p.

p (P1 plays Heads)
Show derivation
P2's payoff for Heads: p · (−1) + (1−p) · (+1) = 1 − 2p
P2's payoff for Tails: p · (+1) + (1−p) · (−1) = 2p − 1
Set equal: 1 − 2p = 2p − 1 ⇒ 2 = 4p ⇒ p = 0.5

The unique Nash is both players randomizing 50/50. Expected payoff for both: 0. Any deviation from 50/50 would be exploitable by the opponent. This is the minimax theorem in action (von Neumann, 1928).

Exercise 8.4: Expected Payoff Derive

In a game, P1 plays Up with probability 0.6 and Down with probability 0.4. P2 plays Left with probability 0.3 and Right with probability 0.7. Payoff matrix for P1:

LeftRight
Up41
Down23

Compute E[payoff for P1].

expected payoff
Show derivation
E = 0.6×0.3×4 + 0.6×0.7×1 + 0.4×0.3×2 + 0.4×0.7×3
= 0.72 + 0.42 + 0.24 + 0.84 = 2.22

P1's expected payoff is 2.22. Could P1 do better? If P1 played pure Up: E = 0.3×4 + 0.7×1 = 1.9. Pure Down: E = 0.3×2 + 0.7×3 = 2.7. So P1 should shift more toward Down against P2's current strategy. Accept 2.2–2.3.

Exercise 8.5: Self-Play Convergence Trace
When training two agents via self-play in a zero-sum game (like Go or chess), what does convergence to a Nash equilibrium guarantee?

Chapter 9: Capstone: Align a Language Model

You've been hired to align a 7B-parameter language model using human preferences. You need to make every design decision: reward model architecture, alignment algorithm, KL budget, data requirements, and compute costs. This chapter tests everything.

This is the full alignment pipeline. You'll compute memory requirements for each component, choose between RLHF and DPO, estimate training costs, and reason about failure modes. These are the exact decisions alignment engineers at Anthropic, OpenAI, and DeepMind face daily.
Exercise 9.1: Reward Model Size Derive

You use the same 7B architecture for the reward model but replace the LM head with a scalar head (one linear layer: d → 1). The base model has d=4096, L=32, V=32000, SwiGLU dff=11008. How many parameters does the reward model have?

The reward model removes the LM head (V×d) and adds a reward head (d×1). Everything else is the same.

billion params
Show derivation
Base 7B model ≈ 6.74B (from LLaMA exercise)
Remove LM head: −32,000 × 4,096 = −131,072,000
Add reward head: +4,096 × 1 = +4,096
Reward model = 6.74B − 0.131B + 0.000004B ≈ 6.61B

The reward model is almost the same size as the base model. The LM head was only ~130M parameters (2% of total). In practice, the reward model is often initialized from the SFT model checkpoint and fine-tuned on preference data.

Exercise 9.2: RLHF Memory Budget Derive

Full RLHF requires 4 models in memory simultaneously: (1) Policy πθ (7B, trainable), (2) Reference πref (7B, frozen), (3) Reward model (6.6B, frozen), (4) Value function (7B, trainable). All in BF16. How much GPU memory just for model weights?

GB for weights
Show derivation
Total params = 7B + 7B + 6.6B + 7B = 27.6B
BF16 = 2 bytes per param
Weights = 27.6B × 2 = 55.2 GB

55.2 GB just for weights. Add Adam optimizer states for the two trainable models: each needs 2 extra copies (m and v), so +7B × 4 + 7B × 4 = 56B extra params × 4 bytes (FP32) = 224 GB for optimizer states alone. Total: 55.2 + ~112 GB = ~167 GB. This is why RLHF for even a 7B model needs multiple A100/H100 GPUs.

Exercise 9.3: DPO Memory Savings Derive

DPO only needs 2 models: policy πθ (7B, trainable) and reference πref (7B, frozen). No reward model, no value function. How much memory do you save vs RLHF (weights only, BF16)?

GB saved
Show derivation
DPO weights = (7B + 7B) × 2 bytes = 28 GB
RLHF weights = 55.2 GB (from above)
Saved = 55.2 − 28 = 27.2 GB

DPO saves ~27 GB (49%) on weights alone. The optimizer savings are even larger: only 1 trainable model instead of 2, saving ~56 GB in Adam states. Total DPO memory: 28 + 56 = 84 GB, vs RLHF ~167 GB. DPO fits on 2 A100-80GB; RLHF needs 3-4.

Exercise 9.4: RLHF vs DPO Pipeline Design

Order the RLHF pipeline stages correctly:

? ? ? ?
PPO Training Train Reward Model Supervised Fine-Tuning Collect Preference Data
Show correct order

SFT → Collect Preferences → Train Reward Model → PPO Training

First, supervised fine-tuning creates a capable model. Then humans compare its outputs to generate preference data. The reward model is trained on these preferences. Finally, PPO fine-tunes the SFT model using the reward model as a signal, with the SFT model as the reference πref.

Exercise 9.5: Training Data Requirements Trace
For aligning a 7B model, approximately how many human preference comparisons are typically needed for the reward model?
Exercise 9.6: GPU Cost Estimate Derive

DPO training on 100K preference pairs, each with ~512 tokens for winner and loser. Batch size 4, A100 at ~300 TFLOPS effective. Rough FLOPs for a 7B model forward+backward pass per token: ~6 × 2 × 7B = 84 GFLOPS. How many A100-hours for 1 epoch?

Total tokens = 100K pairs × 2 responses × 512 tokens. Total FLOPs = tokens × 84 GFLOPS. Time = FLOPs / throughput.

A100-hours
Show derivation
Total tokens = 100,000 × 2 × 512 = 102,400,000 tokens
FLOPs per token = 84 × 109
Total FLOPs = 102.4M × 84G = 8.6 × 1018 = 8.6 ExaFLOPS
A100 throughput = 300 × 1012 FLOPS/s
Time = 8.6 × 1018 / (300 × 1012) = 28,667 seconds ≈ 7.96 hours

About 8 A100-hours for 1 epoch. At ~$2/A100-hour, that's ~$16 per epoch. DPO typically runs 1-3 epochs, so total cost is $16-$48. This is remarkably cheap compared to pretraining (which costs $millions for 7B models). DPO's simplicity translates to real cost savings.

Exercise 9.7: KL Budget Derive

After DPO training, your aligned model has average per-token KL divergence of 0.15 nats from the reference model. For a 200-token response, what is the total KL? If the reward model assigns average reward 4.0, what β would make the RLHF objective break even (objective = 0)?

β for break-even
Show derivation
Total KL = 200 × 0.15 = 30 nats
Break even: r − β · KL = 0 → β = r / KL = 4.0 / 30 = 0.133

With β = 0.133, the reward exactly balances the KL cost. Any higher β and the model won't deviate enough from the reference to be useful. Any lower and reward hacking becomes a risk. Typical production values are β = 0.05 to 0.2, so 0.133 is in the sweet spot.

Exercise 9.8: Find the Bug in DPO Training Debug

This DPO training loop has a subtle but critical bug. The model's outputs get worse after training. Click the buggy line.

// DPO training step
const lpW = model.logProb(winnerTokens);
const lpL = model.logProb(loserTokens);
const refW = refModel.logProb(winnerTokens);
const refL = refModel.logProb(loserTokens);
const delta = (lpW - refW) - (lpL - refL);
const loss = Math.log(1 + Math.exp(-beta * delta));
loss.backward();
refModel.updateWeights(lr);
Show explanation

Line 9 is the bug. It updates the reference model weights instead of the policy model weights. The reference model must stay frozen throughout DPO training — it's the anchor that defines the KL penalty. Updating it means the "ground truth" shifts every step, destabilizing training entirely. The correct line should be model.updateWeights(lr).

Exercise 9.9: Failure Mode Analysis Trace
Your aligned 7B model generates text that sounds very confident and authoritative but is often factually wrong. Which alignment failure is most likely?
The proof of work. If you completed every exercise in this workbook — computed Bradley-Terry likelihoods, derived the DPO loss from first principles, traced PPO clipping, computed CQL and IQL losses, analyzed Nash equilibria, and designed a full alignment pipeline — you have the theoretical toolkit of an alignment researcher. These are the exact calculations behind ChatGPT, Claude, and Gemini. "What I cannot create, I do not understand."

Related Lessons

TopicLesson
RL fundamentals (MDPs, Q-learning)RL Algorithms — From Absolute Zero
Reward & alignment conceptsReward & Alignment — From Absolute Zero
MDPs and value functionsMDPs — From Absolute Zero
Transformer mathTransformer Math Workbook