Stanford CS 224R — Spring 2026

Deep RL Exam Review

Every algorithm, every equation, every exam trap. Interactive practice problems for the closed-book midterm.

Prerequisites: Attended CS 224R lectures + Done the homeworks
12
Chapters
36+
Practice Qs
9
Topics

Chapter 0: The Map — RL Algorithm Taxonomy

You have 80 minutes. The exam is closed-book. There are roughly 26 slides worth of material spanning imitation learning, policy gradients, actor-critic, off-policy, Q-learning, offline RL, reward learning, model-based RL, and goal-conditioned RL. That's a lot of ground.

Before diving into any single topic, you need the map — a mental taxonomy that tells you where every algorithm lives, what data it needs, and when you'd reach for it. If someone names an algorithm on the exam, you should instantly know: is it on-policy or off-policy? Does it need a simulator? Can it work from a fixed dataset? Does it learn a value function, a policy, or both?

Let's build that map from first principles.

The single most important exam skill: when you see an algorithm name, immediately classify it. "DQN → off-policy, value-based, model-free, needs replay buffer." "Behavior Cloning → offline, imitation learning, no reward signal needed." This classification tells you what equations apply, what failure modes to expect, and what comparisons are valid. Get the taxonomy right and the rest follows.

The Two Big Splits

Every RL algorithm lives somewhere in a 2×2 grid defined by two questions:

Question 1: Where does the data come from?

If you can interact with the environment during training — take actions, observe outcomes, collect new data — you're in the online regime. If you're stuck with a fixed dataset collected by someone else (or a previous version of your policy), you're in the offline regime.

Question 2: Can you reuse old data?

If your algorithm requires fresh data from the current policy after every update, it's on-policy. If it can learn from data collected by any policy (stored in a replay buffer), it's off-policy. This distinction only applies within the online regime.

RegimeCategoryAlgorithmsData Requirement
OnlineOn-policyREINFORCE, vanilla PG, A2CFresh rollouts from current πθ
Off-policyDQN, SAC, TD3, DDPGReplay buffer of past transitions
OfflineImitation LearningBehavior Cloning, DAgger, Flow ILExpert demonstrations (s, a) pairs
Offline RLAWR, IQL, CQLFixed dataset of (s, a, r, s') tuples

Where does PPO fit? This is a classic exam trick question. PPO uses importance sampling to reuse data from a slightly old policy — but it clips the ratio so tightly that it's effectively on-policy. Most courses classify PPO as on-policy with a small off-policy correction. On an exam, call it "on-policy with clipping."

Where does SAC fit? SAC maintains a replay buffer, samples uniformly from past transitions, and trains a Q-function using off-policy Bellman backups. It's firmly off-policy. The maximum-entropy objective (adding an entropy bonus) is orthogonal to the on/off-policy distinction.

The Third Dimension: What Do You Learn?

Algorithms also differ in what function they learn:

TypeLearnsHow it ActsExamples
Value-basedQ(s, a) or V(s)π(s) = argmaxa Q(s, a)DQN, Double DQN
Policy-basedπθ(a|s)Sample from πθREINFORCE, vanilla PG
Actor-CriticBoth πθ and VφActor samples, critic evaluatesA2C, PPO, SAC, TD3
Model-basedDynamics p(s'|s,a)Plan using the modelMBPO, Dreamer, MPC
Actor-Critic is the dominant paradigm. Nearly every modern algorithm you'll see on the exam (PPO, SAC, TD3, A2C) is actor-critic. The actor provides the policy; the critic provides low-variance gradient estimates. REINFORCE (no critic) is the teaching tool; actor-critic is what people actually use.

Interactive Taxonomy

RL Algorithm Taxonomy — Click Any Algorithm

Click on an algorithm node to see its key equation, data needs, and typical use case. Orange = offline, Teal = on-policy, Blue = off-policy, Purple = model-based.

Click any algorithm above to see its details.

Decision Flowchart: Choosing an Algorithm

Start
Can you interact with the environment?
↓ Yes → Online
Online
Is the action space discrete or continuous?
↓ Discrete → DQN family  |  Continuous → SAC/PPO/TD3
Continuous
Need sample efficiency? → SAC (off-policy). Need simplicity? → PPO (on-policy).
↑ Back to Start
Start → No
Do you have expert demonstrations?
↓ Yes → Imitation Learning  |  No (but have rewards) → Offline RL
IL
Unimodal actions? → Behavior Cloning (MSE). Multimodal? → Flow Matching / Diffusion Policy.
Exam strategy: when the exam describes a scenario ("you have 1000 demonstrations but no simulator"), your first job is to place it on this map. The map eliminates 80% of the algorithm options immediately. Then you pick within the remaining category based on the specific constraints (action space, compute budget, distribution shift concerns).
Your robot can only learn from a fixed dataset of 1000 demonstrations collected by a human teleoperator. No reward labels are available. Which category of algorithms applies?
SAC uses a replay buffer to store and reuse past transitions. Is SAC on-policy or off-policy?
PPO uses importance sampling ratios to reuse data from a slightly older policy. Why is it still classified as "on-policy"?

Chapter 1: Imitation Learning

Forget rewards. Forget value functions. Forget Bellman equations. Imitation learning (IL) asks the simplest possible question: "I have recordings of an expert doing the task. Can I just copy them?"

The answer is yes — with caveats that will show up on your exam. IL turns RL into supervised learning: given state s, predict the action a the expert would take. No reward signal, no environment interaction during training (at least for behavior cloning), no temporal credit assignment. Just input-output pairs.

But the devil is in two details: how you parameterize the policy (regression vs. generative) and how you handle distribution shift (vanilla BC vs. DAgger).

Approach 1: Regression (Deterministic Policy)

The simplest approach: treat actions as real-valued outputs and minimize mean squared error (MSE):

L(θ) = (1/N) ∑i || πθ(si) − ai ||2

Where πθ(s) is a neural network that takes a state and outputs a single deterministic action. This is literally supervised regression — the same loss you'd use to predict house prices from features.

When does this work? When the expert's behavior is unimodal — for any given state, there's essentially one correct action. A self-driving car on a straight highway: the expert always goes straight. MSE loss works perfectly.

When does this break? When the expert's behavior is multimodal. This is the critical failure mode you need to know for the exam.

The Multimodal Action Problem (Exam Favorite). Imagine playing Flappy Bird. An obstacle appears. The expert sometimes goes over it (action = +1) and sometimes goes under it (action = −1). Both are valid. What does MSE-optimal regression predict? The average: action = 0. That means fly straight into the obstacle. The model learns the one action the expert never takes. Averaging two good modes produces a bad mode.

Approach 2: Generative (Stochastic Policy)

Instead of predicting a single action, model the full conditional distribution pθ(a | s) and maximize log-likelihood:

L(θ) = −(1/N) ∑i log pθ(ai | si)

This is the key difference. Log-likelihood doesn't average modes — it assigns probability mass to each mode independently. A mixture of Gaussians, a diffusion model, or a flow matching model can all represent multimodal distributions. The Flappy Bird policy would learn two peaks: one at +1 (go over) and one at −1 (go under), with near-zero probability at 0 (crash).

Common generative architectures for IL:

ModelHow It Represents MultimodalityInference
Gaussian MixtureK explicit Gaussian componentsSample from mixture
Diffusion PolicyIterative denoising from noise → actionN denoising steps
Flow MatchingLearn velocity field from noise → actionEuler integration

Flow Matching for IL

Flow matching is the generative approach emphasized in CS224R. The idea: learn a velocity field that transports random noise into expert actions. Here's how it works step by step.

Step 1: Define the interpolation. Given a noise sample a0 ~ N(0, I) and an expert action a1, create a path between them:

aτ = τ · a1 + (1 − τ) · a0,    τ ∈ [0, 1]

At τ = 0, we have pure noise. At τ = 1, we have the expert action. The path is a straight line in action space.

Step 2: Compute the target velocity. The velocity along this path is just the derivative with respect to τ:

u(aτ, τ) = a1 − a0

This is the "ground truth" direction: from noise toward the expert action.

Step 3: Train a velocity network. Learn vθ(s, aτ, τ) to predict this velocity:

L(θ) = Eτ, a0, a1 || vθ(s, aτ, τ) − (a1 − a0) ||2

The network takes the state, the noisy action, and the timestep, and predicts which direction to push.

Step 4: Inference by Euler integration. Start from noise a0 ~ N(0, I), then take K small steps:

aτ + Δτ = aτ + Δτ · vθ(s, aτ, τ)

After K steps (e.g., K = 10, Δτ = 0.1), you arrive at a clean expert-like action. The velocity field has learned to route noise toward different modes, so multimodal distributions emerge naturally.

Why flow matching over diffusion? Flow matching uses straight-line paths (simpler, faster) while diffusion uses curved paths defined by a noise schedule. Both can represent multimodal distributions, but flow matching typically needs fewer integration steps at inference time. For an exam: flow matching = straight paths, diffusion = curved paths, both solve the multimodal problem.

DAgger: Fixing Distribution Shift

Even with a perfect generative model, behavior cloning has a fundamental problem: compounding errors.

During training, the policy sees states from the expert's distribution — the states an expert would visit. At test time, the policy makes a small mistake at step 1, which puts it in a slightly different state at step 2. It's never seen this state before, so it makes a bigger mistake. By step 100, the policy is in completely unfamiliar territory and crashes.

The error grows quadratically with the horizon T. If the per-step error is ε, the total compounded error after T steps is O(εT2), not O(εT). This quadratic blowup is what makes vanilla BC fragile for long-horizon tasks.

DAgger (Dataset Aggregation) fixes this by iteratively collecting corrective labels:

Step 1
Train πθ on expert dataset D
Step 2
Roll out πθ in the environment, collecting states s1, s2, ...
Step 3
Ask the expert to label these states: "What would you do here?"
Step 4
Add (si, aexpert) to D. Go to Step 1.
↻ repeat until convergence

The key insight: DAgger trains on the policy's own distribution of states, not just the expert's. After enough iterations, the dataset covers both the expert's trajectory and all the "mistake states" the policy visits, so the policy learns to recover from its own errors.

DAgger reduces the error bound from O(εT2) to O(εT) — linear in the horizon, not quadratic. The cost: you need an expert available to label on-demand during training.

Worked Problem: Regression vs. Generative

Exam problem: A dataset D contains 100 demonstrations of a 1D task. At state s = 3.0, the expert took action a = +2 in 60 demonstrations and a = −2 in 40 demonstrations. Compute: (a) the MSE-optimal prediction, and (b) the log-likelihood-optimal prediction under a 2-component Gaussian mixture.

(a) MSE-optimal prediction:

MSE minimizes E[(a − π(s))2]. The minimizer is the conditional mean:

π*(s=3) = E[a | s=3] = 0.6 × (+2) + 0.4 × (−2) = 1.2 − 0.8 = +0.4

The model predicts +0.4 — an action the expert never took. If +2 means "go left" and −2 means "go right," then +0.4 means "drift vaguely leftward" — probably the worst possible action.

(b) Log-likelihood-optimal GMM:

A 2-component Gaussian mixture with small variance would learn:

p(a | s=3) = 0.6 · N(+2, σ2) + 0.4 · N(−2, σ2)

At inference, sampling from this distribution gives +2 about 60% of the time and −2 about 40% — matching the expert's actual behavior. No probability mass near zero.

Regression vs. Generative on Bimodal Data

The expert's action distribution has two modes. Orange line = MSE prediction (the mean). Teal curves = generative model (GMM). Drag the mode separation slider to see when regression becomes catastrophic.

Mode Separation 2.0
Mode Balance (% left) 60%
Common exam mistakes — Imitation Learning:
• Forgetting that MSE predicts the mean, not the mode. When actions are bimodal, the mean lies between modes where no expert data exists.
• Saying DAgger "collects more expert demonstrations." Wrong — DAgger runs the learner's policy and asks the expert to label the learner's states. The key is covering the learner's distribution.
• Confusing the flow matching loss (predict velocity a1 − a0) with the diffusion loss (predict noise ε). Both are regression losses, but the target is different.
• Saying BC error grows linearly with T. It grows quadratically — O(εT2). DAgger reduces it to O(εT).
In behavior cloning with MSE loss, the expert's action distribution at a particular state is bimodal with peaks at a = +3 and a = −3 (equal probability). What does the trained policy predict?
In DAgger, whose distribution of states is used to collect new training data after the first round?
In flow matching for IL, the training loss minimizes || vθ(s, aτ, τ) − ??? ||2. What is the regression target?

Chapter 2: Policy Gradients

Imitation learning sidesteps reward entirely. But most RL problems don't come with expert demonstrations — they come with a reward function. Policy gradient methods directly optimize the expected cumulative reward by taking gradients of the policy parameters.

The core idea: if a trajectory gets high reward, increase the probability of the actions in that trajectory. If it gets low reward, decrease them. No value function needed (yet). No model of the environment. Just rollouts and gradient updates.

The REINFORCE Gradient

We want to maximize the expected return:

J(θ) = Eτ ~ πθ [R(τ)] = Eτ ~ πθ [∑t=0T r(st, at)]

The problem: we can't backpropagate through the environment (it's a black box). The solution is the log-derivative trick (also called the REINFORCE trick or score function estimator):

θ J(θ) = Eτ ~ πθ [(∑t=0Tθ log πθ(at | st)) · R(τ)]

In practice, we estimate this with N sampled trajectories:

θ J ≈ (1/N) ∑i=1N [(∑t=0Tθ log πθ(ati | sti)) · R(τi)]

Let's unpack each symbol:

SymbolMeaning
θ log πθ(a|s)The score function — which direction in parameter space makes action a more likely at state s
R(τ)Total return of trajectory τ = ∑ r(st, at)
NNumber of sampled trajectories (batch size)
TTrajectory length (horizon)

Intuition:θ log πθ(a|s) points toward parameter values that make action a more likely. Multiplying by R(τ) scales this — high-reward trajectories get large positive updates (make these actions more probable), low-reward trajectories get small or negative updates (make these actions less probable). It's a form of trial-and-error weighted by success.

Improvement 1: Causality (Reward-to-Go)

The vanilla REINFORCE formula weights each action's gradient by the total trajectory reward R(τ). But action at at time t can't affect rewards that already happened at times 0, 1, ..., t−1. Including past rewards adds noise without information.

The reward-to-go fix: replace R(τ) with the sum of future rewards from time t onward:

θ J ≈ (1/N) ∑i=1Nt=0Tθ log πθ(ati | sti) · (∑t'=tT r(st'i, at'i))

The inner sum ∑t'=tT r(st', at') is called the reward-to-go or return-to-go. It's the total reward the agent receives from time t onward. This has the same expected value as using R(τ) but strictly lower variance.

Improvement 2: Baselines

Even with reward-to-go, the gradient estimate has high variance. Consider: if all trajectories get positive reward (say, rewards between +10 and +20), then REINFORCE increases the probability of every action — even the bad ones. It just increases them less. Convergence is slow because the signal-to-noise ratio is poor.

The fix: subtract a baseline b from the reward:

θ J ≈ (1/N) ∑itθ log πθ(ati | sti) · (∑t'=tT rt' − b)

The crucial mathematical fact: subtracting any constant baseline b does not change the expected value of the gradient. This is because E[∇θ log π · b] = b · E[∇θ log π] = b · 0 = 0. The score function has zero mean.

But it does change the variance. The optimal baseline is b* = E[R(τ)], the average return. Now trajectories better than average get positive weight (reinforced) and trajectories worse than average get negative weight (discouraged). Much cleaner signal.

This is an exam classic: "Does subtracting a baseline introduce bias into the policy gradient?" The answer is NO. The expected gradient is unchanged. The baseline only reduces variance. If someone asks you to prove it, show that E[∇ log π · b] = 0 because ∑a ∇π(a|s) = ∇(∑a π(a|s)) = ∇(1) = 0.

Worked Problem: Computing the Policy Gradient

Exam problem: You collect N = 3 trajectories from a policy πθ. Each trajectory has T = 2 timesteps. The rewards are:
• Trajectory 1: r0 = +3, r1 = +2 → R(τ1) = 5
• Trajectory 2: r0 = −1, r1 = −1 → R(τ2) = −2
• Trajectory 3: r0 = +1, r1 = 0 → R(τ3) = 1
Assume ∇θ log π(a01|s01) = [+1, 0]. Compute the gradient contribution from action a0 of trajectory 1: (a) without baseline, and (b) with baseline b = mean return.

(a) Without baseline:

Using reward-to-go from t=0: Rto-go = r0 + r1 = 5

contribution = ∇θ log π(a01|s01) · Rto-go = [+1, 0] · 5 = [+5, 0]

(b) With baseline:

Baseline b = (5 + (−2) + 1) / 3 = 4/3 ≈ 1.33

contribution = [+1, 0] · (5 − 1.33) = [+1, 0] · 3.67 = [+3.67, 0]

The direction is the same (trajectory 1 is above average, so it's reinforced), but the magnitude is smaller. For trajectory 2 (return = −2, below baseline), the contribution would be negative — discouraging those actions. That's exactly what we want.

Without a baseline, even trajectory 3 (return = +1, mediocre) gets a positive gradient. With a baseline, trajectory 3 gets a slightly negative gradient (1 − 1.33 = −0.33), correctly signaling that it's below average.

Policy Gradient: Trajectories, Causality & Baselines

Five trajectories are shown with their rewards. Toggle causality (reward-to-go) and baseline subtraction to see how the gradient direction changes. Arrow length = gradient magnitude for each trajectory.

Common exam mistakes — Policy Gradients:
• "Subtracting a baseline changes the expected gradient." Wrong. It only reduces variance. E[∇ log π · b] = 0 for any constant b.
• Forgetting causality: action at should only be weighted by rewards from t onward, not the whole trajectory return. Using the whole return is valid but has higher variance.
• Confusing "high variance" with "biased." REINFORCE is unbiased but high variance. That's why we need many trajectories (large N) for accurate estimates.
• Thinking the baseline must be the mean return. Any state-dependent function b(s) works. The optimal baseline is b*(s) = E[R | s], but even a simple constant helps enormously.
In the REINFORCE algorithm, you collect 100 trajectories and all of them have positive reward (between +5 and +15). Without a baseline, what happens to the policy gradient update?
You're given the policy gradient: ∇J = (1/N) ∑ (∑ ∇ log π(a|s)) · (R(τ) − b). A colleague claims subtracting b = 100 biases the gradient. Is the colleague correct?
Which modification converts the "total trajectory reward" version of REINFORCE into the "reward-to-go" version?
🔨 Derivation Derive the Policy Gradient Theorem from Scratch ✓ ATTEMPTED

The policy gradient theorem states: ∇θ J(θ) = Eτ~πθ[∑t=0Tθ log πθ(at|st) · Qπ(st, at)]

Your task: (1) Start from J(θ) = Eτ~πθ[R(τ)] and derive this result using the log-derivative trick. (2) Show that Q can be replaced with advantage A = Q - V without changing the gradient's expectation. (3) Prove that subtracting ANY function b(s) from Q doesn't bias the gradient (b is called a "baseline").

θ pθ(τ) = pθ(τ) · ∇θ log pθ(τ). This converts ∇ ∫ p(τ)R(τ)dτ into Eτ[∇ log p(τ) · R(τ)]. Then factor log p(τ) = ∑ log π(at|st) + const (dynamics terms cancel).
R(τ) = ∑t'=0T rt'. By causality, actions at time t only affect future rewards. So ∇ log π(at|st) is multiplied only by ∑t'=tT γt'-trt' = Qπ(st, at) (the reward-to-go = Q-function under the current policy).
Ea~π[∇ log π(a|s) · b(s)] = b(s) · Ea[∇ log π(a|s)] = b(s) · ∇ ∑a π(a|s) = b(s) · ∇(1) = 0. Any b(s) that doesn't depend on a vanishes from the gradient's expectation. V(s) is the variance-minimizing choice.

Step 1: J(θ) = ∫ pθ(τ) R(τ) dτ. Take gradient: ∇J = ∫ ∇pθ(τ) R(τ) dτ = ∫ pθ(τ) ∇ log pθ(τ) R(τ) dτ = Eτ[∇ log pθ(τ) · R(τ)].

Step 2: log pθ(τ) = log p(s0) + ∑ log p(st+1|st,at) + ∑ log πθ(at|st). Only the last sum depends on θ: ∇ log pθ(τ) = ∑ ∇ log πθ(at|st).

Step 3 (Causality): ∇J = E[∑t ∇ log π(at|st) · (∑t'=tT γt'-trt')]. The inner sum = Qπ(st, at). ■

Step 4 (Baseline): Replacing Q with A = Q - V: the V(st) term contributes Ea[∇ log π · V(s)] = V(s) · 0 = 0. So A and Q give the same expected gradient, but A has lower variance (V(s) acts as a control variate). ■

Exam insight: The PG theorem appears in 3 forms: (1) with R(τ), (2) with Q(s,a), (3) with A(s,a). They're all the same gradient in expectation. Form (3) is what's implemented (lowest variance). Know all three and how to go between them.

Chapter 3: Actor-Critic Methods

REINFORCE works but it's inefficient. Each gradient estimate requires rolling out entire trajectories, and the variance is high even with baselines. The fundamental issue: we're using sampled returns to evaluate how good an action is. What if we could replace those noisy sampled returns with a learned estimate?

That's the actor-critic idea. Train two networks: an actor (the policy πθ) and a critic (a value function that estimates expected returns). The critic provides low-variance gradient signals to the actor. The actor provides trajectories for the critic to learn from.

The Three Value Functions

Before anything else, you need to have these three functions and their relationships burned into memory. They will appear on every exam in some form.

Vπ(s) = Eπ [∑t=0 γt rt | s0 = s]

Vπ(s) is the state value function. It answers: "How much total discounted reward do I expect to get, starting from state s and following policy π forever?" It's the "how good is this state?" number.

Qπ(s, a) = Eπ [∑t=0 γt rt | s0 = s, a0 = a]

Qπ(s, a) is the action value function. It answers: "How much total reward do I expect if I start in state s, take action a (possibly different from what π would choose), and then follow π from the next state onward?" It's "how good is this action in this state?"

Aπ(s, a) = Qπ(s, a) − Vπ(s)

Aπ(s, a) is the advantage function. It answers: "How much better is action a compared to the average action at state s?" Positive advantage means the action is above average; negative means below average. The advantage is the signal the actor uses to update — reinforce above-average actions, discourage below-average ones.

The relationship you must know: Vπ(s) = Ea ~ π[Qπ(s, a)]. The state value is the expected action value under the policy. And by definition, Ea ~ π[Aπ(s, a)] = 0 — the average advantage is always zero. Advantages are relative: some actions are above average, some below, but they average to zero.

MC vs. TD vs. n-Step Targets

The critic needs a training target — what should V(s) equal? There are three approaches, and understanding their tradeoffs is essential for the exam.

MethodTarget for V(s)BiasVarianceData Needed
Monte Carlo (MC)t'=tT γt'−t rt' (actual return)UnbiasedHighFull trajectory
TD(0)rt + γ V(st+1) (bootstrap)Biased (if V wrong)LowSingle transition
n-stepk=0n−1 γk rt+k + γn V(st+n)Less biasedMediumn transitions

Monte Carlo: Use the actual observed return from the trajectory. No approximation, so unbiased. But if the trajectory is long (T = 1000), the return includes 1000 random variables, so variance is huge.

yMC = rt + γ rt+1 + γ2 rt+2 + ... + γT−t rT

TD(0) (Temporal Difference): Use only the immediate reward plus the critic's own estimate of the next state. This bootstraps — it uses V(s') to estimate V(s). If V is wrong, the target is biased. But it only depends on one random variable (rt), so variance is low.

yTD = rt + γ V(st+1)

n-step: A compromise. Use n actual rewards, then bootstrap:

yn-step = rt + γ rt+1 + ... + γn−1 rt+n−1 + γn V(st+n)

At n = 1, this is TD(0). At n = T, this is MC. You can slide between bias and variance by choosing n.

The Actor-Critic Policy Gradient

With a learned critic, the policy gradient becomes:

θ J ≈ (1/N) ∑itθ log πθ(at | st) · Aπ(st, at)

Where the advantage is estimated using the TD error:

Â(st, at) = r(st, at) + γ Vφ(st+1) − Vφ(st)

This is the TD advantage: the actual reward plus the discounted value of the next state minus the value of the current state. If this is positive, the action led to a better-than-expected next state — reinforce it. If negative, the action was worse than expected — discourage it.

Think of  as a "surprise" signal. V(s) is what we expected to get from state s. r + γV(s') is what we actually got (one-step look-ahead). The advantage = actual − expected. Positive surprise → good action. Negative surprise → bad action. Zero → action was exactly as good as average.

Worked Problem: Computing TD Target, MC Return, and Advantage

Exam problem: Given: V(s1) = 5, V(s2) = 8, V(s3) = 3, discount γ = 0.9. The agent transitions s1 → s2 → s3 with rewards r1 = 3, r2 = −1. Compute: (a) TD target for V(s1), (b) MC return from s1, (c) TD advantage A(s1, a1).

(a) TD target for V(s1):

yTD(s1) = r1 + γ V(s2) = 3 + 0.9 × 8 = 3 + 7.2 = 10.2

(b) MC return from s1 (assuming the trajectory ends at s3):

G(s1) = r1 + γ r2 + γ2 V(s3) = 3 + 0.9 × (−1) + 0.81 × 3 = 3 − 0.9 + 2.43 = 4.53

(This is actually a 2-step return, not a pure MC return. A pure MC would go until the episode ends. But for a finite trajectory, this is the standard calculation.)

(c) TD advantage A(s1, a1):

Â(s1, a1) = r1 + γ V(s2) − V(s1) = 10.2 − 5 = +5.2

The advantage is strongly positive: taking action a1 in state s1 led to state s2 (value 8), which is much better than the expected value of state s1 (value 5). The actor should reinforce this action.

Notice the TD target (10.2) is much higher than V(s1) = 5, meaning our critic underestimates s1. The critic loss would push V(s1) upward toward 10.2.

MC vs. TD: Bias-Variance Tradeoff

A 10-state chain with random rewards. The orange line = true values (computed analytically). Teal dots = estimated values using n-step returns. Drag the n slider: n=1 is TD(0) (low variance, may be biased), n=10 is MC (unbiased, high variance). Watch the estimates wobble with MC and stabilize (but shift) with TD.

n (bootstrap depth) 1
Critic Accuracy 50%
Common exam mistakes — Actor-Critic:
• Confusing V(s) and Q(s,a). V averages over actions; Q conditions on a specific action. V = Ea[Q(s,a)].
• Saying "MC is better than TD because it's unbiased." Not necessarily — if your value function is well-trained, TD's bias is tiny and its low variance leads to faster convergence. If your value function is poorly trained, TD's bias is large and MC is safer. The right choice depends on critic quality.
• Forgetting that A(s,a) = Q(s,a) − V(s), so the expected advantage is always zero. If every advantage on your exam calculation is positive, something is wrong.
• Computing the TD target as r + V(s) instead of r + γV(s'). Don't forget the discount factor γ and that you use the next state s', not the current state s.
V(sA) = 10, V(sB) = 6, r = 2, γ = 0.95. What is the TD advantage A(sA, a)?
MC estimates are unbiased but have high variance. TD estimates have lower variance but may be biased. You're training a brand-new critic from random initialization. Which target should you use?
What is the expected advantage Ea ~ π[Aπ(s, a)] for any state s?

Chapter 4: Off-Policy Methods & PPO

Here's the problem with on-policy methods like vanilla policy gradient and A2C: the moment you update the policy parameters θ → θ', all your collected data becomes stale. The data was collected under πθ, but now you need data from πθ'. You must throw away your data and collect new trajectories. Every single gradient step requires fresh rollouts.

This is astronomically wasteful. If each rollout takes 10 seconds and you want to do 100,000 gradient steps, that's a million seconds of data collection — even though most of that data is nearly identical (the policy only changed slightly).

Off-policy methods fix this by reusing old data. The key mathematical tool is importance sampling.

Importance Sampling: Reweighting Old Data

Suppose you collected a transition (s, a, r, s') using an old policy πold. You want to use it to estimate the gradient for a new policy πnew. The importance sampling correction is:

Ea ~ πnew[f(a)] = Ea ~ πold[(πnew(a|s) / πold(a|s)) · f(a)]

The ratio ρ = πnew(a|s) / πold(a|s) reweights each sample. If the new policy thinks action a is twice as likely as the old policy did, the sample counts double. If the new policy thinks a is half as likely, the sample counts half.

For policy gradients, the off-policy objective becomes:

JIS(θ') = E(s,a) ~ πθ [(πθ'(a|s) / πθ(a|s)) · Aπθ(s, a)]

The Problem: Exploding Ratios

Importance sampling is mathematically correct but practically dangerous. The ratio ρ = πθ'(a|s) / πθ(a|s) can become arbitrarily large. If πold(a|s) = 0.01 and πnew(a|s) = 0.5, the ratio is 50. A single sample gets multiplied by 50, dominating the entire gradient estimate. The variance explodes.

Even worse: for trajectories, the ratios multiply across timesteps. A trajectory of length T has a combined ratio of ∏t=0Tnew(at|st) / πold(at|st)). If each ratio is 2, the product is 2T. For T = 100, that's 2100 ≈ 1030. Numerical catastrophe.

Solution 1: KL Constraint (TRPO)

Trust Region Policy Optimization (TRPO) constrains how far the new policy can move from the old one:

maxθ' Eπθ [(πθ'(a|s) / πθ(a|s)) · A(s, a)]
subject to: Es[DKLθ(·|s) || πθ'(·|s))] ≤ ε

The KL divergence constraint ensures the ratio stays near 1 (since if the distributions are similar, the ratio is close to 1). This works great in theory but requires computing the KL constraint at every step, which involves second-order optimization (the Fisher information matrix). Expensive and hard to implement.

Solution 2: PPO Clipping (The Practical Choice)

Proximal Policy Optimization (PPO) replaces the KL constraint with a simple clipping trick:

LCLIP(θ') = Et [min(ρt At, clip(ρt, 1−ε, 1+ε) · At)]

Where ρt = πθ'(at|st) / πθ(at|st) and ε is typically 0.2.

Let's unpack the min. There are two cases:

Case 1: Advantage is positive (A > 0). This action is good — we want to increase its probability. The ratio ρ grows as we increase πnew(a|s). But clip(ρ, 0.8, 1.2) caps the benefit at ρ = 1.2. Beyond that, the gradient is zero. We can't over-exploit one good action.

Case 2: Advantage is negative (A < 0). This action is bad — we want to decrease its probability. The ratio ρ shrinks. But clip caps it at ρ = 0.8. We can't over-penalize one bad action.

Exam intuition for PPO clipping: the clip creates a "trust region" around the old policy. The ratio ρ is allowed to move between [1 − ε, 1 + ε], so the new policy can't deviate too far from the old one in any single update. This achieves the same goal as TRPO's KL constraint, but with a single line of code instead of a second-order optimizer. That's why PPO is the most widely used algorithm in practice.

SAC: Maximum Entropy RL

Soft Actor-Critic (SAC) is the other major off-policy algorithm, but with a twist: it adds an entropy bonus to the reward:

π* = argmaxπ Eπ [∑t r(st, at) + α H(π(·|st))]

Where H(π) = −E[log π(a|s)] is the entropy of the policy and α controls the entropy weight.

Why add entropy? Two reasons:

1. Exploration. Maximizing entropy prevents the policy from collapsing to a deterministic action too early. It keeps exploring diverse actions, which helps avoid local optima.

2. Robustness. A high-entropy policy is more robust to perturbations. If one action path gets blocked, the policy has mass on alternative paths.

SAC uses a replay buffer (off-policy), learns Q-functions (two of them, to reduce overestimation), and outputs a stochastic policy (squashed Gaussian). It's the go-to algorithm for continuous control with sample efficiency.

Worked Problem: Importance Sampling Ratio

Exam problem: At state s, the old policy has πold(a|s) = 0.3 and the new policy has πnew(a|s) = 0.9. The advantage is A(s, a) = +2.0 and ε = 0.2.
(a) Compute the IS ratio ρ.
(b) Compute the unclipped objective ρ · A.
(c) Compute the PPO clipped objective.
(d) Is this ratio dangerously large?

(a) IS ratio:

ρ = πnew(a|s) / πold(a|s) = 0.9 / 0.3 = 3.0

(b) Unclipped objective:

ρ · A = 3.0 × 2.0 = 6.0

(c) PPO clipped objective:

Since A > 0, PPO takes min(ρ · A, clip(ρ, 0.8, 1.2) · A):

clip(3.0, 0.8, 1.2) = 1.2
clipped = 1.2 × 2.0 = 2.4
LCLIP = min(6.0, 2.4) = 2.4

PPO caps the objective at 2.4 instead of 6.0. The gradient through the clipped term is zero with respect to θ' beyond the clip point, preventing the policy from jumping too far.

(d) Is ρ = 3.0 dangerously large? Yes. The new policy is 3× more likely to take this action than the old policy. For a single transition, this is manageable. But for a trajectory of length T, the ratios multiply: 3T. Even at T = 20, that's 320 ≈ 3.5 × 109. That's why PPO clips — and why practitioners only reuse data for a few (3-5) gradient steps before recollecting.

Importance Sampling Ratio & PPO Clipping

Set πold and πnew for a single action. Orange bar = unclipped ρ · A. Teal bar = clipped PPO objective. The gray region shows the clipping bounds [1−ε, 1+ε].

πold(a|s) 0.30
πnew(a|s) 0.90
Advantage A +2.0
Clip ε 0.20
Common exam mistakes — Off-Policy & PPO:
• "PPO is off-policy because it uses importance sampling." Wrong. PPO uses IS ratios to reuse data for a few gradient steps, but it recollects data frequently and clips ratios tightly. It's effectively on-policy with a short reuse window.
• Forgetting to check both cases of the PPO clipped objective. When A > 0, the min prevents over-reinforcing. When A < 0, the min prevents over-penalizing. You must handle both cases on the exam.
• Confusing TRPO's KL constraint with PPO's clipping. TRPO constrains the KL divergence (requires second-order optimization). PPO clips the ratio (first-order, much simpler). Both achieve the "trust region" effect.
• Thinking SAC is on-policy. SAC uses a replay buffer and is firmly off-policy. The entropy bonus is about exploration, not about data freshness.
πold(a|s) = 0.1, πnew(a|s) = 0.4. What is the importance sampling ratio?
In PPO with ε = 0.2, the IS ratio ρ = 1.5 and the advantage A = +3.0. What is the clipped objective LCLIP?
Why does PPO clip the IS ratio instead of using a KL divergence constraint like TRPO?

Chapter 5: Q-Learning & DQN

Every method so far learns a policy πθ(a|s) explicitly — a network that directly outputs actions. Q-learning takes a fundamentally different approach: learn the optimal value of every action, then just pick the best one. No policy network at all.

The key insight: if you knew Q*(s, a) — the optimal action-value function — you wouldn't need a policy network. You'd just act greedily:

π*(s) = argmaxa Q*(s, a)

"At every state, pick the action with the highest Q-value." The policy is implicit in the Q-function. This is the value-based approach to RL.

The Bellman Optimality Equation

Q* satisfies a recursive relationship called the Bellman optimality equation:

Q*(s, a) = r(s, a) + γ maxa' Q*(s', a')

In words: the optimal value of taking action a in state s equals the immediate reward plus the discounted value of the best action in the next state. This is a fixed-point equation — Q* is the unique function that satisfies it.

Let's unpack each piece:

TermMeaning
Q*(s, a)Optimal value of taking action a in state s (what we're learning)
r(s, a)Immediate reward for taking action a in state s
γDiscount factor (0 to 1) — how much we care about future rewards
maxa' Q*(s', a')Value of the best action in the next state (bootstrapping)
Why is Q-learning off-policy? The Bellman equation uses max over actions at s'. It doesn't matter how we got to state s' — we always evaluate the best possible action. Even if the data was collected by a random policy, Q-learning still backs up the value of the optimal action. This is different from SARSA (which uses the action the policy actually took, making it on-policy).

Tabular Q-Learning

For small state-action spaces, we can store Q as a table. The update rule for each observed transition (s, a, r, s'):

Q(s, a) ← Q(s, a) + α [r + γ maxa' Q(s', a') − Q(s, a)]

Where α is the learning rate. The term in brackets is the TD error: the difference between the Bellman target (r + γ max Q(s', a')) and the current estimate Q(s, a). We nudge Q toward the target by a fraction α.

With enough exploration (visiting every state-action pair infinitely often) and a decaying learning rate, tabular Q-learning provably converges to Q*.

DQN: Q-Learning with Neural Networks

For large state spaces (images, continuous states), we replace the table with a neural network Qφ(s, a). The loss function:

L(φ) = E(s,a,r,s') ~ D [(Qφ(s, a) − (r + γ maxa' Qφ'(s', a')))2]

This looks like simple regression: predict Q(s,a), use the Bellman target as the label. But there are three problems that make naive DQN unstable, and three tricks that fix them:

Stability Trick 1: Replay Buffer

Problem: consecutive transitions are correlated (s1 → s2 → s3 are temporally adjacent). Neural networks trained on correlated data overfit to recent patterns and forget old ones.

Fix: store all transitions in a large replay buffer D. Sample mini-batches uniformly from D. This breaks temporal correlations — your mini-batch might contain transitions from 10 different episodes at different points in training.

Stability Trick 2: Target Network

Problem: the Bellman target r + γ max Qφ(s', a') uses the same network Qφ that we're training. When we update φ, the target changes too — we're chasing a moving target. This causes oscillations and divergence.

Fix: use a separate target network Qφ' with frozen parameters. Update φ' only periodically (copy φ to φ' every C steps) or slowly (Polyak averaging: φ' ← τφ + (1−τ)φ'). The target is stable between updates.

Stability Trick 3: Double Q-Learning

Problem: the max operator in maxa' Q(s', a') systematically overestimates Q-values. If Q has noise (estimation error), max picks the action with the highest noise realization, biasing the estimate upward. This is the "max of noisy estimates > true max" phenomenon.

Fix: Double DQN decouples action selection from action evaluation. Use the online network to select the action, but the target network to evaluate it:

a* = argmaxa' Qφ(s', a')    (select with online)
target = r + γ Qφ'(s', a*)    (evaluate with target)

Since the selection and evaluation use different networks (with different noise patterns), the overestimation bias is significantly reduced.

Summary of DQN's three tricks:
1. Replay buffer → breaks correlations between consecutive transitions
2. Target network → stabilizes the Bellman target (stop chasing a moving target)
3. Double Q → reduces overestimation from the max operator
All three address different sources of instability in approximate dynamic programming with neural networks. Memorize what each one fixes.

Worked Problem: Q-Table Bellman Update

Exam problem: You have a Q-table with 3 states (s1, s2, s3) and 2 actions (Left, Right). Current Q-values and a transition to process:

Q-table:
Q(s1, L) = 2, Q(s1, R) = 4
Q(s2, L) = 6, Q(s2, R) = 3
Q(s3, L) = 1, Q(s3, R) = 5

Transition: (s1, R, r=2, s2). Learning rate α = 0.5, γ = 0.9.
Perform one Q-learning update.

Step 1: Compute the Bellman target.

target = r + γ maxa' Q(s2, a') = 2 + 0.9 × max(6, 3) = 2 + 0.9 × 6 = 2 + 5.4 = 7.4

Step 2: Compute the TD error.

δ = target − Q(s1, R) = 7.4 − 4 = 3.4

Step 3: Apply the update.

Q(s1, R) ← Q(s1, R) + α · δ = 4 + 0.5 × 3.4 = 4 + 1.7 = 5.7

The Q-value for (s1, Right) increased from 4.0 to 5.7. Makes sense: taking Right in s1 gives reward 2 and transitions to s2, whose best action (Left) has value 6. The discounted future value 0.9 × 6 = 5.4 plus the immediate reward 2 = 7.4, which is higher than the old estimate of 4. So Q moves upward.

Notice: only Q(s1, R) changed. All other Q-values remain the same. Q-learning does pointwise updates — one (s, a) pair per transition.

Interactive Q-Table: Watch Values Converge

A 3×3 grid world with a goal (top-right, reward +10) and a pit (bottom-left, reward −5). Click Step to perform one Q-learning update from a random transition. Click Run 100 to watch the Q-values converge. The arrows show the greedy policy (argmax Q). Warmer colors = higher max Q.

Learning Rate α 0.50
Discount γ 0.90
Exploration ε 0.10
Common exam mistakes — Q-Learning:
• "Q-learning is on-policy." Wrong. Q-learning uses maxa' Q(s', a') — it evaluates the greedy action regardless of what the behavior policy actually did. That makes it off-policy. (SARSA, which uses Q(s', a'actual), is on-policy.)
• Forgetting the discount γ in the Bellman target. target = r + γ max Q, not r + max Q.
• Computing max Q(s, a') instead of max Q(s', a'). The max is over the next state s', not the current state s.
• Confusing what Double DQN does. It doesn't use two replay buffers or two loss functions. It uses the online network to select the best action and the target network to evaluate that action. Decoupling selection from evaluation reduces overestimation.

Q-Learning vs. SARSA: The Key Distinction

Q-LearningSARSA
Update targetr + γ maxa' Q(s', a')r + γ Q(s', a'actual)
On/off-policyOff-policyOn-policy
EvaluatesGreedy (best) actionThe action the policy actually takes
ConsequenceLearns optimal Q* regardless of behaviorLearns Qπ for the current policy
SafetyMay learn unsafe greedy pathsRespects exploration policy (safer)
Q(s1, a1) = 3, Q(s1, a2) = 7. Q(s2, a1) = 5, Q(s2, a2) = 2. The agent transitions from s1 with action a2, receives r = 1, and lands in s2. With γ = 0.9 and α = 1.0, what is Q(s1, a2) after the update?
Why does DQN use a target network with frozen parameters φ' instead of using the same network φ for both the prediction and the Bellman target?
Q-learning uses maxa' Q(s', a') in its update. SARSA uses Q(s', a'actual). Which statement is correct?
🔨 Derivation Prove the Bellman Operator is a Contraction (Q-Learning Convergence) ✓ ATTEMPTED

The Bellman optimality operator is: (TQ)(s,a) = r(s,a) + γ maxa' Q(s', a'). Q-learning converges because T is a contraction in the infinity norm.

Your task: (1) Prove ||TQ1 - TQ2|| ≤ γ ||Q1 - Q2||. (2) Use the Banach fixed-point theorem to conclude Q-learning converges to a unique Q*. (3) What is the convergence rate? After k iterations, how close is Qk to Q*?

|maxa f(a) - maxa g(a)| ≤ maxa |f(a) - g(a)|. The "max" doesn't amplify differences — it can only reduce them. Apply this to maxa' Q1(s',a') - maxa' Q2(s',a').
|(TQ1)(s,a) - (TQ2)(s,a)| = |γ maxa' Q1(s',a') - γ maxa' Q2(s',a')| ≤ γ maxa' |Q1(s',a') - Q2(s',a')| ≤ γ ||Q1 - Q2||. Since this holds for all (s,a), take sup over (s,a): ||TQ1 - TQ2|| ≤ γ ||Q1 - Q2||. ■
Banach theorem: any contraction on a complete metric space has a unique fixed point, and iterates converge to it. Rate: ||Qk - Q*|| ≤ γk ||Q0 - Q*||. With γ=0.99, after 460 iterations the error shrinks by a factor of 100. With γ=0.9, only 46 iterations needed. This is why low γ converges faster but plans shorter horizons.

Proof of contraction:

For any (s,a): |(TQ1 - TQ2)(s,a)| = γ|maxa' Q1(s',a') - maxa' Q2(s',a')|

≤ γ maxa' |Q1(s',a') - Q2(s',a')| (triangle inequality for max)

≤ γ sup(s,a) |Q1(s,a) - Q2(s,a)| = γ ||Q1 - Q2||

Taking sup over (s,a) on the left: ||TQ1 - TQ2|| ≤ γ ||Q1 - Q2||. ■

Convergence: Since γ < 1, T is a γ-contraction. By Banach: (1) unique fixed point Q* exists, (2) Qk = TkQ0 → Q*, (3) rate: ||Qk - Q*|| ≤ γk/(1-γ) ||TQ0 - Q0||.

Exam insight: This proof is a classic exam question. The key steps: (1) property of max, (2) apply it, (3) invoke Banach. Also know: this only works for tabular Q-learning. With function approximation (DQN), the contraction property breaks because the projection step (fitting a neural network) is not a non-expansion. This is the "deadly triad" problem.

Chapter 6: Offline RL

Here's a scenario that comes up constantly in the real world: you have a massive dataset of past decisions — hospital treatment records, autonomous driving logs, recommendation system clickstreams — but you cannot deploy a new policy to collect more data. The patients have already been treated. The cars have already driven. You need to learn the best policy you can from this fixed dataset, without any additional interaction.

This is offline RL (also called batch RL). You have a dataset D = {(s, a, r, s')} collected by some behavior policy πβ, and your job is to extract the best possible policy π from this data alone. Sounds like supervised learning, right? Just imitate the best actions in the dataset?

Not quite. The behavior policy might be mediocre — a mix of novice and expert drivers, doctors following outdated guidelines. We want to do better than the data collector. But this is where offline RL breaks in a spectacular and specific way.

The Distribution Shift Problem

Consider a standard Q-learning update. We pick the action that maximizes Q(s, a) at the next state. But what if that maximizing action was never taken in the dataset? The Q-network has never seen a training signal for (s', amax), so its Q-value there is essentially a random guess — and random neural network outputs tend to be overestimates.

Now it gets worse. The Bellman backup propagates these overestimates. Q(s, a) gets a target that includes maxa' Q(s', a'), which is inflated. Next iteration, that inflated Q(s, a) propagates to other states. The errors compound exponentially. Your Q-function develops "hallucinated" high-value regions in parts of the state-action space the dataset never visited.

The offline RL death spiral: Standard Q-learning uses maxa Q(s', a) in its backup. If the maximizing action a* is out-of-distribution (never seen in data), Q(s', a*) is an unreliable overestimate. This overestimate propagates backward through Bellman backups, creating fictitious high-value "phantom" policies that look great on paper but fail catastrophically when deployed. The policy chases ghosts.

Solution 1: Constrained Policy (AWR)

Advantage Weighted Regression (AWR) sidesteps the problem entirely by never evaluating out-of-distribution actions. Instead of learning Q-values and picking argmax, AWR treats the problem as weighted behavior cloning:

π* = argmaxπ E(s,a)~D [ exp(A(s,a) / β) · log π(a|s) ]

Where A(s,a) = Q(s,a) - V(s) is the advantage, and β is a temperature parameter. Translation: clone the behavior policy, but weight each action by how much better it was than average. Good actions (positive advantage) get upweighted exponentially. Bad actions get downweighted. The policy stays close to the data distribution because it's literally doing weighted imitation.

Solution 2: Avoid OOD Evaluation (IQL)

Implicit Q-Learning (IQL) takes a different, more elegant approach. The core insight: the reason Q-learning fails offline is that the Bellman target maxa Q(s', a) requires evaluating Q at actions that might be OOD. What if we could compute a value function V(s) that approximates the max without ever querying Q at unseen actions?

IQL does this in two steps:

Step 1: Fit V via Expectile Regression
Train V(s) using asymmetric Lτ loss against Q(s,a) for in-data actions only. With τ < 0.5, V estimates a conservative quantile of Q.
Step 2: Extract Policy via AWR
Use A(s,a) = Q(s,a) - V(s) as advantage. Weight BC by exp(A/β). Only uses in-data (s,a) pairs.

The key equation is the expectile loss:

Lτ(u) = |τ - 1(u < 0)| · u²

Where u = Q(s,a) - V(s). When τ = 0.5, this is ordinary MSE — V learns the mean of Q. When τ < 0.5, the loss penalizes overestimates more than underestimates. So V(s) is pulled below the mean Q — it becomes a conservative lower bound. When τ > 0.5, V chases the high end — approaching maxa Q(s,a), which is what we'd want in online RL but is dangerous offline.

IQL's genius: By fitting V with τ < 0.5, we get a conservative estimate of the value of a state using only in-data actions. No max over unseen actions needed. Then we extract the policy by weighting in-data actions by their advantage over this conservative baseline. Both steps only touch (s, a) pairs from the dataset.

Worked Exam Problem

Problem: You have Q(s, a1) = 5, Q(s, a2) = 8, V(s) = 6. Action a1 appears in the dataset. Action a2 is OOD (never in dataset).

(a) Why is Q(s, a2) = 8 dangerous?

(b) Compute IQL's expectile loss Lτ for V(s) = 6 with τ = 0.3, using only the in-data action a1.

(c) What happens to V(s) as training proceeds with τ = 0.3?

Solution (a): Q(s, a2) = 8 was never trained on real (s, a2, r, s') transitions. It's a random extrapolation by the neural network. Standard Q-learning would set its target as r + γ max(Q(s', ·)), and if that max happens at an OOD action in s', the overestimate propagates. A policy using argmaxa Q(s, a) would choose a2 — an action that might not even make physical sense — because of this hallucinated high value.

Solution (b): We only use in-data actions, so u = Q(s, a1) - V(s) = 5 - 6 = -1. Since u < 0, the indicator 1(u < 0) = 1, so the weight is |τ - 1| = |0.3 - 1| = 0.7.

L0.3(-1) = 0.7 · (-1)² = 0.7

Solution (c): Since u = -1 < 0, the weight is 0.7 (high penalty for overestimating). The gradient pushes V(s) down toward Q(s, a1) = 5. With τ = 0.3, V will converge below the mean of in-data Q-values — a conservative estimate. If there were multiple in-data actions, V would settle at roughly the 30th percentile of Q-values.

Conversely, if u were positive (V too low), the weight would be |0.3 - 0| = 0.3 — a weaker push upward. This asymmetry is exactly what makes IQL conservative.

IQL Expectile Loss Explorer

Adjust τ to see how the expectile loss changes. With τ < 0.5, overestimates (V > Q) are penalized more heavily, pushing V to be conservative. With τ > 0.5, underestimates are penalized more, pushing V toward the max. The orange bars show Q-values for in-data actions; the teal line shows V(s).

τ (expectile) 0.30
V(s) 6.0

Comparison Table: Offline RL Approaches

MethodCore IdeaOOD HandlingKey Hyperparameter
BCQOnly consider actions similar to dataGenerative model filters actionsSimilarity threshold
CQLPenalize Q-values for OOD actionsRegularizer pushes Q down on unseen (s,a)Regularization strength α
AWRWeighted behavior cloningNever evaluates OOD — just reweights dataTemperature β
IQLExpectile regression + AWR extractionV fitted conservatively on in-data Q onlyExpectile τ, temperature β
Common exam mistakes:
• Saying "offline RL fails because the dataset is too small." No — even large datasets cause failure if the policy visits states not in the data. The problem is distributional, not statistical.
• Confusing τ direction in IQL: τ < 0.5 is conservative (what we want offline). τ > 0.5 is optimistic (dangerous offline).
• Thinking AWR learns a Q-function from scratch — it doesn't. It uses a pre-computed advantage to reweight existing behavior. The advantage comes from a separate value estimation step.
• Forgetting that IQL's key property is that it never queries Q at actions outside the dataset, not even during the Bellman backup.
Why can't we just apply SAC (an off-policy algorithm) directly to an offline dataset?
IQL uses τ = 0.3 in its expectile loss. What would happen if we set τ = 0.5?
In AWR, what role does the temperature β play?
🔨 Derivation Prove DAgger's Linear Regret Bound (vs BC's Quadratic) ✓ ATTEMPTED

Behavioral Cloning suffers O(εT2) regret due to compounding errors. DAgger achieves O(εT). This is the key theoretical advantage.

Your task: (1) State the no-regret property formally: after N rounds of DAgger, the policy πN has cost at most minπ∈Π J(π) + O(1/√N) under the LEARNER's distribution. (2) Prove this by showing DAgger is an instance of Follow-the-Leader in an online learning game. (3) Explain why this breaks the quadratic compounding: what assumption about the training distribution does BC violate that DAgger satisfies?

In round i, DAgger: (1) rolls out πi, collecting states from dπi. (2) Asks expert to label those states. (3) Trains πi+1 on ALL data so far (D1 ∪ ... ∪ Di). This is "Follow the Leader" in the online learning framework: at each round, play the best policy for all past data. FTL has O(1/N) average regret for convex losses.
BC trains on dπ* (expert's distribution) but is evaluated on dπθ (learner's distribution). This mismatch causes compounding. DAgger trains on a mixture: (1/N)∑i dπi. After N rounds, this mixture is close to dπN (the current policy's distribution). So the training distribution matches the test distribution — no distribution shift, no compounding.
At round N, the policy πN achieves per-step error εN under dπN (because it's trained on that distribution). Total cost = T · εN = O(εT). No T2 because there's no distribution shift to compound. The 1/√N term comes from the statistical error of fitting π with N rounds of data (standard generalization bound).

Formal statement (Ross, Gordon, Bagnell 2011): Let CN = (1/N)∑i=1N Es~dπi[ℓ(πN, π*, s)] be the average cost of πN over all visited distributions. Then:

CN ≤ minπ∈Π (1/N)∑i Edπi[ℓ(π, π*, s)] + O(1/√N)

Why O(εT) not O(εT2): At convergence (large N), πN achieves error ε under its OWN state distribution. Each step incurs ε error independently (no compounding because training matches deployment). Total: T·ε. In BC, step t incurs error proportional to t·ε (accumulated drift), giving ∑t tε = εT2/2.

The key assumption DAgger satisfies: The training data covers the learner's actual state distribution. BC violates this (trains on expert states, evaluated on learner states). This "covariate shift" is the root cause of quadratic regret.

Exam insight: "Why does BC compound errors quadratically?" → Distribution shift: policy visits states not in training data. "How does DAgger fix this?" → Iteratively collects data under the learner's distribution. "What's the tradeoff?" → DAgger requires online expert access (can't use a fixed dataset).

Checkpoint — Before you move on
Offline RL seems like it should be easy: just run Q-learning on a fixed dataset. But it catastrophically fails. Explain in one sentence WHY standard Q-learning fails offline, and in one sentence how CQL/IQL fix it.
✓ Gate cleared
Model Answer

The problem: The Bellman backup uses maxa' Q(s', a'), which evaluates Q at actions the dataset never contained — these out-of-distribution Q-values are wildly overestimated (the network hallucinating high value for unseen actions), causing the policy to chase phantom rewards.

CQL fix: Add a regularizer that pushes Q-values DOWN for actions not in the dataset: minQ Ea~π[Q(s,a)] - Ea~data[Q(s,a)] + standard Bellman loss. This makes out-of-distribution actions have pessimistically low Q-values.

IQL fix: Never evaluate Q at out-of-distribution actions AT ALL. Fit V(s) using expectile regression on Q-values from the dataset, then extract a policy via advantage-weighted regression: π ∝ πβ · exp(A/β). The advantage A = Q - V is computed only at dataset actions.

Chapter 7: Reward Learning

Every RL algorithm assumes a reward function r(s, a) exists. But who writes that function? For Atari games, the reward is the score — easy. For "make a robot set a table nicely" or "write a helpful email," defining reward mathematically is a nightmare. You know a good table setting when you see one, but translating that into r(s, a) is an open research problem.

Worse, hand-designed rewards get hacked. A classic example: a boat racing agent rewarded for collecting speed boosts learned to drive in circles collecting boosts, never finishing the race. The reward was technically maximized, but the behavior was absurd. This is reward misspecification — the optimizer finds a way to score high that violates the designer's intent.

The solution: learn the reward function from human input. There are two main approaches, and they use fundamentally different types of supervision.

From Demonstrations: Inverse RL

Inverse Reinforcement Learning (IRL) starts from expert demonstrations and asks: "What reward function would make this expert behavior optimal?" Given a dataset of expert trajectories Dexpert = {τ1, τ2, ...}, IRL finds r such that the expert policy scores higher than any other policy under r.

The problem is ill-defined — many reward functions explain the same behavior (including r = 0 everywhere). Modern approaches like adversarial IRL frame it as a GAN-like game: a discriminator tries to distinguish expert trajectories from agent trajectories, and the reward is derived from the discriminator's confidence.

From Preferences: The Bradley-Terry Model

Instead of requiring full demonstrations, we can learn from pairwise preferences. Show a human two trajectory segments τA and τB, and ask: "Which one is better?" This is much easier for humans — you don't need to be an expert to say "that robot arm motion looks smoother than that one."

The Bradley-Terry model converts these preferences into a reward learning objective. It models the probability that trajectory τw is preferred over τl as:

P(τw ≻ τl) = σ(r(τw) - r(τl))

Where σ is the sigmoid function, and r(τ) = ∑t r(st, at) is the total reward along the trajectory. The loss for training the reward model is the negative log-likelihood of observed preferences:

L(r) = -Ew, τl) [ log σ(r(τw) - r(τl)) ]

This is just binary cross-entropy — the same loss used in logistic regression. The reward model is trained to assign higher total reward to the trajectory the human preferred.

Why sigmoid? The Bradley-Terry model assumes humans make noisy comparisons. When the reward difference is large, the human almost always picks the better one (probability near 1). When rewards are close, it's nearly a coin flip (probability near 0.5). The sigmoid naturally captures this — steep in the middle, flat at the extremes.

The RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) chains these ideas together in a three-phase pipeline:

Phase 1: Sample
Generate k trajectory pairs from the current policy. Present to human annotators.
Phase 2: Fit Reward
Train reward model rθ using Bradley-Terry loss on human preferences.
Phase 3: Optimize Policy
Run PPO (or similar) using learned rθ as the reward. KL-constrain against base policy.
↻ repeat

The KL constraint in Phase 3 is crucial: without it, the policy could exploit weaknesses in the learned reward model (reward hacking again, but now hacking the learned reward). The objective becomes:

maxπ Eπ[rθ(s,a)] - β · KL(π || πref)

Where πref is the original pre-trained policy and β controls how far the optimized policy can drift.

Worked Exam Problem

Problem: Two trajectories have learned rewards r(τ1) = 3, r(τ2) = 5. A human says τ2 ≻ τ1 (trajectory 2 is better).

(a) Compute P(τ2 ≻ τ1) under the Bradley-Terry model.

(b) Compute the loss for this single comparison.

(c) Now suppose r(τ1) = 3, r(τ2) = 3.1. How does the loss change, and what does this mean for training?

Solution (a): P(τ2 ≻ τ1) = σ(r(τ2) - r(τ1)) = σ(5 - 3) = σ(2).

σ(2) = 1 / (1 + e-2) = 1 / (1 + 0.135) = 1 / 1.135 ≈ 0.881.

The model is 88.1% confident that τ2 is better — which matches the human's judgment. Good.

Solution (b): L = -log σ(2) = -log(0.881) ≈ 0.127.

The loss is low because the model's prediction aligns with the human label. The gradient here is small — not much to learn.

Solution (c): Now P(τ2 ≻ τ1) = σ(3.1 - 3) = σ(0.1) = 1 / (1 + e-0.1) ≈ 1 / 1.905 ≈ 0.525.

L = -log(0.525) ≈ 0.644. The loss is 5x higher. The model is barely more confident than a coin flip that τ2 is better, but the human clearly said it was. The gradient here is large — the reward model needs to push r(τ2) higher and r(τ1) lower to separate them.

Exam insight: When two trajectories have similar rewards but the human has a clear preference, the Bradley-Terry loss is high and the gradients are strong. This is where the reward model learns the most — from "close calls" where the current model can't distinguish quality.
Bradley-Terry Preference Model

Adjust the reward scores for two trajectories. The sigmoid curve shows how reward difference maps to preference probability. Watch the loss and gradient magnitude change — close scores mean high loss and strong learning signal.

r(τ1) 3.0
r(τ2) [preferred] 5.0

Demos vs Preferences: A Comparison

From Demonstrations (IRL)From Preferences (RLHF)
Human inputFull expert trajectoriesPairwise "which is better?" labels
Requires expert?Yes — demonstrator must be skilledNo — any human can compare two options
ScalabilityLow — demonstrations are expensiveHigher — comparisons are cheap and parallelizable
Ill-posednessSevere — many rewards explain same behaviorMilder — preferences provide relative signal
Reward hacking riskModerateManaged via KL constraint to reference policy
Common exam mistakes:
• Confusing the reward model with the policy. The reward model rθ is a separate neural network trained on human preferences. The policy π is then trained using rθ as its reward.
• Forgetting the KL penalty in RLHF. Without it, the policy over-optimizes the reward model and "hacks" it — finding inputs where rθ gives high scores but actual human ratings would be low.
• Saying "RLHF learns from demonstrations." No — RLHF learns from preferences (pairwise comparisons). Learning from demonstrations is IRL/BC.
• Computing Bradley-Terry with σ(r(τl) - r(τw)) instead of σ(r(τw) - r(τl)). The winner's reward comes first.
In RLHF, why do we learn a reward model instead of directly optimizing the policy from human preferences?
What's the key difference between learning rewards from demonstrations (IRL) versus from preferences (RLHF)?
⚔ Adversarial: Reward Hacking in Learned Rewards
You train a reward model from 500 human preference comparisons. Then you run PPO against this reward model. After 1M steps, the reward model gives a score of 9.8/10, but when you show the resulting behavior to humans, they rate it 2/10. What went wrong?

Chapter 8: Model-Based RL

Every RL algorithm we've discussed so far is model-free: the agent interacts with the environment, collects rewards, and learns a policy or value function without ever trying to understand how the world works. That's like learning to drive by trial and error with zero understanding of physics. It works eventually, but you'll crash a lot.

Model-based RL (MBRL) flips the script. First, learn a dynamics model — a function pθ(s'|s, a) that predicts what state comes next given the current state and action. Then use this model to either generate synthetic training data or plan ahead by simulating future trajectories in your head.

The payoff is massive: typically 10-100x fewer real environment interactions. The cost: you need to manage model error, because wrong predictions compound catastrophically.

Dyna: Learn and Imagine

The simplest MBRL framework is Dyna, introduced by Sutton. The idea is beautifully straightforward:

Step 1: Interact
Collect real transition (s, a, r, s') from the environment. Add to replay buffer D.
Step 2: Train Model
Fit dynamics model pθ(s'|s, a) and reward model rφ(s, a) on D.
Step 3: Imagine
Generate K synthetic rollouts using the learned model. Add synthetic transitions to D.
Step 4: Update Policy
Train policy/value function on augmented D (real + imagined data).
↻ repeat from Step 1

For every 1 real environment step, Dyna does K imagined steps. If K = 10, the policy sees 11x as much data but the robot only moved once. That's the data efficiency win.

The Compounding Error Problem

Here's the fundamental tension. Suppose your model has 5% prediction error per step — pretty good for a neural network. After 1 step, your predicted state is 5% off. After 2 steps, the error compounds: you're predicting the next state from an already-wrong state. After H steps:

Error(H) ≈ ε · H    (linear, best case)
Error(H) ≈ ε · (1 + ε)H    (multiplicative, worst case)

For ε = 0.05 and H = 20 steps: linear gives 100% error (useless); multiplicative gives 0.05 × 2.65 ≈ 13.3% error. Either way, long rollouts in a learned model produce garbage. The policy trains on this garbage and learns fictional strategies that fail in the real world.

The model accuracy trap: A model can be 95% accurate per step and still be completely useless over 20-step rollouts. This is why naive Dyna with long rollouts fails catastrophically. You must either (a) keep rollouts short, (b) use ensembles to detect uncertainty, or (c) do both (MBPO).

MBPO: Short Rollouts + Ensembles

Model-Based Policy Optimization (MBPO) solves the compounding error problem with two key ideas:

1. Short branched rollouts. Instead of rolling out from a random initial state for 100 steps, MBPO starts from a real state sampled from the replay buffer and rolls out for only k steps (typically k = 1 to 5). This limits how far the imagined trajectory drifts from reality.

2. Ensemble of models. Train N models (typically 5-7) on the same data. When models disagree on a prediction, that state-action pair is high-uncertainty. MBPO uses the ensemble for rollouts and can detect when imagined data becomes unreliable.

MPC: Plan Online, Don't Learn a Policy

Model Predictive Control (MPC) takes a completely different approach: instead of using the model to train a policy, use it to plan at decision time. At each timestep:

1. Sample Action Sequences
Generate N random action sequences of length H: {a0, a1, ..., aH-1}i for i = 1..N
2. Simulate in Model
Roll out each sequence in the learned model: st+1 = fθ(st, at). Compute total reward for each.
3. Pick Best First Action
Execute only a0 from the highest-scoring sequence. Observe real s'. Discard the rest.
↻ replan from new s'

The receding horizon trick is essential: we only execute the first action and then replan from the new real state. Why? Because the model gets less accurate further into the future. By replanning every step, we always use the model's most accurate near-term predictions.

Planning with a Terminal Value Function

Pure MPC struggles with long-horizon tasks because the planning horizon H is limited by model accuracy. A clever fix: after rolling out H steps in the model, add a terminal value V̂(sH) that estimates the value of the final state. The total score for a sequence becomes:

Score = ∑t=0H-1 r(st, at) + V̂(sH)

Where V̂ is a learned value function (trained model-free on real data). This lets MPC reason about long-term consequences without needing a model accurate enough for long rollouts. Short model rollout + learned long-term value = best of both worlds.

Worked Exam Problem

Problem: A learned dynamics model has 5% error per step (additive). True dynamics: st+1 = st + at. Model predicts: ŝt+1 = st + at + 0.1 (systematic bias).

(a) Starting from s0 = 0 with at = 1 for all t, what is the true state after 10 steps?

(b) What does the model predict after 10 steps?

(c) What's the accumulated error, and why does this motivate short rollouts?

Solution (a): True dynamics: st+1 = st + 1. After 10 steps: s10 = 0 + 10 × 1 = 10.

Solution (b): Model prediction: ŝt+1 = ŝt + 1 + 0.1. The bias accumulates linearly. After 10 steps: ŝ10 = 0 + 10 × (1 + 0.1) = 10 + 10 × 0.1 = 11.

Solution (c): Error = |11 - 10| = 1. That's 10% of the true state — from a model with only 0.1 bias per step. After 100 steps, it would be 10. This linear accumulation is the best case — multiplicative errors compound even faster. MBPO limits rollouts to k ≤ 5 steps precisely to keep this compounding manageable.

Model Error Compounding

Adjust the per-step model error and rollout length. The teal line shows the true trajectory; the orange line shows the model's prediction. Watch how quickly they diverge with longer rollouts.

Per-step error 5%
Rollout length 10

Decision Table: When to Use What

ApproachUses Model ForStrengthWeakness
DynaSynthetic data generationSimple, works with any policy learnerLong rollouts → compounding error
MBPOShort branched rollouts + SACState-of-art data efficiencyNeeds ensemble overhead, short horizon
MPCOnline planning at each stepNo policy network needed, replans constantlySlow at test time (N rollouts per step)
MPC + V̂Short planning + learned valueLong-horizon reasoning with short rolloutsNeeds both model and value function
Common exam mistakes:
• Saying "MBPO uses long rollouts with an ensemble for error correction." No — ensembles detect uncertainty but don't fix it. MBPO uses short rollouts (k=1-5) to limit error.
• Explaining MPC without the receding horizon. The whole point is that you execute only the FIRST action and replan. If you execute the entire sequence, you lose the error-correction benefit of replanning.
• Confusing Dyna (uses model for data augmentation, then trains a policy) with MPC (uses model for planning, no explicit policy).
• Saying model error is a fixed percentage of the true state. Model error compounds — a 5% per-step error after 20 steps is not 5% total error.
MBPO limits model rollouts to k = 1-5 steps and starts from real states in the buffer. Why not roll out for 50 steps from random initial states?
In MPC, why do we only execute the FIRST action of the best sequence and then replan?
💻 Build It Implement GAE (Generalized Advantage Estimation) ✓ ATTEMPTED
GAE is the standard advantage estimator used in PPO. It interpolates between bias (TD, λ=0) and variance (MC, λ=1). The formula: AtGAE = ∑l=0 (γλ)l δt+l, where δt = rt + γV(st+1) - V(st) is the TD error. Implement this efficiently using the backward sweep.
python def compute_gae( rewards: np.ndarray, # (T,) rewards values: np.ndarray, # (T+1,) value estimates (includes V(s_T+1)) dones: np.ndarray, # (T,) episode termination flags gamma: float = 0.99, lam: float = 0.95 ) -> np.ndarray: """ Compute GAE advantages for a trajectory. Returns: advantages (T,) Key: when done[t]=True, V(s_{t+1}) should be 0 (episode ended, no future rewards). """ ...
Test case
rewards = np.array([1, 0, 0, 1, 0])
values = np.array([0.5, 0.4, 0.3, 0.2, 0.1, 0.0]) # T+1 values
dones = np.array([0, 0, 0, 0, 1])
advantages = compute_gae(rewards, values, dones, gamma=0.99, lam=0.95)
# advantages[4] should be: r4 + 0 - V(s4) = 0 + 0 - 0.1 = -0.1
# advantages[3] should be: delta3 + gamma*lam*adv4
Instead of computing the infinite sum directly (expensive), use the recursion: At = δt + γλ · At+1. Start from AT-1 = δT-1 and work backward. This is O(T) instead of O(T2). Critical: when dones[t]=True, set At = δt (no future terms, because the episode ended). Also: when dones[t]=True, δt = rt + 0 - V(st) (next state value is 0).
python
import numpy as np

def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0.0

    for t in reversed(range(T)):
        # If episode ended at t, next value is 0
        next_non_terminal = 1.0 - dones[t]
        next_value = values[t + 1] * next_non_terminal

        # TD error: r + gamma*V(s') - V(s)
        delta = rewards[t] + gamma * next_value - values[t]

        # GAE recursion (zeroed across episode boundaries)
        last_gae = delta + gamma * lam * next_non_terminal * last_gae
        advantages[t] = last_gae

    return advantages

# Returns = advantages + values (for value function target)
# returns = advantages + values[:T]
Bonus challenge: Vectorize this for a batch of parallel environments (shape: (num_envs, T)). The done masking must be per-environment.
Checkpoint — Before you move on
Model-based RL learns a dynamics model and plans through it. Goal-conditioned RL conditions the policy on a desired goal. Explain: could you combine them? What would a "goal-conditioned model-based" algorithm look like, and what advantage would it have?
✓ Gate cleared
Model Answer

A goal-conditioned model-based algorithm: (1) Learn a dynamics model s' = f(s, a). (2) Given a goal g, plan through the model to find action sequence that reaches g. (3) Execute the first action, re-plan from the new state.

This is literally MPC (Model Predictive Control) with a goal-reaching objective. The advantage: you can reach ANY goal without retraining — just change the planning objective. This is more flexible than a goal-conditioned policy (which needs to be trained on goals) and more sample-efficient than model-free GCRL.

Real example: Visual Foresight (Finn & Levine 2017) learns a video prediction model, then plans actions that make the predicted video match a goal image. Modern version: Dreamer with HER-style goal relabeling. The model provides the "imagination" for planning; the goal provides the "what to plan toward."

Chapter 9: Goal-Conditioned & Multi-Task RL

Standard RL trains a policy for one task with one reward function. But real agents need to handle many tasks. A household robot needs to set the table, wash dishes, and fetch objects. Training a separate policy for each is wasteful — the physics of grasping is the same whether you're picking up a mug or a plate.

Multi-task RL trains a single policy conditioned on a task identifier zi. The policy becomes π(a | s, zi), and the state effectively becomes the pair (s, zi). Different task identifiers activate different behaviors while sharing the underlying representations.

Goal-Conditioned RL: A Special Case

Goal-conditioned RL (GCRL) is the most natural version of multi-task RL. The task identifier is simply a goal state sg — the state you want to reach. The policy is π(a | s, sg), and the reward is typically:

r(s, a, sg) = -d(s, sg)    (dense, distance-based)
r(s, a, sg) = 1(||s - sg|| < ε)    (sparse, binary success)

Dense rewards give the agent a gradient to follow at every step ("you're getting warmer"). Sparse rewards are more natural but harder to learn from ("you either reached the goal or you didn't"). Most real-world goals are sparse — the mug is either on the coaster or it isn't.

The sparse reward trap: With sparse rewards and random exploration, the probability of accidentally reaching a distant goal is exponentially small in the distance. An agent trying to reach (5, 5) from (0, 0) on a 10×10 grid with random actions might never get there. No success = no positive reward signal = no learning. This is the curse of sparse rewards, and it's the problem HER was designed to solve.

Hindsight Experience Replay (HER)

HER is one of the cleverest tricks in RL. The insight: even when the agent fails to reach its intended goal, it did reach some state. Why not pretend that state was the goal all along?

Concretely: the agent tries to reach goal g = (5, 5) but ends up at (3, 2). With the original goal, the entire episode has reward 0 (failure). But if we relabel the goal to g' = (3, 2), then the final transition gets reward 1 (success!). The agent learns "here's how to reach (3, 2)" — and that knowledge transfers to reaching other nearby goals.

1. Collect Episode
Try to reach goal g. Trajectory: s0 → s1 → ... → sT. Failed (sT ≠ g).
2. Store Original
Store transitions with original goal g and reward 0. Policy failed, but keep the data.
3. Relabel Goals
Pick achieved states (e.g. sT, sT/2) as new goals g'. Recompute rewards. Store as additional transitions.
4. Train
Off-policy algorithm (DQN, DDPG, SAC) trains on both original and relabeled transitions.
↻ repeat

HER has three requirements that are easy to forget on an exam:

RequirementWhy
Same dynamics across goalsRelabeling changes the goal, not the physics. If different goals had different transition dynamics, the relabeled transition would be invalid.
Evaluatable rewardWe need to recompute r(s, a, g') for the new goal g'. This requires knowing the reward function — we can't relabel if reward is a black box.
Off-policy algorithmRelabeled data was collected under a policy pursuing a different goal. Only off-policy methods can learn from data generated by a different behavioral distribution.

Worked Exam Problem

Problem: An agent in a 2D grid aims for goal g = (5, 5). It executes a 4-step episode: (0,0) → (1,1) → (2,1) → (3,2) → (3,2). Reward = -||s - g|| at each step, plus a bonus of +10 if s = g.

(a) What's the total reward with the original goal?

(b) We relabel the goal to g' = (3, 2). What's the reward at the final step?

(c) Using the "final" HER strategy (relabel with sT), how many new training transitions does HER create from this one episode?

Solution (a): Distances at each step:
t=0: ||((0,0) - (5,5)|| = √(50) ≈ 7.07 → reward = -7.07
t=1: ||(1,1) - (5,5)|| = √(32) ≈ 5.66 → reward = -5.66
t=2: ||(2,1) - (5,5)|| = √(25) = 5.0 → reward = -5.0
t=3: ||(3,2) - (5,5)|| = √(13) ≈ 3.61 → reward = -3.61
No bonus (never reached g). Total ≈ -21.34. All negative, all failure.

Solution (b): With g' = (3, 2): at t=3, ||(3,2) - (3,2)|| = 0 → reward = 0 + 10 = 10. Success!

Solution (c): The "final" strategy relabels with sT = (3, 2). Each of the 4 transitions in the episode gets a copy with g' = (3, 2) and recomputed rewards. So HER creates 4 additional transitions (one per timestep, each paired with the new goal). Combined with the 4 original transitions, we now have 8 total — and 4 of them end in success instead of 0.

Why this matters: Before HER, the agent saw 4 transitions with total reward -21.34 and learned nothing about how to succeed. After HER, it sees 4 additional transitions that include a +10 success reward. From one failed episode, we manufactured positive training signal. Multiply this across thousands of episodes and the agent rapidly learns to reach diverse goals.
HER Goal Relabeling

The agent (circle) tried to reach the orange goal but ended up at its final position. Click Relabel to see HER create a new teal goal at the achieved position, turning failure into success. Click New Episode to generate a random trajectory.

Common exam mistakes:
• Saying HER works with on-policy algorithms. It doesn't — relabeled data was collected pursuing a different goal, making it off-policy by definition. Only off-policy methods (DQN, DDPG, SAC) can use this data.
• Thinking HER changes the environment dynamics. HER only changes the goal and recomputes the reward. The transitions (s, a, s') are unchanged.
• Forgetting that HER requires a known, computable reward function. If reward is a black box (e.g., a human rating), you can't recompute it for a new goal.
• Claiming HER can work when different tasks have different dynamics. If task 1 has friction=0.1 and task 2 has friction=0.5, a trajectory from task 1 can't be relabeled as task 2 — the state transitions would be different.
Why does HER only work with off-policy algorithms like DDPG or SAC?
Can you apply HER if different goals correspond to different physical dynamics (e.g., different friction coefficients)?
🏗 Design Challenge You're the Architect: Choosing the Right RL Algorithm ✓ ATTEMPTED
Given a new RL problem, you must select the right algorithm class. Build a decision tree: for each constraint listed below, determine which algorithm family applies and WHY.
Scenario A
Fixed dataset, no environment access
Scenario B
Expensive env, ~1000 episodes budget
Scenario C
Fast sim, continuous actions, 10M steps
Scenario D
Expert available, multi-modal actions
Scenario E
Sparse reward, goal-conditioned tasks
Scenario F
Safety-critical, must stay near known behavior
1. For each scenario, name the best algorithm class (offline RL, model-based, SAC/PPO, imitation learning, goal-conditioned, conservative) and the specific algorithm.
2. What's the first thing that would fail if you used the WRONG algorithm in each scenario?
3. Draw a decision tree: Data? → Online or Offline? → Continuous or discrete? → Model available? → ...

A (fixed dataset, no env): Offline RL. IQL for simplicity + stability, CQL if you need guaranteed pessimism. NEVER use SAC/DQN online — they require env interaction. BC works if dataset is expert-quality; offline RL is better for mixed-quality datasets.

B (expensive env, 1000 episodes): Model-based RL (MBPO, Dreamer). Learn a world model from real data, generate synthetic rollouts for policy training. 10-100x more sample efficient than model-free. Alternatively: a few demos + fine-tuning (warm-start with BC, then improve with RL).

C (fast sim, continuous, 10M steps): SAC (off-policy, entropy regularization, continuous actions) or PPO (on-policy, simpler, more robust). SAC is more sample efficient. PPO is more stable and easier to tune. For robotics with Isaac Lab: PPO with 4096 parallel envs.

D (expert available, multi-modal): DAgger (iterative) or Diffusion Policy (handles multi-modality). BC fails for multi-modal because MSE averages modes. Diffusion Policy represents the full action distribution. IBC (implicit BC) also handles multi-modality.

E (sparse reward, goal-conditioned): HER (Hindsight Experience Replay) + goal-conditioned SAC. HER retroactively relabels failed episodes as "successes for a different goal," dramatically improving sample efficiency under sparse rewards.

F (safety-critical, stay near known): Offline RL with strong conservatism (CQL with high α), or constrained policy optimization (CPO). The key requirement: never try actions far from the dataset. Conservative methods provide this guarantee by construction.

⚔ Adversarial: The Deadly Triad
You're running DQN (off-policy, function approximation, bootstrapping) and notice Q-values diverging to infinity. You can remove exactly ONE element of the deadly triad. Which removal is most practical while still being useful?
⚔ Adversarial: When Exploration Hurts
You're training SAC on a robot reaching task. The entropy bonus encourages exploration. After 500K steps, the policy reaches the goal 95% of the time. But you notice it takes highly variable paths (sometimes looping). Is this a problem? When would you want to REDUCE entropy?
🔗 Pattern Recognition
The Same Three Tensions Run Through All of RL
Exploration vs Exploitation
ε-greedy (DQN), entropy bonus (SAC), UCB (bandits), curiosity (RND).
Same tension everywhere: try new things vs use what works.
Bias vs Variance
MC (unbiased, high variance) vs TD (biased, low variance).
GAE(λ) interpolates: λ=0 is TD, λ=1 is MC.
Same tradeoff in every estimator.

A third tension runs beneath both: on-policy vs off-policy. On-policy (PPO) is stable but sample-wasteful. Off-policy (SAC, DQN) reuses data but risks divergence. Offline RL takes this to the extreme: ALL data is off-policy. Every algorithm in this course is a point in this 3D tradeoff space. Understanding WHERE each algorithm sits (and WHY) is the deepest understanding you can have of the field.

Place these algorithms in the 3D space (exploration strategy, bias-variance point, on/off-policy): PPO, SAC, DQN, CQL, REINFORCE, Dreamer. Which two are closest? Which two are furthest apart?

Chapter 10: Interactive Exam Simulator

Time to test yourself. This is a randomized exam with questions drawn from all nine topics covered in this review. Each question is the kind you'd see on a real CS224R final — not trivia, but questions that test whether you truly understand the concepts.

The questions are intentionally hard. Some require computation. Some have trick answers that test whether you understand subtle distinctions (like on-policy vs off-policy, or what baselines actually do to the gradient). Some require you to trace through an algorithm step by step. This is the level you need to be at.

How it works: Click Start Exam to begin. You'll see one question at a time with four choices. Select your answer and click Submit to see if you're right — every answer comes with a detailed explanation of why it's correct and why the alternatives are wrong. Click Next Question for another random question. Use the topic filters to drill your weak areas. Aim for 80%+ before the real exam.
Study strategy: Do a first pass through all topics. Note which topics you score below 70% on. Use the topic filter buttons to drill those specific areas. Then do a final full pass. The questions are randomized each time, so you'll see different ones on each attempt.
CS224R Exam Simulator
Score: 0 / 0 correct (0%)

36 exam-quality questions across 9 topics.
Randomized each time. Detailed explanations for every answer.

The score bar above tracks your progress. Green segments = correct, red = incorrect. Use topic filters to drill weak areas. Questions repeat if you cycle through all of a topic's questions. Your score resets when you change the filter.

Exam Strategy Notes

Time management on the real exam:
• Computation questions (Bellman update, Bradley-Terry, expectile loss): budget 3-4 minutes. Write the formula first, then plug in numbers.
• Conceptual "why" questions: budget 2 minutes. These test understanding, not calculation. If you're writing more than 3 sentences, you're overcomplicating it.
• "What happens if..." questions: budget 2-3 minutes. Trace through the algorithm step by step with the changed parameter. Don't guess — derive.
• If stuck on a multiple choice, eliminate wrong answers first. "Always" and "never" answers are usually wrong (except in RL where some things genuinely never work, like HER with on-policy algorithms).
The three questions that classify every algorithm:
1. Does it need a reward function? (BC/DAgger: No. Everything else: Yes, either given or learned.)
2. Does it need environment interaction? (Offline RL: No. Everything else: Yes, or at least for data collection.)
3. On-policy or off-policy? (REINFORCE/A2C/PPO: On. DQN/SAC/AWR/IQL/HER: Off.)

These three axes uniquely identify every algorithm in the course. Master this classification and you can answer half the exam from first principles.

Chapter 11: Cheat Sheet & Connections

This is your one-page reference for the entire course. Print it, memorize it, take it into the exam in your head. Every algorithm, every key equation, every common trap — all in one place.

Master Equation Table

Algorithm Key Equation Data On/Off Key Property
BC min E[ℓ(πθ(s), aexpert)] Expert demos N/A Supervised, no reward needed
DAgger BC + query expert on visited states Expert demos + online queries N/A Fixes distribution shift in BC
REINFORCE ∇J = E[∑t ∇logπ(at|st) · R(τ)] On-policy rollouts On High variance, unbiased
A2C ∇J = E[∇logπ(a|s) · A(s,a)] On-policy rollouts On Baseline reduces variance
PPO L = min(rtAt, clip(rt, 1±ε)At) On-policy (few reuses via IS) On Clipped IS for stability
SAC max E[∑ r + αH(π)] Replay buffer Off Entropy-regularized, explores well
DQN L = (r + γ maxa' Qtarget(s',a') - Q(s,a))² Replay buffer Off Target network stabilization
AWR π = argmax E[exp(A/β) logπ(a|s)] Fixed offline dataset Off Weighted BC — never evaluates OOD
IQL Lτ = |τ - 1(u<0)| · u² Fixed offline dataset Off Conservative V via expectile regression
Dyna Train model pθ(s'|s,a), generate synthetic data Real + imagined Either Data efficiency via imagination
MPC a* = argmaxa0:H ∑ r(st,at) Model-based planning N/A No explicit policy, replans each step
HER Relabel g → g' = sachieved Replay buffer + relabeled Off Turns failures into successes

Decision Flowchart: Which Algorithm?

Start here → Do you have expert demonstrations?

• Yes, demos available:
  → Do you have a reward function? NoBC (simplest) or DAgger (if expert is queryable)
  → Do you have a reward function? YesInverse RL to learn reward, then RL

• No demos, but have reward function:
  → Can you interact with the environment?
    • Yes, many interactions OKPPO (on-policy, stable) or SAC (off-policy, sample efficient)
    • Yes, but interactions expensiveMBPO (model-based, 10-100x fewer interactions) or MPC (planning)
    • No interaction at all (fixed dataset)IQL or AWR (offline RL)

• No demos, no reward, but have human preferences:
  → RLHF (Bradley-Terry reward learning + PPO)

• Goal-conditioned tasks with sparse rewards:
  → HER + off-policy algorithm (DDPG, SAC)

Common Exam Traps

MisconceptionTruth
"Baselines change the expected gradient in policy gradient methods" Baselines change variance only. E[∇logπ · b] = 0 for any state-dependent baseline b(s). The expected gradient is unchanged.
"Q-learning is on-policy" Off-policy. The Bellman target uses maxa' Q(s', a') — the max over all actions, regardless of which action the behavior policy took.
"PPO is off-policy because it reuses data" On-policy. PPO collects fresh rollouts each iteration. It reuses each batch for a few gradient steps via importance sampling, but the data is still on-policy. The clipping ensures updates stay close to on-policy.
"DAgger requires a reward function" No reward needed. DAgger only needs an expert that can label states with the correct action. It's pure imitation learning with interactive expert queries.
"BC suffers from compounding error because of bad training data" BC fails because of distribution shift: the learner visits states the expert never demonstrated, not because the expert data is bad. DAgger fixes this by querying the expert on the learner's visited states.
"Offline RL fails because datasets are too small" Even huge datasets cause failure if the learned policy visits OOD state-action pairs. The problem is distributional shift, not sample size.
"Model-based RL is always more sample efficient" Only if the model is accurate enough. A bad model generates misleading synthetic data that can hurt worse than no data at all.
"HER works with any RL algorithm" HER requires off-policy algorithms only. Relabeled data was collected under a policy pursuing a different goal — it's off-policy by construction.
"In RLHF, the policy is trained directly on human preferences" Two-stage: first train a reward model from preferences (Bradley-Terry), then optimize the policy using that reward model with PPO + KL constraint.
"Entropy bonus in SAC encourages exploitation" Entropy bonus encourages exploration by penalizing deterministic policies. max H(π) = max randomness.
"IQL uses τ > 0.5 for conservative estimates" τ < 0.5 is conservative (penalizes overestimates more). τ > 0.5 would be optimistic — chasing the max, which is dangerous offline.
"The discount factor γ only affects how far ahead the agent plans" γ also affects the effective horizon and the variance of returns. Lower γ reduces variance but makes the agent myopic. Higher γ captures long-term rewards but increases variance.

Detailed Lessons

Each topic covered in this review has a full standalone lesson with deeper derivations, more simulations, and extended worked examples:

Exam TopicFull LessonKey Simulation
Imitation LearningImitation LearningDAgger vs BC on distribution shift
Policy GradientsPolicy GradientsREINFORCE variance reduction
Q-Learning / DQNRL AlgorithmsQ-table grid world with live updates
Actor-Critic / PPO / SACDeep RL TheoryPPO clipping visualization
Offline RL(in review)IQL expectile loss explorer
Reward Learning / RLHFReward LearningBradley-Terry preference model
Model-Based RLModel-Based RLError compounding + MPC planner
Goal-Conditioned RL / HERMulti-Task & Goal-Conditioned RLHER relabeling visualization

Formula Quick Reference

Policy Gradient Family

∇J = E[∑t ∇logπ(at|st) · Φt]

Φt = R(τ) for REINFORCE
Φt = A(st,at) for A2C
Φt = clipped IS · A for PPO

Bellman Equation

Q*(s,a) = r + γ maxa' Q*(s', a')

IQL Expectile Loss

Lτ(u) = |τ - 1(u<0)| · u²

Bradley-Terry

P(τw ≻ τl) = σ(r(τw) - r(τl))

RLHF Objective

max E[rθ(s,a)] - β KL(π||πref)

PPO Clip

L = min(rtAt, clip(rt, 1±ε)At)

SAC Objective

max E[∑ r(s,a) + α H(π(·|s))]
Exam strategy: For every question, ask yourself three things: (1) Is this on-policy or off-policy? (2) Does it need a reward function? (3) Does it need environment interaction? These three axes separate every algorithm in the course. When in doubt, draw the flowchart above.

"What I cannot create, I do not understand." — Richard Feynman