Stanford CS 224R — Spring 2026

Deep RL Exam Review

Every algorithm, every equation, every exam trap. Interactive practice problems for the closed-book midterm.

Prerequisites: Attended CS 224R lectures + Done the homeworks

Chapters

36+

Practice Qs

Topics

Chapter 0: The Map — RL Algorithm Taxonomy

You have 80 minutes. The exam is closed-book. There are roughly 26 slides worth of material spanning imitation learning, policy gradients, actor-critic, off-policy, Q-learning, offline RL, reward learning, model-based RL, and goal-conditioned RL. That's a lot of ground.

Before diving into any single topic, you need the map — a mental taxonomy that tells you where every algorithm lives, what data it needs, and when you'd reach for it. If someone names an algorithm on the exam, you should instantly know: is it on-policy or off-policy? Does it need a simulator? Can it work from a fixed dataset? Does it learn a value function, a policy, or both?

Let's build that map from first principles.

The single most important exam skill: when you see an algorithm name, immediately classify it. "DQN → off-policy, value-based, model-free, needs replay buffer." "Behavior Cloning → offline, imitation learning, no reward signal needed." This classification tells you what equations apply, what failure modes to expect, and what comparisons are valid. Get the taxonomy right and the rest follows.

The Two Big Splits

Every RL algorithm lives somewhere in a 2×2 grid defined by two questions:

Question 1: Where does the data come from?

If you can interact with the environment during training — take actions, observe outcomes, collect new data — you're in the online regime. If you're stuck with a fixed dataset collected by someone else (or a previous version of your policy), you're in the offline regime.

Question 2: Can you reuse old data?

If your algorithm requires fresh data from the current policy after every update, it's on-policy. If it can learn from data collected by any policy (stored in a replay buffer), it's off-policy. This distinction only applies within the online regime.

Regime	Category	Algorithms	Data Requirement
Online	On-policy	REINFORCE, vanilla PG, A2C	Fresh rollouts from current π_θ
Online	Off-policy	DQN, SAC, TD3, DDPG	Replay buffer of past transitions
Offline	Imitation Learning	Behavior Cloning, DAgger, Flow IL	Expert demonstrations (s, a) pairs
Offline	Offline RL	AWR, IQL, CQL	Fixed dataset of (s, a, r, s') tuples

Where does PPO fit? This is a classic exam trick question. PPO uses importance sampling to reuse data from a slightly old policy — but it clips the ratio so tightly that it's effectively on-policy. Most courses classify PPO as on-policy with a small off-policy correction. On an exam, call it "on-policy with clipping."

Where does SAC fit? SAC maintains a replay buffer, samples uniformly from past transitions, and trains a Q-function using off-policy Bellman backups. It's firmly off-policy. The maximum-entropy objective (adding an entropy bonus) is orthogonal to the on/off-policy distinction.

The Third Dimension: What Do You Learn?

Algorithms also differ in what function they learn:

Type	Learns	How it Acts	Examples
Value-based	Q(s, a) or V(s)	π(s) = argmax_a Q(s, a)	DQN, Double DQN
Policy-based	π_θ(a\|s)	Sample from π_θ	REINFORCE, vanilla PG
Actor-Critic	Both π_θ and V_φ	Actor samples, critic evaluates	A2C, PPO, SAC, TD3
Model-based	Dynamics p(s'\|s,a)	Plan using the model	MBPO, Dreamer, MPC

Actor-Critic is the dominant paradigm. Nearly every modern algorithm you'll see on the exam (PPO, SAC, TD3, A2C) is actor-critic. The actor provides the policy; the critic provides low-variance gradient estimates. REINFORCE (no critic) is the teaching tool; actor-critic is what people actually use.

Interactive Taxonomy

RL Algorithm Taxonomy — Click Any Algorithm

Click on an algorithm node to see its key equation, data needs, and typical use case. Orange = offline, Teal = on-policy, Blue = off-policy, Purple = model-based.

Click any algorithm above to see its details.

Decision Flowchart: Choosing an Algorithm

Start

Can you interact with the environment?

↓ Yes → Online

Online

Is the action space discrete or continuous?

↓ Discrete → DQN family | Continuous → SAC/PPO/TD3

Continuous

Need sample efficiency? → SAC (off-policy). Need simplicity? → PPO (on-policy).

↑ Back to Start

Start → No

Do you have expert demonstrations?

↓ Yes → Imitation Learning | No (but have rewards) → Offline RL

Unimodal actions? → Behavior Cloning (MSE). Multimodal? → Flow Matching / Diffusion Policy.

Exam strategy: when the exam describes a scenario ("you have 1000 demonstrations but no simulator"), your first job is to place it on this map. The map eliminates 80% of the algorithm options immediately. Then you pick within the remaining category based on the specific constraints (action space, compute budget, distribution shift concerns).

Your robot can only learn from a fixed dataset of 1000 demonstrations collected by a human teleoperator. No reward labels are available. Which category of algorithms applies?

On-policy RL (REINFORCE, A2C) Off-policy RL (DQN, SAC) Imitation Learning (Behavior Cloning, DAgger) Offline RL (CQL, IQL)

SAC uses a replay buffer to store and reuse past transitions. Is SAC on-policy or off-policy?

On-policy — it updates the current policy Off-policy — it learns from transitions collected by older policies via the replay buffer Neither — SAC is an imitation learning algorithm

PPO uses importance sampling ratios to reuse data from a slightly older policy. Why is it still classified as "on-policy"?

Because the clipping constraint forces the new policy to stay so close to the old one that the data is effectively "current" — it can't reuse data from many updates ago Because it doesn't use a replay buffer Because importance sampling makes any algorithm on-policy

Chapter 1: Imitation Learning

Forget rewards. Forget value functions. Forget Bellman equations. Imitation learning (IL) asks the simplest possible question: "I have recordings of an expert doing the task. Can I just copy them?"

The answer is yes — with caveats that will show up on your exam. IL turns RL into supervised learning: given state s, predict the action a the expert would take. No reward signal, no environment interaction during training (at least for behavior cloning), no temporal credit assignment. Just input-output pairs.

But the devil is in two details: how you parameterize the policy (regression vs. generative) and how you handle distribution shift (vanilla BC vs. DAgger).

Approach 1: Regression (Deterministic Policy)

The simplest approach: treat actions as real-valued outputs and minimize mean squared error (MSE):

L(θ) = (1/N) ∑_i || π_θ(s_i) − a_i ||²

Where π_θ(s) is a neural network that takes a state and outputs a single deterministic action. This is literally supervised regression — the same loss you'd use to predict house prices from features.

When does this work? When the expert's behavior is unimodal — for any given state, there's essentially one correct action. A self-driving car on a straight highway: the expert always goes straight. MSE loss works perfectly.

When does this break? When the expert's behavior is multimodal. This is the critical failure mode you need to know for the exam.

The Multimodal Action Problem (Exam Favorite). Imagine playing Flappy Bird. An obstacle appears. The expert sometimes goes over it (action = +1) and sometimes goes under it (action = −1). Both are valid. What does MSE-optimal regression predict? The average: action = 0. That means fly straight into the obstacle. The model learns the one action the expert never takes. Averaging two good modes produces a bad mode.

Approach 2: Generative (Stochastic Policy)

Instead of predicting a single action, model the full conditional distribution p_θ(a | s) and maximize log-likelihood:

L(θ) = −(1/N) ∑_i log p_θ(a_i | s_i)

This is the key difference. Log-likelihood doesn't average modes — it assigns probability mass to each mode independently. A mixture of Gaussians, a diffusion model, or a flow matching model can all represent multimodal distributions. The Flappy Bird policy would learn two peaks: one at +1 (go over) and one at −1 (go under), with near-zero probability at 0 (crash).

Common generative architectures for IL:

Model	How It Represents Multimodality	Inference
Gaussian Mixture	K explicit Gaussian components	Sample from mixture
Diffusion Policy	Iterative denoising from noise → action	N denoising steps
Flow Matching	Learn velocity field from noise → action	Euler integration

Flow Matching for IL

Flow matching is the generative approach emphasized in CS224R. The idea: learn a velocity field that transports random noise into expert actions. Here's how it works step by step.

Step 1: Define the interpolation. Given a noise sample a₀ ~ N(0, I) and an expert action a₁, create a path between them:

a_τ = τ · a₁ + (1 − τ) · a₀, τ ∈ [0, 1]

At τ = 0, we have pure noise. At τ = 1, we have the expert action. The path is a straight line in action space.

Step 2: Compute the target velocity. The velocity along this path is just the derivative with respect to τ:

u(a_τ, τ) = a₁ − a₀

This is the "ground truth" direction: from noise toward the expert action.

Step 3: Train a velocity network. Learn v_θ(s, a_τ, τ) to predict this velocity:

L(θ) = E_{τ, a₀, a₁} || v_θ(s, a_τ, τ) − (a₁ − a₀) ||²

The network takes the state, the noisy action, and the timestep, and predicts which direction to push.

Step 4: Inference by Euler integration. Start from noise a₀ ~ N(0, I), then take K small steps:

a_{τ + Δτ} = a_τ + Δτ · v_θ(s, a_τ, τ)

After K steps (e.g., K = 10, Δτ = 0.1), you arrive at a clean expert-like action. The velocity field has learned to route noise toward different modes, so multimodal distributions emerge naturally.

Why flow matching over diffusion? Flow matching uses straight-line paths (simpler, faster) while diffusion uses curved paths defined by a noise schedule. Both can represent multimodal distributions, but flow matching typically needs fewer integration steps at inference time. For an exam: flow matching = straight paths, diffusion = curved paths, both solve the multimodal problem.

DAgger: Fixing Distribution Shift

Even with a perfect generative model, behavior cloning has a fundamental problem: compounding errors.

During training, the policy sees states from the expert's distribution — the states an expert would visit. At test time, the policy makes a small mistake at step 1, which puts it in a slightly different state at step 2. It's never seen this state before, so it makes a bigger mistake. By step 100, the policy is in completely unfamiliar territory and crashes.

The error grows quadratically with the horizon T. If the per-step error is ε, the total compounded error after T steps is O(εT²), not O(εT). This quadratic blowup is what makes vanilla BC fragile for long-horizon tasks.

DAgger (Dataset Aggregation) fixes this by iteratively collecting corrective labels:

Step 1

Train π_θ on expert dataset D

↓

Step 2

Roll out π_θ in the environment, collecting states s₁, s₂, ...

↓

Step 3

Ask the expert to label these states: "What would you do here?"

↓

Step 4

Add (s_i, a_expert) to D. Go to Step 1.

↻ repeat until convergence

The key insight: DAgger trains on the policy's own distribution of states, not just the expert's. After enough iterations, the dataset covers both the expert's trajectory and all the "mistake states" the policy visits, so the policy learns to recover from its own errors.

DAgger reduces the error bound from O(εT²) to O(εT) — linear in the horizon, not quadratic. The cost: you need an expert available to label on-demand during training.

Worked Problem: Regression vs. Generative

Exam problem: A dataset D contains 100 demonstrations of a 1D task. At state s = 3.0, the expert took action a = +2 in 60 demonstrations and a = −2 in 40 demonstrations. Compute: (a) the MSE-optimal prediction, and (b) the log-likelihood-optimal prediction under a 2-component Gaussian mixture.

(a) MSE-optimal prediction:

MSE minimizes E[(a − π(s))²]. The minimizer is the conditional mean:

π*(s=3) = E[a | s=3] = 0.6 × (+2) + 0.4 × (−2) = 1.2 − 0.8 = +0.4

The model predicts +0.4 — an action the expert never took. If +2 means "go left" and −2 means "go right," then +0.4 means "drift vaguely leftward" — probably the worst possible action.

(b) Log-likelihood-optimal GMM:

A 2-component Gaussian mixture with small variance would learn:

p(a | s=3) = 0.6 · N(+2, σ²) + 0.4 · N(−2, σ²)

At inference, sampling from this distribution gives +2 about 60% of the time and −2 about 40% — matching the expert's actual behavior. No probability mass near zero.

Regression vs. Generative on Bimodal Data

The expert's action distribution has two modes. Orange line = MSE prediction (the mean). Teal curves = generative model (GMM). Drag the mode separation slider to see when regression becomes catastrophic.

Mode Separation 2.0

Mode Balance (% left) 60%

Common exam mistakes — Imitation Learning:
• Forgetting that MSE predicts the mean, not the mode. When actions are bimodal, the mean lies between modes where no expert data exists.
• Saying DAgger "collects more expert demonstrations." Wrong — DAgger runs the learner's policy and asks the expert to label the learner's states. The key is covering the learner's distribution.
• Confusing the flow matching loss (predict velocity a₁ − a₀) with the diffusion loss (predict noise ε). Both are regression losses, but the target is different.
• Saying BC error grows linearly with T. It grows quadratically — O(εT²). DAgger reduces it to O(εT).

In behavior cloning with MSE loss, the expert's action distribution at a particular state is bimodal with peaks at a = +3 and a = −3 (equal probability). What does the trained policy predict?

Either +3 or −3 randomly 0 (the mean of the two modes) +3 (the more common mode) It depends on the network architecture

In DAgger, whose distribution of states is used to collect new training data after the first round?

The expert's distribution The learner's (current policy's) distribution A uniform random distribution The replay buffer distribution

In flow matching for IL, the training loss minimizes || v_θ(s, a_τ, τ) − ??? ||². What is the regression target?

The noisy action a_τ The expert action a₁ The velocity (a₁ − a₀), the direction from noise to expert action The noise sample a₀

Chapter 2: Policy Gradients

Imitation learning sidesteps reward entirely. But most RL problems don't come with expert demonstrations — they come with a reward function. Policy gradient methods directly optimize the expected cumulative reward by taking gradients of the policy parameters.

The core idea: if a trajectory gets high reward, increase the probability of the actions in that trajectory. If it gets low reward, decrease them. No value function needed (yet). No model of the environment. Just rollouts and gradient updates.

The REINFORCE Gradient

We want to maximize the expected return:

J(θ) = E_{τ ~ π_θ} [R(τ)] = E_{τ ~ π_θ} [∑_t=0^T r(s_t, a_t)]

The problem: we can't backpropagate through the environment (it's a black box). The solution is the log-derivative trick (also called the REINFORCE trick or score function estimator):

∇_θ J(θ) = E_{τ ~ π_θ} [(∑_t=0^T ∇_θ log π_θ(a_t | s_t)) · R(τ)]

In practice, we estimate this with N sampled trajectories:

∇_θ J ≈ (1/N) ∑_i=1^N [(∑_t=0^T ∇_θ log π_θ(a_tⁱ | s_tⁱ)) · R(τ_i)]

Let's unpack each symbol:

Symbol	Meaning
∇_θ log π_θ(a\|s)	The score function — which direction in parameter space makes action a more likely at state s
R(τ)	Total return of trajectory τ = ∑ r(s_t, a_t)
N	Number of sampled trajectories (batch size)
T	Trajectory length (horizon)

Intuition: ∇_θ log π_θ(a|s) points toward parameter values that make action a more likely. Multiplying by R(τ) scales this — high-reward trajectories get large positive updates (make these actions more probable), low-reward trajectories get small or negative updates (make these actions less probable). It's a form of trial-and-error weighted by success.

Improvement 1: Causality (Reward-to-Go)

The vanilla REINFORCE formula weights each action's gradient by the total trajectory reward R(τ). But action a_t at time t can't affect rewards that already happened at times 0, 1, ..., t−1. Including past rewards adds noise without information.

The reward-to-go fix: replace R(τ) with the sum of future rewards from time t onward:

∇_θ J ≈ (1/N) ∑_i=1^N ∑_t=0^T ∇_θ log π_θ(a_tⁱ | s_tⁱ) · (∑_t'=t^T r(s_t'ⁱ, a_t'ⁱ))

The inner sum ∑_t'=t^T r(s_t', a_t') is called the reward-to-go or return-to-go. It's the total reward the agent receives from time t onward. This has the same expected value as using R(τ) but strictly lower variance.

Improvement 2: Baselines

Even with reward-to-go, the gradient estimate has high variance. Consider: if all trajectories get positive reward (say, rewards between +10 and +20), then REINFORCE increases the probability of every action — even the bad ones. It just increases them less. Convergence is slow because the signal-to-noise ratio is poor.

The fix: subtract a baseline b from the reward:

∇_θ J ≈ (1/N) ∑_i ∑_t ∇_θ log π_θ(a_tⁱ | s_tⁱ) · (∑_t'=t^T r_t' − b)

The crucial mathematical fact: subtracting any constant baseline b does not change the expected value of the gradient. This is because E[∇_θ log π · b] = b · E[∇_θ log π] = b · 0 = 0. The score function has zero mean.

But it does change the variance. The optimal baseline is b* = E[R(τ)], the average return. Now trajectories better than average get positive weight (reinforced) and trajectories worse than average get negative weight (discouraged). Much cleaner signal.

This is an exam classic: "Does subtracting a baseline introduce bias into the policy gradient?" The answer is NO. The expected gradient is unchanged. The baseline only reduces variance. If someone asks you to prove it, show that E[∇ log π · b] = 0 because ∑_a ∇π(a|s) = ∇(∑_a π(a|s)) = ∇(1) = 0.

Worked Problem: Computing the Policy Gradient

Exam problem: You collect N = 3 trajectories from a policy π_θ. Each trajectory has T = 2 timesteps. The rewards are:
• Trajectory 1: r₀ = +3, r₁ = +2 → R(τ₁) = 5
• Trajectory 2: r₀ = −1, r₁ = −1 → R(τ₂) = −2
• Trajectory 3: r₀ = +1, r₁ = 0 → R(τ₃) = 1
Assume ∇_θ log π(a₀¹|s₀¹) = [+1, 0]. Compute the gradient contribution from action a₀ of trajectory 1: (a) without baseline, and (b) with baseline b = mean return.

(a) Without baseline:

Using reward-to-go from t=0: R_to-go = r₀ + r₁ = 5

contribution = ∇_θ log π(a₀¹|s₀¹) · R_to-go = [+1, 0] · 5 = [+5, 0]

(b) With baseline:

Baseline b = (5 + (−2) + 1) / 3 = 4/3 ≈ 1.33

contribution = [+1, 0] · (5 − 1.33) = [+1, 0] · 3.67 = [+3.67, 0]

The direction is the same (trajectory 1 is above average, so it's reinforced), but the magnitude is smaller. For trajectory 2 (return = −2, below baseline), the contribution would be negative — discouraging those actions. That's exactly what we want.

Without a baseline, even trajectory 3 (return = +1, mediocre) gets a positive gradient. With a baseline, trajectory 3 gets a slightly negative gradient (1 − 1.33 = −0.33), correctly signaling that it's below average.

Policy Gradient: Trajectories, Causality & Baselines

Five trajectories are shown with their rewards. Toggle causality (reward-to-go) and baseline subtraction to see how the gradient direction changes. Arrow length = gradient magnitude for each trajectory.

Common exam mistakes — Policy Gradients:
• "Subtracting a baseline changes the expected gradient." Wrong. It only reduces variance. E[∇ log π · b] = 0 for any constant b.
• Forgetting causality: action a_t should only be weighted by rewards from t onward, not the whole trajectory return. Using the whole return is valid but has higher variance.
• Confusing "high variance" with "biased." REINFORCE is unbiased but high variance. That's why we need many trajectories (large N) for accurate estimates.
• Thinking the baseline must be the mean return. Any state-dependent function b(s) works. The optimal baseline is b*(s) = E[R | s], but even a simple constant helps enormously.

In the REINFORCE algorithm, you collect 100 trajectories and all of them have positive reward (between +5 and +15). Without a baseline, what happens to the policy gradient update?

All actions get reinforced (probability increased), even bad ones — the gradient just reinforces good actions MORE than bad ones, leading to slow convergence Only the best actions get reinforced The gradient is zero because all rewards are positive The algorithm diverges

You're given the policy gradient: ∇J = (1/N) ∑ (∑ ∇ log π(a|s)) · (R(τ) − b). A colleague claims subtracting b = 100 biases the gradient. Is the colleague correct?

Yes — subtracting 100 from all rewards changes the gradient direction No — subtracting any constant baseline leaves the expected gradient unchanged (E[∇ log π · b] = 0) Only if b equals the mean return

Which modification converts the "total trajectory reward" version of REINFORCE into the "reward-to-go" version?

Replace R(τ) with a learned value function V(s) Replace R(τ) with ∑_t'=t^T r_t' (only future rewards from time t onward) Replace R(τ) with r_t (only the immediate reward) Replace R(τ) with the discounted return γ^t R(τ)

🔨 Derivation Derive the Policy Gradient Theorem from Scratch ▶ ✓ ATTEMPTED

The policy gradient theorem states: ∇_θ J(θ) = E_{τ~π_θ}[∑_t=0^T ∇_θ log π_θ(a_t|s_t) · Q^π(s_t, a_t)]

Your task: (1) Start from J(θ) = E_{τ~π_θ}[R(τ)] and derive this result using the log-derivative trick. (2) Show that Q can be replaced with advantage A = Q - V without changing the gradient's expectation. (3) Prove that subtracting ANY function b(s) from Q doesn't bias the gradient (b is called a "baseline").

∇_θ p_θ(τ) = p_θ(τ) · ∇_θ log p_θ(τ). This converts ∇ ∫ p(τ)R(τ)dτ into E_τ[∇ log p(τ) · R(τ)]. Then factor log p(τ) = ∑ log π(a_t|s_t) + const (dynamics terms cancel).

R(τ) = ∑_t'=0^T r_t'. By causality, actions at time t only affect future rewards. So ∇ log π(a_t|s_t) is multiplied only by ∑_t'=t^T γ^t'-tr_t' = Q^π(s_t, a_t) (the reward-to-go = Q-function under the current policy).

E_a~π[∇ log π(a|s) · b(s)] = b(s) · E_a[∇ log π(a|s)] = b(s) · ∇ ∑_a π(a|s) = b(s) · ∇(1) = 0. Any b(s) that doesn't depend on a vanishes from the gradient's expectation. V(s) is the variance-minimizing choice.

Step 1: J(θ) = ∫ p_θ(τ) R(τ) dτ. Take gradient: ∇J = ∫ ∇p_θ(τ) R(τ) dτ = ∫ p_θ(τ) ∇ log p_θ(τ) R(τ) dτ = E_τ[∇ log p_θ(τ) · R(τ)].

Step 2: log p_θ(τ) = log p(s₀) + ∑ log p(s_t+1|s_t,a_t) + ∑ log π_θ(a_t|s_t). Only the last sum depends on θ: ∇ log p_θ(τ) = ∑ ∇ log π_θ(a_t|s_t).

Step 3 (Causality): ∇J = E[∑_t ∇ log π(a_t|s_t) · (∑_t'=t^T γ^t'-tr_t')]. The inner sum = Q^π(s_t, a_t). ■

Step 4 (Baseline): Replacing Q with A = Q - V: the V(s_t) term contributes E_a[∇ log π · V(s)] = V(s) · 0 = 0. So A and Q give the same expected gradient, but A has lower variance (V(s) acts as a control variate). ■

Exam insight: The PG theorem appears in 3 forms: (1) with R(τ), (2) with Q(s,a), (3) with A(s,a). They're all the same gradient in expectation. Form (3) is what's implemented (lowest variance). Know all three and how to go between them.

Chapter 3: Actor-Critic Methods

REINFORCE works but it's inefficient. Each gradient estimate requires rolling out entire trajectories, and the variance is high even with baselines. The fundamental issue: we're using sampled returns to evaluate how good an action is. What if we could replace those noisy sampled returns with a learned estimate?

That's the actor-critic idea. Train two networks: an actor (the policy π_θ) and a critic (a value function that estimates expected returns). The critic provides low-variance gradient signals to the actor. The actor provides trajectories for the critic to learn from.

The Three Value Functions

Before anything else, you need to have these three functions and their relationships burned into memory. They will appear on every exam in some form.

V^π(s) = E_π [∑_t=0^∞ γ^t r_t | s₀ = s]

V^π(s) is the state value function. It answers: "How much total discounted reward do I expect to get, starting from state s and following policy π forever?" It's the "how good is this state?" number.

Q^π(s, a) = E_π [∑_t=0^∞ γ^t r_t | s₀ = s, a₀ = a]

Q^π(s, a) is the action value function. It answers: "How much total reward do I expect if I start in state s, take action a (possibly different from what π would choose), and then follow π from the next state onward?" It's "how good is this action in this state?"

A^π(s, a) = Q^π(s, a) − V^π(s)

A^π(s, a) is the advantage function. It answers: "How much better is action a compared to the average action at state s?" Positive advantage means the action is above average; negative means below average. The advantage is the signal the actor uses to update — reinforce above-average actions, discourage below-average ones.

The relationship you must know: V^π(s) = E_{a ~ π}[Q^π(s, a)]. The state value is the expected action value under the policy. And by definition, E_{a ~ π}[A^π(s, a)] = 0 — the average advantage is always zero. Advantages are relative: some actions are above average, some below, but they average to zero.

MC vs. TD vs. n-Step Targets

The critic needs a training target — what should V(s) equal? There are three approaches, and understanding their tradeoffs is essential for the exam.

Method	Target for V(s)	Bias	Variance	Data Needed
Monte Carlo (MC)	∑_t'=t^T γ^t'−t r_t' (actual return)	Unbiased	High	Full trajectory
TD(0)	r_t + γ V(s_t+1) (bootstrap)	Biased (if V wrong)	Low	Single transition
n-step	∑_k=0ⁿ⁻¹ γ^k r_t+k + γⁿ V(s_t+n)	Less biased	Medium	n transitions

Monte Carlo: Use the actual observed return from the trajectory. No approximation, so unbiased. But if the trajectory is long (T = 1000), the return includes 1000 random variables, so variance is huge.

y_MC = r_t + γ r_t+1 + γ² r_t+2 + ... + γ^T−t r_T

TD(0) (Temporal Difference): Use only the immediate reward plus the critic's own estimate of the next state. This bootstraps — it uses V(s') to estimate V(s). If V is wrong, the target is biased. But it only depends on one random variable (r_t), so variance is low.

y_TD = r_t + γ V(s_t+1)

n-step: A compromise. Use n actual rewards, then bootstrap:

y_n-step = r_t + γ r_t+1 + ... + γⁿ⁻¹ r_t+n−1 + γⁿ V(s_t+n)

At n = 1, this is TD(0). At n = T, this is MC. You can slide between bias and variance by choosing n.

The Actor-Critic Policy Gradient

With a learned critic, the policy gradient becomes:

∇_θ J ≈ (1/N) ∑_i ∑_t ∇_θ log π_θ(a_t | s_t) · A^π(s_t, a_t)

Where the advantage is estimated using the TD error:

Â(s_t, a_t) = r(s_t, a_t) + γ V_φ(s_t+1) − V_φ(s_t)

This is the TD advantage: the actual reward plus the discounted value of the next state minus the value of the current state. If this is positive, the action led to a better-than-expected next state — reinforce it. If negative, the action was worse than expected — discourage it.

Think of Â as a "surprise" signal. V(s) is what we expected to get from state s. r + γV(s') is what we actually got (one-step look-ahead). The advantage = actual − expected. Positive surprise → good action. Negative surprise → bad action. Zero → action was exactly as good as average.

Worked Problem: Computing TD Target, MC Return, and Advantage

Exam problem: Given: V(s₁) = 5, V(s₂) = 8, V(s₃) = 3, discount γ = 0.9. The agent transitions s₁ → s₂ → s₃ with rewards r₁ = 3, r₂ = −1. Compute: (a) TD target for V(s₁), (b) MC return from s₁, (c) TD advantage A(s₁, a₁).

(a) TD target for V(s₁):

y_TD(s₁) = r₁ + γ V(s₂) = 3 + 0.9 × 8 = 3 + 7.2 = 10.2

(b) MC return from s₁ (assuming the trajectory ends at s₃):

G(s₁) = r₁ + γ r₂ + γ² V(s₃) = 3 + 0.9 × (−1) + 0.81 × 3 = 3 − 0.9 + 2.43 = 4.53

(This is actually a 2-step return, not a pure MC return. A pure MC would go until the episode ends. But for a finite trajectory, this is the standard calculation.)

(c) TD advantage A(s₁, a₁):

Â(s₁, a₁) = r₁ + γ V(s₂) − V(s₁) = 10.2 − 5 = +5.2

The advantage is strongly positive: taking action a₁ in state s₁ led to state s₂ (value 8), which is much better than the expected value of state s₁ (value 5). The actor should reinforce this action.

Notice the TD target (10.2) is much higher than V(s₁) = 5, meaning our critic underestimates s₁. The critic loss would push V(s₁) upward toward 10.2.

MC vs. TD: Bias-Variance Tradeoff

A 10-state chain with random rewards. The orange line = true values (computed analytically). Teal dots = estimated values using n-step returns. Drag the n slider: n=1 is TD(0) (low variance, may be biased), n=10 is MC (unbiased, high variance). Watch the estimates wobble with MC and stabilize (but shift) with TD.

n (bootstrap depth) 1

Critic Accuracy 50%

Common exam mistakes — Actor-Critic:
• Confusing V(s) and Q(s,a). V averages over actions; Q conditions on a specific action. V = E_a[Q(s,a)].
• Saying "MC is better than TD because it's unbiased." Not necessarily — if your value function is well-trained, TD's bias is tiny and its low variance leads to faster convergence. If your value function is poorly trained, TD's bias is large and MC is safer. The right choice depends on critic quality.
• Forgetting that A(s,a) = Q(s,a) − V(s), so the expected advantage is always zero. If every advantage on your exam calculation is positive, something is wrong.
• Computing the TD target as r + V(s) instead of r + γV(s'). Don't forget the discount factor γ and that you use the next state s', not the current state s.

V(s_A) = 10, V(s_B) = 6, r = 2, γ = 0.95. What is the TD advantage A(s_A, a)?

2 + 0.95 × 6 = 7.7 2 + 6 − 10 = −2 2 + 0.95 × 6 − 10 = −2.3 10 − 6 = 4

MC estimates are unbiased but have high variance. TD estimates have lower variance but may be biased. You're training a brand-new critic from random initialization. Which target should you use?

MC — because a random critic provides terrible bootstrap targets, making TD heavily biased. MC uses actual returns and avoids this bootstrapping error. TD(0) — because low variance is always better It doesn't matter — both converge to the same answer

What is the expected advantage E_{a ~ π}[A^π(s, a)] for any state s?

It depends on the state Always 0 — by definition, A = Q − V, and V = E[Q], so E[A] = E[Q] − V = 0 Always positive (we only reinforce good actions) Equal to V(s)

Chapter 5: Q-Learning & DQN

Every method so far learns a policy π_θ(a|s) explicitly — a network that directly outputs actions. Q-learning takes a fundamentally different approach: learn the optimal value of every action, then just pick the best one. No policy network at all.

The key insight: if you knew Q*(s, a) — the optimal action-value function — you wouldn't need a policy network. You'd just act greedily:

π*(s) = argmax_a Q*(s, a)

"At every state, pick the action with the highest Q-value." The policy is implicit in the Q-function. This is the value-based approach to RL.

The Bellman Optimality Equation

Q* satisfies a recursive relationship called the Bellman optimality equation:

Q*(s, a) = r(s, a) + γ max_a' Q*(s', a')

In words: the optimal value of taking action a in state s equals the immediate reward plus the discounted value of the best action in the next state. This is a fixed-point equation — Q* is the unique function that satisfies it.

Let's unpack each piece:

Term	Meaning
Q*(s, a)	Optimal value of taking action a in state s (what we're learning)
r(s, a)	Immediate reward for taking action a in state s
γ	Discount factor (0 to 1) — how much we care about future rewards
max_a' Q*(s', a')	Value of the best action in the next state (bootstrapping)

Why is Q-learning off-policy? The Bellman equation uses max over actions at s'. It doesn't matter how we got to state s' — we always evaluate the best possible action. Even if the data was collected by a random policy, Q-learning still backs up the value of the optimal action. This is different from SARSA (which uses the action the policy actually took, making it on-policy).

Tabular Q-Learning

For small state-action spaces, we can store Q as a table. The update rule for each observed transition (s, a, r, s'):

Q(s, a) ← Q(s, a) + α [r + γ max_a' Q(s', a') − Q(s, a)]

Where α is the learning rate. The term in brackets is the TD error: the difference between the Bellman target (r + γ max Q(s', a')) and the current estimate Q(s, a). We nudge Q toward the target by a fraction α.

With enough exploration (visiting every state-action pair infinitely often) and a decaying learning rate, tabular Q-learning provably converges to Q*.

DQN: Q-Learning with Neural Networks

For large state spaces (images, continuous states), we replace the table with a neural network Q_φ(s, a). The loss function:

L(φ) = E_{(s,a,r,s') ~ D} [(Q_φ(s, a) − (r + γ max_a' Q_φ'(s', a')))²]

This looks like simple regression: predict Q(s,a), use the Bellman target as the label. But there are three problems that make naive DQN unstable, and three tricks that fix them:

Stability Trick 1: Replay Buffer

Problem: consecutive transitions are correlated (s₁ → s₂ → s₃ are temporally adjacent). Neural networks trained on correlated data overfit to recent patterns and forget old ones.

Fix: store all transitions in a large replay buffer D. Sample mini-batches uniformly from D. This breaks temporal correlations — your mini-batch might contain transitions from 10 different episodes at different points in training.

Stability Trick 2: Target Network

Problem: the Bellman target r + γ max Q_φ(s', a') uses the same network Q_φ that we're training. When we update φ, the target changes too — we're chasing a moving target. This causes oscillations and divergence.

Fix: use a separate target network Q_φ' with frozen parameters. Update φ' only periodically (copy φ to φ' every C steps) or slowly (Polyak averaging: φ' ← τφ + (1−τ)φ'). The target is stable between updates.

Stability Trick 3: Double Q-Learning

Problem: the max operator in max_a' Q(s', a') systematically overestimates Q-values. If Q has noise (estimation error), max picks the action with the highest noise realization, biasing the estimate upward. This is the "max of noisy estimates > true max" phenomenon.

Fix: Double DQN decouples action selection from action evaluation. Use the online network to select the action, but the target network to evaluate it:

a* = argmax_a' Q_φ(s', a') (select with online)
target = r + γ Q_φ'(s', a*) (evaluate with target)

Since the selection and evaluation use different networks (with different noise patterns), the overestimation bias is significantly reduced.

Summary of DQN's three tricks:
1. Replay buffer → breaks correlations between consecutive transitions
2. Target network → stabilizes the Bellman target (stop chasing a moving target)
3. Double Q → reduces overestimation from the max operator
All three address different sources of instability in approximate dynamic programming with neural networks. Memorize what each one fixes.

Worked Problem: Q-Table Bellman Update

Exam problem: You have a Q-table with 3 states (s₁, s₂, s₃) and 2 actions (Left, Right). Current Q-values and a transition to process:

Q-table:
Q(s₁, L) = 2, Q(s₁, R) = 4
Q(s₂, L) = 6, Q(s₂, R) = 3
Q(s₃, L) = 1, Q(s₃, R) = 5

Transition: (s₁, R, r=2, s₂). Learning rate α = 0.5, γ = 0.9.
Perform one Q-learning update.

Step 1: Compute the Bellman target.

target = r + γ max_a' Q(s₂, a') = 2 + 0.9 × max(6, 3) = 2 + 0.9 × 6 = 2 + 5.4 = 7.4

Step 2: Compute the TD error.

δ = target − Q(s₁, R) = 7.4 − 4 = 3.4

Step 3: Apply the update.

Q(s₁, R) ← Q(s₁, R) + α · δ = 4 + 0.5 × 3.4 = 4 + 1.7 = 5.7

The Q-value for (s₁, Right) increased from 4.0 to 5.7. Makes sense: taking Right in s₁ gives reward 2 and transitions to s₂, whose best action (Left) has value 6. The discounted future value 0.9 × 6 = 5.4 plus the immediate reward 2 = 7.4, which is higher than the old estimate of 4. So Q moves upward.

Notice: only Q(s₁, R) changed. All other Q-values remain the same. Q-learning does pointwise updates — one (s, a) pair per transition.

Interactive Q-Table: Watch Values Converge

A 3×3 grid world with a goal (top-right, reward +10) and a pit (bottom-left, reward −5). Click Step to perform one Q-learning update from a random transition. Click Run 100 to watch the Q-values converge. The arrows show the greedy policy (argmax Q). Warmer colors = higher max Q.

Learning Rate α 0.50

Discount γ 0.90

Exploration ε 0.10

Common exam mistakes — Q-Learning:
• "Q-learning is on-policy." Wrong. Q-learning uses max_a' Q(s', a') — it evaluates the greedy action regardless of what the behavior policy actually did. That makes it off-policy. (SARSA, which uses Q(s', a'_actual), is on-policy.)
• Forgetting the discount γ in the Bellman target. target = r + γ max Q, not r + max Q.
• Computing max Q(s, a') instead of max Q(s', a'). The max is over the next state s', not the current state s.
• Confusing what Double DQN does. It doesn't use two replay buffers or two loss functions. It uses the online network to select the best action and the target network to evaluate that action. Decoupling selection from evaluation reduces overestimation.

Q-Learning vs. SARSA: The Key Distinction

	Q-Learning	SARSA
Update target	r + γ max_a' Q(s', a')	r + γ Q(s', a'_actual)
On/off-policy	Off-policy	On-policy
Evaluates	Greedy (best) action	The action the policy actually takes
Consequence	Learns optimal Q* regardless of behavior	Learns Q^π for the current policy
Safety	May learn unsafe greedy paths	Respects exploration policy (safer)

Q(s₁, a₁) = 3, Q(s₁, a₂) = 7. Q(s₂, a₁) = 5, Q(s₂, a₂) = 2. The agent transitions from s₁ with action a₂, receives r = 1, and lands in s₂. With γ = 0.9 and α = 1.0, what is Q(s₁, a₂) after the update?

7.0 (no change) 5.5 — target = 1 + 0.9 × max(5, 2) = 1 + 4.5 = 5.5; with α=1.0, Q ← 5.5 6.3 — target = 1 + 0.9 × 7 = 7.3; update Q = 7 + 1.0(7.3 − 7) = 7.3 2.8 — target = 1 + 0.9 × 2 = 2.8

Why does DQN use a target network with frozen parameters φ' instead of using the same network φ for both the prediction and the Bellman target?

Because updating φ changes the Bellman target simultaneously, creating a "moving target" that causes oscillations and divergence. Freezing φ' keeps the target stable between updates. Because it reduces the memory needed for the replay buffer Because two networks learn faster than one Because the target network handles continuous actions while the online network handles discrete actions

Q-learning uses max_a' Q(s', a') in its update. SARSA uses Q(s', a'_actual). Which statement is correct?

Q-learning is on-policy because it uses max SARSA is off-policy because it uses actual actions Q-learning is off-policy (evaluates best action regardless of behavior policy); SARSA is on-policy (evaluates the action the current policy actually takes) They're both off-policy with different variance levels

🔨 Derivation Prove the Bellman Operator is a Contraction (Q-Learning Convergence) ▶ ✓ ATTEMPTED

The Bellman optimality operator is: (TQ)(s,a) = r(s,a) + γ max_a' Q(s', a'). Q-learning converges because T is a contraction in the infinity norm.

Your task: (1) Prove ||TQ₁ - TQ₂||_∞ ≤ γ ||Q₁ - Q₂||_∞. (2) Use the Banach fixed-point theorem to conclude Q-learning converges to a unique Q*. (3) What is the convergence rate? After k iterations, how close is Q_k to Q*?

|max_a f(a) - max_a g(a)| ≤ max_a |f(a) - g(a)|. The "max" doesn't amplify differences — it can only reduce them. Apply this to max_a' Q₁(s',a') - max_a' Q₂(s',a').

|(TQ₁)(s,a) - (TQ₂)(s,a)| = |γ max_a' Q₁(s',a') - γ max_a' Q₂(s',a')| ≤ γ max_a' |Q₁(s',a') - Q₂(s',a')| ≤ γ ||Q₁ - Q₂||_∞. Since this holds for all (s,a), take sup over (s,a): ||TQ₁ - TQ₂||_∞ ≤ γ ||Q₁ - Q₂||_∞. ■

Banach theorem: any contraction on a complete metric space has a unique fixed point, and iterates converge to it. Rate: ||Q_k - Q*||_∞ ≤ γ^k ||Q₀ - Q*||_∞. With γ=0.99, after 460 iterations the error shrinks by a factor of 100. With γ=0.9, only 46 iterations needed. This is why low γ converges faster but plans shorter horizons.

Proof of contraction:

For any (s,a): |(TQ₁ - TQ₂)(s,a)| = γ|max_a' Q₁(s',a') - max_a' Q₂(s',a')|

≤ γ max_a' |Q₁(s',a') - Q₂(s',a')| (triangle inequality for max)

≤ γ sup_(s,a) |Q₁(s,a) - Q₂(s,a)| = γ ||Q₁ - Q₂||_∞

Taking sup over (s,a) on the left: ||TQ₁ - TQ₂||_∞ ≤ γ ||Q₁ - Q₂||_∞. ■

Convergence: Since γ < 1, T is a γ-contraction. By Banach: (1) unique fixed point Q* exists, (2) Q_k = T^kQ₀ → Q*, (3) rate: ||Q_k - Q*|| ≤ γ^k/(1-γ) ||TQ₀ - Q₀||.

Exam insight: This proof is a classic exam question. The key steps: (1) property of max, (2) apply it, (3) invoke Banach. Also know: this only works for tabular Q-learning. With function approximation (DQN), the contraction property breaks because the projection step (fitting a neural network) is not a non-expansion. This is the "deadly triad" problem.

Chapter 6: Offline RL

Here's a scenario that comes up constantly in the real world: you have a massive dataset of past decisions — hospital treatment records, autonomous driving logs, recommendation system clickstreams — but you cannot deploy a new policy to collect more data. The patients have already been treated. The cars have already driven. You need to learn the best policy you can from this fixed dataset, without any additional interaction.

This is offline RL (also called batch RL). You have a dataset D = {(s, a, r, s')} collected by some behavior policy π_β, and your job is to extract the best possible policy π from this data alone. Sounds like supervised learning, right? Just imitate the best actions in the dataset?

Not quite. The behavior policy might be mediocre — a mix of novice and expert drivers, doctors following outdated guidelines. We want to do better than the data collector. But this is where offline RL breaks in a spectacular and specific way.

The Distribution Shift Problem

Consider a standard Q-learning update. We pick the action that maximizes Q(s, a) at the next state. But what if that maximizing action was never taken in the dataset? The Q-network has never seen a training signal for (s', a_max), so its Q-value there is essentially a random guess — and random neural network outputs tend to be overestimates.

Now it gets worse. The Bellman backup propagates these overestimates. Q(s, a) gets a target that includes max_a' Q(s', a'), which is inflated. Next iteration, that inflated Q(s, a) propagates to other states. The errors compound exponentially. Your Q-function develops "hallucinated" high-value regions in parts of the state-action space the dataset never visited.

The offline RL death spiral: Standard Q-learning uses max_a Q(s', a) in its backup. If the maximizing action a* is out-of-distribution (never seen in data), Q(s', a*) is an unreliable overestimate. This overestimate propagates backward through Bellman backups, creating fictitious high-value "phantom" policies that look great on paper but fail catastrophically when deployed. The policy chases ghosts.

Solution 1: Constrained Policy (AWR)

Advantage Weighted Regression (AWR) sidesteps the problem entirely by never evaluating out-of-distribution actions. Instead of learning Q-values and picking argmax, AWR treats the problem as weighted behavior cloning:

π* = argmax_π E_(s,a)~D [ exp(A(s,a) / β) · log π(a|s) ]

Where A(s,a) = Q(s,a) - V(s) is the advantage, and β is a temperature parameter. Translation: clone the behavior policy, but weight each action by how much better it was than average. Good actions (positive advantage) get upweighted exponentially. Bad actions get downweighted. The policy stays close to the data distribution because it's literally doing weighted imitation.

Solution 2: Avoid OOD Evaluation (IQL)

Implicit Q-Learning (IQL) takes a different, more elegant approach. The core insight: the reason Q-learning fails offline is that the Bellman target max_a Q(s', a) requires evaluating Q at actions that might be OOD. What if we could compute a value function V(s) that approximates the max without ever querying Q at unseen actions?

IQL does this in two steps:

Step 1: Fit V via Expectile Regression

Train V(s) using asymmetric L_τ loss against Q(s,a) for in-data actions only. With τ < 0.5, V estimates a conservative quantile of Q.

↓

Step 2: Extract Policy via AWR

Use A(s,a) = Q(s,a) - V(s) as advantage. Weight BC by exp(A/β). Only uses in-data (s,a) pairs.

The key equation is the expectile loss:

L_τ(u) = |τ - 1(u < 0)| · u²

Where u = Q(s,a) - V(s). When τ = 0.5, this is ordinary MSE — V learns the mean of Q. When τ < 0.5, the loss penalizes overestimates more than underestimates. So V(s) is pulled below the mean Q — it becomes a conservative lower bound. When τ > 0.5, V chases the high end — approaching max_a Q(s,a), which is what we'd want in online RL but is dangerous offline.

IQL's genius: By fitting V with τ < 0.5, we get a conservative estimate of the value of a state using only in-data actions. No max over unseen actions needed. Then we extract the policy by weighting in-data actions by their advantage over this conservative baseline. Both steps only touch (s, a) pairs from the dataset.

Worked Exam Problem

Problem: You have Q(s, a₁) = 5, Q(s, a₂) = 8, V(s) = 6. Action a₁ appears in the dataset. Action a₂ is OOD (never in dataset).

(a) Why is Q(s, a₂) = 8 dangerous?

(b) Compute IQL's expectile loss L_τ for V(s) = 6 with τ = 0.3, using only the in-data action a₁.

(c) What happens to V(s) as training proceeds with τ = 0.3?

Solution (a): Q(s, a₂) = 8 was never trained on real (s, a₂, r, s') transitions. It's a random extrapolation by the neural network. Standard Q-learning would set its target as r + γ max(Q(s', ·)), and if that max happens at an OOD action in s', the overestimate propagates. A policy using argmax_a Q(s, a) would choose a₂ — an action that might not even make physical sense — because of this hallucinated high value.

Solution (b): We only use in-data actions, so u = Q(s, a₁) - V(s) = 5 - 6 = -1. Since u < 0, the indicator 1(u < 0) = 1, so the weight is |τ - 1| = |0.3 - 1| = 0.7.

L_0.3(-1) = 0.7 · (-1)² = 0.7

Solution (c): Since u = -1 < 0, the weight is 0.7 (high penalty for overestimating). The gradient pushes V(s) down toward Q(s, a₁) = 5. With τ = 0.3, V will converge below the mean of in-data Q-values — a conservative estimate. If there were multiple in-data actions, V would settle at roughly the 30th percentile of Q-values.

Conversely, if u were positive (V too low), the weight would be |0.3 - 0| = 0.3 — a weaker push upward. This asymmetry is exactly what makes IQL conservative.

IQL Expectile Loss Explorer

Adjust τ to see how the expectile loss changes. With τ < 0.5, overestimates (V > Q) are penalized more heavily, pushing V to be conservative. With τ > 0.5, underestimates are penalized more, pushing V toward the max. The orange bars show Q-values for in-data actions; the teal line shows V(s).

τ (expectile) 0.30

V(s) 6.0

Comparison Table: Offline RL Approaches

Method	Core Idea	OOD Handling	Key Hyperparameter
BCQ	Only consider actions similar to data	Generative model filters actions	Similarity threshold
CQL	Penalize Q-values for OOD actions	Regularizer pushes Q down on unseen (s,a)	Regularization strength α
AWR	Weighted behavior cloning	Never evaluates OOD — just reweights data	Temperature β
IQL	Expectile regression + AWR extraction	V fitted conservatively on in-data Q only	Expectile τ, temperature β

Common exam mistakes:
• Saying "offline RL fails because the dataset is too small." No — even large datasets cause failure if the policy visits states not in the data. The problem is distributional, not statistical.
• Confusing τ direction in IQL: τ < 0.5 is conservative (what we want offline). τ > 0.5 is optimistic (dangerous offline).
• Thinking AWR learns a Q-function from scratch — it doesn't. It uses a pre-computed advantage to reweight existing behavior. The advantage comes from a separate value estimation step.
• Forgetting that IQL's key property is that it never queries Q at actions outside the dataset, not even during the Bellman backup.

Why can't we just apply SAC (an off-policy algorithm) directly to an offline dataset?

SAC requires on-policy data — it can't use a replay buffer The dataset might be too small for SAC's sample complexity SAC's policy improvement step evaluates Q at actions the dataset never covered, leading to overestimated Q-values that compound through Bellman backups SAC uses entropy regularization, which doesn't work without environment interaction

IQL uses τ = 0.3 in its expectile loss. What would happen if we set τ = 0.5?

The algorithm would crash due to division by zero V(s) would learn the mean of Q-values (no conservatism), losing offline safety — equivalent to standard value fitting V(s) would become even more conservative than τ = 0.3 The policy extraction step would fail because advantages would all be zero

In AWR, what role does the temperature β play?

Small β makes the policy greedily focus on the highest-advantage actions; large β makes the weighting more uniform (closer to plain BC) β controls the learning rate of the policy network β determines the discount factor for future rewards β sets the entropy bonus in the SAC objective

🔨 Derivation Prove DAgger's Linear Regret Bound (vs BC's Quadratic) ▶ ✓ ATTEMPTED

Behavioral Cloning suffers O(εT²) regret due to compounding errors. DAgger achieves O(εT). This is the key theoretical advantage.

Your task: (1) State the no-regret property formally: after N rounds of DAgger, the policy π_N has cost at most min_π∈Π J(π) + O(1/√N) under the LEARNER's distribution. (2) Prove this by showing DAgger is an instance of Follow-the-Leader in an online learning game. (3) Explain why this breaks the quadratic compounding: what assumption about the training distribution does BC violate that DAgger satisfies?

In round i, DAgger: (1) rolls out π_i, collecting states from d_πi. (2) Asks expert to label those states. (3) Trains π_i+1 on ALL data so far (D₁ ∪ ... ∪ D_i). This is "Follow the Leader" in the online learning framework: at each round, play the best policy for all past data. FTL has O(1/N) average regret for convex losses.

BC trains on d_π* (expert's distribution) but is evaluated on d_πθ (learner's distribution). This mismatch causes compounding. DAgger trains on a mixture: (1/N)∑_i d_πi. After N rounds, this mixture is close to d_πN (the current policy's distribution). So the training distribution matches the test distribution — no distribution shift, no compounding.

At round N, the policy π_N achieves per-step error ε_N under d_πN (because it's trained on that distribution). Total cost = T · ε_N = O(εT). No T² because there's no distribution shift to compound. The 1/√N term comes from the statistical error of fitting π with N rounds of data (standard generalization bound).

Formal statement (Ross, Gordon, Bagnell 2011): Let C_N = (1/N)∑_i=1^N E_{s~d_πi}[ℓ(π_N, π*, s)] be the average cost of π_N over all visited distributions. Then:

C_N ≤ min_π∈Π (1/N)∑_i E_{d_πi}[ℓ(π, π*, s)] + O(1/√N)

Why O(εT) not O(εT²): At convergence (large N), π_N achieves error ε under its OWN state distribution. Each step incurs ε error independently (no compounding because training matches deployment). Total: T·ε. In BC, step t incurs error proportional to t·ε (accumulated drift), giving ∑_t tε = εT²/2.

The key assumption DAgger satisfies: The training data covers the learner's actual state distribution. BC violates this (trains on expert states, evaluated on learner states). This "covariate shift" is the root cause of quadratic regret.

Exam insight: "Why does BC compound errors quadratically?" → Distribution shift: policy visits states not in training data. "How does DAgger fix this?" → Iteratively collects data under the learner's distribution. "What's the tradeoff?" → DAgger requires online expert access (can't use a fixed dataset).

Checkpoint — Before you move on

Offline RL seems like it should be easy: just run Q-learning on a fixed dataset. But it catastrophically fails. Explain in one sentence WHY standard Q-learning fails offline, and in one sentence how CQL/IQL fix it.

✓ Gate cleared

Model Answer

The problem: The Bellman backup uses max_a' Q(s', a'), which evaluates Q at actions the dataset never contained — these out-of-distribution Q-values are wildly overestimated (the network hallucinating high value for unseen actions), causing the policy to chase phantom rewards.

CQL fix: Add a regularizer that pushes Q-values DOWN for actions not in the dataset: min_Q E_a~π[Q(s,a)] - E_a~data[Q(s,a)] + standard Bellman loss. This makes out-of-distribution actions have pessimistically low Q-values.

IQL fix: Never evaluate Q at out-of-distribution actions AT ALL. Fit V(s) using expectile regression on Q-values from the dataset, then extract a policy via advantage-weighted regression: π ∝ π_β · exp(A/β). The advantage A = Q - V is computed only at dataset actions.

Chapter 7: Reward Learning

Every RL algorithm assumes a reward function r(s, a) exists. But who writes that function? For Atari games, the reward is the score — easy. For "make a robot set a table nicely" or "write a helpful email," defining reward mathematically is a nightmare. You know a good table setting when you see one, but translating that into r(s, a) is an open research problem.

Worse, hand-designed rewards get hacked. A classic example: a boat racing agent rewarded for collecting speed boosts learned to drive in circles collecting boosts, never finishing the race. The reward was technically maximized, but the behavior was absurd. This is reward misspecification — the optimizer finds a way to score high that violates the designer's intent.

The solution: learn the reward function from human input. There are two main approaches, and they use fundamentally different types of supervision.

From Demonstrations: Inverse RL

Inverse Reinforcement Learning (IRL) starts from expert demonstrations and asks: "What reward function would make this expert behavior optimal?" Given a dataset of expert trajectories D_expert = {τ₁, τ₂, ...}, IRL finds r such that the expert policy scores higher than any other policy under r.

The problem is ill-defined — many reward functions explain the same behavior (including r = 0 everywhere). Modern approaches like adversarial IRL frame it as a GAN-like game: a discriminator tries to distinguish expert trajectories from agent trajectories, and the reward is derived from the discriminator's confidence.

From Preferences: The Bradley-Terry Model

Instead of requiring full demonstrations, we can learn from pairwise preferences. Show a human two trajectory segments τ_A and τ_B, and ask: "Which one is better?" This is much easier for humans — you don't need to be an expert to say "that robot arm motion looks smoother than that one."

The Bradley-Terry model converts these preferences into a reward learning objective. It models the probability that trajectory τ_w is preferred over τ_l as:

P(τ_w ≻ τ_l) = σ(r(τ_w) - r(τ_l))

Where σ is the sigmoid function, and r(τ) = ∑_t r(s_t, a_t) is the total reward along the trajectory. The loss for training the reward model is the negative log-likelihood of observed preferences:

L(r) = -E_{(τ_w, τ_l)} [ log σ(r(τ_w) - r(τ_l)) ]

This is just binary cross-entropy — the same loss used in logistic regression. The reward model is trained to assign higher total reward to the trajectory the human preferred.

Why sigmoid? The Bradley-Terry model assumes humans make noisy comparisons. When the reward difference is large, the human almost always picks the better one (probability near 1). When rewards are close, it's nearly a coin flip (probability near 0.5). The sigmoid naturally captures this — steep in the middle, flat at the extremes.

The RLHF Pipeline

Reinforcement Learning from Human Feedback (RLHF) chains these ideas together in a three-phase pipeline:

Phase 1: Sample

Generate k trajectory pairs from the current policy. Present to human annotators.

↓

Phase 2: Fit Reward

Train reward model r_θ using Bradley-Terry loss on human preferences.

↓

Phase 3: Optimize Policy

Run PPO (or similar) using learned r_θ as the reward. KL-constrain against base policy.

↻ repeat

The KL constraint in Phase 3 is crucial: without it, the policy could exploit weaknesses in the learned reward model (reward hacking again, but now hacking the learned reward). The objective becomes:

max_π E_π[r_θ(s,a)] - β · KL(π || π_ref)

Where π_ref is the original pre-trained policy and β controls how far the optimized policy can drift.

Worked Exam Problem

Problem: Two trajectories have learned rewards r(τ₁) = 3, r(τ₂) = 5. A human says τ₂ ≻ τ₁ (trajectory 2 is better).

(a) Compute P(τ₂ ≻ τ₁) under the Bradley-Terry model.

(b) Compute the loss for this single comparison.

(c) Now suppose r(τ₁) = 3, r(τ₂) = 3.1. How does the loss change, and what does this mean for training?

Solution (a): P(τ₂ ≻ τ₁) = σ(r(τ₂) - r(τ₁)) = σ(5 - 3) = σ(2).

σ(2) = 1 / (1 + e^-2) = 1 / (1 + 0.135) = 1 / 1.135 ≈ 0.881.

The model is 88.1% confident that τ₂ is better — which matches the human's judgment. Good.

Solution (b): L = -log σ(2) = -log(0.881) ≈ 0.127.

The loss is low because the model's prediction aligns with the human label. The gradient here is small — not much to learn.

Solution (c): Now P(τ₂ ≻ τ₁) = σ(3.1 - 3) = σ(0.1) = 1 / (1 + e^-0.1) ≈ 1 / 1.905 ≈ 0.525.

L = -log(0.525) ≈ 0.644. The loss is 5x higher. The model is barely more confident than a coin flip that τ₂ is better, but the human clearly said it was. The gradient here is large — the reward model needs to push r(τ₂) higher and r(τ₁) lower to separate them.

Exam insight: When two trajectories have similar rewards but the human has a clear preference, the Bradley-Terry loss is high and the gradients are strong. This is where the reward model learns the most — from "close calls" where the current model can't distinguish quality.

Bradley-Terry Preference Model

Adjust the reward scores for two trajectories. The sigmoid curve shows how reward difference maps to preference probability. Watch the loss and gradient magnitude change — close scores mean high loss and strong learning signal.

r(τ₁) 3.0

r(τ₂) [preferred] 5.0

Demos vs Preferences: A Comparison

	From Demonstrations (IRL)	From Preferences (RLHF)
Human input	Full expert trajectories	Pairwise "which is better?" labels
Requires expert?	Yes — demonstrator must be skilled	No — any human can compare two options
Scalability	Low — demonstrations are expensive	Higher — comparisons are cheap and parallelizable
Ill-posedness	Severe — many rewards explain same behavior	Milder — preferences provide relative signal
Reward hacking risk	Moderate	Managed via KL constraint to reference policy

Common exam mistakes:
• Confusing the reward model with the policy. The reward model r_θ is a separate neural network trained on human preferences. The policy π is then trained using r_θ as its reward.
• Forgetting the KL penalty in RLHF. Without it, the policy over-optimizes the reward model and "hacks" it — finding inputs where r_θ gives high scores but actual human ratings would be low.
• Saying "RLHF learns from demonstrations." No — RLHF learns from preferences (pairwise comparisons). Learning from demonstrations is IRL/BC.
• Computing Bradley-Terry with σ(r(τ_l) - r(τ_w)) instead of σ(r(τ_w) - r(τ_l)). The winner's reward comes first.

In RLHF, why do we learn a reward model instead of directly optimizing the policy from human preferences?

Because neural networks can't handle pairwise labels directly Human feedback is too expensive to collect at every policy update step — the reward model generalizes from limited comparisons to provide dense reward for RL training Because the Bradley-Terry model requires a reward function to exist before training can begin Directly optimizing from preferences would require on-policy data collection

What's the key difference between learning rewards from demonstrations (IRL) versus from preferences (RLHF)?

IRL is model-based; RLHF is model-free IRL works only for discrete actions; RLHF works for continuous IRL needs an expert who can perform the task well; RLHF only needs someone who can judge which of two outcomes is better — a much weaker and more scalable requirement IRL uses neural networks; RLHF uses linear models

⚔ Adversarial: Reward Hacking in Learned Rewards

You train a reward model from 500 human preference comparisons. Then you run PPO against this reward model. After 1M steps, the reward model gives a score of 9.8/10, but when you show the resulting behavior to humans, they rate it 2/10. What went wrong?

The policy exploited OOD inputs to the reward model — it found states that the reward model was never trained on and that produce spuriously high reward predictions (reward model overconfident on unseen inputs) The reward model was underfitting the human preferences PPO has a bug — it's not optimizing correctly

Chapter 8: Model-Based RL

Every RL algorithm we've discussed so far is model-free: the agent interacts with the environment, collects rewards, and learns a policy or value function without ever trying to understand how the world works. That's like learning to drive by trial and error with zero understanding of physics. It works eventually, but you'll crash a lot.

Model-based RL (MBRL) flips the script. First, learn a dynamics model — a function p_θ(s'|s, a) that predicts what state comes next given the current state and action. Then use this model to either generate synthetic training data or plan ahead by simulating future trajectories in your head.

The payoff is massive: typically 10-100x fewer real environment interactions. The cost: you need to manage model error, because wrong predictions compound catastrophically.

Dyna: Learn and Imagine

The simplest MBRL framework is Dyna, introduced by Sutton. The idea is beautifully straightforward:

Step 1: Interact

Collect real transition (s, a, r, s') from the environment. Add to replay buffer D.

↓

Step 2: Train Model

Fit dynamics model p_θ(s'|s, a) and reward model r_φ(s, a) on D.

↓

Step 3: Imagine

Generate K synthetic rollouts using the learned model. Add synthetic transitions to D.

↓

Step 4: Update Policy

Train policy/value function on augmented D (real + imagined data).

↻ repeat from Step 1

For every 1 real environment step, Dyna does K imagined steps. If K = 10, the policy sees 11x as much data but the robot only moved once. That's the data efficiency win.

The Compounding Error Problem

Here's the fundamental tension. Suppose your model has 5% prediction error per step — pretty good for a neural network. After 1 step, your predicted state is 5% off. After 2 steps, the error compounds: you're predicting the next state from an already-wrong state. After H steps:

Error(H) ≈ ε · H (linear, best case)

Error(H) ≈ ε · (1 + ε)^H (multiplicative, worst case)

For ε = 0.05 and H = 20 steps: linear gives 100% error (useless); multiplicative gives 0.05 × 2.65 ≈ 13.3% error. Either way, long rollouts in a learned model produce garbage. The policy trains on this garbage and learns fictional strategies that fail in the real world.

The model accuracy trap: A model can be 95% accurate per step and still be completely useless over 20-step rollouts. This is why naive Dyna with long rollouts fails catastrophically. You must either (a) keep rollouts short, (b) use ensembles to detect uncertainty, or (c) do both (MBPO).

MBPO: Short Rollouts + Ensembles

Model-Based Policy Optimization (MBPO) solves the compounding error problem with two key ideas:

1. Short branched rollouts. Instead of rolling out from a random initial state for 100 steps, MBPO starts from a real state sampled from the replay buffer and rolls out for only k steps (typically k = 1 to 5). This limits how far the imagined trajectory drifts from reality.

2. Ensemble of models. Train N models (typically 5-7) on the same data. When models disagree on a prediction, that state-action pair is high-uncertainty. MBPO uses the ensemble for rollouts and can detect when imagined data becomes unreliable.

MPC: Plan Online, Don't Learn a Policy

Model Predictive Control (MPC) takes a completely different approach: instead of using the model to train a policy, use it to plan at decision time. At each timestep:

1. Sample Action Sequences

Generate N random action sequences of length H: {a₀, a₁, ..., a_H-1}_i for i = 1..N

↓

2. Simulate in Model

Roll out each sequence in the learned model: s_t+1 = f_θ(s_t, a_t). Compute total reward for each.

↓

3. Pick Best First Action

Execute only a₀ from the highest-scoring sequence. Observe real s'. Discard the rest.

↻ replan from new s'

The receding horizon trick is essential: we only execute the first action and then replan from the new real state. Why? Because the model gets less accurate further into the future. By replanning every step, we always use the model's most accurate near-term predictions.

Planning with a Terminal Value Function

Pure MPC struggles with long-horizon tasks because the planning horizon H is limited by model accuracy. A clever fix: after rolling out H steps in the model, add a terminal value V̂(s_H) that estimates the value of the final state. The total score for a sequence becomes:

Score = ∑_t=0^H-1 r(s_t, a_t) + V̂(s_H)

Where V̂ is a learned value function (trained model-free on real data). This lets MPC reason about long-term consequences without needing a model accurate enough for long rollouts. Short model rollout + learned long-term value = best of both worlds.

Worked Exam Problem

Problem: A learned dynamics model has 5% error per step (additive). True dynamics: s_t+1 = s_t + a_t. Model predicts: ŝ_t+1 = s_t + a_t + 0.1 (systematic bias).

(a) Starting from s₀ = 0 with a_t = 1 for all t, what is the true state after 10 steps?

(b) What does the model predict after 10 steps?

(c) What's the accumulated error, and why does this motivate short rollouts?

Solution (a): True dynamics: s_t+1 = s_t + 1. After 10 steps: s₁₀ = 0 + 10 × 1 = 10.

Solution (b): Model prediction: ŝ_t+1 = ŝ_t + 1 + 0.1. The bias accumulates linearly. After 10 steps: ŝ₁₀ = 0 + 10 × (1 + 0.1) = 10 + 10 × 0.1 = 11.

Solution (c): Error = |11 - 10| = 1. That's 10% of the true state — from a model with only 0.1 bias per step. After 100 steps, it would be 10. This linear accumulation is the best case — multiplicative errors compound even faster. MBPO limits rollouts to k ≤ 5 steps precisely to keep this compounding manageable.

Model Error Compounding

Adjust the per-step model error and rollout length. The teal line shows the true trajectory; the orange line shows the model's prediction. Watch how quickly they diverge with longer rollouts.

Per-step error 5%

Rollout length 10

Decision Table: When to Use What

Approach	Uses Model For	Strength	Weakness
Dyna	Synthetic data generation	Simple, works with any policy learner	Long rollouts → compounding error
MBPO	Short branched rollouts + SAC	State-of-art data efficiency	Needs ensemble overhead, short horizon
MPC	Online planning at each step	No policy network needed, replans constantly	Slow at test time (N rollouts per step)
MPC + V̂	Short planning + learned value	Long-horizon reasoning with short rollouts	Needs both model and value function

Common exam mistakes:
• Saying "MBPO uses long rollouts with an ensemble for error correction." No — ensembles detect uncertainty but don't fix it. MBPO uses short rollouts (k=1-5) to limit error.
• Explaining MPC without the receding horizon. The whole point is that you execute only the FIRST action and replan. If you execute the entire sequence, you lose the error-correction benefit of replanning.
• Confusing Dyna (uses model for data augmentation, then trains a policy) with MPC (uses model for planning, no explicit policy).
• Saying model error is a fixed percentage of the true state. Model error compounds — a 5% per-step error after 20 steps is not 5% total error.

MBPO limits model rollouts to k = 1-5 steps and starts from real states in the buffer. Why not roll out for 50 steps from random initial states?

Model prediction errors compound over long horizons, making synthetic data unreliable. Starting from real states ensures the first few predictions are in a region the model was actually trained on. Because MBPO uses on-policy data and long rollouts would make it off-policy The ensemble can only handle 5 steps of prediction before running out of memory Long rollouts are computationally expensive and don't improve sample efficiency

In MPC, why do we only execute the FIRST action of the best sequence and then replan?

Because the action sequences are generated randomly and most are suboptimal To save computation — executing more actions would require more model evaluations The model is most accurate for the immediate next step. After observing the real next state, replanning from that real state uses fresh accurate information instead of compounded model error. MPC requires real-time feedback from the environment to handle stochastic transitions

💻 Build It Implement GAE (Generalized Advantage Estimation) ▶ ✓ ATTEMPTED

GAE is the standard advantage estimator used in PPO. It interpolates between bias (TD, λ=0) and variance (MC, λ=1). The formula: A_t^GAE = ∑_l=0^∞ (γλ)^l δ_t+l, where δ_t = r_t + γV(s_t+1) - V(s_t) is the TD error. Implement this efficiently using the backward sweep.

python def compute_gae( rewards: np.ndarray, # (T,) rewards values: np.ndarray, # (T+1,) value estimates (includes V(s_T+1)) dones: np.ndarray, # (T,) episode termination flags gamma: float = 0.99, lam: float = 0.95 ) -> np.ndarray: """ Compute GAE advantages for a trajectory. Returns: advantages (T,) Key: when done[t]=True, V(s_{t+1}) should be 0 (episode ended, no future rewards). """ ...

Test case

rewards = np.array([1, 0, 0, 1, 0])
values = np.array([0.5, 0.4, 0.3, 0.2, 0.1, 0.0]) # T+1 values
dones = np.array([0, 0, 0, 0, 1])
advantages = compute_gae(rewards, values, dones, gamma=0.99, lam=0.95)
# advantages[4] should be: r4 + 0 - V(s4) = 0 + 0 - 0.1 = -0.1
# advantages[3] should be: delta3 + gamma*lam*adv4

Instead of computing the infinite sum directly (expensive), use the recursion: A_t = δ_t + γλ · A_t+1. Start from A_T-1 = δ_T-1 and work backward. This is O(T) instead of O(T²). Critical: when dones[t]=True, set A_t = δ_t (no future terms, because the episode ended). Also: when dones[t]=True, δ_t = r_t + 0 - V(s_t) (next state value is 0).

python
import numpy as np

def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95):
    T = len(rewards)
    advantages = np.zeros(T)
    last_gae = 0.0

    for t in reversed(range(T)):
        # If episode ended at t, next value is 0
        next_non_terminal = 1.0 - dones[t]
        next_value = values[t + 1] * next_non_terminal

        # TD error: r + gamma*V(s') - V(s)
        delta = rewards[t] + gamma * next_value - values[t]

        # GAE recursion (zeroed across episode boundaries)
        last_gae = delta + gamma * lam * next_non_terminal * last_gae
        advantages[t] = last_gae

    return advantages

# Returns = advantages + values (for value function target)
# returns = advantages + values[:T]

Bonus challenge: Vectorize this for a batch of parallel environments (shape: (num_envs, T)). The done masking must be per-environment.

Checkpoint — Before you move on

Model-based RL learns a dynamics model and plans through it. Goal-conditioned RL conditions the policy on a desired goal. Explain: could you combine them? What would a "goal-conditioned model-based" algorithm look like, and what advantage would it have?

✓ Gate cleared

Model Answer

A goal-conditioned model-based algorithm: (1) Learn a dynamics model s' = f(s, a). (2) Given a goal g, plan through the model to find action sequence that reaches g. (3) Execute the first action, re-plan from the new state.

This is literally MPC (Model Predictive Control) with a goal-reaching objective. The advantage: you can reach ANY goal without retraining — just change the planning objective. This is more flexible than a goal-conditioned policy (which needs to be trained on goals) and more sample-efficient than model-free GCRL.

Real example: Visual Foresight (Finn & Levine 2017) learns a video prediction model, then plans actions that make the predicted video match a goal image. Modern version: Dreamer with HER-style goal relabeling. The model provides the "imagination" for planning; the goal provides the "what to plan toward."

Chapter 9: Goal-Conditioned & Multi-Task RL

Standard RL trains a policy for one task with one reward function. But real agents need to handle many tasks. A household robot needs to set the table, wash dishes, and fetch objects. Training a separate policy for each is wasteful — the physics of grasping is the same whether you're picking up a mug or a plate.

Multi-task RL trains a single policy conditioned on a task identifier z_i. The policy becomes π(a | s, z_i), and the state effectively becomes the pair (s, z_i). Different task identifiers activate different behaviors while sharing the underlying representations.

Goal-Conditioned RL: A Special Case

Goal-conditioned RL (GCRL) is the most natural version of multi-task RL. The task identifier is simply a goal state s_g — the state you want to reach. The policy is π(a | s, s_g), and the reward is typically:

r(s, a, s_g) = -d(s, s_g) (dense, distance-based)

r(s, a, s_g) = 1(||s - s_g|| < ε) (sparse, binary success)

Dense rewards give the agent a gradient to follow at every step ("you're getting warmer"). Sparse rewards are more natural but harder to learn from ("you either reached the goal or you didn't"). Most real-world goals are sparse — the mug is either on the coaster or it isn't.

The sparse reward trap: With sparse rewards and random exploration, the probability of accidentally reaching a distant goal is exponentially small in the distance. An agent trying to reach (5, 5) from (0, 0) on a 10×10 grid with random actions might never get there. No success = no positive reward signal = no learning. This is the curse of sparse rewards, and it's the problem HER was designed to solve.

Hindsight Experience Replay (HER)

HER is one of the cleverest tricks in RL. The insight: even when the agent fails to reach its intended goal, it did reach some state. Why not pretend that state was the goal all along?

Concretely: the agent tries to reach goal g = (5, 5) but ends up at (3, 2). With the original goal, the entire episode has reward 0 (failure). But if we relabel the goal to g' = (3, 2), then the final transition gets reward 1 (success!). The agent learns "here's how to reach (3, 2)" — and that knowledge transfers to reaching other nearby goals.

1. Collect Episode

Try to reach goal g. Trajectory: s₀ → s₁ → ... → s_T. Failed (s_T ≠ g).

↓

2. Store Original

Store transitions with original goal g and reward 0. Policy failed, but keep the data.

↓

3. Relabel Goals

Pick achieved states (e.g. s_T, s_T/2) as new goals g'. Recompute rewards. Store as additional transitions.

↓

4. Train

Off-policy algorithm (DQN, DDPG, SAC) trains on both original and relabeled transitions.

↻ repeat

HER has three requirements that are easy to forget on an exam:

Requirement	Why
Same dynamics across goals	Relabeling changes the goal, not the physics. If different goals had different transition dynamics, the relabeled transition would be invalid.
Evaluatable reward	We need to recompute r(s, a, g') for the new goal g'. This requires knowing the reward function — we can't relabel if reward is a black box.
Off-policy algorithm	Relabeled data was collected under a policy pursuing a different goal. Only off-policy methods can learn from data generated by a different behavioral distribution.

Worked Exam Problem

Problem: An agent in a 2D grid aims for goal g = (5, 5). It executes a 4-step episode: (0,0) → (1,1) → (2,1) → (3,2) → (3,2). Reward = -||s - g|| at each step, plus a bonus of +10 if s = g.

(a) What's the total reward with the original goal?

(b) We relabel the goal to g' = (3, 2). What's the reward at the final step?

(c) Using the "final" HER strategy (relabel with s_T), how many new training transitions does HER create from this one episode?

Solution (a): Distances at each step:
t=0: ||((0,0) - (5,5)|| = √(50) ≈ 7.07 → reward = -7.07
t=1: ||(1,1) - (5,5)|| = √(32) ≈ 5.66 → reward = -5.66
t=2: ||(2,1) - (5,5)|| = √(25) = 5.0 → reward = -5.0
t=3: ||(3,2) - (5,5)|| = √(13) ≈ 3.61 → reward = -3.61
No bonus (never reached g). Total ≈ -21.34. All negative, all failure.

Solution (b): With g' = (3, 2): at t=3, ||(3,2) - (3,2)|| = 0 → reward = 0 + 10 = 10. Success!

Solution (c): The "final" strategy relabels with s_T = (3, 2). Each of the 4 transitions in the episode gets a copy with g' = (3, 2) and recomputed rewards. So HER creates 4 additional transitions (one per timestep, each paired with the new goal). Combined with the 4 original transitions, we now have 8 total — and 4 of them end in success instead of 0.

Why this matters: Before HER, the agent saw 4 transitions with total reward -21.34 and learned nothing about how to succeed. After HER, it sees 4 additional transitions that include a +10 success reward. From one failed episode, we manufactured positive training signal. Multiply this across thousands of episodes and the agent rapidly learns to reach diverse goals.

HER Goal Relabeling

The agent (circle) tried to reach the orange goal but ended up at its final position. Click Relabel to see HER create a new teal goal at the achieved position, turning failure into success. Click New Episode to generate a random trajectory.

Common exam mistakes:
• Saying HER works with on-policy algorithms. It doesn't — relabeled data was collected pursuing a different goal, making it off-policy by definition. Only off-policy methods (DQN, DDPG, SAC) can use this data.
• Thinking HER changes the environment dynamics. HER only changes the goal and recomputes the reward. The transitions (s, a, s') are unchanged.
• Forgetting that HER requires a known, computable reward function. If reward is a black box (e.g., a human rating), you can't recompute it for a new goal.
• Claiming HER can work when different tasks have different dynamics. If task 1 has friction=0.1 and task 2 has friction=0.5, a trajectory from task 1 can't be relabeled as task 2 — the state transitions would be different.

Why does HER only work with off-policy algorithms like DDPG or SAC?

Because on-policy algorithms can't use replay buffers at all Relabeled transitions were collected while pursuing a different goal than the relabeled one — they come from a different behavioral distribution. Only off-policy methods can learn from data generated by a different policy. On-policy algorithms require continuous action spaces, but HER only works with discrete actions HER changes the reward function, which invalidates on-policy importance sampling corrections

Can you apply HER if different goals correspond to different physical dynamics (e.g., different friction coefficients)?

No — HER relabels the goal but reuses the same (s, a, s') transitions. If different goals had different dynamics, the transition would be invalid under the new goal because s' would have been different. Yes — HER automatically adjusts the transitions for the new goal's dynamics Yes — as long as the reward function is computable, HER works regardless of dynamics No — but only because different dynamics require different network architectures

🏗 Design Challenge You're the Architect: Choosing the Right RL Algorithm ▶ ✓ ATTEMPTED

Given a new RL problem, you must select the right algorithm class. Build a decision tree: for each constraint listed below, determine which algorithm family applies and WHY.

Scenario A

Fixed dataset, no environment access

Scenario B

Expensive env, ~1000 episodes budget

Scenario C

Fast sim, continuous actions, 10M steps

Scenario D

Expert available, multi-modal actions

Scenario E

Sparse reward, goal-conditioned tasks

Scenario F

Safety-critical, must stay near known behavior

1. For each scenario, name the best algorithm class (offline RL, model-based, SAC/PPO, imitation learning, goal-conditioned, conservative) and the specific algorithm.

2. What's the first thing that would fail if you used the WRONG algorithm in each scenario?

3. Draw a decision tree: Data? → Online or Offline? → Continuous or discrete? → Model available? → ...

A (fixed dataset, no env): Offline RL. IQL for simplicity + stability, CQL if you need guaranteed pessimism. NEVER use SAC/DQN online — they require env interaction. BC works if dataset is expert-quality; offline RL is better for mixed-quality datasets.

B (expensive env, 1000 episodes): Model-based RL (MBPO, Dreamer). Learn a world model from real data, generate synthetic rollouts for policy training. 10-100x more sample efficient than model-free. Alternatively: a few demos + fine-tuning (warm-start with BC, then improve with RL).

C (fast sim, continuous, 10M steps): SAC (off-policy, entropy regularization, continuous actions) or PPO (on-policy, simpler, more robust). SAC is more sample efficient. PPO is more stable and easier to tune. For robotics with Isaac Lab: PPO with 4096 parallel envs.

D (expert available, multi-modal): DAgger (iterative) or Diffusion Policy (handles multi-modality). BC fails for multi-modal because MSE averages modes. Diffusion Policy represents the full action distribution. IBC (implicit BC) also handles multi-modality.

E (sparse reward, goal-conditioned): HER (Hindsight Experience Replay) + goal-conditioned SAC. HER retroactively relabels failed episodes as "successes for a different goal," dramatically improving sample efficiency under sparse rewards.

F (safety-critical, stay near known): Offline RL with strong conservatism (CQL with high α), or constrained policy optimization (CPO). The key requirement: never try actions far from the dataset. Conservative methods provide this guarantee by construction.

⚔ Adversarial: The Deadly Triad

You're running DQN (off-policy, function approximation, bootstrapping) and notice Q-values diverging to infinity. You can remove exactly ONE element of the deadly triad. Which removal is most practical while still being useful?

Remove off-policy: only use data from the current policy (makes it SARSA/on-policy) Remove function approximation: use a tabular Q-table Remove bootstrapping: use Monte Carlo returns instead of TD targets

⚔ Adversarial: When Exploration Hurts

You're training SAC on a robot reaching task. The entropy bonus encourages exploration. After 500K steps, the policy reaches the goal 95% of the time. But you notice it takes highly variable paths (sometimes looping). Is this a problem? When would you want to REDUCE entropy?

Yes, reduce entropy (α) for deployment: variable paths waste energy, increase wear, and make the system unpredictable. Use a high α for training (exploration) and anneal to low α for deployment (exploitation). No, high entropy is always good because it prevents local optima This indicates the training hasn't converged yet; just train longer

🔗 Pattern Recognition

The Same Three Tensions Run Through All of RL

Exploration vs Exploitation

ε-greedy (DQN), entropy bonus (SAC), UCB (bandits), curiosity (RND).
Same tension everywhere: try new things vs use what works.

Bias vs Variance

MC (unbiased, high variance) vs TD (biased, low variance).
GAE(λ) interpolates: λ=0 is TD, λ=1 is MC.
Same tradeoff in every estimator.

A third tension runs beneath both: on-policy vs off-policy. On-policy (PPO) is stable but sample-wasteful. Off-policy (SAC, DQN) reuses data but risks divergence. Offline RL takes this to the extreme: ALL data is off-policy. Every algorithm in this course is a point in this 3D tradeoff space. Understanding WHERE each algorithm sits (and WHY) is the deepest understanding you can have of the field.

Place these algorithms in the 3D space (exploration strategy, bias-variance point, on/off-policy): PPO, SAC, DQN, CQL, REINFORCE, Dreamer. Which two are closest? Which two are furthest apart?

Chapter 10: Interactive Exam Simulator

Time to test yourself. This is a randomized exam with questions drawn from all nine topics covered in this review. Each question is the kind you'd see on a real CS224R final — not trivia, but questions that test whether you truly understand the concepts.

The questions are intentionally hard. Some require computation. Some have trick answers that test whether you understand subtle distinctions (like on-policy vs off-policy, or what baselines actually do to the gradient). Some require you to trace through an algorithm step by step. This is the level you need to be at.

How it works: Click Start Exam to begin. You'll see one question at a time with four choices. Select your answer and click Submit to see if you're right — every answer comes with a detailed explanation of why it's correct and why the alternatives are wrong. Click Next Question for another random question. Use the topic filters to drill your weak areas. Aim for 80%+ before the real exam.

Study strategy: Do a first pass through all topics. Note which topics you score below 70% on. Use the topic filter buttons to drill those specific areas. Then do a final full pass. The questions are randomized each time, so you'll see different ones on each attempt.

CS224R Exam Simulator

Score: 0 / 0 correct (0%)

36 exam-quality questions across 9 topics.
Randomized each time. Detailed explanations for every answer.

The score bar above tracks your progress. Green segments = correct, red = incorrect. Use topic filters to drill weak areas. Questions repeat if you cycle through all of a topic's questions. Your score resets when you change the filter.

Exam Strategy Notes

Time management on the real exam:
• Computation questions (Bellman update, Bradley-Terry, expectile loss): budget 3-4 minutes. Write the formula first, then plug in numbers.
• Conceptual "why" questions: budget 2 minutes. These test understanding, not calculation. If you're writing more than 3 sentences, you're overcomplicating it.
• "What happens if..." questions: budget 2-3 minutes. Trace through the algorithm step by step with the changed parameter. Don't guess — derive.
• If stuck on a multiple choice, eliminate wrong answers first. "Always" and "never" answers are usually wrong (except in RL where some things genuinely never work, like HER with on-policy algorithms).

The three questions that classify every algorithm:
1. Does it need a reward function? (BC/DAgger: No. Everything else: Yes, either given or learned.)
2. Does it need environment interaction? (Offline RL: No. Everything else: Yes, or at least for data collection.)
3. On-policy or off-policy? (REINFORCE/A2C/PPO: On. DQN/SAC/AWR/IQL/HER: Off.)

These three axes uniquely identify every algorithm in the course. Master this classification and you can answer half the exam from first principles.

Chapter 11: Cheat Sheet & Connections

This is your one-page reference for the entire course. Print it, memorize it, take it into the exam in your head. Every algorithm, every key equation, every common trap — all in one place.

Master Equation Table

Algorithm	Key Equation	Data	On/Off	Key Property
BC	min E[ℓ(π_θ(s), a_expert)]	Expert demos	N/A	Supervised, no reward needed
DAgger	BC + query expert on visited states	Expert demos + online queries	N/A	Fixes distribution shift in BC
REINFORCE	∇J = E[∑_t ∇logπ(a_t\|s_t) · R(τ)]	On-policy rollouts	On	High variance, unbiased
A2C	∇J = E[∇logπ(a\|s) · A(s,a)]	On-policy rollouts	On	Baseline reduces variance
PPO	L = min(r_tA_t, clip(r_t, 1±ε)A_t)	On-policy (few reuses via IS)	On	Clipped IS for stability
SAC	max E[∑ r + αH(π)]	Replay buffer	Off	Entropy-regularized, explores well
DQN	L = (r + γ max_a' Q_target(s',a') - Q(s,a))²	Replay buffer	Off	Target network stabilization
AWR	π = argmax E[exp(A/β) logπ(a\|s)]	Fixed offline dataset	Off	Weighted BC — never evaluates OOD
IQL	L_τ = \|τ - 1(u<0)\| · u²	Fixed offline dataset	Off	Conservative V via expectile regression
Dyna	Train model p_θ(s'\|s,a), generate synthetic data	Real + imagined	Either	Data efficiency via imagination
MPC	a* = argmax_{a_0:H} ∑ r(s_t,a_t)	Model-based planning	N/A	No explicit policy, replans each step
HER	Relabel g → g' = s_achieved	Replay buffer + relabeled	Off	Turns failures into successes

Decision Flowchart: Which Algorithm?

Start here → Do you have expert demonstrations?

• Yes, demos available:
  → Do you have a reward function? No → BC (simplest) or DAgger (if expert is queryable)
  → Do you have a reward function? Yes → Inverse RL to learn reward, then RL

• No demos, but have reward function:
  → Can you interact with the environment?
    • Yes, many interactions OK → PPO (on-policy, stable) or SAC (off-policy, sample efficient)
    • Yes, but interactions expensive → MBPO (model-based, 10-100x fewer interactions) or MPC (planning)
    • No interaction at all (fixed dataset) → IQL or AWR (offline RL)

• No demos, no reward, but have human preferences:
  → RLHF (Bradley-Terry reward learning + PPO)

• Goal-conditioned tasks with sparse rewards:
  → HER + off-policy algorithm (DDPG, SAC)

Common Exam Traps

Misconception	Truth
"Baselines change the expected gradient in policy gradient methods"	Baselines change variance only. E[∇logπ · b] = 0 for any state-dependent baseline b(s). The expected gradient is unchanged.
"Q-learning is on-policy"	Off-policy. The Bellman target uses max_a' Q(s', a') — the max over all actions, regardless of which action the behavior policy took.
"PPO is off-policy because it reuses data"	On-policy. PPO collects fresh rollouts each iteration. It reuses each batch for a few gradient steps via importance sampling, but the data is still on-policy. The clipping ensures updates stay close to on-policy.
"DAgger requires a reward function"	No reward needed. DAgger only needs an expert that can label states with the correct action. It's pure imitation learning with interactive expert queries.
"BC suffers from compounding error because of bad training data"	BC fails because of distribution shift: the learner visits states the expert never demonstrated, not because the expert data is bad. DAgger fixes this by querying the expert on the learner's visited states.
"Offline RL fails because datasets are too small"	Even huge datasets cause failure if the learned policy visits OOD state-action pairs. The problem is distributional shift, not sample size.
"Model-based RL is always more sample efficient"	Only if the model is accurate enough. A bad model generates misleading synthetic data that can hurt worse than no data at all.
"HER works with any RL algorithm"	HER requires off-policy algorithms only. Relabeled data was collected under a policy pursuing a different goal — it's off-policy by construction.
"In RLHF, the policy is trained directly on human preferences"	Two-stage: first train a reward model from preferences (Bradley-Terry), then optimize the policy using that reward model with PPO + KL constraint.
"Entropy bonus in SAC encourages exploitation"	Entropy bonus encourages exploration by penalizing deterministic policies. max H(π) = max randomness.
"IQL uses τ > 0.5 for conservative estimates"	τ < 0.5 is conservative (penalizes overestimates more). τ > 0.5 would be optimistic — chasing the max, which is dangerous offline.
"The discount factor γ only affects how far ahead the agent plans"	γ also affects the effective horizon and the variance of returns. Lower γ reduces variance but makes the agent myopic. Higher γ captures long-term rewards but increases variance.

Detailed Lessons

Each topic covered in this review has a full standalone lesson with deeper derivations, more simulations, and extended worked examples:

Exam Topic	Full Lesson	Key Simulation
Imitation Learning	Imitation Learning	DAgger vs BC on distribution shift
Policy Gradients	Policy Gradients	REINFORCE variance reduction
Q-Learning / DQN	RL Algorithms	Q-table grid world with live updates
Actor-Critic / PPO / SAC	Deep RL Theory	PPO clipping visualization
Offline RL	(in review)	IQL expectile loss explorer
Reward Learning / RLHF	Reward Learning	Bradley-Terry preference model
Model-Based RL	Model-Based RL	Error compounding + MPC planner
Goal-Conditioned RL / HER	Multi-Task & Goal-Conditioned RL	HER relabeling visualization

Formula Quick Reference

Policy Gradient Family

∇J = E[∑_t ∇logπ(a_t|s_t) · Φ_t]

Φ_t = R(τ) for REINFORCE
Φ_t = A(s_t,a_t) for A2C
Φ_t = clipped IS · A for PPO

Bellman Equation

Q*(s,a) = r + γ max_a' Q*(s', a')

IQL Expectile Loss

L_τ(u) = |τ - 1(u<0)| · u²

Bradley-Terry

P(τ_w ≻ τ_l) = σ(r(τ_w) - r(τ_l))

RLHF Objective

max E[r_θ(s,a)] - β KL(π||π_ref)

PPO Clip

L = min(r_tA_t, clip(r_t, 1±ε)A_t)

SAC Objective

max E[∑ r(s,a) + α H(π(·|s))]

Exam strategy: For every question, ask yourself three things: (1) Is this on-policy or off-policy? (2) Does it need a reward function? (3) Does it need environment interaction? These three axes separate every algorithm in the course. When in doubt, draw the flowchart above.

"What I cannot create, I do not understand." — Richard Feynman

Deep RL Exam Review

Chapter 0: The Map — RL Algorithm Taxonomy

The Two Big Splits

The Third Dimension: What Do You Learn?

Interactive Taxonomy

Decision Flowchart: Choosing an Algorithm

Chapter 1: Imitation Learning

Approach 1: Regression (Deterministic Policy)

Approach 2: Generative (Stochastic Policy)

Flow Matching for IL

DAgger: Fixing Distribution Shift

Worked Problem: Regression vs. Generative

Chapter 2: Policy Gradients

The REINFORCE Gradient

Improvement 1: Causality (Reward-to-Go)

Improvement 2: Baselines

Worked Problem: Computing the Policy Gradient

Chapter 3: Actor-Critic Methods

The Three Value Functions

MC vs. TD vs. n-Step Targets

The Actor-Critic Policy Gradient

Worked Problem: Computing TD Target, MC Return, and Advantage

Chapter 4: Off-Policy Methods & PPO

Importance Sampling: Reweighting Old Data

The Problem: Exploding Ratios

Solution 1: KL Constraint (TRPO)

Solution 2: PPO Clipping (The Practical Choice)

SAC: Maximum Entropy RL

Worked Problem: Importance Sampling Ratio

Chapter 5: Q-Learning & DQN

The Bellman Optimality Equation

Tabular Q-Learning

DQN: Q-Learning with Neural Networks

Stability Trick 1: Replay Buffer

Stability Trick 2: Target Network

Stability Trick 3: Double Q-Learning

Worked Problem: Q-Table Bellman Update

Q-Learning vs. SARSA: The Key Distinction

Chapter 6: Offline RL

The Distribution Shift Problem

Solution 1: Constrained Policy (AWR)

Solution 2: Avoid OOD Evaluation (IQL)

Worked Exam Problem

Comparison Table: Offline RL Approaches

Chapter 7: Reward Learning

From Demonstrations: Inverse RL

From Preferences: The Bradley-Terry Model

The RLHF Pipeline

Worked Exam Problem

Demos vs Preferences: A Comparison

Chapter 8: Model-Based RL

Dyna: Learn and Imagine

The Compounding Error Problem

MBPO: Short Rollouts + Ensembles

MPC: Plan Online, Don't Learn a Policy

Planning with a Terminal Value Function

Worked Exam Problem

Decision Table: When to Use What

Chapter 9: Goal-Conditioned & Multi-Task RL

Goal-Conditioned RL: A Special Case

Hindsight Experience Replay (HER)

Worked Exam Problem

Chapter 10: Interactive Exam Simulator

Exam Strategy Notes

Chapter 11: Cheat Sheet & Connections

Master Equation Table

Decision Flowchart: Which Algorithm?

Common Exam Traps

Detailed Lessons

Formula Quick Reference