Every algorithm, every equation, every exam trap. Interactive practice problems for the closed-book midterm.
You have 80 minutes. The exam is closed-book. There are roughly 26 slides worth of material spanning imitation learning, policy gradients, actor-critic, off-policy, Q-learning, offline RL, reward learning, model-based RL, and goal-conditioned RL. That's a lot of ground.
Before diving into any single topic, you need the map — a mental taxonomy that tells you where every algorithm lives, what data it needs, and when you'd reach for it. If someone names an algorithm on the exam, you should instantly know: is it on-policy or off-policy? Does it need a simulator? Can it work from a fixed dataset? Does it learn a value function, a policy, or both?
Let's build that map from first principles.
Every RL algorithm lives somewhere in a 2×2 grid defined by two questions:
Question 1: Where does the data come from?
If you can interact with the environment during training — take actions, observe outcomes, collect new data — you're in the online regime. If you're stuck with a fixed dataset collected by someone else (or a previous version of your policy), you're in the offline regime.
Question 2: Can you reuse old data?
If your algorithm requires fresh data from the current policy after every update, it's on-policy. If it can learn from data collected by any policy (stored in a replay buffer), it's off-policy. This distinction only applies within the online regime.
| Regime | Category | Algorithms | Data Requirement |
|---|---|---|---|
| Online | On-policy | REINFORCE, vanilla PG, A2C | Fresh rollouts from current πθ |
| Off-policy | DQN, SAC, TD3, DDPG | Replay buffer of past transitions | |
| Offline | Imitation Learning | Behavior Cloning, DAgger, Flow IL | Expert demonstrations (s, a) pairs |
| Offline RL | AWR, IQL, CQL | Fixed dataset of (s, a, r, s') tuples |
Where does PPO fit? This is a classic exam trick question. PPO uses importance sampling to reuse data from a slightly old policy — but it clips the ratio so tightly that it's effectively on-policy. Most courses classify PPO as on-policy with a small off-policy correction. On an exam, call it "on-policy with clipping."
Where does SAC fit? SAC maintains a replay buffer, samples uniformly from past transitions, and trains a Q-function using off-policy Bellman backups. It's firmly off-policy. The maximum-entropy objective (adding an entropy bonus) is orthogonal to the on/off-policy distinction.
Algorithms also differ in what function they learn:
| Type | Learns | How it Acts | Examples |
|---|---|---|---|
| Value-based | Q(s, a) or V(s) | π(s) = argmaxa Q(s, a) | DQN, Double DQN |
| Policy-based | πθ(a|s) | Sample from πθ | REINFORCE, vanilla PG |
| Actor-Critic | Both πθ and Vφ | Actor samples, critic evaluates | A2C, PPO, SAC, TD3 |
| Model-based | Dynamics p(s'|s,a) | Plan using the model | MBPO, Dreamer, MPC |
Click on an algorithm node to see its key equation, data needs, and typical use case. Orange = offline, Teal = on-policy, Blue = off-policy, Purple = model-based.
Forget rewards. Forget value functions. Forget Bellman equations. Imitation learning (IL) asks the simplest possible question: "I have recordings of an expert doing the task. Can I just copy them?"
The answer is yes — with caveats that will show up on your exam. IL turns RL into supervised learning: given state s, predict the action a the expert would take. No reward signal, no environment interaction during training (at least for behavior cloning), no temporal credit assignment. Just input-output pairs.
But the devil is in two details: how you parameterize the policy (regression vs. generative) and how you handle distribution shift (vanilla BC vs. DAgger).
The simplest approach: treat actions as real-valued outputs and minimize mean squared error (MSE):
Where πθ(s) is a neural network that takes a state and outputs a single deterministic action. This is literally supervised regression — the same loss you'd use to predict house prices from features.
When does this work? When the expert's behavior is unimodal — for any given state, there's essentially one correct action. A self-driving car on a straight highway: the expert always goes straight. MSE loss works perfectly.
When does this break? When the expert's behavior is multimodal. This is the critical failure mode you need to know for the exam.
Instead of predicting a single action, model the full conditional distribution pθ(a | s) and maximize log-likelihood:
This is the key difference. Log-likelihood doesn't average modes — it assigns probability mass to each mode independently. A mixture of Gaussians, a diffusion model, or a flow matching model can all represent multimodal distributions. The Flappy Bird policy would learn two peaks: one at +1 (go over) and one at −1 (go under), with near-zero probability at 0 (crash).
Common generative architectures for IL:
| Model | How It Represents Multimodality | Inference |
|---|---|---|
| Gaussian Mixture | K explicit Gaussian components | Sample from mixture |
| Diffusion Policy | Iterative denoising from noise → action | N denoising steps |
| Flow Matching | Learn velocity field from noise → action | Euler integration |
Flow matching is the generative approach emphasized in CS224R. The idea: learn a velocity field that transports random noise into expert actions. Here's how it works step by step.
Step 1: Define the interpolation. Given a noise sample a0 ~ N(0, I) and an expert action a1, create a path between them:
At τ = 0, we have pure noise. At τ = 1, we have the expert action. The path is a straight line in action space.
Step 2: Compute the target velocity. The velocity along this path is just the derivative with respect to τ:
This is the "ground truth" direction: from noise toward the expert action.
Step 3: Train a velocity network. Learn vθ(s, aτ, τ) to predict this velocity:
The network takes the state, the noisy action, and the timestep, and predicts which direction to push.
Step 4: Inference by Euler integration. Start from noise a0 ~ N(0, I), then take K small steps:
After K steps (e.g., K = 10, Δτ = 0.1), you arrive at a clean expert-like action. The velocity field has learned to route noise toward different modes, so multimodal distributions emerge naturally.
Even with a perfect generative model, behavior cloning has a fundamental problem: compounding errors.
During training, the policy sees states from the expert's distribution — the states an expert would visit. At test time, the policy makes a small mistake at step 1, which puts it in a slightly different state at step 2. It's never seen this state before, so it makes a bigger mistake. By step 100, the policy is in completely unfamiliar territory and crashes.
The error grows quadratically with the horizon T. If the per-step error is ε, the total compounded error after T steps is O(εT2), not O(εT). This quadratic blowup is what makes vanilla BC fragile for long-horizon tasks.
DAgger (Dataset Aggregation) fixes this by iteratively collecting corrective labels:
The key insight: DAgger trains on the policy's own distribution of states, not just the expert's. After enough iterations, the dataset covers both the expert's trajectory and all the "mistake states" the policy visits, so the policy learns to recover from its own errors.
DAgger reduces the error bound from O(εT2) to O(εT) — linear in the horizon, not quadratic. The cost: you need an expert available to label on-demand during training.
(a) MSE-optimal prediction:
MSE minimizes E[(a − π(s))2]. The minimizer is the conditional mean:
The model predicts +0.4 — an action the expert never took. If +2 means "go left" and −2 means "go right," then +0.4 means "drift vaguely leftward" — probably the worst possible action.
(b) Log-likelihood-optimal GMM:
A 2-component Gaussian mixture with small variance would learn:
At inference, sampling from this distribution gives +2 about 60% of the time and −2 about 40% — matching the expert's actual behavior. No probability mass near zero.
The expert's action distribution has two modes. Orange line = MSE prediction (the mean). Teal curves = generative model (GMM). Drag the mode separation slider to see when regression becomes catastrophic.
Imitation learning sidesteps reward entirely. But most RL problems don't come with expert demonstrations — they come with a reward function. Policy gradient methods directly optimize the expected cumulative reward by taking gradients of the policy parameters.
The core idea: if a trajectory gets high reward, increase the probability of the actions in that trajectory. If it gets low reward, decrease them. No value function needed (yet). No model of the environment. Just rollouts and gradient updates.
We want to maximize the expected return:
The problem: we can't backpropagate through the environment (it's a black box). The solution is the log-derivative trick (also called the REINFORCE trick or score function estimator):
In practice, we estimate this with N sampled trajectories:
Let's unpack each symbol:
| Symbol | Meaning |
|---|---|
| ∇θ log πθ(a|s) | The score function — which direction in parameter space makes action a more likely at state s |
| R(τ) | Total return of trajectory τ = ∑ r(st, at) |
| N | Number of sampled trajectories (batch size) |
| T | Trajectory length (horizon) |
Intuition: ∇θ log πθ(a|s) points toward parameter values that make action a more likely. Multiplying by R(τ) scales this — high-reward trajectories get large positive updates (make these actions more probable), low-reward trajectories get small or negative updates (make these actions less probable). It's a form of trial-and-error weighted by success.
The vanilla REINFORCE formula weights each action's gradient by the total trajectory reward R(τ). But action at at time t can't affect rewards that already happened at times 0, 1, ..., t−1. Including past rewards adds noise without information.
The reward-to-go fix: replace R(τ) with the sum of future rewards from time t onward:
The inner sum ∑t'=tT r(st', at') is called the reward-to-go or return-to-go. It's the total reward the agent receives from time t onward. This has the same expected value as using R(τ) but strictly lower variance.
Even with reward-to-go, the gradient estimate has high variance. Consider: if all trajectories get positive reward (say, rewards between +10 and +20), then REINFORCE increases the probability of every action — even the bad ones. It just increases them less. Convergence is slow because the signal-to-noise ratio is poor.
The fix: subtract a baseline b from the reward:
The crucial mathematical fact: subtracting any constant baseline b does not change the expected value of the gradient. This is because E[∇θ log π · b] = b · E[∇θ log π] = b · 0 = 0. The score function has zero mean.
But it does change the variance. The optimal baseline is b* = E[R(τ)], the average return. Now trajectories better than average get positive weight (reinforced) and trajectories worse than average get negative weight (discouraged). Much cleaner signal.
(a) Without baseline:
Using reward-to-go from t=0: Rto-go = r0 + r1 = 5
(b) With baseline:
Baseline b = (5 + (−2) + 1) / 3 = 4/3 ≈ 1.33
The direction is the same (trajectory 1 is above average, so it's reinforced), but the magnitude is smaller. For trajectory 2 (return = −2, below baseline), the contribution would be negative — discouraging those actions. That's exactly what we want.
Without a baseline, even trajectory 3 (return = +1, mediocre) gets a positive gradient. With a baseline, trajectory 3 gets a slightly negative gradient (1 − 1.33 = −0.33), correctly signaling that it's below average.
Five trajectories are shown with their rewards. Toggle causality (reward-to-go) and baseline subtraction to see how the gradient direction changes. Arrow length = gradient magnitude for each trajectory.
The policy gradient theorem states: ∇θ J(θ) = Eτ~πθ[∑t=0T ∇θ log πθ(at|st) · Qπ(st, at)]
Your task: (1) Start from J(θ) = Eτ~πθ[R(τ)] and derive this result using the log-derivative trick. (2) Show that Q can be replaced with advantage A = Q - V without changing the gradient's expectation. (3) Prove that subtracting ANY function b(s) from Q doesn't bias the gradient (b is called a "baseline").
Step 1: J(θ) = ∫ pθ(τ) R(τ) dτ. Take gradient: ∇J = ∫ ∇pθ(τ) R(τ) dτ = ∫ pθ(τ) ∇ log pθ(τ) R(τ) dτ = Eτ[∇ log pθ(τ) · R(τ)].
Step 2: log pθ(τ) = log p(s0) + ∑ log p(st+1|st,at) + ∑ log πθ(at|st). Only the last sum depends on θ: ∇ log pθ(τ) = ∑ ∇ log πθ(at|st).
Step 3 (Causality): ∇J = E[∑t ∇ log π(at|st) · (∑t'=tT γt'-trt')]. The inner sum = Qπ(st, at). ■
Step 4 (Baseline): Replacing Q with A = Q - V: the V(st) term contributes Ea[∇ log π · V(s)] = V(s) · 0 = 0. So A and Q give the same expected gradient, but A has lower variance (V(s) acts as a control variate). ■
Exam insight: The PG theorem appears in 3 forms: (1) with R(τ), (2) with Q(s,a), (3) with A(s,a). They're all the same gradient in expectation. Form (3) is what's implemented (lowest variance). Know all three and how to go between them.
REINFORCE works but it's inefficient. Each gradient estimate requires rolling out entire trajectories, and the variance is high even with baselines. The fundamental issue: we're using sampled returns to evaluate how good an action is. What if we could replace those noisy sampled returns with a learned estimate?
That's the actor-critic idea. Train two networks: an actor (the policy πθ) and a critic (a value function that estimates expected returns). The critic provides low-variance gradient signals to the actor. The actor provides trajectories for the critic to learn from.
Before anything else, you need to have these three functions and their relationships burned into memory. They will appear on every exam in some form.
Vπ(s) is the state value function. It answers: "How much total discounted reward do I expect to get, starting from state s and following policy π forever?" It's the "how good is this state?" number.
Qπ(s, a) is the action value function. It answers: "How much total reward do I expect if I start in state s, take action a (possibly different from what π would choose), and then follow π from the next state onward?" It's "how good is this action in this state?"
Aπ(s, a) is the advantage function. It answers: "How much better is action a compared to the average action at state s?" Positive advantage means the action is above average; negative means below average. The advantage is the signal the actor uses to update — reinforce above-average actions, discourage below-average ones.
The critic needs a training target — what should V(s) equal? There are three approaches, and understanding their tradeoffs is essential for the exam.
| Method | Target for V(s) | Bias | Variance | Data Needed |
|---|---|---|---|---|
| Monte Carlo (MC) | ∑t'=tT γt'−t rt' (actual return) | Unbiased | High | Full trajectory |
| TD(0) | rt + γ V(st+1) (bootstrap) | Biased (if V wrong) | Low | Single transition |
| n-step | ∑k=0n−1 γk rt+k + γn V(st+n) | Less biased | Medium | n transitions |
Monte Carlo: Use the actual observed return from the trajectory. No approximation, so unbiased. But if the trajectory is long (T = 1000), the return includes 1000 random variables, so variance is huge.
TD(0) (Temporal Difference): Use only the immediate reward plus the critic's own estimate of the next state. This bootstraps — it uses V(s') to estimate V(s). If V is wrong, the target is biased. But it only depends on one random variable (rt), so variance is low.
n-step: A compromise. Use n actual rewards, then bootstrap:
At n = 1, this is TD(0). At n = T, this is MC. You can slide between bias and variance by choosing n.
With a learned critic, the policy gradient becomes:
Where the advantage is estimated using the TD error:
This is the TD advantage: the actual reward plus the discounted value of the next state minus the value of the current state. If this is positive, the action led to a better-than-expected next state — reinforce it. If negative, the action was worse than expected — discourage it.
(a) TD target for V(s1):
(b) MC return from s1 (assuming the trajectory ends at s3):
(This is actually a 2-step return, not a pure MC return. A pure MC would go until the episode ends. But for a finite trajectory, this is the standard calculation.)
(c) TD advantage A(s1, a1):
The advantage is strongly positive: taking action a1 in state s1 led to state s2 (value 8), which is much better than the expected value of state s1 (value 5). The actor should reinforce this action.
Notice the TD target (10.2) is much higher than V(s1) = 5, meaning our critic underestimates s1. The critic loss would push V(s1) upward toward 10.2.
A 10-state chain with random rewards. The orange line = true values (computed analytically). Teal dots = estimated values using n-step returns. Drag the n slider: n=1 is TD(0) (low variance, may be biased), n=10 is MC (unbiased, high variance). Watch the estimates wobble with MC and stabilize (but shift) with TD.
Here's the problem with on-policy methods like vanilla policy gradient and A2C: the moment you update the policy parameters θ → θ', all your collected data becomes stale. The data was collected under πθ, but now you need data from πθ'. You must throw away your data and collect new trajectories. Every single gradient step requires fresh rollouts.
This is astronomically wasteful. If each rollout takes 10 seconds and you want to do 100,000 gradient steps, that's a million seconds of data collection — even though most of that data is nearly identical (the policy only changed slightly).
Off-policy methods fix this by reusing old data. The key mathematical tool is importance sampling.
Suppose you collected a transition (s, a, r, s') using an old policy πold. You want to use it to estimate the gradient for a new policy πnew. The importance sampling correction is:
The ratio ρ = πnew(a|s) / πold(a|s) reweights each sample. If the new policy thinks action a is twice as likely as the old policy did, the sample counts double. If the new policy thinks a is half as likely, the sample counts half.
For policy gradients, the off-policy objective becomes:
Importance sampling is mathematically correct but practically dangerous. The ratio ρ = πθ'(a|s) / πθ(a|s) can become arbitrarily large. If πold(a|s) = 0.01 and πnew(a|s) = 0.5, the ratio is 50. A single sample gets multiplied by 50, dominating the entire gradient estimate. The variance explodes.
Even worse: for trajectories, the ratios multiply across timesteps. A trajectory of length T has a combined ratio of ∏t=0T (πnew(at|st) / πold(at|st)). If each ratio is 2, the product is 2T. For T = 100, that's 2100 ≈ 1030. Numerical catastrophe.
Trust Region Policy Optimization (TRPO) constrains how far the new policy can move from the old one:
The KL divergence constraint ensures the ratio stays near 1 (since if the distributions are similar, the ratio is close to 1). This works great in theory but requires computing the KL constraint at every step, which involves second-order optimization (the Fisher information matrix). Expensive and hard to implement.
Proximal Policy Optimization (PPO) replaces the KL constraint with a simple clipping trick:
Where ρt = πθ'(at|st) / πθ(at|st) and ε is typically 0.2.
Let's unpack the min. There are two cases:
Case 1: Advantage is positive (A > 0). This action is good — we want to increase its probability. The ratio ρ grows as we increase πnew(a|s). But clip(ρ, 0.8, 1.2) caps the benefit at ρ = 1.2. Beyond that, the gradient is zero. We can't over-exploit one good action.
Case 2: Advantage is negative (A < 0). This action is bad — we want to decrease its probability. The ratio ρ shrinks. But clip caps it at ρ = 0.8. We can't over-penalize one bad action.
Soft Actor-Critic (SAC) is the other major off-policy algorithm, but with a twist: it adds an entropy bonus to the reward:
Where H(π) = −E[log π(a|s)] is the entropy of the policy and α controls the entropy weight.
Why add entropy? Two reasons:
1. Exploration. Maximizing entropy prevents the policy from collapsing to a deterministic action too early. It keeps exploring diverse actions, which helps avoid local optima.
2. Robustness. A high-entropy policy is more robust to perturbations. If one action path gets blocked, the policy has mass on alternative paths.
SAC uses a replay buffer (off-policy), learns Q-functions (two of them, to reduce overestimation), and outputs a stochastic policy (squashed Gaussian). It's the go-to algorithm for continuous control with sample efficiency.
(a) IS ratio:
(b) Unclipped objective:
(c) PPO clipped objective:
Since A > 0, PPO takes min(ρ · A, clip(ρ, 0.8, 1.2) · A):
PPO caps the objective at 2.4 instead of 6.0. The gradient through the clipped term is zero with respect to θ' beyond the clip point, preventing the policy from jumping too far.
(d) Is ρ = 3.0 dangerously large? Yes. The new policy is 3× more likely to take this action than the old policy. For a single transition, this is manageable. But for a trajectory of length T, the ratios multiply: 3T. Even at T = 20, that's 320 ≈ 3.5 × 109. That's why PPO clips — and why practitioners only reuse data for a few (3-5) gradient steps before recollecting.
Set πold and πnew for a single action. Orange bar = unclipped ρ · A. Teal bar = clipped PPO objective. The gray region shows the clipping bounds [1−ε, 1+ε].
Every method so far learns a policy πθ(a|s) explicitly — a network that directly outputs actions. Q-learning takes a fundamentally different approach: learn the optimal value of every action, then just pick the best one. No policy network at all.
The key insight: if you knew Q*(s, a) — the optimal action-value function — you wouldn't need a policy network. You'd just act greedily:
"At every state, pick the action with the highest Q-value." The policy is implicit in the Q-function. This is the value-based approach to RL.
Q* satisfies a recursive relationship called the Bellman optimality equation:
In words: the optimal value of taking action a in state s equals the immediate reward plus the discounted value of the best action in the next state. This is a fixed-point equation — Q* is the unique function that satisfies it.
Let's unpack each piece:
| Term | Meaning |
|---|---|
| Q*(s, a) | Optimal value of taking action a in state s (what we're learning) |
| r(s, a) | Immediate reward for taking action a in state s |
| γ | Discount factor (0 to 1) — how much we care about future rewards |
| maxa' Q*(s', a') | Value of the best action in the next state (bootstrapping) |
For small state-action spaces, we can store Q as a table. The update rule for each observed transition (s, a, r, s'):
Where α is the learning rate. The term in brackets is the TD error: the difference between the Bellman target (r + γ max Q(s', a')) and the current estimate Q(s, a). We nudge Q toward the target by a fraction α.
With enough exploration (visiting every state-action pair infinitely often) and a decaying learning rate, tabular Q-learning provably converges to Q*.
For large state spaces (images, continuous states), we replace the table with a neural network Qφ(s, a). The loss function:
This looks like simple regression: predict Q(s,a), use the Bellman target as the label. But there are three problems that make naive DQN unstable, and three tricks that fix them:
Problem: consecutive transitions are correlated (s1 → s2 → s3 are temporally adjacent). Neural networks trained on correlated data overfit to recent patterns and forget old ones.
Fix: store all transitions in a large replay buffer D. Sample mini-batches uniformly from D. This breaks temporal correlations — your mini-batch might contain transitions from 10 different episodes at different points in training.
Problem: the Bellman target r + γ max Qφ(s', a') uses the same network Qφ that we're training. When we update φ, the target changes too — we're chasing a moving target. This causes oscillations and divergence.
Fix: use a separate target network Qφ' with frozen parameters. Update φ' only periodically (copy φ to φ' every C steps) or slowly (Polyak averaging: φ' ← τφ + (1−τ)φ'). The target is stable between updates.
Problem: the max operator in maxa' Q(s', a') systematically overestimates Q-values. If Q has noise (estimation error), max picks the action with the highest noise realization, biasing the estimate upward. This is the "max of noisy estimates > true max" phenomenon.
Fix: Double DQN decouples action selection from action evaluation. Use the online network to select the action, but the target network to evaluate it:
Since the selection and evaluation use different networks (with different noise patterns), the overestimation bias is significantly reduced.
Step 1: Compute the Bellman target.
Step 2: Compute the TD error.
Step 3: Apply the update.
The Q-value for (s1, Right) increased from 4.0 to 5.7. Makes sense: taking Right in s1 gives reward 2 and transitions to s2, whose best action (Left) has value 6. The discounted future value 0.9 × 6 = 5.4 plus the immediate reward 2 = 7.4, which is higher than the old estimate of 4. So Q moves upward.
Notice: only Q(s1, R) changed. All other Q-values remain the same. Q-learning does pointwise updates — one (s, a) pair per transition.
A 3×3 grid world with a goal (top-right, reward +10) and a pit (bottom-left, reward −5). Click Step to perform one Q-learning update from a random transition. Click Run 100 to watch the Q-values converge. The arrows show the greedy policy (argmax Q). Warmer colors = higher max Q.
| Q-Learning | SARSA | |
|---|---|---|
| Update target | r + γ maxa' Q(s', a') | r + γ Q(s', a'actual) |
| On/off-policy | Off-policy | On-policy |
| Evaluates | Greedy (best) action | The action the policy actually takes |
| Consequence | Learns optimal Q* regardless of behavior | Learns Qπ for the current policy |
| Safety | May learn unsafe greedy paths | Respects exploration policy (safer) |
The Bellman optimality operator is: (TQ)(s,a) = r(s,a) + γ maxa' Q(s', a'). Q-learning converges because T is a contraction in the infinity norm.
Your task: (1) Prove ||TQ1 - TQ2||∞ ≤ γ ||Q1 - Q2||∞. (2) Use the Banach fixed-point theorem to conclude Q-learning converges to a unique Q*. (3) What is the convergence rate? After k iterations, how close is Qk to Q*?
Proof of contraction:
For any (s,a): |(TQ1 - TQ2)(s,a)| = γ|maxa' Q1(s',a') - maxa' Q2(s',a')|
≤ γ maxa' |Q1(s',a') - Q2(s',a')| (triangle inequality for max)
≤ γ sup(s,a) |Q1(s,a) - Q2(s,a)| = γ ||Q1 - Q2||∞
Taking sup over (s,a) on the left: ||TQ1 - TQ2||∞ ≤ γ ||Q1 - Q2||∞. ■
Convergence: Since γ < 1, T is a γ-contraction. By Banach: (1) unique fixed point Q* exists, (2) Qk = TkQ0 → Q*, (3) rate: ||Qk - Q*|| ≤ γk/(1-γ) ||TQ0 - Q0||.
Exam insight: This proof is a classic exam question. The key steps: (1) property of max, (2) apply it, (3) invoke Banach. Also know: this only works for tabular Q-learning. With function approximation (DQN), the contraction property breaks because the projection step (fitting a neural network) is not a non-expansion. This is the "deadly triad" problem.
Here's a scenario that comes up constantly in the real world: you have a massive dataset of past decisions — hospital treatment records, autonomous driving logs, recommendation system clickstreams — but you cannot deploy a new policy to collect more data. The patients have already been treated. The cars have already driven. You need to learn the best policy you can from this fixed dataset, without any additional interaction.
This is offline RL (also called batch RL). You have a dataset D = {(s, a, r, s')} collected by some behavior policy πβ, and your job is to extract the best possible policy π from this data alone. Sounds like supervised learning, right? Just imitate the best actions in the dataset?
Not quite. The behavior policy might be mediocre — a mix of novice and expert drivers, doctors following outdated guidelines. We want to do better than the data collector. But this is where offline RL breaks in a spectacular and specific way.
Consider a standard Q-learning update. We pick the action that maximizes Q(s, a) at the next state. But what if that maximizing action was never taken in the dataset? The Q-network has never seen a training signal for (s', amax), so its Q-value there is essentially a random guess — and random neural network outputs tend to be overestimates.
Now it gets worse. The Bellman backup propagates these overestimates. Q(s, a) gets a target that includes maxa' Q(s', a'), which is inflated. Next iteration, that inflated Q(s, a) propagates to other states. The errors compound exponentially. Your Q-function develops "hallucinated" high-value regions in parts of the state-action space the dataset never visited.
Advantage Weighted Regression (AWR) sidesteps the problem entirely by never evaluating out-of-distribution actions. Instead of learning Q-values and picking argmax, AWR treats the problem as weighted behavior cloning:
Where A(s,a) = Q(s,a) - V(s) is the advantage, and β is a temperature parameter. Translation: clone the behavior policy, but weight each action by how much better it was than average. Good actions (positive advantage) get upweighted exponentially. Bad actions get downweighted. The policy stays close to the data distribution because it's literally doing weighted imitation.
Implicit Q-Learning (IQL) takes a different, more elegant approach. The core insight: the reason Q-learning fails offline is that the Bellman target maxa Q(s', a) requires evaluating Q at actions that might be OOD. What if we could compute a value function V(s) that approximates the max without ever querying Q at unseen actions?
IQL does this in two steps:
The key equation is the expectile loss:
Where u = Q(s,a) - V(s). When τ = 0.5, this is ordinary MSE — V learns the mean of Q. When τ < 0.5, the loss penalizes overestimates more than underestimates. So V(s) is pulled below the mean Q — it becomes a conservative lower bound. When τ > 0.5, V chases the high end — approaching maxa Q(s,a), which is what we'd want in online RL but is dangerous offline.
Solution (a): Q(s, a2) = 8 was never trained on real (s, a2, r, s') transitions. It's a random extrapolation by the neural network. Standard Q-learning would set its target as r + γ max(Q(s', ·)), and if that max happens at an OOD action in s', the overestimate propagates. A policy using argmaxa Q(s, a) would choose a2 — an action that might not even make physical sense — because of this hallucinated high value.
Solution (b): We only use in-data actions, so u = Q(s, a1) - V(s) = 5 - 6 = -1. Since u < 0, the indicator 1(u < 0) = 1, so the weight is |τ - 1| = |0.3 - 1| = 0.7.
Solution (c): Since u = -1 < 0, the weight is 0.7 (high penalty for overestimating). The gradient pushes V(s) down toward Q(s, a1) = 5. With τ = 0.3, V will converge below the mean of in-data Q-values — a conservative estimate. If there were multiple in-data actions, V would settle at roughly the 30th percentile of Q-values.
Conversely, if u were positive (V too low), the weight would be |0.3 - 0| = 0.3 — a weaker push upward. This asymmetry is exactly what makes IQL conservative.
Adjust τ to see how the expectile loss changes. With τ < 0.5, overestimates (V > Q) are penalized more heavily, pushing V to be conservative. With τ > 0.5, underestimates are penalized more, pushing V toward the max. The orange bars show Q-values for in-data actions; the teal line shows V(s).
| Method | Core Idea | OOD Handling | Key Hyperparameter |
|---|---|---|---|
| BCQ | Only consider actions similar to data | Generative model filters actions | Similarity threshold |
| CQL | Penalize Q-values for OOD actions | Regularizer pushes Q down on unseen (s,a) | Regularization strength α |
| AWR | Weighted behavior cloning | Never evaluates OOD — just reweights data | Temperature β |
| IQL | Expectile regression + AWR extraction | V fitted conservatively on in-data Q only | Expectile τ, temperature β |
Behavioral Cloning suffers O(εT2) regret due to compounding errors. DAgger achieves O(εT). This is the key theoretical advantage.
Your task: (1) State the no-regret property formally: after N rounds of DAgger, the policy πN has cost at most minπ∈Π J(π) + O(1/√N) under the LEARNER's distribution. (2) Prove this by showing DAgger is an instance of Follow-the-Leader in an online learning game. (3) Explain why this breaks the quadratic compounding: what assumption about the training distribution does BC violate that DAgger satisfies?
Formal statement (Ross, Gordon, Bagnell 2011): Let CN = (1/N)∑i=1N Es~dπi[ℓ(πN, π*, s)] be the average cost of πN over all visited distributions. Then:
CN ≤ minπ∈Π (1/N)∑i Edπi[ℓ(π, π*, s)] + O(1/√N)
Why O(εT) not O(εT2): At convergence (large N), πN achieves error ε under its OWN state distribution. Each step incurs ε error independently (no compounding because training matches deployment). Total: T·ε. In BC, step t incurs error proportional to t·ε (accumulated drift), giving ∑t tε = εT2/2.
The key assumption DAgger satisfies: The training data covers the learner's actual state distribution. BC violates this (trains on expert states, evaluated on learner states). This "covariate shift" is the root cause of quadratic regret.
Exam insight: "Why does BC compound errors quadratically?" → Distribution shift: policy visits states not in training data. "How does DAgger fix this?" → Iteratively collects data under the learner's distribution. "What's the tradeoff?" → DAgger requires online expert access (can't use a fixed dataset).
The problem: The Bellman backup uses maxa' Q(s', a'), which evaluates Q at actions the dataset never contained — these out-of-distribution Q-values are wildly overestimated (the network hallucinating high value for unseen actions), causing the policy to chase phantom rewards.
CQL fix: Add a regularizer that pushes Q-values DOWN for actions not in the dataset: minQ Ea~π[Q(s,a)] - Ea~data[Q(s,a)] + standard Bellman loss. This makes out-of-distribution actions have pessimistically low Q-values.
IQL fix: Never evaluate Q at out-of-distribution actions AT ALL. Fit V(s) using expectile regression on Q-values from the dataset, then extract a policy via advantage-weighted regression: π ∝ πβ · exp(A/β). The advantage A = Q - V is computed only at dataset actions.
Every RL algorithm assumes a reward function r(s, a) exists. But who writes that function? For Atari games, the reward is the score — easy. For "make a robot set a table nicely" or "write a helpful email," defining reward mathematically is a nightmare. You know a good table setting when you see one, but translating that into r(s, a) is an open research problem.
Worse, hand-designed rewards get hacked. A classic example: a boat racing agent rewarded for collecting speed boosts learned to drive in circles collecting boosts, never finishing the race. The reward was technically maximized, but the behavior was absurd. This is reward misspecification — the optimizer finds a way to score high that violates the designer's intent.
The solution: learn the reward function from human input. There are two main approaches, and they use fundamentally different types of supervision.
Inverse Reinforcement Learning (IRL) starts from expert demonstrations and asks: "What reward function would make this expert behavior optimal?" Given a dataset of expert trajectories Dexpert = {τ1, τ2, ...}, IRL finds r such that the expert policy scores higher than any other policy under r.
The problem is ill-defined — many reward functions explain the same behavior (including r = 0 everywhere). Modern approaches like adversarial IRL frame it as a GAN-like game: a discriminator tries to distinguish expert trajectories from agent trajectories, and the reward is derived from the discriminator's confidence.
Instead of requiring full demonstrations, we can learn from pairwise preferences. Show a human two trajectory segments τA and τB, and ask: "Which one is better?" This is much easier for humans — you don't need to be an expert to say "that robot arm motion looks smoother than that one."
The Bradley-Terry model converts these preferences into a reward learning objective. It models the probability that trajectory τw is preferred over τl as:
Where σ is the sigmoid function, and r(τ) = ∑t r(st, at) is the total reward along the trajectory. The loss for training the reward model is the negative log-likelihood of observed preferences:
This is just binary cross-entropy — the same loss used in logistic regression. The reward model is trained to assign higher total reward to the trajectory the human preferred.
Reinforcement Learning from Human Feedback (RLHF) chains these ideas together in a three-phase pipeline:
The KL constraint in Phase 3 is crucial: without it, the policy could exploit weaknesses in the learned reward model (reward hacking again, but now hacking the learned reward). The objective becomes:
Where πref is the original pre-trained policy and β controls how far the optimized policy can drift.
Solution (a): P(τ2 ≻ τ1) = σ(r(τ2) - r(τ1)) = σ(5 - 3) = σ(2).
σ(2) = 1 / (1 + e-2) = 1 / (1 + 0.135) = 1 / 1.135 ≈ 0.881.
The model is 88.1% confident that τ2 is better — which matches the human's judgment. Good.
Solution (b): L = -log σ(2) = -log(0.881) ≈ 0.127.
The loss is low because the model's prediction aligns with the human label. The gradient here is small — not much to learn.
Solution (c): Now P(τ2 ≻ τ1) = σ(3.1 - 3) = σ(0.1) = 1 / (1 + e-0.1) ≈ 1 / 1.905 ≈ 0.525.
L = -log(0.525) ≈ 0.644. The loss is 5x higher. The model is barely more confident than a coin flip that τ2 is better, but the human clearly said it was. The gradient here is large — the reward model needs to push r(τ2) higher and r(τ1) lower to separate them.
Adjust the reward scores for two trajectories. The sigmoid curve shows how reward difference maps to preference probability. Watch the loss and gradient magnitude change — close scores mean high loss and strong learning signal.
| From Demonstrations (IRL) | From Preferences (RLHF) | |
|---|---|---|
| Human input | Full expert trajectories | Pairwise "which is better?" labels |
| Requires expert? | Yes — demonstrator must be skilled | No — any human can compare two options |
| Scalability | Low — demonstrations are expensive | Higher — comparisons are cheap and parallelizable |
| Ill-posedness | Severe — many rewards explain same behavior | Milder — preferences provide relative signal |
| Reward hacking risk | Moderate | Managed via KL constraint to reference policy |
Every RL algorithm we've discussed so far is model-free: the agent interacts with the environment, collects rewards, and learns a policy or value function without ever trying to understand how the world works. That's like learning to drive by trial and error with zero understanding of physics. It works eventually, but you'll crash a lot.
Model-based RL (MBRL) flips the script. First, learn a dynamics model — a function pθ(s'|s, a) that predicts what state comes next given the current state and action. Then use this model to either generate synthetic training data or plan ahead by simulating future trajectories in your head.
The payoff is massive: typically 10-100x fewer real environment interactions. The cost: you need to manage model error, because wrong predictions compound catastrophically.
The simplest MBRL framework is Dyna, introduced by Sutton. The idea is beautifully straightforward:
For every 1 real environment step, Dyna does K imagined steps. If K = 10, the policy sees 11x as much data but the robot only moved once. That's the data efficiency win.
Here's the fundamental tension. Suppose your model has 5% prediction error per step — pretty good for a neural network. After 1 step, your predicted state is 5% off. After 2 steps, the error compounds: you're predicting the next state from an already-wrong state. After H steps:
For ε = 0.05 and H = 20 steps: linear gives 100% error (useless); multiplicative gives 0.05 × 2.65 ≈ 13.3% error. Either way, long rollouts in a learned model produce garbage. The policy trains on this garbage and learns fictional strategies that fail in the real world.
Model-Based Policy Optimization (MBPO) solves the compounding error problem with two key ideas:
1. Short branched rollouts. Instead of rolling out from a random initial state for 100 steps, MBPO starts from a real state sampled from the replay buffer and rolls out for only k steps (typically k = 1 to 5). This limits how far the imagined trajectory drifts from reality.
2. Ensemble of models. Train N models (typically 5-7) on the same data. When models disagree on a prediction, that state-action pair is high-uncertainty. MBPO uses the ensemble for rollouts and can detect when imagined data becomes unreliable.
Model Predictive Control (MPC) takes a completely different approach: instead of using the model to train a policy, use it to plan at decision time. At each timestep:
The receding horizon trick is essential: we only execute the first action and then replan from the new real state. Why? Because the model gets less accurate further into the future. By replanning every step, we always use the model's most accurate near-term predictions.
Pure MPC struggles with long-horizon tasks because the planning horizon H is limited by model accuracy. A clever fix: after rolling out H steps in the model, add a terminal value V̂(sH) that estimates the value of the final state. The total score for a sequence becomes:
Where V̂ is a learned value function (trained model-free on real data). This lets MPC reason about long-term consequences without needing a model accurate enough for long rollouts. Short model rollout + learned long-term value = best of both worlds.
Solution (a): True dynamics: st+1 = st + 1. After 10 steps: s10 = 0 + 10 × 1 = 10.
Solution (b): Model prediction: ŝt+1 = ŝt + 1 + 0.1. The bias accumulates linearly. After 10 steps: ŝ10 = 0 + 10 × (1 + 0.1) = 10 + 10 × 0.1 = 11.
Solution (c): Error = |11 - 10| = 1. That's 10% of the true state — from a model with only 0.1 bias per step. After 100 steps, it would be 10. This linear accumulation is the best case — multiplicative errors compound even faster. MBPO limits rollouts to k ≤ 5 steps precisely to keep this compounding manageable.
Adjust the per-step model error and rollout length. The teal line shows the true trajectory; the orange line shows the model's prediction. Watch how quickly they diverge with longer rollouts.
| Approach | Uses Model For | Strength | Weakness |
|---|---|---|---|
| Dyna | Synthetic data generation | Simple, works with any policy learner | Long rollouts → compounding error |
| MBPO | Short branched rollouts + SAC | State-of-art data efficiency | Needs ensemble overhead, short horizon |
| MPC | Online planning at each step | No policy network needed, replans constantly | Slow at test time (N rollouts per step) |
| MPC + V̂ | Short planning + learned value | Long-horizon reasoning with short rollouts | Needs both model and value function |
python import numpy as np def compute_gae(rewards, values, dones, gamma=0.99, lam=0.95): T = len(rewards) advantages = np.zeros(T) last_gae = 0.0 for t in reversed(range(T)): # If episode ended at t, next value is 0 next_non_terminal = 1.0 - dones[t] next_value = values[t + 1] * next_non_terminal # TD error: r + gamma*V(s') - V(s) delta = rewards[t] + gamma * next_value - values[t] # GAE recursion (zeroed across episode boundaries) last_gae = delta + gamma * lam * next_non_terminal * last_gae advantages[t] = last_gae return advantages # Returns = advantages + values (for value function target) # returns = advantages + values[:T]
A goal-conditioned model-based algorithm: (1) Learn a dynamics model s' = f(s, a). (2) Given a goal g, plan through the model to find action sequence that reaches g. (3) Execute the first action, re-plan from the new state.
This is literally MPC (Model Predictive Control) with a goal-reaching objective. The advantage: you can reach ANY goal without retraining — just change the planning objective. This is more flexible than a goal-conditioned policy (which needs to be trained on goals) and more sample-efficient than model-free GCRL.
Real example: Visual Foresight (Finn & Levine 2017) learns a video prediction model, then plans actions that make the predicted video match a goal image. Modern version: Dreamer with HER-style goal relabeling. The model provides the "imagination" for planning; the goal provides the "what to plan toward."
Standard RL trains a policy for one task with one reward function. But real agents need to handle many tasks. A household robot needs to set the table, wash dishes, and fetch objects. Training a separate policy for each is wasteful — the physics of grasping is the same whether you're picking up a mug or a plate.
Multi-task RL trains a single policy conditioned on a task identifier zi. The policy becomes π(a | s, zi), and the state effectively becomes the pair (s, zi). Different task identifiers activate different behaviors while sharing the underlying representations.
Goal-conditioned RL (GCRL) is the most natural version of multi-task RL. The task identifier is simply a goal state sg — the state you want to reach. The policy is π(a | s, sg), and the reward is typically:
Dense rewards give the agent a gradient to follow at every step ("you're getting warmer"). Sparse rewards are more natural but harder to learn from ("you either reached the goal or you didn't"). Most real-world goals are sparse — the mug is either on the coaster or it isn't.
HER is one of the cleverest tricks in RL. The insight: even when the agent fails to reach its intended goal, it did reach some state. Why not pretend that state was the goal all along?
Concretely: the agent tries to reach goal g = (5, 5) but ends up at (3, 2). With the original goal, the entire episode has reward 0 (failure). But if we relabel the goal to g' = (3, 2), then the final transition gets reward 1 (success!). The agent learns "here's how to reach (3, 2)" — and that knowledge transfers to reaching other nearby goals.
HER has three requirements that are easy to forget on an exam:
| Requirement | Why |
|---|---|
| Same dynamics across goals | Relabeling changes the goal, not the physics. If different goals had different transition dynamics, the relabeled transition would be invalid. |
| Evaluatable reward | We need to recompute r(s, a, g') for the new goal g'. This requires knowing the reward function — we can't relabel if reward is a black box. |
| Off-policy algorithm | Relabeled data was collected under a policy pursuing a different goal. Only off-policy methods can learn from data generated by a different behavioral distribution. |
Solution (a): Distances at each step:
t=0: ||((0,0) - (5,5)|| = √(50) ≈ 7.07 → reward = -7.07
t=1: ||(1,1) - (5,5)|| = √(32) ≈ 5.66 → reward = -5.66
t=2: ||(2,1) - (5,5)|| = √(25) = 5.0 → reward = -5.0
t=3: ||(3,2) - (5,5)|| = √(13) ≈ 3.61 → reward = -3.61
No bonus (never reached g). Total ≈ -21.34. All negative, all failure.
Solution (b): With g' = (3, 2): at t=3, ||(3,2) - (3,2)|| = 0 → reward = 0 + 10 = 10. Success!
Solution (c): The "final" strategy relabels with sT = (3, 2). Each of the 4 transitions in the episode gets a copy with g' = (3, 2) and recomputed rewards. So HER creates 4 additional transitions (one per timestep, each paired with the new goal). Combined with the 4 original transitions, we now have 8 total — and 4 of them end in success instead of 0.
The agent (circle) tried to reach the orange goal but ended up at its final position. Click Relabel to see HER create a new teal goal at the achieved position, turning failure into success. Click New Episode to generate a random trajectory.
A (fixed dataset, no env): Offline RL. IQL for simplicity + stability, CQL if you need guaranteed pessimism. NEVER use SAC/DQN online — they require env interaction. BC works if dataset is expert-quality; offline RL is better for mixed-quality datasets.
B (expensive env, 1000 episodes): Model-based RL (MBPO, Dreamer). Learn a world model from real data, generate synthetic rollouts for policy training. 10-100x more sample efficient than model-free. Alternatively: a few demos + fine-tuning (warm-start with BC, then improve with RL).
C (fast sim, continuous, 10M steps): SAC (off-policy, entropy regularization, continuous actions) or PPO (on-policy, simpler, more robust). SAC is more sample efficient. PPO is more stable and easier to tune. For robotics with Isaac Lab: PPO with 4096 parallel envs.
D (expert available, multi-modal): DAgger (iterative) or Diffusion Policy (handles multi-modality). BC fails for multi-modal because MSE averages modes. Diffusion Policy represents the full action distribution. IBC (implicit BC) also handles multi-modality.
E (sparse reward, goal-conditioned): HER (Hindsight Experience Replay) + goal-conditioned SAC. HER retroactively relabels failed episodes as "successes for a different goal," dramatically improving sample efficiency under sparse rewards.
F (safety-critical, stay near known): Offline RL with strong conservatism (CQL with high α), or constrained policy optimization (CPO). The key requirement: never try actions far from the dataset. Conservative methods provide this guarantee by construction.
A third tension runs beneath both: on-policy vs off-policy. On-policy (PPO) is stable but sample-wasteful. Off-policy (SAC, DQN) reuses data but risks divergence. Offline RL takes this to the extreme: ALL data is off-policy. Every algorithm in this course is a point in this 3D tradeoff space. Understanding WHERE each algorithm sits (and WHY) is the deepest understanding you can have of the field.
Place these algorithms in the 3D space (exploration strategy, bias-variance point, on/off-policy): PPO, SAC, DQN, CQL, REINFORCE, Dreamer. Which two are closest? Which two are furthest apart?
Time to test yourself. This is a randomized exam with questions drawn from all nine topics covered in this review. Each question is the kind you'd see on a real CS224R final — not trivia, but questions that test whether you truly understand the concepts.
The questions are intentionally hard. Some require computation. Some have trick answers that test whether you understand subtle distinctions (like on-policy vs off-policy, or what baselines actually do to the gradient). Some require you to trace through an algorithm step by step. This is the level you need to be at.
36 exam-quality questions across 9 topics.
Randomized each time. Detailed explanations for every answer.
The score bar above tracks your progress. Green segments = correct, red = incorrect. Use topic filters to drill weak areas. Questions repeat if you cycle through all of a topic's questions. Your score resets when you change the filter.
This is your one-page reference for the entire course. Print it, memorize it, take it into the exam in your head. Every algorithm, every key equation, every common trap — all in one place.
| Algorithm | Key Equation | Data | On/Off | Key Property |
|---|---|---|---|---|
| BC | min E[ℓ(πθ(s), aexpert)] | Expert demos | N/A | Supervised, no reward needed |
| DAgger | BC + query expert on visited states | Expert demos + online queries | N/A | Fixes distribution shift in BC |
| REINFORCE | ∇J = E[∑t ∇logπ(at|st) · R(τ)] | On-policy rollouts | On | High variance, unbiased |
| A2C | ∇J = E[∇logπ(a|s) · A(s,a)] | On-policy rollouts | On | Baseline reduces variance |
| PPO | L = min(rtAt, clip(rt, 1±ε)At) | On-policy (few reuses via IS) | On | Clipped IS for stability |
| SAC | max E[∑ r + αH(π)] | Replay buffer | Off | Entropy-regularized, explores well |
| DQN | L = (r + γ maxa' Qtarget(s',a') - Q(s,a))² | Replay buffer | Off | Target network stabilization |
| AWR | π = argmax E[exp(A/β) logπ(a|s)] | Fixed offline dataset | Off | Weighted BC — never evaluates OOD |
| IQL | Lτ = |τ - 1(u<0)| · u² | Fixed offline dataset | Off | Conservative V via expectile regression |
| Dyna | Train model pθ(s'|s,a), generate synthetic data | Real + imagined | Either | Data efficiency via imagination |
| MPC | a* = argmaxa0:H ∑ r(st,at) | Model-based planning | N/A | No explicit policy, replans each step |
| HER | Relabel g → g' = sachieved | Replay buffer + relabeled | Off | Turns failures into successes |
| Misconception | Truth |
|---|---|
| "Baselines change the expected gradient in policy gradient methods" | Baselines change variance only. E[∇logπ · b] = 0 for any state-dependent baseline b(s). The expected gradient is unchanged. |
| "Q-learning is on-policy" | Off-policy. The Bellman target uses maxa' Q(s', a') — the max over all actions, regardless of which action the behavior policy took. |
| "PPO is off-policy because it reuses data" | On-policy. PPO collects fresh rollouts each iteration. It reuses each batch for a few gradient steps via importance sampling, but the data is still on-policy. The clipping ensures updates stay close to on-policy. |
| "DAgger requires a reward function" | No reward needed. DAgger only needs an expert that can label states with the correct action. It's pure imitation learning with interactive expert queries. |
| "BC suffers from compounding error because of bad training data" | BC fails because of distribution shift: the learner visits states the expert never demonstrated, not because the expert data is bad. DAgger fixes this by querying the expert on the learner's visited states. |
| "Offline RL fails because datasets are too small" | Even huge datasets cause failure if the learned policy visits OOD state-action pairs. The problem is distributional shift, not sample size. |
| "Model-based RL is always more sample efficient" | Only if the model is accurate enough. A bad model generates misleading synthetic data that can hurt worse than no data at all. |
| "HER works with any RL algorithm" | HER requires off-policy algorithms only. Relabeled data was collected under a policy pursuing a different goal — it's off-policy by construction. |
| "In RLHF, the policy is trained directly on human preferences" | Two-stage: first train a reward model from preferences (Bradley-Terry), then optimize the policy using that reward model with PPO + KL constraint. |
| "Entropy bonus in SAC encourages exploitation" | Entropy bonus encourages exploration by penalizing deterministic policies. max H(π) = max randomness. |
| "IQL uses τ > 0.5 for conservative estimates" | τ < 0.5 is conservative (penalizes overestimates more). τ > 0.5 would be optimistic — chasing the max, which is dangerous offline. |
| "The discount factor γ only affects how far ahead the agent plans" | γ also affects the effective horizon and the variance of returns. Lower γ reduces variance but makes the agent myopic. Higher γ captures long-term rewards but increases variance. |
Each topic covered in this review has a full standalone lesson with deeper derivations, more simulations, and extended worked examples:
| Exam Topic | Full Lesson | Key Simulation |
|---|---|---|
| Imitation Learning | Imitation Learning | DAgger vs BC on distribution shift |
| Policy Gradients | Policy Gradients | REINFORCE variance reduction |
| Q-Learning / DQN | RL Algorithms | Q-table grid world with live updates |
| Actor-Critic / PPO / SAC | Deep RL Theory | PPO clipping visualization |
| Offline RL | (in review) | IQL expectile loss explorer |
| Reward Learning / RLHF | Reward Learning | Bradley-Terry preference model |
| Model-Based RL | Model-Based RL | Error compounding + MPC planner |
| Goal-Conditioned RL / HER | Multi-Task & Goal-Conditioned RL | HER relabeling visualization |
Policy Gradient Family
Φt = R(τ) for REINFORCE
Φt = A(st,at) for A2C
Φt = clipped IS · A for PPO
Bellman Equation
IQL Expectile Loss
Bradley-Terry
RLHF Objective
PPO Clip
SAC Objective
"What I cannot create, I do not understand." — Richard Feynman