Deep RL Exam Prep Workbook — Engineermaxxing

Chapter 0: Imitation Learning — Mode Collapse

You're training a self-driving race car. At a critical state s₁, expert drivers use two distinct safe racing lines around a barrier. The action space is continuous: a = (v, ω), where v ∈ [0, 5] is velocity and ω ∈ [−10, 10] is steering. The expert data has two equally likely modes:

Mode 1: v = 2.0, ω = 2.4 (Fast right turn)
Mode 2: v = 0.5, ω = −2.4 (Slow left turn)

Safety constraints: any |ω| ≤ 2.3 or |ω| ≥ 2.5 crashes. A "fast left turn" (v > 0.6, ω < 0) causes a rollover.

The mode-averaging trap. When a unimodal distribution (like a Gaussian) fits multimodal data, its mean falls between the modes — often in a region that neither expert would ever visit. This is the fundamental limitation of behavioral cloning with unimodal policies.

Exercise 0.1: Gaussian MLE Mean (velocity) Derive

Fit π(a | s₁) = N(μ, Σ) to the two expert modes via MLE. What is μ_v?

μ_v

Show derivation

μ_v = (2.0 + 0.5) / 2 = 1.25

The MLE mean of a Gaussian is the arithmetic mean of the data points.

Exercise 0.2: Gaussian MLE Mean (steering) Derive

What is μ_ω for the same Gaussian MLE fit?

μ_ω

Show derivation

μ_ω = (2.4 + (−2.4)) / 2 = 0

The two steering values cancel perfectly. The mean action μ = (1.25, 0) has |ω| = 0 ≤ 2.3, causing a crash. The Gaussian collapses two safe modes into one unsafe average.

Exercise 0.3: Which Fixes Work? Trace

Which approaches avoid the mode-averaging crash? (A) Collect more data for each mode. (B) Use autoregressive discretized policy with bin centers {−10, −8, ..., 8, 10}. (C) Use flow-matching policy. (D) Train on only one maneuver family.

(A) and (B) — more data + discretization (B) and (C) — autoregressive + flow matching (C) and (D) — flow matching can represent multimodality; single-mode data makes the problem unimodal All four work

Show reasoning

(A) is wrong: more equal-weight data doesn't change the MLE mean of a Gaussian — the problem is the model family, not data quantity.

(B) is wrong: the closest bin centers to ω = ±2.4 are ω = ±2, and |ω| = 2 ≤ 2.3 still crashes. Discretization granularity is too coarse.

(D) is correct: if you only keep one mode, MLE recovers that safe mode.

Exercise 0.4: MLE With 3 Modes Derive

New scenario: 3 equally likely modes at s₂: (v, ω) = {(1.0, 3.0), (2.0, −1.0), (3.0, 3.0)}. What is μ_v for a Gaussian MLE?

μ_v

Show derivation

μ_v = (1.0 + 2.0 + 3.0) / 3 = 2.0

Exercise 0.5: Policy Expressivity Bug Debug

This behavioral cloning code fits a Gaussian to bimodal expert data. Which line causes the unsafe mode-averaging behavior?

expert_actions = [(2.0, 2.4), (0.5, -2.4)]
mu = np.mean(expert_actions, axis=0)  # MLE mean
sigma = np.cov(expert_actions, rowvar=False)
policy = MultivariateNormal(mu, sigma)
action = policy.sample()  # deploy

Exercise 0.6: Build MLE Build

Implement gaussianMLE that takes an array of 2D action vectors and returns the MLE mean [μ_v, μ_ω].

Show solution

javascript
function gaussianMLE(actions) {
  const n = actions.length;
  let sv = 0, sw = 0;
  for (const [v, w] of actions) { sv += v; sw += w; }
  return [sv / n, sw / n];
}

The formula. Gaussian MLE mean = arithmetic mean of data. If the data is multimodal, this average can land in an unsafe region that no expert ever visited. Flow matching and mixture models avoid this.

Chapter 1: Flow Matching Arithmetic

Your flow-matching policy learned a vector field v_θ(s_t, a_t,τ, τ) that denoises actions. During inference, you start from Gaussian noise and integrate the ODE using Euler steps. Can you trace through the arithmetic?

Flow matching ODE: da_t,τ/dτ = v_θ(s_t, a_t,τ, τ)
Euler integration: a_t,τ+1/n = a_t,τ + (1/n) · v_θ(s_t, a_t,τ, τ)

Exercise 1.1: Euler Step 1 Derive

Start from noise a_t,0 = (1.5, −0.4). Step size Δτ = 0.5. Network outputs v_θ(s_t, a_t,0, 0) = (−1.0, −2.0). Compute the velocity component of a_t,0.5.

a_t,0.5 = a_t,0 + 0.5 · v_θ. Compute the v-component (first element).

v-component of a_t,0.5

Show derivation

a_t,0.5[v] = 1.5 + 0.5 × (−1.0) = 1.5 − 0.5 = 1.0

Exercise 1.2: Euler Step 1 (steering) Derive

Same setup. Compute the ω-component of a_t,0.5.

ω-component of a_t,0.5

Show derivation

a_t,0.5[ω] = −0.4 + 0.5 × (−2.0) = −0.4 − 1.0 = −1.4

Exercise 1.3: Euler Step 2 → Final Action Derive

From a_t,0.5 = (1.0, −1.4), the network outputs v_θ(s_t, a_t,0.5, 0.5) = (−1.1, −1.9). Compute a_t,1[v] (the final velocity).

v-component of a_t,1

Show derivation

a_t,1[v] = 1.0 + 0.5 × (−1.1) = 1.0 − 0.55 = 0.45

Exercise 1.4: Final Steering Derive

Compute a_t,1[ω] (the final steering value).

ω-component of a_t,1

Show derivation

a_t,1[ω] = −1.4 + 0.5 × (−1.9) = −1.4 − 0.95 = −2.35

Exercise 1.5: Safety Check Trace

The final action is (0.45, −2.35). Given constraints: |ω| ≤ 2.3 or |ω| ≥ 2.5 → crash; v > 0.6 with ω < 0 → rollover. What happens?

Crash — |ω| is in the danger zone Rollover — fast left turn Safe navigation — |ω| = 2.35 is between 2.3 and 2.5 (safe), and v = 0.45 ≤ 0.6 (no rollover)

Exercise 1.6: Build Euler Integration Build

Implement eulerIntegrate that takes a starting noise vector, a step size, and an array of velocity vectors (one per step), and returns the final action.

Show solution

javascript
function eulerIntegrate(a0, stepSize, velocities) {
  let a = [...a0];
  for (const v of velocities) {
    a[0] += stepSize * v[0];
    a[1] += stepSize * v[1];
  }
  return a;
}

The formula. Euler integration: a_τ+Δτ = a_τ + Δτ · v_θ(s, a_τ, τ). Each step moves the noisy action toward a clean sample from the learned distribution.

Chapter 3: Actor-Critic & N-Step Returns

Vanilla REINFORCE waits until the episode ends to assign credit. Actor-critic methods bootstrap from a learned value function to get faster, lower-variance updates. But bootstrap introduces bias from an imperfect V_φ.

1-step advantage: Â^π(s_t, a_t) ≈ r(s_t, a_t) + γV̂^π(s_t+1) − V̂^π(s_t)
n-step advantage: Â^π_n(s_t, a_t) = ∑_t'=t^t+n-1 γ^t'−tr(s_t',a_t') + γⁿV̂^π(s_t+n) − V̂^π(s_t)

Exercise 3.1: 1-Step Estimate in Go Trace

In sparse-reward Go (r=0 at non-terminal states), what does the 1-step advantage estimate reduce to at a non-terminal position? (Use γ = 1.)

V̂(s_t+1) − V̂(s_t) — purely the value function's opinion about whether the board got better or worse 0 — the advantage is always zero at non-terminal states r(s_t,a_t) — just the immediate reward

Exercise 3.2: N-Step Bias-Variance Trace

For the n-step estimator with γ=1, which are TRUE? (A) 1-step relies entirely on V̂ for non-terminal Go. (B) Increasing n moves closer to MC reward-to-go. (C) Inaccurate V̂ introduces substantial bias in 1-step. (D) Increasing n always lowers both bias and variance.

(A), (B), and (C) (A) and (D) (B), (C), and (D) All four

Show reasoning

(A) TRUE: at non-terminal Go positions, r=0 so the 1-step estimate is V̂(s_t+1) − V̂(s_t).

(B) TRUE: larger n uses more real rewards before bootstrapping.

(D) FALSE: increasing n lowers bias but increases variance (more real trajectory randomness).

Exercise 3.3: 1-Step Advantage Computation Derive

A robot walking task. r(s_t, a_t) = 3, γ = 0.9, V̂(s_t) = 10, V̂(s_t+1) = 12. Compute the 1-step advantage estimate.

Â^π

Show derivation

Â^π = r + γV̂(s_t+1) − V̂(s_t) = 3 + 0.9 × 12 − 10 = 3 + 10.8 − 10 = 3.8

Positive advantage: this action performed better than expected from state s_t.

Exercise 3.4: Pipeline Order Design

Put the actor-critic training loop steps in order.

→

Collect rollouts with π_θ Compute advantage Â Update θ with policy gradient Fit V_φ to observed returns Repeat from collection

Exercise 3.5: Build 1-Step Advantage Build

Implement advantage1Step.

Show solution

javascript
function advantage1Step(r, gamma, v_next, v_curr) {
  return r + gamma * v_next - v_curr;
}

The tradeoff. 1-step: low variance, high bias (relies on V̂). n-step: more real rewards, less bias, more variance. GAE blends all n-step estimates with exponential weighting controlled by λ.

Chapter 4: Q-Learning — TD Targets

A hallway grid world with states {−2, −1, 0, 1, 2}. Actions: Left, Right. Start at 0. Discount γ = 0.5.

Reward table r(s, a):
Left: —, 10, 4, 4, 4 (for s = −2..2)
Right: 5, 5, 5, −10, —

Current Q̂(s,a) predictions:
Left: —, 10, 8, 10, 6
Right: 3, 20, 12, 20, —

Exercise 4.1: 1-Step TD Target Derive

Transition: s=0, a=Right, r=5, s'=1. Compute the 1-step Q-learning target y = r + γ max_a' Q̂(s', a').

TD target y

Show derivation

y = r(0, Right) + γ max{Q̂(1, Left), Q̂(1, Right)}

y = 5 + 0.5 × max{10, 20} = 5 + 0.5 × 20 = 5 + 10 = 15

Exercise 4.2: 1-Step TD Loss Derive

The current prediction is Q̂(0, Right) = 12 and the target is y = 15. Compute the squared loss (y − Q̂)².

loss

Show derivation

(y − Q̂)² = (15 − 12)² = 3² = 9

Exercise 4.3: 2-Step TD Target (Window 1) Derive

Behavior policy alternates Right, Left, Right, Left from s=0. Window 1: 0 ⟶^Right,5 1 ⟶^Left,4 0. Compute the 2-step target for (s=0, a=Right).

y⁽²⁾ = r₀ + γr₁ + γ² max_a' Q̂(s₂, a'). The agent ends back at s=0.

2-step target

Show derivation

y⁽²⁾ = 5 + 0.5(4) + 0.5² max{Q̂(0,L), Q̂(0,R)}

= 5 + 2 + 0.25 × max{8, 12} = 5 + 2 + 3 = 10

Exercise 4.4: 2-Step Loss (Window 1) Derive

Q̂(0, Right) = 12, target = 10. Compute the squared loss.

loss

Show derivation

(10 − 12)² = (−2)² = 4

Exercise 4.5: 2-Step Target (Window 2) Derive

Window 2: 1 ⟶^Left,4 0 ⟶^Right,5 1. Compute the 2-step target for (s=1, a=Left).

2-step target

Show derivation

y⁽²⁾ = 4 + 0.5(5) + 0.5² max{Q̂(1,L), Q̂(1,R)}

= 4 + 2.5 + 0.25 × max{10, 20} = 4 + 2.5 + 5 = 11.5

Exercise 4.6: Average 2-Step Loss Derive

Window 1 loss = 4, window 2: Q̂(1,Left) = 10, target = 11.5, so loss = (11.5−10)² = 2.25. Compute the average loss over both windows.

avg loss

Show derivation

L = (4 + 2.25) / 2 = 6.25 / 2 = 3.125

The formula. 1-step: y = r + γ max_a' Q̂(s', a'). N-step: y⁽ⁿ⁾ = ∑_k=0^n-1 γ^kr_t+k + γⁿ max_a' Q̂(s_t+n, a'). More steps = less bootstrap bias, more variance.

Chapter 6: Offline RL — Distribution Shift

You have a fixed dataset collected by some behavior policy. No more environment interaction. Naively running SAC or DQN on this data fails catastrophically. Why?

Exercise 6.1: Why Offline RL Fails Trace

Why does naively running an off-policy actor-critic (SAC) on a static offline dataset typically fail?

Policy gradient becomes biased due to different behavior policy The actor exploits over-optimistic Q-values for OOD actions unseen in the dataset TD targets always diverge without ε-greedy exploration Replay buffer is too small

Exercise 6.2: Extrapolation Error Bug Debug

This offline RL code has the core issue that causes failure. Which line?

q_pred = Q_net(state, action)  # current Q
next_action = actor(next_state)  # actor proposes
q_next = Q_net(next_state, next_action)  # Q of OOD action!
target = reward + gamma * q_next
loss = (q_pred - target)**2

Exercise 6.3: AWR Weight Derive

Advantage-weighted regression uses weight exp(Â/α). If Â(s,a) = 2 and α = 0.5, what is the weight?

weight

Show derivation

exp(2 / 0.5) = exp(4) = e⁴ ≈ 54.60

High-advantage actions get exponentially more weight. Small α = sharper weighting = more selective.

The core issue. Offline RL fails because the actor queries Q for out-of-distribution actions. IQL and CQL constrain the policy to stay within the dataset's support.

Chapter 7: IQL — Expectile Loss Math

IQL avoids querying Q for OOD actions by fitting V(s) with an asymmetric expectile loss that pushes V toward the upper end of the Q distribution — a proxy for the best in-dataset action value.

Expectile loss: ℓ₂^λ(x) = |λ − 1(x < 0)| · x²
= (1−λ)x² if x < 0, λx² if x ≥ 0
where x = V̂(s) − Q̂(s, a)

Exercise 7.1: Expectile Loss (positive residual) Derive

λ = 1/4. V̂(s) = 4, Q̂(s, a₁) = 2. Compute x and the loss for this transition.

loss

Show derivation

x = V̂(s) − Q̂(s, a₁) = 4 − 2 = 2 ≥ 0

loss = λ × x² = 0.25 × 4 = 1

Exercise 7.2: Expectile Loss (negative residual) Derive

Same setup. V̂(s) = 4, Q̂(s, a₂) = 6. Compute the loss for this transition.

loss

Show derivation

x = 4 − 6 = −2 < 0

loss = (1 − λ) × x² = 0.75 × 4 = 3

Exercise 7.3: Average Expectile Loss Derive

Average the losses from 7.1 and 7.2 (equal-weight dataset). What is the mean loss?

avg loss

Show derivation

(1 + 3) / 2 = 2

Exercise 7.4: Why Asymmetric? Trace

Why use asymmetric expectile loss for V(s) instead of standard MSE?

Makes the value objective convex Avoids bootstrapping by using MC returns Pushes V(s) toward high Q̂ values (best in-dataset actions) without maximizing over unsupported actions Guarantees Q̂ ≤ Q* pointwise

Exercise 7.5: Why NOT for Q? Trace

Why would using expectile loss for the Q-function (instead of standard Bellman MSE) be a bad idea?

It makes optimization non-convex Expectile loss is non-differentiable at zero It breaks the stationarity of the target distribution It biases Q̂ toward optimistic targets, causing overestimation via bootstrapping

Exercise 7.6: Build Expectile Loss Build

Implement expectileLoss.

Show solution

javascript
function expectileLoss(vHat, qHat, lambda) {
  const x = vHat - qHat;
  const w = x >= 0 ? lambda : (1 - lambda);
  return w * x * x;
}

The asymmetry. λ < 0.5 penalizes V > Q less than V < Q. This pushes V toward the upper expectile of Q — approximating the value of the best in-dataset action, without ever querying Q for unsupported actions.

Chapter 8: Sparse Rewards & Goal-Conditioned RL

A Mars rover learns to deposit rocks into a bin. Reward = 1 only when the rock is placed successfully, 0 otherwise. Almost all trajectories fail. How do we learn from near-zero signal?

Exercise 8.1: Sparse Reward Challenges Trace

With sparse reward (1 on success, 0 otherwise) and mostly failed trajectories, which is TRUE? (A) Transition model learning is harder. (B) Value estimation is hard because few successful episodes provide positive signal. (C) Q-learning converges to accurate values because rewards are low-variance. (D) Small γ helps propagate rare success signals.

(A) only (B) only — with almost no successes, value backups receive near-zero positive signal (B) and (D) (C) only

Exercise 8.2: Hindsight Relabeling Trace

In goal-conditioned RL with π(a | s, g), relabeling the final state s_T as the goal g creates useful signal. Why?

It removes the need for dynamics/value learning Conditioning on g_bin alone is enough to overcome sparse reward Every failed trajectory becomes a successful demonstration for reaching s_T The critic can reuse Q(s,a) without goal conditioning

Exercise 8.3: Goal-Conditioned Reward Derive

Using distance-based reward r(s,a,g) = −||s − g||. If s = (3, 4) and g = (0, 0), what is the reward?

reward

Show derivation

||s − g|| = √(3² + 4²) = √(9 + 16) = √25 = 5

r = −5

Exercise 8.4: GCRL Pipeline Design

Order the Hindsight Experience Replay (HER) steps.

→

Collect trajectory with original goal Relabel with s_T as new goal Compute reward for relabeled goal Train Q(s,a,g) on relabeled data

The trick. Hindsight relabeling turns failed trajectories into successes by asking "what goal did this trajectory actually achieve?" This creates dense training signal from sparse-reward data.

Chapter 9: Capstone — Full Exam Walkthrough

This capstone combines all prior chapters into a single multi-step problem that mirrors the exam format. Work through it end-to-end.

Exercise 9.1: Gaussian Mode Collapse Derive

Expert data at state s₁: modes (v, ω) = {(3.0, 4.0), (1.0, −4.0)}. Compute the Gaussian MLE mean μ_ω.

μ_ω

Exercise 9.2: 3-Step Euler Integration Derive

Flow matching with 3 Euler steps (Δτ = 1/3). Start a₀ = (2.0, 1.0). Velocities: v(τ=0) = (−3, −6), v(τ=1/3) = (−3, −6), v(τ=2/3) = (−3, −6). Compute a₁[v] (first component of final action).

v-component

Show derivation

a_1/3[v] = 2.0 + (1/3)(−3) = 2.0 − 1.0 = 1.0

a_2/3[v] = 1.0 + (1/3)(−3) = 1.0 − 1.0 = 0.0

a₁[v] = 0.0 + (1/3)(−3) = 0.0 − 1.0 = −1.0

Exercise 9.3: TD Target Chain Derive

Grid world: s=−1, a=Right, r=5, s'=0. γ = 0.5. Q̂(0,Left) = 8, Q̂(0,Right) = 12. Compute 1-step target and loss if Q̂(−1, Right) = 20.

loss

Show derivation

y = r(−1, Right) + γ max{Q̂(0,L), Q̂(0,R)} = 5 + 0.5 × max{8, 12} = 5 + 6 = 11

Q̂(−1, Right) = 20

loss = (y − Q̂)² = (11 − 20)² = 81

Exercise 9.4: IQL Full Pipeline Derive

λ = 0.1. V̂(s) = 5. Q̂ values for 3 dataset actions: {3, 5, 10}. Compute the average expectile loss.

avg loss

Show derivation

x₁ = V̂ − Q̂₁ = 5 − 3 = 2 ≥ 0 → λ × x² = 0.1 × 4 = 0.4

x₂ = 5 − 5 = 0 ≥ 0 → 0.1 × 0 = 0

x₃ = 5 − 10 = −5 < 0 → (1−λ) × x² = 0.9 × 25 = 22.5

Average = (0.4 + 0 + 22.5) / 3 = 22.9 / 3 ≈ 7.63

Notice how the loss heavily penalizes V̂ < Q̂ (x < 0). The (1−λ) = 0.9 weight on negative residuals pushes V̂ upward toward the best Q̂ values.

Exercise 9.5: Hindsight Relabeling Count Derive

You collect 100 trajectories. 3 reach g_bin (success). With HER, you relabel each trajectory with its final state as the goal. How many total "successful" trajectories (reaching their relabeled goal) do you now have?

trajectories

Show reasoning

Every trajectory reaches its own final state s_T by definition. When you relabel g = s_T, every trajectory is a "success" for that relabeled goal. All 100 trajectories become successful demonstrations.

Related Lessons

Topic	Lesson
Policy Gradients (full derivation)	Policy Gradients
Hierarchical Policies	Hierarchy in IL & RL
RL Fundamentals	RL Fundamentals Workbook
Advanced RL Theory	RL Theory Workbook

Exam readiness check. If you can solve every exercise above without peeking at solutions, you own the core problem types in Deep RL exams. The key formulas: Euler integration for flow matching, REINFORCE and its variants, 1-step and n-step TD targets, expectile loss for IQL, and hindsight relabeling for GCRL.

Deep RL Exam

Chapter 0: Imitation Learning — Mode Collapse

Chapter 1: Flow Matching Arithmetic

Chapter 2: REINFORCE & Credit Assignment

Chapter 3: Actor-Critic & N-Step Returns

Chapter 4: Q-Learning — TD Targets

Chapter 5: Off-Policy V(s) & Overestimation

Chapter 6: Offline RL — Distribution Shift

Chapter 7: IQL — Expectile Loss Math

Chapter 8: Sparse Rewards & Goal-Conditioned RL

Chapter 9: Capstone — Full Exam Walkthrough

Related Lessons