Workbook — Deep RL Exam Prep

Deep RL Exam

Every exam-style problem in Deep RL: policy expressivity, flow matching arithmetic, REINFORCE credit assignment, Q-learning TD targets, offline RL, and goal-conditioned RL. Compute by hand, check instantly.

Prerequisites: Deep RL fundamentals (imitation learning through goal-conditioned RL). That's it.
10
Chapters
50
Exercises
5
Exercise Types
Mastery
0 / 50 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Imitation Learning — Mode Collapse

You're training a self-driving race car. At a critical state s1, expert drivers use two distinct safe racing lines around a barrier. The action space is continuous: a = (v, ω), where v ∈ [0, 5] is velocity and ω ∈ [−10, 10] is steering. The expert data has two equally likely modes:

Mode 1: v = 2.0, ω = 2.4  (Fast right turn)
Mode 2: v = 0.5, ω = −2.4  (Slow left turn)

Safety constraints: any |ω| ≤ 2.3 or |ω| ≥ 2.5 crashes. A "fast left turn" (v > 0.6, ω < 0) causes a rollover.

The mode-averaging trap. When a unimodal distribution (like a Gaussian) fits multimodal data, its mean falls between the modes — often in a region that neither expert would ever visit. This is the fundamental limitation of behavioral cloning with unimodal policies.
Exercise 0.1: Gaussian MLE Mean (velocity) Derive

Fit π(a | s1) = N(μ, Σ) to the two expert modes via MLE. What is μv?

μv
Show derivation
μv = (2.0 + 0.5) / 2 = 1.25

The MLE mean of a Gaussian is the arithmetic mean of the data points.

Exercise 0.2: Gaussian MLE Mean (steering) Derive

What is μω for the same Gaussian MLE fit?

μω
Show derivation
μω = (2.4 + (−2.4)) / 2 = 0

The two steering values cancel perfectly. The mean action μ = (1.25, 0) has |ω| = 0 ≤ 2.3, causing a crash. The Gaussian collapses two safe modes into one unsafe average.

Exercise 0.3: Which Fixes Work? Trace
Which approaches avoid the mode-averaging crash? (A) Collect more data for each mode. (B) Use autoregressive discretized policy with bin centers {−10, −8, ..., 8, 10}. (C) Use flow-matching policy. (D) Train on only one maneuver family.
Show reasoning

(A) is wrong: more equal-weight data doesn't change the MLE mean of a Gaussian — the problem is the model family, not data quantity.

(B) is wrong: the closest bin centers to ω = ±2.4 are ω = ±2, and |ω| = 2 ≤ 2.3 still crashes. Discretization granularity is too coarse.

(C) is correct: flow-matching policies can represent multimodal distributions by denoising to different modes.

(D) is correct: if you only keep one mode, MLE recovers that safe mode.

Exercise 0.4: MLE With 3 Modes Derive

New scenario: 3 equally likely modes at s2: (v, ω) = {(1.0, 3.0), (2.0, −1.0), (3.0, 3.0)}. What is μv for a Gaussian MLE?

μv
Show derivation
μv = (1.0 + 2.0 + 3.0) / 3 = 2.0
Exercise 0.5: Policy Expressivity Bug Debug

This behavioral cloning code fits a Gaussian to bimodal expert data. Which line causes the unsafe mode-averaging behavior?

expert_actions = [(2.0, 2.4), (0.5, -2.4)]
mu = np.mean(expert_actions, axis=0)  # MLE mean
sigma = np.cov(expert_actions, rowvar=False)
policy = MultivariateNormal(mu, sigma)
action = policy.sample()  # deploy
Exercise 0.6: Build MLE Build

Implement gaussianMLE that takes an array of 2D action vectors and returns the MLE mean [μv, μω].

Show solution
javascript
function gaussianMLE(actions) {
  const n = actions.length;
  let sv = 0, sw = 0;
  for (const [v, w] of actions) { sv += v; sw += w; }
  return [sv / n, sw / n];
}
The formula. Gaussian MLE mean = arithmetic mean of data. If the data is multimodal, this average can land in an unsafe region that no expert ever visited. Flow matching and mixture models avoid this.

Chapter 1: Flow Matching Arithmetic

Your flow-matching policy learned a vector field vθ(st, at,τ, τ) that denoises actions. During inference, you start from Gaussian noise and integrate the ODE using Euler steps. Can you trace through the arithmetic?

Flow matching ODE: dat,τ/dτ = vθ(st, at,τ, τ)
Euler integration: at,τ+1/n = at,τ + (1/n) · vθ(st, at,τ, τ)
Exercise 1.1: Euler Step 1 Derive

Start from noise at,0 = (1.5, −0.4). Step size Δτ = 0.5. Network outputs vθ(st, at,0, 0) = (−1.0, −2.0). Compute the velocity component of at,0.5.

at,0.5 = at,0 + 0.5 · vθ. Compute the v-component (first element).

v-component of at,0.5
Show derivation
at,0.5[v] = 1.5 + 0.5 × (−1.0) = 1.5 − 0.5 = 1.0
Exercise 1.2: Euler Step 1 (steering) Derive

Same setup. Compute the ω-component of at,0.5.

ω-component of at,0.5
Show derivation
at,0.5[ω] = −0.4 + 0.5 × (−2.0) = −0.4 − 1.0 = −1.4
Exercise 1.3: Euler Step 2 → Final Action Derive

From at,0.5 = (1.0, −1.4), the network outputs vθ(st, at,0.5, 0.5) = (−1.1, −1.9). Compute at,1[v] (the final velocity).

v-component of at,1
Show derivation
at,1[v] = 1.0 + 0.5 × (−1.1) = 1.0 − 0.55 = 0.45
Exercise 1.4: Final Steering Derive

Compute at,1[ω] (the final steering value).

ω-component of at,1
Show derivation
at,1[ω] = −1.4 + 0.5 × (−1.9) = −1.4 − 0.95 = −2.35
Exercise 1.5: Safety Check Trace
The final action is (0.45, −2.35). Given constraints: |ω| ≤ 2.3 or |ω| ≥ 2.5 → crash; v > 0.6 with ω < 0 → rollover. What happens?
Exercise 1.6: Build Euler Integration Build

Implement eulerIntegrate that takes a starting noise vector, a step size, and an array of velocity vectors (one per step), and returns the final action.

Show solution
javascript
function eulerIntegrate(a0, stepSize, velocities) {
  let a = [...a0];
  for (const v of velocities) {
    a[0] += stepSize * v[0];
    a[1] += stepSize * v[1];
  }
  return a;
}
The formula. Euler integration: aτ+Δτ = aτ + Δτ · vθ(s, aτ, τ). Each step moves the noisy action toward a clean sample from the learned distribution.

Chapter 2: REINFORCE & Credit Assignment

You're training a Go agent with policy gradients. Go is a long-horizon sparse-reward game: r(st, at) = +1 at game end if win, −1 if loss, 0 otherwise. A trajectory is one full game.

REINFORCE estimator:
θJ(θ) ≈ (1/N) ∑i (∑tθ log πθ(ai,t|si,t)) (∑t r(si,t, ai,t))
Exercise 2.1: REINFORCE Properties Trace
The REINFORCE estimator above uses the full trajectory return to weight every action. Which are TRUE? (A) Unbiased estimator. (B) Reliable per-action signal. (C) Biased because single-trajectory samples. (D) High variance because terminal outcome credits all actions.
Show reasoning

(A) TRUE: REINFORCE is an unbiased estimator of the policy gradient by the log-derivative trick.

(B) FALSE: weighting every action by the same terminal outcome is exactly what makes credit assignment hard. Winning trajectories can contain bad moves.

(C) FALSE: single-trajectory sampling causes variance, not bias. The estimator is unbiased in expectation.

(D) TRUE: the single terminal +1/−1 is spread across all actions in the trajectory, creating enormous variance.

Exercise 2.2: Reward Shifting Trace
Suppose we change the reward from +1/−1 (win/loss) to +3/+1. Now every trajectory has positive reward. Which is TRUE?
Show reasoning

R3/1 = R±1 + 2. This is a constant shift. Since E[∇θ log pθ(τ)] = 0, the constant cancels in expectation. The expected gradient is unchanged. (B) is correct. (A) is false: baselines reduce variance but are NOT required for correctness. (C) is false: even though losing trajectories get positive reward (+1), wins get +3, so wins are still upweighted more.

Exercise 2.3: Reward-to-Go vs Full Return Trace
In Go with sparse terminal reward, changing from full return ∑t=1T r(st,at) to reward-to-go ∑t'=tT r(st',at') is especially helpful for variance reduction. True or false?
Show reasoning

In this sparse Go setting, the only nonzero reward is at termination. For any non-terminal timestep t, both ∑t=1T r and ∑t'=tT r contain the same single terminal reward. The causality trick doesn't help here because there are no past nonzero rewards to remove.

Exercise 2.4: Value Baseline Properties Trace
When we subtract a state-dependent value baseline Vφπ(st), which are TRUE? (A) Solves credit assignment. (B) Reduces variance. (C) Causality trick helps a lot for sparse Go. (D) Turns return into an advantage-like quantity.
Exercise 2.5: MC Value Baseline Trace
You estimate Vπ(st) by averaging Monte Carlo rollouts from st. What best describes this?
The key fact. Subtracting a state-only baseline does NOT change the expected gradient — only reduces variance. Baselines compare "how well did I do vs. what I expected," turning raw returns into relative advantages.

Chapter 3: Actor-Critic & N-Step Returns

Vanilla REINFORCE waits until the episode ends to assign credit. Actor-critic methods bootstrap from a learned value function to get faster, lower-variance updates. But bootstrap introduces bias from an imperfect Vφ.

1-step advantage: Âπ(st, at) ≈ r(st, at) + γV̂π(st+1) − V̂π(st)
n-step advantage: Âπn(st, at) = ∑t'=tt+n-1 γt'−tr(st',at') + γnπ(st+n) − V̂π(st)
Exercise 3.1: 1-Step Estimate in Go Trace
In sparse-reward Go (r=0 at non-terminal states), what does the 1-step advantage estimate reduce to at a non-terminal position? (Use γ = 1.)
Exercise 3.2: N-Step Bias-Variance Trace
For the n-step estimator with γ=1, which are TRUE? (A) 1-step relies entirely on V̂ for non-terminal Go. (B) Increasing n moves closer to MC reward-to-go. (C) Inaccurate V̂ introduces substantial bias in 1-step. (D) Increasing n always lowers both bias and variance.
Show reasoning

(A) TRUE: at non-terminal Go positions, r=0 so the 1-step estimate is V̂(st+1) − V̂(st).

(B) TRUE: larger n uses more real rewards before bootstrapping.

(C) TRUE: if V̂ is wrong, the 1-step estimate is dominated by bootstrap error.

(D) FALSE: increasing n lowers bias but increases variance (more real trajectory randomness).

Exercise 3.3: 1-Step Advantage Computation Derive

A robot walking task. r(st, at) = 3, γ = 0.9, V̂(st) = 10, V̂(st+1) = 12. Compute the 1-step advantage estimate.

Âπ
Show derivation
Âπ = r + γV̂(st+1) − V̂(st) = 3 + 0.9 × 12 − 10 = 3 + 10.8 − 10 = 3.8

Positive advantage: this action performed better than expected from state st.

Exercise 3.4: Pipeline Order Design

Put the actor-critic training loop steps in order.

?
?
?
?
?
Collect rollouts with πθ Compute advantage  Update θ with policy gradient Fit Vφ to observed returns Repeat from collection
Exercise 3.5: Build 1-Step Advantage Build

Implement advantage1Step.

Show solution
javascript
function advantage1Step(r, gamma, v_next, v_curr) {
  return r + gamma * v_next - v_curr;
}
The tradeoff. 1-step: low variance, high bias (relies on V̂). n-step: more real rewards, less bias, more variance. GAE blends all n-step estimates with exponential weighting controlled by λ.

Chapter 4: Q-Learning — TD Targets

A hallway grid world with states {−2, −1, 0, 1, 2}. Actions: Left, Right. Start at 0. Discount γ = 0.5.

Reward table r(s, a):
Left:  —, 10, 4, 4, 4  (for s = −2..2)
Right: 5, 5, 5, −10, —

Current Q̂(s,a) predictions:
Left:  —, 10, 8, 10, 6
Right: 3, 20, 12, 20, —
Exercise 4.1: 1-Step TD Target Derive

Transition: s=0, a=Right, r=5, s'=1. Compute the 1-step Q-learning target y = r + γ maxa' Q̂(s', a').

TD target y
Show derivation
y = r(0, Right) + γ max{Q̂(1, Left), Q̂(1, Right)}
y = 5 + 0.5 × max{10, 20} = 5 + 0.5 × 20 = 5 + 10 = 15
Exercise 4.2: 1-Step TD Loss Derive

The current prediction is Q̂(0, Right) = 12 and the target is y = 15. Compute the squared loss (y − Q̂)².

loss
Show derivation
(y − Q̂)² = (15 − 12)² = 3² = 9
Exercise 4.3: 2-Step TD Target (Window 1) Derive

Behavior policy alternates Right, Left, Right, Left from s=0. Window 1: 0 ⟶Right,5 1 ⟶Left,4 0. Compute the 2-step target for (s=0, a=Right).

y(2) = r0 + γr1 + γ² maxa' Q̂(s2, a'). The agent ends back at s=0.

2-step target
Show derivation
y(2) = 5 + 0.5(4) + 0.5² max{Q̂(0,L), Q̂(0,R)}
= 5 + 2 + 0.25 × max{8, 12} = 5 + 2 + 3 = 10
Exercise 4.4: 2-Step Loss (Window 1) Derive

Q̂(0, Right) = 12, target = 10. Compute the squared loss.

loss
Show derivation
(10 − 12)² = (−2)² = 4
Exercise 4.5: 2-Step Target (Window 2) Derive

Window 2: 1 ⟶Left,4 0 ⟶Right,5 1. Compute the 2-step target for (s=1, a=Left).

2-step target
Show derivation
y(2) = 4 + 0.5(5) + 0.5² max{Q̂(1,L), Q̂(1,R)}
= 4 + 2.5 + 0.25 × max{10, 20} = 4 + 2.5 + 5 = 11.5
Exercise 4.6: Average 2-Step Loss Derive

Window 1 loss = 4, window 2: Q̂(1,Left) = 10, target = 11.5, so loss = (11.5−10)² = 2.25. Compute the average loss over both windows.

avg loss
Show derivation
L = (4 + 2.25) / 2 = 6.25 / 2 = 3.125
The formula. 1-step: y = r + γ maxa' Q̂(s', a'). N-step: y(n) = ∑k=0n-1 γkrt+k + γn maxa' Q̂(st+n, a'). More steps = less bootstrap bias, more variance.

Chapter 5: Off-Policy V(s) & Overestimation

We want to evaluate one policy (π*) using data collected by a different policy (random). This is the off-policy evaluation problem. It's subtle: the data doesn't come from the policy we're evaluating.

Exercise 5.1: Off-Policy Vπ Targets Trace
We have the optimal Q*(s,a). Data comes from a random policy. π(s) = argmax Q*(s,a). We want V̂(s) → Vπ(s). Which target y for V̂(st) is valid? (A) rt + γV̂(st+1). (B) rt + γ maxa' Q*(st+1,a'). (C) rt + γQ*(st+1, π(st+1)). (D) ∑ γt'−trt'.
Show reasoning

None work correctly. In all options, the reward rt and next state st+1 come from the random data-collection action at, not from π(st). The first transition's reward and dynamics don't match Vπ. These targets fit Q*(st, at) or the random policy's value, not Vπ(st).

Exercise 5.2: Q-Learning Overestimation Trace
Why does maxa' Q̂(s',a') tend to overestimate the true optimal value?
Exercise 5.3: Double Q-Learning Trace
How does Double Q-learning reduce overestimation?
Exercise 5.4: Build Q-Learning Target Build

Implement qTarget that computes the 1-step Q-learning TD target.

Show solution
javascript
function qTarget(r, gamma, qNextValues) {
  return r + gamma * Math.max(...qNextValues);
}
The key. max over noisy Q estimates → systematic overestimation. Double Q decouples selection from evaluation. Target networks stabilize regression targets.

Chapter 6: Offline RL — Distribution Shift

You have a fixed dataset collected by some behavior policy. No more environment interaction. Naively running SAC or DQN on this data fails catastrophically. Why?

Exercise 6.1: Why Offline RL Fails Trace
Why does naively running an off-policy actor-critic (SAC) on a static offline dataset typically fail?
Exercise 6.2: Extrapolation Error Bug Debug

This offline RL code has the core issue that causes failure. Which line?

q_pred = Q_net(state, action)  # current Q
next_action = actor(next_state)  # actor proposes
q_next = Q_net(next_state, next_action)  # Q of OOD action!
target = reward + gamma * q_next
loss = (q_pred - target)**2
Exercise 6.3: AWR Weight Derive

Advantage-weighted regression uses weight exp(Â/α). If Â(s,a) = 2 and α = 0.5, what is the weight?

weight
Show derivation
exp(2 / 0.5) = exp(4) = e4 ≈ 54.60

High-advantage actions get exponentially more weight. Small α = sharper weighting = more selective.

The core issue. Offline RL fails because the actor queries Q for out-of-distribution actions. IQL and CQL constrain the policy to stay within the dataset's support.

Chapter 7: IQL — Expectile Loss Math

IQL avoids querying Q for OOD actions by fitting V(s) with an asymmetric expectile loss that pushes V toward the upper end of the Q distribution — a proxy for the best in-dataset action value.

Expectile loss:2λ(x) = |λ − 1(x < 0)| · x²
= (1−λ)x² if x < 0,   λx² if x ≥ 0
where x = V̂(s) − Q̂(s, a)
Exercise 7.1: Expectile Loss (positive residual) Derive

λ = 1/4. V̂(s) = 4, Q̂(s, a1) = 2. Compute x and the loss for this transition.

loss
Show derivation
x = V̂(s) − Q̂(s, a1) = 4 − 2 = 2 ≥ 0
loss = λ × x² = 0.25 × 4 = 1
Exercise 7.2: Expectile Loss (negative residual) Derive

Same setup. V̂(s) = 4, Q̂(s, a2) = 6. Compute the loss for this transition.

loss
Show derivation
x = 4 − 6 = −2 < 0
loss = (1 − λ) × x² = 0.75 × 4 = 3
Exercise 7.3: Average Expectile Loss Derive

Average the losses from 7.1 and 7.2 (equal-weight dataset). What is the mean loss?

avg loss
Show derivation
(1 + 3) / 2 = 2
Exercise 7.4: Why Asymmetric? Trace
Why use asymmetric expectile loss for V(s) instead of standard MSE?
Exercise 7.5: Why NOT for Q? Trace
Why would using expectile loss for the Q-function (instead of standard Bellman MSE) be a bad idea?
Exercise 7.6: Build Expectile Loss Build

Implement expectileLoss.

Show solution
javascript
function expectileLoss(vHat, qHat, lambda) {
  const x = vHat - qHat;
  const w = x >= 0 ? lambda : (1 - lambda);
  return w * x * x;
}
The asymmetry. λ < 0.5 penalizes V > Q less than V < Q. This pushes V toward the upper expectile of Q — approximating the value of the best in-dataset action, without ever querying Q for unsupported actions.

Chapter 8: Sparse Rewards & Goal-Conditioned RL

A Mars rover learns to deposit rocks into a bin. Reward = 1 only when the rock is placed successfully, 0 otherwise. Almost all trajectories fail. How do we learn from near-zero signal?

Exercise 8.1: Sparse Reward Challenges Trace
With sparse reward (1 on success, 0 otherwise) and mostly failed trajectories, which is TRUE? (A) Transition model learning is harder. (B) Value estimation is hard because few successful episodes provide positive signal. (C) Q-learning converges to accurate values because rewards are low-variance. (D) Small γ helps propagate rare success signals.
Exercise 8.2: Hindsight Relabeling Trace
In goal-conditioned RL with π(a | s, g), relabeling the final state sT as the goal g creates useful signal. Why?
Exercise 8.3: Goal-Conditioned Reward Derive

Using distance-based reward r(s,a,g) = −||s − g||. If s = (3, 4) and g = (0, 0), what is the reward?

reward
Show derivation
||s − g|| = √(3² + 4²) = √(9 + 16) = √25 = 5
r = −5
Exercise 8.4: GCRL Pipeline Design

Order the Hindsight Experience Replay (HER) steps.

?
?
?
?
Collect trajectory with original goal Relabel with sT as new goal Compute reward for relabeled goal Train Q(s,a,g) on relabeled data
The trick. Hindsight relabeling turns failed trajectories into successes by asking "what goal did this trajectory actually achieve?" This creates dense training signal from sparse-reward data.

Chapter 9: Capstone — Full Exam Walkthrough

This capstone combines all prior chapters into a single multi-step problem that mirrors the exam format. Work through it end-to-end.

Exercise 9.1: Gaussian Mode Collapse Derive

Expert data at state s1: modes (v, ω) = {(3.0, 4.0), (1.0, −4.0)}. Compute the Gaussian MLE mean μω.

μω
Exercise 9.2: 3-Step Euler Integration Derive

Flow matching with 3 Euler steps (Δτ = 1/3). Start a0 = (2.0, 1.0). Velocities: v(τ=0) = (−3, −6), v(τ=1/3) = (−3, −6), v(τ=2/3) = (−3, −6). Compute a1[v] (first component of final action).

v-component
Show derivation
a1/3[v] = 2.0 + (1/3)(−3) = 2.0 − 1.0 = 1.0
a2/3[v] = 1.0 + (1/3)(−3) = 1.0 − 1.0 = 0.0
a1[v] = 0.0 + (1/3)(−3) = 0.0 − 1.0 = −1.0
Exercise 9.3: TD Target Chain Derive

Grid world: s=−1, a=Right, r=5, s'=0. γ = 0.5. Q̂(0,Left) = 8, Q̂(0,Right) = 12. Compute 1-step target and loss if Q̂(−1, Right) = 20.

loss
Show derivation
y = r(−1, Right) + γ max{Q̂(0,L), Q̂(0,R)} = 5 + 0.5 × max{8, 12} = 5 + 6 = 11
Q̂(−1, Right) = 20
loss = (y − Q̂)² = (11 − 20)² = 81
Exercise 9.4: IQL Full Pipeline Derive

λ = 0.1. V̂(s) = 5. Q̂ values for 3 dataset actions: {3, 5, 10}. Compute the average expectile loss.

avg loss
Show derivation
x1 = V̂ − Q̂1 = 5 − 3 = 2 ≥ 0 → λ × x² = 0.1 × 4 = 0.4
x2 = 5 − 5 = 0 ≥ 0 → 0.1 × 0 = 0
x3 = 5 − 10 = −5 < 0 → (1−λ) × x² = 0.9 × 25 = 22.5
Average = (0.4 + 0 + 22.5) / 3 = 22.9 / 3 ≈ 7.63

Notice how the loss heavily penalizes V̂ < Q̂ (x < 0). The (1−λ) = 0.9 weight on negative residuals pushes V̂ upward toward the best Q̂ values.

Exercise 9.5: Hindsight Relabeling Count Derive

You collect 100 trajectories. 3 reach gbin (success). With HER, you relabel each trajectory with its final state as the goal. How many total "successful" trajectories (reaching their relabeled goal) do you now have?

trajectories
Show reasoning

Every trajectory reaches its own final state sT by definition. When you relabel g = sT, every trajectory is a "success" for that relabeled goal. All 100 trajectories become successful demonstrations.

Related Lessons

TopicLesson
Policy Gradients (full derivation)Policy Gradients
Hierarchical PoliciesHierarchy in IL & RL
RL FundamentalsRL Fundamentals Workbook
Advanced RL TheoryRL Theory Workbook
Exam readiness check. If you can solve every exercise above without peeking at solutions, you own the core problem types in Deep RL exams. The key formulas: Euler integration for flow matching, REINFORCE and its variants, 1-step and n-step TD targets, expectile loss for IQL, and hindsight relabeling for GCRL.