Every exam-style problem in Deep RL: policy expressivity, flow matching arithmetic, REINFORCE credit assignment, Q-learning TD targets, offline RL, and goal-conditioned RL. Compute by hand, check instantly.
You're training a self-driving race car. At a critical state s1, expert drivers use two distinct safe racing lines around a barrier. The action space is continuous: a = (v, ω), where v ∈ [0, 5] is velocity and ω ∈ [−10, 10] is steering. The expert data has two equally likely modes:
Safety constraints: any |ω| ≤ 2.3 or |ω| ≥ 2.5 crashes. A "fast left turn" (v > 0.6, ω < 0) causes a rollover.
Fit π(a | s1) = N(μ, Σ) to the two expert modes via MLE. What is μv?
The MLE mean of a Gaussian is the arithmetic mean of the data points.
What is μω for the same Gaussian MLE fit?
The two steering values cancel perfectly. The mean action μ = (1.25, 0) has |ω| = 0 ≤ 2.3, causing a crash. The Gaussian collapses two safe modes into one unsafe average.
(A) is wrong: more equal-weight data doesn't change the MLE mean of a Gaussian — the problem is the model family, not data quantity.
(B) is wrong: the closest bin centers to ω = ±2.4 are ω = ±2, and |ω| = 2 ≤ 2.3 still crashes. Discretization granularity is too coarse.
(C) is correct: flow-matching policies can represent multimodal distributions by denoising to different modes.
(D) is correct: if you only keep one mode, MLE recovers that safe mode.
New scenario: 3 equally likely modes at s2: (v, ω) = {(1.0, 3.0), (2.0, −1.0), (3.0, 3.0)}. What is μv for a Gaussian MLE?
This behavioral cloning code fits a Gaussian to bimodal expert data. Which line causes the unsafe mode-averaging behavior?
expert_actions = [(2.0, 2.4), (0.5, -2.4)] mu = np.mean(expert_actions, axis=0) # MLE mean sigma = np.cov(expert_actions, rowvar=False) policy = MultivariateNormal(mu, sigma) action = policy.sample() # deploy
Implement gaussianMLE that takes an array of 2D action vectors and returns the MLE mean [μv, μω].
javascript function gaussianMLE(actions) { const n = actions.length; let sv = 0, sw = 0; for (const [v, w] of actions) { sv += v; sw += w; } return [sv / n, sw / n]; }
Your flow-matching policy learned a vector field vθ(st, at,τ, τ) that denoises actions. During inference, you start from Gaussian noise and integrate the ODE using Euler steps. Can you trace through the arithmetic?
Start from noise at,0 = (1.5, −0.4). Step size Δτ = 0.5. Network outputs vθ(st, at,0, 0) = (−1.0, −2.0). Compute the velocity component of at,0.5.
at,0.5 = at,0 + 0.5 · vθ. Compute the v-component (first element).
Same setup. Compute the ω-component of at,0.5.
From at,0.5 = (1.0, −1.4), the network outputs vθ(st, at,0.5, 0.5) = (−1.1, −1.9). Compute at,1[v] (the final velocity).
Compute at,1[ω] (the final steering value).
Implement eulerIntegrate that takes a starting noise vector, a step size, and an array of velocity vectors (one per step), and returns the final action.
javascript function eulerIntegrate(a0, stepSize, velocities) { let a = [...a0]; for (const v of velocities) { a[0] += stepSize * v[0]; a[1] += stepSize * v[1]; } return a; }
You're training a Go agent with policy gradients. Go is a long-horizon sparse-reward game: r(st, at) = +1 at game end if win, −1 if loss, 0 otherwise. A trajectory is one full game.
(A) TRUE: REINFORCE is an unbiased estimator of the policy gradient by the log-derivative trick.
(B) FALSE: weighting every action by the same terminal outcome is exactly what makes credit assignment hard. Winning trajectories can contain bad moves.
(C) FALSE: single-trajectory sampling causes variance, not bias. The estimator is unbiased in expectation.
(D) TRUE: the single terminal +1/−1 is spread across all actions in the trajectory, creating enormous variance.
R3/1 = R±1 + 2. This is a constant shift. Since E[∇θ log pθ(τ)] = 0, the constant cancels in expectation. The expected gradient is unchanged. (B) is correct. (A) is false: baselines reduce variance but are NOT required for correctness. (C) is false: even though losing trajectories get positive reward (+1), wins get +3, so wins are still upweighted more.
In this sparse Go setting, the only nonzero reward is at termination. For any non-terminal timestep t, both ∑t=1T r and ∑t'=tT r contain the same single terminal reward. The causality trick doesn't help here because there are no past nonzero rewards to remove.
Vanilla REINFORCE waits until the episode ends to assign credit. Actor-critic methods bootstrap from a learned value function to get faster, lower-variance updates. But bootstrap introduces bias from an imperfect Vφ.
(A) TRUE: at non-terminal Go positions, r=0 so the 1-step estimate is V̂(st+1) − V̂(st).
(B) TRUE: larger n uses more real rewards before bootstrapping.
(C) TRUE: if V̂ is wrong, the 1-step estimate is dominated by bootstrap error.
(D) FALSE: increasing n lowers bias but increases variance (more real trajectory randomness).
A robot walking task. r(st, at) = 3, γ = 0.9, V̂(st) = 10, V̂(st+1) = 12. Compute the 1-step advantage estimate.
Positive advantage: this action performed better than expected from state st.
Put the actor-critic training loop steps in order.
Implement advantage1Step.
javascript function advantage1Step(r, gamma, v_next, v_curr) { return r + gamma * v_next - v_curr; }
A hallway grid world with states {−2, −1, 0, 1, 2}. Actions: Left, Right. Start at 0. Discount γ = 0.5.
Transition: s=0, a=Right, r=5, s'=1. Compute the 1-step Q-learning target y = r + γ maxa' Q̂(s', a').
The current prediction is Q̂(0, Right) = 12 and the target is y = 15. Compute the squared loss (y − Q̂)².
Behavior policy alternates Right, Left, Right, Left from s=0. Window 1: 0 ⟶Right,5 1 ⟶Left,4 0. Compute the 2-step target for (s=0, a=Right).
y(2) = r0 + γr1 + γ² maxa' Q̂(s2, a'). The agent ends back at s=0.
Q̂(0, Right) = 12, target = 10. Compute the squared loss.
Window 2: 1 ⟶Left,4 0 ⟶Right,5 1. Compute the 2-step target for (s=1, a=Left).
Window 1 loss = 4, window 2: Q̂(1,Left) = 10, target = 11.5, so loss = (11.5−10)² = 2.25. Compute the average loss over both windows.
We want to evaluate one policy (π*) using data collected by a different policy (random). This is the off-policy evaluation problem. It's subtle: the data doesn't come from the policy we're evaluating.
None work correctly. In all options, the reward rt and next state st+1 come from the random data-collection action at, not from π(st). The first transition's reward and dynamics don't match Vπ. These targets fit Q*(st, at) or the random policy's value, not Vπ(st).
Implement qTarget that computes the 1-step Q-learning TD target.
javascript function qTarget(r, gamma, qNextValues) { return r + gamma * Math.max(...qNextValues); }
You have a fixed dataset collected by some behavior policy. No more environment interaction. Naively running SAC or DQN on this data fails catastrophically. Why?
This offline RL code has the core issue that causes failure. Which line?
q_pred = Q_net(state, action) # current Q next_action = actor(next_state) # actor proposes q_next = Q_net(next_state, next_action) # Q of OOD action! target = reward + gamma * q_next loss = (q_pred - target)**2
Advantage-weighted regression uses weight exp(Â/α). If Â(s,a) = 2 and α = 0.5, what is the weight?
High-advantage actions get exponentially more weight. Small α = sharper weighting = more selective.
IQL avoids querying Q for OOD actions by fitting V(s) with an asymmetric expectile loss that pushes V toward the upper end of the Q distribution — a proxy for the best in-dataset action value.
λ = 1/4. V̂(s) = 4, Q̂(s, a1) = 2. Compute x and the loss for this transition.
Same setup. V̂(s) = 4, Q̂(s, a2) = 6. Compute the loss for this transition.
Average the losses from 7.1 and 7.2 (equal-weight dataset). What is the mean loss?
Implement expectileLoss.
javascript function expectileLoss(vHat, qHat, lambda) { const x = vHat - qHat; const w = x >= 0 ? lambda : (1 - lambda); return w * x * x; }
A Mars rover learns to deposit rocks into a bin. Reward = 1 only when the rock is placed successfully, 0 otherwise. Almost all trajectories fail. How do we learn from near-zero signal?
Using distance-based reward r(s,a,g) = −||s − g||. If s = (3, 4) and g = (0, 0), what is the reward?
Order the Hindsight Experience Replay (HER) steps.
This capstone combines all prior chapters into a single multi-step problem that mirrors the exam format. Work through it end-to-end.
Expert data at state s1: modes (v, ω) = {(3.0, 4.0), (1.0, −4.0)}. Compute the Gaussian MLE mean μω.
Flow matching with 3 Euler steps (Δτ = 1/3). Start a0 = (2.0, 1.0). Velocities: v(τ=0) = (−3, −6), v(τ=1/3) = (−3, −6), v(τ=2/3) = (−3, −6). Compute a1[v] (first component of final action).
Grid world: s=−1, a=Right, r=5, s'=0. γ = 0.5. Q̂(0,Left) = 8, Q̂(0,Right) = 12. Compute 1-step target and loss if Q̂(−1, Right) = 20.
λ = 0.1. V̂(s) = 5. Q̂ values for 3 dataset actions: {3, 5, 10}. Compute the average expectile loss.
Notice how the loss heavily penalizes V̂ < Q̂ (x < 0). The (1−λ) = 0.9 weight on negative residuals pushes V̂ upward toward the best Q̂ values.
You collect 100 trajectories. 3 reach gbin (success). With HER, you relabel each trajectory with its final state as the goal. How many total "successful" trajectories (reaching their relabeled goal) do you now have?
Every trajectory reaches its own final state sT by definition. When you relabel g = sT, every trajectory is a "success" for that relabeled goal. All 100 trajectories become successful demonstrations.
| Topic | Lesson |
|---|---|
| Policy Gradients (full derivation) | Policy Gradients |
| Hierarchical Policies | Hierarchy in IL & RL |
| RL Fundamentals | RL Fundamentals Workbook |
| Advanced RL Theory | RL Theory Workbook |