← Gleams
Stanford CS 224R · Lecture 2 · Deep RL

The Complete Guide to Imitation Learning

Why showing beats telling. From behavioral cloning to diffusion policies to DAgger — every failure mode explained.

Behavioral cloning The averaging trap Diffusion policies Compounding errors DAgger & HG-DAgger
Roadmap

What You'll Master

Chapter 01

The Problem: Teaching by Showing

Imagine you're teaching a child to tie their shoes. You don't write down a 47-step motor-control program specifying joint torques for each finger. You show them. You tie the shoe, they watch, they try, they fumble, you show them again. Eventually they get it.

Imitation learning brings this idea to robots and AI agents. Instead of designing a reward function and hoping the agent discovers good behavior through trial and error, you demonstrate the desired behavior and let the agent learn by copying you.

Definition
Imitation Learning

Learning a policy πθ(a|s) from expert demonstrations, without access to a reward function. The demonstrations are the entire training signal.

Quick terminology check from the MDP/RL framework:

SymbolMeaningExample (Self-Driving)
stState at time tCar position, velocity, all objects
otObservation at time tCamera image (partial view of state)
atAction at time tSteering angle, acceleration
τTrajectory (s₁,a₁,...,sT,aT)One full driving episode
πθ(a|s)Policy (learned strategy)Neural net: image → steering command
Why not just use RL?

RL requires a reward function. For many tasks — folding laundry, cooking an egg, suturing tissue — specifying a reward function is harder than just doing the task yourself. Imitation learning sidesteps this entirely: you show the robot what "success" looks like, and it learns to replicate it.

The dataset for imitation learning is a collection of expert demonstrations:

Expert Dataset 𝒟 = {(s₁, a₁, s₂, a₂, …, sT, aT)₁, …, (s₁, a₁, …, sT, aT)N}

N expert trajectories. Each one: a recording of states visited and actions taken by a skilled demonstrator.

The question this entire lecture answers: how do we turn these demonstrations into a policy that works?

The Spectrum of Approaches

At one extreme: hand-code every behavior (classical robotics). At the other: pure RL from scratch (agent learns everything by trial and error). Imitation learning sits in the sweet spot — leverage human expertise without the brittleness of hand-coding or the sample complexity of pure RL.

Chapter 02

Behavioral Cloning

The simplest possible approach. You have (state, action) pairs from the expert. You want a neural network that maps states to actions. This is… supervised learning. Literally just regression.

Definition
Behavioral Cloning (BC)

Train a policy by supervised learning on expert demonstrations. Given a state, predict the expert's action. That's it.

Behavioral Cloning Objective minθ Σ(s,a) ∈ 𝒟 ‖ a − πθ(s) ‖²

Minimize the squared difference between expert actions and predicted actions. Standard L2 regression.

Unpack each piece:

πθ(s) — your neural network takes state s as input, outputs a predicted action â.

a — the expert's actual action at that state. This is your ground truth label.

‖·‖² — squared Euclidean distance. Penalizes any deviation from the expert.

minθ — adjust network weights via gradient descent to minimize the total prediction error across all demos.

If you've ever done supervised regression in machine learning, you already know how to do behavioral cloning. The state is the input feature, the action is the label, and the loss is MSE. That's literally it.

Worked Example — Self-Driving BC

State: camera image showing a road curving left. Expert steers at −12° (left). Your network predicts −10°.

Loss = (−12 − (−10))² = (−2)² = 4.

Gradient pushes θ to make the prediction closer to −12°. After thousands of examples and gradient steps, the network learns to imitate the expert's steering across many road conditions.

BC = Offline & Simple

No simulator needed. No reward function needed. No interaction with the environment. You collect demos once, train overnight, and deploy. This is why BC is by far the most common starting point in real robotics — it works well enough to get a first prototype running.

The Hidden Assumption

BC assumes there's a single correct action for every state. For many tasks (following a trajectory, stabilizing a drone), this is approximately true. But for tasks with multiple valid strategies, this assumption will betray you spectacularly.

But there is a devastating failure mode hiding in that innocent L2 loss…

Chapter 03

The Averaging Trap

You're training a self-driving car. There's a large obstacle in the middle of the road. In your dataset, some demonstrators swerve left. Others swerve right. Both are perfectly valid.

What does L2 regression predict? The mean. Go straight. Into the obstacle.

The Averaging Problem

When the data is multimodal — multiple valid actions for the same state — minimizing squared error produces the average of those actions. The average of two good actions can be a catastrophically bad action.

Why L2 Produces the Mean

Minimizing E[‖a − â‖²] over â has a closed-form solution: â* = E[a]. The optimal prediction under L2 loss is always the conditional mean of the data. This is a fact from basic statistics, not a choice.

Worked Example — The Numbers

State: obstacle ahead. 5 expert demos at this state:

• Demo 1: steer −20° (left) • Demo 2: steer −18° (left) • Demo 3: steer +22° (right) • Demo 4: steer +19° (right) • Demo 5: steer −21° (left)

L2-optimal prediction: (−20 − 18 + 22 + 19 − 21) / 5 = −3.6°

Nearly straight ahead. None of the experts ever steered at −3.6° here. The "optimal" prediction is a disaster — you crash into the obstacle.

This isn't a rare edge case. It happens all the time:

• Multiple demonstrators with slightly different styles

• Tasks where multiple strategies are valid (reaching around an object from either side)

• Any situation where the action distribution is not unimodal

The Core Insight

The problem is not the neural network being too small or the training being insufficient. A perfect neural network optimized under L2 loss will still predict the mean. The problem is the loss function forcing a single-point prediction for an inherently multi-valued answer.

When Does BC Work Despite This?

When the data is unimodal at every state — there's essentially one right answer for each situation. A single expert with consistent behavior is ideal. Multiple experts with very different styles is dangerous.

Formal Statement

If the conditional distribution p(a|s) is unimodal for every s in the dataset, then L2 regression recovers the correct action. The averaging problem only appears when p(a|s) has multiple separated modes. The severity scales with the number of modes and the distance between them.

Chapter 04

Expressive Policy Distributions

The fix: instead of predicting a single action, model a distribution over actions. The policy should be able to say "there's a 60% chance we should go left and a 40% chance we should go right" — and never output "go straight."

Option 1: Discrete Actions

If actions are discrete (left, right, up, down), the solution is trivial. Output a categorical distribution via softmax:

Discrete Policy πθ(a|s) = softmax(fθ(s))
e.g., p(left) = 0.6, p(right) = 0.4, p(straight) = 0.0

This is maximally expressive — it can represent any distribution over the discrete action set. No averaging problem. The policy can assign zero probability to "go straight" while putting mass on both left and right.

Option 2: Gaussian (Continuous)

For continuous actions (like steering angle, joint torques, or end-effector velocities), the naive approach is a Gaussian: the network outputs a mean μ and a standard deviation σ.

Gaussian Policy πθ(a|s) = 𝒩(a; μθ(s), σθ(s)²)
Still Unimodal!

A Gaussian has one peak. It can be wide or narrow, but it always has a single mode. For the obstacle scenario: the Gaussian centers at the mean (−3.6°) with large variance. It assigns highest probability to going straight — exactly what we wanted to avoid.

Option 3: Mixture of Gaussians

A natural extension: output multiple Gaussians and mix them with learned weights:

Mixture of Gaussians πθ(a|s) = Σk=1K wk · 𝒩(a; μk, σk²)

Network outputs: μ₁,σ₁,w₁, μ₂,σ₂,w₂, … μKK,wK

Now we can have two modes: one peaked around −20° (left) and one around +20° (right). The mix weight w controls how much probability goes to each mode.

Option 4: Discretize + Autoregressive

For a multi-dimensional continuous action vector (a1, a2, a3…), we can discretize each dimension into bins and model them sequentially, one dimension at a time:

Autoregressive Factorization πθ(a|s) = p(a1|s) · p(a2|s, â1) · p(a3|s, â1, â2) …

Each factor is a categorical over discretized bins. â = sampled value from previous step.

This is how language models work: p(next word | previous words). Same idea, applied to action dimensions. Each conditional is a simple categorical distribution (maximally expressive for that dimension), and the chain rule ensures the joint distribution can be arbitrarily complex.

Network Expressivity ≠ Distribution Expressivity

A critical distinction. Making the neural network bigger (more layers, more parameters) does NOT fix the averaging problem if the output head is still a single Gaussian. A 100-billion-parameter network with a Gaussian output is still unimodal. The bottleneck is the output representation, not the network capacity.

Concrete Comparison

Obstacle scenario. Expert data: 60% steer left (−20°), 40% steer right (+20°).

Gaussian: peak at −4°. Probability of ±20° is low. Fails.

MoG (K=2): mode at −20° (w=0.6), mode at +20° (w=0.4). Works!

Autoregressive: discretize to 1° bins, models exact histogram. Works!

Diffusion: samples cluster at −20° and +20° naturally. Works!

Option 5: Diffusion Models

The most powerful option. Model the action distribution as a denoising diffusion process. Start from noise, iteratively refine into a clean action. Can represent any distribution, including complex multimodal ones. This is so important it gets its own chapter.

Chapter 05

Diffusion Policies

You know how image diffusion models (Stable Diffusion, DALL-E) generate images by starting from pure noise and iteratively denoising? Same idea, but instead of generating pixels, we generate actions.

Definition
Diffusion Policy

A policy that generates actions through iterative denoising. Starting from random Gaussian noise aN, the network predicts and removes noise over N steps to produce a clean action a0. The conditioning input is the current state/observation.

How It Works

The key connection: in image diffusion, you model p(image | text). In diffusion policy, you model p(action | observation). Same math, different domain. The observation (camera image, robot joint angles) plays the role of the text prompt.

Training: Take an expert action a0. Add noise to get an = a0 + ε (where n indexes the noise level). Train the network to predict the noise: ε̂θ(an, s, n) ≈ ε.

Diffusion Training Loss ℒ(θ) = 𝔼n,ε[ ‖ ε − ε̂θ(an, s, n) ‖² ]

Predict the noise that was added at noise level n, conditioned on state s.

Inference: Start with random noise aN ~ 𝒩(0, I). Iteratively denoise: at each step, predict the noise and subtract a fraction of it. After N steps, you have a clean action.

Obstacle Avoidance with Diffusion

State: obstacle ahead. 100 denoising runs produce:

• ~60 runs converge to −20° (left cluster)

• ~40 runs converge to +20° (right cluster)

• ~0 runs converge to 0° (straight)

The diffusion process naturally discovers modes without being told how many there are. Different random starting noises converge to different modes. The valley between modes (going straight) is avoided because no training data lives there.

Why Diffusion Beats Everything Else

Mixture of Gaussians requires choosing K (number of modes) in advance. Autoregressive models accumulate discretization error. Diffusion models can represent arbitrary distributions with arbitrary numbers of modes, automatically. The denoising process is a universal distribution approximator. This is why "Diffusion Policy" (Chi et al., 2023) became state-of-the-art for robotic manipulation.

The Cost: Inference Speed

Diffusion policies require N denoising steps at inference time (typically 10–100). Each step is a forward pass through the network. This is 10–100x slower than a single-pass policy. For a robot running at 10 Hz control, you need fast denoising — techniques like DDIM or consistency models help reduce N to as few as 1–4 steps.

Flow Matching: The Modern View

The noise-prediction formulation above (DDPM-style) works, but modern systems like π0 (Physical Intelligence, 2024) use a cleaner framework called flow matching. The idea: instead of learning to predict noise, learn a velocity field that smoothly transforms Gaussian noise into expert actions.

Start with two things: a data point x₁ (an expert action) and a noise sample x₀ ~ 𝒩(0, I). Define a straight-line path between them:

Linear Interpolation Path xt = t · x₁ + (1 − t) · x₀    t ∈ [0, 1]. At t=0, pure noise. At t=1, clean action.

The velocity along this path is simply x₁ − x₀. We train a network vθ(xt, s, t) to predict this velocity:

Flow Matching Training Loss ℒ(θ) = 𝔼t~U[0,1], x₁~data, x₀~𝒩[ ‖ vθ(xt, s, t) − (x₁ − x₀) ‖² ]

At test time: start from pure noise x₀ ~ 𝒩(0, I). Integrate the learned velocity field with small Euler steps: xt+Δ = xt + vθ(xt, s, t) · Δ. After ~10–20 steps, you arrive at a clean action.

Why flow matching is cleaner

DDPM requires a carefully designed noise schedule (βt), a specific forward noising process, and a reverse SDE. Flow matching replaces all of that with a single straight-line interpolation. The training loss is simpler (no noise schedule), the sampling is an ODE (not SDE — deterministic and faster), and the framework generalizes to non-Gaussian sources. This is why π0, π0.7, and other modern robot foundation models use flow matching.

Systems in the Wild

Diffusion and flow-matching policies power nearly all state-of-the-art IL systems deployed today:

SystemYearPolicy TypeDomain
Diffusion Policy2023DDPM denoisingTabletop manipulation
π0 (Physical Intelligence)2024Flow matchingGeneral-purpose robotics
Octo / OpenVLA2024Autoregressive (VLM)Open-vocab robot control
GR00T (NVIDIA)2024Flow matchingHumanoid robots
Figure Helix2025Flow matchingHumanoid manipulation
Waymo EMMA2024Autoregressive (LLM)Autonomous driving
The pattern

For robotics (continuous, multimodal actions): flow matching / diffusion dominates. For driving and language-conditioned control: autoregressive VLMs are competitive. Both are forms of behavioral cloning with expressive output distributions — the core lesson of this chapter.

Chapter 06

Action Chunking

A robot running at 50 Hz makes a new decision every 20 milliseconds. But does it need to? Most actions unfold over 200–500 ms. A grasping motion doesn't change its mind every 20 ms — it commits to a trajectory and follows through. What if instead of predicting one action at a time, the policy predicted an entire chunk of future actions?

Definition
Action Chunking

Instead of predicting a single action at from state st, the policy predicts a chunk of k future actions: πθ(st) → (at, at+1, …, at+k−1). The robot executes the entire chunk before observing a new state. Typical chunk sizes: k = 4–16 steps.

Why It Works

Reason 1: Temporal consistency. Single-step predictions can jitter — at time t the policy says "go left" and at t+1 it says "go right." Chunking forces the policy to commit to a coherent motion over multiple timesteps. The result is smoother, more physically plausible actions.

Reason 2: More compute time. Diffusion and flow-matching policies need 10–20 denoising steps per prediction. If you predict 1 action per control step at 50 Hz, each denoising step must complete in 1 ms. If you predict a chunk of 8 actions, you have 160 ms total — 8 ms per denoising step. This is comfortably achievable on current hardware.

Chunk Execution in Practice

Robot running at 50 Hz with chunk size k = 8:

• t = 0: Observe state s₀. Run diffusion (takes ~100 ms). Get 8 actions.

• t = 0–7: Execute a₀, a₁, …, a₇ open-loop (160 ms).

• t = 8: Observe new state s₈. Run diffusion again. Get next 8 actions.

The policy only runs inference every 160 ms instead of every 20 ms — 8× fewer forward passes.

The trade-off

During open-loop execution, the robot doesn't react to disturbances. If something unexpected happens at t = 3, the robot won't notice until t = 8. Short chunks (k = 4) react faster but give less compute time. Long chunks (k = 16) are smoother but less reactive. Most systems use k = 8–16 with temporal ensembling: predict overlapping chunks and average them, so each action is informed by multiple observations.

Action chunking + diffusion = the modern IL recipe

Diffusion Policy (Chi et al., 2023) and ALOHA (Zhao et al., 2023) both introduced action chunking independently. π0, GR00T, and virtually every modern IL system uses it. The combination of flow-matching policies with action chunking is the current state of the art for robotic imitation learning.

Chapter 07

Compounding Errors

Even with the perfect output distribution, behavioral cloning has a deeper, more fundamental failure mode. Suppose your policy makes a tiny mistake at step 5 — it's 0.1° off from the expert's steering angle. No big deal, right?

Wrong. That tiny error puts the car in a state it never saw in training. The expert never drove 2 cm to the left of the lane center at that point. The policy has no idea what to do from this unfamiliar state — it was only trained on the expert's trajectory, not on states slightly off that trajectory.

So it makes a bigger mistake. Which puts it in an even more unfamiliar state. Which causes an even bigger mistake. The errors cascade, each one feeding the next, until the policy is so far from the training distribution that its outputs are essentially random.

The Compounding Error Problem

Small per-step errors accumulate over time. The policy drifts into states not covered by the training data, where it has no basis for making good decisions. Errors cascade. This is the fundamental limitation of behavioral cloning.

How Fast Do Errors Compound?

Let ε be the per-step error probability (chance the policy makes a mistake at any given step). Let T be the episode length.

Error Bound for Behavioral Cloning Total error ∼ O(ε · T²)

Quadratic in horizon! Double the episode length → 4x the total error.

Why T² and not T? Because each mistake doesn't just cost one step of error — it pushes you off the expert's trajectory, and every future step is now also wrong because you're in unfamiliar territory.

Worked Example — Compounding Numbers

Per-step error rate: ε = 0.01 (1% chance of mistake per step).

• T = 10 steps: total error ∼ 0.01 × 100 = 1.0 (manageable)

• T = 50 steps: total error ∼ 0.01 × 2500 = 25 (disaster)

• T = 100 steps: total error ∼ 0.01 × 10000 = 100 (complete failure)

Even with 99% per-step accuracy, a 100-step task is essentially guaranteed to fail catastrophically. This is why BC works for short tasks but collapses on long-horizon ones.

The Distribution Shift

Formally, the problem is a distribution mismatch. The policy is trained on states from the expert's distribution pexpert(st). At test time, the policy's own mistakes cause it to visit states from a different distribution pπ(st). The bigger the gap between these distributions, the worse the policy performs. BC has no mechanism to close this gap.

The Driving Analogy

Think of a student driver who only practiced on a driving simulator with scripted scenarios. On the real road, the first time another car cuts them off, they freeze — they've never seen this state. They swerve too hard, ending up in the oncoming lane. Now they've really never seen this state. BC is that student driver: good in familiar territory, helpless the moment anything goes slightly off-script.

We need a way to train the policy on states it actually visits, not just states the expert visited. Enter DAgger.

🔨 Derivation Compounding Error Bound: Why T² and Not T ✓ ATTEMPTED

BC achieves per-step error ε (probability of making a mistake at any timestep). Intuitively, with T steps, total error should be ε·T. But the actual bound is O(ε·T²).

Your task: Derive the T² bound rigorously. Show that at step t, the probability of being off-distribution is O(ε·t), and since the policy has no recovery data for off-distribution states, the expected cost from step t onward is O(T-t). Sum over all t to get T².

At each step, with probability ε, the policy makes a mistake. After t steps, the probability of EVER having made a mistake is bounded by t·ε (union bound). So P(off-distribution at step t) ≤ t·ε. Once off-distribution, the policy stays off because it has no recovery data.
Once the policy is off the expert's trajectory, every subsequent action is effectively random (no training data for these states). The expected cost for the remaining (T-t) steps is O(T-t). Each step incurs ~1 unit of error because the policy has no basis for correct actions.
Total expected error = Σt=1T P(first mistake at step t) · (cost from step t to T). Approximately: Σt=1T ε · (T-t) = ε · Σk=0T-1 k = ε · T(T-1)/2 = O(εT²).

Setup: Let ε = P(policy makes wrong action at step t | on expert's trajectory). Let Ct = cost incurred at step t.

Step 1: Define "off-trajectory" as having deviated at any prior step. P(off-trajectory at t) ≤ tε (union bound over t independent error events).

Step 2: When on-trajectory: E[Ct] = 0 (correct action). When off-trajectory: E[Ct] = c (some constant cost per step, since the policy is in unknown territory).

Step 3: Total expected cost:

E[Σt Ct] = Σt=1T P(off at t) · c ≤ c · Σt=1T tε = cε · T(T+1)/2 = O(εT²)

Alternative (tighter) derivation via Ross et al. 2011:

Let dtπ be the state distribution at step t under π. The performance difference lemma gives:

J(π*) - J(π) = Σt Es~dtπ[l(s)] where l(s) is the loss at state s.

For BC: l(s) = 0 if s ∈ supp(dtπ*), but l(s) = O(1) otherwise. Since dtπ drifts from dtπ* with rate ε per step, the TV distance ||dtπ - dtπ*|| ≤ tε. So E[l(s)] ≤ tε, and summing: J(π*) - J(π) ≤ Σt tε = O(εT²).

The key insight: It's T² (not T) because of the CASCADING nature: a mistake at step 5 doesn't just cost 1 step — it corrupts all future steps too. DAgger fixes this by training on the policy's own distribution, making the per-step error independent of position: E[Ct] = ε regardless of t, giving ΣCt = εT (linear).

Chapter 08

DAgger: Dataset Aggregation

The insight is elegant: if the problem is that BC trains on expert states but the policy visits different states, then train on the states the policy actually visits. But we still need expert actions — so we ask the expert to label those states.

Definition
DAgger (Dataset Aggregation)

An iterative algorithm that fixes the distribution mismatch of BC. In each round: (1) run the current policy, (2) record the states it visits, (3) ask the expert what action they would take in those states, (4) add these new (state, action) pairs to the training dataset, (5) retrain.

Algorithm: DAgger
  1. Initialize: Collect expert demonstrations 𝒟 = {(s, a)expert}
  2. Train policy πθ on 𝒟 via supervised learning
  3. Deploy πθ in the environment, recording visited states {s₁, s₂, …}
  4. Query expert: For each visited state s, get expert action a* = πexpert(s)
  5. Aggregate: 𝒟 ← 𝒟 ∪ {(s, a*)new}
  6. Go to step 2

Why This Fixes Compounding Errors

After several rounds, the training dataset contains expert actions for the states the learned policy actually visits — including the "mistake" states that BC never saw. The policy learns what to do when things go slightly wrong: how to recover.

Worked Example — DAgger for Driving

Round 1: Train BC on 100 expert driving demos. Policy drifts left on turns.

Round 2: Run the policy. It ends up 10 cm left of center on a curve. Expert labels: "steer right to correct." Add 50 new (state, correction) pairs. Retrain.

Round 3: Policy is better on curves but wobbles on straights. Expert labels: "steady." Add 50 more pairs. Retrain.

Round 5: Dataset now has expert actions for the full range of states the policy encounters — including recovery states. Error bound improves to O(ε · T) instead of O(ε · T²).

DAgger Error Bound Total error with DAgger ∼ O(ε · T)

Linear in horizon, not quadratic. Matching the expert's distribution eliminates the compounding.
The Catch: Expert Availability

DAgger requires the expert to label states on demand. The policy visits some weird state at 3 AM — who labels it? For simulated environments, you can query an oracle. For real robots with human experts, this is a serious logistical challenge. HG-DAgger addresses this.

The Key Shift

BC trains on pexpert(s). DAgger trains on pπ(s). This is the entire difference. By closing the distribution gap, DAgger eliminates quadratic error growth. It's the same supervised learning, just on the right data.

🔨 Derivation DAgger Regret Bound: O(εT) from Online Learning Theory ✓ ATTEMPTED

DAgger reduces BC's O(εT²) to O(εT). The proof uses online learning theory: DAgger is a no-regret algorithm over the sequence of data distributions d1π, d2π, ..., dNπ.

Your task: Show that after N rounds of DAgger, the average policy π̄ has expected cost at most ε + O(1/√N) per step, which gives total cost O(εT) as N → ∞. Why does training on the mixture distribution fix the cascading problem?

BC trains on dπ* (expert's state distribution). DAgger trains on dπ (policy's own state distribution). After N rounds, the aggregated dataset DN is drawn from (1/N)Σi dπi — a mixture of all past policy distributions. This converges to dπ̄ (the mixture policy's distribution).
DAgger is an instance of Follow the Leader (FTL) in online learning. The regret over N rounds is: Σi l(πi, dπi) - minπ Σi l(π, dπi) ≤ O(√N). Dividing by N: the average policy's loss on the average distribution is at most optimal + O(1/√N).

Step 1 (Setup): At each DAgger round i, we have policy πi and its induced state distribution dπi. We collect data from dπi and retrain on all accumulated data.

Step 2 (Performance decomposition): For the mixture policy π̄ = (1/N)Σiπi:

J(π̄) ≤ (1/N) Σi J(πi) (by Jensen's inequality on costs)

J(πi) = T · Es~dπi[l(πi, s)] where l is per-step loss.

Step 3 (No-regret): Since πi+1 minimizes loss on ALL data so far (best response to past distributions):

(1/N)Σi Edπi[l(πi)] ≤ minπ (1/N)Σi Edπi[l(π)] + O(1/√N)

Step 4 (Best in class): The best fixed policy in the class achieves Ed[l(π*)] = ε (classification error on the right distribution). So:

(1/N)Σi E[l(πi)] ≤ ε + O(1/√N)

Step 5 (Total cost): After enough rounds, J(π̄) ≤ T · (ε + O(1/√N)) ≈ εT

The key insight: DAgger converts the sequential decision problem into an i.i.d. supervised learning problem. By training on states the policy actually visits, the per-step error is ε REGARDLESS of what happened at previous steps. No cascading. The price: you need access to the expert for N rounds of labeling. The reduction from T² to T is the fundamental reason DAgger works.

Checkpoint — Before you move on
A colleague says "DAgger isn't needed — just collect more expert demonstrations and BC will work for long horizons." Explain why this is wrong, using the T² bound. How much data would BC need to match DAgger at T=100 steps?
✓ Gate cleared
Model Answer

More data reduces ε (per-step error) through better generalization, but it does NOT change the T² dependence. BC's bound is O(εT²), DAgger's is O(εT). For T=100: BC error ~ ε·10,000 while DAgger error ~ ε·100. To match DAgger with BC, you'd need to reduce ε by a factor of 100 (i.e., εBC = εDAgger/100). Since supervised learning error scales roughly as ε ~ 1/√N, you'd need 10,000x more demonstrations! And this gets worse for longer horizons. More data helps BC, but it can never overcome the quadratic penalty. The fundamental issue is that BC trains on the WRONG distribution (expert states), not that it has too little data on that distribution. DAgger fixes the distribution, which no amount of expert data can do.

Chapter 09

HG-DAgger: The Practical Version

Vanilla DAgger has a usability problem. Running the policy, recording all states, then asking an expert to retroactively label each one? Imagine watching 10,000 frames of a robot flailing and labeling the correct action for each. Exhausting and impractical.

HG-DAgger (Human-Gated DAgger) flips the interface: the expert watches the policy run in real time and intervenes only when needed.

Definition
HG-DAgger (Human-Gated DAgger)

A variant of DAgger where the expert observes the policy executing and takes over control only when the policy is about to make a serious mistake. The intervention trajectories are added to the training set.

Algorithm: HG-DAgger
  1. Train initial policy πθ on expert demonstrations
  2. Deploy πθ with expert watching
  3. Expert intervenes when the policy is about to fail — takes over control, demonstrates the correct behavior from the intervention point
  4. Aggregate: Add the intervention demonstrations to 𝒟
  5. Retrain πθ on augmented 𝒟
  6. Go to step 2

Why This Is Better in Practice

The expert only acts when it matters. If the policy is doing fine on a straight road, the expert sits back. When the policy starts drifting toward a curb, the expert grabs the wheel and demonstrates the recovery. The expert's time is spent on the hard cases — exactly the states where the policy needs the most help.

Worked Example — HG-DAgger for Robot Manipulation

Task: Pick up a cup. Initial BC policy succeeds 60% of the time.

Session 1: Expert watches 20 attempts. Policy fails on off-center cups. Expert intervenes 8 times, demonstrating approach corrections. 8 new demos added. Retrain → 75% success.

Session 2: Expert watches 20 attempts. Policy now fails on tilted cups. Expert intervenes 5 times. Retrain → 85% success.

Session 3: 3 interventions for edge cases. Retrain → 92% success.

Total expert time: ~1 hour across 3 sessions, compared to the 5+ hours of continuous demonstration that vanilla DAgger would require.

The Design Principle

HG-DAgger turns imitation learning into an iterative refinement process. The expert is a teacher who lets the student try, watches for mistakes, and corrects only when necessary. Each round of corrections targets the policy's current failure modes, making the data maximally informative.

Expert Quality Matters

The expert must be good at monitoring AND intervening. Some experts are great demonstrators but poor monitors — they intervene too late or too early. Too-late interventions miss the critical recovery states. Too-early interventions waste expert time on states the policy would have handled.

DAgger vs. HG-DAgger: When to Use Which

Vanilla DAgger is better when you have a scripted expert (a simulator oracle that can label any state instantly) or when you need theoretical guarantees about convergence.

HG-DAgger is better in the real world with human experts, where labeling time is expensive and the expert's attention is the bottleneck. Most real robotics experiments use some variant of HG-DAgger because it respects the human's time and cognitive load.

Chapter 10

Collecting Demonstrations

We've assumed demonstrations exist. But where do they actually come from? The answer depends heavily on the domain, and some methods are dramatically harder than others.

Natural Data Sources

Some domains generate demonstrations as a byproduct of normal activity:

Self-driving: Every human driver generates (camera image, steering angle) pairs all day long. Tesla collects billions of frames this way.

Language: The entire internet is a demonstration of text generation. Every sentence is an expert demonstration of "what to write next."

Game playing: Replay databases (StarCraft, Go, Chess) contain millions of expert games.

Scale Advantage

When demonstrations come for free, you can collect millions. This is why behavioral cloning works so well for self-driving and language — the sheer data volume overwhelms the compounding error problem for many practical scenarios.

Robotics: The Hard Case

For robotic manipulation, demonstrations don't come naturally. Someone has to actively teach the robot. Three main approaches:

MethodHow It WorksProsCons
Kinesthetic teachingPhysically guide the robot arm by handIntuitive, directAwkward for complex motions, limited to compliant robots
TeleoperationControl robot remotely via joystick/VR controllerNatural interface, works for any robotLatency, requires practice to master
PuppeteeringUse a second identical robot arm as input device1:1 mapping, very preciseRequires double the hardware cost
Real Numbers

The Diffusion Policy paper (Chi et al., 2023) collected about 100–200 demonstrations per task via teleoperation. Each demo took ~30 seconds. Total data collection: about 1–2 hours per task. This is typical for state-of-the-art robotic IL — it's not millions of demos, it's hundreds.

Videos of Humans and Animals

YouTube is full of humans doing tasks. Can we learn from watching humans?

The Embodiment Gap

A human hand has 27 degrees of freedom with soft, deformable fingers and tactile sensing. A robot gripper has 1–7 DOF with rigid pincers and maybe a force sensor. Watching a human fold a towel doesn't directly tell a robot how to fold a towel — the bodies are too different. This is the embodiment gap. Bridging it is an active research frontier.

Current approaches extract high-level intentions from human videos (what to grasp, where to place) while leaving the low-level motor control to RL or a separate controller. Full "watch a YouTube video and learn the task" remains an open problem.

The Data Quality Hierarchy

Not all demonstrations are equal. From best to worst: (1) same robot, same task, expert operator; (2) same robot, similar task, decent operator; (3) different robot, same task; (4) human video of the task. Each step down requires more sophisticated transfer methods. Most practical systems operate at level 1 or 2.

Chapter 11

Imitation + RL

Imitation learning and reinforcement learning aren't rivals — they're complements. Each has strengths the other lacks:

PropertyImitation LearningReinforcement Learning
Needs reward function?NoYes
Needs demonstrations?YesNo
Can surpass demonstrator?No (BC) / Maybe (DAgger)Yes
Self-improvement?No framework for practiceYes, by definition
Data efficiency?Good (supervised learning)Poor (exploration needed)
Long-horizon tasks?Compounding errorsReward signal guides recovery
The Winning Combination

Use imitation learning to initialize the policy (start from a reasonable baseline), then use RL to improve beyond the demonstrator. IL gives you a warm start; RL gives you self-improvement. This is the recipe behind many successful real-world systems.

Common Patterns

Pattern 1: IL then RL. Pre-train via BC on expert demos, then fine-tune with PPO or SAC using a reward function. The BC initialization means RL starts from sensible behavior instead of random flailing — dramatically faster convergence. Think of it as teaching someone the basics before sending them to practice on their own.

Pattern 2: IL + RL reward shaping. Use the distance from expert behavior as an auxiliary reward signal. The agent gets reward both for task success (RL objective) and for staying close to expert demonstrations (IL regularizer). This keeps the policy grounded while still allowing improvement beyond the demonstrator.

Pattern 3: RLHF (RL from Human Feedback). The recipe behind ChatGPT and Claude. Pre-train a language model on text (BC on internet data), then fine-tune with RL using a reward model trained on human preferences. Imitation learning provides the foundation; RL aligns the behavior.

The ChatGPT Recipe

Step 1 (IL): Pre-train on internet text. This is behavioral cloning on the largest dataset ever assembled.

Step 2 (IL): Fine-tune on curated (prompt, response) pairs. Still BC, but higher quality demos.

Step 3 (RL): PPO with a reward model trained on human rankings. The model goes beyond imitating — it optimizes for human preference.

Without step 1, RL from scratch on language would be intractable. Without step 3, you get a talented mimic but not an aligned assistant.

IL Alone Has No Mechanism for Practice

A pianist who only watches performances will never improve beyond mimicry. They must practice — try, fail, get feedback, adjust. BC has no equivalent. DAgger gets closer (the expert provides ongoing feedback), but only RL provides the full loop: try things the expert never did, discover something better, and keep it.

🔨 Derivation GAIL = Moment Matching via Adversarial Training ✓ ATTEMPTED

Generative Adversarial Imitation Learning (GAIL, Ho & Ermon 2016) avoids the reward-engineering problem by training a discriminator to distinguish expert from policy state-action pairs, then using the discriminator's output as a reward signal for RL.

Your task: Show that minimizing the Jensen-Shannon divergence between the occupancy measures ρπ and ρπ* (via a discriminator) is equivalent to moment matching — making the policy's state-action distribution indistinguishable from the expert's.

ρπ(s,a) = (1-γ) Σt=0 γt P(st=s, at=a | π). It's the discounted frequency of visiting (s,a) under policy π. Two policies with the same occupancy measure have the same expected return for ANY reward function. So matching occupancy measures = matching behavior.
The GAIL objective is: minπ maxD Eρπ[log D(s,a)] + Eρπ*[log(1-D(s,a))]. This is exactly the GAN objective with ρπ as the generator distribution and ρπ* as the real data distribution. At optimality, D*(s,a) = ρπ* / (ρπ + ρπ*), and the overall min over π minimizes JS(ρπ || ρπ*).

Step 1: GAIL trains: D(s,a) = discriminator, π = policy (generator).

Discriminator objective: maxD E(s,a)~ρπ*[log D(s,a)] + E(s,a)~ρπ[log(1-D(s,a))]

Step 2: Optimal discriminator: D*(s,a) = ρπ*(s,a) / (ρπ(s,a) + ρπ*(s,a))

Step 3: With D = D*, the policy optimization becomes:

minπ JS(ρπ || ρπ*) = (1/2)KL(ρπ||m) + (1/2)KL(ρπ*||m)

where m = (ρπ + ρπ*)/2.

Step 4: The reward for RL is r(s,a) = -log(1 - D(s,a)) (or log D(s,a)). The policy is trained with any RL algorithm (TRPO, PPO) using this learned reward. When D can't distinguish expert from policy (D ≈ 0.5), the policy has matched the expert's occupancy measure.

Step 5 (Moment matching): ρπ = ρπ* implies Eρπ[f(s,a)] = Eρπ*[f(s,a)] for ALL functions f. This is moment matching — every statistic of the policy's behavior matches the expert. This is stronger than BC (which only matches the conditional π(a|s) pointwise).

The key insight: GAIL converts IL into an adversarial game. The discriminator automatically discovers which features of behavior matter (the "reward function" emerges from training). BC matches actions at expert states; GAIL matches the entire state-action distribution. This makes it robust to compounding error because the policy's own states are part of the optimization.

💥 Break-It Lab What Dies When You Break IL Components? ✓ ATTEMPTED
Imitation learning has multiple failure modes that activate under different conditions. Toggle each pathology to understand when and why IL breaks.
Distribution shift (BC on long horizon, T=200) NORMAL
Failure mode: The policy makes a small mistake at step 10, entering a state never seen in training. From there, every subsequent action is nearly random. By step 50, the robot is in a completely unfamiliar configuration. By step 100, it has crashed or frozen. Symptom: performance degrades quadratically with episode length. Short tasks (T<20) work fine; long tasks (T>50) fail catastrophically. Fix: DAgger (train on policy-visited states) or keep T short with action chunking.
Wrong loss function (MSE on discrete/multimodal actions) NORMAL
Failure mode: The policy outputs the MEAN of multiple valid actions, which is itself an invalid action. For a self-driving task: left and right around an obstacle both have probability 0.5 in the data, but the MSE-optimal prediction is "straight ahead" — into the obstacle. The policy looks confident (low loss on training set) but produces catastrophic actions in the multimodal states. Symptom: low training loss but terrible real-world performance. Fix: use cross-entropy (discrete), mixture models, or diffusion policies.
Covariate shift with no correction (deploy in new environment) NORMAL
Failure mode: Policy trained on demonstrations in environment A (well-lit, clean floor) is deployed in environment B (dim lighting, textured floor). The observations are outside the training distribution. The policy has no mechanism to detect it's seeing OOD inputs and produces nonsensical actions with high confidence. Symptom: works perfectly in training environment, fails immediately in any different setting. Fix: domain randomization during data collection, or train an OOD detector that triggers a safe fallback behavior.
🏗 Design Challenge You're the Architect: IL for Autonomous Driving ✓ ATTEMPTED
You're building an autonomous driving system using imitation learning. You have 1000 hours of human driving data (cameras, LiDAR, steering/throttle/brake at 10 Hz). The system must handle edge cases: construction zones, accidents, aggressive drivers, animals on road. You cannot do online testing until you pass simulation benchmarks.
Data
1000 hours (36M frames), 95% highway, 5% edge cases
Observation
8 cameras + LiDAR point cloud + HD map
Action
Steering angle, acceleration, lane change (mixed continuous/discrete)
Horizon
Planning: 5s ahead, episode: hours
Safety
Zero tolerance for collision in 99.99% of scenarios
Edge Cases
Must handle construction, accidents, jaywalkers, road debris
1. 95% of data is boring highway driving. Only 5% contains the hard cases. How do you prevent the policy from being overwhelmed by the easy majority? (Hint: sampling strategies, data augmentation.)
2. Compounding errors over hours of driving would be catastrophic. How do you handle the long horizon? (Hint: hierarchical policies, action chunking, closed-loop vs open-loop.)
3. Some edge cases are SO rare (animal on highway) that you might have 0 examples in 1000 hours. How do you handle these?
4. The policy must be multimodal (lane change left vs. right are both valid). What output distribution do you use?
5. What's your safety layer? The IL policy WILL make mistakes. How do you prevent them from being fatal?

Real-world approaches (public knowledge from Tesla AI Day, Waymo papers):

1. Data imbalance: Mine hard examples aggressively. Tesla's "shadow mode" runs the policy in background on fleet vehicles, flags disagreements between policy and human, and collects those scenarios preferentially. Result: 100x oversampling of edge cases. Also: data augmentation (synthetic rain, night, occlusion) to inflate rare conditions.

2. Long horizon: Hierarchical decomposition: (a) Route planner (minutes ahead, graph search), (b) Behavior planner (seconds ahead, learned IL policy decides lane changes/speed), (c) Motion controller (milliseconds, PID/MPC tracks the behavior plan). IL operates at layer (b) with 2-5 second horizon. Action chunking: predict 1-2 seconds of waypoints, replan at 10 Hz. This keeps the effective horizon short (~20 steps).

3. Ultra-rare events: Simulation. Build a photorealistic simulator and procedurally generate edge cases: animal crossings, debris, construction. Train a separate "emergency" policy on simulated data. At runtime: a classifier detects "unusual scenario" and switches to the emergency policy. Also: RL fine-tuning in simulation for collision avoidance.

4. Multimodality: Tesla uses a learned trajectory generation model (similar to diffusion policy) that outputs multiple candidate trajectories. A separate scoring model picks the best one considering safety, comfort, and traffic rules. This avoids the averaging trap entirely.

5. Safety layer: Multiple levels: (a) Rule-based collision checker (physics-based time-to-collision), (b) Learned safety critic that predicts P(collision | action), (c) Hardware emergency brake (AEB) as final fallback. The IL policy's output is always filtered through safety checks before execution. If ANY check fails, the system executes a minimal-risk condition (slow to stop, stay in lane).

Key lesson: No production AV system relies on pure IL. It's always IL + safety systems + hierarchical planning + simulation for edge cases + fleet learning for data. The IL component handles the "normal driving" 99% efficiently; everything else handles the 1% safely.

⚔ Adversarial: The Subtle Failure of Cross-Entropy on Continuous Actions
Your colleague discretizes the steering angle into 100 bins and uses cross-entropy loss for BC. They report 95% top-1 accuracy on the validation set. But the robot drifts off-road within 30 seconds. Why?
🔗 Pattern Recognition
IL + Reward = Offline RL
Imitation Learning (This Lesson)
π ← argmin ED[l(π(s), aexpert)]
No reward function. Clone expert.
Cannot surpass demonstrator.
Offline RL
π ← argmax ED[Q(s,a) - constraint]
Has reward. Improves beyond data.
Can surpass demonstrator. → Offline RL lesson

BC is offline RL with the constraint turned to maximum (no improvement allowed). Filtered BC is halfway — it imitates only the good data. AWR/IQL complete the spectrum: imitate the data but weight by quality. The entire progression from BC to offline RL is one dial: how much do you trust the advantage estimates to allow deviation from the data?

If you gave BC access to a reward function but no environment interaction, what method would you get? (Answer: filtered BC or AWR — rank demonstrations by return, weight accordingly.)

🔗 Pattern Recognition
Modern VLAs = Imitation Learning at Scale
Classical IL (This Lesson)
Small dataset (100 demos). Single task.
BC with MLP/CNN policy.
Compounding errors limit horizon.
Vision-Language-Action Models
Massive dataset (1M+ episodes, 100+ tasks).
BC with transformer + diffusion policy.
Foundation model generalization fights compounding. → VLA lesson

RT-2, Octo, π0, OpenVLA, GR00T — these are all imitation learning systems. The core algorithm is still BC (minimize action prediction error on demonstrations). What changed: (1) scale — millions of demonstrations across hundreds of tasks, (2) architecture — pre-trained vision-language backbones provide generalizable features, (3) output — diffusion/flow-matching policies handle multimodality. The compounding error problem is mitigated (not solved) by having such broad training distribution that the policy rarely encounters truly OOD states.

What would it take for a VLA to truly solve compounding errors? (Hint: online fine-tuning = DAgger at scale. This is the RLHF equivalent for robotics.)

💻 Build It Implement the DAgger Training Loop ✓ ATTEMPTED
Implement DAgger's core loop: roll out the current policy, collect expert labels on visited states, aggregate data, retrain. The key insight: the training distribution changes each round.
signature def dagger(env, expert_policy, policy, n_rounds=10, n_episodes=5, n_epochs=50): """ Run DAgger for n_rounds. Args: env: gym-like environment with reset() and step() expert_policy: function(state) -> action (oracle) policy: learnable policy with .predict(s) and .train(dataset) n_rounds: number of DAgger iterations n_episodes: episodes to collect per round n_epochs: training epochs per round Returns: dataset: aggregated (states, actions) dataset """
Test case
After 5 rounds on CartPole with a perfect expert: policy success rate should increase from ~20% (BC) to ~90%+ (DAgger). Dataset grows linearly: round 1 has N states, round 5 has 5N states.
The policy executes actions, but the EXPERT labels the states. So you roll out with policy.predict(s), but store (s, expert_policy(s)). This means the training data has states from the policy's distribution but actions from the expert. That's the distribution-matching magic.
def dagger(env, expert_policy, policy, n_rounds=10, n_episodes=5, n_epochs=50):
    # Step 0: Seed with expert demonstrations
    dataset = {'states': [], 'actions': []}
    for _ in range(n_episodes):
        s = env.reset()
        done = False
        while not done:
            a = expert_policy(s)
            dataset['states'].append(s)
            dataset['actions'].append(a)
            s, _, done, _ = env.step(a)

    # DAgger rounds
    for round_i in range(n_rounds):
        # Step 1: Train policy on ALL accumulated data
        policy.train(dataset, epochs=n_epochs)

        # Step 2: Roll out CURRENT policy
        for _ in range(n_episodes):
            s = env.reset()
            done = False
            while not done:
                a_policy = policy.predict(s)  # policy's action

                # Step 3: Expert labels this state
                a_expert = expert_policy(s)   # what expert WOULD do

                # Step 4: Add to dataset
                dataset['states'].append(s)
                dataset['actions'].append(a_expert)  # expert action!

                # Execute policy's action (not expert's)
                s, _, done, _ = env.step(a_policy)

    return dataset
Bonus: Implement a β-scheduled variant where round i uses a mixed policy: πmix = βi πexpert + (1-βi) πlearned for rollouts, with β decaying from 1 to 0. This is the "safe DAgger" variant that avoids catastrophic early rollouts when the learned policy is still terrible.
Chapter 12

Full Summary & Cheat Sheet

The Methods at a Glance

MethodTypeKey IdeaError Scaling
Behavioral CloningOfflineSupervised regression on demosO(εT²)
BC + Expressive PolicyOfflineDiffusion/MoG output avoids averagingO(εT²)
DAggerOnlineTrain on policy-visited statesO(εT)
HG-DAggerOnlineExpert intervenes only when neededO(εT)
IL + RLBothIL initializes, RL improvesCan surpass expert

Policy Expressivity Cheat Sheet

Output TypeMultimodal?Continuous?Expressivity
Categorical (softmax)YesNo (discrete only)Maximal for discrete
Gaussian (μ, σ)No (unimodal)YesLow
Mixture of GaussiansYes (K modes)YesMedium (must choose K)
AutoregressiveYesDiscretizedHigh
DiffusionYes (arbitrary)YesHighest

Decision Flowchart

Do you have demonstrations?

• No → You need RL. Imitation learning requires demos by definition.

• Yes → Is the task short-horizon (<50 steps)?

  • Yes → BC with diffusion policy is likely sufficient.

  • No → Can you interact with the environment?

    • Yes + expert available → DAgger / HG-DAgger.

    • Yes + reward available → BC pre-train, then RL fine-tune.

    • No → BC with massive data and expressive policy. Hope for the best.

Key Equations

Behavioral Cloningminθ Σ(s,a) ∈ 𝒟 ‖ a − πθ(s) ‖²
BC Error Boundεtotal ∼ O(ε · T²)    (quadratic — compounding errors)
DAgger Error Boundεtotal ∼ O(ε · T)    (linear — distribution matching)
Diffusion Policy Lossℒ(θ) = 𝔼[ ‖ ε − ε̂θ(an, s, n) ‖² ]

Common Failure Modes

FailureSymptomFix
AveragingPolicy outputs mean of multimodal dataExpressive policy (diffusion, MoG)
CompoundingWorks for 10 steps, diverges by 50DAgger / HG-DAgger
Distribution shiftPolicy visits unseen statesTrain on policy-visited states
Embodiment gapCan't transfer from human videoTask-level (not motion-level) transfer
Insufficient dataPolicy memorizes, doesn't generalizeMore demos, data augmentation

The Big Picture

The One Sentence

Imitation learning = supervised learning on expert demonstrations, where the main enemies are multimodal actions (solved by expressive policies) and compounding errors (solved by DAgger). Combine with RL to go beyond imitation.

What Comes Next

Actor-Critic methods combine a learned policy (actor) with a learned value function (critic), enabling efficient RL fine-tuning on top of IL pre-training. PPO, the algorithm behind RLHF, builds directly on the policy gradient + KL constraint ideas from the previous lecture, applied after an imitation learning warm-start.

Looking Ahead
The Modern Recipe

Pre-train with imitation (massive data, supervised learning). Fine-tune with RL (reward signal, self-improvement). Constrain with KL (don't drift too far from pre-training). This pattern repeats across language models, robot policies, and game agents. Master imitation learning, and you've mastered the foundation.