CS224R — Imitation Learning: The Complete Guide

Roadmap

What You'll Master

01The Problem 02Behavioral Cloning 03The Averaging Trap 04Expressive Distributions 05Diffusion Policies 06Action Chunking 07Compounding Errors 08DAgger 09HG-DAgger 10Collecting Demonstrations 11Imitation + RL 12Summary & Cheat Sheet

Chapter 01

The Problem: Teaching by Showing

Imagine you're teaching a child to tie their shoes. You don't write down a 47-step motor-control program specifying joint torques for each finger. You show them. You tie the shoe, they watch, they try, they fumble, you show them again. Eventually they get it.

Imitation learning brings this idea to robots and AI agents. Instead of designing a reward function and hoping the agent discovers good behavior through trial and error, you demonstrate the desired behavior and let the agent learn by copying you.

Definition

Imitation Learning

Learning a policy π_θ(a|s) from expert demonstrations, without access to a reward function. The demonstrations are the entire training signal.

Quick terminology check from the MDP/RL framework:

Symbol	Meaning	Example (Self-Driving)
s_t	State at time t	Car position, velocity, all objects
o_t	Observation at time t	Camera image (partial view of state)
a_t	Action at time t	Steering angle, acceleration
τ	Trajectory (s₁,a₁,...,s_T,a_T)	One full driving episode
π_θ(a\|s)	Policy (learned strategy)	Neural net: image → steering command

Why not just use RL?

RL requires a reward function. For many tasks — folding laundry, cooking an egg, suturing tissue — specifying a reward function is harder than just doing the task yourself. Imitation learning sidesteps this entirely: you show the robot what "success" looks like, and it learns to replicate it.

The dataset for imitation learning is a collection of expert demonstrations:

Expert Dataset 𝒟 = {(s₁, a₁, s₂, a₂, …, s_T, a_T)₁, …, (s₁, a₁, …, s_T, a_T)_N}

N expert trajectories. Each one: a recording of states visited and actions taken by a skilled demonstrator.

The question this entire lecture answers: how do we turn these demonstrations into a policy that works?

The Spectrum of Approaches

At one extreme: hand-code every behavior (classical robotics). At the other: pure RL from scratch (agent learns everything by trial and error). Imitation learning sits in the sweet spot — leverage human expertise without the brittleness of hand-coding or the sample complexity of pure RL.

Chapter 02

Behavioral Cloning

The simplest possible approach. You have (state, action) pairs from the expert. You want a neural network that maps states to actions. This is… supervised learning. Literally just regression.

Definition

Behavioral Cloning (BC)

Train a policy by supervised learning on expert demonstrations. Given a state, predict the expert's action. That's it.

Behavioral Cloning Objective min_θ Σ_{(s,a) ∈ 𝒟} ‖ a − π_θ(s) ‖²

Minimize the squared difference between expert actions and predicted actions. Standard L2 regression.

Unpack each piece:

• π_θ(s) — your neural network takes state s as input, outputs a predicted action â.

• a — the expert's actual action at that state. This is your ground truth label.

• ‖·‖² — squared Euclidean distance. Penalizes any deviation from the expert.

• min_θ — adjust network weights via gradient descent to minimize the total prediction error across all demos.

If you've ever done supervised regression in machine learning, you already know how to do behavioral cloning. The state is the input feature, the action is the label, and the loss is MSE. That's literally it.

Worked Example — Self-Driving BC

State: camera image showing a road curving left. Expert steers at −12° (left). Your network predicts −10°.

Loss = (−12 − (−10))² = (−2)² = 4.

Gradient pushes θ to make the prediction closer to −12°. After thousands of examples and gradient steps, the network learns to imitate the expert's steering across many road conditions.

BC = Offline & Simple

No simulator needed. No reward function needed. No interaction with the environment. You collect demos once, train overnight, and deploy. This is why BC is by far the most common starting point in real robotics — it works well enough to get a first prototype running.

The Hidden Assumption

BC assumes there's a single correct action for every state. For many tasks (following a trajectory, stabilizing a drone), this is approximately true. But for tasks with multiple valid strategies, this assumption will betray you spectacularly.

But there is a devastating failure mode hiding in that innocent L2 loss…

Chapter 03

The Averaging Trap

You're training a self-driving car. There's a large obstacle in the middle of the road. In your dataset, some demonstrators swerve left. Others swerve right. Both are perfectly valid.

What does L2 regression predict? The mean. Go straight. Into the obstacle.

The Averaging Problem

When the data is multimodal — multiple valid actions for the same state — minimizing squared error produces the average of those actions. The average of two good actions can be a catastrophically bad action.

Why L2 Produces the Mean

Minimizing E[‖a − â‖²] over â has a closed-form solution: â* = E[a]. The optimal prediction under L2 loss is always the conditional mean of the data. This is a fact from basic statistics, not a choice.

Worked Example — The Numbers

State: obstacle ahead. 5 expert demos at this state:

• Demo 1: steer −20° (left) • Demo 2: steer −18° (left) • Demo 3: steer +22° (right) • Demo 4: steer +19° (right) • Demo 5: steer −21° (left)

L2-optimal prediction: (−20 − 18 + 22 + 19 − 21) / 5 = −3.6°

Nearly straight ahead. None of the experts ever steered at −3.6° here. The "optimal" prediction is a disaster — you crash into the obstacle.

This isn't a rare edge case. It happens all the time:

• Multiple demonstrators with slightly different styles

• Tasks where multiple strategies are valid (reaching around an object from either side)

• Any situation where the action distribution is not unimodal

The Core Insight

The problem is not the neural network being too small or the training being insufficient. A perfect neural network optimized under L2 loss will still predict the mean. The problem is the loss function forcing a single-point prediction for an inherently multi-valued answer.

When Does BC Work Despite This?

When the data is unimodal at every state — there's essentially one right answer for each situation. A single expert with consistent behavior is ideal. Multiple experts with very different styles is dangerous.

Formal Statement

If the conditional distribution p(a|s) is unimodal for every s in the dataset, then L2 regression recovers the correct action. The averaging problem only appears when p(a|s) has multiple separated modes. The severity scales with the number of modes and the distance between them.

Chapter 04

Expressive Policy Distributions

The fix: instead of predicting a single action, model a distribution over actions. The policy should be able to say "there's a 60% chance we should go left and a 40% chance we should go right" — and never output "go straight."

Option 1: Discrete Actions

If actions are discrete (left, right, up, down), the solution is trivial. Output a categorical distribution via softmax:

Discrete Policy π_θ(a|s) = softmax(f_θ(s))
e.g., p(left) = 0.6, p(right) = 0.4, p(straight) = 0.0

This is maximally expressive — it can represent any distribution over the discrete action set. No averaging problem. The policy can assign zero probability to "go straight" while putting mass on both left and right.

Option 2: Gaussian (Continuous)

For continuous actions (like steering angle, joint torques, or end-effector velocities), the naive approach is a Gaussian: the network outputs a mean μ and a standard deviation σ.

Gaussian Policy π_θ(a|s) = 𝒩(a; μ_θ(s), σ_θ(s)²)

Still Unimodal!

A Gaussian has one peak. It can be wide or narrow, but it always has a single mode. For the obstacle scenario: the Gaussian centers at the mean (−3.6°) with large variance. It assigns highest probability to going straight — exactly what we wanted to avoid.

Option 3: Mixture of Gaussians

A natural extension: output multiple Gaussians and mix them with learned weights:

Mixture of Gaussians π_θ(a|s) = Σ_k=1^K w_k · 𝒩(a; μ_k, σ_k²)

Network outputs: μ₁,σ₁,w₁, μ₂,σ₂,w₂, … μ_K,σ_K,w_K

Now we can have two modes: one peaked around −20° (left) and one around +20° (right). The mix weight w controls how much probability goes to each mode.

Option 4: Discretize + Autoregressive

For a multi-dimensional continuous action vector (a₁, a₂, a₃…), we can discretize each dimension into bins and model them sequentially, one dimension at a time:

Autoregressive Factorization π_θ(a|s) = p(a₁|s) · p(a₂|s, â₁) · p(a₃|s, â₁, â₂) …

Each factor is a categorical over discretized bins. â = sampled value from previous step.

This is how language models work: p(next word | previous words). Same idea, applied to action dimensions. Each conditional is a simple categorical distribution (maximally expressive for that dimension), and the chain rule ensures the joint distribution can be arbitrarily complex.

Network Expressivity ≠ Distribution Expressivity

A critical distinction. Making the neural network bigger (more layers, more parameters) does NOT fix the averaging problem if the output head is still a single Gaussian. A 100-billion-parameter network with a Gaussian output is still unimodal. The bottleneck is the output representation, not the network capacity.

Concrete Comparison

Obstacle scenario. Expert data: 60% steer left (−20°), 40% steer right (+20°).

• Gaussian: peak at −4°. Probability of ±20° is low. Fails.

• MoG (K=2): mode at −20° (w=0.6), mode at +20° (w=0.4). Works!

• Autoregressive: discretize to 1° bins, models exact histogram. Works!

• Diffusion: samples cluster at −20° and +20° naturally. Works!

Option 5: Diffusion Models

The most powerful option. Model the action distribution as a denoising diffusion process. Start from noise, iteratively refine into a clean action. Can represent any distribution, including complex multimodal ones. This is so important it gets its own chapter.

Chapter 05

Diffusion Policies

You know how image diffusion models (Stable Diffusion, DALL-E) generate images by starting from pure noise and iteratively denoising? Same idea, but instead of generating pixels, we generate actions.

Definition

Diffusion Policy

A policy that generates actions through iterative denoising. Starting from random Gaussian noise a_N, the network predicts and removes noise over N steps to produce a clean action a₀. The conditioning input is the current state/observation.

How It Works

The key connection: in image diffusion, you model p(image | text). In diffusion policy, you model p(action | observation). Same math, different domain. The observation (camera image, robot joint angles) plays the role of the text prompt.

Training: Take an expert action a₀. Add noise to get a_n = a₀ + ε (where n indexes the noise level). Train the network to predict the noise: ε̂_θ(a_n, s, n) ≈ ε.

Diffusion Training Loss ℒ(θ) = 𝔼_n,ε[ ‖ ε − ε̂_θ(a_n, s, n) ‖² ]

Predict the noise that was added at noise level n, conditioned on state s.

Inference: Start with random noise a_N ~ 𝒩(0, I). Iteratively denoise: at each step, predict the noise and subtract a fraction of it. After N steps, you have a clean action.

Obstacle Avoidance with Diffusion

State: obstacle ahead. 100 denoising runs produce:

• ~60 runs converge to −20° (left cluster)

• ~40 runs converge to +20° (right cluster)

• ~0 runs converge to 0° (straight)

The diffusion process naturally discovers modes without being told how many there are. Different random starting noises converge to different modes. The valley between modes (going straight) is avoided because no training data lives there.

Why Diffusion Beats Everything Else

Mixture of Gaussians requires choosing K (number of modes) in advance. Autoregressive models accumulate discretization error. Diffusion models can represent arbitrary distributions with arbitrary numbers of modes, automatically. The denoising process is a universal distribution approximator. This is why "Diffusion Policy" (Chi et al., 2023) became state-of-the-art for robotic manipulation.

The Cost: Inference Speed

Diffusion policies require N denoising steps at inference time (typically 10–100). Each step is a forward pass through the network. This is 10–100x slower than a single-pass policy. For a robot running at 10 Hz control, you need fast denoising — techniques like DDIM or consistency models help reduce N to as few as 1–4 steps.

Flow Matching: The Modern View

The noise-prediction formulation above (DDPM-style) works, but modern systems like π0 (Physical Intelligence, 2024) use a cleaner framework called flow matching. The idea: instead of learning to predict noise, learn a velocity field that smoothly transforms Gaussian noise into expert actions.

Start with two things: a data point x₁ (an expert action) and a noise sample x₀ ~ 𝒩(0, I). Define a straight-line path between them:

Linear Interpolation Path x_t = t · x₁ + (1 − t) · x₀ t ∈ [0, 1]. At t=0, pure noise. At t=1, clean action.

The velocity along this path is simply x₁ − x₀. We train a network v_θ(x_t, s, t) to predict this velocity:

Flow Matching Training Loss ℒ(θ) = 𝔼_{t~U[0,1], x₁~data, x₀~𝒩}[ ‖ v_θ(x_t, s, t) − (x₁ − x₀) ‖² ]

At test time: start from pure noise x₀ ~ 𝒩(0, I). Integrate the learned velocity field with small Euler steps: x_t+Δ = x_t + v_θ(x_t, s, t) · Δ. After ~10–20 steps, you arrive at a clean action.

Why flow matching is cleaner

DDPM requires a carefully designed noise schedule (β_t), a specific forward noising process, and a reverse SDE. Flow matching replaces all of that with a single straight-line interpolation. The training loss is simpler (no noise schedule), the sampling is an ODE (not SDE — deterministic and faster), and the framework generalizes to non-Gaussian sources. This is why π0, π0.7, and other modern robot foundation models use flow matching.

Systems in the Wild

Diffusion and flow-matching policies power nearly all state-of-the-art IL systems deployed today:

System	Year	Policy Type	Domain
Diffusion Policy	2023	DDPM denoising	Tabletop manipulation
π0 (Physical Intelligence)	2024	Flow matching	General-purpose robotics
Octo / OpenVLA	2024	Autoregressive (VLM)	Open-vocab robot control
GR00T (NVIDIA)	2024	Flow matching	Humanoid robots
Figure Helix	2025	Flow matching	Humanoid manipulation
Waymo EMMA	2024	Autoregressive (LLM)	Autonomous driving

The pattern

For robotics (continuous, multimodal actions): flow matching / diffusion dominates. For driving and language-conditioned control: autoregressive VLMs are competitive. Both are forms of behavioral cloning with expressive output distributions — the core lesson of this chapter.

Chapter 06

Action Chunking

A robot running at 50 Hz makes a new decision every 20 milliseconds. But does it need to? Most actions unfold over 200–500 ms. A grasping motion doesn't change its mind every 20 ms — it commits to a trajectory and follows through. What if instead of predicting one action at a time, the policy predicted an entire chunk of future actions?

Definition

Action Chunking

Instead of predicting a single action a_t from state s_t, the policy predicts a chunk of k future actions: π_θ(s_t) → (a_t, a_t+1, …, a_t+k−1). The robot executes the entire chunk before observing a new state. Typical chunk sizes: k = 4–16 steps.

Why It Works

Reason 1: Temporal consistency. Single-step predictions can jitter — at time t the policy says "go left" and at t+1 it says "go right." Chunking forces the policy to commit to a coherent motion over multiple timesteps. The result is smoother, more physically plausible actions.

Reason 2: More compute time. Diffusion and flow-matching policies need 10–20 denoising steps per prediction. If you predict 1 action per control step at 50 Hz, each denoising step must complete in 1 ms. If you predict a chunk of 8 actions, you have 160 ms total — 8 ms per denoising step. This is comfortably achievable on current hardware.

Chunk Execution in Practice

Robot running at 50 Hz with chunk size k = 8:

• t = 0: Observe state s₀. Run diffusion (takes ~100 ms). Get 8 actions.

• t = 0–7: Execute a₀, a₁, …, a₇ open-loop (160 ms).

• t = 8: Observe new state s₈. Run diffusion again. Get next 8 actions.

The policy only runs inference every 160 ms instead of every 20 ms — 8× fewer forward passes.

The trade-off

During open-loop execution, the robot doesn't react to disturbances. If something unexpected happens at t = 3, the robot won't notice until t = 8. Short chunks (k = 4) react faster but give less compute time. Long chunks (k = 16) are smoother but less reactive. Most systems use k = 8–16 with temporal ensembling: predict overlapping chunks and average them, so each action is informed by multiple observations.

Action chunking + diffusion = the modern IL recipe

Diffusion Policy (Chi et al., 2023) and ALOHA (Zhao et al., 2023) both introduced action chunking independently. π0, GR00T, and virtually every modern IL system uses it. The combination of flow-matching policies with action chunking is the current state of the art for robotic imitation learning.

Chapter 07

Compounding Errors

Even with the perfect output distribution, behavioral cloning has a deeper, more fundamental failure mode. Suppose your policy makes a tiny mistake at step 5 — it's 0.1° off from the expert's steering angle. No big deal, right?

Wrong. That tiny error puts the car in a state it never saw in training. The expert never drove 2 cm to the left of the lane center at that point. The policy has no idea what to do from this unfamiliar state — it was only trained on the expert's trajectory, not on states slightly off that trajectory.

So it makes a bigger mistake. Which puts it in an even more unfamiliar state. Which causes an even bigger mistake. The errors cascade, each one feeding the next, until the policy is so far from the training distribution that its outputs are essentially random.

The Compounding Error Problem

Small per-step errors accumulate over time. The policy drifts into states not covered by the training data, where it has no basis for making good decisions. Errors cascade. This is the fundamental limitation of behavioral cloning.

How Fast Do Errors Compound?

Let ε be the per-step error probability (chance the policy makes a mistake at any given step). Let T be the episode length.

Error Bound for Behavioral Cloning Total error ∼ O(ε · T²)

Quadratic in horizon! Double the episode length → 4x the total error.

Why T² and not T? Because each mistake doesn't just cost one step of error — it pushes you off the expert's trajectory, and every future step is now also wrong because you're in unfamiliar territory.

Worked Example — Compounding Numbers

Per-step error rate: ε = 0.01 (1% chance of mistake per step).

• T = 10 steps: total error ∼ 0.01 × 100 = 1.0 (manageable)

• T = 50 steps: total error ∼ 0.01 × 2500 = 25 (disaster)

• T = 100 steps: total error ∼ 0.01 × 10000 = 100 (complete failure)

Even with 99% per-step accuracy, a 100-step task is essentially guaranteed to fail catastrophically. This is why BC works for short tasks but collapses on long-horizon ones.

The Distribution Shift

Formally, the problem is a distribution mismatch. The policy is trained on states from the expert's distribution p_expert(s_t). At test time, the policy's own mistakes cause it to visit states from a different distribution p_π(s_t). The bigger the gap between these distributions, the worse the policy performs. BC has no mechanism to close this gap.

The Driving Analogy

Think of a student driver who only practiced on a driving simulator with scripted scenarios. On the real road, the first time another car cuts them off, they freeze — they've never seen this state. They swerve too hard, ending up in the oncoming lane. Now they've really never seen this state. BC is that student driver: good in familiar territory, helpless the moment anything goes slightly off-script.

We need a way to train the policy on states it actually visits, not just states the expert visited. Enter DAgger.

🔨 Derivation Compounding Error Bound: Why T² and Not T ▶ ✓ ATTEMPTED

BC achieves per-step error ε (probability of making a mistake at any timestep). Intuitively, with T steps, total error should be ε·T. But the actual bound is O(ε·T²).

Your task: Derive the T² bound rigorously. Show that at step t, the probability of being off-distribution is O(ε·t), and since the policy has no recovery data for off-distribution states, the expected cost from step t onward is O(T-t). Sum over all t to get T².

At each step, with probability ε, the policy makes a mistake. After t steps, the probability of EVER having made a mistake is bounded by t·ε (union bound). So P(off-distribution at step t) ≤ t·ε. Once off-distribution, the policy stays off because it has no recovery data.

Once the policy is off the expert's trajectory, every subsequent action is effectively random (no training data for these states). The expected cost for the remaining (T-t) steps is O(T-t). Each step incurs ~1 unit of error because the policy has no basis for correct actions.

Total expected error = Σ_t=1^T P(first mistake at step t) · (cost from step t to T). Approximately: Σ_t=1^T ε · (T-t) = ε · Σ_k=0^T-1 k = ε · T(T-1)/2 = O(εT²).

Setup: Let ε = P(policy makes wrong action at step t | on expert's trajectory). Let C_t = cost incurred at step t.

Step 1: Define "off-trajectory" as having deviated at any prior step. P(off-trajectory at t) ≤ tε (union bound over t independent error events).

Step 2: When on-trajectory: E[C_t] = 0 (correct action). When off-trajectory: E[C_t] = c (some constant cost per step, since the policy is in unknown territory).

Step 3: Total expected cost:

E[Σ_t C_t] = Σ_t=1^T P(off at t) · c ≤ c · Σ_t=1^T tε = cε · T(T+1)/2 = O(εT²)

Alternative (tighter) derivation via Ross et al. 2011:

Let d_t^π be the state distribution at step t under π. The performance difference lemma gives:

J(π*) - J(π) = Σ_t E_{s~d_t^π}[l(s)] where l(s) is the loss at state s.

For BC: l(s) = 0 if s ∈ supp(d_t^π*), but l(s) = O(1) otherwise. Since d_t^π drifts from d_t^π* with rate ε per step, the TV distance ||d_t^π - d_t^π*|| ≤ tε. So E[l(s)] ≤ tε, and summing: J(π*) - J(π) ≤ Σ_t tε = O(εT²).

The key insight: It's T² (not T) because of the CASCADING nature: a mistake at step 5 doesn't just cost 1 step — it corrupts all future steps too. DAgger fixes this by training on the policy's own distribution, making the per-step error independent of position: E[C_t] = ε regardless of t, giving ΣC_t = εT (linear).

Chapter 08

DAgger: Dataset Aggregation

The insight is elegant: if the problem is that BC trains on expert states but the policy visits different states, then train on the states the policy actually visits. But we still need expert actions — so we ask the expert to label those states.

Definition

DAgger (Dataset Aggregation)

An iterative algorithm that fixes the distribution mismatch of BC. In each round: (1) run the current policy, (2) record the states it visits, (3) ask the expert what action they would take in those states, (4) add these new (state, action) pairs to the training dataset, (5) retrain.

Algorithm: DAgger

Initialize: Collect expert demonstrations 𝒟 = {(s, a)_expert}
Train policy π_θ on 𝒟 via supervised learning
Deploy π_θ in the environment, recording visited states {s₁, s₂, …}
Query expert: For each visited state s, get expert action a* = π_expert(s)
Aggregate: 𝒟 ← 𝒟 ∪ {(s, a*)_new}
Go to step 2

Why This Fixes Compounding Errors

After several rounds, the training dataset contains expert actions for the states the learned policy actually visits — including the "mistake" states that BC never saw. The policy learns what to do when things go slightly wrong: how to recover.

Worked Example — DAgger for Driving

Round 1: Train BC on 100 expert driving demos. Policy drifts left on turns.

Round 2: Run the policy. It ends up 10 cm left of center on a curve. Expert labels: "steer right to correct." Add 50 new (state, correction) pairs. Retrain.

Round 3: Policy is better on curves but wobbles on straights. Expert labels: "steady." Add 50 more pairs. Retrain.

Round 5: Dataset now has expert actions for the full range of states the policy encounters — including recovery states. Error bound improves to O(ε · T) instead of O(ε · T²).

DAgger Error Bound Total error with DAgger ∼ O(ε · T)

Linear in horizon, not quadratic. Matching the expert's distribution eliminates the compounding.

The Catch: Expert Availability

DAgger requires the expert to label states on demand. The policy visits some weird state at 3 AM — who labels it? For simulated environments, you can query an oracle. For real robots with human experts, this is a serious logistical challenge. HG-DAgger addresses this.

The Key Shift

BC trains on p_expert(s). DAgger trains on p_π(s). This is the entire difference. By closing the distribution gap, DAgger eliminates quadratic error growth. It's the same supervised learning, just on the right data.

🔨 Derivation DAgger Regret Bound: O(εT) from Online Learning Theory ▶ ✓ ATTEMPTED

DAgger reduces BC's O(εT²) to O(εT). The proof uses online learning theory: DAgger is a no-regret algorithm over the sequence of data distributions d₁^π, d₂^π, ..., d_N^π.

Your task: Show that after N rounds of DAgger, the average policy π̄ has expected cost at most ε + O(1/√N) per step, which gives total cost O(εT) as N → ∞. Why does training on the mixture distribution fix the cascading problem?

BC trains on d^π* (expert's state distribution). DAgger trains on d^π (policy's own state distribution). After N rounds, the aggregated dataset D_N is drawn from (1/N)Σ_i d^π_i — a mixture of all past policy distributions. This converges to d^π̄ (the mixture policy's distribution).

DAgger is an instance of Follow the Leader (FTL) in online learning. The regret over N rounds is: Σ_i l(π_i, d^π_i) - min_π Σ_i l(π, d^π_i) ≤ O(√N). Dividing by N: the average policy's loss on the average distribution is at most optimal + O(1/√N).

Step 1 (Setup): At each DAgger round i, we have policy π_i and its induced state distribution d^π_i. We collect data from d^π_i and retrain on all accumulated data.

Step 2 (Performance decomposition): For the mixture policy π̄ = (1/N)Σ_iπ_i:

J(π̄) ≤ (1/N) Σ_i J(π_i) (by Jensen's inequality on costs)

J(π_i) = T · E_{s~d^π_i}[l(π_i, s)] where l is per-step loss.

Step 3 (No-regret): Since π_i+1 minimizes loss on ALL data so far (best response to past distributions):

(1/N)Σ_i E_{d^π_i}[l(π_i)] ≤ min_π (1/N)Σ_i E_{d^π_i}[l(π)] + O(1/√N)

Step 4 (Best in class): The best fixed policy in the class achieves E_d[l(π*)] = ε (classification error on the right distribution). So:

(1/N)Σ_i E[l(π_i)] ≤ ε + O(1/√N)

Step 5 (Total cost): After enough rounds, J(π̄) ≤ T · (ε + O(1/√N)) ≈ εT

The key insight: DAgger converts the sequential decision problem into an i.i.d. supervised learning problem. By training on states the policy actually visits, the per-step error is ε REGARDLESS of what happened at previous steps. No cascading. The price: you need access to the expert for N rounds of labeling. The reduction from T² to T is the fundamental reason DAgger works.

Checkpoint — Before you move on

A colleague says "DAgger isn't needed — just collect more expert demonstrations and BC will work for long horizons." Explain why this is wrong, using the T² bound. How much data would BC need to match DAgger at T=100 steps?

✓ Gate cleared

Model Answer

More data reduces ε (per-step error) through better generalization, but it does NOT change the T² dependence. BC's bound is O(εT²), DAgger's is O(εT). For T=100: BC error ~ ε·10,000 while DAgger error ~ ε·100. To match DAgger with BC, you'd need to reduce ε by a factor of 100 (i.e., ε_BC = ε_DAgger/100). Since supervised learning error scales roughly as ε ~ 1/√N, you'd need 10,000x more demonstrations! And this gets worse for longer horizons. More data helps BC, but it can never overcome the quadratic penalty. The fundamental issue is that BC trains on the WRONG distribution (expert states), not that it has too little data on that distribution. DAgger fixes the distribution, which no amount of expert data can do.

Chapter 09

HG-DAgger: The Practical Version

Vanilla DAgger has a usability problem. Running the policy, recording all states, then asking an expert to retroactively label each one? Imagine watching 10,000 frames of a robot flailing and labeling the correct action for each. Exhausting and impractical.

HG-DAgger (Human-Gated DAgger) flips the interface: the expert watches the policy run in real time and intervenes only when needed.

Definition

HG-DAgger (Human-Gated DAgger)

A variant of DAgger where the expert observes the policy executing and takes over control only when the policy is about to make a serious mistake. The intervention trajectories are added to the training set.

Algorithm: HG-DAgger

Train initial policy π_θ on expert demonstrations
Deploy π_θ with expert watching
Expert intervenes when the policy is about to fail — takes over control, demonstrates the correct behavior from the intervention point
Aggregate: Add the intervention demonstrations to 𝒟
Retrain π_θ on augmented 𝒟
Go to step 2

Why This Is Better in Practice

The expert only acts when it matters. If the policy is doing fine on a straight road, the expert sits back. When the policy starts drifting toward a curb, the expert grabs the wheel and demonstrates the recovery. The expert's time is spent on the hard cases — exactly the states where the policy needs the most help.

Worked Example — HG-DAgger for Robot Manipulation

Task: Pick up a cup. Initial BC policy succeeds 60% of the time.

Session 1: Expert watches 20 attempts. Policy fails on off-center cups. Expert intervenes 8 times, demonstrating approach corrections. 8 new demos added. Retrain → 75% success.

Session 2: Expert watches 20 attempts. Policy now fails on tilted cups. Expert intervenes 5 times. Retrain → 85% success.

Session 3: 3 interventions for edge cases. Retrain → 92% success.

Total expert time: ~1 hour across 3 sessions, compared to the 5+ hours of continuous demonstration that vanilla DAgger would require.

The Design Principle

HG-DAgger turns imitation learning into an iterative refinement process. The expert is a teacher who lets the student try, watches for mistakes, and corrects only when necessary. Each round of corrections targets the policy's current failure modes, making the data maximally informative.

Expert Quality Matters

The expert must be good at monitoring AND intervening. Some experts are great demonstrators but poor monitors — they intervene too late or too early. Too-late interventions miss the critical recovery states. Too-early interventions waste expert time on states the policy would have handled.

DAgger vs. HG-DAgger: When to Use Which

Vanilla DAgger is better when you have a scripted expert (a simulator oracle that can label any state instantly) or when you need theoretical guarantees about convergence.

HG-DAgger is better in the real world with human experts, where labeling time is expensive and the expert's attention is the bottleneck. Most real robotics experiments use some variant of HG-DAgger because it respects the human's time and cognitive load.

Chapter 10

Collecting Demonstrations

We've assumed demonstrations exist. But where do they actually come from? The answer depends heavily on the domain, and some methods are dramatically harder than others.

Natural Data Sources

Some domains generate demonstrations as a byproduct of normal activity:

• Self-driving: Every human driver generates (camera image, steering angle) pairs all day long. Tesla collects billions of frames this way.

• Language: The entire internet is a demonstration of text generation. Every sentence is an expert demonstration of "what to write next."

• Game playing: Replay databases (StarCraft, Go, Chess) contain millions of expert games.

Scale Advantage

When demonstrations come for free, you can collect millions. This is why behavioral cloning works so well for self-driving and language — the sheer data volume overwhelms the compounding error problem for many practical scenarios.

Robotics: The Hard Case

For robotic manipulation, demonstrations don't come naturally. Someone has to actively teach the robot. Three main approaches:

Method	How It Works	Pros	Cons
Kinesthetic teaching	Physically guide the robot arm by hand	Intuitive, direct	Awkward for complex motions, limited to compliant robots
Teleoperation	Control robot remotely via joystick/VR controller	Natural interface, works for any robot	Latency, requires practice to master
Puppeteering	Use a second identical robot arm as input device	1:1 mapping, very precise	Requires double the hardware cost

Real Numbers

The Diffusion Policy paper (Chi et al., 2023) collected about 100–200 demonstrations per task via teleoperation. Each demo took ~30 seconds. Total data collection: about 1–2 hours per task. This is typical for state-of-the-art robotic IL — it's not millions of demos, it's hundreds.

Videos of Humans and Animals

YouTube is full of humans doing tasks. Can we learn from watching humans?

The Embodiment Gap

A human hand has 27 degrees of freedom with soft, deformable fingers and tactile sensing. A robot gripper has 1–7 DOF with rigid pincers and maybe a force sensor. Watching a human fold a towel doesn't directly tell a robot how to fold a towel — the bodies are too different. This is the embodiment gap. Bridging it is an active research frontier.

Current approaches extract high-level intentions from human videos (what to grasp, where to place) while leaving the low-level motor control to RL or a separate controller. Full "watch a YouTube video and learn the task" remains an open problem.

The Data Quality Hierarchy

Not all demonstrations are equal. From best to worst: (1) same robot, same task, expert operator; (2) same robot, similar task, decent operator; (3) different robot, same task; (4) human video of the task. Each step down requires more sophisticated transfer methods. Most practical systems operate at level 1 or 2.

Chapter 11

Imitation + RL

Imitation learning and reinforcement learning aren't rivals — they're complements. Each has strengths the other lacks:

Property	Imitation Learning	Reinforcement Learning
Needs reward function?	No	Yes
Needs demonstrations?	Yes	No
Can surpass demonstrator?	No (BC) / Maybe (DAgger)	Yes
Self-improvement?	No framework for practice	Yes, by definition
Data efficiency?	Good (supervised learning)	Poor (exploration needed)
Long-horizon tasks?	Compounding errors	Reward signal guides recovery

The Winning Combination

Use imitation learning to initialize the policy (start from a reasonable baseline), then use RL to improve beyond the demonstrator. IL gives you a warm start; RL gives you self-improvement. This is the recipe behind many successful real-world systems.

Common Patterns

Pattern 1: IL then RL. Pre-train via BC on expert demos, then fine-tune with PPO or SAC using a reward function. The BC initialization means RL starts from sensible behavior instead of random flailing — dramatically faster convergence. Think of it as teaching someone the basics before sending them to practice on their own.

Pattern 2: IL + RL reward shaping. Use the distance from expert behavior as an auxiliary reward signal. The agent gets reward both for task success (RL objective) and for staying close to expert demonstrations (IL regularizer). This keeps the policy grounded while still allowing improvement beyond the demonstrator.

Pattern 3: RLHF (RL from Human Feedback). The recipe behind ChatGPT and Claude. Pre-train a language model on text (BC on internet data), then fine-tune with RL using a reward model trained on human preferences. Imitation learning provides the foundation; RL aligns the behavior.

The ChatGPT Recipe

Step 1 (IL): Pre-train on internet text. This is behavioral cloning on the largest dataset ever assembled.

Step 2 (IL): Fine-tune on curated (prompt, response) pairs. Still BC, but higher quality demos.

Step 3 (RL): PPO with a reward model trained on human rankings. The model goes beyond imitating — it optimizes for human preference.

Without step 1, RL from scratch on language would be intractable. Without step 3, you get a talented mimic but not an aligned assistant.

IL Alone Has No Mechanism for Practice

A pianist who only watches performances will never improve beyond mimicry. They must practice — try, fail, get feedback, adjust. BC has no equivalent. DAgger gets closer (the expert provides ongoing feedback), but only RL provides the full loop: try things the expert never did, discover something better, and keep it.

🔨 Derivation GAIL = Moment Matching via Adversarial Training ▶ ✓ ATTEMPTED

Generative Adversarial Imitation Learning (GAIL, Ho & Ermon 2016) avoids the reward-engineering problem by training a discriminator to distinguish expert from policy state-action pairs, then using the discriminator's output as a reward signal for RL.

Your task: Show that minimizing the Jensen-Shannon divergence between the occupancy measures ρ_π and ρ_π* (via a discriminator) is equivalent to moment matching — making the policy's state-action distribution indistinguishable from the expert's.

ρ_π(s,a) = (1-γ) Σ_t=0^∞ γ^t P(s_t=s, a_t=a | π). It's the discounted frequency of visiting (s,a) under policy π. Two policies with the same occupancy measure have the same expected return for ANY reward function. So matching occupancy measures = matching behavior.

The GAIL objective is: min_π max_D E_{ρ_π}[log D(s,a)] + E_{ρ_π*}[log(1-D(s,a))]. This is exactly the GAN objective with ρ_π as the generator distribution and ρ_π* as the real data distribution. At optimality, D*(s,a) = ρ_π* / (ρ_π + ρ_π*), and the overall min over π minimizes JS(ρ_π || ρ_π*).

Step 1: GAIL trains: D(s,a) = discriminator, π = policy (generator).

Discriminator objective: max_D E_{(s,a)~ρ_π*}[log D(s,a)] + E_{(s,a)~ρ_π}[log(1-D(s,a))]

Step 2: Optimal discriminator: D*(s,a) = ρ_π*(s,a) / (ρ_π(s,a) + ρ_π*(s,a))

Step 3: With D = D*, the policy optimization becomes:

min_π JS(ρ_π || ρ_π*) = (1/2)KL(ρ_π||m) + (1/2)KL(ρ_π*||m)

where m = (ρ_π + ρ_π*)/2.

Step 4: The reward for RL is r(s,a) = -log(1 - D(s,a)) (or log D(s,a)). The policy is trained with any RL algorithm (TRPO, PPO) using this learned reward. When D can't distinguish expert from policy (D ≈ 0.5), the policy has matched the expert's occupancy measure.

Step 5 (Moment matching): ρ_π = ρ_π* implies E_{ρ_π}[f(s,a)] = E_{ρ_π*}[f(s,a)] for ALL functions f. This is moment matching — every statistic of the policy's behavior matches the expert. This is stronger than BC (which only matches the conditional π(a|s) pointwise).

The key insight: GAIL converts IL into an adversarial game. The discriminator automatically discovers which features of behavior matter (the "reward function" emerges from training). BC matches actions at expert states; GAIL matches the entire state-action distribution. This makes it robust to compounding error because the policy's own states are part of the optimization.

💥 Break-It Lab What Dies When You Break IL Components? ▶ ✓ ATTEMPTED

Imitation learning has multiple failure modes that activate under different conditions. Toggle each pathology to understand when and why IL breaks.

Distribution shift (BC on long horizon, T=200) NORMAL

Failure mode: The policy makes a small mistake at step 10, entering a state never seen in training. From there, every subsequent action is nearly random. By step 50, the robot is in a completely unfamiliar configuration. By step 100, it has crashed or frozen. Symptom: performance degrades quadratically with episode length. Short tasks (T<20) work fine; long tasks (T>50) fail catastrophically. Fix: DAgger (train on policy-visited states) or keep T short with action chunking.

Wrong loss function (MSE on discrete/multimodal actions) NORMAL

Failure mode: The policy outputs the MEAN of multiple valid actions, which is itself an invalid action. For a self-driving task: left and right around an obstacle both have probability 0.5 in the data, but the MSE-optimal prediction is "straight ahead" — into the obstacle. The policy looks confident (low loss on training set) but produces catastrophic actions in the multimodal states. Symptom: low training loss but terrible real-world performance. Fix: use cross-entropy (discrete), mixture models, or diffusion policies.

Covariate shift with no correction (deploy in new environment) NORMAL

Failure mode: Policy trained on demonstrations in environment A (well-lit, clean floor) is deployed in environment B (dim lighting, textured floor). The observations are outside the training distribution. The policy has no mechanism to detect it's seeing OOD inputs and produces nonsensical actions with high confidence. Symptom: works perfectly in training environment, fails immediately in any different setting. Fix: domain randomization during data collection, or train an OOD detector that triggers a safe fallback behavior.

🏗 Design Challenge You're the Architect: IL for Autonomous Driving ▶ ✓ ATTEMPTED

You're building an autonomous driving system using imitation learning. You have 1000 hours of human driving data (cameras, LiDAR, steering/throttle/brake at 10 Hz). The system must handle edge cases: construction zones, accidents, aggressive drivers, animals on road. You cannot do online testing until you pass simulation benchmarks.

Data

1000 hours (36M frames), 95% highway, 5% edge cases

Observation

8 cameras + LiDAR point cloud + HD map

Action

Steering angle, acceleration, lane change (mixed continuous/discrete)

Horizon

Planning: 5s ahead, episode: hours

Safety

Zero tolerance for collision in 99.99% of scenarios

Edge Cases

Must handle construction, accidents, jaywalkers, road debris

1. 95% of data is boring highway driving. Only 5% contains the hard cases. How do you prevent the policy from being overwhelmed by the easy majority? (Hint: sampling strategies, data augmentation.)

2. Compounding errors over hours of driving would be catastrophic. How do you handle the long horizon? (Hint: hierarchical policies, action chunking, closed-loop vs open-loop.)

3. Some edge cases are SO rare (animal on highway) that you might have 0 examples in 1000 hours. How do you handle these?

4. The policy must be multimodal (lane change left vs. right are both valid). What output distribution do you use?

5. What's your safety layer? The IL policy WILL make mistakes. How do you prevent them from being fatal?

Real-world approaches (public knowledge from Tesla AI Day, Waymo papers):

1. Data imbalance: Mine hard examples aggressively. Tesla's "shadow mode" runs the policy in background on fleet vehicles, flags disagreements between policy and human, and collects those scenarios preferentially. Result: 100x oversampling of edge cases. Also: data augmentation (synthetic rain, night, occlusion) to inflate rare conditions.

2. Long horizon: Hierarchical decomposition: (a) Route planner (minutes ahead, graph search), (b) Behavior planner (seconds ahead, learned IL policy decides lane changes/speed), (c) Motion controller (milliseconds, PID/MPC tracks the behavior plan). IL operates at layer (b) with 2-5 second horizon. Action chunking: predict 1-2 seconds of waypoints, replan at 10 Hz. This keeps the effective horizon short (~20 steps).

3. Ultra-rare events: Simulation. Build a photorealistic simulator and procedurally generate edge cases: animal crossings, debris, construction. Train a separate "emergency" policy on simulated data. At runtime: a classifier detects "unusual scenario" and switches to the emergency policy. Also: RL fine-tuning in simulation for collision avoidance.

4. Multimodality: Tesla uses a learned trajectory generation model (similar to diffusion policy) that outputs multiple candidate trajectories. A separate scoring model picks the best one considering safety, comfort, and traffic rules. This avoids the averaging trap entirely.

5. Safety layer: Multiple levels: (a) Rule-based collision checker (physics-based time-to-collision), (b) Learned safety critic that predicts P(collision | action), (c) Hardware emergency brake (AEB) as final fallback. The IL policy's output is always filtered through safety checks before execution. If ANY check fails, the system executes a minimal-risk condition (slow to stop, stay in lane).

Key lesson: No production AV system relies on pure IL. It's always IL + safety systems + hierarchical planning + simulation for edge cases + fleet learning for data. The IL component handles the "normal driving" 99% efficiently; everything else handles the 1% safely.

⚔ Adversarial: The Subtle Failure of Cross-Entropy on Continuous Actions

Your colleague discretizes the steering angle into 100 bins and uses cross-entropy loss for BC. They report 95% top-1 accuracy on the validation set. But the robot drifts off-road within 30 seconds. Why?

Cross-entropy can't learn from continuous actions 100 bins isn't enough resolution 95% accuracy means 5% error rate per step, and with T=1500 steps (30s at 50Hz), compounding gives O(0.05 * 1500^2) = catastrophic drift The discretization introduces aliasing artifacts

🔗 Pattern Recognition

IL + Reward = Offline RL

Imitation Learning (This Lesson)

π ← argmin E_D[l(π(s), a_expert)]
No reward function. Clone expert.
Cannot surpass demonstrator.

Offline RL

π ← argmax E_D[Q(s,a) - constraint]
Has reward. Improves beyond data.
Can surpass demonstrator. → Offline RL lesson

BC is offline RL with the constraint turned to maximum (no improvement allowed). Filtered BC is halfway — it imitates only the good data. AWR/IQL complete the spectrum: imitate the data but weight by quality. The entire progression from BC to offline RL is one dial: how much do you trust the advantage estimates to allow deviation from the data?

If you gave BC access to a reward function but no environment interaction, what method would you get? (Answer: filtered BC or AWR — rank demonstrations by return, weight accordingly.)

🔗 Pattern Recognition

Modern VLAs = Imitation Learning at Scale

Classical IL (This Lesson)

Small dataset (100 demos). Single task.
BC with MLP/CNN policy.
Compounding errors limit horizon.

Vision-Language-Action Models

Massive dataset (1M+ episodes, 100+ tasks).
BC with transformer + diffusion policy.
Foundation model generalization fights compounding. → VLA lesson

RT-2, Octo, π0, OpenVLA, GR00T — these are all imitation learning systems. The core algorithm is still BC (minimize action prediction error on demonstrations). What changed: (1) scale — millions of demonstrations across hundreds of tasks, (2) architecture — pre-trained vision-language backbones provide generalizable features, (3) output — diffusion/flow-matching policies handle multimodality. The compounding error problem is mitigated (not solved) by having such broad training distribution that the policy rarely encounters truly OOD states.

What would it take for a VLA to truly solve compounding errors? (Hint: online fine-tuning = DAgger at scale. This is the RLHF equivalent for robotics.)

💻 Build It Implement the DAgger Training Loop ▶ ✓ ATTEMPTED

Implement DAgger's core loop: roll out the current policy, collect expert labels on visited states, aggregate data, retrain. The key insight: the training distribution changes each round.

signature def dagger(env, expert_policy, policy, n_rounds=10, n_episodes=5, n_epochs=50): """ Run DAgger for n_rounds. Args: env: gym-like environment with reset() and step() expert_policy: function(state) -> action (oracle) policy: learnable policy with .predict(s) and .train(dataset) n_rounds: number of DAgger iterations n_episodes: episodes to collect per round n_epochs: training epochs per round Returns: dataset: aggregated (states, actions) dataset """

Test case

After 5 rounds on CartPole with a perfect expert: policy success rate should increase from ~20% (BC) to ~90%+ (DAgger). Dataset grows linearly: round 1 has N states, round 5 has 5N states.

The policy executes actions, but the EXPERT labels the states. So you roll out with policy.predict(s), but store (s, expert_policy(s)). This means the training data has states from the policy's distribution but actions from the expert. That's the distribution-matching magic.

def dagger(env, expert_policy, policy, n_rounds=10, n_episodes=5, n_epochs=50):
    # Step 0: Seed with expert demonstrations
    dataset = {'states': [], 'actions': []}
    for _ in range(n_episodes):
        s = env.reset()
        done = False
        while not done:
            a = expert_policy(s)
            dataset['states'].append(s)
            dataset['actions'].append(a)
            s, _, done, _ = env.step(a)

    # DAgger rounds
    for round_i in range(n_rounds):
        # Step 1: Train policy on ALL accumulated data
        policy.train(dataset, epochs=n_epochs)

        # Step 2: Roll out CURRENT policy
        for _ in range(n_episodes):
            s = env.reset()
            done = False
            while not done:
                a_policy = policy.predict(s)  # policy's action

                # Step 3: Expert labels this state
                a_expert = expert_policy(s)   # what expert WOULD do

                # Step 4: Add to dataset
                dataset['states'].append(s)
                dataset['actions'].append(a_expert)  # expert action!

                # Execute policy's action (not expert's)
                s, _, done, _ = env.step(a_policy)

    return dataset

Bonus: Implement a β-scheduled variant where round i uses a mixed policy: π_mix = β_i π_expert + (1-β_i) π_learned for rollouts, with β decaying from 1 to 0. This is the "safe DAgger" variant that avoids catastrophic early rollouts when the learned policy is still terrible.

Chapter 12

Full Summary & Cheat Sheet

The Methods at a Glance

Method	Type	Key Idea	Error Scaling
Behavioral Cloning	Offline	Supervised regression on demos	O(εT²)
BC + Expressive Policy	Offline	Diffusion/MoG output avoids averaging	O(εT²)
DAgger	Online	Train on policy-visited states	O(εT)
HG-DAgger	Online	Expert intervenes only when needed	O(εT)
IL + RL	Both	IL initializes, RL improves	Can surpass expert

Policy Expressivity Cheat Sheet

Output Type	Multimodal?	Continuous?	Expressivity
Categorical (softmax)	Yes	No (discrete only)	Maximal for discrete
Gaussian (μ, σ)	No (unimodal)	Yes	Low
Mixture of Gaussians	Yes (K modes)	Yes	Medium (must choose K)
Autoregressive	Yes	Discretized	High
Diffusion	Yes (arbitrary)	Yes	Highest

Decision Flowchart

Do you have demonstrations?

• No → You need RL. Imitation learning requires demos by definition.

• Yes → Is the task short-horizon (<50 steps)?

• Yes → BC with diffusion policy is likely sufficient.

• No → Can you interact with the environment?

• Yes + expert available → DAgger / HG-DAgger.

• Yes + reward available → BC pre-train, then RL fine-tune.

• No → BC with massive data and expressive policy. Hope for the best.

Key Equations

Behavioral Cloningmin_θ Σ_{(s,a) ∈ 𝒟} ‖ a − π_θ(s) ‖²

BC Error Boundε_total ∼ O(ε · T²) (quadratic — compounding errors)

DAgger Error Boundε_total ∼ O(ε · T) (linear — distribution matching)

Diffusion Policy Lossℒ(θ) = 𝔼[ ‖ ε − ε̂_θ(a_n, s, n) ‖² ]

Common Failure Modes

Failure	Symptom	Fix
Averaging	Policy outputs mean of multimodal data	Expressive policy (diffusion, MoG)
Compounding	Works for 10 steps, diverges by 50	DAgger / HG-DAgger
Distribution shift	Policy visits unseen states	Train on policy-visited states
Embodiment gap	Can't transfer from human video	Task-level (not motion-level) transfer
Insufficient data	Policy memorizes, doesn't generalize	More demos, data augmentation

The Big Picture

The One Sentence

Imitation learning = supervised learning on expert demonstrations, where the main enemies are multimodal actions (solved by expressive policies) and compounding errors (solved by DAgger). Combine with RL to go beyond imitation.

What Comes Next

Actor-Critic methods combine a learned policy (actor) with a learned value function (critic), enabling efficient RL fine-tuning on top of IL pre-training. PPO, the algorithm behind RLHF, builds directly on the policy gradient + KL constraint ideas from the previous lecture, applied after an imitation learning warm-start.

Looking Ahead

The Modern Recipe

Pre-train with imitation (massive data, supervised learning). Fine-tune with RL (reward signal, self-improvement). Constrain with KL (don't drift too far from pre-training). This pattern repeats across language models, robot policies, and game agents. Master imitation learning, and you've mastered the foundation.

The Complete Guide to Imitation Learning

What You'll Master

The Problem: Teaching by Showing

Behavioral Cloning

The Averaging Trap

Why L2 Produces the Mean

When Does BC Work Despite This?

Expressive Policy Distributions

Option 1: Discrete Actions

Option 2: Gaussian (Continuous)

Option 3: Mixture of Gaussians

Option 4: Discretize + Autoregressive

Option 5: Diffusion Models

Diffusion Policies

How It Works

Flow Matching: The Modern View

Systems in the Wild

Action Chunking

Why It Works

Compounding Errors

How Fast Do Errors Compound?

DAgger: Dataset Aggregation

Why This Fixes Compounding Errors

HG-DAgger: The Practical Version

Why This Is Better in Practice

DAgger vs. HG-DAgger: When to Use Which

Collecting Demonstrations

Natural Data Sources

Robotics: The Hard Case

Videos of Humans and Animals

Imitation + RL

Common Patterns

Full Summary & Cheat Sheet

The Methods at a Glance

Policy Expressivity Cheat Sheet

Decision Flowchart

Key Equations

Common Failure Modes

The Big Picture

What Comes Next