Why showing beats telling. From behavioral cloning to diffusion policies to DAgger — every failure mode explained.
Imagine you're teaching a child to tie their shoes. You don't write down a 47-step motor-control program specifying joint torques for each finger. You show them. You tie the shoe, they watch, they try, they fumble, you show them again. Eventually they get it.
Imitation learning brings this idea to robots and AI agents. Instead of designing a reward function and hoping the agent discovers good behavior through trial and error, you demonstrate the desired behavior and let the agent learn by copying you.
Learning a policy πθ(a|s) from expert demonstrations, without access to a reward function. The demonstrations are the entire training signal.
Quick terminology check from the MDP/RL framework:
| Symbol | Meaning | Example (Self-Driving) |
|---|---|---|
| st | State at time t | Car position, velocity, all objects |
| ot | Observation at time t | Camera image (partial view of state) |
| at | Action at time t | Steering angle, acceleration |
| τ | Trajectory (s₁,a₁,...,sT,aT) | One full driving episode |
| πθ(a|s) | Policy (learned strategy) | Neural net: image → steering command |
RL requires a reward function. For many tasks — folding laundry, cooking an egg, suturing tissue — specifying a reward function is harder than just doing the task yourself. Imitation learning sidesteps this entirely: you show the robot what "success" looks like, and it learns to replicate it.
The dataset for imitation learning is a collection of expert demonstrations:
The question this entire lecture answers: how do we turn these demonstrations into a policy that works?
At one extreme: hand-code every behavior (classical robotics). At the other: pure RL from scratch (agent learns everything by trial and error). Imitation learning sits in the sweet spot — leverage human expertise without the brittleness of hand-coding or the sample complexity of pure RL.
The simplest possible approach. You have (state, action) pairs from the expert. You want a neural network that maps states to actions. This is… supervised learning. Literally just regression.
Train a policy by supervised learning on expert demonstrations. Given a state, predict the expert's action. That's it.
Unpack each piece:
• πθ(s) — your neural network takes state s as input, outputs a predicted action â.
• a — the expert's actual action at that state. This is your ground truth label.
• ‖·‖² — squared Euclidean distance. Penalizes any deviation from the expert.
• minθ — adjust network weights via gradient descent to minimize the total prediction error across all demos.
If you've ever done supervised regression in machine learning, you already know how to do behavioral cloning. The state is the input feature, the action is the label, and the loss is MSE. That's literally it.
State: camera image showing a road curving left. Expert steers at −12° (left). Your network predicts −10°.
Loss = (−12 − (−10))² = (−2)² = 4.
Gradient pushes θ to make the prediction closer to −12°. After thousands of examples and gradient steps, the network learns to imitate the expert's steering across many road conditions.
No simulator needed. No reward function needed. No interaction with the environment. You collect demos once, train overnight, and deploy. This is why BC is by far the most common starting point in real robotics — it works well enough to get a first prototype running.
BC assumes there's a single correct action for every state. For many tasks (following a trajectory, stabilizing a drone), this is approximately true. But for tasks with multiple valid strategies, this assumption will betray you spectacularly.
But there is a devastating failure mode hiding in that innocent L2 loss…
You're training a self-driving car. There's a large obstacle in the middle of the road. In your dataset, some demonstrators swerve left. Others swerve right. Both are perfectly valid.
What does L2 regression predict? The mean. Go straight. Into the obstacle.
When the data is multimodal — multiple valid actions for the same state — minimizing squared error produces the average of those actions. The average of two good actions can be a catastrophically bad action.
Minimizing E[‖a − â‖²] over â has a closed-form solution: â* = E[a]. The optimal prediction under L2 loss is always the conditional mean of the data. This is a fact from basic statistics, not a choice.
State: obstacle ahead. 5 expert demos at this state:
• Demo 1: steer −20° (left) • Demo 2: steer −18° (left) • Demo 3: steer +22° (right) • Demo 4: steer +19° (right) • Demo 5: steer −21° (left)
L2-optimal prediction: (−20 − 18 + 22 + 19 − 21) / 5 = −3.6°
Nearly straight ahead. None of the experts ever steered at −3.6° here. The "optimal" prediction is a disaster — you crash into the obstacle.
This isn't a rare edge case. It happens all the time:
• Multiple demonstrators with slightly different styles
• Tasks where multiple strategies are valid (reaching around an object from either side)
• Any situation where the action distribution is not unimodal
The problem is not the neural network being too small or the training being insufficient. A perfect neural network optimized under L2 loss will still predict the mean. The problem is the loss function forcing a single-point prediction for an inherently multi-valued answer.
When the data is unimodal at every state — there's essentially one right answer for each situation. A single expert with consistent behavior is ideal. Multiple experts with very different styles is dangerous.
If the conditional distribution p(a|s) is unimodal for every s in the dataset, then L2 regression recovers the correct action. The averaging problem only appears when p(a|s) has multiple separated modes. The severity scales with the number of modes and the distance between them.
The fix: instead of predicting a single action, model a distribution over actions. The policy should be able to say "there's a 60% chance we should go left and a 40% chance we should go right" — and never output "go straight."
If actions are discrete (left, right, up, down), the solution is trivial. Output a categorical distribution via softmax:
This is maximally expressive — it can represent any distribution over the discrete action set. No averaging problem. The policy can assign zero probability to "go straight" while putting mass on both left and right.
For continuous actions (like steering angle, joint torques, or end-effector velocities), the naive approach is a Gaussian: the network outputs a mean μ and a standard deviation σ.
A Gaussian has one peak. It can be wide or narrow, but it always has a single mode. For the obstacle scenario: the Gaussian centers at the mean (−3.6°) with large variance. It assigns highest probability to going straight — exactly what we wanted to avoid.
A natural extension: output multiple Gaussians and mix them with learned weights:
Now we can have two modes: one peaked around −20° (left) and one around +20° (right). The mix weight w controls how much probability goes to each mode.
For a multi-dimensional continuous action vector (a1, a2, a3…), we can discretize each dimension into bins and model them sequentially, one dimension at a time:
This is how language models work: p(next word | previous words). Same idea, applied to action dimensions. Each conditional is a simple categorical distribution (maximally expressive for that dimension), and the chain rule ensures the joint distribution can be arbitrarily complex.
A critical distinction. Making the neural network bigger (more layers, more parameters) does NOT fix the averaging problem if the output head is still a single Gaussian. A 100-billion-parameter network with a Gaussian output is still unimodal. The bottleneck is the output representation, not the network capacity.
Obstacle scenario. Expert data: 60% steer left (−20°), 40% steer right (+20°).
• Gaussian: peak at −4°. Probability of ±20° is low. Fails.
• MoG (K=2): mode at −20° (w=0.6), mode at +20° (w=0.4). Works!
• Autoregressive: discretize to 1° bins, models exact histogram. Works!
• Diffusion: samples cluster at −20° and +20° naturally. Works!
The most powerful option. Model the action distribution as a denoising diffusion process. Start from noise, iteratively refine into a clean action. Can represent any distribution, including complex multimodal ones. This is so important it gets its own chapter.
You know how image diffusion models (Stable Diffusion, DALL-E) generate images by starting from pure noise and iteratively denoising? Same idea, but instead of generating pixels, we generate actions.
A policy that generates actions through iterative denoising. Starting from random Gaussian noise aN, the network predicts and removes noise over N steps to produce a clean action a0. The conditioning input is the current state/observation.
The key connection: in image diffusion, you model p(image | text). In diffusion policy, you model p(action | observation). Same math, different domain. The observation (camera image, robot joint angles) plays the role of the text prompt.
Training: Take an expert action a0. Add noise to get an = a0 + ε (where n indexes the noise level). Train the network to predict the noise: ε̂θ(an, s, n) ≈ ε.
Inference: Start with random noise aN ~ 𝒩(0, I). Iteratively denoise: at each step, predict the noise and subtract a fraction of it. After N steps, you have a clean action.
State: obstacle ahead. 100 denoising runs produce:
• ~60 runs converge to −20° (left cluster)
• ~40 runs converge to +20° (right cluster)
• ~0 runs converge to 0° (straight)
The diffusion process naturally discovers modes without being told how many there are. Different random starting noises converge to different modes. The valley between modes (going straight) is avoided because no training data lives there.
Mixture of Gaussians requires choosing K (number of modes) in advance. Autoregressive models accumulate discretization error. Diffusion models can represent arbitrary distributions with arbitrary numbers of modes, automatically. The denoising process is a universal distribution approximator. This is why "Diffusion Policy" (Chi et al., 2023) became state-of-the-art for robotic manipulation.
Diffusion policies require N denoising steps at inference time (typically 10–100). Each step is a forward pass through the network. This is 10–100x slower than a single-pass policy. For a robot running at 10 Hz control, you need fast denoising — techniques like DDIM or consistency models help reduce N to as few as 1–4 steps.
The noise-prediction formulation above (DDPM-style) works, but modern systems like π0 (Physical Intelligence, 2024) use a cleaner framework called flow matching. The idea: instead of learning to predict noise, learn a velocity field that smoothly transforms Gaussian noise into expert actions.
Start with two things: a data point x₁ (an expert action) and a noise sample x₀ ~ 𝒩(0, I). Define a straight-line path between them:
The velocity along this path is simply x₁ − x₀. We train a network vθ(xt, s, t) to predict this velocity:
At test time: start from pure noise x₀ ~ 𝒩(0, I). Integrate the learned velocity field with small Euler steps: xt+Δ = xt + vθ(xt, s, t) · Δ. After ~10–20 steps, you arrive at a clean action.
DDPM requires a carefully designed noise schedule (βt), a specific forward noising process, and a reverse SDE. Flow matching replaces all of that with a single straight-line interpolation. The training loss is simpler (no noise schedule), the sampling is an ODE (not SDE — deterministic and faster), and the framework generalizes to non-Gaussian sources. This is why π0, π0.7, and other modern robot foundation models use flow matching.
Diffusion and flow-matching policies power nearly all state-of-the-art IL systems deployed today:
| System | Year | Policy Type | Domain |
|---|---|---|---|
| Diffusion Policy | 2023 | DDPM denoising | Tabletop manipulation |
| π0 (Physical Intelligence) | 2024 | Flow matching | General-purpose robotics |
| Octo / OpenVLA | 2024 | Autoregressive (VLM) | Open-vocab robot control |
| GR00T (NVIDIA) | 2024 | Flow matching | Humanoid robots |
| Figure Helix | 2025 | Flow matching | Humanoid manipulation |
| Waymo EMMA | 2024 | Autoregressive (LLM) | Autonomous driving |
For robotics (continuous, multimodal actions): flow matching / diffusion dominates. For driving and language-conditioned control: autoregressive VLMs are competitive. Both are forms of behavioral cloning with expressive output distributions — the core lesson of this chapter.
A robot running at 50 Hz makes a new decision every 20 milliseconds. But does it need to? Most actions unfold over 200–500 ms. A grasping motion doesn't change its mind every 20 ms — it commits to a trajectory and follows through. What if instead of predicting one action at a time, the policy predicted an entire chunk of future actions?
Instead of predicting a single action at from state st, the policy predicts a chunk of k future actions: πθ(st) → (at, at+1, …, at+k−1). The robot executes the entire chunk before observing a new state. Typical chunk sizes: k = 4–16 steps.
Reason 1: Temporal consistency. Single-step predictions can jitter — at time t the policy says "go left" and at t+1 it says "go right." Chunking forces the policy to commit to a coherent motion over multiple timesteps. The result is smoother, more physically plausible actions.
Reason 2: More compute time. Diffusion and flow-matching policies need 10–20 denoising steps per prediction. If you predict 1 action per control step at 50 Hz, each denoising step must complete in 1 ms. If you predict a chunk of 8 actions, you have 160 ms total — 8 ms per denoising step. This is comfortably achievable on current hardware.
Robot running at 50 Hz with chunk size k = 8:
• t = 0: Observe state s₀. Run diffusion (takes ~100 ms). Get 8 actions.
• t = 0–7: Execute a₀, a₁, …, a₇ open-loop (160 ms).
• t = 8: Observe new state s₈. Run diffusion again. Get next 8 actions.
The policy only runs inference every 160 ms instead of every 20 ms — 8× fewer forward passes.
During open-loop execution, the robot doesn't react to disturbances. If something unexpected happens at t = 3, the robot won't notice until t = 8. Short chunks (k = 4) react faster but give less compute time. Long chunks (k = 16) are smoother but less reactive. Most systems use k = 8–16 with temporal ensembling: predict overlapping chunks and average them, so each action is informed by multiple observations.
Diffusion Policy (Chi et al., 2023) and ALOHA (Zhao et al., 2023) both introduced action chunking independently. π0, GR00T, and virtually every modern IL system uses it. The combination of flow-matching policies with action chunking is the current state of the art for robotic imitation learning.
Even with the perfect output distribution, behavioral cloning has a deeper, more fundamental failure mode. Suppose your policy makes a tiny mistake at step 5 — it's 0.1° off from the expert's steering angle. No big deal, right?
Wrong. That tiny error puts the car in a state it never saw in training. The expert never drove 2 cm to the left of the lane center at that point. The policy has no idea what to do from this unfamiliar state — it was only trained on the expert's trajectory, not on states slightly off that trajectory.
So it makes a bigger mistake. Which puts it in an even more unfamiliar state. Which causes an even bigger mistake. The errors cascade, each one feeding the next, until the policy is so far from the training distribution that its outputs are essentially random.
Small per-step errors accumulate over time. The policy drifts into states not covered by the training data, where it has no basis for making good decisions. Errors cascade. This is the fundamental limitation of behavioral cloning.
Let ε be the per-step error probability (chance the policy makes a mistake at any given step). Let T be the episode length.
Why T² and not T? Because each mistake doesn't just cost one step of error — it pushes you off the expert's trajectory, and every future step is now also wrong because you're in unfamiliar territory.
Per-step error rate: ε = 0.01 (1% chance of mistake per step).
• T = 10 steps: total error ∼ 0.01 × 100 = 1.0 (manageable)
• T = 50 steps: total error ∼ 0.01 × 2500 = 25 (disaster)
• T = 100 steps: total error ∼ 0.01 × 10000 = 100 (complete failure)
Even with 99% per-step accuracy, a 100-step task is essentially guaranteed to fail catastrophically. This is why BC works for short tasks but collapses on long-horizon ones.
Formally, the problem is a distribution mismatch. The policy is trained on states from the expert's distribution pexpert(st). At test time, the policy's own mistakes cause it to visit states from a different distribution pπ(st). The bigger the gap between these distributions, the worse the policy performs. BC has no mechanism to close this gap.
Think of a student driver who only practiced on a driving simulator with scripted scenarios. On the real road, the first time another car cuts them off, they freeze — they've never seen this state. They swerve too hard, ending up in the oncoming lane. Now they've really never seen this state. BC is that student driver: good in familiar territory, helpless the moment anything goes slightly off-script.
We need a way to train the policy on states it actually visits, not just states the expert visited. Enter DAgger.
BC achieves per-step error ε (probability of making a mistake at any timestep). Intuitively, with T steps, total error should be ε·T. But the actual bound is O(ε·T²).
Your task: Derive the T² bound rigorously. Show that at step t, the probability of being off-distribution is O(ε·t), and since the policy has no recovery data for off-distribution states, the expected cost from step t onward is O(T-t). Sum over all t to get T².
Setup: Let ε = P(policy makes wrong action at step t | on expert's trajectory). Let Ct = cost incurred at step t.
Step 1: Define "off-trajectory" as having deviated at any prior step. P(off-trajectory at t) ≤ tε (union bound over t independent error events).
Step 2: When on-trajectory: E[Ct] = 0 (correct action). When off-trajectory: E[Ct] = c (some constant cost per step, since the policy is in unknown territory).
Step 3: Total expected cost:
E[Σt Ct] = Σt=1T P(off at t) · c ≤ c · Σt=1T tε = cε · T(T+1)/2 = O(εT²)
Alternative (tighter) derivation via Ross et al. 2011:
Let dtπ be the state distribution at step t under π. The performance difference lemma gives:
J(π*) - J(π) = Σt Es~dtπ[l(s)] where l(s) is the loss at state s.
For BC: l(s) = 0 if s ∈ supp(dtπ*), but l(s) = O(1) otherwise. Since dtπ drifts from dtπ* with rate ε per step, the TV distance ||dtπ - dtπ*|| ≤ tε. So E[l(s)] ≤ tε, and summing: J(π*) - J(π) ≤ Σt tε = O(εT²).
The key insight: It's T² (not T) because of the CASCADING nature: a mistake at step 5 doesn't just cost 1 step — it corrupts all future steps too. DAgger fixes this by training on the policy's own distribution, making the per-step error independent of position: E[Ct] = ε regardless of t, giving ΣCt = εT (linear).
The insight is elegant: if the problem is that BC trains on expert states but the policy visits different states, then train on the states the policy actually visits. But we still need expert actions — so we ask the expert to label those states.
An iterative algorithm that fixes the distribution mismatch of BC. In each round: (1) run the current policy, (2) record the states it visits, (3) ask the expert what action they would take in those states, (4) add these new (state, action) pairs to the training dataset, (5) retrain.
After several rounds, the training dataset contains expert actions for the states the learned policy actually visits — including the "mistake" states that BC never saw. The policy learns what to do when things go slightly wrong: how to recover.
Round 1: Train BC on 100 expert driving demos. Policy drifts left on turns.
Round 2: Run the policy. It ends up 10 cm left of center on a curve. Expert labels: "steer right to correct." Add 50 new (state, correction) pairs. Retrain.
Round 3: Policy is better on curves but wobbles on straights. Expert labels: "steady." Add 50 more pairs. Retrain.
Round 5: Dataset now has expert actions for the full range of states the policy encounters — including recovery states. Error bound improves to O(ε · T) instead of O(ε · T²).
DAgger requires the expert to label states on demand. The policy visits some weird state at 3 AM — who labels it? For simulated environments, you can query an oracle. For real robots with human experts, this is a serious logistical challenge. HG-DAgger addresses this.
BC trains on pexpert(s). DAgger trains on pπ(s). This is the entire difference. By closing the distribution gap, DAgger eliminates quadratic error growth. It's the same supervised learning, just on the right data.
DAgger reduces BC's O(εT²) to O(εT). The proof uses online learning theory: DAgger is a no-regret algorithm over the sequence of data distributions d1π, d2π, ..., dNπ.
Your task: Show that after N rounds of DAgger, the average policy π̄ has expected cost at most ε + O(1/√N) per step, which gives total cost O(εT) as N → ∞. Why does training on the mixture distribution fix the cascading problem?
Step 1 (Setup): At each DAgger round i, we have policy πi and its induced state distribution dπi. We collect data from dπi and retrain on all accumulated data.
Step 2 (Performance decomposition): For the mixture policy π̄ = (1/N)Σiπi:
J(π̄) ≤ (1/N) Σi J(πi) (by Jensen's inequality on costs)
J(πi) = T · Es~dπi[l(πi, s)] where l is per-step loss.
Step 3 (No-regret): Since πi+1 minimizes loss on ALL data so far (best response to past distributions):
(1/N)Σi Edπi[l(πi)] ≤ minπ (1/N)Σi Edπi[l(π)] + O(1/√N)
Step 4 (Best in class): The best fixed policy in the class achieves Ed[l(π*)] = ε (classification error on the right distribution). So:
(1/N)Σi E[l(πi)] ≤ ε + O(1/√N)
Step 5 (Total cost): After enough rounds, J(π̄) ≤ T · (ε + O(1/√N)) ≈ εT
The key insight: DAgger converts the sequential decision problem into an i.i.d. supervised learning problem. By training on states the policy actually visits, the per-step error is ε REGARDLESS of what happened at previous steps. No cascading. The price: you need access to the expert for N rounds of labeling. The reduction from T² to T is the fundamental reason DAgger works.
More data reduces ε (per-step error) through better generalization, but it does NOT change the T² dependence. BC's bound is O(εT²), DAgger's is O(εT). For T=100: BC error ~ ε·10,000 while DAgger error ~ ε·100. To match DAgger with BC, you'd need to reduce ε by a factor of 100 (i.e., εBC = εDAgger/100). Since supervised learning error scales roughly as ε ~ 1/√N, you'd need 10,000x more demonstrations! And this gets worse for longer horizons. More data helps BC, but it can never overcome the quadratic penalty. The fundamental issue is that BC trains on the WRONG distribution (expert states), not that it has too little data on that distribution. DAgger fixes the distribution, which no amount of expert data can do.
Vanilla DAgger has a usability problem. Running the policy, recording all states, then asking an expert to retroactively label each one? Imagine watching 10,000 frames of a robot flailing and labeling the correct action for each. Exhausting and impractical.
HG-DAgger (Human-Gated DAgger) flips the interface: the expert watches the policy run in real time and intervenes only when needed.
A variant of DAgger where the expert observes the policy executing and takes over control only when the policy is about to make a serious mistake. The intervention trajectories are added to the training set.
The expert only acts when it matters. If the policy is doing fine on a straight road, the expert sits back. When the policy starts drifting toward a curb, the expert grabs the wheel and demonstrates the recovery. The expert's time is spent on the hard cases — exactly the states where the policy needs the most help.
Task: Pick up a cup. Initial BC policy succeeds 60% of the time.
Session 1: Expert watches 20 attempts. Policy fails on off-center cups. Expert intervenes 8 times, demonstrating approach corrections. 8 new demos added. Retrain → 75% success.
Session 2: Expert watches 20 attempts. Policy now fails on tilted cups. Expert intervenes 5 times. Retrain → 85% success.
Session 3: 3 interventions for edge cases. Retrain → 92% success.
Total expert time: ~1 hour across 3 sessions, compared to the 5+ hours of continuous demonstration that vanilla DAgger would require.
HG-DAgger turns imitation learning into an iterative refinement process. The expert is a teacher who lets the student try, watches for mistakes, and corrects only when necessary. Each round of corrections targets the policy's current failure modes, making the data maximally informative.
The expert must be good at monitoring AND intervening. Some experts are great demonstrators but poor monitors — they intervene too late or too early. Too-late interventions miss the critical recovery states. Too-early interventions waste expert time on states the policy would have handled.
Vanilla DAgger is better when you have a scripted expert (a simulator oracle that can label any state instantly) or when you need theoretical guarantees about convergence.
HG-DAgger is better in the real world with human experts, where labeling time is expensive and the expert's attention is the bottleneck. Most real robotics experiments use some variant of HG-DAgger because it respects the human's time and cognitive load.
We've assumed demonstrations exist. But where do they actually come from? The answer depends heavily on the domain, and some methods are dramatically harder than others.
Some domains generate demonstrations as a byproduct of normal activity:
• Self-driving: Every human driver generates (camera image, steering angle) pairs all day long. Tesla collects billions of frames this way.
• Language: The entire internet is a demonstration of text generation. Every sentence is an expert demonstration of "what to write next."
• Game playing: Replay databases (StarCraft, Go, Chess) contain millions of expert games.
When demonstrations come for free, you can collect millions. This is why behavioral cloning works so well for self-driving and language — the sheer data volume overwhelms the compounding error problem for many practical scenarios.
For robotic manipulation, demonstrations don't come naturally. Someone has to actively teach the robot. Three main approaches:
| Method | How It Works | Pros | Cons |
|---|---|---|---|
| Kinesthetic teaching | Physically guide the robot arm by hand | Intuitive, direct | Awkward for complex motions, limited to compliant robots |
| Teleoperation | Control robot remotely via joystick/VR controller | Natural interface, works for any robot | Latency, requires practice to master |
| Puppeteering | Use a second identical robot arm as input device | 1:1 mapping, very precise | Requires double the hardware cost |
The Diffusion Policy paper (Chi et al., 2023) collected about 100–200 demonstrations per task via teleoperation. Each demo took ~30 seconds. Total data collection: about 1–2 hours per task. This is typical for state-of-the-art robotic IL — it's not millions of demos, it's hundreds.
YouTube is full of humans doing tasks. Can we learn from watching humans?
A human hand has 27 degrees of freedom with soft, deformable fingers and tactile sensing. A robot gripper has 1–7 DOF with rigid pincers and maybe a force sensor. Watching a human fold a towel doesn't directly tell a robot how to fold a towel — the bodies are too different. This is the embodiment gap. Bridging it is an active research frontier.
Current approaches extract high-level intentions from human videos (what to grasp, where to place) while leaving the low-level motor control to RL or a separate controller. Full "watch a YouTube video and learn the task" remains an open problem.
Not all demonstrations are equal. From best to worst: (1) same robot, same task, expert operator; (2) same robot, similar task, decent operator; (3) different robot, same task; (4) human video of the task. Each step down requires more sophisticated transfer methods. Most practical systems operate at level 1 or 2.
Imitation learning and reinforcement learning aren't rivals — they're complements. Each has strengths the other lacks:
| Property | Imitation Learning | Reinforcement Learning |
|---|---|---|
| Needs reward function? | No | Yes |
| Needs demonstrations? | Yes | No |
| Can surpass demonstrator? | No (BC) / Maybe (DAgger) | Yes |
| Self-improvement? | No framework for practice | Yes, by definition |
| Data efficiency? | Good (supervised learning) | Poor (exploration needed) |
| Long-horizon tasks? | Compounding errors | Reward signal guides recovery |
Use imitation learning to initialize the policy (start from a reasonable baseline), then use RL to improve beyond the demonstrator. IL gives you a warm start; RL gives you self-improvement. This is the recipe behind many successful real-world systems.
Pattern 1: IL then RL. Pre-train via BC on expert demos, then fine-tune with PPO or SAC using a reward function. The BC initialization means RL starts from sensible behavior instead of random flailing — dramatically faster convergence. Think of it as teaching someone the basics before sending them to practice on their own.
Pattern 2: IL + RL reward shaping. Use the distance from expert behavior as an auxiliary reward signal. The agent gets reward both for task success (RL objective) and for staying close to expert demonstrations (IL regularizer). This keeps the policy grounded while still allowing improvement beyond the demonstrator.
Pattern 3: RLHF (RL from Human Feedback). The recipe behind ChatGPT and Claude. Pre-train a language model on text (BC on internet data), then fine-tune with RL using a reward model trained on human preferences. Imitation learning provides the foundation; RL aligns the behavior.
Step 1 (IL): Pre-train on internet text. This is behavioral cloning on the largest dataset ever assembled.
Step 2 (IL): Fine-tune on curated (prompt, response) pairs. Still BC, but higher quality demos.
Step 3 (RL): PPO with a reward model trained on human rankings. The model goes beyond imitating — it optimizes for human preference.
Without step 1, RL from scratch on language would be intractable. Without step 3, you get a talented mimic but not an aligned assistant.
A pianist who only watches performances will never improve beyond mimicry. They must practice — try, fail, get feedback, adjust. BC has no equivalent. DAgger gets closer (the expert provides ongoing feedback), but only RL provides the full loop: try things the expert never did, discover something better, and keep it.
Generative Adversarial Imitation Learning (GAIL, Ho & Ermon 2016) avoids the reward-engineering problem by training a discriminator to distinguish expert from policy state-action pairs, then using the discriminator's output as a reward signal for RL.
Your task: Show that minimizing the Jensen-Shannon divergence between the occupancy measures ρπ and ρπ* (via a discriminator) is equivalent to moment matching — making the policy's state-action distribution indistinguishable from the expert's.
Step 1: GAIL trains: D(s,a) = discriminator, π = policy (generator).
Discriminator objective: maxD E(s,a)~ρπ*[log D(s,a)] + E(s,a)~ρπ[log(1-D(s,a))]
Step 2: Optimal discriminator: D*(s,a) = ρπ*(s,a) / (ρπ(s,a) + ρπ*(s,a))
Step 3: With D = D*, the policy optimization becomes:
minπ JS(ρπ || ρπ*) = (1/2)KL(ρπ||m) + (1/2)KL(ρπ*||m)
where m = (ρπ + ρπ*)/2.
Step 4: The reward for RL is r(s,a) = -log(1 - D(s,a)) (or log D(s,a)). The policy is trained with any RL algorithm (TRPO, PPO) using this learned reward. When D can't distinguish expert from policy (D ≈ 0.5), the policy has matched the expert's occupancy measure.
Step 5 (Moment matching): ρπ = ρπ* implies Eρπ[f(s,a)] = Eρπ*[f(s,a)] for ALL functions f. This is moment matching — every statistic of the policy's behavior matches the expert. This is stronger than BC (which only matches the conditional π(a|s) pointwise).
The key insight: GAIL converts IL into an adversarial game. The discriminator automatically discovers which features of behavior matter (the "reward function" emerges from training). BC matches actions at expert states; GAIL matches the entire state-action distribution. This makes it robust to compounding error because the policy's own states are part of the optimization.
Real-world approaches (public knowledge from Tesla AI Day, Waymo papers):
1. Data imbalance: Mine hard examples aggressively. Tesla's "shadow mode" runs the policy in background on fleet vehicles, flags disagreements between policy and human, and collects those scenarios preferentially. Result: 100x oversampling of edge cases. Also: data augmentation (synthetic rain, night, occlusion) to inflate rare conditions.
2. Long horizon: Hierarchical decomposition: (a) Route planner (minutes ahead, graph search), (b) Behavior planner (seconds ahead, learned IL policy decides lane changes/speed), (c) Motion controller (milliseconds, PID/MPC tracks the behavior plan). IL operates at layer (b) with 2-5 second horizon. Action chunking: predict 1-2 seconds of waypoints, replan at 10 Hz. This keeps the effective horizon short (~20 steps).
3. Ultra-rare events: Simulation. Build a photorealistic simulator and procedurally generate edge cases: animal crossings, debris, construction. Train a separate "emergency" policy on simulated data. At runtime: a classifier detects "unusual scenario" and switches to the emergency policy. Also: RL fine-tuning in simulation for collision avoidance.
4. Multimodality: Tesla uses a learned trajectory generation model (similar to diffusion policy) that outputs multiple candidate trajectories. A separate scoring model picks the best one considering safety, comfort, and traffic rules. This avoids the averaging trap entirely.
5. Safety layer: Multiple levels: (a) Rule-based collision checker (physics-based time-to-collision), (b) Learned safety critic that predicts P(collision | action), (c) Hardware emergency brake (AEB) as final fallback. The IL policy's output is always filtered through safety checks before execution. If ANY check fails, the system executes a minimal-risk condition (slow to stop, stay in lane).
Key lesson: No production AV system relies on pure IL. It's always IL + safety systems + hierarchical planning + simulation for edge cases + fleet learning for data. The IL component handles the "normal driving" 99% efficiently; everything else handles the 1% safely.
BC is offline RL with the constraint turned to maximum (no improvement allowed). Filtered BC is halfway — it imitates only the good data. AWR/IQL complete the spectrum: imitate the data but weight by quality. The entire progression from BC to offline RL is one dial: how much do you trust the advantage estimates to allow deviation from the data?
If you gave BC access to a reward function but no environment interaction, what method would you get? (Answer: filtered BC or AWR — rank demonstrations by return, weight accordingly.)
RT-2, Octo, π0, OpenVLA, GR00T — these are all imitation learning systems. The core algorithm is still BC (minimize action prediction error on demonstrations). What changed: (1) scale — millions of demonstrations across hundreds of tasks, (2) architecture — pre-trained vision-language backbones provide generalizable features, (3) output — diffusion/flow-matching policies handle multimodality. The compounding error problem is mitigated (not solved) by having such broad training distribution that the policy rarely encounters truly OOD states.
What would it take for a VLA to truly solve compounding errors? (Hint: online fine-tuning = DAgger at scale. This is the RLHF equivalent for robotics.)
def dagger(env, expert_policy, policy, n_rounds=10, n_episodes=5, n_epochs=50):
# Step 0: Seed with expert demonstrations
dataset = {'states': [], 'actions': []}
for _ in range(n_episodes):
s = env.reset()
done = False
while not done:
a = expert_policy(s)
dataset['states'].append(s)
dataset['actions'].append(a)
s, _, done, _ = env.step(a)
# DAgger rounds
for round_i in range(n_rounds):
# Step 1: Train policy on ALL accumulated data
policy.train(dataset, epochs=n_epochs)
# Step 2: Roll out CURRENT policy
for _ in range(n_episodes):
s = env.reset()
done = False
while not done:
a_policy = policy.predict(s) # policy's action
# Step 3: Expert labels this state
a_expert = expert_policy(s) # what expert WOULD do
# Step 4: Add to dataset
dataset['states'].append(s)
dataset['actions'].append(a_expert) # expert action!
# Execute policy's action (not expert's)
s, _, done, _ = env.step(a_policy)
return dataset
| Method | Type | Key Idea | Error Scaling |
|---|---|---|---|
| Behavioral Cloning | Offline | Supervised regression on demos | O(εT²) |
| BC + Expressive Policy | Offline | Diffusion/MoG output avoids averaging | O(εT²) |
| DAgger | Online | Train on policy-visited states | O(εT) |
| HG-DAgger | Online | Expert intervenes only when needed | O(εT) |
| IL + RL | Both | IL initializes, RL improves | Can surpass expert |
| Output Type | Multimodal? | Continuous? | Expressivity |
|---|---|---|---|
| Categorical (softmax) | Yes | No (discrete only) | Maximal for discrete |
| Gaussian (μ, σ) | No (unimodal) | Yes | Low |
| Mixture of Gaussians | Yes (K modes) | Yes | Medium (must choose K) |
| Autoregressive | Yes | Discretized | High |
| Diffusion | Yes (arbitrary) | Yes | Highest |
Do you have demonstrations?
• No → You need RL. Imitation learning requires demos by definition.
• Yes → Is the task short-horizon (<50 steps)?
• Yes → BC with diffusion policy is likely sufficient.
• No → Can you interact with the environment?
• Yes + expert available → DAgger / HG-DAgger.
• Yes + reward available → BC pre-train, then RL fine-tune.
• No → BC with massive data and expressive policy. Hope for the best.
| Failure | Symptom | Fix |
|---|---|---|
| Averaging | Policy outputs mean of multimodal data | Expressive policy (diffusion, MoG) |
| Compounding | Works for 10 steps, diverges by 50 | DAgger / HG-DAgger |
| Distribution shift | Policy visits unseen states | Train on policy-visited states |
| Embodiment gap | Can't transfer from human video | Task-level (not motion-level) transfer |
| Insufficient data | Policy memorizes, doesn't generalize | More demos, data augmentation |
Imitation learning = supervised learning on expert demonstrations, where the main enemies are multimodal actions (solved by expressive policies) and compounding errors (solved by DAgger). Combine with RL to go beyond imitation.
Actor-Critic methods combine a learned policy (actor) with a learned value function (critic), enabling efficient RL fine-tuning on top of IL pre-training. PPO, the algorithm behind RLHF, builds directly on the policy gradient + KL constraint ideas from the previous lecture, applied after an imitation learning warm-start.
Pre-train with imitation (massive data, supervised learning). Fine-tune with RL (reward signal, self-improvement). Constrain with KL (don't drift too far from pre-training). This pattern repeats across language models, robot policies, and game agents. Master imitation learning, and you've mastered the foundation.