Capuano et al., Chapter 4

Imitation Learning

From behavioral cloning to diffusion policies — teaching robots by showing, not telling.

Prerequisites: Chapter 3 (Reinforcement Learning). RL optimized rewards; now we optimize closeness to demonstrations.
12
Chapters
5+
Simulations
12
Quizzes

Chapter 0: Why Imitate?

In Chapter 3 we learned RL: define a reward, explore for millions of steps, converge on a policy. For a shirt-folding task, that means a robot randomly flailing fabric for days until it stumbles onto something that looks like a fold. Absurd.

A human can fold a shirt in 30 seconds. Why not just record that and train a neural network to copy it? That's behavioral cloning (BC) — the simplest form of imitation learning. Collect a dataset of demonstrations D = {(oi, ai)}, where o is an observation (image, joint state) and a is the expert's action, then minimize:

L(θ) = E(o, a) ~ D[ || fθ(o) - a ||2 ]

This is just supervised regression. The network fθ maps observations to actions. MSE loss. Backprop. Done. No reward function, no environment resets, no safety concerns during training.

BC is orders of magnitude more sample-efficient than RL. With 50 demonstrations (5 minutes of human effort), BC can learn a reasonable policy. RL might need 50,000 episodes for the same task. The tradeoff: BC can't improve beyond the demonstrator and struggles with distribution shift.

But there's a subtle, devastating problem with the MSE objective. Consider a task where you can reach around an obstacle from either the left or the right. Both are valid. The demonstrations contain both modes. What does MSE do? It averages them — predicting an action that goes straight through the obstacle.

Multimodal Demonstration Problem

Two groups of demonstrations (orange dots) go around an obstacle from different sides. MSE regression (blue x) averages them — producing an invalid action. Click Resample for different demonstrations.

The mode-averaging catastrophe. MSE loss finds the mean of all demonstrated actions. When demonstrations are multimodal (multiple valid strategies), the mean can be a strategy that is never valid. This is not a training bug — it's a fundamental limitation of unimodal regression. Fixing this requires generative models (Chapters 2-6).
Why does MSE loss fail when demonstrations contain multiple valid strategies?

Chapter 1: Compounding Errors

Even if the demonstrations are unimodal and MSE works perfectly, BC has a second fundamental problem: covariate shift.

During training, the network sees states from the expert's trajectory. During deployment, it sees states from its own trajectory. The network makes a small error at step 1. That error shifts the state slightly. At step 2, the state is now slightly out of distribution. The error grows. By step 50, the robot is in a state the expert never visited, and the policy outputs nonsense.

Think of it like steering. An expert drives straight down a road. A cloned driver drifts 1 cm right on the first step. Now it's at a state the expert never demonstrated from. It drifts further. By step 100, it's in the ditch. The expert's data only covered the center of the road — not recovery from drift.

Formally, the error compounds quadratically. If the per-step error is ε, the total error after T steps is:

Etotal = O(T2 · ε)

This is because each step's error shifts the state, causing additional error at subsequent steps. For a 100-step trajectory with ε = 0.01, the total error is O(100), not O(1).

Compounding Error Visualization

The green line is the expert trajectory. The orange line is the cloned policy. Watch small errors accumulate. Drag the ε slider to change per-step error magnitude.

0.02

Mitigation: DAgger

Dataset Aggregation (DAgger) addresses covariate shift by iteratively collecting new data. Run the learned policy, visit novel states, query the expert for the correct action at those states, add to the dataset, retrain. Each round covers more of the state space the learned policy actually visits.

But DAgger requires an expert available during training — a human on standby to label states on the fly. For real robots, this is expensive. Modern approaches solve the problem differently: instead of fixing the data distribution, they fix the policy architecture to avoid error accumulation. That's where action chunking comes in (Chapter 7).

Why do behavioral cloning errors compound over time?

Chapter 2: Generative Models for Actions

The mode-averaging problem from Chapter 0 has a clean solution: instead of predicting a single action fθ(o), model the full conditional distribution p(a|o). Then at inference time, sample from this distribution. Each sample will be a valid action — either going left or going right around the obstacle — never the invalid average.

The paradigm shift: BC with MSE = discriminative model (predicts one answer). Generative BC = generative model (captures the full distribution of valid answers, then samples one).

How do we model p(a|o)? With latent variable models. Introduce a latent variable z that captures the "which mode are we in?" decision:

p(a | o) = ∫ p(a | o, z) · p(z) dz

The latent z acts as a "strategy selector." When z is sampled from one region of latent space, the model outputs a go-left action. From another region, go-right. The integral marginalizes over all strategies, giving a proper multimodal distribution.

Three families of generative models are used for robot action prediction, each implementing this idea differently:

VAE (Ch 3)
Learn a latent z with encoder-decoder. Sample z ~ N(0, I), decode to action.
Diffusion (Ch 4-5)
Start from noise zT ~ N(0, I), iteratively denoise to a valid action.
Flow Matching (Ch 6)
Learn a velocity field that transports noise to actions in one continuous flow.

All three can model multimodal distributions. They differ in training stability, inference speed, and sample quality. Let's derive each one.

What is the role of the latent variable z in generative action models?

Chapter 3: VAE Derivation

The Variational Autoencoder gives us a principled way to learn latent variable models. We want to maximize log p(a|o) but that integral over z is intractable. The VAE sidesteps this with a variational lower bound (ELBO).

The ELBO

Introduce an approximate posterior qφ(z | o, a) — an encoder that guesses which latent z produced a given (observation, action) pair. Then:

log p(a | o) ≥ Ez ~ qφ[ log pθ(a | o, z) ] - KL( qφ(z | o, a) || p(z) )

This is the Evidence Lower Bound (ELBO). Two terms:

Reconstruction term: Ez ~ q[log pθ(a|o,z)]. "Given a latent z sampled from the encoder, how well does the decoder reconstruct the original action?" In practice: sample z from q, decode to predicted action, compute MSE. Maximizing this makes the model faithful to the data.
KL regularization term: -KL(q(z|o,a) || p(z)). "How close is the encoder's posterior to the prior N(0, I)?" Forces the latent space to be organized — smooth, continuous, without "holes." At inference time, we sample z from the prior p(z) = N(0, I), so the prior must produce valid latents.

Worked Example: KL Between Two 1D Gaussians

Suppose q = N(μq, σq2) and p = N(0, 1). The KL divergence has a closed-form:

KL(q || p) = log(σpq) + (σq2 + (μq - μp)2) / (2σp2) - 1/2
Worked example. Let q = N(2, 0.52) and p = N(0, 1).

KL = log(1/0.5) + (0.25 + 4)/(2) - 0.5
  = log(2) + 4.25/2 - 0.5
  = 0.693 + 2.125 - 0.5
  = 2.318

This is fairly large — the encoder's posterior (mean 2, std 0.5) is far from the prior (mean 0, std 1). The KL term in the ELBO would penalize this, pushing the encoder to produce latents closer to N(0, 1).

VAE Training Loop

# VAE training step (PyTorch-style pseudocode)
def vae_loss(obs, action, encoder, decoder):
    # Encode: observation + action → latent distribution
    mu, log_var = encoder(obs, action)  # shapes: [B, d_z]

    # Reparameterize: sample z without breaking gradients
    std = torch.exp(0.5 * log_var)       # [B, d_z]
    eps = torch.randn_like(std)           # [B, d_z]
    z = mu + std * eps                    # [B, d_z]

    # Decode: observation + z → predicted action
    a_pred = decoder(obs, z)              # [B, d_a]

    # Reconstruction loss (MSE)
    recon = ((a_pred - action) ** 2).sum(dim=-1).mean()

    # KL divergence (closed form for Gaussians)
    kl = -0.5 * (1 + log_var - mu**2 - log_var.exp()).sum(dim=-1).mean()

    return recon + beta * kl  # beta controls regularization strength
The reparameterization trick is essential. We can't backpropagate through a sampling operation (z ~ N(μ, σ2)). Instead, sample ε ~ N(0, 1) and compute z = μ + σ · ε. Now z is a deterministic function of μ, σ, and ε, and gradients flow through μ and σ normally.
In the VAE's ELBO, what does the KL divergence term enforce?

Chapter 4: Diffusion Models

VAEs work, but their latent space is a single bottleneck — it must compress all variation into one vector z. For highly multimodal action distributions, this can be limiting. Diffusion models take a different approach: instead of encoding then decoding, they learn to gradually denoise random noise into valid data.

The Forward Process (Adding Noise)

Start with a clean data point z0 (an action sequence from the expert). Over T steps, gradually add Gaussian noise until the signal is destroyed:

q(zt | zt-1) = N(zt; √(1 - βt) · zt-1, βt I)

At each step t, the sample zt is a slightly noisier version of zt-1. The noise schedule β1, …, βT controls how fast the signal degrades. After T steps (typically T = 1000), zT ≈ N(0, I) — pure noise.

A key insight: we can jump directly from z0 to any zt without computing intermediate steps. Define αt = 1 - βt and ᾱt = ∏i=1t αi:

q(zt | z0) = N(zt; √ᾱt · z0, (1 - ᾱt) I)
This means: zt = √ᾱt · z0 + √(1 - ᾱt) · ε, where ε ~ N(0, I). At any step t, the noisy sample is a weighted combination of the original signal and random noise. As t increases, the signal weight √ᾱt shrinks toward 0 and the noise weight grows toward 1.

The Reverse Process (Denoising)

Training = learn to reverse the noise process. Starting from pure noise zT, we want to iteratively "denoise" back to z0. The reverse process is also Gaussian:

pθ(zt-1 | zt) = N(zt-1; μθ(zt, t), σt2 I)

The neural network μθ predicts the mean of the "cleaned up" sample at each step. After T denoising steps, we arrive at z0 — a sample from the learned distribution.

Forward & Reverse Diffusion

Drag the slider to move through the diffusion process. Left = clean action, right = pure noise. The forward process adds noise; the reverse process (learned) removes it.

t=0 (clean)
In the forward diffusion process, what happens as t increases from 0 to T?

Chapter 5: Noise Prediction

The reverse process requires predicting μθ(zt, t) — the denoised mean at each step. But Ho et al. (DDPM) showed that it's simpler and more stable to predict the noise ε instead. Here's the connection:

Since zt = √ᾱt · z0 + √(1 - ᾱt) · ε, we can rearrange to recover z0:

z0 = (zt - √(1 - ᾱt) · ε) / √ᾱt

So if a network εθ(zt, t) predicts the noise that was added, we can reconstruct z0. The DDPM simplified loss is:

L = Ez0, t, ε[ || ε - εθ(zt, t) ||2 ]

That's the entire training objective. Sample a clean action z0 from the dataset. Sample a random time step t. Sample noise ε ~ N(0, I). Compute zt. Train the network to predict ε from zt and t. Beautiful in its simplicity.

From denoising score matching to noise prediction. The score function ∇z log p(zt) is proportional to -ε / √(1 - ᾱt). Predicting noise IS learning the score. This connection to score-based models is why diffusion models are sometimes called "score-based generative models."
Worked example: Compute zt from z0.

Given: z0 = [1.5, -0.3] (a 2D action), t = 50, ᾱ50 = 0.36, ε = [0.8, -1.2]

z50 = √0.36 · [1.5, -0.3] + √(1 - 0.36) · [0.8, -1.2]
   = 0.6 · [1.5, -0.3] + 0.8 · [0.8, -1.2]
   = [0.9, -0.18] + [0.64, -0.96]
   = [1.54, -1.14]

The noisy sample z50 still has some signal from z0 (the 0.6 factor) but significant noise contamination (the 0.8 factor). The neural network's job: given z50 and t=50, predict ε = [0.8, -1.2].

Training Algorithm

# DDPM training loop (simplified)
for batch in dataloader:
    z0 = batch["actions"]                     # [B, T_a, d_a] action chunks
    t = torch.randint(0, T, (B,))             # random timesteps
    eps = torch.randn_like(z0)                # noise sample

    # Forward process: compute z_t directly
    alpha_bar_t = alpha_bar[t]                # precomputed schedule
    z_t = sqrt(alpha_bar_t) * z0 + sqrt(1 - alpha_bar_t) * eps

    # Predict noise
    eps_pred = model(z_t, t, obs)             # conditioned on observation

    # Simple MSE loss on noise
    loss = ((eps - eps_pred) ** 2).mean()
    loss.backward()
    optimizer.step()
In DDPM, what does the neural network learn to predict?

Chapter 6: Flow Matching

Diffusion models work by iterating through many small denoising steps (100-1000). This is slow. Flow matching reformulates the problem: instead of a discrete Markov chain, define a continuous-time ODE that transforms noise into data in one smooth flow.

The Velocity Field

Define a path from noise x0 ~ N(0, I) to data x1. The simplest path is a straight line:

xt = (1 - t) · x0 + t · x1,   t ∈ [0, 1]

The velocity along this path is the time derivative:

ut(x | x1) = x1 - x0

This is the conditional optimal transport velocity — the straight-line direction from noise to data. A neural network vθ(xt, t) is trained to predict this velocity:

L = Et, x0, x1[ || vθ(xt, t) - ut(x | x1) ||2 ]
Why flow matching is simpler than diffusion: No noise schedule to design. No forward/reverse Markov chain. No score function connection to worry about. Just: "learn a velocity field that pushes noise toward data along straight lines." The loss is a simple regression on the velocity.

Inference

At inference time, sample x0 ~ N(0, I) and integrate the learned velocity field:

xt+Δt = xt + Δt · vθ(xt, t)

This is an Euler step of the ODE dx/dt = vθ(x, t). Because the paths are approximately straight, even a few Euler steps (5-10) give high-quality samples. Compare this to diffusion's 100-1000 denoising steps.

Vector Field Visualization

The arrows show the learned velocity field. Blue dots (noise) are transported to green dots (data) along the flow. Click Flow to animate the transport, or Reset to resample.

t = 0.00
Flow matching for robotics: Diffusion Policy (Chapter 8) uses DDPM by default, but you can swap in flow matching as the generative backbone. The result: same action quality, 5-10x faster inference. For real-time robot control at 10+ Hz, this speedup matters.
What does the flow matching network learn to predict?

Chapter 7: ACT — Action Chunking with Transformers

Now we combine the pieces: a generative model (VAE) with a key architectural innovation called action chunking. Instead of predicting one action at a time, predict an entire chunk of future actions: at:t+k = (at, at+1, …, at+k-1).

Why Chunks Help

Remember the compounding error problem from Chapter 1? If we predict one action at a time, errors accumulate at every step. With chunks of k = 100 actions, the policy is queried only every 100 steps. Errors only compound between chunk boundaries, not within them. This reduces the effective number of decision points from T to T/k.

Action chunking = temporal consistency. A single-step policy might jitter: push left, then right, then left. A chunk policy commits to a coherent motion plan for the next k steps. The chunk acts as a "commitment" — the robot follows through on its plan instead of second-guessing at every time step.

ACT Architecture

ACT uses a conditional VAE (CVAE) with a Transformer backbone:

Encoder (training only)
(observation, action_chunk) → μ, σ → z ~ N(μ, σ2)
↓ z
Decoder (training + inference)
(observation, z) → predicted action_chunk [B, k, d_a]

During training, the encoder sees both the observation and the ground-truth action chunk, producing a latent z that captures the "style" of the demonstration. The decoder takes the observation and z, and reconstructs the action chunk.

During inference, there is no ground-truth action chunk. So we sample z ~ N(0, I) from the prior. The decoder generates a chunk conditioned on just the observation and this random z. Different z samples produce different valid action sequences.

LeRobot ACT Training

# LeRobot ACT training configuration
from lerobot.common.policies.act.configuration_act import ACTConfig
from lerobot.common.policies.act.modeling_act import ACTPolicy

config = ACTConfig(
    chunk_size=100,           # predict 100 future actions
    n_obs_steps=1,            # condition on 1 observation
    dim_model=512,            # Transformer hidden dim
    n_heads=8,                # attention heads
    n_encoder_layers=4,       # CVAE encoder layers
    n_decoder_layers=7,       # CVAE decoder layers
    latent_dim=32,            # z dimensionality
    kl_weight=10.0,           # beta for KL term
)

policy = ACTPolicy(config)
# Input shapes:
#   obs["images"]: [B, 1, 3, 480, 640]  (1 camera view)
#   obs["state"]:  [B, 1, 14]             (7 joints + 7 velocities)
#   actions:       [B, 100, 14]            (100-step action chunk)
# Output: predicted_actions [B, 100, 14]
chunk_size = 100 means the robot plans 100 future actions at once. At 50 Hz control, that's 2 seconds of motion. The robot re-plans every few steps (not every 100), with temporal ensembling (Chapter 9) smoothing overlapping chunks.
How does action chunking reduce compounding errors?

Chapter 8: Diffusion Policy

Take the DDPM framework from Chapters 4-5 and apply it to action prediction. That's Diffusion Policy — one of the most effective imitation learning methods for robotics.

Observation-Conditioned Denoising

The noise prediction network is conditioned on the current observation o:

εθ(atk, k, o) → predicted noise

where atk is the action chunk at diffusion step k (note: k indexes diffusion steps, t indexes robot time). At inference, we start with random noise atK ~ N(0, I) and iteratively denoise:

atk-1 = (1/√αk)(atk - (1 - αk)/√(1 - ᾱk) · εθ(atk, k, o)) + σk · z

DDIM: Faster Inference

DDPM needs 100-1000 denoising steps. At 10 Hz control, 1000 forward passes per action is far too slow. DDIM (Denoising Diffusion Implicit Models) makes the process deterministic and allows skipping steps. With DDIM, you can go from 100 denoising steps to 10 with minimal quality loss.

The DDIM trick: Instead of the stochastic DDPM reverse step, DDIM uses a deterministic update that skips intermediary steps. You preselect a subsequence of 10 timesteps from the full 100 and only denoise at those points. Same quality, 10x faster.
Action Trajectory Denoising

Watch a 2D action trajectory emerge from noise. Each frame is one denoising step. Click Denoise to run the full process, or Step for one step at a time.

k=100 (pure noise)

Architecture Details

Diffusion Policy typically uses a 1D temporal U-Net as the noise prediction network. The observation is injected via FiLM conditioning (feature-wise linear modulation). The diffusion timestep k is encoded via sinusoidal positional embeddings, same as in standard DDPM.

Diffusion Policy vs. ACT: ACT uses a VAE and predicts actions in one forward pass. Diffusion Policy uses DDPM and needs 10-100 forward passes. But diffusion handles multimodality better — it doesn't need a learned encoder/decoder, and it avoids the "posterior collapse" problem that plagues VAEs. In benchmarks, Diffusion Policy slightly outperforms ACT on tasks with high action diversity.
How does DDIM speed up Diffusion Policy inference?

Chapter 9: Temporal Ensembling

Both ACT and Diffusion Policy predict action chunks — say, k = 100 future actions. But the robot re-plans before the chunk expires, typically every 1-10 steps. This means at any given time step, we have multiple overlapping predictions for what the robot should do.

Suppose chunk_size = 4 and we re-plan every step. At time step t = 3:

Which one do we use? Temporal ensembling averages them with exponentially decaying weights:

at = ∑i wi · at(t-i) / ∑i wi,   wi = exp(-m · i)

where m is a decay constant. Newer predictions (small i) get higher weight because they were made with more recent observations.

Worked example: temporal ensembling with k=4.

Let m = 0.5. Four overlapping predictions for time step t = 3:
a3(3) = 1.2 (predicted this step, i=0)
a3(2) = 1.0 (predicted 1 step ago, i=1)
a3(1) = 0.8 (predicted 2 steps ago, i=2)
a3(0) = 1.1 (predicted 3 steps ago, i=3)

Weights: w0 = exp(0) = 1.000, w1 = exp(-0.5) = 0.607, w2 = exp(-1.0) = 0.368, w3 = exp(-1.5) = 0.223

Sum of weights: 1.000 + 0.607 + 0.368 + 0.223 = 2.198

a3 = (1.0 · 1.2 + 0.607 · 1.0 + 0.368 · 0.8 + 0.223 · 1.1) / 2.198
   = (1.200 + 0.607 + 0.294 + 0.245) / 2.198
   = 2.346 / 2.198
   = 1.067

The result is dominated by the most recent prediction (1.2) but smoothed by older ones.
Why not just use the latest prediction? Because each prediction is noisy. The generative model (VAE or diffusion) samples stochastically, so consecutive predictions for the same time step will differ slightly. Averaging reduces this variance, producing smoother, more consistent robot motion. Without ensembling, the robot's movements can appear jittery.
In temporal ensembling, why do we weight recent predictions more heavily?

Chapter 10: Async Inference Pipeline

Here's a practical problem: a Diffusion Policy forward pass takes 50-100ms (10 DDIM steps through a U-Net). A robot arm needs a new action every 20ms (50 Hz control). The model is too slow for synchronous execution.

The solution: decouple planning (slow, runs the model) from execution (fast, serves precomputed actions).

Two-Thread Architecture

Planning Thread (slow)
Runs policy forward pass every 50-100ms. Produces an action chunk [k, d_a]. Pushes to action queue.
↓ action queue
Execution Thread (fast)
Pops actions from queue at 50 Hz. Sends motor commands. Applies temporal ensembling.

The planning thread runs as fast as the GPU allows. It grabs the latest observation, runs the diffusion/VAE model, and pushes the resulting action chunk into a shared queue. The execution thread pops actions from this queue at a fixed rate (50 Hz), interpolating if needed.

The key insight: The robot doesn't need a new plan at every control step. An action chunk predicted 50ms ago is still valid — the robot state hasn't changed much. By the time the current chunk is half-executed, a new chunk from the planning thread is already in the queue.

LeRobot Async Inference

# Simplified async inference pipeline
import threading
import queue

action_queue = queue.Queue(maxsize=2)  # small buffer

def planning_thread(policy, obs_buffer):
    """Slow thread: runs model, produces action chunks."""
    while True:
        obs = obs_buffer.get_latest()          # latest camera + state
        chunk = policy.predict(obs)            # [k, d_a] — takes 50-100ms
        action_queue.put(chunk, block=True)   # blocks if queue full

def execution_thread(robot, dt=0.02):
    """Fast thread: sends actions to robot at 50Hz."""
    chunk = None
    step_in_chunk = 0
    while True:
        # Grab new chunk if available
        if not action_queue.empty() or chunk is None:
            chunk = action_queue.get()
            step_in_chunk = 0

        # Send current action to robot
        action = chunk[step_in_chunk]
        robot.send_action(action)              # motor command
        step_in_chunk = min(step_in_chunk + 1, len(chunk) - 1)

        time.sleep(dt)                         # maintain 50Hz rate

# Launch both threads
t1 = threading.Thread(target=planning_thread, args=(policy, obs))
t2 = threading.Thread(target=execution_thread, args=(robot,))
t1.start(); t2.start()
maxsize=2 is intentional. If the planning thread is faster than the execution thread consumes chunks, old plans pile up. A small queue ensures the execution thread always uses recent plans, discarding stale ones. If the planning thread is slower (more common), the execution thread simply keeps executing the last chunk.
Why is the inference pipeline split into two threads?

Chapter 11: Connections

We've covered the full pipeline of modern robot imitation learning: from the simplest behavioral cloning (min MSE) to generative models (VAE, diffusion, flow matching) that handle multimodality, to architectural choices (ACT, Diffusion Policy) that handle temporal consistency, to deployment engineering (temporal ensembling, async inference).

The key takeaway: Imitation learning replaces RL's "learn from reward" with "learn from demonstrations." This is far more sample-efficient — but the policy can never exceed the demonstrator's skill. The best real-world systems (Chapter 3's HIL-SERL) combine both: BC to bootstrap, RL to surpass.

From Single-Task to Multi-Task

Everything in this chapter trains a policy for one task: "fold this specific shirt" or "insert this specific peg." But real-world utility requires robots that handle many tasks. How do you tell the robot which task to perform?

The answer: condition the policy on a language instruction. "Pick up the red cup." "Fold the towel." "Open the drawer." This turns the policy from π(a|o) to π(a|o, l), where l is a language embedding from a vision-language model.

Chapter 4 (this chapter)
Single-task BC: one policy, one task. Generative models handle multimodality.
Chapter 5 (next)
Vision-Language-Action models (VLAs): condition on language + vision for multi-task generalization.
Beyond
Foundation models for robotics: pretrained VLAs that transfer across robots, tasks, and environments.

Method Comparison

MethodGenerative ModelMultimodal?Inference StepsStrengths
BC (MSE)NoneNo1Simple, fast
ACTCVAEYes1Fast, action chunking
Diffusion PolicyDDPMYes10-100Best multimodal, robust
Flow PolicyFlow MatchingYes5-10Fast + multimodal
What is the next step beyond single-task imitation learning?
← Ch 3: Reinforcement Learning