From behavioral cloning to diffusion policies — teaching robots by showing, not telling.
In Chapter 3 we learned RL: define a reward, explore for millions of steps, converge on a policy. For a shirt-folding task, that means a robot randomly flailing fabric for days until it stumbles onto something that looks like a fold. Absurd.
A human can fold a shirt in 30 seconds. Why not just record that and train a neural network to copy it? That's behavioral cloning (BC) — the simplest form of imitation learning. Collect a dataset of demonstrations D = {(oi, ai)}, where o is an observation (image, joint state) and a is the expert's action, then minimize:
This is just supervised regression. The network fθ maps observations to actions. MSE loss. Backprop. Done. No reward function, no environment resets, no safety concerns during training.
But there's a subtle, devastating problem with the MSE objective. Consider a task where you can reach around an obstacle from either the left or the right. Both are valid. The demonstrations contain both modes. What does MSE do? It averages them — predicting an action that goes straight through the obstacle.
Two groups of demonstrations (orange dots) go around an obstacle from different sides. MSE regression (blue x) averages them — producing an invalid action. Click Resample for different demonstrations.
Even if the demonstrations are unimodal and MSE works perfectly, BC has a second fundamental problem: covariate shift.
During training, the network sees states from the expert's trajectory. During deployment, it sees states from its own trajectory. The network makes a small error at step 1. That error shifts the state slightly. At step 2, the state is now slightly out of distribution. The error grows. By step 50, the robot is in a state the expert never visited, and the policy outputs nonsense.
Formally, the error compounds quadratically. If the per-step error is ε, the total error after T steps is:
This is because each step's error shifts the state, causing additional error at subsequent steps. For a 100-step trajectory with ε = 0.01, the total error is O(100), not O(1).
The green line is the expert trajectory. The orange line is the cloned policy. Watch small errors accumulate. Drag the ε slider to change per-step error magnitude.
Dataset Aggregation (DAgger) addresses covariate shift by iteratively collecting new data. Run the learned policy, visit novel states, query the expert for the correct action at those states, add to the dataset, retrain. Each round covers more of the state space the learned policy actually visits.
But DAgger requires an expert available during training — a human on standby to label states on the fly. For real robots, this is expensive. Modern approaches solve the problem differently: instead of fixing the data distribution, they fix the policy architecture to avoid error accumulation. That's where action chunking comes in (Chapter 7).
The mode-averaging problem from Chapter 0 has a clean solution: instead of predicting a single action fθ(o), model the full conditional distribution p(a|o). Then at inference time, sample from this distribution. Each sample will be a valid action — either going left or going right around the obstacle — never the invalid average.
How do we model p(a|o)? With latent variable models. Introduce a latent variable z that captures the "which mode are we in?" decision:
The latent z acts as a "strategy selector." When z is sampled from one region of latent space, the model outputs a go-left action. From another region, go-right. The integral marginalizes over all strategies, giving a proper multimodal distribution.
Three families of generative models are used for robot action prediction, each implementing this idea differently:
All three can model multimodal distributions. They differ in training stability, inference speed, and sample quality. Let's derive each one.
The Variational Autoencoder gives us a principled way to learn latent variable models. We want to maximize log p(a|o) but that integral over z is intractable. The VAE sidesteps this with a variational lower bound (ELBO).
Introduce an approximate posterior qφ(z | o, a) — an encoder that guesses which latent z produced a given (observation, action) pair. Then:
This is the Evidence Lower Bound (ELBO). Two terms:
Suppose q = N(μq, σq2) and p = N(0, 1). The KL divergence has a closed-form:
# VAE training step (PyTorch-style pseudocode) def vae_loss(obs, action, encoder, decoder): # Encode: observation + action → latent distribution mu, log_var = encoder(obs, action) # shapes: [B, d_z] # Reparameterize: sample z without breaking gradients std = torch.exp(0.5 * log_var) # [B, d_z] eps = torch.randn_like(std) # [B, d_z] z = mu + std * eps # [B, d_z] # Decode: observation + z → predicted action a_pred = decoder(obs, z) # [B, d_a] # Reconstruction loss (MSE) recon = ((a_pred - action) ** 2).sum(dim=-1).mean() # KL divergence (closed form for Gaussians) kl = -0.5 * (1 + log_var - mu**2 - log_var.exp()).sum(dim=-1).mean() return recon + beta * kl # beta controls regularization strength
VAEs work, but their latent space is a single bottleneck — it must compress all variation into one vector z. For highly multimodal action distributions, this can be limiting. Diffusion models take a different approach: instead of encoding then decoding, they learn to gradually denoise random noise into valid data.
Start with a clean data point z0 (an action sequence from the expert). Over T steps, gradually add Gaussian noise until the signal is destroyed:
At each step t, the sample zt is a slightly noisier version of zt-1. The noise schedule β1, …, βT controls how fast the signal degrades. After T steps (typically T = 1000), zT ≈ N(0, I) — pure noise.
A key insight: we can jump directly from z0 to any zt without computing intermediate steps. Define αt = 1 - βt and ᾱt = ∏i=1t αi:
Training = learn to reverse the noise process. Starting from pure noise zT, we want to iteratively "denoise" back to z0. The reverse process is also Gaussian:
The neural network μθ predicts the mean of the "cleaned up" sample at each step. After T denoising steps, we arrive at z0 — a sample from the learned distribution.
Drag the slider to move through the diffusion process. Left = clean action, right = pure noise. The forward process adds noise; the reverse process (learned) removes it.
The reverse process requires predicting μθ(zt, t) — the denoised mean at each step. But Ho et al. (DDPM) showed that it's simpler and more stable to predict the noise ε instead. Here's the connection:
Since zt = √ᾱt · z0 + √(1 - ᾱt) · ε, we can rearrange to recover z0:
So if a network εθ(zt, t) predicts the noise that was added, we can reconstruct z0. The DDPM simplified loss is:
That's the entire training objective. Sample a clean action z0 from the dataset. Sample a random time step t. Sample noise ε ~ N(0, I). Compute zt. Train the network to predict ε from zt and t. Beautiful in its simplicity.
# DDPM training loop (simplified) for batch in dataloader: z0 = batch["actions"] # [B, T_a, d_a] action chunks t = torch.randint(0, T, (B,)) # random timesteps eps = torch.randn_like(z0) # noise sample # Forward process: compute z_t directly alpha_bar_t = alpha_bar[t] # precomputed schedule z_t = sqrt(alpha_bar_t) * z0 + sqrt(1 - alpha_bar_t) * eps # Predict noise eps_pred = model(z_t, t, obs) # conditioned on observation # Simple MSE loss on noise loss = ((eps - eps_pred) ** 2).mean() loss.backward() optimizer.step()
Diffusion models work by iterating through many small denoising steps (100-1000). This is slow. Flow matching reformulates the problem: instead of a discrete Markov chain, define a continuous-time ODE that transforms noise into data in one smooth flow.
Define a path from noise x0 ~ N(0, I) to data x1. The simplest path is a straight line:
The velocity along this path is the time derivative:
This is the conditional optimal transport velocity — the straight-line direction from noise to data. A neural network vθ(xt, t) is trained to predict this velocity:
At inference time, sample x0 ~ N(0, I) and integrate the learned velocity field:
This is an Euler step of the ODE dx/dt = vθ(x, t). Because the paths are approximately straight, even a few Euler steps (5-10) give high-quality samples. Compare this to diffusion's 100-1000 denoising steps.
The arrows show the learned velocity field. Blue dots (noise) are transported to green dots (data) along the flow. Click Flow to animate the transport, or Reset to resample.
Now we combine the pieces: a generative model (VAE) with a key architectural innovation called action chunking. Instead of predicting one action at a time, predict an entire chunk of future actions: at:t+k = (at, at+1, …, at+k-1).
Remember the compounding error problem from Chapter 1? If we predict one action at a time, errors accumulate at every step. With chunks of k = 100 actions, the policy is queried only every 100 steps. Errors only compound between chunk boundaries, not within them. This reduces the effective number of decision points from T to T/k.
ACT uses a conditional VAE (CVAE) with a Transformer backbone:
During training, the encoder sees both the observation and the ground-truth action chunk, producing a latent z that captures the "style" of the demonstration. The decoder takes the observation and z, and reconstructs the action chunk.
During inference, there is no ground-truth action chunk. So we sample z ~ N(0, I) from the prior. The decoder generates a chunk conditioned on just the observation and this random z. Different z samples produce different valid action sequences.
# LeRobot ACT training configuration from lerobot.common.policies.act.configuration_act import ACTConfig from lerobot.common.policies.act.modeling_act import ACTPolicy config = ACTConfig( chunk_size=100, # predict 100 future actions n_obs_steps=1, # condition on 1 observation dim_model=512, # Transformer hidden dim n_heads=8, # attention heads n_encoder_layers=4, # CVAE encoder layers n_decoder_layers=7, # CVAE decoder layers latent_dim=32, # z dimensionality kl_weight=10.0, # beta for KL term ) policy = ACTPolicy(config) # Input shapes: # obs["images"]: [B, 1, 3, 480, 640] (1 camera view) # obs["state"]: [B, 1, 14] (7 joints + 7 velocities) # actions: [B, 100, 14] (100-step action chunk) # Output: predicted_actions [B, 100, 14]
Take the DDPM framework from Chapters 4-5 and apply it to action prediction. That's Diffusion Policy — one of the most effective imitation learning methods for robotics.
The noise prediction network is conditioned on the current observation o:
where atk is the action chunk at diffusion step k (note: k indexes diffusion steps, t indexes robot time). At inference, we start with random noise atK ~ N(0, I) and iteratively denoise:
DDPM needs 100-1000 denoising steps. At 10 Hz control, 1000 forward passes per action is far too slow. DDIM (Denoising Diffusion Implicit Models) makes the process deterministic and allows skipping steps. With DDIM, you can go from 100 denoising steps to 10 with minimal quality loss.
Watch a 2D action trajectory emerge from noise. Each frame is one denoising step. Click Denoise to run the full process, or Step for one step at a time.
Diffusion Policy typically uses a 1D temporal U-Net as the noise prediction network. The observation is injected via FiLM conditioning (feature-wise linear modulation). The diffusion timestep k is encoded via sinusoidal positional embeddings, same as in standard DDPM.
Both ACT and Diffusion Policy predict action chunks — say, k = 100 future actions. But the robot re-plans before the chunk expires, typically every 1-10 steps. This means at any given time step, we have multiple overlapping predictions for what the robot should do.
Suppose chunk_size = 4 and we re-plan every step. At time step t = 3:
Which one do we use? Temporal ensembling averages them with exponentially decaying weights:
where m is a decay constant. Newer predictions (small i) get higher weight because they were made with more recent observations.
Here's a practical problem: a Diffusion Policy forward pass takes 50-100ms (10 DDIM steps through a U-Net). A robot arm needs a new action every 20ms (50 Hz control). The model is too slow for synchronous execution.
The solution: decouple planning (slow, runs the model) from execution (fast, serves precomputed actions).
The planning thread runs as fast as the GPU allows. It grabs the latest observation, runs the diffusion/VAE model, and pushes the resulting action chunk into a shared queue. The execution thread pops actions from this queue at a fixed rate (50 Hz), interpolating if needed.
# Simplified async inference pipeline import threading import queue action_queue = queue.Queue(maxsize=2) # small buffer def planning_thread(policy, obs_buffer): """Slow thread: runs model, produces action chunks.""" while True: obs = obs_buffer.get_latest() # latest camera + state chunk = policy.predict(obs) # [k, d_a] — takes 50-100ms action_queue.put(chunk, block=True) # blocks if queue full def execution_thread(robot, dt=0.02): """Fast thread: sends actions to robot at 50Hz.""" chunk = None step_in_chunk = 0 while True: # Grab new chunk if available if not action_queue.empty() or chunk is None: chunk = action_queue.get() step_in_chunk = 0 # Send current action to robot action = chunk[step_in_chunk] robot.send_action(action) # motor command step_in_chunk = min(step_in_chunk + 1, len(chunk) - 1) time.sleep(dt) # maintain 50Hz rate # Launch both threads t1 = threading.Thread(target=planning_thread, args=(policy, obs)) t2 = threading.Thread(target=execution_thread, args=(robot,)) t1.start(); t2.start()
We've covered the full pipeline of modern robot imitation learning: from the simplest behavioral cloning (min MSE) to generative models (VAE, diffusion, flow matching) that handle multimodality, to architectural choices (ACT, Diffusion Policy) that handle temporal consistency, to deployment engineering (temporal ensembling, async inference).
Everything in this chapter trains a policy for one task: "fold this specific shirt" or "insert this specific peg." But real-world utility requires robots that handle many tasks. How do you tell the robot which task to perform?
The answer: condition the policy on a language instruction. "Pick up the red cup." "Fold the towel." "Open the drawer." This turns the policy from π(a|o) to π(a|o, l), where l is a language embedding from a vision-language model.
| Method | Generative Model | Multimodal? | Inference Steps | Strengths |
|---|---|---|---|---|
| BC (MSE) | None | No | 1 | Simple, fast |
| ACT | CVAE | Yes | 1 | Fast, action chunking |
| Diffusion Policy | DDPM | Yes | 10-100 | Best multimodal, robust |
| Flow Policy | Flow Matching | Yes | 5-10 | Fast + multimodal |