The same denoising that paints images can drive a robot arm. Instead of regressing one action, a diffusion policy generates a sequence of future actions from noise — which is exactly how it handles tasks with many right answers.
You teach a robot by behavior cloning: record an expert doing a task, then train a network to copy the action given the observation. Standard recipe: the network outputs one action, and you minimize the squared error to the expert’s action. Simple, and it often works. Until the task has more than one right answer — and then it fails in a specific, deadly way.
Picture a robot approaching an obstacle. The expert sometimes goes left around it, sometimes right — both perfectly valid. Train a squared-error policy on this data and it learns the action that minimizes average error to both demonstrations: straight down the middle. Into the obstacle. Averaging two good options produces a terrible one. This is the multimodality problem, and it’s why naive behavior cloning is brittle on real tasks where humans are inconsistent.
Experts go left OR right around the obstacle (teal paths). A squared-error policy outputs their average (orange) — straight into it. Slide the mix of left/right demos and watch the average stay stuck in the middle.
The fix starts with a mindset shift: a policy shouldn’t output an action, it should model the distribution over good actions. For the obstacle, that distribution is bimodal — a bump of probability for “go left” and another for “go right,” with a valley in between (the collision). A good policy should sample from one bump or the other, never the valley.
Squared-error regression can only ever produce the single mean — it sits in the valley. What we need is a model that can represent and sample from a full, multi-bump distribution. That’s exactly what generative models do. And the most stable, expressive generative model we have for continuous data is diffusion. So the plan is: model the action distribution with diffusion, and sample an action from it — landing cleanly in one mode.
The action distribution has two bumps (left/right). The squared-error mean (orange) sits in the dead valley between them. Diffusion samples (teal dots) land in the bumps — valid actions. Drag to shift the balance.
Here’s the beautiful part: generating actions works exactly like generating images — only the data is different. In image diffusion you denoise a grid of pixels. In Diffusion Policy you denoise a sequence of actions (a little trajectory of, say, joint angles or gripper positions over the next several timesteps). Same math, same training, different payload.
Training: take an expert action sequence, add noise, and train a network to predict the noise (or equivalently, the denoising direction). Generation: start from a pure-noise action sequence and denoise it step by step into a clean, executable trajectory. Because you’re sampling — starting from random noise — you naturally land in one mode or another: one run denoises toward “go left,” another toward “go right.” The randomness in the starting noise is what picks a mode, and the learned denoiser is what makes the result a valid, coherent action.
A random, jagged action sequence (left) is denoised into a smooth, sensible trajectory (right). Drag the denoising progress. Re-roll the starting noise and a different valid trajectory emerges — that’s how it samples modes.
A key design choice: Diffusion Policy doesn’t predict one action — it predicts a chunk of future actions at once (a horizon of, say, 16 steps). Then it executes only the first few (say 8), throws away the rest, observes again, and re-predicts a fresh chunk. This receding-horizon scheme (also called action chunking) is borrowed from control theory.
Why predict a chunk instead of one step? Three reasons. Temporal consistency: a committed sequence of actions is smooth and coherent, where independent per-step predictions jitter. Commitment to a mode: predicting the whole chunk means the policy commits to “go left” as a plan, instead of possibly flip-flopping left/right between steps and crashing in the middle. And robustness to idle/pauses: chunks help the policy power through moments where the right action is “wait” without freezing. Executing only part of the chunk before replanning keeps it reactive to new observations.
The policy predicts a horizon (faint), executes the first part (solid), then re-observes and predicts again. Drag the execution-before-replan amount: small = very reactive but more compute; large = smoother but slower to adapt.
A policy must act on observations, so the denoiser is conditioned on what the robot sees and feels: camera images (passed through a vision encoder like a ResNet or ViT) and proprioceptive state (joint angles, gripper). These observations are turned into a conditioning vector that steers the denoising — the action chunk is denoised toward a trajectory appropriate for the current scene.
This is the same idea as text-conditioning an image diffusion model, but the “prompt” is the robot’s observation. Two common ways to inject it: FiLM (the conditioning vector scales and shifts the denoiser’s features — used with the 1-D conv denoiser) or cross-attention (the action tokens attend to observation tokens — used with the transformer denoiser). Either way, the result is a conditional action distribution: given this scene, here are the good action sequences.
Move the obstacle (drag the slider). The conditioning changes, so the denoised action distribution shifts — the policy routes around wherever the obstacle now is. Same model, different scene, different actions.
What network does the denoising? Two popular choices, both standard. The original Diffusion Policy used a 1-D convolutional U-Net over the time axis of the action sequence — treat the chunk of actions as a 1-D signal (time × action-dimensions) and denoise it like a short waveform, with observation conditioning injected by FiLM. The alternative is a transformer over the action tokens, with observations entering by cross-attention — better for some high-dimensional or long-horizon settings.
Either way, the denoiser’s job is identical to an image denoiser’s: take the noisy action chunk plus the timestep plus the conditioning, and predict the noise to remove. Run it for several denoising steps and the chunk resolves from static into a clean trajectory. Note how little is “robotics-specific” here — it’s a generic denoiser pointed at action data. That generality is why Diffusion Policy slotted so easily into the broader diffusion toolkit, and why modern robot foundation models (VLAs like π-0) put a diffusion or flow-matching action head on top of a big vision-language backbone.
The action chunk is a 1-D signal (time across, action dims stacked). The U-Net denoises it, with the observation conditioning (FiLM) scaling/shifting features. Step the denoising and watch the jagged chunk smooth into a trajectory.
Put it together into the control loop that runs on the real robot, many times per second:
Observe, denoise a fresh plan, execute a slice, repeat. The replanning keeps it reactive — if you nudge the object, the next observation reflects it and the next denoised chunk adapts. And because each chunk is internally coherent and committed to a mode, the motion stays smooth and decisive instead of dithering. This closed loop is what turns a generative model of trajectories into a working controller.
Press Run: the agent observes, denoises a chunk (faint), executes part (solid), then replans. Move the goal mid-run — the next plan adapts. This is the policy actually controlling.
Watch an agent reach a goal past an obstacle that has two valid routes. A squared-error policy averages the routes and drives into the obstacle. The Diffusion Policy samples a chunk, commits to one route, executes it smoothly, and replans as it goes. Re-roll to see it sometimes pick left, sometimes right — both succeed.
Press Go. The teal agent (Diffusion Policy) commits to one side and reaches the goal; the orange agent (squared-error BC) heads for the average — the obstacle. Re-roll to see the diffusion agent pick a different valid route. Move the obstacle to test reactivity.
The teal agent succeeds because it treats “which way to go” as a sample from a distribution and commits; the orange agent fails because it treats it as a number to average. Every idea in this lesson — modeling the distribution, denoising a chunk, conditioning on the scene, committing to a mode, replanning — is what separates the two paths.
Diffusion Policy reframed robot learning: action generation is just generative modeling. That idea now anchors the frontier. Vision-Language-Action models (VLAs) like π-0 and its successors put a diffusion or flow-matching action head on a large pretrained vision-language backbone — the backbone understands the scene and instruction, the diffusion head generates the motion. The denoising-actions idea you just learned is the beating heart of modern robot foundation models.
More denoising steps = better samples but slower. Standard diffusion (orange) needs many; flow matching / consistency (teal) needs few — bringing real-time control within reach. Drag the step budget.
| Approach | Multimodal? | Training | Issue |
|---|---|---|---|
| MSE behavior cloning | no (averages) | stable | collides in the valley |
| Gaussian head | no (unimodal) | stable | same valley |
| GAN policy | yes | unstable | mode collapse |
| Diffusion Policy | yes | stable (regression) | sampling cost (fixable) |
→ Imitation Learning — behavior cloning and its failure modes
→ Diffusion — the denoising engine, in full
→ Flow Matching — the few-step speedup for real-time control
→ Vision-Language-Action Models — diffusion action heads on VLM backbones
→ Actor-Critic / RL — going beyond the expert