AI Architectures

Diffusion Policy

The same denoising that paints images can drive a robot arm. Instead of regressing one action, a diffusion policy generates a sequence of future actions from noise — which is exactly how it handles tasks with many right answers.

Prerequisites: Diffusion denoises noise into data + Behavior cloning copies expert actions. That’s it.

Chapters

Simulations

Assumed Knowledge

Chapter 0: Averaging Kills

You teach a robot by behavior cloning: record an expert doing a task, then train a network to copy the action given the observation. Standard recipe: the network outputs one action, and you minimize the squared error to the expert’s action. Simple, and it often works. Until the task has more than one right answer — and then it fails in a specific, deadly way.

Picture a robot approaching an obstacle. The expert sometimes goes left around it, sometimes right — both perfectly valid. Train a squared-error policy on this data and it learns the action that minimizes average error to both demonstrations: straight down the middle. Into the obstacle. Averaging two good options produces a terrible one. This is the multimodality problem, and it’s why naive behavior cloning is brittle on real tasks where humans are inconsistent.

The trap: “just regress the expert action.” Squared-error regression assumes there’s one correct output and finds its mean. When the truth is a distribution with several modes, the mean can land in a valley no expert ever chose. Diffusion Policy (Chi et al., 2023) fixes this by generating actions from a learned distribution — the same trick that lets diffusion paint sharp images instead of blurry averages.

The deadly average

Experts go left OR right around the obstacle (teal paths). A squared-error policy outputs their average (orange) — straight into it. Slide the mix of left/right demos and watch the average stay stuck in the middle.

fraction of demos going left0.50

Why does squared-error behavior cloning fail when a task has multiple valid actions?

It overfits to the first demo It outputs the mean of the valid actions, which can be an invalid action (e.g. straight into the obstacle) It needs more layers

Chapter 1: Actions Are a Distribution

The fix starts with a mindset shift: a policy shouldn’t output an action, it should model the distribution over good actions. For the obstacle, that distribution is bimodal — a bump of probability for “go left” and another for “go right,” with a valley in between (the collision). A good policy should sample from one bump or the other, never the valley.

Squared-error regression can only ever produce the single mean — it sits in the valley. What we need is a model that can represent and sample from a full, multi-bump distribution. That’s exactly what generative models do. And the most stable, expressive generative model we have for continuous data is diffusion. So the plan is: model the action distribution with diffusion, and sample an action from it — landing cleanly in one mode.

Why not a Gaussian head or a GAN? A single Gaussian is unimodal — same valley problem. A mixture-of-Gaussians needs you to pick the number of modes and trains unstably. GANs model multimodal data but suffer mode collapse and finicky training. Diffusion captures arbitrary multimodal distributions and trains with a simple, stable regression loss — the best of both worlds, which is why it took over action generation.

Bimodal actions: mean vs. samples

The action distribution has two bumps (left/right). The squared-error mean (orange) sits in the dead valley between them. Diffusion samples (teal dots) land in the bumps — valid actions. Drag to shift the balance.

left/right balance0.50

Why is diffusion preferred over a single-Gaussian head or a GAN for modeling actions?

It is the fastest to run It captures arbitrary multimodal distributions while training with a simple, stable regression loss (no mode collapse, no mode-count to pick) It needs no training data

Chapter 2: Denoising Actions

Here’s the beautiful part: generating actions works exactly like generating images — only the data is different. In image diffusion you denoise a grid of pixels. In Diffusion Policy you denoise a sequence of actions (a little trajectory of, say, joint angles or gripper positions over the next several timesteps). Same math, same training, different payload.

Training: take an expert action sequence, add noise, and train a network to predict the noise (or equivalently, the denoising direction). Generation: start from a pure-noise action sequence and denoise it step by step into a clean, executable trajectory. Because you’re sampling — starting from random noise — you naturally land in one mode or another: one run denoises toward “go left,” another toward “go right.” The randomness in the starting noise is what picks a mode, and the learned denoiser is what makes the result a valid, coherent action.

Noise → action trajectory

A random, jagged action sequence (left) is denoised into a smooth, sensible trajectory (right). Drag the denoising progress. Re-roll the starting noise and a different valid trajectory emerges — that’s how it samples modes.

denoising progress0.00

In Diffusion Policy, what is being denoised?

The camera image A sequence of future actions (a short trajectory), starting from pure noise The neural network weights

Chapter 3: Action Chunks & Receding Horizon

A key design choice: Diffusion Policy doesn’t predict one action — it predicts a chunk of future actions at once (a horizon of, say, 16 steps). Then it executes only the first few (say 8), throws away the rest, observes again, and re-predicts a fresh chunk. This receding-horizon scheme (also called action chunking) is borrowed from control theory.

Why predict a chunk instead of one step? Three reasons. Temporal consistency: a committed sequence of actions is smooth and coherent, where independent per-step predictions jitter. Commitment to a mode: predicting the whole chunk means the policy commits to “go left” as a plan, instead of possibly flip-flopping left/right between steps and crashing in the middle. And robustness to idle/pauses: chunks help the policy power through moments where the right action is “wait” without freezing. Executing only part of the chunk before replanning keeps it reactive to new observations.

The mode-commitment insight: the obstacle problem isn’t just “pick left or right” once — it’s “keep picking left.” A per-step policy might sample left, then right, then left, and weave into the obstacle. Generating a whole chunk forces a single coherent decision across time — the chunk is the commitment.

Predict a chunk, execute part, replan

The policy predicts a horizon (faint), executes the first part (solid), then re-observes and predicts again. Drag the execution-before-replan amount: small = very reactive but more compute; large = smoother but slower to adapt.

steps executed before replan6

Why predict a chunk of future actions instead of one action at a time?

To save memory For temporal consistency and to commit to one mode across time, avoiding step-to-step flip-flopping into failure Because single actions can’t be denoised

Chapter 4: Conditioning on What It Sees

A policy must act on observations, so the denoiser is conditioned on what the robot sees and feels: camera images (passed through a vision encoder like a ResNet or ViT) and proprioceptive state (joint angles, gripper). These observations are turned into a conditioning vector that steers the denoising — the action chunk is denoised toward a trajectory appropriate for the current scene.

This is the same idea as text-conditioning an image diffusion model, but the “prompt” is the robot’s observation. Two common ways to inject it: FiLM (the conditioning vector scales and shifts the denoiser’s features — used with the 1-D conv denoiser) or cross-attention (the action tokens attend to observation tokens — used with the transformer denoiser). Either way, the result is a conditional action distribution: given this scene, here are the good action sequences.

camera + state

images → vision encoder; joints → state vector

↓ conditioning vector

denoiser

conditioned via FiLM or cross-attention

↓

action chunk

trajectory appropriate to this observation

Observation steers the denoising

Move the obstacle (drag the slider). The conditioning changes, so the denoised action distribution shifts — the policy routes around wherever the obstacle now is. Same model, different scene, different actions.

obstacle position0.50

How does a Diffusion Policy know which way to act for the current scene?

It ignores observations and acts randomly The denoiser is conditioned on encoded observations (vision + state) via FiLM or cross-attention, giving a conditional action distribution It memorizes a fixed trajectory

Chapter 5: The Denoiser

What network does the denoising? Two popular choices, both standard. The original Diffusion Policy used a 1-D convolutional U-Net over the time axis of the action sequence — treat the chunk of actions as a 1-D signal (time × action-dimensions) and denoise it like a short waveform, with observation conditioning injected by FiLM. The alternative is a transformer over the action tokens, with observations entering by cross-attention — better for some high-dimensional or long-horizon settings.

Either way, the denoiser’s job is identical to an image denoiser’s: take the noisy action chunk plus the timestep plus the conditioning, and predict the noise to remove. Run it for several denoising steps and the chunk resolves from static into a clean trajectory. Note how little is “robotics-specific” here — it’s a generic denoiser pointed at action data. That generality is why Diffusion Policy slotted so easily into the broader diffusion toolkit, and why modern robot foundation models (VLAs like π-0) put a diffusion or flow-matching action head on top of a big vision-language backbone.

1-D temporal denoiser over the action chunk

The action chunk is a 1-D signal (time across, action dims stacked). The U-Net denoises it, with the observation conditioning (FiLM) scaling/shifting features. Step the denoising and watch the jagged chunk smooth into a trajectory.

denoise step0

What kind of network typically does the denoising in Diffusion Policy?

A decision tree A standard denoiser (1-D conv U-Net or transformer) predicting the noise on the action chunk, conditioned on observations A lookup table of trajectories

Chapter 6: The Closed Loop

Put it together into the control loop that runs on the real robot, many times per second:

1. observe

grab camera images + robot state

↓ encode → condition

2. denoise

noise → action chunk, conditioned on the observation

↓

3. execute part

run the first few actions on the robot

↻ back to 1 (replan)

Observe, denoise a fresh plan, execute a slice, repeat. The replanning keeps it reactive — if you nudge the object, the next observation reflects it and the next denoised chunk adapts. And because each chunk is internally coherent and committed to a mode, the motion stays smooth and decisive instead of dithering. This closed loop is what turns a generative model of trajectories into a working controller.

Common misconception: “diffusion is too slow for real-time control.” It can be — running many denoising steps at every control tick is expensive. But because you only replan every few steps (not every step), and because faster samplers and flow-matching cut the step count dramatically, Diffusion Policies run at real robot control rates in practice. Speed was a real concern that engineering largely solved.

The observe–denoise–execute loop

Press Run: the agent observes, denoises a chunk (faint), executes part (solid), then replans. Move the goal mid-run — the next plan adapts. This is the policy actually controlling.

goal position0.85

What keeps a Diffusion Policy reactive to changes in the environment?

It never changes its plan It re-observes and re-denoises a fresh chunk every few steps (receding horizon), so new observations update the plan It executes the entire chunk no matter what

Chapter 7: A Diffusion Policy in Action (showcase)

Watch an agent reach a goal past an obstacle that has two valid routes. A squared-error policy averages the routes and drives into the obstacle. The Diffusion Policy samples a chunk, commits to one route, executes it smoothly, and replans as it goes. Re-roll to see it sometimes pick left, sometimes right — both succeed.

Diffusion Policy vs. behavior cloning

Press Go. The teal agent (Diffusion Policy) commits to one side and reaches the goal; the orange agent (squared-error BC) heads for the average — the obstacle. Re-roll to see the diffusion agent pick a different valid route. Move the obstacle to test reactivity.

obstacle position0.50

The teal agent succeeds because it treats “which way to go” as a sample from a distribution and commits; the orange agent fails because it treats it as a number to average. Every idea in this lesson — modeling the distribution, denoising a chunk, conditioning on the scene, committing to a mode, replanning — is what separates the two paths.

Chapter 8: Trade-offs & the Bigger Picture

Strengths: handles multimodal tasks gracefully, trains stably (simple regression loss, no GAN instability or mode collapse), produces smooth coherent motion, excels at fine manipulation (assembly, pouring, cloth).
Cost: sampling needs multiple denoising steps per chunk — more compute than a one-shot policy. Mitigated by replanning only every few steps, faster samplers (DDIM), and especially flow matching / consistency models that cut denoising to a few or one step.
Data hungry & imitation-bound: it clones demonstrations, so it inherits their quality and can’t exceed the expert without added reinforcement learning.

Diffusion Policy reframed robot learning: action generation is just generative modeling. That idea now anchors the frontier. Vision-Language-Action models (VLAs) like π-0 and its successors put a diffusion or flow-matching action head on a large pretrained vision-language backbone — the backbone understands the scene and instruction, the diffusion head generates the motion. The denoising-actions idea you just learned is the beating heart of modern robot foundation models.

Denoising steps vs. control rate

More denoising steps = better samples but slower. Standard diffusion (orange) needs many; flow matching / consistency (teal) needs few — bringing real-time control within reach. Drag the step budget.

denoising steps20

How do modern robot foundation models (VLAs like π-0) use this idea?

They abandon diffusion entirely They put a diffusion/flow-matching action head on a large vision-language backbone — the backbone understands, the head generates motion They only work in simulation

Chapter 9: Cheat Sheet & Connections

problem

squared-error BC averages multimodal actions → invalid (collision)

↓ model the distribution

diffusion over actions

denoise noise → action chunk; sampling lands in a mode

↓ predict a horizon

action chunking

predict many steps, execute a few, replan (commit to a mode)

↓ condition + loop

conditioned closed loop

obs (vision+state) via FiLM/cross-attn → observe-denoise-execute

Approach	Multimodal?	Training	Issue
MSE behavior cloning	no (averages)	stable	collides in the valley
Gaussian head	no (unimodal)	stable	same valley
GAN policy	yes	unstable	mode collapse
Diffusion Policy	yes	stable (regression)	sampling cost (fixable)

Keep exploring

→ Imitation Learning — behavior cloning and its failure modes
→ Diffusion — the denoising engine, in full
→ Flow Matching — the few-step speedup for real-time control
→ Vision-Language-Action Models — diffusion action heads on VLM backbones
→ Actor-Critic / RL — going beyond the expert

“What I cannot create, I do not understand.” You just rebuilt Diffusion Policy from one failure — averaging two good actions gives a bad one — and one fix: model the action distribution and sample it by denoising. Predict a chunk, commit to a mode, condition on the scene, and replan. The same denoising that paints images now drives robots.