Chi, Xu, Feng, Cousineau, Du, Burchfiel, Tedrake, Song — 2023

Diffusion Policy

Visuomotor policy learning via action diffusion — representing robot behavior as a conditional denoising diffusion process that iteratively refines noise into precise, multimodal action sequences.

Prerequisites: Diffusion models basics + Behavior cloning
10
Chapters
6+
Simulations

Chapter 0: The Problem

You're teleoperating a robot arm to push a T-shaped block into a target zone. You show it 200 demonstrations. Then you step back and let the learned policy take over.

The robot freezes. Or it jitters in place. Or it confidently pushes the block the wrong way. What went wrong?

The core issue is that behavior cloning — learning to map observations to actions via supervised regression — has a hidden enemy: multimodal action distributions.

When you showed the robot how to push the T-block, sometimes you went around it from the left. Other times from the right. Both are valid. But a standard regression policy averages these two modes, producing an action that goes straight ahead — right into the block, achieving neither strategy.

The averaging catastrophe: When a human demonstrates two valid strategies (go left OR go right), a regression model learns the mean (go straight). The mean of two good strategies is often a terrible strategy. This is the fundamental failure mode of naive behavior cloning.

Previous approaches tried to fix this in different ways:

Explicit policies (Gaussian Mixture Models, categorical distributions) attempt to represent multiple modes directly. But GMMs require you to guess the number of modes upfront, and discretization scales exponentially with action dimensions.

Implicit policies (Energy-Based Models) learn an energy landscape over actions and find low-energy actions at inference. They can express arbitrary distributions — but they require negative sampling during training, which causes wild instability. Training loss goes down while actual performance oscillates unpredictably.

The Multimodal Problem

A robot can push the T-block from the left or right. Click "Sample Actions" to see what different policy types produce. The regression policy averages the two modes into a useless straight-ahead action.

Click to sample
A robot is demonstrated two valid strategies: pushing a block left-then-up, and pushing it right-then-up. A standard MSE regression policy will learn to:

Chapter 1: The Key Insight

What if we didn't try to output actions directly at all? What if instead, we started with random noise and iteratively sculpted it into actions?

This is exactly what Diffusion Policy does. It borrows the denoising diffusion process from image generation — but instead of generating pixels, it generates robot actions.

The process works like this: at inference time, we sample a random action sequence from Gaussian noise. Then we run K denoising steps. Each step looks at the current noisy actions and the robot's visual observations, and nudges the actions toward a clean, valid trajectory. After K steps, noise becomes a precise action plan.

Think of it this way: Imagine a sculptor starting with a rough block of clay (noise). At each step, they look at a reference photo (the observation) and chip away a little more. They don't need to decide the final shape upfront — the form emerges gradually through iterative refinement. Different starting blocks lead to different sculptures — this is how multimodality arises naturally.

Why does this solve multimodality? Because the denoising process has two sources of randomness. First, the initial noise sample — different random starts land in different "basins" of the action distribution, naturally selecting one mode or another. Second, the stochastic perturbations added at each denoising step let samples explore and settle into nearby modes.

And crucially, diffusion models never need to compute the normalizing constant Z that makes Energy-Based Models unstable. The noise prediction network learns the gradient of the energy landscape — the direction to push actions toward validity — without ever needing to know the total volume of the distribution.

Start
Sample action sequence AK from Gaussian noise N(0, I)
Denoise
For k = K to 1: predict noise εθ(O, Ak, k), subtract it from Ak
Result
A0 = clean action sequence conditioned on observation O
How does Diffusion Policy naturally handle multimodal action distributions without explicitly modeling the number of modes?

Chapter 2: DDPM Primer

Before we apply diffusion to robot actions, we need to understand the machinery. A Denoising Diffusion Probabilistic Model (DDPM) has two processes: a forward process that gradually adds noise to data, and a reverse process that learns to undo the noise.

The forward process: destroying structure

Take a clean data sample x0 (in image generation, a real image; for us, a real action sequence). Over K steps, add progressively more Gaussian noise until xK is indistinguishable from pure noise.

The reverse process: creating structure

The reverse process is where learning happens. At each step k, a neural network εθ looks at the current noisy sample xk and the step number k, and predicts the noise that was added. We then subtract this predicted noise:

xk−1 = α(xk − γ εθ(xk, k) + N(0, σ²I))

Where α is a scaling factor (slightly less than 1 for stability), γ is the step size, and the N(0, σ²I) term adds a small amount of fresh noise — this is the "stochastic" part of stochastic Langevin dynamics.

Gradient descent in disguise: The denoising equation is really a noisy gradient step: x′ = x − γ∇E(x). The noise prediction network εθ is learning the gradient field ∇E(x) of an implicit energy function. Each denoising step walks downhill on the energy landscape toward clean data.

Training: learning to predict noise

Training is elegantly simple. For each training example x0:

  1. Pick a random step k ∈ {1, ..., K}
  2. Sample random noise εk with the appropriate variance for step k
  3. Add that noise to get a noisy version: x0 + εk
  4. Train εθ to predict the noise: minimize MSE(εk, εθ(x0 + εk, k))
L = MSE(εk, εθ(x0 + εk, k))

That's the entire training loss. No adversarial training, no ELBO tricks, no negative sampling. Just "predict what noise was added to this data."

The Denoising Process

Watch noise get refined into a clean 1D action trajectory. Click Play to run K=20 denoising steps. The network predicts noise (red arrows), and we subtract it to get cleaner actions.

Step K=20 (pure noise)
What does the DDPM noise prediction network εθ learn to predict?

Chapter 3: From Images to Actions

Standard DDPMs generate images. Diffusion Policy generates robot actions. This requires two crucial modifications to the formulation.

Modification 1: Actions as the output

Instead of x being a 256×256 image, x is now At — a sequence of Tp future actions. For a 7-DOF robot arm, one action is a 7-dimensional vector (6 joint positions + gripper). If we predict Tp = 16 steps ahead, the output is a 7×16 = 112-dimensional vector. Much smaller than an image, but high-dimensional enough that simpler methods like GMMs struggle.

Modification 2: Conditioning on observations

Here's where the paper makes a critical design choice. Previous work (Diffuser, by Janner et al.) modeled the joint distribution p(At, Ot) — generating both actions and observations together. Diffusion Policy instead models the conditional distribution p(At | Ot).

Why does this matter? Because the observation Ot only needs to be encoded once, not re-encoded at every denoising step. With K=100 denoising iterations, that's 100× less vision computation.

Atk−1 = α(Atk − γ εθ(Ot, Atk, k) + N(0, σ²I))

The training loss becomes:

L = MSE(εk, εθ(Ot, At0 + εk, k))
Joint vs conditional — why it matters: Modeling p(A, O) jointly means the diffusion process denoises both future actions and future observations. That's much harder and much slower — you're trying to predict what the world looks like AND what to do. Conditioning on O lets you focus entirely on "given what I see, what should I do?" The vision encoder runs once, and K denoising iterations refine only the actions.

The observation encoding

The observation Ot isn't just the current frame — it's the last To frames of visual and proprioceptive data. Different camera views use separate ResNet-18 encoders (with spatial softmax pooling instead of global average pooling, and GroupNorm instead of BatchNorm). The visual embeddings are concatenated with proprioceptive state to form Ot.

Why GroupNorm? Because DDPMs use Exponential Moving Average (EMA) of model weights during training. BatchNorm's running statistics clash with EMA — GroupNorm doesn't have running statistics, so it plays nicely with the EMA schedule.

Why does Diffusion Policy model p(A|O) rather than the joint p(A, O)?

Chapter 4: Architecture

The noise prediction network εθ is the heart of Diffusion Policy. The paper proposes two architectures, each with different strengths.

Option A: CNN-based (1D Temporal Convolution)

This architecture treats the action sequence as a 1D signal and applies temporal convolutions. The key mechanism for injecting observation information is Feature-wise Linear Modulation (FiLM):

FiLM(x) = γ(Ot) ⊙ x + β(Ot)

The observation features Ot produce per-channel scale (γ) and shift (β) parameters that modulate every convolutional layer. The denoising step k is also injected via FiLM conditioning.

The CNN backbone works well out-of-the-box on most tasks with minimal hyperparameter tuning. But it has a weakness: temporal convolutions have an inductive bias toward smooth, low-frequency signals. If your task requires sharp, rapid action changes (like velocity control), the CNN smears them out.

Option B: Time-Series Diffusion Transformer

To handle high-frequency action changes, the paper introduces a transformer-based architecture. Each noisy action in the sequence Atk becomes an input token. The denoising step k is prepended as a special token (sinusoidal embedding). The observation Ot is projected into an embedding sequence and injected via cross-attention in each transformer decoder block.

Causal attention ensures each action token only attends to itself and previous actions — preserving the temporal ordering. The gradient εθ is predicted by each corresponding output token.

CNN vs Transformer — practical guidance: Start with the CNN. It's robust, fast to train, and works well on most tasks. If performance is low because the task demands rapid action changes (velocity control, high-frequency manipulation), switch to the Transformer — it captures sharp transitions better but requires more careful hyperparameter tuning.
Architecture Comparison

Two architectures for the noise prediction network. Toggle to see how each processes noisy actions conditioned on observations.

Why does the CNN backbone struggle with velocity-control action spaces?

Chapter 5: Action Chunking

This is where Diffusion Policy's design gets truly clever. Instead of predicting a single action at each timestep, it predicts a whole chunk of future actions — and only executes a portion of them before re-planning.

Three horizons

The system uses three carefully tuned horizon parameters:

At time t, the policy sees the last To observations, generates Tp future actions, executes Ta of them, then re-plans with fresh observations. This is receding horizon control — a classic idea from control theory, now powered by diffusion.

The critical tradeoff: Ta controls the balance between temporal consistency (high Ta = commit to the plan longer) and responsiveness (low Ta = react to changes faster). Too high and the robot can't adapt; too low and it jitters between modes. The paper found Ta = 8 works best for most tasks.

Why predict more than you execute?

If Ta = 8 and Tp = 16, we predict 16 actions but only use the first 8. Why waste compute on actions we'll throw away?

Because predicting further into the future gives the model context. To generate good actions for the next 8 steps, the model needs to "think ahead" about where the trajectory is going. The unused tail actions are a form of lookahead planning that improves the quality of the executed actions.

There's also a warm-starting trick: when re-planning, the new denoising process is initialized with the previous prediction (shifted by Ta steps) rather than pure noise. This further smooths transitions between consecutive plans.

Receding Horizon Control

Watch how the policy predicts a chunk of actions, executes a subset, then re-plans. Drag the Ta slider to see the consistency-responsiveness tradeoff. Blue = executed actions, gray = predicted-but-discarded lookahead.

Ta (execute)8
Adjust T_a and press Animate

Solving idle actions

During teleoperation, demonstrators sometimes pause — producing sequences of identical positions or near-zero velocities. Single-step policies overfit to these pauses and get permanently stuck. But with action chunking, the model sees the pause as part of a longer trajectory that eventually continues. The surrounding context prevents overfitting to the idle segment.

In Diffusion Policy's receding horizon scheme, why does the model predict Tp = 16 actions when only Ta = 8 are executed?

Chapter 6: Why Diffusion Wins

Diffusion Policy doesn't just match alternatives — it outperforms them by an average of 46.9% across 15 tasks. Let's understand why through three key advantages.

Advantage 1: Training stability

Implicit policies (IBC) represent actions using Energy-Based Models. To train an EBM, you need to estimate the intractable normalizing constant Z(o, θ) via negative sampling:

pθ(a|o) = e−Eθ(o,a) / Z(o, θ)

The InfoNCE loss uses negative samples to approximate Z, but inaccurate negative sampling causes training instability. Energy goes down, but actual policy performance oscillates wildly — making checkpoint selection a nightmare.

Diffusion Policy sidesteps Z entirely. The noise prediction network approximates the score function — the gradient of log p(a|o):

a log p(a|o) = −∇a Eθ(a,o) − ∇a log Z(o,θ)

The key: ∇a log Z(o, θ) = 0 because Z doesn't depend on the action a. So the score function — and therefore the noise prediction — is independent of Z. No negative sampling needed. Stable training guaranteed.

Sanity check: Why is ∇a log Z = 0? Because Z(o, θ) = ∫ e−E(o,a′) da′ integrates over ALL actions a′. It's a constant with respect to any particular action a. Taking the gradient of a constant is zero.

Advantage 2: Synergy with position control

A surprising finding: Diffusion Policy + position control consistently beats Diffusion Policy + velocity control, even though most prior work uses velocity control. Two reasons:

  1. Position control has more pronounced multimodality (different positions for left vs right strategy). Diffusion Policy handles multimodality better than alternatives, so it benefits MORE from position control's expressiveness.
  2. Position control suffers less from compounding errors. A small velocity error accumulates over time; a small position error stays small.

Advantage 3: Latency robustness

Because Diffusion Policy predicts a sequence of future actions, it naturally handles the latency between observation capture and action execution. Even with 4 timesteps of latency, performance barely drops — the predicted action sequence already accounts for the near future.

Training Stability: Diffusion Policy vs IBC

Compare training curves. IBC (orange) shows decreasing loss but oscillating evaluation performance — you can't tell which checkpoint is best. Diffusion Policy (teal) converges smoothly and stays stable.

Why is Diffusion Policy's training more stable than IBC's?

Chapter 7: Experiments

The paper evaluates Diffusion Policy across 15 tasks from 4 benchmarks. The breadth is impressive: simulated and real, 2-DOF to 6-DOF, single and multi-arm, rigid and fluid objects, single-user and multi-user demonstrations.

Benchmark tasks

Robomimic (5 tasks): Lift, Can, Square, Transport, ToolHang — progressing from simple pick-place to complex bimanual manipulation and precision tool hanging. Each has proficient-human (PH) and mixed-human (MH) demonstration variants.

Push-T: Push a T-shaped block into a target zone with a circular end-effector. Requires exploiting contact dynamics and handling multimodal approach strategies.

Block Push: Push two blocks into two target squares in any order. Tests long-horizon multimodality — which block first?

Franka Kitchen: Interact with 7 kitchen objects (burners, microwave, etc.) in arbitrary order. Tests both short-horizon and long-horizon multimodality.

Results: consistent dominance

Across every task and variant, Diffusion Policy matches or exceeds all baselines. The improvements are largest on the hardest tasks:

Simulation Results

Success rates across key simulation tasks. Diffusion Policy (both CNN and Transformer variants) consistently outperforms LSTM-GMM, IBC, and BET.

The multimodality gap: The improvement is most dramatic on tasks with strong multimodality. On Block Push (long-horizon, which-block-first decisions), Diffusion-T achieves 94% vs BET's 71% on p2. On Kitchen (7-object, arbitrary-order interaction), Diffusion-T gets 96% on p4 vs BET's 44%. Diffusion Policy doesn't just handle multimodality — it thrives on it.

Key ablation findings

Action horizon Ta: Optimal at 8 for most tasks. Ta = 1 gives reactive but jittery behavior; Ta = 16+ is smooth but unresponsive.

Position vs velocity: Switching from velocity to position control improves Diffusion Policy while hurting baselines. The combination of action chunking + position control + diffusion creates a compounding advantage.

Vision encoder: End-to-end training beats frozen pretrained encoders. Fine-tuning pretrained CLIP ViT-B/16 with 10× lower learning rate gives the best results (98% on Square) but the gap is small for simpler tasks.

On which type of task does Diffusion Policy show the LARGEST improvement over baselines?

Chapter 8: Real-World Results

Simulation numbers are encouraging, but the real test is hardware. The paper deploys Diffusion Policy on 7 real-world tasks across 2 hardware setups, including 3 challenging bimanual tasks.

Push-T (Real): 95% success, near-human

The real-world Push-T is significantly harder than simulation. It's multi-stage: ⓵ push the T-block into the target, then ⓶ move the end-effector to an end-zone to avoid occluding the camera. The transition between stages is highly multimodal — the precise moment to stop pushing and start retreating varies.

Diffusion Policy achieves 95% success rate and 0.80 IoU (human: 100%, 0.84 IoU). LSTM-GMM gets 20%, IBC gets 0%. The failure mode of baselines is revealing: LSTM-GMM gets stuck near the T-block (overfitting to idle actions during fine adjustment), while IBC prematurely stops pushing.

Mug Flipping (6-DOF): 90% success

The robot must pick up a randomly placed mug, orient it lip-down, then rotate it so the handle points left. The demonstration data is wildly multimodal: forehand vs backhand grasps, direct placement vs push-to-rotate, varying grasp adjustments. Despite never being demonstrated, the policy can even sequence multiple pushes or re-grasp a dropped mug.

Sauce tasks: fluid manipulation

Two tasks test manipulation of non-rigid, fluid objects. Sauce pouring: scoop sauce with a ladle, pour it centered on pizza dough (0.74 IoU vs 0.79 human). Sauce spreading: spread sauce in a spiral pattern (0.77 coverage vs 0.79 human). Both require long idle periods (waiting for viscous sauce to fill the ladle) and periodic motions (spiral spreading) — known failure modes for standard behavior cloning.

Robustness to perturbation: During real Push-T experiments, the authors tested three perturbations: (1) blocking the camera for 3 seconds — the policy jittered but stayed on course; (2) shifting the T-block during pushing — the policy immediately re-planned to push from the opposite direction; (3) moving the T-block after "completion" — the policy abandoned its retreat and returned to re-position the block. Perturbation (3) was never demonstrated — the policy synthesized novel recovery behavior.

Bimanual tasks: the frontier

Three bimanual tasks push the limits further:

Critically, Diffusion Policy worked out of the box on all bimanual tasks with the same hyperparameters as single-arm tasks. No tuning required.

Real-World Task Gallery

Success rates across 7 real-world tasks. Diffusion Policy approaches human performance on most tasks.

What was the most surprising behavior observed in real-world Push-T experiments?

Chapter 9: Connections

What Diffusion Policy built on

DDPM (Ho et al., 2020): The foundational denoising diffusion model that Diffusion Policy adapts from image generation to action generation.

Diffuser (Janner et al., 2022): The first use of diffusion for robot planning, but modeled the joint p(A, O) rather than the conditional p(A|O). Slower and less accurate.

IBC (Florence et al., 2021): Implicit behavioral cloning via energy-based models. Showed the promise of non-explicit policies but suffered from training instability. Diffusion Policy achieves IBC's expressiveness without its instability.

BET (Shafiullah et al., 2022): Behavior Transformers using k-means action clustering. Handles multimodality but fails to maintain temporal consistency across steps.

What Diffusion Policy enabled

pi-0 (Physical Intelligence, 2024): The first robot foundation model, combining a VLM with flow matching (a continuous-time variant of diffusion) for actions. Diffusion Policy proved that diffusion-based policy representations work; pi-0 scaled the idea to 7 robot types and 68 tasks.

3D Diffusion Policy (Ze et al., 2024): Extends the approach to 3D visual representations using point clouds instead of RGB images.

Consistency Policy (Prasad et al., 2024): Applies consistency distillation to speed up inference from ~10 denoising steps to a single step while maintaining performance.

The connection to control theory

For simple linear systems where the expert uses a linear feedback policy at = −Kst, Diffusion Policy provably recovers the exact policy. The optimal denoiser becomes εθ(s, a, k) = (a + Ks) / σk, and DDIM sampling converges to a = −Ks. For multi-step prediction, it implicitly learns the dynamics model: at+t′ = −K(A − BK)t′st.

The big picture: Diffusion Policy shifted the robotics community's approach to behavior cloning. Before this paper, the debate was "explicit vs implicit policies." After it, the question became "how do we scale diffusion-based policies?" — leading directly to the VLA revolution (pi-0, pi-0.5, Octo, OpenVLA) that defines modern robot learning.

Cheat sheet

Core equation
Ak−1 = α(Ak − γεθ(O, Ak, k) + N(0, σ²I))
Training loss
L = MSE(εk, εθ(O, A0 + εk, k))
Horizons
To = 2 (observe), Tp = 16 (predict), Ta = 8 (execute)
Key finding
Position control + action chunking + diffusion = 46.9% avg improvement
Architecture
CNN (FiLM) for most tasks; Transformer for high-frequency actions
What is the key advantage of Diffusion Policy over Diffuser (Janner et al., 2022)?