Diffusion Policy

Chapter 0: The Problem

You're teleoperating a robot arm to push a T-shaped block into a target zone. You show it 200 demonstrations. Then you step back and let the learned policy take over.

The robot freezes. Or it jitters in place. Or it confidently pushes the block the wrong way. What went wrong?

The core issue is that behavior cloning — learning to map observations to actions via supervised regression — has a hidden enemy: multimodal action distributions.

When you showed the robot how to push the T-block, sometimes you went around it from the left. Other times from the right. Both are valid. But a standard regression policy averages these two modes, producing an action that goes straight ahead — right into the block, achieving neither strategy.

The averaging catastrophe: When a human demonstrates two valid strategies (go left OR go right), a regression model learns the mean (go straight). The mean of two good strategies is often a terrible strategy. This is the fundamental failure mode of naive behavior cloning.

Previous approaches tried to fix this in different ways:

Explicit policies (Gaussian Mixture Models, categorical distributions) attempt to represent multiple modes directly. But GMMs require you to guess the number of modes upfront, and discretization scales exponentially with action dimensions.

Implicit policies (Energy-Based Models) learn an energy landscape over actions and find low-energy actions at inference. They can express arbitrary distributions — but they require negative sampling during training, which causes wild instability. Training loss goes down while actual performance oscillates unpredictably.

The Multimodal Problem

A robot can push the T-block from the left or right. Click "Sample Actions" to see what different policy types produce. The regression policy averages the two modes into a useless straight-ahead action.

Click to sample

A robot is demonstrated two valid strategies: pushing a block left-then-up, and pushing it right-then-up. A standard MSE regression policy will learn to:

Push straight up — the average of left and right — which crashes into the block Randomly choose left or right each episode Alternate between left and right on consecutive timesteps

Chapter 1: The Key Insight

What if we didn't try to output actions directly at all? What if instead, we started with random noise and iteratively sculpted it into actions?

This is exactly what Diffusion Policy does. It borrows the denoising diffusion process from image generation — but instead of generating pixels, it generates robot actions.

The process works like this: at inference time, we sample a random action sequence from Gaussian noise. Then we run K denoising steps. Each step looks at the current noisy actions and the robot's visual observations, and nudges the actions toward a clean, valid trajectory. After K steps, noise becomes a precise action plan.

Think of it this way: Imagine a sculptor starting with a rough block of clay (noise). At each step, they look at a reference photo (the observation) and chip away a little more. They don't need to decide the final shape upfront — the form emerges gradually through iterative refinement. Different starting blocks lead to different sculptures — this is how multimodality arises naturally.

Why does this solve multimodality? Because the denoising process has two sources of randomness. First, the initial noise sample — different random starts land in different "basins" of the action distribution, naturally selecting one mode or another. Second, the stochastic perturbations added at each denoising step let samples explore and settle into nearby modes.

And crucially, diffusion models never need to compute the normalizing constant Z that makes Energy-Based Models unstable. The noise prediction network learns the gradient of the energy landscape — the direction to push actions toward validity — without ever needing to know the total volume of the distribution.

Start

Sample action sequence A^K from Gaussian noise N(0, I)

↓

Denoise

For k = K to 1: predict noise ε_θ(O, A^k, k), subtract it from A^k

↓

Result

A⁰ = clean action sequence conditioned on observation O

How does Diffusion Policy naturally handle multimodal action distributions without explicitly modeling the number of modes?

It discretizes the action space into enough bins to cover all modes Different random noise initializations land in different basins of attraction, naturally selecting different modes It uses a GMM to model the modes explicitly

Chapter 2: DDPM Primer

Before we apply diffusion to robot actions, we need to understand the machinery. A Denoising Diffusion Probabilistic Model (DDPM) has two processes: a forward process that gradually adds noise to data, and a reverse process that learns to undo the noise.

The forward process: destroying structure

Take a clean data sample x⁰ (in image generation, a real image; for us, a real action sequence). Over K steps, add progressively more Gaussian noise until x^K is indistinguishable from pure noise.

The reverse process: creating structure

The reverse process is where learning happens. At each step k, a neural network ε_θ looks at the current noisy sample x^k and the step number k, and predicts the noise that was added. We then subtract this predicted noise:

x^k−1 = α(x^k − γ ε_θ(x^k, k) + N(0, σ²I))

Where α is a scaling factor (slightly less than 1 for stability), γ is the step size, and the N(0, σ²I) term adds a small amount of fresh noise — this is the "stochastic" part of stochastic Langevin dynamics.

Gradient descent in disguise: The denoising equation is really a noisy gradient step: x′ = x − γ∇E(x). The noise prediction network ε_θ is learning the gradient field ∇E(x) of an implicit energy function. Each denoising step walks downhill on the energy landscape toward clean data.

Training: learning to predict noise

Training is elegantly simple. For each training example x⁰:

Pick a random step k ∈ {1, ..., K}
Sample random noise ε^k with the appropriate variance for step k
Add that noise to get a noisy version: x⁰ + ε^k
Train ε_θ to predict the noise: minimize MSE(ε^k, ε_θ(x⁰ + ε^k, k))

L = MSE(ε^k, ε_θ(x⁰ + ε^k, k))

That's the entire training loss. No adversarial training, no ELBO tricks, no negative sampling. Just "predict what noise was added to this data."

The Denoising Process

Watch noise get refined into a clean 1D action trajectory. Click Play to run K=20 denoising steps. The network predicts noise (red arrows), and we subtract it to get cleaner actions.

Step K=20 (pure noise)

What does the DDPM noise prediction network ε_θ learn to predict?

The noise that was added to the clean data at step k — so we can subtract it to recover cleaner data The clean data directly The probability of each possible action

Chapter 3: From Images to Actions

Standard DDPMs generate images. Diffusion Policy generates robot actions. This requires two crucial modifications to the formulation.

Modification 1: Actions as the output

Instead of x being a 256×256 image, x is now A_t — a sequence of T_p future actions. For a 7-DOF robot arm, one action is a 7-dimensional vector (6 joint positions + gripper). If we predict T_p = 16 steps ahead, the output is a 7×16 = 112-dimensional vector. Much smaller than an image, but high-dimensional enough that simpler methods like GMMs struggle.

Modification 2: Conditioning on observations

Here's where the paper makes a critical design choice. Previous work (Diffuser, by Janner et al.) modeled the joint distribution p(A_t, O_t) — generating both actions and observations together. Diffusion Policy instead models the conditional distribution p(A_t | O_t).

Why does this matter? Because the observation O_t only needs to be encoded once, not re-encoded at every denoising step. With K=100 denoising iterations, that's 100× less vision computation.

A_t^k−1 = α(A_t^k − γ ε_θ(O_t, A_t^k, k) + N(0, σ²I))

The training loss becomes:

L = MSE(ε^k, ε_θ(O_t, A_t⁰ + ε^k, k))

Joint vs conditional — why it matters: Modeling p(A, O) jointly means the diffusion process denoises both future actions and future observations. That's much harder and much slower — you're trying to predict what the world looks like AND what to do. Conditioning on O lets you focus entirely on "given what I see, what should I do?" The vision encoder runs once, and K denoising iterations refine only the actions.

The observation encoding

The observation O_t isn't just the current frame — it's the last T_o frames of visual and proprioceptive data. Different camera views use separate ResNet-18 encoders (with spatial softmax pooling instead of global average pooling, and GroupNorm instead of BatchNorm). The visual embeddings are concatenated with proprioceptive state to form O_t.

Why GroupNorm? Because DDPMs use Exponential Moving Average (EMA) of model weights during training. BatchNorm's running statistics clash with EMA — GroupNorm doesn't have running statistics, so it plays nicely with the EMA schedule.

Why does Diffusion Policy model p(A|O) rather than the joint p(A, O)?

The visual encoder runs only once regardless of K denoising steps, making inference fast enough for real-time control The joint distribution is mathematically intractable It simplifies the training loss function

Chapter 4: Architecture

The noise prediction network ε_θ is the heart of Diffusion Policy. The paper proposes two architectures, each with different strengths.

Option A: CNN-based (1D Temporal Convolution)

This architecture treats the action sequence as a 1D signal and applies temporal convolutions. The key mechanism for injecting observation information is Feature-wise Linear Modulation (FiLM):

FiLM(x) = γ(O_t) ⊙ x + β(O_t)

The observation features O_t produce per-channel scale (γ) and shift (β) parameters that modulate every convolutional layer. The denoising step k is also injected via FiLM conditioning.

The CNN backbone works well out-of-the-box on most tasks with minimal hyperparameter tuning. But it has a weakness: temporal convolutions have an inductive bias toward smooth, low-frequency signals. If your task requires sharp, rapid action changes (like velocity control), the CNN smears them out.

Option B: Time-Series Diffusion Transformer

To handle high-frequency action changes, the paper introduces a transformer-based architecture. Each noisy action in the sequence A_t^k becomes an input token. The denoising step k is prepended as a special token (sinusoidal embedding). The observation O_t is projected into an embedding sequence and injected via cross-attention in each transformer decoder block.

Causal attention ensures each action token only attends to itself and previous actions — preserving the temporal ordering. The gradient ε_θ is predicted by each corresponding output token.

CNN vs Transformer — practical guidance: Start with the CNN. It's robust, fast to train, and works well on most tasks. If performance is low because the task demands rapid action changes (velocity control, high-frequency manipulation), switch to the Transformer — it captures sharp transitions better but requires more careful hyperparameter tuning.

Architecture Comparison

Two architectures for the noise prediction network. Toggle to see how each processes noisy actions conditioned on observations.

Why does the CNN backbone struggle with velocity-control action spaces?

CNN is too slow for real-time control Temporal convolutions have an inductive bias toward smooth, low-frequency signals — they smear out sharp action changes CNN can't handle high-dimensional inputs

Chapter 5: Action Chunking

This is where Diffusion Policy's design gets truly clever. Instead of predicting a single action at each timestep, it predicts a whole chunk of future actions — and only executes a portion of them before re-planning.

Three horizons

The system uses three carefully tuned horizon parameters:

T_o (observation horizon): how many past frames the policy sees (e.g., 2-3 frames)
T_p (prediction horizon): how many future actions the model generates (e.g., 16 steps)
T_a (action horizon): how many of those predicted actions we actually execute before re-planning (e.g., 8 steps)

At time t, the policy sees the last T_o observations, generates T_p future actions, executes T_a of them, then re-plans with fresh observations. This is receding horizon control — a classic idea from control theory, now powered by diffusion.

The critical tradeoff: T_a controls the balance between temporal consistency (high T_a = commit to the plan longer) and responsiveness (low T_a = react to changes faster). Too high and the robot can't adapt; too low and it jitters between modes. The paper found T_a = 8 works best for most tasks.

Why predict more than you execute?

If T_a = 8 and T_p = 16, we predict 16 actions but only use the first 8. Why waste compute on actions we'll throw away?

Because predicting further into the future gives the model context. To generate good actions for the next 8 steps, the model needs to "think ahead" about where the trajectory is going. The unused tail actions are a form of lookahead planning that improves the quality of the executed actions.

There's also a warm-starting trick: when re-planning, the new denoising process is initialized with the previous prediction (shifted by T_a steps) rather than pure noise. This further smooths transitions between consecutive plans.

Receding Horizon Control

Watch how the policy predicts a chunk of actions, executes a subset, then re-plans. Drag the T_a slider to see the consistency-responsiveness tradeoff. Blue = executed actions, gray = predicted-but-discarded lookahead.

T_a (execute)8

Adjust T_a and press Animate

Solving idle actions

During teleoperation, demonstrators sometimes pause — producing sequences of identical positions or near-zero velocities. Single-step policies overfit to these pauses and get permanently stuck. But with action chunking, the model sees the pause as part of a longer trajectory that eventually continues. The surrounding context prevents overfitting to the idle segment.

In Diffusion Policy's receding horizon scheme, why does the model predict T_p = 16 actions when only T_a = 8 are executed?

Predicting further ahead provides lookahead context that improves the quality of the executed actions The extra actions are cached for later use It's a buffer in case some actions fail

Chapter 6: Why Diffusion Wins

Diffusion Policy doesn't just match alternatives — it outperforms them by an average of 46.9% across 15 tasks. Let's understand why through three key advantages.

Advantage 1: Training stability

Implicit policies (IBC) represent actions using Energy-Based Models. To train an EBM, you need to estimate the intractable normalizing constant Z(o, θ) via negative sampling:

p_θ(a|o) = e^−E_θ(o,a) / Z(o, θ)

The InfoNCE loss uses negative samples to approximate Z, but inaccurate negative sampling causes training instability. Energy goes down, but actual policy performance oscillates wildly — making checkpoint selection a nightmare.

Diffusion Policy sidesteps Z entirely. The noise prediction network approximates the score function — the gradient of log p(a|o):

∇_a log p(a|o) = −∇_a E_θ(a,o) − ∇_a log Z(o,θ)

The key: ∇_a log Z(o, θ) = 0 because Z doesn't depend on the action a. So the score function — and therefore the noise prediction — is independent of Z. No negative sampling needed. Stable training guaranteed.

Sanity check: Why is ∇_a log Z = 0? Because Z(o, θ) = ∫ e^−E(o,a′) da′ integrates over ALL actions a′. It's a constant with respect to any particular action a. Taking the gradient of a constant is zero.

Advantage 2: Synergy with position control

A surprising finding: Diffusion Policy + position control consistently beats Diffusion Policy + velocity control, even though most prior work uses velocity control. Two reasons:

Position control has more pronounced multimodality (different positions for left vs right strategy). Diffusion Policy handles multimodality better than alternatives, so it benefits MORE from position control's expressiveness.
Position control suffers less from compounding errors. A small velocity error accumulates over time; a small position error stays small.

Advantage 3: Latency robustness

Because Diffusion Policy predicts a sequence of future actions, it naturally handles the latency between observation capture and action execution. Even with 4 timesteps of latency, performance barely drops — the predicted action sequence already accounts for the near future.

Training Stability: Diffusion Policy vs IBC

Compare training curves. IBC (orange) shows decreasing loss but oscillating evaluation performance — you can't tell which checkpoint is best. Diffusion Policy (teal) converges smoothly and stays stable.

Why is Diffusion Policy's training more stable than IBC's?

Diffusion Policy uses a simpler network architecture The score function is independent of the intractable normalizing constant Z, so no negative sampling is needed Diffusion Policy uses more training data

Chapter 7: Experiments

The paper evaluates Diffusion Policy across 15 tasks from 4 benchmarks. The breadth is impressive: simulated and real, 2-DOF to 6-DOF, single and multi-arm, rigid and fluid objects, single-user and multi-user demonstrations.

Benchmark tasks

Robomimic (5 tasks): Lift, Can, Square, Transport, ToolHang — progressing from simple pick-place to complex bimanual manipulation and precision tool hanging. Each has proficient-human (PH) and mixed-human (MH) demonstration variants.

Push-T: Push a T-shaped block into a target zone with a circular end-effector. Requires exploiting contact dynamics and handling multimodal approach strategies.

Block Push: Push two blocks into two target squares in any order. Tests long-horizon multimodality — which block first?

Franka Kitchen: Interact with 7 kitchen objects (burners, microwave, etc.) in arbitrary order. Tests both short-horizon and long-horizon multimodality.

Results: consistent dominance

Across every task and variant, Diffusion Policy matches or exceeds all baselines. The improvements are largest on the hardest tasks:

Simulation Results

Success rates across key simulation tasks. Diffusion Policy (both CNN and Transformer variants) consistently outperforms LSTM-GMM, IBC, and BET.

The multimodality gap: The improvement is most dramatic on tasks with strong multimodality. On Block Push (long-horizon, which-block-first decisions), Diffusion-T achieves 94% vs BET's 71% on p2. On Kitchen (7-object, arbitrary-order interaction), Diffusion-T gets 96% on p4 vs BET's 44%. Diffusion Policy doesn't just handle multimodality — it thrives on it.

Key ablation findings

Action horizon T_a: Optimal at 8 for most tasks. T_a = 1 gives reactive but jittery behavior; T_a = 16+ is smooth but unresponsive.

Position vs velocity: Switching from velocity to position control improves Diffusion Policy while hurting baselines. The combination of action chunking + position control + diffusion creates a compounding advantage.

Vision encoder: End-to-end training beats frozen pretrained encoders. Fine-tuning pretrained CLIP ViT-B/16 with 10× lower learning rate gives the best results (98% on Square) but the gap is small for simpler tasks.

On which type of task does Diffusion Policy show the LARGEST improvement over baselines?

Simple single-object pick-and-place tasks Tasks with strong multimodality (multiple valid strategies, arbitrary subtask ordering) Tasks with high-dimensional observation spaces

Chapter 8: Real-World Results

Simulation numbers are encouraging, but the real test is hardware. The paper deploys Diffusion Policy on 7 real-world tasks across 2 hardware setups, including 3 challenging bimanual tasks.

Push-T (Real): 95% success, near-human

The real-world Push-T is significantly harder than simulation. It's multi-stage: ⓵ push the T-block into the target, then ⓶ move the end-effector to an end-zone to avoid occluding the camera. The transition between stages is highly multimodal — the precise moment to stop pushing and start retreating varies.

Diffusion Policy achieves 95% success rate and 0.80 IoU (human: 100%, 0.84 IoU). LSTM-GMM gets 20%, IBC gets 0%. The failure mode of baselines is revealing: LSTM-GMM gets stuck near the T-block (overfitting to idle actions during fine adjustment), while IBC prematurely stops pushing.

Mug Flipping (6-DOF): 90% success

The robot must pick up a randomly placed mug, orient it lip-down, then rotate it so the handle points left. The demonstration data is wildly multimodal: forehand vs backhand grasps, direct placement vs push-to-rotate, varying grasp adjustments. Despite never being demonstrated, the policy can even sequence multiple pushes or re-grasp a dropped mug.

Sauce tasks: fluid manipulation

Two tasks test manipulation of non-rigid, fluid objects. Sauce pouring: scoop sauce with a ladle, pour it centered on pizza dough (0.74 IoU vs 0.79 human). Sauce spreading: spread sauce in a spiral pattern (0.77 coverage vs 0.79 human). Both require long idle periods (waiting for viscous sauce to fill the ladle) and periodic motions (spiral spreading) — known failure modes for standard behavior cloning.

Robustness to perturbation: During real Push-T experiments, the authors tested three perturbations: (1) blocking the camera for 3 seconds — the policy jittered but stayed on course; (2) shifting the T-block during pushing — the policy immediately re-planned to push from the opposite direction; (3) moving the T-block after "completion" — the policy abandoned its retreat and returned to re-position the block. Perturbation (3) was never demonstrated — the policy synthesized novel recovery behavior.

Bimanual tasks: the frontier

Three bimanual tasks push the limits further:

Egg beater (55% success, 210 demos): One arm holds a bowl, the other cranks an egg beater. Required haptic teleoperation — without force feedback, the demonstrator couldn't even complete the task.
Mat unrolling (75% success, 162 demos): Both arms coordinate to unroll, lift, center, and place a dog mat. Omnidextrous — can unroll left or right.
Shirt folding (75% success, 284 demos): A 9-step sequence of sleeve folds, body folds, and smoothing. The longest-horizon task tested.

Critically, Diffusion Policy worked out of the box on all bimanual tasks with the same hyperparameters as single-arm tasks. No tuning required.

Real-World Task Gallery

Success rates across 7 real-world tasks. Diffusion Policy approaches human performance on most tasks.

What was the most surprising behavior observed in real-world Push-T experiments?

The policy achieved 100% success rate The policy was faster than human demonstration The policy synthesized novel recovery behavior (returning to fix a moved block) that was never demonstrated

Chapter 9: Connections

What Diffusion Policy built on

DDPM (Ho et al., 2020): The foundational denoising diffusion model that Diffusion Policy adapts from image generation to action generation.

Diffuser (Janner et al., 2022): The first use of diffusion for robot planning, but modeled the joint p(A, O) rather than the conditional p(A|O). Slower and less accurate.

IBC (Florence et al., 2021): Implicit behavioral cloning via energy-based models. Showed the promise of non-explicit policies but suffered from training instability. Diffusion Policy achieves IBC's expressiveness without its instability.

BET (Shafiullah et al., 2022): Behavior Transformers using k-means action clustering. Handles multimodality but fails to maintain temporal consistency across steps.

What Diffusion Policy enabled

pi-0 (Physical Intelligence, 2024): The first robot foundation model, combining a VLM with flow matching (a continuous-time variant of diffusion) for actions. Diffusion Policy proved that diffusion-based policy representations work; pi-0 scaled the idea to 7 robot types and 68 tasks.

3D Diffusion Policy (Ze et al., 2024): Extends the approach to 3D visual representations using point clouds instead of RGB images.

Consistency Policy (Prasad et al., 2024): Applies consistency distillation to speed up inference from ~10 denoising steps to a single step while maintaining performance.

The connection to control theory

For simple linear systems where the expert uses a linear feedback policy a_t = −Ks_t, Diffusion Policy provably recovers the exact policy. The optimal denoiser becomes ε_θ(s, a, k) = (a + Ks) / σ_k, and DDIM sampling converges to a = −Ks. For multi-step prediction, it implicitly learns the dynamics model: a_t+t′ = −K(A − BK)^t′s_t.

The big picture: Diffusion Policy shifted the robotics community's approach to behavior cloning. Before this paper, the debate was "explicit vs implicit policies." After it, the question became "how do we scale diffusion-based policies?" — leading directly to the VLA revolution (pi-0, pi-0.5, Octo, OpenVLA) that defines modern robot learning.

Cheat sheet

Core equation

A^k−1 = α(A^k − γε_θ(O, A^k, k) + N(0, σ²I))

Training loss

L = MSE(ε^k, ε_θ(O, A⁰ + ε^k, k))

Horizons

T_o = 2 (observe), T_p = 16 (predict), T_a = 8 (execute)

Key finding

Position control + action chunking + diffusion = 46.9% avg improvement

Architecture

CNN (FiLM) for most tasks; Transformer for high-frequency actions

What is the key advantage of Diffusion Policy over Diffuser (Janner et al., 2022)?

Diffusion Policy conditions on observations instead of modeling the joint distribution, making inference faster and action prediction more accurate Diffusion Policy uses a larger model Diffusion Policy uses more training data