Visuomotor policy learning via action diffusion — representing robot behavior as a conditional denoising diffusion process that iteratively refines noise into precise, multimodal action sequences.
You're teleoperating a robot arm to push a T-shaped block into a target zone. You show it 200 demonstrations. Then you step back and let the learned policy take over.
The robot freezes. Or it jitters in place. Or it confidently pushes the block the wrong way. What went wrong?
The core issue is that behavior cloning — learning to map observations to actions via supervised regression — has a hidden enemy: multimodal action distributions.
When you showed the robot how to push the T-block, sometimes you went around it from the left. Other times from the right. Both are valid. But a standard regression policy averages these two modes, producing an action that goes straight ahead — right into the block, achieving neither strategy.
Previous approaches tried to fix this in different ways:
Explicit policies (Gaussian Mixture Models, categorical distributions) attempt to represent multiple modes directly. But GMMs require you to guess the number of modes upfront, and discretization scales exponentially with action dimensions.
Implicit policies (Energy-Based Models) learn an energy landscape over actions and find low-energy actions at inference. They can express arbitrary distributions — but they require negative sampling during training, which causes wild instability. Training loss goes down while actual performance oscillates unpredictably.
A robot can push the T-block from the left or right. Click "Sample Actions" to see what different policy types produce. The regression policy averages the two modes into a useless straight-ahead action.
What if we didn't try to output actions directly at all? What if instead, we started with random noise and iteratively sculpted it into actions?
This is exactly what Diffusion Policy does. It borrows the denoising diffusion process from image generation — but instead of generating pixels, it generates robot actions.
The process works like this: at inference time, we sample a random action sequence from Gaussian noise. Then we run K denoising steps. Each step looks at the current noisy actions and the robot's visual observations, and nudges the actions toward a clean, valid trajectory. After K steps, noise becomes a precise action plan.
Why does this solve multimodality? Because the denoising process has two sources of randomness. First, the initial noise sample — different random starts land in different "basins" of the action distribution, naturally selecting one mode or another. Second, the stochastic perturbations added at each denoising step let samples explore and settle into nearby modes.
And crucially, diffusion models never need to compute the normalizing constant Z that makes Energy-Based Models unstable. The noise prediction network learns the gradient of the energy landscape — the direction to push actions toward validity — without ever needing to know the total volume of the distribution.
Before we apply diffusion to robot actions, we need to understand the machinery. A Denoising Diffusion Probabilistic Model (DDPM) has two processes: a forward process that gradually adds noise to data, and a reverse process that learns to undo the noise.
Take a clean data sample x0 (in image generation, a real image; for us, a real action sequence). Over K steps, add progressively more Gaussian noise until xK is indistinguishable from pure noise.
The reverse process is where learning happens. At each step k, a neural network εθ looks at the current noisy sample xk and the step number k, and predicts the noise that was added. We then subtract this predicted noise:
Where α is a scaling factor (slightly less than 1 for stability), γ is the step size, and the N(0, σ²I) term adds a small amount of fresh noise — this is the "stochastic" part of stochastic Langevin dynamics.
Training is elegantly simple. For each training example x0:
That's the entire training loss. No adversarial training, no ELBO tricks, no negative sampling. Just "predict what noise was added to this data."
Watch noise get refined into a clean 1D action trajectory. Click Play to run K=20 denoising steps. The network predicts noise (red arrows), and we subtract it to get cleaner actions.
Standard DDPMs generate images. Diffusion Policy generates robot actions. This requires two crucial modifications to the formulation.
Instead of x being a 256×256 image, x is now At — a sequence of Tp future actions. For a 7-DOF robot arm, one action is a 7-dimensional vector (6 joint positions + gripper). If we predict Tp = 16 steps ahead, the output is a 7×16 = 112-dimensional vector. Much smaller than an image, but high-dimensional enough that simpler methods like GMMs struggle.
Here's where the paper makes a critical design choice. Previous work (Diffuser, by Janner et al.) modeled the joint distribution p(At, Ot) — generating both actions and observations together. Diffusion Policy instead models the conditional distribution p(At | Ot).
Why does this matter? Because the observation Ot only needs to be encoded once, not re-encoded at every denoising step. With K=100 denoising iterations, that's 100× less vision computation.
The training loss becomes:
The observation Ot isn't just the current frame — it's the last To frames of visual and proprioceptive data. Different camera views use separate ResNet-18 encoders (with spatial softmax pooling instead of global average pooling, and GroupNorm instead of BatchNorm). The visual embeddings are concatenated with proprioceptive state to form Ot.
Why GroupNorm? Because DDPMs use Exponential Moving Average (EMA) of model weights during training. BatchNorm's running statistics clash with EMA — GroupNorm doesn't have running statistics, so it plays nicely with the EMA schedule.
The noise prediction network εθ is the heart of Diffusion Policy. The paper proposes two architectures, each with different strengths.
This architecture treats the action sequence as a 1D signal and applies temporal convolutions. The key mechanism for injecting observation information is Feature-wise Linear Modulation (FiLM):
The observation features Ot produce per-channel scale (γ) and shift (β) parameters that modulate every convolutional layer. The denoising step k is also injected via FiLM conditioning.
The CNN backbone works well out-of-the-box on most tasks with minimal hyperparameter tuning. But it has a weakness: temporal convolutions have an inductive bias toward smooth, low-frequency signals. If your task requires sharp, rapid action changes (like velocity control), the CNN smears them out.
To handle high-frequency action changes, the paper introduces a transformer-based architecture. Each noisy action in the sequence Atk becomes an input token. The denoising step k is prepended as a special token (sinusoidal embedding). The observation Ot is projected into an embedding sequence and injected via cross-attention in each transformer decoder block.
Causal attention ensures each action token only attends to itself and previous actions — preserving the temporal ordering. The gradient εθ is predicted by each corresponding output token.
Two architectures for the noise prediction network. Toggle to see how each processes noisy actions conditioned on observations.
This is where Diffusion Policy's design gets truly clever. Instead of predicting a single action at each timestep, it predicts a whole chunk of future actions — and only executes a portion of them before re-planning.
The system uses three carefully tuned horizon parameters:
At time t, the policy sees the last To observations, generates Tp future actions, executes Ta of them, then re-plans with fresh observations. This is receding horizon control — a classic idea from control theory, now powered by diffusion.
If Ta = 8 and Tp = 16, we predict 16 actions but only use the first 8. Why waste compute on actions we'll throw away?
Because predicting further into the future gives the model context. To generate good actions for the next 8 steps, the model needs to "think ahead" about where the trajectory is going. The unused tail actions are a form of lookahead planning that improves the quality of the executed actions.
There's also a warm-starting trick: when re-planning, the new denoising process is initialized with the previous prediction (shifted by Ta steps) rather than pure noise. This further smooths transitions between consecutive plans.
Watch how the policy predicts a chunk of actions, executes a subset, then re-plans. Drag the Ta slider to see the consistency-responsiveness tradeoff. Blue = executed actions, gray = predicted-but-discarded lookahead.
During teleoperation, demonstrators sometimes pause — producing sequences of identical positions or near-zero velocities. Single-step policies overfit to these pauses and get permanently stuck. But with action chunking, the model sees the pause as part of a longer trajectory that eventually continues. The surrounding context prevents overfitting to the idle segment.
Diffusion Policy doesn't just match alternatives — it outperforms them by an average of 46.9% across 15 tasks. Let's understand why through three key advantages.
Implicit policies (IBC) represent actions using Energy-Based Models. To train an EBM, you need to estimate the intractable normalizing constant Z(o, θ) via negative sampling:
The InfoNCE loss uses negative samples to approximate Z, but inaccurate negative sampling causes training instability. Energy goes down, but actual policy performance oscillates wildly — making checkpoint selection a nightmare.
Diffusion Policy sidesteps Z entirely. The noise prediction network approximates the score function — the gradient of log p(a|o):
The key: ∇a log Z(o, θ) = 0 because Z doesn't depend on the action a. So the score function — and therefore the noise prediction — is independent of Z. No negative sampling needed. Stable training guaranteed.
A surprising finding: Diffusion Policy + position control consistently beats Diffusion Policy + velocity control, even though most prior work uses velocity control. Two reasons:
Because Diffusion Policy predicts a sequence of future actions, it naturally handles the latency between observation capture and action execution. Even with 4 timesteps of latency, performance barely drops — the predicted action sequence already accounts for the near future.
Compare training curves. IBC (orange) shows decreasing loss but oscillating evaluation performance — you can't tell which checkpoint is best. Diffusion Policy (teal) converges smoothly and stays stable.
The paper evaluates Diffusion Policy across 15 tasks from 4 benchmarks. The breadth is impressive: simulated and real, 2-DOF to 6-DOF, single and multi-arm, rigid and fluid objects, single-user and multi-user demonstrations.
Robomimic (5 tasks): Lift, Can, Square, Transport, ToolHang — progressing from simple pick-place to complex bimanual manipulation and precision tool hanging. Each has proficient-human (PH) and mixed-human (MH) demonstration variants.
Push-T: Push a T-shaped block into a target zone with a circular end-effector. Requires exploiting contact dynamics and handling multimodal approach strategies.
Block Push: Push two blocks into two target squares in any order. Tests long-horizon multimodality — which block first?
Franka Kitchen: Interact with 7 kitchen objects (burners, microwave, etc.) in arbitrary order. Tests both short-horizon and long-horizon multimodality.
Across every task and variant, Diffusion Policy matches or exceeds all baselines. The improvements are largest on the hardest tasks:
Success rates across key simulation tasks. Diffusion Policy (both CNN and Transformer variants) consistently outperforms LSTM-GMM, IBC, and BET.
Action horizon Ta: Optimal at 8 for most tasks. Ta = 1 gives reactive but jittery behavior; Ta = 16+ is smooth but unresponsive.
Position vs velocity: Switching from velocity to position control improves Diffusion Policy while hurting baselines. The combination of action chunking + position control + diffusion creates a compounding advantage.
Vision encoder: End-to-end training beats frozen pretrained encoders. Fine-tuning pretrained CLIP ViT-B/16 with 10× lower learning rate gives the best results (98% on Square) but the gap is small for simpler tasks.
Simulation numbers are encouraging, but the real test is hardware. The paper deploys Diffusion Policy on 7 real-world tasks across 2 hardware setups, including 3 challenging bimanual tasks.
The real-world Push-T is significantly harder than simulation. It's multi-stage: ⓵ push the T-block into the target, then ⓶ move the end-effector to an end-zone to avoid occluding the camera. The transition between stages is highly multimodal — the precise moment to stop pushing and start retreating varies.
Diffusion Policy achieves 95% success rate and 0.80 IoU (human: 100%, 0.84 IoU). LSTM-GMM gets 20%, IBC gets 0%. The failure mode of baselines is revealing: LSTM-GMM gets stuck near the T-block (overfitting to idle actions during fine adjustment), while IBC prematurely stops pushing.
The robot must pick up a randomly placed mug, orient it lip-down, then rotate it so the handle points left. The demonstration data is wildly multimodal: forehand vs backhand grasps, direct placement vs push-to-rotate, varying grasp adjustments. Despite never being demonstrated, the policy can even sequence multiple pushes or re-grasp a dropped mug.
Two tasks test manipulation of non-rigid, fluid objects. Sauce pouring: scoop sauce with a ladle, pour it centered on pizza dough (0.74 IoU vs 0.79 human). Sauce spreading: spread sauce in a spiral pattern (0.77 coverage vs 0.79 human). Both require long idle periods (waiting for viscous sauce to fill the ladle) and periodic motions (spiral spreading) — known failure modes for standard behavior cloning.
Three bimanual tasks push the limits further:
Critically, Diffusion Policy worked out of the box on all bimanual tasks with the same hyperparameters as single-arm tasks. No tuning required.
Success rates across 7 real-world tasks. Diffusion Policy approaches human performance on most tasks.
DDPM (Ho et al., 2020): The foundational denoising diffusion model that Diffusion Policy adapts from image generation to action generation.
Diffuser (Janner et al., 2022): The first use of diffusion for robot planning, but modeled the joint p(A, O) rather than the conditional p(A|O). Slower and less accurate.
IBC (Florence et al., 2021): Implicit behavioral cloning via energy-based models. Showed the promise of non-explicit policies but suffered from training instability. Diffusion Policy achieves IBC's expressiveness without its instability.
BET (Shafiullah et al., 2022): Behavior Transformers using k-means action clustering. Handles multimodality but fails to maintain temporal consistency across steps.
pi-0 (Physical Intelligence, 2024): The first robot foundation model, combining a VLM with flow matching (a continuous-time variant of diffusion) for actions. Diffusion Policy proved that diffusion-based policy representations work; pi-0 scaled the idea to 7 robot types and 68 tasks.
3D Diffusion Policy (Ze et al., 2024): Extends the approach to 3D visual representations using point clouds instead of RGB images.
Consistency Policy (Prasad et al., 2024): Applies consistency distillation to speed up inference from ~10 denoising steps to a single step while maintaining performance.
For simple linear systems where the expert uses a linear feedback policy at = −Kst, Diffusion Policy provably recovers the exact policy. The optimal denoiser becomes εθ(s, a, k) = (a + Ks) / σk, and DDIM sampling converges to a = −Ks. For multi-step prediction, it implicitly learns the dynamics model: at+t′ = −K(A − BK)t′st.