Introduction
The original DDPM paper established a remarkable fact: you can generate photorealistic images by learning to reverse a diffusion process. The catch? It required 1000 sequential denoising steps to produce a single image. Each step calls the same neural network once — meaning one image costs 1000 full forward passes through a U-Net. At ~50 ms per pass on a modern GPU, that is nearly a minute per image.
Compare this with GANs, which produce an image in a single forward pass (~20 ms), or VAEs, which decode in one shot. Diffusion models won the quality race but were losing the speed race by three orders of magnitude. The entire field of sampler acceleration is about closing this gap: achieving the same sample quality with 10× to 1000× fewer network evaluations.
The solutions fall into two broad families. Better numerical methods — DDIM, higher-order ODE solvers, Karras scheduling — keep the original model but trace its trajectory more efficiently, reducing steps from 1000 to 20–50. Distillation and reparameterization — progressive distillation, consistency models, adversarial approaches — train a new model that shortcuts the trajectory entirely, reaching 1–4 steps.
This article traces the full arc: from Song et al.'s realization that DDPM hides a deterministic ODE, through the numerical solver revolution, to the modern single-step generators that power real-time image synthesis.
DDIM: Denoising Diffusion Implicit Models
Song, Meng & Ermon (2021) observed something profound about DDPMs: the DDPM training objective (predicting ε from xt) does not uniquely determine a single generative process. DDPMs train by denoising, but the reverse process used at sampling time can be generalized to a much larger family of non-Markovian processes — all of which share the same marginal distributions q(xt|x0) and therefore the same trained model.
The key insight: DDPM's reverse step adds noise (it is stochastic), but this noise is not mandatory. You can dial it down to zero and get a completely deterministic mapping from noise to image. This is DDIM — a deterministic sampler for models trained with the DDPM objective.
The deterministic shortcut
DDIM generalizes the DDPM reverse step. Where DDPM uses:
Given the predicted noise εθ(xt, t), the predicted x0 is:
x̂₀ = (x_t − √(1−ᾱ_t) · ε_θ(x_t, t)) / √ᾱ_t
The DDIM update with noise parameter σt:
x_{t-1} = √ᾱ_{t-1} · x̂₀ + √(1 − ᾱ_{t-1} − σ_t²) · ε_θ(x_t, t) + σ_t · z
When σt = 0, the noise term vanishes entirely. The process becomes deterministic: the same initial noise xT always produces the same image. When σt = σDDPM, you recover exactly the original DDPM reverse process.
This deterministic property enables two things that stochastic DDPM cannot do. First, meaningful latent codes: since the mapping is deterministic and (approximately) invertible, you can encode real images into the noise space and manipulate them there — enabling interpolation, editing, and reconstruction. Second, and more importantly for this article, step skipping.
Step skipping
DDPM's Markov chain couples adjacent timesteps: you must go t → t−1 → t−2, visiting all 1000 steps. DDIM's non-Markovian formulation breaks this coupling. Instead of stepping through every timestep, you can choose a subsequence — say τ = {1000, 800, 600, 400, 200, 1} — and jump directly between these. You still use the same pretrained model; you simply evaluate it at fewer points.
The quality degrades gracefully: 100 steps is nearly indistinguishable from 1000; 50 steps shows minor degradation; even 20 steps produces coherent images. This was the first major speedup — a free 10–50× acceleration requiring no retraining.
Why does skipping work? Because the DDIM update at each step is effectively solving an ODE one step at a time. Taking larger steps introduces truncation error, but the trajectory is smooth enough to tolerate moderate step sizes.
Sampler Comparison
Three samplers trace from noise xT to data x0 along the probability flow ODE. DDPM (orange) uses many stochastic steps; DDIM (cyan) uses fewer deterministic steps; DPM-Solver (purple) uses even fewer with higher-order corrections. Drag the slider to change the DDIM step count.
The ODE Perspective
The connection between DDIM and ordinary differential equations was made explicit by Song et al. (2021) in the score-SDE framework. Every diffusion process has an associated probability flow ODE — a deterministic ODE whose trajectories follow the same marginal densities as the stochastic process:
dx = [f(x,t) − ½ g(t)² ∇_x log p_t(x)] dt
DDIM is simply the Euler discretization of this ODE. Each DDIM step is one Euler step with a particular step size. This realization was transformative because it opened the door to the entire toolkit of numerical ODE solvers — methods developed over centuries for exactly this kind of problem.
The Euler method is the simplest (and least accurate) ODE solver. It approximates the solution by following the tangent line at each point. For smooth ODEs, the global error is O(h) — linear in the step size h. This means halving the step count roughly doubles the error. But higher-order methods can achieve O(h²), O(h⁴), or better, meaning far fewer steps for the same accuracy.
The conceptual shift is subtle but profound. DDPM frames generation as a stochastic Markov chain: each step adds random noise, and you sample a different image each time. DDIM and the probability flow ODE frame generation as a deterministic trajectory in data space: the initial noise uniquely determines the final image, and the trajectory is a smooth curve you can trace with numerical methods. Same trained model, fundamentally different perspective — and the ODE view is what unlocks acceleration.
Higher-Order Solvers
Once we recognize sampling as ODE integration, the question becomes: which solver? Euler works but wastes most of its function evaluations on redundant linear extrapolation. Higher-order solvers extract more information per evaluation, dramatically improving the step-to-quality tradeoff.
Euler & Heun
Euler's method (order 1): evaluate the derivative at the current point, step forward. Global error O(h). For diffusion, each "derivative evaluation" is one neural network forward pass, so 20 Euler steps = 20 NFEs (neural function evaluations).
Heun's method (order 2): evaluate the derivative at the current point, take a tentative Euler step, evaluate the derivative at the new point, and average the two derivatives for the actual step. This "predictor-corrector" approach doubles the cost per step (2 NFEs) but reduces the global error to O(h²). For the same total NFE budget, Heun typically outperforms Euler significantly — 10 Heun steps (20 NFEs) beats 20 Euler steps (20 NFEs).
RK4 (Runge-Kutta order 4): the classic four-stage method with O(h⁴) error. 4 NFEs per step. Often overkill for diffusion because the ODE is not smooth enough to benefit from 4th-order accuracy everywhere, but it can help in specific regimes.
DPM-Solver & UniPC
Lu et al. (2022) developed DPM-Solver, a solver specifically designed for the structure of diffusion ODEs. The key observation: the diffusion ODE can be decomposed into a linear part (known analytically) and a nonlinear part (the neural network output). By solving the linear part exactly and only approximating the nonlinear part, DPM-Solver achieves much better accuracy than generic methods.
DPM-Solver++ (2nd and 3rd order) further exploits the specific structure of noise-prediction and data-prediction parameterizations. It became the default sampler in many Stable Diffusion pipelines, enabling high-quality generation in 20–25 steps.
UniPC (Zhao et al., 2023) unifies predictor-corrector methods into a single framework, automatically selecting the optimal combination for a given step budget. It matches or exceeds DPM-Solver++ quality with the same NFE count.
ODE Solver Accuracy
A 1D ODE (dx/dt = −sin(x)·t) solved with Euler (green) vs Heun (orange) vs exact solution (white). Drag the slider to change step size — fewer, larger steps reveal how Heun stays accurate while Euler drifts.
The following table summarizes the practical tradeoffs among common solvers:
| Solver | Order | NFE / Step | Sweet Spot (Steps) | Notes |
|---|---|---|---|---|
| DDPM | 1 (stochastic) | 1 | 1000 | Original; too slow for practical use |
| DDIM | 1 (deterministic) | 1 | 50–100 | First practical speedup; Euler on probability flow ODE |
| Heun | 2 | 2 | 25–50 | Simple predictor-corrector; good baseline |
| DPM-Solver++ | 2–3 | 1 | 15–25 | Exploits diffusion ODE structure; de facto standard |
| UniPC | 2–3 | 1 | 10–20 | Unified predictor-corrector; slight edge over DPM-Solver++ |
| RK4 | 4 | 4 | 10–20 | Classic; usually not worth 4× NFE for diffusion |
Karras (EDM): Elucidating the Design Space
Karras et al. (2022) — the "EDM" paper — took a step back from inventing new models and instead systematically analyzed every design choice in diffusion: the noise schedule, the network preconditioning, the loss weighting, and the sampler. Their insight was that many of these choices interact, and jointly optimizing them yields large improvements with no architectural changes.
Preconditioning. Instead of training the raw network on noisy inputs, EDM reparameterizes the network with input and output scaling that depends on the noise level σ. The network Fθ receives a normalized input and produces a normalized output, making the target magnitude roughly constant across noise levels. This simple change significantly improves training stability and sample quality.
Noise schedule design. EDM parameterizes the noise schedule directly in terms of σ(t) rather than β(t). They show that a simple schedule σ(t) = t (with t going from σmax to σmin) works well, and that the spacing of discretization points along this schedule matters enormously. Their recommended schedule places more steps at low noise levels (where detail emerges) than at high noise levels (where only coarse structure matters).
Stochastic churn. While pure ODE sampling is deterministic and consistent, EDM found that injecting a small amount of noise at each step (controlled by parameters Schurn, Snoise) and then denoising it away actually improves sample quality. This "stochastic churn" acts as a form of regularization — it prevents the trajectory from getting stuck in local artifacts of the learned score function. The optimal churn amount is small (just enough to correct errors) and only applied at intermediate noise levels.
The EDM sampler with Heun's method and optimized schedule became one of the strongest baselines, matching or beating specialized solvers at 35–50 steps. It showed that careful engineering of existing components can be as powerful as novel algorithms.
Progressive Distillation
Better solvers push the boundary down to ~15–20 steps, but there is a floor: the probability flow ODE has finite curvature, and tracing it with fewer than ~10 steps introduces visible artifacts regardless of solver sophistication. Breaking through this floor requires changing the model itself.
Salimans & Ho (2022) proposed progressive distillation: train a student model to match the output of a teacher model in half the steps. The teacher takes N DDIM steps to go from xt to xt-2; the student learns to achieve the same result in a single step from xt to xt-2. Once trained, the student becomes the new teacher, and the process repeats.
Starting from a 1024-step teacher:
- Round 1: Student learns to match teacher in 512 steps
- Round 2: 512 → 256 steps
- Round 3: 256 → 128
- Round 4: 128 → 64
- ...
- Round 8: 8 → 4 steps
Each round halves the step count while keeping quality close to the teacher. The loss is straightforward: given the same starting point xt, the student's single-step output should match the teacher's two-step output in terms of the predicted x0 (or equivalently, ε). After ~8 rounds of distillation, you have a model that generates in 4 steps with quality close to the original 1024-step model.
Progressive Distillation
Watch a teacher model (blue, 8 steps) get distilled into a student (orange, 4 steps), then the student becomes the teacher for the next round. Each round halves the trajectory while preserving the endpoint.
The limitation of progressive distillation is diminishing returns: each halving discards some information, and the quality gap accumulates. Getting below 4 steps with this approach requires either accepting noticeable quality loss or combining with other techniques.
Consistency Models
Song et al. (2023) proposed a radically different approach. Instead of tracing the ODE trajectory step by step, what if a single neural network could map any point on the trajectory directly to the trajectory's endpoint (the clean image)?
A consistency model fθ(xt, t) is trained to satisfy a single property: self-consistency. For any two points xt and xs on the same ODE trajectory (i.e., they would evolve into the same clean image), the model should produce the same output:
f_θ(x_t, t) = f_θ(x_s, s) for all t, s on the same trajectory
This is enforced with a boundary condition: at t = 0 (or ε → 0), the model must return its input: fθ(x0, 0) = x0. Combined with self-consistency, this means fθ(xt, t) = x0 for all t along the trajectory. The model learns to "jump" from any noise level directly to the clean data.
Think of the ODE trajectory as a river flowing from noise to data. A consistency model learns to teleport from any point in the river directly to its mouth (the clean image), without needing to follow the current. Every point along the same river maps to the same destination — that is the consistency property. This collapses the entire multi-step sampling process into a single function evaluation.
There are two training approaches:
Consistency distillation (CD): Given a pretrained diffusion model, use its ODE solver to find pairs of points (xt+1, xt) on the same trajectory. Train fθ so that fθ(xt+1, t+1) matches fθ−(xt, t), where θ− is an exponential moving average of θ (similar to the target network in reinforcement learning).
Consistency training (CT): Train from scratch without a pretrained diffusion model. This uses a different loss formulation based on pairs of adjacent noise levels, avoiding the need for a teacher model entirely. CT is harder to train but removes the dependency on an existing diffusion model.
Consistency Model Mapping
Multiple points along the same ODE trajectory (dashed white curve) all map to the same clean output x₀ (gold star). Drag the slider to move a sample point along the trajectory — the consistency model output stays fixed at x₀.
In practice, consistency models achieve reasonable quality in a single step and near-diffusion quality in 2–4 steps (using the model's own output as a warm start for iterative refinement). The improved "Consistency Models 2" (Song & Dhariwal, 2023) closed much of the remaining gap to full diffusion, making single-step generation practical for high-resolution images.
Adversarial & Hybrid Approaches
The final frontier of acceleration combines diffusion's training stability with adversarial losses that directly optimize perceptual quality at low step counts.
ADD (Adversarial Diffusion Distillation): Sauer et al. (2023) distill a diffusion model into a 1–4 step generator using a combination of diffusion loss and adversarial loss. The discriminator evaluates whether the student's output (at each step) looks like a real image, providing a gradient signal that the diffusion loss alone cannot. This produced SDXL-Turbo — real-time 512×512 generation in 1–4 steps.
Latent Consistency Models (LCM): Luo et al. (2023) apply the consistency model framework specifically in the latent space of Stable Diffusion. By operating on the compressed 32×32×4 latent representation rather than the full pixel space, and using classifier-free guidance as part of the distillation target, LCM achieves high-quality generation in 2–4 steps. LCM-LoRA further makes this accessible by training only a small adapter, enabling any fine-tuned Stable Diffusion checkpoint to be accelerated.
Rectified flow distillation: Building on the flow matching framework (Article 06 in this series), rectified flows learn straighter trajectories that require fewer integration steps. InstaFlow (Liu et al., 2023) combines rectified flows with distillation to achieve single-step generation. Stability AI's SD3-Turbo and FLUX-Schnell use similar ideas: start with a flow matching model that already has relatively straight trajectories, then distill for 1–4 step generation.
DMD2 (Distribution Matching Distillation 2): Yin et al. (2024) use a regression loss combined with a GAN-like distribution matching objective, achieving state-of-the-art single-step generation. The key is using the teacher model itself to provide training signal — no external discriminator needed.
The trend is clear: the boundary between "diffusion model" and "single-step generator" is dissolving. Modern systems train with diffusion objectives for stability and coverage, then distill with adversarial or consistency objectives for speed. The result is models that are trained like diffusion models but sample like GANs.
Choosing a Sampler: A Practical Guide
With so many options, how do you choose? The answer depends almost entirely on your step budget — how many neural function evaluations (NFEs) you can afford.
| Budget (NFEs) | Recommended Approach | Quality | Notes |
|---|---|---|---|
| 1 | Consistency model / ADD / DMD2 | Good (FID 2–4) | Requires distillation; real-time applications |
| 2–4 | LCM / SDXL-Turbo / consistency + refinement | Very good (FID 1.5–3) | Best quality-speed tradeoff for interactive use |
| 5–15 | DPM-Solver++ / UniPC | Excellent (FID ~2) | No distillation needed; works with any model |
| 15–30 | DPM-Solver++ or Karras (EDM) | Near-optimal | Diminishing returns beyond 25 steps for most models |
| 30–50 | Karras (EDM) with stochastic churn | Optimal | Maximum quality; stochastic churn helps at this budget |
| 50+ | Any solver (diminishing returns) | Plateau | Rarely needed; spend the budget on larger batch instead |
Additional considerations:
- Classifier-free guidance doubles the NFE count (two forward passes per step) unless using a distilled model that bakes in the guidance.
- Stochastic vs. deterministic: Deterministic samplers (DDIM, pure ODE solvers) give reproducible results and enable latent interpolation. Stochastic samplers (DDPM, Karras with churn, SDE solvers) often produce slightly higher quality due to error correction but sacrifice reproducibility.
- Model-specific tuning: Different models respond differently to samplers. SDXL works well with DPM-Solver++ at 20–30 steps. FLUX benefits from its native flow matching sampler. Always check the model card for recommended settings.
- Resolution matters: Higher resolutions tolerate fewer steps because the model operates in latent space — a 1024×1024 image uses the same latent size as 512×512 in many architectures, so step count does not need to scale with resolution.
The arc of sampler development bends toward one step. We started at 1000 steps, carved it down to 20 with better solvers, then to 4 with distillation, and now to 1 with consistency and adversarial methods. The gap between one-step and multi-step quality narrows with each generation of techniques. Soon, the question may not be "how many steps?" but rather "how many refinement iterations after the first step?" — a subtly different problem, and one where diffusion models are converging with other generative architectures.
References
Seminal papers and key works referenced in this article.
- Lu et al. "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling." NeurIPS, 2022. arXiv
- Song et al. "Consistency Models." ICML, 2023. arXiv
- Salimans & Ho. "Progressive Distillation for Fast Sampling of Diffusion Models." ICLR, 2022. arXiv
- Meng et al. "On Distillation of Guided Diffusion Models." CVPR, 2023. arXiv