Introduction

By 2020, two parallel research threads had achieved remarkable results in generative modeling. Denoising Diffusion Probabilistic Models (DDPMs) — developed by Sohl-Dickstein et al. (2015) and dramatically improved by Ho, Jain & Abbeel (2020) — used a discrete Markov chain of T noising steps and learned to reverse each step. Noise Conditional Score Networks (NCSNs) — developed by Song & Ermon (2019) — trained a network to estimate the score function ∇x log p(x) at multiple noise scales and generated samples via annealed Langevin dynamics.

Both approaches worked well. Both destroyed data with noise and learned to undo the destruction. But the mathematical machinery looked quite different — one spoke of Markov transition kernels and variational bounds, the other of score functions and Langevin MCMC. Were they fundamentally different, or two views of the same underlying process?

In a landmark 2021 paper, Song, Sohl-Dickstein, Kingma, Kumar, Ermon & Poole answered decisively: both are discretizations of a continuous-time stochastic differential equation (SDE). This SDE framework — "Score-Based Generative Modeling through Stochastic Differential Equations" — unified the two approaches and unlocked powerful new capabilities: exact likelihood computation, flexible sampling strategies, and a clean mathematical foundation for everything that followed.

ℹ Why continuous time?

Moving from discrete steps to a continuous SDE is not just mathematical elegance for its own sake. It gives us: (1) a single framework that encompasses DDPMs, NCSNs, and new SDE types; (2) a deterministic probability flow ODE that enables exact likelihood computation; (3) the ability to use any black-box ODE/SDE solver for sampling, with adaptive step sizes; and (4) a direct bridge to the flow matching methods of Article 05.

SDE Primer

Before diving into the diffusion framework, we need a minimal primer on stochastic differential equations. The key insight: an SDE is like an ordinary differential equation (ODE), but with a random "kick" at every instant.

Brownian motion

Brownian motion (or a Wiener process) w(t) is the fundamental building block of continuous-time stochastic processes. It has three defining properties:

(1) w(0) = 0. (2) Increments are Gaussian: w(t) − w(s) ~ 𝒩(0, (t−s)I) for t > s. (3) Increments over non-overlapping intervals are independent.

Brownian motion is continuous everywhere but differentiable nowhere — it's infinitely jagged. In an infinitesimal time interval dt, the increment dw has magnitude proportional to √dt, not dt. This √dt scaling is the source of much that is subtle about stochastic calculus.

Drift & diffusion coefficients

A general Itô SDE takes the form:

▭ The Itô SDE

dx = f(x, t) dt + g(t) dw

f(x, t) is the drift coefficient — the deterministic force pulling the state in a particular direction. g(t) is the diffusion coefficient — the magnitude of the random noise injected at each instant. When g(t) = 0, this reduces to an ordinary differential equation. When g(t) > 0, the trajectory is stochastic: running the same SDE twice from the same initial condition produces different paths.

The drift f(x, t) can depend on the current state x (making the process nonlinear), while in diffusion models the diffusion coefficient g(t) typically depends only on time (not on x). This special case is called an SDE with additive noise and simplifies much of the analysis.

One crucial result from stochastic calculus: given an SDE, there exists an associated Fokker-Planck equation (also called the Kolmogorov forward equation) that describes how the probability density p(x, t) evolves over time:

∂p/∂t = −∇ ⋅ [f(x,t) p] + ½ g(t)² ∇² p

This connects the trajectory-level view (individual particles following the SDE) to the density-level view (how the cloud of probability mass flows and spreads). Both perspectives will be essential.

The Forward SDE

In discrete DDPMs, we defined a forward process with T steps. In the continuous-time framework, we let the number of steps go to infinity and index the process by continuous time t ∈ [0, T]. At t = 0 we have clean data x(0) ~ pdata; at t = T we have (approximately) pure noise x(T) ~ 𝒩(0, σ2I). The forward process is an SDE that continuously injects noise while (optionally) shrinking the signal.

Song et al. (2021) identified two canonical forward SDEs corresponding to the two previously known discrete approaches.

Variance Preserving SDE (VP-SDE)

The VP-SDE corresponds to the DDPM forward process in continuous time:

▭ VP-SDE (Variance Preserving)

dx = −½ β(t) x dt + √β(t) dw

The drift f(x, t) = −½ β(t) x shrinks the data toward zero. The diffusion coefficient g(t) = √β(t) injects noise. The two effects are balanced so that if x(0) is standardized, the total variance remains approximately 1 throughout — hence "variance preserving." The noise schedule β(t) controls the speed of corruption.

The transition kernel from time 0 to time t has a clean closed form: p(x(t) | x(0)) = 𝒩(x(t); α(t) x(0), (1 − α(t)²) I), where α(t) = exp(−½ ∫0t β(s) ds). This is exactly the continuous analog of the discrete ᾱt from DDPM.

Variance Exploding SDE (VE-SDE)

The VE-SDE corresponds to the NCSN (Noise Conditional Score Network) framework:

▭ VE-SDE (Variance Exploding)

dx = √(dσ²/dt) dw

There is no drift — the data is not shrunk. The diffusion coefficient g(t) = √(dσ²/dt) simply adds more and more noise over time. The variance "explodes" to infinity: p(x(t) | x(0)) = 𝒩(x(t); x(0), σ(t)² I). In the NCSN framework, σ(t) is a geometric sequence of noise levels.

The key difference: VP-SDE shrinks signal while adding noise (bounded variance), VE-SDE only adds noise (growing variance). In practice, VP-SDE tends to give slightly better sample quality, while VE-SDE can have advantages for likelihood evaluation.

Comparing SDE types

Property VP-SDE VE-SDE sub-VP SDE
Drift f(x, t) −½β(t) x 0 −½β(t) x
Diffusion g(t) √β(t) √(dσ²/dt) √β(t)(1−e−2∫β)
Signal at time t Shrinks exponentially Unchanged Shrinks exponentially
Noise variance Bounded (~ 1) Grows unbounded Bounded (strictly < VP)
Discrete analog DDPM NCSN / SMLD
Likelihood Good Good Best (provably)

The sub-VP SDE is a variant introduced by Song et al. (2021) that provably achieves the best likelihood bounds. It uses the same drift as VP-SDE but a smaller diffusion coefficient, resulting in variance that is always strictly less than the VP case.

Forward SDE Simulator Interactive

Particles undergoing the forward SDE. Toggle between VP and VE to see how signals degrade differently.

t = 0.00 | signal: 100%

The Reverse-Time SDE

The forward SDE takes clean data to noise. To generate samples, we need to run time backward — starting from noise at t = T and arriving at clean data at t = 0. This is the reverse-time SDE.

A remarkable result due to Anderson (1982) shows that the reverse of an Itô SDE is itself an SDE, running backward in time:

▭ The Reverse-Time SDE (Anderson, 1982)

dx = [f(x, t) − g(t)² ∇x log pt(x)] dt + g(t) dw̄

Here dt is an infinitesimal negative time step (time runs backward from T to 0), and dw̄ is a reverse-time Brownian motion. The drift has two terms: the original forward drift f(x, t), and a score correction −g(t)² ∇x log pt(x) that steers the trajectory back toward data.

This equation is exact — not an approximation. If we knew the score function x log pt(x) at every point and every time, we could perfectly reverse the forward diffusion and generate exact samples from the data distribution.

Of course, we don't know the score function analytically — pt(x) is intractable for real data distributions. But we can learn it. A neural network sθ(x, t) ≈ ∇x log pt(x) trained with the denoising score matching objective from Article 03 provides exactly this estimate. Plugging the learned score into the reverse SDE gives us an approximate generative process.

The score function plays a dual role: it tells you which direction leads to higher probability density (the "uphill" direction), and the magnitude tells you how steep the climb is. In regions far from any data, the score points strongly toward data modes. Near the data manifold, the score becomes more nuanced, capturing local structure.

💡 The only unknown is the score

Look at the reverse SDE again: the forward drift f(x, t) is known by design (we chose it). The diffusion coefficient g(t) is known by design. The only thing we need to learn is x log pt(x) — the score of the noisy data distribution at each noise level. This is exactly what score matching (Article 03) trains us to estimate. The entire generative process reduces to one learned function.

Reverse SDE vs Probability Flow ODE Interactive

Left: stochastic reverse SDE (different each run). Right: deterministic ODE (same trajectory each run). Both produce valid samples.

Click "Run Reverse" to start

The Probability Flow ODE

The reverse SDE generates samples stochastically — each run produces a different trajectory, even from the same starting noise. But Song et al. (2021) proved a striking result: there exists a deterministic ODE whose trajectories have the same marginal distributions pt(x) at every time t.

▭ The Probability Flow ODE

dx/dt = f(x, t) − ½ g(t)² ∇x log pt(x)

Compare this to the reverse SDE: the drift is modified (the score term is halved) and there is no noise term. This ODE defines a deterministic flow that transforms the noise distribution at t = T into the data distribution at t = 0, passing through the same sequence of marginal densities as the stochastic reverse SDE.

This is sometimes called the probability flow ODE or the neural ODE associated with the diffusion process. It's remarkable: the same score network sθ(x, t) that drives the stochastic sampler also defines a deterministic flow.

The probability flow ODE offers several practical advantages:

Deterministic sampling. Given the same noise vector x(T), the ODE always produces the same output x(0). This enables latent-space interpolation: linearly interpolating between two noise vectors and solving the ODE produces a smooth morphing between the corresponding images.

Exact likelihood computation. Because the probability flow ODE is a continuous normalizing flow, we can compute exact log-likelihoods using the instantaneous change of variables formula (next section). This was the first time diffusion models could compute exact (not lower-bound) likelihoods.

Adaptive step-size solvers. Any black-box ODE solver (RK45, Dormand-Prince, etc.) can be used to solve the probability flow ODE, with adaptive step sizes that concentrate computation where the trajectory changes rapidly. This can be much more efficient than the fixed step-size schemes used for the stochastic SDE.

💡 DDIM is a discretization of the probability flow ODE

The DDIM sampler (Song, Meng & Ermon, 2021) — which produces deterministic samples from a trained DDPM — is precisely a first-order Euler discretization of the VP probability flow ODE. Setting the DDIM "stochasticity" parameter η = 0 recovers the ODE; η = 1 recovers the full stochastic reverse SDE. This connection was not initially obvious but falls out naturally from the SDE framework.

VP vs VE Noise Schedules Interactive

Signal and noise magnitudes over time for VP-SDE, VE-SDE, and sub-VP SDE. Notice how VP preserves total variance while VE lets it grow.

Exact Likelihood via the Probability Flow ODE

Instantaneous change of variables

The probability flow ODE defines a continuous normalizing flow (CNF) — a smooth, invertible mapping from noise space to data space. For any such flow, the instantaneous change of variables formula gives us the exact log-likelihood:

log p0(x(0)) = log pT(x(T)) + ∫0T ∇ ⋅ f̃(x(t), t) dt

where f̃(x, t) = f(x, t) − ½ g(t)² sθ(x, t) is the probability flow ODE vector field, and ∇ ⋅ f̃ is its divergence (the trace of the Jacobian). The integral tracks how the flow compresses or expands volume as it transforms noise into data.

In practice, we compute this by solving the ODE forward (data to noise), since we know x(0) (the data point) and want to find x(T) and the accumulated log-density change. This gives the exact log-likelihood of any data point under the model — not a lower bound as in VAEs, but the true value (up to numerical ODE solver error).

Hutchinson's trace estimator

The divergence ∇ ⋅ f̃ requires computing the trace of the Jacobian ∂f̃/∂x, which naively costs O(d) backward passes for data dimension d — far too expensive for images. Hutchinson's trace estimator provides a stochastic shortcut:

tr(A) = 𝔼[εT A ε]

where ε is a random vector with 𝔼[εεT] = I (e.g., Rademacher or standard Gaussian). This requires only a single vector-Jacobian product εT (∂f̃/∂x), computable with one backward pass. The estimate is unbiased, and in practice 1-2 random vectors suffice per evaluation.

Combining Hutchinson's estimator with an adaptive ODE solver, Song et al. computed likelihoods that were competitive with — and sometimes better than — state-of-the-art autoregressive models on standard benchmarks, while simultaneously producing high-quality samples. This was a major advance: previous diffusion models could only compute likelihood lower bounds.

Probability Flow ODE — Likelihood Tracking Interactive

A single trajectory under the probability flow ODE with a running divergence integral. The accumulated divergence gives the log-likelihood change from noise to data.

Click "Run ODE" to start

Connections to Discrete Models

The SDE framework beautifully unifies previously disparate methods. Here are the key correspondences:

DDPM = discretized VP-SDE. Take the VP-SDE dx = −½β(t)x dt + √β(t) dw and discretize it with step size Δt = 1/T using the Euler-Maruyama method. With βt = T ⋅ β(t/T), you recover exactly the DDPM forward transitions q(xt | xt−1) = 𝒩(√(1 − βt) xt−1, βt I). The DDPM reverse process is an Euler-Maruyama discretization of the reverse-time VP-SDE.

NCSN / SMLD = discretized VE-SDE. The NCSN method of Song & Ermon (2019) uses a geometric sequence of noise levels σ1 < ... < σN and samples via annealed Langevin dynamics. This is precisely a discretization of the VE-SDE reverse process. Each noise level corresponds to a time slice, and the Langevin steps approximate the reverse SDE at that noise level.

DDIM ≈ Probability flow ODE. The deterministic DDIM sampler (Song, Meng & Ermon, 2021) is a first-order discretization of the VP probability flow ODE. Setting the DDIM stochasticity parameter η = 0 gives the pure ODE discretization; increasing η blends in stochastic noise from the reverse SDE. At η = 1, DDIM reduces to DDPM.

ℹ Why unification matters

Before the SDE framework, DDPM and NCSN were developed and analyzed with different tools (variational bounds vs score matching). The continuous-time view reveals that they're the same algorithm with different noise schedules. This means insights from one approach (e.g., the simplified training loss from DDPM) immediately transfer to the other. It also opened the door to designing entirely new SDEs with desirable properties — the sub-VP SDE is one example.

The SDE framework also clarified the role of the number of sampling steps. In discrete models, T was a fixed architectural choice (T = 1000 in the original DDPM). In the SDE framework, T is a continuous endpoint, and the number of discretization steps is a solver choice made at sampling time. This immediately suggested using fewer, larger steps or adaptive solvers — ideas that blossomed into the fast sampling methods of Article 07.

Bridge to Flow Matching

The probability flow ODE offers a tantalizing perspective. It defines a deterministic transport from the noise distribution pT ≈ 𝒩(0, I) to the data distribution p0 = pdata. The velocity field of this transport is v(x, t) = f(x, t) − ½ g(t)² ∇x log pt(x).

But notice: we had to go through a roundabout process to get this velocity field. First we designed a forward SDE. Then we derived the reverse SDE using the score. Then we extracted the deterministic ODE. What if we could learn the velocity field directly?

This is exactly the insight behind flow matching (Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023). Instead of defining a stochastic forward process and deriving an ODE from it, flow matching directly parameterizes and trains a velocity field vθ(x, t) that transports noise to data. The training objective is simpler: regress the velocity at each (x, t) to the target velocity of a conditional transport path.

The connection is deep: the optimal flow matching velocity field, when the conditional paths are chosen to match the diffusion forward process, exactly recovers the probability flow ODE velocity. Flow matching is not a departure from diffusion — it's a more direct route to the same destination.

💡 From SDEs to straight lines

The probability flow ODE trajectories in diffusion models are typically curved — the velocity field changes character at different noise levels. Flow matching noticed that you can choose any transport path, not just the one inherited from a diffusion SDE. The simplest choice — straight lines from noise to data — turns out to work remarkably well, and produces straighter trajectories that are easier to integrate with few steps. This is the "rectified flow" idea that we explore in Article 05.

The SDE framework was a watershed moment for generative modeling. It unified two major approaches, revealed deep mathematical structure, and — through the probability flow ODE — pointed the way toward the flow-based methods that would dominate the next generation of research. With the continuous-time perspective firmly in place, we're ready to explore flow matching and its surprisingly simple training procedure.

References

Seminal papers and key works referenced in this article.

  1. Song et al. "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR, 2021. arXiv
  2. Karras et al. "Elucidating the Design Space of Diffusion-Based Generative Models." NeurIPS, 2022. arXiv
  3. Anderson. "Reverse-time diffusion equation models." Stochastic Processes and their Applications, 1982.