Denoising Diffusion Probabilistic Models — Diffusion & Flow Matching Internals

Introduction

In 2015, Sohl-Dickstein et al. proposed a provocative idea borrowed from non-equilibrium statistical mechanics: what if we could generate data by learning to reverse a diffusion process? The idea was theoretically intriguing but produced mediocre samples. Five years later, Ho, Jain & Abbeel (2020) revisited the approach with modern neural networks and a crucial simplification of the training objective. The result — Denoising Diffusion Probabilistic Models (DDPMs) — produced image quality rivaling GANs while being far more stable to train.

DDPMs work in two phases. The forward process takes a clean data sample and progressively corrupts it by adding Gaussian noise over T timesteps, until only pure noise remains. The reverse process starts from pure noise and learns to iteratively denoise, one step at a time, recovering a clean data sample. The neural network only needs to solve a modest task at each step — predict and remove a small amount of noise — rather than generating an entire image from scratch.

This article derives the DDPM framework from first principles: the forward Markov chain, the reverse posterior, the variational bound, and the elegant noise-prediction simplification that makes training as simple as a single line of code.

ℹ Prerequisites

This article builds on the probability foundations from Article 01 — particularly Gaussian distributions, the reparameterization trick, KL divergence, and the ELBO. If concepts like D_KL(q ‖ p) or the evidence lower bound are unfamiliar, start there.

The Forward Process

The forward process is a Markov chain that adds Gaussian noise to data over T timesteps. Starting from a clean data sample x₀ ~ q(x₀), each step slightly corrupts the previous state:

q(x t | x t-1) = 𝒩(x t; \sqrt(1 - β t) \cdot x t-1, β t I)

Here β₁, ..., β_T is the noise schedule — a sequence of small positive values (typically 10^-4 to 0.02) that control how much noise is added at each step. The scaling factor √(1 - β_t) slightly shrinks the signal while β_t controls the noise variance. Together they ensure the total variance doesn't blow up.

At each step, we can use the reparameterization trick to write:

x t = \sqrt(1 - β t) \cdot x t-1 + \sqrtβ t \cdot ε t, ε t ~ 𝒩(0, I)

After enough steps (typically T = 1000), the signal is completely drowned by noise: x_T ≈ 𝒩(0, I). The data has been destroyed.

Closed-form sampling at any timestep

A crucial mathematical property: we don't need to run all T steps sequentially to get x_t. Define α_t = 1 - β_t and ᾱ_t = ∏_s=1^t α_s (the cumulative product of all signal-retention factors up to step t). Then:

∑ Closed-form forward process

By recursively applying the single-step formula and using the fact that sums of independent Gaussians are Gaussian:

q(x t | x 0) = 𝒩(x t; \sqrtᾱ t \cdot x 0, (1 - ᾱ t) I)

Equivalently via reparameterization:

x t = \sqrtᾱ t \cdot x 0 + \sqrt(1 - ᾱ t) \cdot ε, ε ~ 𝒩(0, I)

The coefficient √ᾱ_t controls how much of the original signal remains, and √(1 - ᾱ_t) controls the noise magnitude. As t → T, ᾱ_t → 0 and we get pure noise.

This closed-form is essential for training efficiency. Instead of sequentially noising an image through all 1000 steps, we can jump directly to any timestep t by sampling ε and computing x_t in one operation. During training, we sample a random t for each example in the batch — no sequential processing needed.

Forward Diffusion Process Interactive

Watch data points get progressively corrupted by noise. Drag the timestep slider to see any point in the forward process.

Timestep t: 0 ᾱ = 1.000 | noise = 0%

The Reverse Process

The forward process destroys data. The reverse process learns to undo the destruction. Starting from pure noise x_T ~ 𝒩(0, I), we want to iteratively recover x_T-1, x_T-2, ..., x₀ until we arrive at a clean data sample.

The reverse process is parameterized as:

p θ (x t-1 | x t) = 𝒩(x t-1; μ θ (x t, t), Σ θ (x t, t))

At each step, the neural network takes the current noisy state x_t and the timestep t, and predicts the parameters of a Gaussian distribution over the previous (less noisy) state. The key insight from Feller (1949) and later formalized in the diffusion model context: when β_t is small enough, the reverse of a Gaussian diffusion step is also approximately Gaussian. So a Gaussian parameterization for each reverse step is well-justified.

The reverse posterior q(x_t-1 | x_t, x₀)

There is a remarkable mathematical gift in the DDPM framework. While the true reverse distribution q(x_t-1 | x_t) is intractable (it requires integrating over all possible x₀), the reverse posterior conditioned on x₀ is tractable and Gaussian:

∑ Reverse posterior derivation

Using Bayes' theorem: q(x_t-1 | x_t, x₀) ∝ q(x_t | x_t-1) · q(x_t-1 | x₀). Since both factors are Gaussian, the product is also Gaussian:

q(x t-1 | x t, x 0) = 𝒩(x t-1; μ̃ t, β̃ t I)

where:

μ̃ t = (\sqrtᾱ t-1 \cdot β t) / (1 - ᾱ t) \cdot x 0 + (\sqrtα t \cdot (1 - ᾱ t-1)) / (1 - ᾱ t) \cdot x t

β̃ t = (1 - ᾱ t-1) / (1 - ᾱ t) \cdot β t

This is a weighted average of x₀ and x_t. The variance β̃_t depends only on the noise schedule — it requires no learning.

This tractable posterior is the key that unlocks DDPM training. We know the true distribution we want each reverse step to approximate (given that we could peek at x₀). We can measure how well our learned reverse step matches this target using KL divergence — and the KL between two Gaussians has a closed-form expression.

The Training Objective

The variational lower bound

Following the ELBO framework from Article 01, we can decompose the variational lower bound on log p(x₀) into a sum of terms — one per timestep:

L VLB = L T + Σ t=2 T L t-1 + L 0

Where each L_t-1 is the KL divergence between the true reverse posterior and our learned approximation:

L t-1 = D KL (q(x t-1 | x t, x 0) ‖ p θ (x t-1 | x t))

Since both distributions are Gaussian, this KL has a closed-form expression that reduces to comparing their means (the variances can be fixed). The entire training objective becomes: make the predicted mean μ_θ(x_t, t) match the true posterior mean μ̃_t.

The noise prediction reparameterization

Ho et al. (2020) made a pivotal choice. Instead of directly predicting the mean μ_θ, they reparameterized the network to predict the noise ε that was added to create x_t from x₀.

Recall that x_t = √ᾱ_t · x₀ + √(1 - ᾱ_t) · ε. If we know ε, we can recover x₀:

x̂ 0 = (x t - \sqrt(1 - ᾱ t) \cdot ε θ (x t, t)) / \sqrtᾱ t

And then express the mean in terms of the predicted noise:

μ θ (x t, t) = (1 / \sqrtα t) \cdot (x t - (β t / \sqrt(1 - ᾱ t)) \cdot ε θ (x t, t))

The simplified loss

After plugging the noise parameterization into the VLB and dropping the weighting terms that depend on the schedule (which Ho et al. found empirically harmful), we arrive at the beautifully simple training objective:

L simple = 𝔼 t, x 0, ε [‖ε - ε θ (x t, t)‖ 2]

💡 Why this is remarkable

The training objective is simply: add noise to data, then train a neural network to predict what noise was added. That's it. No adversarial training, no complex loss balancing, no mode collapse. Sample x₀ from data, sample t uniformly, sample ε ~ 𝒩(0, I), compute x_t, predict ε, take gradient of MSE loss. One line of pseudocode.

The training algorithm in pseudocode:

while training:
    x0 = sample_from_data()             # clean data
    t = randint(1, T)                    # random timestep
    eps = torch.randn_like(x0)           # random noise
    xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * eps   # noisy data
    loss = mse(model(xt, t), eps)        # predict the noise
    loss.backward()
    optimizer.step()

This simplicity is not an approximation — it's a reweighted version of the full variational bound. Ho et al. found that uniform weighting across timesteps produces better sample quality than the theoretically derived weights, even though the latter give a tighter bound on the log-likelihood.

Noise Schedule Comparison Interactive

Compare how ᾱ_t (signal retained) decays under linear and cosine schedules.

Showing: linear

Noise Schedules

The noise schedule β₁, ..., β_T controls the rate of noise addition. This choice significantly affects sample quality.

Linear schedule (Ho et al., 2020): β_t increases linearly from β₁ = 10^-4 to β_T = 0.02 over T = 1000 steps. Simple and effective, but the signal-to-noise ratio drops too quickly in early steps, wasting capacity on nearly-destroyed images.

Cosine schedule (Nichol & Dhariwal, 2021): designed so that ᾱ_t follows a cosine curve. This produces a more gradual noise increase, with more timesteps spent at intermediate noise levels where the denoising task is neither trivial nor hopeless. The cosine schedule consistently improves FID scores, especially at higher resolutions.

Schedule	Formula	Behavior	Used by
Linear	β_t = β₁ + (t-1)/(T-1) · (β_T - β₁)	Rapid signal destruction in early steps	DDPM (original)
Cosine	ᾱ_t = cos((t/T + s)/(1+s) · π/2)²	Gradual, more uniform SNR change	Improved DDPM, many modern models
Sigmoid	β_t via sigmoid mapping	Slow start, fast middle, slow end	Some latent diffusion variants

The signal-to-noise ratio (SNR) at timestep t is SNR(t) = ᾱ_t / (1 - ᾱ_t). A good noise schedule distributes the SNR decrease evenly in log-space across timesteps, giving the network a smooth curriculum from easy denoising (high SNR) to hard denoising (low SNR).

Denoising Step Visualization Interactive

Watch denoising recover data from noise. Each step removes a small amount of noise.

Step 0 / 20 — pure noise

Sampling (The Reverse Pass)

Once trained, generating new data is straightforward:

Sample x_T ~ 𝒩(0, I) — start from pure noise.
For t = T, T-1, ..., 1, compute one reverse step: $x t-1 = (1/\sqrtα t) \cdot (x t - (β t /\sqrt(1-ᾱ t)) \cdot ε θ (x t, t)) + σ t \cdot z$ where z ~ 𝒩(0, I) for t > 1, and z = 0 for t = 1.
Return x₀ — a generated sample.

The variance σ_t² can be set to β_t (Ho et al.) or β̃_t (both work, giving slightly different sample characteristics). The stochastic noise z at each step introduces diversity — different z sequences produce different samples from the same starting noise x_T.

The obvious limitation: sampling requires T forward passes through the neural network. At T = 1000 and ~0.1s per pass on a GPU, a single image takes ~100 seconds. This motivated the entire field of fast samplers covered in Article 07.

DDPM Training Loop Interactive

Watch the training process: sample data, add noise at random t, predict ε, minimize MSE loss.

Iteration 0 | Loss: —

Putting It Together

Let's zoom out and see the complete DDPM picture:

Training: repeat until convergence — sample clean data x₀, sample random t ∈ {1,...,T}, sample noise ε ~ 𝒩(0,I), compute noisy x_t, predict ε_θ(x_t, t), take gradient step on ‖ε - ε_θ‖².

Sampling: sample x_T ~ 𝒩(0,I). For t = T down to 1: predict noise, compute mean, add stochastic noise (if t > 1). Return x₀.

The mathematical elegance is remarkable. The forward process is fixed (no parameters). The reverse process is a single neural network ε_θ shared across all timesteps, conditioned on t. The training loss is plain MSE. The sampling is a simple iterative loop.

💡 Three perspectives on what the network learns

The same neural network can be interpreted as learning three equivalent things: (1) the noise ε added at step t (noise prediction — the DDPM framing), (2) the clean data x₀ given noisy x_t (data prediction — useful for some samplers), (3) the score function ∇_x log p_t(x) (score prediction — the score-based framing of Article 03). These are related by simple linear transformations involving the schedule constants.

DDPMs opened the floodgates. Within two years, diffusion models overtook GANs on every major image generation benchmark. But understanding why they work so well requires the score function perspective — which is where we're headed next.

References

Seminal papers and key works referenced in this article.

Ho et al. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. arXiv
Sohl-Dickstein et al. "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML, 2015. arXiv
Song et al. "Denoising Diffusion Implicit Models." ICLR, 2021. arXiv
Nichol & Dhariwal. "Improved Denoising Diffusion Probabilistic Models." ICML, 2021. arXiv