Introduction
In 2015, Sohl-Dickstein et al. proposed a provocative idea borrowed from non-equilibrium statistical mechanics: what if we could generate data by learning to reverse a diffusion process? The idea was theoretically intriguing but produced mediocre samples. Five years later, Ho, Jain & Abbeel (2020) revisited the approach with modern neural networks and a crucial simplification of the training objective. The result — Denoising Diffusion Probabilistic Models (DDPMs) — produced image quality rivaling GANs while being far more stable to train.
DDPMs work in two phases. The forward process takes a clean data sample and progressively corrupts it by adding Gaussian noise over T timesteps, until only pure noise remains. The reverse process starts from pure noise and learns to iteratively denoise, one step at a time, recovering a clean data sample. The neural network only needs to solve a modest task at each step — predict and remove a small amount of noise — rather than generating an entire image from scratch.
This article derives the DDPM framework from first principles: the forward Markov chain, the reverse posterior, the variational bound, and the elegant noise-prediction simplification that makes training as simple as a single line of code.
This article builds on the probability foundations from Article 01 — particularly Gaussian distributions, the reparameterization trick, KL divergence, and the ELBO. If concepts like DKL(q ‖ p) or the evidence lower bound are unfamiliar, start there.
The Forward Process
The forward process is a Markov chain that adds Gaussian noise to data over
T timesteps. Starting from a clean data sample x0 ~ q(x0),
each step slightly corrupts the previous state:
Here β1, ..., βT is the noise schedule —
a sequence of small positive values (typically 10-4 to 0.02) that control how much noise
is added at each step. The scaling factor √(1 - βt) slightly
shrinks the signal while βt controls the noise variance. Together they
ensure the total variance doesn't blow up.
At each step, we can use the reparameterization trick to write:
After enough steps (typically T = 1000), the signal is completely drowned by noise:
xT ≈ 𝒩(0, I). The data has been destroyed.
Closed-form sampling at any timestep
A crucial mathematical property: we don't need to run all T steps sequentially to get
xt. Define αt = 1 - βt
and ᾱt = ∏s=1t αs
(the cumulative product of all signal-retention factors up to step t). Then:
By recursively applying the single-step formula and using the fact that sums of independent Gaussians are Gaussian:
Equivalently via reparameterization:
The coefficient √ᾱt controls how much of the original signal remains, and √(1 - ᾱt) controls the noise magnitude. As t → T, ᾱt → 0 and we get pure noise.
This closed-form is essential for training efficiency. Instead of sequentially noising an image through
all 1000 steps, we can jump directly to any timestep t by sampling ε and computing
xt in one operation. During training, we sample a random t for each example
in the batch — no sequential processing needed.
Watch data points get progressively corrupted by noise. Drag the timestep slider to see any point in the forward process.
The Reverse Process
The forward process destroys data. The reverse process learns to undo the destruction. Starting from
pure noise xT ~ 𝒩(0, I), we want to iteratively recover
xT-1, xT-2, ..., x0 until we arrive at a clean
data sample.
The reverse process is parameterized as:
At each step, the neural network takes the current noisy state xt and the
timestep t, and predicts the parameters of a Gaussian distribution over the previous
(less noisy) state. The key insight from Feller (1949) and later formalized in the diffusion model
context: when βt is small enough, the reverse of a Gaussian diffusion step is
also approximately Gaussian. So a Gaussian parameterization for each reverse step is well-justified.
The reverse posterior q(xt-1 | xt, x0)
There is a remarkable mathematical gift in the DDPM framework. While the true reverse distribution
q(xt-1 | xt) is intractable (it requires integrating over all
possible x0), the reverse posterior conditioned on x0
is tractable and Gaussian:
Using Bayes' theorem: q(xt-1 | xt, x0) ∝ q(xt | xt-1) · q(xt-1 | x0). Since both factors are Gaussian, the product is also Gaussian:
where:
This is a weighted average of x0 and xt. The variance β̃t depends only on the noise schedule — it requires no learning.
This tractable posterior is the key that unlocks DDPM training. We know the true distribution we want each reverse step to approximate (given that we could peek at x0). We can measure how well our learned reverse step matches this target using KL divergence — and the KL between two Gaussians has a closed-form expression.
The Training Objective
The variational lower bound
Following the ELBO framework from Article 01, we can decompose the variational lower bound on log p(x0) into a sum of terms — one per timestep:
Where each Lt-1 is the KL divergence between the true reverse posterior and
our learned approximation:
Since both distributions are Gaussian, this KL has a closed-form expression that reduces to comparing their means (the variances can be fixed). The entire training objective becomes: make the predicted mean μθ(xt, t) match the true posterior mean μ̃t.
The noise prediction reparameterization
Ho et al. (2020) made a pivotal choice. Instead of directly predicting the mean μθ, they reparameterized the network to predict the noise ε that was added to create xt from x0.
Recall that xt = √ᾱt · x0 + √(1 - ᾱt) · ε.
If we know ε, we can recover x0:
And then express the mean in terms of the predicted noise:
The simplified loss
After plugging the noise parameterization into the VLB and dropping the weighting terms that depend on the schedule (which Ho et al. found empirically harmful), we arrive at the beautifully simple training objective:
The training objective is simply: add noise to data, then train a neural network to predict what noise was added. That's it. No adversarial training, no complex loss balancing, no mode collapse. Sample x0 from data, sample t uniformly, sample ε ~ 𝒩(0, I), compute xt, predict ε, take gradient of MSE loss. One line of pseudocode.
The training algorithm in pseudocode:
while training:
x0 = sample_from_data() # clean data
t = randint(1, T) # random timestep
eps = torch.randn_like(x0) # random noise
xt = sqrt(alpha_bar[t]) * x0 + sqrt(1 - alpha_bar[t]) * eps # noisy data
loss = mse(model(xt, t), eps) # predict the noise
loss.backward()
optimizer.step()
This simplicity is not an approximation — it's a reweighted version of the full variational bound. Ho et al. found that uniform weighting across timesteps produces better sample quality than the theoretically derived weights, even though the latter give a tighter bound on the log-likelihood.
Compare how ᾱt (signal retained) decays under linear and cosine schedules.
Noise Schedules
The noise schedule β1, ..., βT controls the rate of
noise addition. This choice significantly affects sample quality.
Linear schedule (Ho et al., 2020): βt increases linearly from β1 = 10-4 to βT = 0.02 over T = 1000 steps. Simple and effective, but the signal-to-noise ratio drops too quickly in early steps, wasting capacity on nearly-destroyed images.
Cosine schedule (Nichol & Dhariwal, 2021): designed so that ᾱt follows a cosine curve. This produces a more gradual noise increase, with more timesteps spent at intermediate noise levels where the denoising task is neither trivial nor hopeless. The cosine schedule consistently improves FID scores, especially at higher resolutions.
| Schedule | Formula | Behavior | Used by |
|---|---|---|---|
| Linear | βt = β1 + (t-1)/(T-1) · (βT - β1) | Rapid signal destruction in early steps | DDPM (original) |
| Cosine | ᾱt = cos((t/T + s)/(1+s) · π/2)2 | Gradual, more uniform SNR change | Improved DDPM, many modern models |
| Sigmoid | βt via sigmoid mapping | Slow start, fast middle, slow end | Some latent diffusion variants |
The signal-to-noise ratio (SNR) at timestep t is
SNR(t) = ᾱt / (1 - ᾱt). A good noise schedule
distributes the SNR decrease evenly in log-space across timesteps, giving the network a smooth
curriculum from easy denoising (high SNR) to hard denoising (low SNR).
Watch denoising recover data from noise. Each step removes a small amount of noise.
Sampling (The Reverse Pass)
Once trained, generating new data is straightforward:
- Sample
xT ~ 𝒩(0, I)— start from pure noise. - For t = T, T-1, ..., 1, compute one reverse step:
xt-1 = (1/√αt) · (xt - (βt/√(1-ᾱt)) · εθ(xt, t)) + σt · zwhere z ~ 𝒩(0, I) for t > 1, and z = 0 for t = 1.
- Return
x0— a generated sample.
The variance σt2 can be set to βt (Ho et al.)
or β̃t (both work, giving slightly different sample characteristics). The
stochastic noise z at each step introduces diversity — different z sequences produce different samples
from the same starting noise xT.
The obvious limitation: sampling requires T forward passes through the neural network. At T = 1000 and ~0.1s per pass on a GPU, a single image takes ~100 seconds. This motivated the entire field of fast samplers covered in Article 07.
Watch the training process: sample data, add noise at random t, predict ε, minimize MSE loss.
Putting It Together
Let's zoom out and see the complete DDPM picture:
Training: repeat until convergence — sample clean data x0, sample random t ∈ {1,...,T}, sample noise ε ~ 𝒩(0,I), compute noisy xt, predict εθ(xt, t), take gradient step on ‖ε - εθ‖2.
Sampling: sample xT ~ 𝒩(0,I). For t = T down to 1: predict noise, compute mean, add stochastic noise (if t > 1). Return x0.
The mathematical elegance is remarkable. The forward process is fixed (no parameters). The reverse process is a single neural network εθ shared across all timesteps, conditioned on t. The training loss is plain MSE. The sampling is a simple iterative loop.
The same neural network can be interpreted as learning three equivalent things: (1) the noise ε added at step t (noise prediction — the DDPM framing), (2) the clean data x0 given noisy xt (data prediction — useful for some samplers), (3) the score function ∇x log pt(x) (score prediction — the score-based framing of Article 03). These are related by simple linear transformations involving the schedule constants.
DDPMs opened the floodgates. Within two years, diffusion models overtook GANs on every major image generation benchmark. But understanding why they work so well requires the score function perspective — which is where we're headed next.
References
Seminal papers and key works referenced in this article.
- Ho et al. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020. arXiv
- Sohl-Dickstein et al. "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML, 2015. arXiv
- Song et al. "Denoising Diffusion Implicit Models." ICLR, 2021. arXiv
- Nichol & Dhariwal. "Improved Denoising Diffusion Probabilistic Models." ICML, 2021. arXiv