Introduction
Generative models learn to create. Given a collection of images, they learn to synthesize new images that look like they belong to the same collection. Given a dataset of molecules, they learn to propose new molecules with similar properties. Given recordings of speech, they learn to generate new utterances in the same voice.
This is a fundamentally different task from discriminative modeling — classifying an image as a cat or dog, predicting whether a molecule will bind to a receptor, transcribing speech to text. Discriminative models learn boundaries between categories. Generative models learn the full distribution over the data itself: every pixel correlation, every structural regularity, every subtle statistical pattern that makes a face look like a face and not like random noise.
Over the past decade, a succession of generative modeling frameworks has emerged — Variational Autoencoders (Kingma & Welling, 2014), Generative Adversarial Networks (Goodfellow et al., 2014), autoregressive models (van den Oord et al., 2016), normalizing flows (Rezende & Mohamed, 2015), and energy-based models. Each brought unique insights and tradeoffs. But starting around 2020, a new family of methods began to dominate: diffusion models and their close relatives, score-based generative models and flow matching.
The core idea is breathtakingly simple: systematically destroy data by adding noise, then learn to undo the destruction. If you can learn to reverse each tiny step of corruption, you can start from pure random noise and iteratively sculpt it into a photorealistic image, a protein structure, or a symphony. Creation through the reversal of chaos.
This 8-part series takes you from probability foundations to state-of-the-art generation. We start here with the mathematical vocabulary — distributions, divergences, the ELBO, and the landscape of generative models. Then we build through DDPMs, score matching, SDEs, flow matching, architectures, fast sampling, and finally applications from text-to-image to protein design.
The Generative Modeling Problem
The formal setup is deceptively clean. We have a dataset of observations
x(1), x(2), ..., x(N) drawn independently from an
unknown data distribution pdata(x). Our goal is to learn
a parameterized model pθ(x) that approximates
pdata(x) well enough that samples from pθ
are indistinguishable from real data.
Simple to state, staggering in practice. A 256×256 RGB image lives in
ℝ196,608 — a space with nearly 200,000 dimensions. The set of
"natural images" occupies a vanishingly thin manifold within this space. A randomly sampled point
in pixel-space looks like television static, not a photograph. The generative model must learn the
precise geometry of this manifold — every correlation between nearby pixels, every structural
regularity of objects, every statistical pattern that distinguishes signal from noise.
This is the curse of dimensionality at its most acute. In high dimensions, volume concentrates in shells, distances between random points converge, and any finite dataset is hopelessly sparse. A million training images cover essentially zero percent of the space of all possible 256×256 images.
Explicit vs implicit density models
Generative models split into two philosophical camps based on how they represent
pθ(x):
Explicit density models define a normalized probability distribution and can (at least
in principle) evaluate pθ(x) for any input. This enables maximum
likelihood training: adjust θ to maximize the probability the model assigns to observed data.
Autoregressive models, normalizing flows, and VAEs (via a lower bound) fall in this category.
Implicit density models can generate samples from pθ
but cannot evaluate the density directly. GANs are the canonical example — the generator maps noise
to data, but there is no tractable expression for the density of generated samples. Training uses
a surrogate objective (the discriminator's feedback) rather than likelihood.
Diffusion models occupy a fascinating middle ground. They are trained via a variational bound on the log-likelihood, making them explicit in principle. But their real power comes from an implicit iterative sampling procedure — start with noise, denoise step by step — that produces samples of extraordinary quality. The probability flow ODE formulation even allows exact likelihood computation when needed (Article 04).
Probability Distributions for Generation
Before we can measure how well a model approximates data, we need the language of probability distributions. Three distributions appear constantly in diffusion models: the Gaussian, the data distribution, and learned conditional distributions. Let's build the vocabulary.
The Gaussian distribution
The Gaussian (normal) distribution is the default noise source in diffusion models, and for good reason. The central limit theorem guarantees that sums of independent random variables converge to Gaussians. Gaussian noise is analytically tractable — sums, conditionals, and marginals of Gaussians remain Gaussian. And the KL divergence between two Gaussians has a closed-form expression.
A d-dimensional Gaussian is parameterized by a mean vector μ ∈ ℝd
and a covariance matrix Σ ∈ ℝd×d:
In diffusion models, we almost always use isotropic Gaussians where
Σ = σ2I — each dimension has the same variance and dimensions
are independent. The standard normal 𝒩(0, I) serves as the "pure noise" endpoint
of the diffusion process.
The reparameterization trick
A critical technique throughout diffusion models: instead of sampling
x ~ 𝒩(μ, σ2I) directly, we write:
This reparameterization trick (Kingma & Welling, 2014) separates the randomness (ε) from the parameters (μ, σ). The immediate benefit: gradients can flow through μ and σ because the sampling operation is now a deterministic function of differentiable parameters plus fixed noise. Every diffusion model uses this trick — the forward process adds noise via reparameterization, and the training objective is expressed in terms of the noise ε.
Bayes' theorem and posterior computation
Bayes' theorem connects priors, likelihoods, and posteriors:
In generative modeling, z represents latent variables (hidden causes of the data) and
x represents observations. The posterior p(z|x) tells us
what latent state likely produced an observation. The evidence
p(x) = ∫ p(x|z) p(z) dz is the probability of the data averaged over all possible
latent states — typically intractable to compute.
In diffusion models, the "latent variables" are the intermediate noisy states
x1, ..., xT. The forward process defines
q(xt | xt-1), and the reverse posterior
q(xt-1 | xt, x0) turns out to be a tractable
Gaussian — one of the mathematical gifts that makes DDPMs work (Article 02).
Adjust the mean and variance of a 2D Gaussian. Dots are samples; contours show the density.
Measuring Distance Between Distributions
To train a generative model, we need a way to measure how different the model distribution
pθ is from the data distribution pdata.
If we can quantify this gap, we can minimize it. The choice of distance measure profoundly shapes
the training dynamics and failure modes of every generative model.
KL divergence
The Kullback-Leibler divergence is the most fundamental divergence in generative
modeling. It measures the expected number of "extra bits" needed to encode samples from
p using a code optimized for q:
Key properties of KL divergence:
- Non-negative: DKL(p ‖ q) ≥ 0, with equality iff p = q (Gibbs' inequality).
- Asymmetric: DKL(p ‖ q) ≠ DKL(q ‖ p) in general. This asymmetry matters enormously.
- Not a true metric: violates symmetry and the triangle inequality.
The forward KL DKL(pdata ‖ pθ)
penalizes the model for assigning low probability where data actually exists. It is mean-seeking:
the model tries to cover all modes of the data, even at the cost of placing probability mass in
low-density regions between modes.
The reverse KL DKL(pθ ‖ pdata)
penalizes the model for placing probability where no data exists. It is mode-seeking:
the model concentrates on a single mode of the data, producing sharp but potentially incomplete samples.
Maximum likelihood is KL minimization
Maximizing the log-likelihood over data is equivalent to minimizing the forward KL divergence:
Proof: DKL(pdata ‖ pθ) = 𝔼pdata[log pdata(x)] - 𝔼pdata[log pθ(x)]. The first term (entropy of data) is constant w.r.t. θ. Minimizing the KL is equivalent to maximizing the second term — the expected log-likelihood under the data distribution. In practice, we approximate this with the empirical mean over training samples: (1/N) Σi log pθ(x(i)).
This connection is why maximum likelihood is the default training objective for explicit density models. When you train a normalizing flow or an autoregressive model by maximizing log-likelihood, you are implicitly minimizing the forward KL divergence from data to model.
Other divergences
The Jensen-Shannon divergence symmetrizes KL:
DJS(p, q) = ½ DKL(p ‖ m) + ½ DKL(q ‖ m)
where m = (p + q) / 2. The original GAN objective minimizes a quantity related to JSD.
The Wasserstein distance (earth mover's distance) measures the minimum cost of "transporting mass" from one distribution to another. Unlike KL, it is well-defined even when distributions have non-overlapping support — a property that helps with training stability. Wasserstein distance will reappear when we discuss optimal transport in flow matching (Article 05).
The Fisher divergence compares gradients of log-densities:
DF(p ‖ q) = 𝔼p[‖∇x log p(x) - ∇x log q(x)‖2].
This is the divergence minimized by score matching — the foundation of score-based
generative models (Article 03).
Compare two 1D Gaussians. Adjust q's parameters and watch DKL(p ‖ q) update in real-time.
The Evidence Lower Bound
Many of the most powerful generative models — VAEs, diffusion models, hierarchical models — involve latent variables: hidden quantities that are never directly observed but help explain the structure of the data. The mathematics of training such models leads inevitably to one of the most important objects in generative modeling: the Evidence Lower Bound, or ELBO.
Latent variable models
A latent variable model posits that each data point x was generated by first sampling
a latent code z ~ p(z), then generating the observation x ~ p(x|z).
The marginal likelihood (also called the evidence) is:
This integral is almost always intractable. For a diffusion model with T=1000 timesteps, the "latent
variables" are x1, ..., x1000, and the integral is over a
space of dimension 1000 × d. No quadrature rule or Monte Carlo estimate can handle this directly.
The solution: instead of computing log p(x) exactly, we derive a lower bound
that we can compute and optimize.
Deriving the ELBO
We introduce an approximate posterior q(z|x) and apply Jensen's inequality:
The first term is the reconstruction term: how well can we reconstruct x from a latent code sampled from q? The second term is the regularization term: how close is our approximate posterior to the prior?
The gap between log p(x) and the ELBO is exactly DKL(q(z|x) ‖ p(z|x)) — the KL between the approximate and true posterior. A better approximate posterior tightens the bound.
In a VAE, q(z|x) is an encoder network that maps data to latent distributions, and p(x|z) is a decoder network that maps latent codes to data distributions. Training maximizes the ELBO, which simultaneously trains the encoder and decoder.
In a diffusion model, the situation is structurally identical but the roles are fixed: q(x1:T | x0) is the forward process (a fixed sequence of noise additions, no learnable parameters), and pθ(x0:T) is the reverse process (a learned sequence of denoising steps). The ELBO decomposes into a sum of KL divergences between Gaussians at each timestep — each of which has a closed-form expression. This decomposition is the mathematical foundation of DDPM training (Article 02).
The DDPM training objective — predicting the noise ε added at each step — is a simplified reweighted version of the ELBO. Score matching objectives can also be derived from the ELBO in continuous time. Even flow matching has connections to variational inference. Understanding the ELBO is understanding the mathematical bedrock of the entire field.
Adjust the quality of the approximate posterior q(z|x). As q improves, the KL gap shrinks and the ELBO tightens toward log p(x).
The Landscape of Generative Models
Before diving deep into diffusion, it helps to see the full landscape of generative approaches. Each family makes different tradeoffs between sample quality, training stability, likelihood evaluation, and sampling speed.
| Family | Density | Training | Strengths | Weaknesses |
|---|---|---|---|---|
| Autoregressive | Exact (factorized) | MLE (teacher forcing) | Exact likelihood, simple training | Sequential sampling (slow) |
| VAE | Lower bound (ELBO) | ELBO maximization | Fast sampling, latent space | Blurry samples, posterior collapse |
| Normalizing Flows | Exact (change of vars) | MLE | Exact likelihood, invertible | Architecture constraints, limited expressiveness |
| GAN | Implicit | Adversarial (min-max) | Sharp samples, fast generation | Mode collapse, training instability |
| Energy-Based | Unnormalized | Contrastive / score matching | Flexible, composable | Intractable Z, slow MCMC sampling |
| Diffusion / Score | ELBO / implicit | Denoising objective | Best quality, stable training, mode coverage | Slow sampling (many steps) |
| Flow Matching | Implicit (via ODE) | Velocity regression | Simple loss, straight paths, fast | Newer, less understood theoretically |
Autoregressive models like GPT and PixelCNN factor the joint distribution as a product
of conditionals: p(x) = ∏i p(xi | x<i). This
yields exact, tractable log-likelihoods and simple teacher-forced training. The cost is sequential
sampling — each dimension must be generated one at a time, making them fast for text but slow for images.
VAEs learn an encoder-decoder pair optimized via the ELBO. Sampling is fast (decode a random latent), but the variational bound introduces slack, and the Gaussian decoder assumption leads to blurry samples. The latent space, however, is useful for interpolation and manipulation.
Normalizing flows define an invertible mapping from a simple distribution (Gaussian) to the data distribution. The change-of-variables formula gives exact likelihoods. But the invertibility constraint limits architectural choices — you can't use arbitrary neural networks.
GANs use a generator-discriminator game. The generator produces spectacularly sharp samples, but training is notoriously unstable (mode collapse, oscillation, failure to converge), and there is no likelihood for evaluation or comparison. GANs dominated image generation from 2014 to 2020.
Diffusion models and flow matching combine the best of several worlds: training stability comparable to likelihood-based methods, sample quality that exceeds GANs, full mode coverage without collapse, and (with modern samplers) competitive generation speed. This combination explains their rapid dominance across image, video, audio, and scientific applications.
Hover over each model family to see details. X-axis: sample quality. Y-axis: sampling speed.
Why Noise? The Core Insight Behind Diffusion
Of all the approaches in the landscape above, why has diffusion emerged as the dominant paradigm? The answer lies in an insight so simple it's almost philosophical: destruction is easy; creation is hard; but if you learn to reverse destruction, you get creation for free.
Consider the problem directly: given a random vector sampled from 𝒩(0, I), produce
a photorealistic image. This is a mapping from a simple distribution to an extraordinarily complex one.
Learning this mapping in one shot — as a GAN generator tries to do — requires the network to perform a
massive, discontinuous transformation. Small changes in the input noise can produce wildly different
outputs. The optimization landscape is treacherous.
Now consider the inverse problem: given a photorealistic image, add a tiny amount of Gaussian noise. This is trivial — literally one line of code. The key insight is that if we do this gradually, in many small steps, the data is smoothly transformed into pure noise. And crucially, the reverse of each tiny step is also a tiny step — a small, learnable denoising operation.
The forward process (data → noise) requires no learning. It's a fixed sequence of noise additions. The reverse process (noise → data) is what we learn — but each step is a modest denoising task, not a wild generative leap. A neural network that can remove a small amount of noise from a slightly corrupted image is much easier to train than one that must generate an entire image from scratch.
The name "diffusion" comes from physics. Adding noise to data is like heat diffusion — a crystal (structured data) gradually dissolves into thermal equilibrium (Gaussian noise). The second law of thermodynamics says this happens spontaneously. But if you know the exact microscopic dynamics, you can run the process backward — reconstructing the crystal from the thermal bath. This is precisely what diffusion models do, with a neural network standing in for knowledge of the microscopic dynamics.
There's a second, more technical reason noise is so powerful: it smooths the data distribution. The true data distribution is concentrated on a thin manifold — a spiky, discontinuous mess in high dimensions. Adding noise inflates this manifold into a full-dimensional cloud, filling in the gaps and making the distribution smooth and well-behaved. A smooth distribution has well-defined gradients (scores) everywhere, which makes it much easier to learn and sample from.
At high noise levels, the distribution is nearly Gaussian — simple and easy to model. At low noise levels, it's close to the data — complex but locally smooth. By learning to denoise at every scale from pure noise down to pristine data, the model builds a complete, multi-scale understanding of the data distribution.
From Theory to Architecture
We now have the mathematical vocabulary to understand diffusion models and flow matching. We know what generative models aim to do (approximate pdata), how to measure success (KL divergence, likelihood), how to handle latent variables (the ELBO), and why noise is the secret ingredient (smooth gradients, multi-scale learning, easy destruction paired with learnable reconstruction).
The next seven articles build the full stack:
- Article 02: DDPMs — the discrete forward process, reverse denoising, the noise prediction objective, and training
- Article 03: Score functions and Langevin dynamics — the gradient perspective on diffusion
- Article 04: SDEs — the continuous-time unification of DDPMs and score models
- Article 05: Flow matching — learning velocity fields instead of scores, with straighter paths
- Article 06: Architectures — U-Nets, DiTs, classifier-free guidance, latent diffusion
- Article 07: Fast sampling — DDIM, DPM-Solver, consistency models, and one-step generation
- Article 08: Applications — from text-to-image to protein design and beyond
Each concept builds on the previous one. The math gets richer, the models get more powerful, and the applications get more extraordinary. Let's begin with the model that started the revolution: the Denoising Diffusion Probabilistic Model.
References
Seminal papers and key works referenced in this article.