Generative Modeling & Probability Foundations — Diffusion & Flow Matching Internals

Introduction

Generative models learn to create. Given a collection of images, they learn to synthesize new images that look like they belong to the same collection. Given a dataset of molecules, they learn to propose new molecules with similar properties. Given recordings of speech, they learn to generate new utterances in the same voice.

This is a fundamentally different task from discriminative modeling — classifying an image as a cat or dog, predicting whether a molecule will bind to a receptor, transcribing speech to text. Discriminative models learn boundaries between categories. Generative models learn the full distribution over the data itself: every pixel correlation, every structural regularity, every subtle statistical pattern that makes a face look like a face and not like random noise.

Over the past decade, a succession of generative modeling frameworks has emerged — Variational Autoencoders (Kingma & Welling, 2014), Generative Adversarial Networks (Goodfellow et al., 2014), autoregressive models (van den Oord et al., 2016), normalizing flows (Rezende & Mohamed, 2015), and energy-based models. Each brought unique insights and tradeoffs. But starting around 2020, a new family of methods began to dominate: diffusion models and their close relatives, score-based generative models and flow matching.

The core idea is breathtakingly simple: systematically destroy data by adding noise, then learn to undo the destruction. If you can learn to reverse each tiny step of corruption, you can start from pure random noise and iteratively sculpt it into a photorealistic image, a protein structure, or a symphony. Creation through the reversal of chaos.

ℹ What this series covers

This 8-part series takes you from probability foundations to state-of-the-art generation. We start here with the mathematical vocabulary — distributions, divergences, the ELBO, and the landscape of generative models. Then we build through DDPMs, score matching, SDEs, flow matching, architectures, fast sampling, and finally applications from text-to-image to protein design.

The Generative Modeling Problem

The formal setup is deceptively clean. We have a dataset of observations x⁽¹⁾, x⁽²⁾, ..., x^(N) drawn independently from an unknown data distribution p_data(x). Our goal is to learn a parameterized model p_θ(x) that approximates p_data(x) well enough that samples from p_θ are indistinguishable from real data.

Simple to state, staggering in practice. A 256×256 RGB image lives in ℝ^196,608 — a space with nearly 200,000 dimensions. The set of "natural images" occupies a vanishingly thin manifold within this space. A randomly sampled point in pixel-space looks like television static, not a photograph. The generative model must learn the precise geometry of this manifold — every correlation between nearby pixels, every structural regularity of objects, every statistical pattern that distinguishes signal from noise.

This is the curse of dimensionality at its most acute. In high dimensions, volume concentrates in shells, distances between random points converge, and any finite dataset is hopelessly sparse. A million training images cover essentially zero percent of the space of all possible 256×256 images.

Explicit vs implicit density models

Generative models split into two philosophical camps based on how they represent p_θ(x):

Explicit density models define a normalized probability distribution and can (at least in principle) evaluate p_θ(x) for any input. This enables maximum likelihood training: adjust θ to maximize the probability the model assigns to observed data. Autoregressive models, normalizing flows, and VAEs (via a lower bound) fall in this category.

Implicit density models can generate samples from p_θ but cannot evaluate the density directly. GANs are the canonical example — the generator maps noise to data, but there is no tractable expression for the density of generated samples. Training uses a surrogate objective (the discriminator's feedback) rather than likelihood.

Diffusion models occupy a fascinating middle ground. They are trained via a variational bound on the log-likelihood, making them explicit in principle. But their real power comes from an implicit iterative sampling procedure — start with noise, denoise step by step — that produces samples of extraordinary quality. The probability flow ODE formulation even allows exact likelihood computation when needed (Article 04).

Probability Distributions for Generation

Before we can measure how well a model approximates data, we need the language of probability distributions. Three distributions appear constantly in diffusion models: the Gaussian, the data distribution, and learned conditional distributions. Let's build the vocabulary.

The Gaussian distribution

The Gaussian (normal) distribution is the default noise source in diffusion models, and for good reason. The central limit theorem guarantees that sums of independent random variables converge to Gaussians. Gaussian noise is analytically tractable — sums, conditionals, and marginals of Gaussians remain Gaussian. And the KL divergence between two Gaussians has a closed-form expression.

A d-dimensional Gaussian is parameterized by a mean vector μ ∈ ℝ^d and a covariance matrix Σ ∈ ℝ^d×d:

𝒩(x; μ, Σ) = (2π) -d/2 |Σ| -1/2 exp(-½(x - μ) T Σ -1 (x - μ))

In diffusion models, we almost always use isotropic Gaussians where Σ = σ²I — each dimension has the same variance and dimensions are independent. The standard normal 𝒩(0, I) serves as the "pure noise" endpoint of the diffusion process.

The reparameterization trick

A critical technique throughout diffusion models: instead of sampling x ~ 𝒩(μ, σ²I) directly, we write:

x = μ + σ \cdot ε, ε ~ 𝒩(0, I)

This reparameterization trick (Kingma & Welling, 2014) separates the randomness (ε) from the parameters (μ, σ). The immediate benefit: gradients can flow through μ and σ because the sampling operation is now a deterministic function of differentiable parameters plus fixed noise. Every diffusion model uses this trick — the forward process adds noise via reparameterization, and the training objective is expressed in terms of the noise ε.

Bayes' theorem and posterior computation

Bayes' theorem connects priors, likelihoods, and posteriors:

p(z | x) = p(x | z) \cdot p(z) / p(x)

In generative modeling, z represents latent variables (hidden causes of the data) and x represents observations. The posterior p(z|x) tells us what latent state likely produced an observation. The evidence p(x) = ∫ p(x|z) p(z) dz is the probability of the data averaged over all possible latent states — typically intractable to compute.

In diffusion models, the "latent variables" are the intermediate noisy states x₁, ..., x_T. The forward process defines q(x_t | x_t-1), and the reverse posterior q(x_t-1 | x_t, x₀) turns out to be a tractable Gaussian — one of the mathematical gifts that makes DDPMs work (Article 02).

2D Gaussian Explorer Interactive

Adjust the mean and variance of a 2D Gaussian. Dots are samples; contours show the density.

μ_x: 0.0 μ_y: 0.0 σ: 1.0

Measuring Distance Between Distributions

To train a generative model, we need a way to measure how different the model distribution p_θ is from the data distribution p_data. If we can quantify this gap, we can minimize it. The choice of distance measure profoundly shapes the training dynamics and failure modes of every generative model.

KL divergence

The Kullback-Leibler divergence is the most fundamental divergence in generative modeling. It measures the expected number of "extra bits" needed to encode samples from p using a code optimized for q:

D KL (p ‖ q) = 𝔼 p [log p(x) / q(x)] = 𝔼 p [log p(x) - log q(x)]

Key properties of KL divergence:

Non-negative: D_KL(p ‖ q) ≥ 0, with equality iff p = q (Gibbs' inequality).
Asymmetric: D_KL(p ‖ q) ≠ D_KL(q ‖ p) in general. This asymmetry matters enormously.
Not a true metric: violates symmetry and the triangle inequality.

The forward KL D_KL(p_data ‖ p_θ) penalizes the model for assigning low probability where data actually exists. It is mean-seeking: the model tries to cover all modes of the data, even at the cost of placing probability mass in low-density regions between modes.

The reverse KL D_KL(p_θ ‖ p_data) penalizes the model for placing probability where no data exists. It is mode-seeking: the model concentrates on a single mode of the data, producing sharp but potentially incomplete samples.

Maximum likelihood is KL minimization

∑ MLE = forward KL minimization

Maximizing the log-likelihood over data is equivalent to minimizing the forward KL divergence:

max θ 𝔼 p data [log p θ (x)] = min θ D KL (p data ‖ p θ) + const

Proof: D_KL(p_data ‖ p_θ) = 𝔼_{p_data}[log p_data(x)] - 𝔼_{p_data}[log p_θ(x)]. The first term (entropy of data) is constant w.r.t. θ. Minimizing the KL is equivalent to maximizing the second term — the expected log-likelihood under the data distribution. In practice, we approximate this with the empirical mean over training samples: (1/N) Σ_i log p_θ(x⁽ⁱ⁾).

This connection is why maximum likelihood is the default training objective for explicit density models. When you train a normalizing flow or an autoregressive model by maximizing log-likelihood, you are implicitly minimizing the forward KL divergence from data to model.

Other divergences

The Jensen-Shannon divergence symmetrizes KL: D_JS(p, q) = ½ D_KL(p ‖ m) + ½ D_KL(q ‖ m) where m = (p + q) / 2. The original GAN objective minimizes a quantity related to JSD.

The Wasserstein distance (earth mover's distance) measures the minimum cost of "transporting mass" from one distribution to another. Unlike KL, it is well-defined even when distributions have non-overlapping support — a property that helps with training stability. Wasserstein distance will reappear when we discuss optimal transport in flow matching (Article 05).

The Fisher divergence compares gradients of log-densities: D_F(p ‖ q) = 𝔼_p[‖∇_x log p(x) - ∇_x log q(x)‖²]. This is the divergence minimized by score matching — the foundation of score-based generative models (Article 03).

KL Divergence — Interactive Interactive

Compare two 1D Gaussians. Adjust q's parameters and watch D_KL(p ‖ q) update in real-time.

μ_q: 0.0 σ_q: 1.0 D_KL(p‖q) = 0.000

The Evidence Lower Bound

Many of the most powerful generative models — VAEs, diffusion models, hierarchical models — involve latent variables: hidden quantities that are never directly observed but help explain the structure of the data. The mathematics of training such models leads inevitably to one of the most important objects in generative modeling: the Evidence Lower Bound, or ELBO.

Latent variable models

A latent variable model posits that each data point x was generated by first sampling a latent code z ~ p(z), then generating the observation x ~ p(x|z). The marginal likelihood (also called the evidence) is:

p(x) = \int p(x | z) p(z) dz

This integral is almost always intractable. For a diffusion model with T=1000 timesteps, the "latent variables" are x₁, ..., x₁₀₀₀, and the integral is over a space of dimension 1000 × d. No quadrature rule or Monte Carlo estimate can handle this directly.

The solution: instead of computing log p(x) exactly, we derive a lower bound that we can compute and optimize.

Deriving the ELBO

∑ ELBO derivation via Jensen's inequality

We introduce an approximate posterior q(z|x) and apply Jensen's inequality:

log p(x) = log \int p(x, z) dz = log \int q(z|x) \cdot p(x, z) / q(z|x) dz

\geq \int q(z|x) log [p(x, z) / q(z|x)] dz (Jensen's inequality)

= 𝔼 q(z|x) [log p(x|z)] - D KL (q(z|x) ‖ p(z))

The first term is the reconstruction term: how well can we reconstruct x from a latent code sampled from q? The second term is the regularization term: how close is our approximate posterior to the prior?

The gap between log p(x) and the ELBO is exactly D_KL(q(z|x) ‖ p(z|x)) — the KL between the approximate and true posterior. A better approximate posterior tightens the bound.

In a VAE, q(z|x) is an encoder network that maps data to latent distributions, and p(x|z) is a decoder network that maps latent codes to data distributions. Training maximizes the ELBO, which simultaneously trains the encoder and decoder.

In a diffusion model, the situation is structurally identical but the roles are fixed: q(x_1:T | x₀) is the forward process (a fixed sequence of noise additions, no learnable parameters), and p_θ(x_0:T) is the reverse process (a learned sequence of denoising steps). The ELBO decomposes into a sum of KL divergences between Gaussians at each timestep — each of which has a closed-form expression. This decomposition is the mathematical foundation of DDPM training (Article 02).

💡 The ELBO is everywhere in diffusion

The DDPM training objective — predicting the noise ε added at each step — is a simplified reweighted version of the ELBO. Score matching objectives can also be derived from the ELBO in continuous time. Even flow matching has connections to variational inference. Understanding the ELBO is understanding the mathematical bedrock of the entire field.

ELBO Decomposition Interactive

Adjust the quality of the approximate posterior q(z|x). As q improves, the KL gap shrinks and the ELBO tightens toward log p(x).

Posterior quality: 50% ELBO = -4.50

The Landscape of Generative Models

Before diving deep into diffusion, it helps to see the full landscape of generative approaches. Each family makes different tradeoffs between sample quality, training stability, likelihood evaluation, and sampling speed.

Family	Density	Training	Strengths	Weaknesses
Autoregressive	Exact (factorized)	MLE (teacher forcing)	Exact likelihood, simple training	Sequential sampling (slow)
VAE	Lower bound (ELBO)	ELBO maximization	Fast sampling, latent space	Blurry samples, posterior collapse
Normalizing Flows	Exact (change of vars)	MLE	Exact likelihood, invertible	Architecture constraints, limited expressiveness
GAN	Implicit	Adversarial (min-max)	Sharp samples, fast generation	Mode collapse, training instability
Energy-Based	Unnormalized	Contrastive / score matching	Flexible, composable	Intractable Z, slow MCMC sampling
Diffusion / Score	ELBO / implicit	Denoising objective	Best quality, stable training, mode coverage	Slow sampling (many steps)
Flow Matching	Implicit (via ODE)	Velocity regression	Simple loss, straight paths, fast	Newer, less understood theoretically

Autoregressive models like GPT and PixelCNN factor the joint distribution as a product of conditionals: p(x) = ∏_i p(x_i | x_<i). This yields exact, tractable log-likelihoods and simple teacher-forced training. The cost is sequential sampling — each dimension must be generated one at a time, making them fast for text but slow for images.

VAEs learn an encoder-decoder pair optimized via the ELBO. Sampling is fast (decode a random latent), but the variational bound introduces slack, and the Gaussian decoder assumption leads to blurry samples. The latent space, however, is useful for interpolation and manipulation.

Normalizing flows define an invertible mapping from a simple distribution (Gaussian) to the data distribution. The change-of-variables formula gives exact likelihoods. But the invertibility constraint limits architectural choices — you can't use arbitrary neural networks.

GANs use a generator-discriminator game. The generator produces spectacularly sharp samples, but training is notoriously unstable (mode collapse, oscillation, failure to converge), and there is no likelihood for evaluation or comparison. GANs dominated image generation from 2014 to 2020.

Diffusion models and flow matching combine the best of several worlds: training stability comparable to likelihood-based methods, sample quality that exceeds GANs, full mode coverage without collapse, and (with modern samplers) competitive generation speed. This combination explains their rapid dominance across image, video, audio, and scientific applications.

Generative Model Landscape Interactive

Hover over each model family to see details. X-axis: sample quality. Y-axis: sampling speed.

Why Noise? The Core Insight Behind Diffusion

Of all the approaches in the landscape above, why has diffusion emerged as the dominant paradigm? The answer lies in an insight so simple it's almost philosophical: destruction is easy; creation is hard; but if you learn to reverse destruction, you get creation for free.

Consider the problem directly: given a random vector sampled from 𝒩(0, I), produce a photorealistic image. This is a mapping from a simple distribution to an extraordinarily complex one. Learning this mapping in one shot — as a GAN generator tries to do — requires the network to perform a massive, discontinuous transformation. Small changes in the input noise can produce wildly different outputs. The optimization landscape is treacherous.

Now consider the inverse problem: given a photorealistic image, add a tiny amount of Gaussian noise. This is trivial — literally one line of code. The key insight is that if we do this gradually, in many small steps, the data is smoothly transformed into pure noise. And crucially, the reverse of each tiny step is also a tiny step — a small, learnable denoising operation.

The forward process (data → noise) requires no learning. It's a fixed sequence of noise additions. The reverse process (noise → data) is what we learn — but each step is a modest denoising task, not a wild generative leap. A neural network that can remove a small amount of noise from a slightly corrupted image is much easier to train than one that must generate an entire image from scratch.

💡 The thermodynamic analogy

The name "diffusion" comes from physics. Adding noise to data is like heat diffusion — a crystal (structured data) gradually dissolves into thermal equilibrium (Gaussian noise). The second law of thermodynamics says this happens spontaneously. But if you know the exact microscopic dynamics, you can run the process backward — reconstructing the crystal from the thermal bath. This is precisely what diffusion models do, with a neural network standing in for knowledge of the microscopic dynamics.

There's a second, more technical reason noise is so powerful: it smooths the data distribution. The true data distribution is concentrated on a thin manifold — a spiky, discontinuous mess in high dimensions. Adding noise inflates this manifold into a full-dimensional cloud, filling in the gaps and making the distribution smooth and well-behaved. A smooth distribution has well-defined gradients (scores) everywhere, which makes it much easier to learn and sample from.

At high noise levels, the distribution is nearly Gaussian — simple and easy to model. At low noise levels, it's close to the data — complex but locally smooth. By learning to denoise at every scale from pure noise down to pristine data, the model builds a complete, multi-scale understanding of the data distribution.

From Theory to Architecture

We now have the mathematical vocabulary to understand diffusion models and flow matching. We know what generative models aim to do (approximate p_data), how to measure success (KL divergence, likelihood), how to handle latent variables (the ELBO), and why noise is the secret ingredient (smooth gradients, multi-scale learning, easy destruction paired with learnable reconstruction).

The next seven articles build the full stack:

Article 02: DDPMs — the discrete forward process, reverse denoising, the noise prediction objective, and training
Article 03: Score functions and Langevin dynamics — the gradient perspective on diffusion
Article 04: SDEs — the continuous-time unification of DDPMs and score models
Article 05: Flow matching — learning velocity fields instead of scores, with straighter paths
Article 06: Architectures — U-Nets, DiTs, classifier-free guidance, latent diffusion
Article 07: Fast sampling — DDIM, DPM-Solver, consistency models, and one-step generation
Article 08: Applications — from text-to-image to protein design and beyond

Each concept builds on the previous one. The math gets richer, the models get more powerful, and the applications get more extraordinary. Let's begin with the model that started the revolution: the Denoising Diffusion Probabilistic Model.

References

Seminal papers and key works referenced in this article.

Goodfellow et al. "Generative Adversarial Nets." NeurIPS, 2014. arXiv
Kingma & Welling. "Auto-Encoding Variational Bayes." ICLR, 2014. arXiv
Rezende & Mohamed. "Variational Inference with Normalizing Flows." ICML, 2015. arXiv
Dinh et al. "Density estimation using Real-NVP." ICLR, 2017. arXiv