The mathematical bridge from flow models to diffusion models: scores, Fokker-Planck, and DDPM.
In Chapter 3, we trained a flow model (ODE) using flow matching. This gives us deterministic sampling: same noise → same output. But diffusion models (SDEs) add noise during sampling, which can improve sample quality and enable new capabilities like guidance. To use SDEs for sampling, we need a new mathematical object: the score function.
The score function ∇ log pt(x) tells you the direction of steepest ascent in log-probability. It points toward regions of higher density. If pt(x) is the current marginal distribution of noisy data at time t, the score tells you: "which direction should x move to become more likely under pt?"
Here is the remarkable connection: for Gaussian probability paths, the score function and the vector field are linear reparameterizations of each other. Once you have learned one, you have the other for free. This means a flow model trained with flow matching can be used for SDE sampling without any additional training.
Let q(x) be any probability distribution with density q(x) > 0. The score function is defined as:
This is the gradient of the log-probability with respect to x (not with respect to any parameters). It is a vector field: at every point x, it gives a direction in Rd.
Worked example — 1D Gaussian. Let q(x) = N(x; μ, σ2). The density is q(x) = (1/√(2πσ2)) exp(−(x−μ)2/(2σ2)).
The log-density is log q(x) = −(x−μ)2/(2σ2) + const.
The score is:
Numerical example. μ = 3, σ = 1. At x = 5: score = −(5−3)/1 = −2 (points left, toward the mean). At x = 1: score = −(1−3)/1 = +2 (points right, toward the mean). At x = 3: score = 0 (at the peak, no gradient).
The score always points toward the mean for a Gaussian. Further from the mean = larger score magnitude = stronger pull back.
Worked example — 2D Gaussian mixture. Let q(x) = 0.5 N(x; μ1, σ2I) + 0.5 N(x; μ2, σ2I) with μ1 = (−2, 0), μ2 = (2, 0), σ = 0.7. The score is:
where wi = N(x;μi,σ2I)/q(x) are the posterior weights (how likely x came from component i).
At x = (0, 0) (between the two modes): w1 = w2 = 0.5 by symmetry. The two score contributions cancel out, giving ∇ log q(0,0) ≈ (0, 0). The midpoint is a saddle — the score is zero there.
At x = (3, 0) (near μ2): w2 ≫ w1, so the score is dominated by −(x−μ2)/σ2 = −(1, 0)/0.49 ≈ (−2.04, 0). It points back toward μ2.
This illustrates a crucial property: the score of a mixture is a posterior-weighted average of the component scores. This is analogous to how the marginal vector field is a posterior-weighted average of conditional vector fields.
The heatmap shows the density q(x). Arrows show the score ∇ log q(x) — the direction of steepest ascent. Notice how arrows point toward the peaks of the distribution.
In the context of generative modeling, we apply the score function to the marginal probability path pt(x). The marginal score is ∇ log pt(x), and the conditional score (for a specific data point z) is ∇ log pt(x|z). Like vector fields, the marginal score can be expressed via conditional scores:
For the Gaussian probability path pt(x|z) = N(αtz, βt2I), the conditional score has a simple closed form:
Derivation. The Gaussian density is pt(x|z) ∝ exp(−||x − αtz||2 / (2βt2)).
Taking log: log pt(x|z) = −||x − αtz||2 / (2βt2) + const.
Taking gradient with respect to x: ∇x log pt(x|z) = −(x − αtz) / βt2. Let's verify this for d = 1:
For d > 1, the gradient with respect to the vector x gives the same formula component-wise, yielding a vector in Rd.
Since x = αtz + βtε (where ε ~ N(0,I)), we can substitute x − αtz = βtε:
Score magnitude varies with noise level. The score is −ε/βt. For fixed ε:
| t (CondOT) | βt = 1−t | Score = −ε/βt (for ε=1) | Interpretation |
|---|---|---|---|
| 0.1 (very noisy) | 0.9 | −1.11 | Small correction — lots of noise, uncertain |
| 0.5 (medium) | 0.5 | −2.0 | Moderate correction |
| 0.9 (nearly clean) | 0.1 | −10.0 | Large correction — precise denoising |
| 0.99 | 0.01 | −100.0 | Extremely large — numerical instability! |
This explosion as βt → 0 is why directly training a score network can be numerically problematic. The DDPM reparameterization (Chapter 8) addresses this by predicting ε instead of −ε/βt.
Geometric interpretation in 2D. Consider a 2D Gaussian pt(x|z) = N(αtz, βt2I). The score at any point x is a vector pointing from x toward the Gaussian center αtz, with magnitude inversely proportional to βt2. The level curves of the density are circles (because the covariance is isotropic), and the score vectors are perpendicular to these circles, pointing inward.
For a mixture of Gaussians (the marginal distribution), the score at each point is a weighted average of the scores from each component. Near the center of one component, that component's score dominates. Between two components, the scores compete — the score may be small or point in an ambiguous direction. This "indecisive" region between modes is where the neural network must make the most nuanced predictions.
python # Computing the conditional score for a Gaussian path def conditional_score(x, z, alpha_t, beta_t): """Score of p_t(x|z) = N(alpha_t * z, beta_t^2 * I).""" return -(x - alpha_t * z) / (beta_t ** 2) # Example z = torch.tensor([3.0, -1.0]) x = torch.tensor([2.5, 0.0]) alpha, beta = 0.7, 0.3 score = conditional_score(x, z, alpha, beta) # score = -(x - 0.7*z) / 0.09 # = -([2.5, 0] - [2.1, -0.7]) / 0.09 # = -[0.4, 0.7] / 0.09 = [-4.44, -7.78] # Points from x toward alpha*z = [2.1, -0.7]
Worked example for CondOT (αt=t, βt=1−t). Let z = 5, ε = −1.2, t = 0.6.
Then x = 0.6 · 5 + 0.4 · (−1.2) = 3.0 − 0.48 = 2.52.
Score: −ε/βt = −(−1.2)/0.4 = 3.0. The score points in the positive direction, toward the clean data z = 5.
Vector field: ut(x|z) = z − ε = 5 − (−1.2) = 6.2.
These are different numbers but encode the same information — they are linear reparameterizations of each other.
For Gaussian probability paths, the conditional vector field and the conditional score are both linear functions of x and z. This means we can convert between them algebraically.
Derivation for CondOT (αt=t, βt=1−t).
α̇t = 1, β̇t = −1. Then:
at = (1−t)2 · (1/t) − (−1)(1−t) = (1−t)2/t + (1−t) = (1−t)[(1−t)/t + 1] = (1−t)/t
bt = 1/t
Let's verify: ut(x|z) = at ∇ log pt(x|z) + bt x = ((1−t)/t)(−ε/(1−t)) + x/t = −ε/t + x/t = (x−ε(1−t)−... wait, let me use the direct formula.
Actually, the cleanest way to see this: the vector field ut(x|z) = z − ε. The score ∇ log pt(x|z) = −ε/(1−t). Both encode the same underlying information (the noise ε and data z), just packaged differently.
This also means we can define a denoiser Dt(x) that predicts the clean data from noisy data:
The denoiser, score, and vector field are all equivalent representations. Learning any one of them is sufficient.
Worked example — converting between representations. For CondOT (αt = t, βt = 1−t) with z = 5, ε = −1, t = 0.6:
x = 0.6(5) + 0.4(−1) = 2.6
Velocity: u = z − ε = 5 − (−1) = 6.0
Score: s = −ε/β = −(−1)/0.4 = 2.5
Noise prediction: ε̂ = ε = −1.0
Denoiser: D = z = 5.0
All four numbers encode the same information. Given any one, you can compute the others:
(The exact conversion uses the at, bt coefficients from Proposition 1.)
python # All four parameterizations, interconverted # For CondOT: alpha_t = t, beta_t = 1 - t def velocity_to_eps(v, z, eps): """v = z - eps => eps = z - v""" return z - v # only if you know z def score_to_eps(s, beta_t): """s = -eps/beta => eps = -beta * s""" return -beta_t * s def eps_to_z(eps_pred, x, alpha_t, beta_t): """x = alpha*z + beta*eps => z = (x - beta*eps) / alpha""" return (x - beta_t * eps_pred) / alpha_t
| Parameterization | Network predicts | Target |
|---|---|---|
| Velocity (flow matching) | utθ(x) | z − ε (CondOT) |
| Score | stθ(x) | −ε/βt |
| Noise prediction | εtθ(x) | ε |
| Denoiser | Dtθ(x) | E[z|x] |
We trained a flow model using flow matching. We can sample via ODE. But can we also sample via SDE (adding noise during sampling)? Yes — and it is free.
The SDE has three terms:
1. uttarget(Xt) dt — the ODE drift (moves toward data, same as flow model).
2. (σt2/2) ∇ log pt(Xt) dt — the score correction (compensates for the noise being added).
3. σt dWt — the Brownian motion noise.
For Gaussian paths, we can express everything in terms of the learned vector field (using Proposition 1):
Same trained model, different σ. At σ=0 you get ODE (straight paths). Increase σ to see SDE sampling (noisy paths). Both reach the same target distribution.
Worked example — SDE sampling step by step. Consider a trained model with the CondOT path. At step i with t = 0.3, Xt = (1.2, −0.5), σ = 0.5, h = 0.02:
1. Network predicts velocity: uθ(Xt, 0.3) = (2.1, 1.3).
2. Convert to score using Proposition 1: at = 0.7/0.3 = 2.33, bt = 1/0.3 = 3.33.
score = (u − btx) / at = ((2.1, 1.3) − 3.33(1.2, −0.5)) / 2.33 = ((2.1−4.0, 1.3+1.67)) / 2.33 = (−0.82, 1.27).
3. Compute SDE drift: u + (σ2/2) · score = (2.1, 1.3) + 0.125(−0.82, 1.27) = (2.0, 1.46).
4. Sample noise: ε = (0.3, −0.8).
5. Update: Xt+h = Xt + h · drift + σ√h · ε = (1.2, −0.5) + 0.02(2.0, 1.46) + 0.5√0.02(0.3, −0.8) = (1.2+0.04+0.021, −0.5+0.029−0.057) = (1.261, −0.528).
python # SDE sampling with a trained flow matching model def sde_sample(u_theta, sigma_fn, n_steps=50, d=2): x = torch.randn(d) h = 1.0 / n_steps for i in range(n_steps): t = i * h sigma = sigma_fn(t) # Convert velocity to score using Prop 1 v = u_theta(x, t) a_t = (1-t) / max(t, 1e-5) # CondOT b_t = 1.0 / max(t, 1e-5) score = (v - b_t * x) / a_t # SDE step drift = v + 0.5 * sigma**2 * score noise = sigma * (h**0.5) * torch.randn_like(x) x = x + h * drift + noise return x
How do we prove the SDE Extension Trick? The key mathematical tool is the Fokker-Planck equation, which tells us what distribution an SDE produces over time.
Recall the continuity equation for ODEs: if dX/dt = ut(Xt) and Xt ~ pt, then:
The Fokker-Planck equation extends this to SDEs. For the SDE dXt = ut(Xt) dt + σt dWt:
where Δ is the Laplacian operator: Δf = ∑i ∂2f/∂xi2.
Proof sketch of the SDE Extension Trick (Theorem 17). We need to show the SDE with drift uttarget + (σ2/2)∇ log pt satisfies Fokker-Planck for pt:
Start with the known continuity equation: ∂t pt = −div(pt uttarget).
Add and subtract (σ2/2)Δpt:
Use the identity Δpt = div(∇pt) = div(pt ∇ log pt):
This is exactly the Fokker-Planck equation for the SDE with drift uttarget + (σ2/2)∇ log pt and diffusion σt. QED.
Why does this identity work? Let's verify it step by step. We know ∇ log p = ∇p / p. Therefore:
The p's cancel. This is why the Fokker-Planck diffusion term (σ2/2)Δp can be rewritten as div(p · (σ2/2)∇ log p) — and that is exactly the divergence of the SDE's score-correction drift term times the density. The proof works because adding a drift of (σ2/2)∇ log p exactly compensates for the Brownian diffusion.
Worked example — Fokker-Planck for OU process. The Ornstein-Uhlenbeck process has u(x) = −θx and constant σ. The stationary distribution satisfies ∂tp = 0:
The solution is p(x) = N(0, σ2/(2θ)). You can verify this by substituting the Gaussian density and checking that both terms cancel. This confirms what we stated in Chapter 2: the OU process converges to a Gaussian with variance σ2/(2θ).
A beautiful special case of the SDE Extension Trick arises when the probability path is constant: pt = p for all t. Setting uttarget = 0, the SDE becomes:
This is Langevin dynamics. It has a remarkable property: if you run it long enough, the distribution of Xt converges to p, regardless of where you started. It is a way to sample from any distribution using only its score function.
Worked example. Target p(x) = mixture of Gaussians at −3 and +3 (equal weight, σ = 0.5). Start at X0 = 0, step size h = 0.01.
Near x = 0, the score is small (between two modes). Near x = 3, the score points toward x = 3. The particle bounces around, eventually spending time near both modes proportional to their probability.
50 particles evolve under Langevin dynamics toward the target distribution (5-mode Gaussian mixture). Watch them converge from random initialization to the target.
Langevin dynamics is foundational in molecular dynamics (simulating protein folding), Bayesian statistics (sampling from posterior distributions), and generative modeling (the Ornstein-Uhlenbeck process is Langevin dynamics for a Gaussian target).
Worked example — Langevin for a Gaussian. Target: p = N(3, 1). Score: ∇ log p(x) = −(x − 3). Step size h = 0.01. Starting at X0 = −2:
| Step | Xt | Score = −(X−3) | Drift = h/2 · score | ε | Noise = √h · ε |
|---|---|---|---|---|---|
| 1 | −2.000 | 5.000 | 0.025 | 0.3 | 0.030 |
| 2 | −1.945 | 4.945 | 0.025 | −1.2 | −0.120 |
| 3 | −2.040 | 5.040 | 0.025 | 0.8 | 0.080 |
Over hundreds of steps, the particle drifts toward x = 3 (the mean) and then fluctuates around it. After convergence, the distribution of Xt matches N(3, 1).
python # Langevin dynamics sampler def langevin_sample(score_fn, n_steps=1000, h=0.01, d=2): """Sample from distribution using only its score function.""" x = torch.randn(d) # arbitrary initialization for _ in range(n_steps): score = score_fn(x) x = x + (h / 2) * score + (h ** 0.5) * torch.randn_like(x) return x # approximately distributed as p
Why Langevin dynamics converges. The drift (σ2/2)∇ log p(x) points toward high-density regions. The noise σ dWt provides exploration. The balance between the two ensures that:
1. Particles are attracted to modes (high-density peaks) by the drift.
2. Particles can escape local modes and explore other modes via the noise.
3. The stationary distribution is exactly p, because the Fokker-Planck equation with drift = (σ2/2)∇ log p and diffusion = σ has stationary solution p.
The convergence rate depends on σ and the structure of p. For multimodal distributions, larger σ helps explore between modes but increases the total number of steps needed. There is a "Goldilocks" σ: enough noise to explore, not so much that convergence is slow.
Applications beyond generative modeling:
| Field | How Langevin is used |
|---|---|
| Bayesian statistics | Sample from posterior distributions for uncertainty quantification |
| Molecular dynamics | Simulate protein folding at finite temperature |
| MCMC methods | Metropolis-Adjusted Langevin Algorithm (MALA) |
| Optimization | Stochastic gradient Langevin dynamics (SGLD) for escaping local minima |
Instead of training a vector field utθ via flow matching, we can directly train a score network stθ(x) to approximate the marginal score ∇ log pt(x). The score matching loss mirrors the flow matching loss exactly:
This is intractable (we cannot compute ∇ log pt(x)). But the denoising score matching loss is tractable:
And by the same proof as Theorem 12 (since the formula for ∇ log pt via conditionals has the same structure as for uttarget):
For the Gaussian path, the conditional score is ∇ log pt(x|z) = −ε/βt, so:
Worked example — denoising score matching step. Data z = (4.0, 2.0), αt = 0.7, βt = 0.3, noise ε = (0.5, −1.2).
Noisy input: x = 0.7(4.0, 2.0) + 0.3(0.5, −1.2) = (2.8, 1.4) + (0.15, −0.36) = (2.95, 1.04).
Target score: −ε/βt = −(0.5, −1.2)/0.3 = (−1.67, 4.0).
If the network outputs stθ = (−1.5, 3.5), the loss is ||(−1.5+1.67, 3.5−4.0)||2 = ||(0.17, −0.5)||2 = 0.029 + 0.25 = 0.279.
Why "denoising"? The conditional score ∇ log pt(x|z) = −(x − αtz)/βt2 points from x back toward αtz, the noiseless version of the data. Following the score = undoing the noise = denoising. When we train the network to predict this score, we are training a denoiser. This is exactly what happens inside every diffusion model: the neural network is fundamentally a denoiser.
The proof follows the same structure as Theorem 12. The conditional and marginal score functions are related by the same posterior-weighted integral as the vector fields (Equation 38 in the book). Therefore, the same "expand ||a−b||2, swap integrals" proof applies. The key identity is:
for any function f. This follows from the posterior-weighted representation of the marginal score.
Comparing the three training losses side by side for CondOT.
python # All three losses for the CondOT path, side by side z = sample_data(B) # [B, d] t = torch.rand(B, 1) # [B, 1] eps = torch.randn_like(z) # [B, d] x = t * z + (1 - t) * eps # noisy sample # Flow Matching (velocity prediction) v_target = z - eps L_cfm = (v_model(x, t) - v_target).pow(2).mean() # Score Matching (score prediction) s_target = -eps / (1 - t) L_csm = (s_model(x, t) - s_target).pow(2).mean() # DDPM (noise prediction) e_target = eps L_ddpm = (e_model(x, t) - e_target).pow(2).mean() # All three optimize the same objective (up to constants)! # At convergence: v_model ≈ z-eps, s_model ≈ -eps/(1-t), e_model ≈ eps
The "denoising" interpretation unifies everything. In all three parameterizations, the network sees a noisy version of the data x = αtz + βtε and must recover information about the clean data z and/or the noise ε. This is fundamentally a denoising task at different noise levels. The different parameterizations just package the denoising information differently:
| What the network predicts | How it relates to denoising |
|---|---|
| Velocity z − ε | The direction and speed from noise to clean data |
| Score −ε/βt | The direction to increase likelihood (undo noise) |
| Noise ε | The exact noise that was added (subtract to denoise) |
| Clean data z | The denoised output directly |
Which loss should you use in practice? Here is a practical decision tree:
1. If using CondOT path (α = t, β = 1−t): velocity prediction (flow matching). The target z − ε is well-conditioned at all t values.
2. If using VP noise schedule (DDPM-style): noise prediction (ε-prediction). The DDPM loss avoids the 1/β2 explosion of score matching.
3. If you need SDE sampling with guidance: train with any loss, then convert to score at inference using Proposition 1.
4. If you need very few sampling steps: consider v-prediction (a parameterization related to velocity prediction) combined with progressive distillation.
In modern practice, most new models use flow matching (velocity prediction) with the CondOT path because it produces the straightest trajectories and the most well-conditioned loss. The era of DDPM-style noise prediction, while historically important, is gradually giving way to flow matching.
python # Decision tree in code if noise_schedule == 'condot': # Use flow matching (velocity prediction) x = t * z + (1 - t) * eps target = z - eps loss = mse(model(x, t), target) elif noise_schedule == 'vp': # Use DDPM (noise prediction) alpha_bar = cosine_schedule(t) x = alpha_bar.sqrt() * z + (1 - alpha_bar).sqrt() * eps loss = mse(model(x, t), eps) # Both are equivalent mathematically, but CondOT is preferred
Common architectures for score/noise prediction. Historically, diffusion models used U-Net architectures (like the one in the original DDPM paper). Modern models use Transformers (DiT). The architecture choice is independent of the training objective — you can use velocity, score, or noise prediction with any architecture.
Multi-scale denoising. The neural network must handle a wide range of noise levels. At low noise (high t), the task is easy — the input already looks like data. At high noise (low t), the task is hard — the input is almost pure noise. To handle this range, the network typically uses:
1. Sinusoidal time embeddings at multiple frequencies, so the network can precisely distinguish between close noise levels.
2. Multi-scale feature processing (in U-Nets: encoder-decoder with skip connections; in DiTs: attention at multiple patch sizes).
3. Adaptive normalization that modulates features based on the noise level.
The denoiser perspective. There is a fourth parameterization worth mentioning: the denoiser Dt(x) = E[z | x at time t]. The denoiser directly predicts the clean data from the noisy observation. For the CondOT path:
The denoiser is useful for progressive denoising: at each step, peek at what the "clean" image would look like, then continue. This is used in visualization tools that show intermediate denoising steps.
Why does the denoiser produce blurry images at low t? At low t (high noise), many different clean images z could have produced the same noisy observation x. The denoiser returns the average over all plausible z values, which is blurry. As t increases, fewer z values are plausible, and the denoiser output sharpens. At t = 1, only one z is plausible, and the output is sharp. This "blurry to sharp" progression is visible in diffusion model sampling visualizations.
The score as a Bayesian posterior mean. The marginal score ∇ log pt(x) can be interpreted as a Bayesian quantity: it is the gradient of the log-posterior averaged over all plausible data points. In regions where many data points contribute (low t, high noise), the score points toward the center of mass of plausible data. In regions where one data point dominates (high t, low noise), the score points precisely toward that data point. This Bayesian interpretation connects score matching to classical statistics and provides intuition for why the method works.
Looking ahead. Chapters 5-7 of the book build on the foundation we have established:
Chapter 5 (Guidance): Uses score functions to condition generation on text prompts via classifier-free guidance. This is what makes text-to-image models work.
Chapter 6 (Architecture): Discusses U-Nets, DiTs, and latent space design. How to build real image and video generators at scale.
Chapter 7 (Discrete Diffusion): Extends the principles to discrete data (text), enabling diffusion-based language models.
Recommended further reading on score matching:
| Paper | Authors | Key contribution |
|---|---|---|
| Generative Modeling by Estimating Gradients of Data Distribution | Song & Ermon, 2019 | NCSN: multi-scale score matching |
| Denoising Diffusion Probabilistic Models | Ho, Jain & Abbeel, 2020 | DDPM: practical noise prediction training |
| Score-Based Generative Modeling through SDEs | Song et al., 2020 | Unified SDE framework |
| Classifier-Free Diffusion Guidance | Ho & Salimans, 2022 | CFG: the key to text-to-image |
| Progressive Distillation for Fast Sampling | Salimans & Ho, 2022 | Reducing steps via distillation |
The full training and sampling pipeline in one place:
Key equations to remember from this chapter:
These equations, combined with those from Chapter 3, form the complete mathematical toolkit for modern generative modeling. The score function is the bridge between flow models (ODE) and diffusion models (SDE). Once you understand both perspectives, you have mastery over the entire framework.
Final thought from the book: "Creating noise from data is easy; creating data from noise is generative modeling." — Song et al. We have now seen exactly how this is done: define a probability path from noise to data, derive the vector field (or score) that follows this path, train a neural network to approximate it, and simulate the ODE/SDE at inference. The mathematical elegance of this framework — where the intractable marginal loss equals the tractable conditional loss, where scores and velocities are interchangeable, where SDEs can be added for free — is what makes it both practical and beautiful.
python # Denoising Score Matching training step def dsm_loss(score_net, z_batch, alpha_fn, beta_fn): B, d = z_batch.shape t = torch.rand(B, 1) alpha = alpha_fn(t) # [B, 1] beta = beta_fn(t) # [B, 1] eps = torch.randn_like(z_batch) # [B, d] x = alpha * z_batch + beta * eps # noisy sample target_score = -eps / beta # -eps/beta_t pred = score_net(x, t) # s_theta(x, t) return (pred - target_score).pow(2).mean()
The denoising score matching loss has a numerical problem: when βt ≈ 0 (near t = 1), the target −ε/βt explodes. The seminal DDPM paper (Ho et al., 2020) proposed a fix: reparameterize the score network as a noise predictor.
Define εtθ(x) = −βt stθ(x), so the network predicts the noise ε directly instead of the score. The loss becomes:
Worked example — one DDPM training step.
Data: z = [3.0, −1.0] (a 2D point).
Noise schedule: αt = t, βt = 1−t.
Sample t = 0.4, ε = [0.8, −0.5].
Noisy input: x = 0.4[3, −1] + 0.6[0.8, −0.5] = [1.2, −0.4] + [0.48, −0.3] = [1.68, −0.7].
Target: ε = [0.8, −0.5].
Network predicts: ε0.4θ([1.68, −0.7]) = [0.6, −0.3] (hypothetical).
Loss: ||(0.6−0.8, −0.3−(−0.5))||2 = ||(−0.2, 0.2)||2 = 0.04 + 0.04 = 0.08.
Why DDPM drops the 1/βt2 weight. The score matching loss has an implicit weight 1/βt2 which blows up as βt → 0 (near t = 1, when the data is nearly clean). The DDPM loss εtθ absorbs this by predicting −βt stθ = ε. This reparameterization is more numerically stable and gives equal weight to all noise levels. In practice, additional weighting schemes (like the "v-prediction" or "min-SNR" weighting) can further improve training stability.
From noise prediction to denoising. If the network predicts ε (the noise added), we can recover the clean data estimate:
This is the denoiser — the expected clean data given the noisy observation. For the CondOT path: ẑ = (x − (1−t) εtθ(x)) / t. At t close to 1 (nearly clean), this is approximately ẑ ≈ x. At t close to 0 (very noisy), the network must do heavy lifting to predict z from mostly noise.
python # Converting between parameterizations def eps_to_velocity(eps_pred, z_pred, alpha_dot, beta_dot): """Convert noise prediction to velocity prediction.""" return alpha_dot * z_pred + beta_dot * eps_pred def eps_to_score(eps_pred, beta): """Convert noise prediction to score.""" return -eps_pred / beta def eps_to_denoiser(eps_pred, x, alpha, beta): """Convert noise prediction to denoised data.""" return (x - beta * eps_pred) / alpha
Visualize the DDPM training process. At each time t, the network sees noisy data and must predict the noise that was added. Drag the slider to see different noise levels.
python # Algorithm 4: DDPM Training (Noise Prediction) def ddpm_loss(eps_net, z_batch, alpha_fn, beta_fn): B, d = z_batch.shape t = torch.rand(B, 1) eps = torch.randn_like(z_batch) x = alpha_fn(t) * z_batch + beta_fn(t) * eps pred_eps = eps_net(x, t) return (pred_eps - eps).pow(2).mean()
This chapter unified flow matching and score matching into a single framework. Here is the complete picture:
| Topic | Key Result |
|---|---|
| Score function | ∇ log pt(x) = direction of steepest ascent in log-probability |
| Gaussian score | ∇ log pt(x|z) = −ε/βt (proportional to the noise) |
| Score-velocity conversion | ut = at ∇ log pt + bt x (linear reparameterization) |
| SDE Extension Trick | Add (σ2/2)∇ log pt dt + σ dW to ODE to get SDE |
| Fokker-Planck equation | ∂tp = −div(pu) + (σ2/2)Δp |
| Langevin dynamics | Sample from p using only ∇ log p |
| Denoising score matching | ||sθ(x) + ε/βt||2 (equivalent to score matching) |
| DDPM loss | ||εθ(x) − ε||2 (reparameterized for stability) |
Historical perspective. The ideas in this chapter evolved over several years:
| Year | Paper | Key Contribution |
|---|---|---|
| 2019 | Song & Ermon (NCSN) | Score matching with multiple noise levels |
| 2020 | Ho et al. (DDPM) | Noise prediction loss, high-quality images |
| 2020 | Song et al. (Score SDE) | Unified ODE/SDE framework, continuous time |
| 2022 | Lipman et al. (Flow Matching) | CondOT path, simulation-free training |
| 2023 | Albergo & Vanden-Eijnden | Stochastic interpolants (generalization) |
The key realization of this chapter is that DDPM (2020) and flow matching (2022), which looked like different algorithms, are actually the same algorithm in different coordinates. This unification, enabled by the score-velocity conversion (Proposition 1), is one of the most elegant results in modern generative modeling.
When to use each parameterization:
| Parameterization | Best For | Used By |
|---|---|---|
| Velocity (v-prediction) | CondOT paths, straight trajectories | Stable Diffusion 3, FLUX |
| Noise (ε-prediction) | VP noise schedules, DDPM-style | Original Stable Diffusion, DALL-E 2 |
| Score (s-prediction) | Theoretical analysis, SDE sampling | Score SDE papers |
| Denoiser (x0-prediction) | High-SNR regime, progressive distillation | Consistency models |
In practice, velocity prediction (flow matching) has become the dominant choice for new models because CondOT paths create the straightest trajectories, requiring the fewest sampling steps. Noise prediction remains popular in legacy codebases and for VP noise schedules.
Practical sampling recipe for a trained model. After training with any parameterization, the generation procedure is:
The score function enables guidance. One of the most important applications of score functions is classifier-free guidance (covered in Chapter 5 of the book). The key idea: at inference time, modify the score to amplify the conditioning signal:
where w > 1 is the guidance weight. Larger w = stronger adherence to the prompt y. This is why Stable Diffusion has a "CFG scale" slider: it controls w. This technique requires score functions (or their equivalent parameterizations), which is why understanding scores is essential even if you primarily use flow matching for training.
Worked example — guidance in 1D. Suppose p(x) is a Gaussian mixture with modes at −3 and +3 (unconditional). Conditioning on y = "right mode" gives p(x|y) = N(3, 1). At x = 1:
Unconditional score: ∇ log p(1) ≈ +0.5 (slight pull toward +3 mode, both modes have influence).
Conditional score: ∇ log p(1|y) = −(1−3)/1 = +2 (strong pull toward +3).
Guided score (w = 7): s = 0.5 + 7(2 − 0.5) = 0.5 + 10.5 = 11.0. Very strong pull toward +3. The guidance amplifies the conditioning signal by a factor of 7.
This is exactly what happens in Stable Diffusion when you increase the "CFG scale" slider: higher values produce images that more faithfully match the prompt but may look more extreme or less natural.
Training for guidance. To enable classifier-free guidance, the model is trained with random prompt dropout: with probability pdrop (typically 10%), the text prompt is replaced with an empty string during training. This teaches the model both conditional and unconditional generation. At inference, both scores are computed in a single forward pass with two inputs (with and without prompt).
python # Training with prompt dropout for CFG def train_step_with_cfg(model, z, prompt, p_drop=0.1): t = torch.rand(z.shape[0], 1) eps = torch.randn_like(z) x = t * z + (1 - t) * eps # Randomly drop prompts during training mask = torch.rand(z.shape[0]) > p_drop prompt_input = [p if m else "" for p, m in zip(prompt, mask)] target = z - eps return (model(x, t, prompt_input) - target).pow(2).mean()
SDE vs ODE sampling quality. For a perfectly trained model, ODE and SDE sampling produce exactly the same distribution (by the SDE Extension Trick). But for imperfectly trained models (which is always the case in practice), SDE sampling can sometimes produce higher quality because the added noise has a "corrective" effect — it averages over small errors in the learned vector field. This is analogous to how ensemble methods in machine learning are more robust than single models.
The optimal σ depends on the model quality and dataset. For well-trained models, σ = 0 (ODE) is often best because it avoids the additional discretization error from the noise. For weaker models or complex datasets, σ ≈ 0.3-1.0 can improve FID scores.
Stochastic vs. deterministic sampling comparison:
| Property | ODE (σ=0) | SDE (σ>0) |
|---|---|---|
| Deterministic? | Yes (same seed = same output) | No |
| Typical steps | 20-50 | 50-200 |
| Speed | Faster | Slower |
| Mode coverage | Good | Potentially better |
| Error correction | None (errors accumulate) | Noise helps average over errors |
| Guidance support | Limited | Full (classifier guidance) |
| Likelihood computation | Yes (via change of variables) | Not directly |
python # Classifier-free guidance (preview of Chapter 5) def guided_velocity(model, x, t, prompt, cfg_scale=7.5): """Classifier-free guidance for conditional generation.""" v_uncond = model(x, t, prompt=None) # unconditional v_cond = model(x, t, prompt=prompt) # conditional # Guided velocity: amplify the conditioning signal return v_uncond + cfg_scale * (v_cond - v_uncond)
python # The complete unified pipeline import torch # === TRAINING (same for all three parameterizations) === def unified_train_step(model, z_batch, param='velocity'): t = torch.rand(z_batch.shape[0], 1) eps = torch.randn_like(z_batch) x = t * z_batch + (1 - t) * eps if param == 'velocity': target = z_batch - eps # flow matching elif param == 'noise': target = eps # DDPM elif param == 'score': target = -eps / (1 - t) # denoising score matching return (model(x, t) - target).pow(2).mean() # === SAMPLING (ODE or SDE, after any training) === def unified_sample(model, sigma=0, n=50, d=2): x = torch.randn(1, d) h = 1.0 / n for i in range(n): t = i * h v = model(x, t) if sigma > 0: # Convert velocity to score for SDE term a_t = (1-t) / max(t, 1e-5) b_t = 1.0 / max(t, 1e-5) score = (v - b_t * x) / a_t drift = v + 0.5 * sigma**2 * score x = x + h * drift + sigma * h**0.5 * torch.randn_like(x) else: x = x + h * v return x