Holderrieth & Erives, Chapter 4

Score Functions and Score Matching

The mathematical bridge from flow models to diffusion models: scores, Fokker-Planck, and DDPM.

Prerequisites: Ch 3 (flow matching, conditional vector fields). That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: Why Scores?

In Chapter 3, we trained a flow model (ODE) using flow matching. This gives us deterministic sampling: same noise → same output. But diffusion models (SDEs) add noise during sampling, which can improve sample quality and enable new capabilities like guidance. To use SDEs for sampling, we need a new mathematical object: the score function.

The score function ∇ log pt(x) tells you the direction of steepest ascent in log-probability. It points toward regions of higher density. If pt(x) is the current marginal distribution of noisy data at time t, the score tells you: "which direction should x move to become more likely under pt?"

Here is the remarkable connection: for Gaussian probability paths, the score function and the vector field are linear reparameterizations of each other. Once you have learned one, you have the other for free. This means a flow model trained with flow matching can be used for SDE sampling without any additional training.

The big picture of this chapter:
1. Define score functions ∇ log pt(x)
2. Show how to convert between scores and vector fields (Proposition 1)
3. SDE Extension Trick: add noise to ODE sampling using scores (Theorem 17)
4. Denoising score matching: train a score network directly (the DDPM loss)
5. Langevin dynamics: sample from a distribution using only its score
Why do we need score functions if we already have flow matching?

Chapter 1: The Score Function

Let q(x) be any probability distribution with density q(x) > 0. The score function is defined as:

∇ log q(x) = ∇q(x) / q(x)

This is the gradient of the log-probability with respect to x (not with respect to any parameters). It is a vector field: at every point x, it gives a direction in Rd.

Intuition: The score ∇ log q(x) points toward regions of higher probability. If you are standing on a probability "hill," the score points uphill. If you follow the score, you climb toward the peak (the mode of q). The magnitude tells you how steep the climb is.

Worked example — 1D Gaussian. Let q(x) = N(x; μ, σ2). The density is q(x) = (1/√(2πσ2)) exp(−(x−μ)2/(2σ2)).

The log-density is log q(x) = −(x−μ)2/(2σ2) + const.

The score is:

∇ log q(x) = d/dx [−(x−μ)2/(2σ2)] = −(x−μ)/σ2

Numerical example. μ = 3, σ = 1. At x = 5: score = −(5−3)/1 = −2 (points left, toward the mean). At x = 1: score = −(1−3)/1 = +2 (points right, toward the mean). At x = 3: score = 0 (at the peak, no gradient).

The score always points toward the mean for a Gaussian. Further from the mean = larger score magnitude = stronger pull back.

Worked example — 2D Gaussian mixture. Let q(x) = 0.5 N(x; μ1, σ2I) + 0.5 N(x; μ2, σ2I) with μ1 = (−2, 0), μ2 = (2, 0), σ = 0.7. The score is:

∇ log q(x) = ∇q(x)/q(x) = [w1 · (−(x−μ1)/σ2) + w2 · (−(x−μ2)/σ2)]

where wi = N(x;μi2I)/q(x) are the posterior weights (how likely x came from component i).

At x = (0, 0) (between the two modes): w1 = w2 = 0.5 by symmetry. The two score contributions cancel out, giving ∇ log q(0,0) ≈ (0, 0). The midpoint is a saddle — the score is zero there.

At x = (3, 0) (near μ2): w2 ≫ w1, so the score is dominated by −(x−μ2)/σ2 = −(1, 0)/0.49 ≈ (−2.04, 0). It points back toward μ2.

This illustrates a crucial property: the score of a mixture is a posterior-weighted average of the component scores. This is analogous to how the marginal vector field is a posterior-weighted average of conditional vector fields.

Score Field Visualization

The heatmap shows the density q(x). Arrows show the score ∇ log q(x) — the direction of steepest ascent. Notice how arrows point toward the peaks of the distribution.

Distribution

In the context of generative modeling, we apply the score function to the marginal probability path pt(x). The marginal score is ∇ log pt(x), and the conditional score (for a specific data point z) is ∇ log pt(x|z). Like vector fields, the marginal score can be expressed via conditional scores:

∇ log pt(x) = ∫ ∇ log pt(x|z) · pt(x|z)pdata(z)/pt(x) · dz
For q = N(x; 5, 4) (mean 5, variance 4), what is the score at x = 9?

Chapter 2: Gaussian Score

For the Gaussian probability path pt(x|z) = N(αtz, βt2I), the conditional score has a simple closed form:

∇ log pt(x|z) = −(x − αtz) / βt2

Derivation. The Gaussian density is pt(x|z) ∝ exp(−||x − αtz||2 / (2βt2)).

Taking log: log pt(x|z) = −||x − αtz||2 / (2βt2) + const.

Taking gradient with respect to x: ∇x log pt(x|z) = −(x − αtz) / βt2. Let's verify this for d = 1:

∂/∂x [−(x − αtz)2 / (2βt2)] = −2(x − αtz) / (2βt2) = −(x − αtz) / βt2

For d > 1, the gradient with respect to the vector x gives the same formula component-wise, yielding a vector in Rd.

Since x = αtz + βtε (where ε ~ N(0,I)), we can substitute x − αtz = βtε:

∇ log pt(x|z) = −βtε / βt2 = −ε / βt
The score is proportional to the noise! The conditional score ∇ log pt(x|z) = −ε/βt. It points in the direction opposite to the noise that was added. This makes perfect sense: to increase the likelihood of x under pt(·|z), you should move x in the direction that undoes the noise, bringing it closer to the clean data αtz.

Score magnitude varies with noise level. The score is −ε/βt. For fixed ε:

t (CondOT)βt = 1−tScore = −ε/βt (for ε=1)Interpretation
0.1 (very noisy)0.9−1.11Small correction — lots of noise, uncertain
0.5 (medium)0.5−2.0Moderate correction
0.9 (nearly clean)0.1−10.0Large correction — precise denoising
0.990.01−100.0Extremely large — numerical instability!

This explosion as βt → 0 is why directly training a score network can be numerically problematic. The DDPM reparameterization (Chapter 8) addresses this by predicting ε instead of −ε/βt.

Geometric interpretation in 2D. Consider a 2D Gaussian pt(x|z) = N(αtz, βt2I). The score at any point x is a vector pointing from x toward the Gaussian center αtz, with magnitude inversely proportional to βt2. The level curves of the density are circles (because the covariance is isotropic), and the score vectors are perpendicular to these circles, pointing inward.

For a mixture of Gaussians (the marginal distribution), the score at each point is a weighted average of the scores from each component. Near the center of one component, that component's score dominates. Between two components, the scores compete — the score may be small or point in an ambiguous direction. This "indecisive" region between modes is where the neural network must make the most nuanced predictions.

python
# Computing the conditional score for a Gaussian path
def conditional_score(x, z, alpha_t, beta_t):
    """Score of p_t(x|z) = N(alpha_t * z, beta_t^2 * I)."""
    return -(x - alpha_t * z) / (beta_t ** 2)

# Example
z = torch.tensor([3.0, -1.0])
x = torch.tensor([2.5, 0.0])
alpha, beta = 0.7, 0.3
score = conditional_score(x, z, alpha, beta)
# score = -(x - 0.7*z) / 0.09
# = -([2.5, 0] - [2.1, -0.7]) / 0.09
# = -[0.4, 0.7] / 0.09 = [-4.44, -7.78]
# Points from x toward alpha*z = [2.1, -0.7]

Worked example for CondOT (αt=t, βt=1−t). Let z = 5, ε = −1.2, t = 0.6.

Then x = 0.6 · 5 + 0.4 · (−1.2) = 3.0 − 0.48 = 2.52.

Score: −ε/βt = −(−1.2)/0.4 = 3.0. The score points in the positive direction, toward the clean data z = 5.

Vector field: ut(x|z) = z − ε = 5 − (−1.2) = 6.2.

These are different numbers but encode the same information — they are linear reparameterizations of each other.

The conditional score −ε/βt points in the direction that...

Chapter 3: Score-Velocity Conversion

For Gaussian probability paths, the conditional vector field and the conditional score are both linear functions of x and z. This means we can convert between them algebraically.

Proposition 1 (Conversion Formula): For the Gaussian path pt(x|z) = N(αtz, βt2I):

uttarget(x|z) = at ∇ log pt(x|z) + bt x

where at = βt2 α̇tt − β̇tβt,   bt = α̇tt

The same formula holds for marginals: uttarget(x) = at ∇ log pt(x) + bt x

Derivation for CondOT (αt=t, βt=1−t).

α̇t = 1, β̇t = −1. Then:

at = (1−t)2 · (1/t) − (−1)(1−t) = (1−t)2/t + (1−t) = (1−t)[(1−t)/t + 1] = (1−t)/t

bt = 1/t

Let's verify: ut(x|z) = at ∇ log pt(x|z) + bt x = ((1−t)/t)(−ε/(1−t)) + x/t = −ε/t + x/t = (x−ε(1−t)−... wait, let me use the direct formula.

Actually, the cleanest way to see this: the vector field ut(x|z) = z − ε. The score ∇ log pt(x|z) = −ε/(1−t). Both encode the same underlying information (the noise ε and data z), just packaged differently.

The profound implication: Once you train a flow model (learn utθ), you automatically have a score model. And once you train a score model (learn stθ), you automatically have a flow model. They are the same model in different coordinates. This is why flow matching and score matching are unified: one training procedure, two sampling methods (ODE and SDE).

This also means we can define a denoiser Dt(x) that predicts the clean data from noisy data:

Dt(x) = E[z | x at time t]

The denoiser, score, and vector field are all equivalent representations. Learning any one of them is sufficient.

Worked example — converting between representations. For CondOT (αt = t, βt = 1−t) with z = 5, ε = −1, t = 0.6:

x = 0.6(5) + 0.4(−1) = 2.6

Velocity: u = z − ε = 5 − (−1) = 6.0

Score: s = −ε/β = −(−1)/0.4 = 2.5

Noise prediction: ε̂ = ε = −1.0

Denoiser: D = z = 5.0

All four numbers encode the same information. Given any one, you can compute the others:

velocity = z − ε = 6.0
score = −ε/β = −(−1)/0.4 = 2.5
velocity = β2 score / t + x/t = 0.16(2.5)/0.6 + 2.6/0.6 = 0.667 + 4.333 = 5.0 ...

(The exact conversion uses the at, bt coefficients from Proposition 1.)

python
# All four parameterizations, interconverted
# For CondOT: alpha_t = t, beta_t = 1 - t

def velocity_to_eps(v, z, eps):
    """v = z - eps => eps = z - v"""
    return z - v  # only if you know z

def score_to_eps(s, beta_t):
    """s = -eps/beta => eps = -beta * s"""
    return -beta_t * s

def eps_to_z(eps_pred, x, alpha_t, beta_t):
    """x = alpha*z + beta*eps => z = (x - beta*eps) / alpha"""
    return (x - beta_t * eps_pred) / alpha_t
ParameterizationNetwork predictsTarget
Velocity (flow matching)utθ(x)z − ε (CondOT)
Scorestθ(x)−ε/βt
Noise predictionεtθ(x)ε
DenoiserDtθ(x)E[z|x]
If you train a flow model with flow matching, do you also get a score model?

Chapter 4: The SDE Extension Trick

We trained a flow model using flow matching. We can sample via ODE. But can we also sample via SDE (adding noise during sampling)? Yes — and it is free.

Theorem 17 (SDE Extension Trick): Given a trained vector field uttarget and score ∇ log pt, for any diffusion coefficient σt ≥ 0, the following SDE generates samples from pt:

dXt = [uttarget(Xt) + (σt2/2) ∇ log pt(Xt)] dt + σt dWt

In particular, X1 ~ pdata.

The SDE has three terms:

1. uttarget(Xt) dt — the ODE drift (moves toward data, same as flow model).

2. t2/2) ∇ log pt(Xt) dt — the score correction (compensates for the noise being added).

3. σt dWt — the Brownian motion noise.

For Gaussian paths, we can express everything in terms of the learned vector field (using Proposition 1):

dXt = [(1 + σt2/(2at)) utθ(Xt) − σt2 bt/(2at) Xt] dt + σt dWt
SDE Extension: Adjusting σ

Same trained model, different σ. At σ=0 you get ODE (straight paths). Increase σ to see SDE sampling (noisy paths). Both reach the same target distribution.

σ 0.00
Why this matters: You train once (with flow matching). Then at inference time, you can choose σt freely. σ = 0 gives ODE sampling (fast, deterministic). σ > 0 gives SDE sampling (potentially higher quality, stochastic). The optimal σ is found empirically.

Worked example — SDE sampling step by step. Consider a trained model with the CondOT path. At step i with t = 0.3, Xt = (1.2, −0.5), σ = 0.5, h = 0.02:

1. Network predicts velocity: uθ(Xt, 0.3) = (2.1, 1.3).

2. Convert to score using Proposition 1: at = 0.7/0.3 = 2.33, bt = 1/0.3 = 3.33.

   score = (u − btx) / at = ((2.1, 1.3) − 3.33(1.2, −0.5)) / 2.33 = ((2.1−4.0, 1.3+1.67)) / 2.33 = (−0.82, 1.27).

3. Compute SDE drift: u + (σ2/2) · score = (2.1, 1.3) + 0.125(−0.82, 1.27) = (2.0, 1.46).

4. Sample noise: ε = (0.3, −0.8).

5. Update: Xt+h = Xt + h · drift + σ√h · ε = (1.2, −0.5) + 0.02(2.0, 1.46) + 0.5√0.02(0.3, −0.8) = (1.2+0.04+0.021, −0.5+0.029−0.057) = (1.261, −0.528).

python
# SDE sampling with a trained flow matching model
def sde_sample(u_theta, sigma_fn, n_steps=50, d=2):
    x = torch.randn(d)
    h = 1.0 / n_steps
    for i in range(n_steps):
        t = i * h
        sigma = sigma_fn(t)
        # Convert velocity to score using Prop 1
        v = u_theta(x, t)
        a_t = (1-t) / max(t, 1e-5)  # CondOT
        b_t = 1.0 / max(t, 1e-5)
        score = (v - b_t * x) / a_t
        # SDE step
        drift = v + 0.5 * sigma**2 * score
        noise = sigma * (h**0.5) * torch.randn_like(x)
        x = x + h * drift + noise
    return x
The SDE Extension Trick lets you choose σt...

Chapter 5: The Fokker-Planck Equation

How do we prove the SDE Extension Trick? The key mathematical tool is the Fokker-Planck equation, which tells us what distribution an SDE produces over time.

Recall the continuity equation for ODEs: if dX/dt = ut(Xt) and Xt ~ pt, then:

t pt(x) = −div(pt ut)(x)

The Fokker-Planck equation extends this to SDEs. For the SDE dXt = ut(Xt) dt + σt dWt:

t pt(x) = −div(pt ut)(x) + (σt2/2) Δpt(x)

where Δ is the Laplacian operator: Δf = ∑i2f/∂xi2.

Physical interpretation: The continuity equation says probability flows along the vector field (like water in pipes). The Fokker-Planck equation adds a diffusion term — probability also spreads out due to Brownian noise, like ink diffusing in water. The Laplacian Δp measures how "concentrated" the probability is at each point and pushes it to spread.

Proof sketch of the SDE Extension Trick (Theorem 17). We need to show the SDE with drift uttarget + (σ2/2)∇ log pt satisfies Fokker-Planck for pt:

Start with the known continuity equation: ∂t pt = −div(pt uttarget).

Add and subtract (σ2/2)Δpt:

t pt = −div(pt uttarget) − (σ2/2)Δpt + (σ2/2)Δpt

Use the identity Δpt = div(∇pt) = div(pt ∇ log pt):

= −div(pt uttarget) − div(pt2/2) ∇ log pt) + (σ2/2)Δpt
= −div(pt [uttarget + (σ2/2) ∇ log pt]) + (σ2/2)Δpt

This is exactly the Fokker-Planck equation for the SDE with drift uttarget + (σ2/2)∇ log pt and diffusion σt. QED.

The key identity used: Δp = div(∇p) = div(p · ∇log p). This converts between the Laplacian (which is hard to interpret) and the divergence of p times the score (which relates to our SDE drift). The entire proof is adding zero in a clever way.

Why does this identity work? Let's verify it step by step. We know ∇ log p = ∇p / p. Therefore:

div(p · ∇ log p) = div(p · ∇p/p) = div(∇p) = Δp

The p's cancel. This is why the Fokker-Planck diffusion term (σ2/2)Δp can be rewritten as div(p · (σ2/2)∇ log p) — and that is exactly the divergence of the SDE's score-correction drift term times the density. The proof works because adding a drift of (σ2/2)∇ log p exactly compensates for the Brownian diffusion.

Worked example — Fokker-Planck for OU process. The Ornstein-Uhlenbeck process has u(x) = −θx and constant σ. The stationary distribution satisfies ∂tp = 0:

0 = −div(p(−θx)) + (σ2/2)Δp
0 = θ div(px) + (σ2/2)Δp

The solution is p(x) = N(0, σ2/(2θ)). You can verify this by substituting the Gaussian density and checking that both terms cancel. This confirms what we stated in Chapter 2: the OU process converges to a Gaussian with variance σ2/(2θ).

The Fokker-Planck equation extends the continuity equation by adding which term?

Chapter 6: Langevin Dynamics

A beautiful special case of the SDE Extension Trick arises when the probability path is constant: pt = p for all t. Setting uttarget = 0, the SDE becomes:

dXt = (σt2/2) ∇ log p(Xt) dt + σt dWt

This is Langevin dynamics. It has a remarkable property: if you run it long enough, the distribution of Xt converges to p, regardless of where you started. It is a way to sample from any distribution using only its score function.

Langevin dynamics as a sampling algorithm: Given access to ∇ log p(x) (the score of a target distribution p), initialize X0 anywhere and simulate:

Xt+h = Xt + (h/2) ∇ log p(Xt) + √h · εt

After many steps, Xt ≈ sample from p. The score ∇ log p pulls particles toward high-density regions (drift), while the noise prevents them from getting stuck at a single mode (exploration).

Worked example. Target p(x) = mixture of Gaussians at −3 and +3 (equal weight, σ = 0.5). Start at X0 = 0, step size h = 0.01.

Near x = 0, the score is small (between two modes). Near x = 3, the score points toward x = 3. The particle bounces around, eventually spending time near both modes proportional to their probability.

Langevin Dynamics Particles

50 particles evolve under Langevin dynamics toward the target distribution (5-mode Gaussian mixture). Watch them converge from random initialization to the target.

Langevin dynamics is foundational in molecular dynamics (simulating protein folding), Bayesian statistics (sampling from posterior distributions), and generative modeling (the Ornstein-Uhlenbeck process is Langevin dynamics for a Gaussian target).

Worked example — Langevin for a Gaussian. Target: p = N(3, 1). Score: ∇ log p(x) = −(x − 3). Step size h = 0.01. Starting at X0 = −2:

StepXtScore = −(X−3)Drift = h/2 · scoreεNoise = √h · ε
1−2.0005.0000.0250.30.030
2−1.9454.9450.025−1.2−0.120
3−2.0405.0400.0250.80.080

Over hundreds of steps, the particle drifts toward x = 3 (the mean) and then fluctuates around it. After convergence, the distribution of Xt matches N(3, 1).

python
# Langevin dynamics sampler
def langevin_sample(score_fn, n_steps=1000, h=0.01, d=2):
    """Sample from distribution using only its score function."""
    x = torch.randn(d)  # arbitrary initialization
    for _ in range(n_steps):
        score = score_fn(x)
        x = x + (h / 2) * score + (h ** 0.5) * torch.randn_like(x)
    return x  # approximately distributed as p
Connection to generative modeling: In the SDE Extension Trick, the added terms (σ2/2)∇ log pt dt + σ dWt are exactly Langevin dynamics applied to the time-varying distribution pt. At each instant, the SDE performs a tiny Langevin step that preserves pt, while the ODE drift uttarget does the actual transportation from noise to data.

Why Langevin dynamics converges. The drift (σ2/2)∇ log p(x) points toward high-density regions. The noise σ dWt provides exploration. The balance between the two ensures that:

1. Particles are attracted to modes (high-density peaks) by the drift.

2. Particles can escape local modes and explore other modes via the noise.

3. The stationary distribution is exactly p, because the Fokker-Planck equation with drift = (σ2/2)∇ log p and diffusion = σ has stationary solution p.

The convergence rate depends on σ and the structure of p. For multimodal distributions, larger σ helps explore between modes but increases the total number of steps needed. There is a "Goldilocks" σ: enough noise to explore, not so much that convergence is slow.

Applications beyond generative modeling:

FieldHow Langevin is used
Bayesian statisticsSample from posterior distributions for uncertainty quantification
Molecular dynamicsSimulate protein folding at finite temperature
MCMC methodsMetropolis-Adjusted Langevin Algorithm (MALA)
OptimizationStochastic gradient Langevin dynamics (SGLD) for escaping local minima
Langevin dynamics requires which piece of information about the target distribution p?

Chapter 7: The Score Matching Loss

Instead of training a vector field utθ via flow matching, we can directly train a score network stθ(x) to approximate the marginal score ∇ log pt(x). The score matching loss mirrors the flow matching loss exactly:

LSM(θ) = Et, z, x~pt(·|z)[ ||stθ(x) − ∇ log pt(x)||2 ]

This is intractable (we cannot compute ∇ log pt(x)). But the denoising score matching loss is tractable:

LCSM(θ) = Et, z, x~pt(·|z)[ ||stθ(x) − ∇ log pt(x|z)||2 ]

And by the same proof as Theorem 12 (since the formula for ∇ log pt via conditionals has the same structure as for uttarget):

LSM(θ) = LCSM(θ) + C

For the Gaussian path, the conditional score is ∇ log pt(x|z) = −ε/βt, so:

LCSM = Et, z, ε[ ||stθtz + βtε) + ε/βt||2 ]
The network learns to denoise. The score network stθ(x) learns to predict −ε/βt given noisy data x = αtz + βtε. It essentially identifies and removes the noise that was added. This is why the method is called denoising score matching.

Worked example — denoising score matching step. Data z = (4.0, 2.0), αt = 0.7, βt = 0.3, noise ε = (0.5, −1.2).

Noisy input: x = 0.7(4.0, 2.0) + 0.3(0.5, −1.2) = (2.8, 1.4) + (0.15, −0.36) = (2.95, 1.04).

Target score: −ε/βt = −(0.5, −1.2)/0.3 = (−1.67, 4.0).

If the network outputs stθ = (−1.5, 3.5), the loss is ||(−1.5+1.67, 3.5−4.0)||2 = ||(0.17, −0.5)||2 = 0.029 + 0.25 = 0.279.

Why "denoising"? The conditional score ∇ log pt(x|z) = −(x − αtz)/βt2 points from x back toward αtz, the noiseless version of the data. Following the score = undoing the noise = denoising. When we train the network to predict this score, we are training a denoiser. This is exactly what happens inside every diffusion model: the neural network is fundamentally a denoiser.

The proof follows the same structure as Theorem 12. The conditional and marginal score functions are related by the same posterior-weighted integral as the vector fields (Equation 38 in the book). Therefore, the same "expand ||a−b||2, swap integrals" proof applies. The key identity is:

Ex~pt[f(x)T ∇ log pt(x)] = Ez~pdata, x~pt(·|z)[f(x)T ∇ log pt(x|z)]

for any function f. This follows from the posterior-weighted representation of the marginal score.

Comparing the three training losses side by side for CondOT.

python
# All three losses for the CondOT path, side by side
z = sample_data(B)                    # [B, d]
t = torch.rand(B, 1)                   # [B, 1]
eps = torch.randn_like(z)               # [B, d]
x = t * z + (1 - t) * eps              # noisy sample

# Flow Matching (velocity prediction)
v_target = z - eps
L_cfm = (v_model(x, t) - v_target).pow(2).mean()

# Score Matching (score prediction)
s_target = -eps / (1 - t)
L_csm = (s_model(x, t) - s_target).pow(2).mean()

# DDPM (noise prediction)
e_target = eps
L_ddpm = (e_model(x, t) - e_target).pow(2).mean()

# All three optimize the same objective (up to constants)!
# At convergence: v_model ≈ z-eps, s_model ≈ -eps/(1-t), e_model ≈ eps

The "denoising" interpretation unifies everything. In all three parameterizations, the network sees a noisy version of the data x = αtz + βtε and must recover information about the clean data z and/or the noise ε. This is fundamentally a denoising task at different noise levels. The different parameterizations just package the denoising information differently:

What the network predictsHow it relates to denoising
Velocity z − εThe direction and speed from noise to clean data
Score −ε/βtThe direction to increase likelihood (undo noise)
Noise εThe exact noise that was added (subtract to denoise)
Clean data zThe denoised output directly

Which loss should you use in practice? Here is a practical decision tree:

1. If using CondOT path (α = t, β = 1−t): velocity prediction (flow matching). The target z − ε is well-conditioned at all t values.

2. If using VP noise schedule (DDPM-style): noise prediction (ε-prediction). The DDPM loss avoids the 1/β2 explosion of score matching.

3. If you need SDE sampling with guidance: train with any loss, then convert to score at inference using Proposition 1.

4. If you need very few sampling steps: consider v-prediction (a parameterization related to velocity prediction) combined with progressive distillation.

In modern practice, most new models use flow matching (velocity prediction) with the CondOT path because it produces the straightest trajectories and the most well-conditioned loss. The era of DDPM-style noise prediction, while historically important, is gradually giving way to flow matching.

python
# Decision tree in code
if noise_schedule == 'condot':
    # Use flow matching (velocity prediction)
    x = t * z + (1 - t) * eps
    target = z - eps
    loss = mse(model(x, t), target)
elif noise_schedule == 'vp':
    # Use DDPM (noise prediction)
    alpha_bar = cosine_schedule(t)
    x = alpha_bar.sqrt() * z + (1 - alpha_bar).sqrt() * eps
    loss = mse(model(x, t), eps)
# Both are equivalent mathematically, but CondOT is preferred

Common architectures for score/noise prediction. Historically, diffusion models used U-Net architectures (like the one in the original DDPM paper). Modern models use Transformers (DiT). The architecture choice is independent of the training objective — you can use velocity, score, or noise prediction with any architecture.

Multi-scale denoising. The neural network must handle a wide range of noise levels. At low noise (high t), the task is easy — the input already looks like data. At high noise (low t), the task is hard — the input is almost pure noise. To handle this range, the network typically uses:

1. Sinusoidal time embeddings at multiple frequencies, so the network can precisely distinguish between close noise levels.

2. Multi-scale feature processing (in U-Nets: encoder-decoder with skip connections; in DiTs: attention at multiple patch sizes).

3. Adaptive normalization that modulates features based on the noise level.

Chapter 4 summary: Score functions provide an alternative lens for understanding generative models. For Gaussian paths, scores and velocities are interchangeable. The SDE Extension Trick lets us use both ODE and SDE sampling from a single trained model. Denoising score matching and the DDPM loss are equivalent to flow matching up to reparameterization. The three perspectives — velocity, score, noise — are different facets of the same diamond. Understanding all three gives you the flexibility to choose the best approach for each application.

The denoiser perspective. There is a fourth parameterization worth mentioning: the denoiser Dt(x) = E[z | x at time t]. The denoiser directly predicts the clean data from the noisy observation. For the CondOT path:

Dt(x) = (x − (1−t) εtθ(x)) / t

The denoiser is useful for progressive denoising: at each step, peek at what the "clean" image would look like, then continue. This is used in visualization tools that show intermediate denoising steps.

Why does the denoiser produce blurry images at low t? At low t (high noise), many different clean images z could have produced the same noisy observation x. The denoiser returns the average over all plausible z values, which is blurry. As t increases, fewer z values are plausible, and the denoiser output sharpens. At t = 1, only one z is plausible, and the output is sharp. This "blurry to sharp" progression is visible in diffusion model sampling visualizations.

The score as a Bayesian posterior mean. The marginal score ∇ log pt(x) can be interpreted as a Bayesian quantity: it is the gradient of the log-posterior averaged over all plausible data points. In regions where many data points contribute (low t, high noise), the score points toward the center of mass of plausible data. In regions where one data point dominates (high t, low noise), the score points precisely toward that data point. This Bayesian interpretation connects score matching to classical statistics and provides intuition for why the method works.

Looking ahead. Chapters 5-7 of the book build on the foundation we have established:

Chapter 5 (Guidance): Uses score functions to condition generation on text prompts via classifier-free guidance. This is what makes text-to-image models work.

Chapter 6 (Architecture): Discusses U-Nets, DiTs, and latent space design. How to build real image and video generators at scale.

Chapter 7 (Discrete Diffusion): Extends the principles to discrete data (text), enabling diffusion-based language models.

Recommended further reading on score matching:

PaperAuthorsKey contribution
Generative Modeling by Estimating Gradients of Data DistributionSong & Ermon, 2019NCSN: multi-scale score matching
Denoising Diffusion Probabilistic ModelsHo, Jain & Abbeel, 2020DDPM: practical noise prediction training
Score-Based Generative Modeling through SDEsSong et al., 2020Unified SDE framework
Classifier-Free Diffusion GuidanceHo & Salimans, 2022CFG: the key to text-to-image
Progressive Distillation for Fast SamplingSalimans & Ho, 2022Reducing steps via distillation

The full training and sampling pipeline in one place:

Training
Sample (z, t, ε), compute x = αz + βε, minimize ||prediction − target||2
↓ after training
ODE Sampling
X0 ~ N(0,I), iterate Xt+h = Xt + h · uθ(Xt, t)
SDE Sampling
Same model + score correction + Brownian noise at each step
Guided Sampling
Blend conditional + unconditional predictions with CFG scale w

Key equations to remember from this chapter:

Score:   ∇ log q(x) = ∇q(x)/q(x)   (direction of steepest ascent in log-prob)
Gaussian score:   ∇ log pt(x|z) = −(x − αtz)/βt2 = −ε/βt
Conversion:   ut(x) = at ∇ log pt(x) + bt x
SDE Extension:   dX = [ut + (σ2/2)∇ log pt] dt + σ dW
Fokker-Planck:   ∂tp = −div(pu) + (σ2/2)Δp
Langevin:   dX = (σ2/2)∇ log p(X) dt + σ dW
DDPM loss:   L = E[||εθ(x) − ε||2]

These equations, combined with those from Chapter 3, form the complete mathematical toolkit for modern generative modeling. The score function is the bridge between flow models (ODE) and diffusion models (SDE). Once you understand both perspectives, you have mastery over the entire framework.

Final thought from the book: "Creating noise from data is easy; creating data from noise is generative modeling." — Song et al. We have now seen exactly how this is done: define a probability path from noise to data, derive the vector field (or score) that follows this path, train a neural network to approximate it, and simulate the ODE/SDE at inference. The mathematical elegance of this framework — where the intractable marginal loss equals the tractable conditional loss, where scores and velocities are interchangeable, where SDEs can be added for free — is what makes it both practical and beautiful.

python
# Denoising Score Matching training step
def dsm_loss(score_net, z_batch, alpha_fn, beta_fn):
    B, d = z_batch.shape
    t = torch.rand(B, 1)
    alpha = alpha_fn(t)                       # [B, 1]
    beta = beta_fn(t)                         # [B, 1]
    eps = torch.randn_like(z_batch)            # [B, d]
    x = alpha * z_batch + beta * eps          # noisy sample
    target_score = -eps / beta                # -eps/beta_t
    pred = score_net(x, t)                    # s_theta(x, t)
    return (pred - target_score).pow(2).mean()
Denoising score matching trains the network to predict...

Chapter 8: The DDPM Loss

The denoising score matching loss has a numerical problem: when βt ≈ 0 (near t = 1), the target −ε/βt explodes. The seminal DDPM paper (Ho et al., 2020) proposed a fix: reparameterize the score network as a noise predictor.

Define εtθ(x) = −βt stθ(x), so the network predicts the noise ε directly instead of the score. The loss becomes:

LDDPM(θ) = Et, z, ε[ ||εtθtz + βtε) − ε||2 ]
Compare the three training objectives side by side:

Flow matching (CFM): ||utθ(x) − (z − ε)||2    (predict velocity)

Score matching (CSM): ||stθ(x) + ε/βt||2    (predict score)

DDPM: ||εtθ(x) − ε||2    (predict noise)

All three are equivalent up to constants and reparameterization. They all learn the same underlying information: the relationship between noisy data and clean data.

Worked example — one DDPM training step.

Data: z = [3.0, −1.0] (a 2D point).

Noise schedule: αt = t, βt = 1−t.

Sample t = 0.4, ε = [0.8, −0.5].

Noisy input: x = 0.4[3, −1] + 0.6[0.8, −0.5] = [1.2, −0.4] + [0.48, −0.3] = [1.68, −0.7].

Target: ε = [0.8, −0.5].

Network predicts: ε0.4θ([1.68, −0.7]) = [0.6, −0.3] (hypothetical).

Loss: ||(0.6−0.8, −0.3−(−0.5))||2 = ||(−0.2, 0.2)||2 = 0.04 + 0.04 = 0.08.

Why DDPM drops the 1/βt2 weight. The score matching loss has an implicit weight 1/βt2 which blows up as βt → 0 (near t = 1, when the data is nearly clean). The DDPM loss εtθ absorbs this by predicting −βt stθ = ε. This reparameterization is more numerically stable and gives equal weight to all noise levels. In practice, additional weighting schemes (like the "v-prediction" or "min-SNR" weighting) can further improve training stability.

From noise prediction to denoising. If the network predicts ε (the noise added), we can recover the clean data estimate:

ẑ = (x − βt εtθ(x)) / αt

This is the denoiser — the expected clean data given the noisy observation. For the CondOT path: ẑ = (x − (1−t) εtθ(x)) / t. At t close to 1 (nearly clean), this is approximately ẑ ≈ x. At t close to 0 (very noisy), the network must do heavy lifting to predict z from mostly noise.

python
# Converting between parameterizations
def eps_to_velocity(eps_pred, z_pred, alpha_dot, beta_dot):
    """Convert noise prediction to velocity prediction."""
    return alpha_dot * z_pred + beta_dot * eps_pred

def eps_to_score(eps_pred, beta):
    """Convert noise prediction to score."""
    return -eps_pred / beta

def eps_to_denoiser(eps_pred, x, alpha, beta):
    """Convert noise prediction to denoised data."""
    return (x - beta * eps_pred) / alpha
DDPM Noise Prediction

Visualize the DDPM training process. At each time t, the network sees noisy data and must predict the noise that was added. Drag the slider to see different noise levels.

t (noise level) 0.30
python
# Algorithm 4: DDPM Training (Noise Prediction)
def ddpm_loss(eps_net, z_batch, alpha_fn, beta_fn):
    B, d = z_batch.shape
    t = torch.rand(B, 1)
    eps = torch.randn_like(z_batch)
    x = alpha_fn(t) * z_batch + beta_fn(t) * eps
    pred_eps = eps_net(x, t)
    return (pred_eps - eps).pow(2).mean()
The DDPM loss trains the network to predict...

Chapter 9: Connections

This chapter unified flow matching and score matching into a single framework. Here is the complete picture:

TopicKey Result
Score function∇ log pt(x) = direction of steepest ascent in log-probability
Gaussian score∇ log pt(x|z) = −ε/βt (proportional to the noise)
Score-velocity conversionut = at ∇ log pt + bt x (linear reparameterization)
SDE Extension TrickAdd (σ2/2)∇ log pt dt + σ dW to ODE to get SDE
Fokker-Planck equationtp = −div(pu) + (σ2/2)Δp
Langevin dynamicsSample from p using only ∇ log p
Denoising score matching||sθ(x) + ε/βt||2 (equivalent to score matching)
DDPM loss||εθ(x) − ε||2 (reparameterized for stability)
The unified view: Flow matching (velocity prediction), score matching (score prediction), and DDPM (noise prediction) are three equivalent parameterizations of the same model. Train with any loss, sample with either ODE or SDE. The entire framework is summarized in two equations:

Training: Regress against conditional target (velocity, score, or noise).
Sampling: Simulate ODE (σ=0) or SDE (σ>0) from noise to data.

Historical perspective. The ideas in this chapter evolved over several years:

YearPaperKey Contribution
2019Song & Ermon (NCSN)Score matching with multiple noise levels
2020Ho et al. (DDPM)Noise prediction loss, high-quality images
2020Song et al. (Score SDE)Unified ODE/SDE framework, continuous time
2022Lipman et al. (Flow Matching)CondOT path, simulation-free training
2023Albergo & Vanden-EijndenStochastic interpolants (generalization)

The key realization of this chapter is that DDPM (2020) and flow matching (2022), which looked like different algorithms, are actually the same algorithm in different coordinates. This unification, enabled by the score-velocity conversion (Proposition 1), is one of the most elegant results in modern generative modeling.

When to use each parameterization:

ParameterizationBest ForUsed By
Velocity (v-prediction)CondOT paths, straight trajectoriesStable Diffusion 3, FLUX
Noise (ε-prediction)VP noise schedules, DDPM-styleOriginal Stable Diffusion, DALL-E 2
Score (s-prediction)Theoretical analysis, SDE samplingScore SDE papers
Denoiser (x0-prediction)High-SNR regime, progressive distillationConsistency models

In practice, velocity prediction (flow matching) has become the dominant choice for new models because CondOT paths create the straightest trajectories, requiring the fewest sampling steps. Noise prediction remains popular in legacy codebases and for VP noise schedules.

Practical sampling recipe for a trained model. After training with any parameterization, the generation procedure is:

1. Initialize
X0 = torch.randn(batch_size, d) — Gaussian noise
2. Choose sampler
ODE (σ=0): Euler or Heun. SDE (σ>0): Euler-Maruyama.
3. Choose σ schedule
σ=0 (fast, deterministic) or σ=0.5-1.0 (more diverse, slower)
4. Choose n steps
20-50 for ODE, 100+ for SDE. More steps = better quality.
5. Simulate
Loop: Xt+h = Xt + h · drift + noise. Return X1.

The score function enables guidance. One of the most important applications of score functions is classifier-free guidance (covered in Chapter 5 of the book). The key idea: at inference time, modify the score to amplify the conditioning signal:

sguided(x, t) = suncond(x, t) + w · (scond(x, t, y) − suncond(x, t))

where w > 1 is the guidance weight. Larger w = stronger adherence to the prompt y. This is why Stable Diffusion has a "CFG scale" slider: it controls w. This technique requires score functions (or their equivalent parameterizations), which is why understanding scores is essential even if you primarily use flow matching for training.

Worked example — guidance in 1D. Suppose p(x) is a Gaussian mixture with modes at −3 and +3 (unconditional). Conditioning on y = "right mode" gives p(x|y) = N(3, 1). At x = 1:

Unconditional score: ∇ log p(1) ≈ +0.5 (slight pull toward +3 mode, both modes have influence).

Conditional score: ∇ log p(1|y) = −(1−3)/1 = +2 (strong pull toward +3).

Guided score (w = 7): s = 0.5 + 7(2 − 0.5) = 0.5 + 10.5 = 11.0. Very strong pull toward +3. The guidance amplifies the conditioning signal by a factor of 7.

This is exactly what happens in Stable Diffusion when you increase the "CFG scale" slider: higher values produce images that more faithfully match the prompt but may look more extreme or less natural.

Training for guidance. To enable classifier-free guidance, the model is trained with random prompt dropout: with probability pdrop (typically 10%), the text prompt is replaced with an empty string during training. This teaches the model both conditional and unconditional generation. At inference, both scores are computed in a single forward pass with two inputs (with and without prompt).

python
# Training with prompt dropout for CFG
def train_step_with_cfg(model, z, prompt, p_drop=0.1):
    t = torch.rand(z.shape[0], 1)
    eps = torch.randn_like(z)
    x = t * z + (1 - t) * eps

    # Randomly drop prompts during training
    mask = torch.rand(z.shape[0]) > p_drop
    prompt_input = [p if m else "" for p, m in zip(prompt, mask)]

    target = z - eps
    return (model(x, t, prompt_input) - target).pow(2).mean()

SDE vs ODE sampling quality. For a perfectly trained model, ODE and SDE sampling produce exactly the same distribution (by the SDE Extension Trick). But for imperfectly trained models (which is always the case in practice), SDE sampling can sometimes produce higher quality because the added noise has a "corrective" effect — it averages over small errors in the learned vector field. This is analogous to how ensemble methods in machine learning are more robust than single models.

The optimal σ depends on the model quality and dataset. For well-trained models, σ = 0 (ODE) is often best because it avoids the additional discretization error from the noise. For weaker models or complex datasets, σ ≈ 0.3-1.0 can improve FID scores.

Stochastic vs. deterministic sampling comparison:

PropertyODE (σ=0)SDE (σ>0)
Deterministic?Yes (same seed = same output)No
Typical steps20-5050-200
SpeedFasterSlower
Mode coverageGoodPotentially better
Error correctionNone (errors accumulate)Noise helps average over errors
Guidance supportLimitedFull (classifier guidance)
Likelihood computationYes (via change of variables)Not directly
python
# Classifier-free guidance (preview of Chapter 5)
def guided_velocity(model, x, t, prompt, cfg_scale=7.5):
    """Classifier-free guidance for conditional generation."""
    v_uncond = model(x, t, prompt=None)   # unconditional
    v_cond = model(x, t, prompt=prompt)  # conditional
    # Guided velocity: amplify the conditioning signal
    return v_uncond + cfg_scale * (v_cond - v_uncond)
python
# The complete unified pipeline
import torch

# === TRAINING (same for all three parameterizations) ===
def unified_train_step(model, z_batch, param='velocity'):
    t = torch.rand(z_batch.shape[0], 1)
    eps = torch.randn_like(z_batch)
    x = t * z_batch + (1 - t) * eps

    if param == 'velocity':
        target = z_batch - eps          # flow matching
    elif param == 'noise':
        target = eps                     # DDPM
    elif param == 'score':
        target = -eps / (1 - t)          # denoising score matching

    return (model(x, t) - target).pow(2).mean()

# === SAMPLING (ODE or SDE, after any training) ===
def unified_sample(model, sigma=0, n=50, d=2):
    x = torch.randn(1, d)
    h = 1.0 / n
    for i in range(n):
        t = i * h
        v = model(x, t)
        if sigma > 0:
            # Convert velocity to score for SDE term
            a_t = (1-t) / max(t, 1e-5)
            b_t = 1.0 / max(t, 1e-5)
            score = (v - b_t * x) / a_t
            drift = v + 0.5 * sigma**2 * score
            x = x + h * drift + sigma * h**0.5 * torch.randn_like(x)
        else:
            x = x + h * v
    return x
Chapters 1-4 Complete
The full theory of flow matching and diffusion models
↓ next
Ch 5: Guidance
Classifier-free guidance: condition on text prompts
Ch 6: Architecture
U-Nets, DiTs, latent spaces — building real image generators
Which statement best summarizes the relationship between flow matching, score matching, and DDPM?