Score Functions & Langevin Dynamics — Diffusion & Flow Matching Internals

Introduction

In Article 02 we learned that DDPMs train a neural network to predict the noise ε added to data. That framing is clean, practical, and led to stunning image generation results. But it hides a deeper mathematical story — one that began years before DDPMs and unifies an entire family of generative models under a single elegant concept: the score function.

The score function is the gradient of the log-probability density: s(x) = ∇_x log p(x). It tells you the direction in which probability increases fastest — a vector field that points every location in space toward regions of higher density. If you can learn this vector field accurately, you can generate samples by simply following it. No adversarial training. No decoder. No normalization constant.

This idea was crystallized by Song & Ermon (2019) in their landmark paper on score-based generative models, building on classical results from Hyvärinen (2005) and Vincent (2011). Their framework — Noise Conditional Score Networks (NCSN) — approached diffusion models from a fundamentally different angle than the DDPM perspective, yet arrived at remarkably similar algorithms. The deep connection between the two was later formalized by Song et al. (2021), revealing that DDPMs and NCSNs are two views of the same mathematical framework.

This article traces the complete arc: from the score function definition, through score matching objectives that avoid the need for the intractable data density, to Langevin dynamics as a sampling engine, and finally to the multi-scale noise trick that makes it all work in practice.

ℹ Prerequisites

This article assumes familiarity with the DDPM framework from Article 02 — particularly the noise prediction parameterization ε_θ(x_t, t), the noise schedule constants ᾱ_t, and the simplified training loss. We'll show how ε-prediction and score-prediction are two sides of the same coin.

The Score Function

Let p(x) be a probability density over x ∈ ℝ^d. The score function (also called the Stein score) is defined as the gradient of the log-density with respect to the data:

s(x) = \nabla x log p(x)

This is a vector field: at every point x in data space, it produces a d-dimensional vector pointing in the direction of steepest ascent of log p(x). Where the density is high, the score vectors are small (you're already near a mode). Where density is increasing rapidly — on the flanks of a mode — the score vectors are large and point toward the peak.

Think of log p(x) as a landscape of rolling hills. The score function is the slope at every point. A ball placed anywhere on this landscape and pushed in the direction of the score would roll uphill toward the nearest peak — the nearest high-density region. This is exactly the intuition behind Langevin dynamics sampling, which we'll reach in Section 5.

Score of a Gaussian

The simplest instructive example: for a multivariate Gaussian p(x) = 𝒩(x; μ, Σ), the score is:

s(x) = \nabla x log p(x) = -Σ -1 (x - μ)

The score points from x directly toward the mean μ, scaled by the inverse covariance. For an isotropic Gaussian 𝒩(0, σ²I), this simplifies to s(x) = -x/σ² — every point is pulled toward the origin with a force inversely proportional to the variance. High variance means a gentle pull (the distribution is spread out); low variance means a strong pull (the distribution is concentrated).

Freedom from the normalization constant

Here is the crucial property that makes score-based methods so powerful. Any density can be written as:

p(x) = exp(f(x)) / Z

where f(x) is an unnormalized log-density and Z = ∫ exp(f(x)) dx is the normalization constant (partition function). Computing Z requires integrating over all of ℝ^d — which is intractable for any interesting high-dimensional distribution. This is the fundamental bottleneck of energy-based models, undirected graphical models, and many other frameworks.

∑ The normalization constant vanishes

Taking the gradient of the log-density:

\nabla x log p(x) = \nabla x [f(x) - log Z] = \nabla x f(x)

Since Z is a constant with respect to x, its gradient is zero: ∇_x log Z = 0. The score function depends only on the unnormalized density. We never need to compute Z.

This is not a trick or an approximation — it is an exact mathematical identity. Any model that parameterizes an unnormalized density can be trained via its score without ever computing the partition function.

This property is what makes the score function so attractive for generative modeling. We can train a neural network s_θ(x) to approximate the data score ∇_x log p_data(x) without ever needing to know p_data(x) itself — we just need samples from it, which is exactly what our training dataset provides.

Score Field Visualizer Interactive

A mixture of two Gaussians shown as a density heatmap. Arrows show the score field ∇_x log p(x) — they point toward high-density regions. Increase σ to see how noise smooths the landscape.

Noise σ: 0.70 2 components | σ = 0.70

Score Matching

We want to train a neural network s_θ(x) to approximate the true data score ∇_x log p_data(x). The most natural objective is to minimize the expected squared error between them:

J(θ) = ½ 𝔼 p data [‖s θ (x) - \nabla x log p data (x)‖ 2]

But there's an immediate problem: we don't know ∇_x log p_data(x). We have samples from p_data (our training set), but not the density itself, much less its gradient. This seems like a chicken-and-egg problem — we need the score to train the score.

The breakthrough came from Hyvärinen (2005), who showed that this objective can be rewritten into a form that requires only the model score s_θ and samples from p_data — the true score drops out entirely. This is the score matching identity.

Implicit score matching (the integration-by-parts trick)

Start by expanding the squared norm:

J(θ) = ½ 𝔼 p [‖s θ ‖ 2 - 2 s θ (x) \cdot \nabla x log p(x) + ‖\nabla x log p‖ 2]

The last term is a constant with respect to θ — it depends only on the true data distribution, not our model. So we can drop it for optimization. The tricky term is the middle one: 𝔼_p[s_θ(x) · ∇_x log p(x)]. This is where the magic happens.

Using integration by parts (or equivalently, Stein's identity), and assuming the density vanishes at infinity (a mild regularity condition), the cross term can be rewritten:

𝔼 p [s θ (x) \cdot \nabla x log p(x)] = -𝔼 p [\nabla x \cdot s θ (x)] = -𝔼 p [tr(\nabla x s θ (x))]

where tr(∇_x s_θ) is the trace of the Jacobian of the score model — the sum of the diagonal entries Σ_i ∂s_θ,i / ∂x_i. This depends only on our model, not on the true score.

∑ Hyvärinen's implicit score matching objective

Putting it together, we can minimize the equivalent objective:

J ISM (θ) = 𝔼 p data [tr(\nabla x s θ (x)) + ½ ‖s θ (x)‖ 2]

This is remarkable: the true score has completely disappeared. We need only (1) evaluate s_θ(x) at data samples, (2) compute the trace of its Jacobian, and (3) minimize. No adversarial training, no normalization constant, no explicit density.

The catch is computational: the Jacobian trace requires d backpropagation passes (one per input dimension), which is prohibitive for high-dimensional data like images. This motivated two practical alternatives: sliced score matching (Song et al., 2020), which uses random projections to estimate the trace stochastically, and denoising score matching (Vincent, 2011), which sidesteps the Jacobian entirely.

Denoising Score Matching

Vincent (2011) discovered a beautifully practical alternative to implicit score matching. Instead of matching the score of the clean data distribution p_data(x), we match the score of a noise-perturbed distribution. The key insight: if we know the perturbation kernel, we know the perturbed score in closed form — no Jacobians required.

Given a clean data point x drawn from p_data, we create a noisy version x̃ by adding Gaussian noise:

q σ (x̃ | x) = 𝒩(x̃; x, σ 2 I)

The score of this perturbation kernel is:

\nabla x̃ log q σ (x̃ | x) = -(x̃ - x) / σ 2

This is just the direction from the noisy point back to the clean point, scaled by the noise variance. It's the "denoising direction" — the vector that would undo the noise if followed exactly.

The denoising score matching (DSM) objective trains the model to predict this known score:

J DSM (θ) = ½ 𝔼 p data (x) 𝔼 q σ (x̃|x) [‖s θ (x̃) - \nabla x̃ log q σ (x̃ | x)‖ 2]

Substituting the known score:

J DSM (θ) = ½ 𝔼 x, ε [‖s θ (x + σε) + ε/σ‖ 2]

where ε ~ 𝒩(0, I). Vincent (2011) proved that minimizing this DSM objective is equivalent (up to a constant) to minimizing the original score matching objective with respect to the noise-perturbed distribution q_σ(x̃) = ∫ q_σ(x̃|x) p_data(x) dx. As σ → 0, we recover the true data score.

💡 Denoising score matching = denoising

The DSM objective is literally: corrupt data with noise, then train a network to point back toward the clean data. The predicted score vector s_θ(x̃) should equal -(x̃ - x)/σ², which is the direction from noisy to clean, divided by σ². Score estimation is denoising. This is the same intuition behind DDPMs, but from the score perspective.

The connection to DDPM noise prediction

The connection between denoising score matching and DDPM's noise prediction is not just conceptual — it is an exact mathematical identity. Recall that in DDPM, the noise prediction network ε_θ is trained to predict the noise ε that was added. In the score framework, s_θ predicts the score -(x̃ - x)/σ² = -ε/σ. Therefore:

ε θ (x t, t) = -\sqrt(1 - ᾱ t) \cdot s θ (x t, t)

Or equivalently: s_θ(x_t, t) = -ε_θ(x_t, t) / √(1 - ᾱ_t). The noise predictor and the score predictor are the same network, up to a timestep-dependent scaling factor. Training one is exactly equivalent to training the other.

This identity is profound. It means that every DDPM ever trained is already a score-based model. The ε-prediction framing and the score-prediction framing are not competing paradigms — they are notational variants of the same underlying mathematics.

Denoising Score Matching Interactive

Clean data points (green) are corrupted by noise to produce noisy versions (blue). Score arrows point from noisy back toward clean — this is what the network learns.

Noise σ: 0.50 σ = 0.50 | score ∝ -(x̃ - x)/σ²

Langevin Dynamics

We now have a way to estimate the score function from data (via denoising score matching). But how do we actually generate samples using a learned score? Enter Langevin dynamics — a Markov chain Monte Carlo (MCMC) algorithm that uses only the score function to produce samples from the target distribution.

The algorithm is strikingly simple. Starting from any initial point x₀ ~ π₀ (typically random noise), we iterate:

x t+1 = x t + (η/2) \nabla x log p(x t) + \sqrtη \cdot z t, z t ~ 𝒩(0, I)

Each step has two components: a gradient ascent term (η/2)∇_x log p(x_t) that pushes the sample toward higher-density regions, and a noise injection term √η z_t that adds random exploration. The step size η controls both.

The gradient term alone would just find the nearest mode (gradient ascent on log p). The noise term is crucial: it prevents the chain from collapsing to a single point and ensures the samples explore the full distribution. Too little noise and the chain gets stuck at modes. Too much noise and it never converges. The precise √η scaling is what makes it work — it comes from the physics of Brownian motion.

Convergence guarantees

Under mild regularity conditions (log-concavity, or more generally, a Poincaré inequality), the distribution of x_t converges to p(x) as t → ∞ and η → 0. In practice, we use a finite step size and finite iterations, which introduces some bias — but for sufficiently small η and large t, the samples are excellent approximations.

The critical observation for generative modeling: Langevin dynamics requires only the score function to sample. If we have a good estimate s_θ(x) ≈ ∇_x log p(x), we can replace the true score in the update rule:

x t+1 = x t + (η/2) s θ (x t) + \sqrtη \cdot z t

This gives us a complete generative pipeline: (1) train s_θ on data via denoising score matching, (2) sample by running Langevin dynamics with the learned score. No decoder, no adversarial game, no explicit density evaluation — just a score network and an iterative update rule.

Langevin Dynamics Sampler Interactive

Particles start from random positions and perform Langevin MCMC on a mixture-of-Gaussians target. Watch them converge to the high-density regions.

Step size η: 0.020 Step 0 | Ready

The Low-Density Region Challenge

The score matching + Langevin dynamics pipeline sounds clean in theory. In practice, it has a devastating failure mode that becomes worse in high dimensions: the score estimate is unreliable in low-density regions.

The score matching objective 𝔼_{p_data}[...] averages over the data distribution. This means the model is trained to be accurate where the data lives — near the high-density regions. In the vast low-density regions between modes, the model receives almost no training signal. Its score estimates there can be arbitrary.

Why does this matter? Because Langevin dynamics starts from random noise — which is almost certainly in a low-density region. Before the chain can reach a high-density mode, it must traverse a desert of space where the score estimates are unreliable. In high dimensions, this problem is catastrophic: the volume of space grows exponentially, and even nearby modes can be separated by enormous low-density regions.

Consider a simple mixture of two Gaussians in 1000 dimensions. The probability of a random point falling near either mode is astronomically small. The Langevin chain, starting from a random location, must navigate through thousands of dimensions using a score estimate that was never trained for this region of space. It wanders aimlessly, mixing slowly (or never), producing poor samples.

ℹ The curse of dimensionality for score estimation

In d dimensions, a Gaussian with variance σ² concentrates on a thin shell at radius ≈ σ√d. Most of the volume of ℝ^d is far from any data. For d = 784 (a 28×28 image), even points that look "close" to the data manifold in terms of pixel distance can be astronomically far in terms of probability. The score model has never seen such points during training and has no reason to point in a useful direction.

This is not just a theoretical concern — it was the primary practical limitation of the original score matching approach. Samples from naive Langevin dynamics with a learned score model were noisy, mode-collapsed, or simply wrong. Something more was needed.

Noise Conditional Score Networks (NCSN)

Song & Ermon (2019) proposed an elegant solution to the low-density problem: add noise at multiple scales and train the score model to handle all of them. This is the Noise Conditional Score Network (NCSN) framework.

The idea is to define a sequence of noise levels:

σ 1 > σ 2 > ... > σ L

where σ₁ is large (comparable to the diameter of the data distribution) and σ_L is small (negligible perturbation). At each noise level, we define a perturbed distribution:

q σ (x̃) = \int 𝒩(x̃; x, σ 2 I) p data (x) dx

A single neural network s_θ(x, σ) is trained to estimate the score at all noise levels simultaneously:

J NCSN (θ) = ½ Σ i=1 L λ(σ i) 𝔼 p data (x) 𝔼 x̃ ~ 𝒩(x, σ i 2 I) [‖s θ (x̃, σ i) + (x̃ - x)/σ i 2 ‖ 2]

The weighting λ(σ_i) = σ_i² is chosen so that the loss magnitude is approximately equal across noise levels — without it, the large-σ terms (which have small scores ∝ 1/σ²) would be dwarfed by the small-σ terms.

Why does this solve the low-density problem? At the largest noise level σ₁, the perturbed distribution q_σ₁ has substantial density everywhere in the relevant region of space. The heavy noise "fills in" the gaps between modes, ensuring the score model receives training signal across the entire space. As we decrease σ, the modes sharpen and separate, but by then the Langevin chain has already found the right neighborhood.

Annealed Langevin dynamics

Sampling from a NCSN uses annealed Langevin dynamics: run Langevin dynamics at each noise level in sequence, from largest to smallest, using the previous level's result as the starting point for the next:

x = torch.randn(d)                    # start from noise
for sigma in [sigma_1, sigma_2, ..., sigma_L]:
    eta = step_size(sigma)             # scale with sigma
    for step in range(T_per_level):
        z = torch.randn_like(x)
        x = x + (eta/2) * score_net(x, sigma) + sqrt(eta) * z
return x

At the highest noise level (σ₁), the distribution is nearly Gaussian and easy to sample — the score is accurate everywhere. The Langevin chain quickly finds the general region of the data. At each subsequent level, the noise decreases, the modes sharpen, and the chain refines the sample with increasing precision. By the final level (σ_L), the chain is performing fine-grained corrections on an already good sample.

This coarse-to-fine strategy is strikingly similar to the reverse process in DDPMs — and it is not a coincidence, as we'll see in Section 8.

Method	Training Objective	Sampling	Key Innovation
Score Matching (Hyvärinen 2005)	tr(∇ s_θ) + ½‖s_θ‖²	Langevin dynamics	No true score needed; integration by parts
Denoising SM (Vincent 2011)	‖s_θ(x̃) + (x̃-x)/σ²‖²	Langevin dynamics	No Jacobian trace; noise as supervision
NCSN (Song & Ermon 2019)	Multi-scale DSM across σ₁,...,σ_L	Annealed Langevin	Multi-scale noise covers low-density regions
DDPM (Ho et al. 2020)	‖ε - ε_θ(x_t, t)‖²	Ancestral sampling	Noise prediction; simplified VLB

Annealed Langevin Dynamics Interactive

Multi-scale sampling: particles start from noise, then refine through decreasing noise levels (σ₁ → σ_L). Watch coarse-to-fine convergence.

Ready | 5 noise levels

The Unified View

For two years, DDPMs and NCSNs developed as parallel lines of research with different vocabularies, different training objectives, and different sampling algorithms. DDPMs spoke of noise schedules, ᾱ_t, and ε-prediction. NCSNs spoke of score functions, noise levels σ_i, and Langevin dynamics. Yet practitioners noticed a suspicious coincidence: the two approaches produced remarkably similar results with remarkably similar algorithms.

Song et al. (2021) — in the seminal "Score-Based Generative Modeling through Stochastic Differential Equations" paper — proved that the coincidence was no accident. DDPMs and NCSNs are discretizations of the same continuous-time stochastic process. They are literally the same model, viewed through different mathematical lenses.

The unifying identity is the relationship between the noise predictor and the score:

ε θ (x t, t) = -\sqrt(1 - ᾱ t) \cdot s θ (x t, t)

This is not a loose analogy — it is an exact algebraic equivalence. At every timestep t, the DDPM noise prediction ε_θ is exactly the score s_θ scaled by -√(1 - ᾱ_t). The DDPM training loss ‖ε - ε_θ‖² is a reweighted version of the denoising score matching loss at noise level σ_t = √(1 - ᾱ_t).

Similarly, the DDPM noise schedule induces a sequence of effective noise levels. Setting σ_t² = (1 - ᾱ_t) / ᾱ_t (the inverse SNR) maps the DDPM schedule to the NCSN noise sequence. The DDPM reverse step is a discretized Langevin step with a specific step size tied to the schedule.

∑ Three equivalent parameterizations

A single neural network can be viewed as predicting any of three equivalent quantities, related by the schedule constants at timestep t:

ε θ = -\sqrt(1 - ᾱ t) \cdot s θ = -\sqrt(1 - ᾱ t) \cdot (-x̂ 0 + x t) / (1 - ᾱ t)

Where x̂₀ = (x_t - √(1-ᾱ_t)ε_θ) / √ᾱ_t is the predicted clean data. These are:

Noise prediction: ε θ (x t, t) | Score prediction: s θ (x t, t) | Data prediction: x̂ 0,θ (x t, t)

Different papers and codebases prefer different parameterizations. Stable Diffusion uses ε-prediction. Some recent architectures (like Imagen) use v-prediction, a hybrid. But they all train the same underlying model.

The continuous-time perspective of Song et al. (2021) goes even further, describing the forward process as a stochastic differential equation (SDE) and the reverse process as the time-reversed SDE — a framework that encompasses DDPMs, NCSNs, and more. We explore this SDE viewpoint fully in Article 04.

For now, the essential takeaway: score-based models and denoising diffusion models are the same thing. The score function is the conceptual backbone; noise prediction is the practical implementation. Learning to denoise is learning the score. Following the score is reversing diffusion. The mathematical universe of generative modeling through noise has a single, coherent structure — and the score function is its spine.

References

Seminal papers and key works referenced in this article.

Song & Ermon. "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS, 2019. arXiv
Song & Ermon. "Improved Techniques for Training Score-Based Generative Models." NeurIPS, 2020. arXiv
Hyvarinen. "Estimation of Non-Normalized Statistical Models by Score Matching." JMLR, 2005.
Vincent. "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation, 2011.