Introduction

In Article 02 we learned that DDPMs train a neural network to predict the noise ε added to data. That framing is clean, practical, and led to stunning image generation results. But it hides a deeper mathematical story — one that began years before DDPMs and unifies an entire family of generative models under a single elegant concept: the score function.

The score function is the gradient of the log-probability density: s(x) = ∇x log p(x). It tells you the direction in which probability increases fastest — a vector field that points every location in space toward regions of higher density. If you can learn this vector field accurately, you can generate samples by simply following it. No adversarial training. No decoder. No normalization constant.

This idea was crystallized by Song & Ermon (2019) in their landmark paper on score-based generative models, building on classical results from Hyvärinen (2005) and Vincent (2011). Their framework — Noise Conditional Score Networks (NCSN) — approached diffusion models from a fundamentally different angle than the DDPM perspective, yet arrived at remarkably similar algorithms. The deep connection between the two was later formalized by Song et al. (2021), revealing that DDPMs and NCSNs are two views of the same mathematical framework.

This article traces the complete arc: from the score function definition, through score matching objectives that avoid the need for the intractable data density, to Langevin dynamics as a sampling engine, and finally to the multi-scale noise trick that makes it all work in practice.

ℹ Prerequisites

This article assumes familiarity with the DDPM framework from Article 02 — particularly the noise prediction parameterization εθ(xt, t), the noise schedule constants ᾱt, and the simplified training loss. We'll show how ε-prediction and score-prediction are two sides of the same coin.

The Score Function

Let p(x) be a probability density over x ∈ ℝd. The score function (also called the Stein score) is defined as the gradient of the log-density with respect to the data:

s(x) = ∇x log p(x)

This is a vector field: at every point x in data space, it produces a d-dimensional vector pointing in the direction of steepest ascent of log p(x). Where the density is high, the score vectors are small (you're already near a mode). Where density is increasing rapidly — on the flanks of a mode — the score vectors are large and point toward the peak.

Think of log p(x) as a landscape of rolling hills. The score function is the slope at every point. A ball placed anywhere on this landscape and pushed in the direction of the score would roll uphill toward the nearest peak — the nearest high-density region. This is exactly the intuition behind Langevin dynamics sampling, which we'll reach in Section 5.

Score of a Gaussian

The simplest instructive example: for a multivariate Gaussian p(x) = 𝒩(x; μ, Σ), the score is:

s(x) = ∇x log p(x) = -Σ-1(x - μ)

The score points from x directly toward the mean μ, scaled by the inverse covariance. For an isotropic Gaussian 𝒩(0, σ2I), this simplifies to s(x) = -x/σ2 — every point is pulled toward the origin with a force inversely proportional to the variance. High variance means a gentle pull (the distribution is spread out); low variance means a strong pull (the distribution is concentrated).

Freedom from the normalization constant

Here is the crucial property that makes score-based methods so powerful. Any density can be written as:

p(x) = exp(f(x)) / Z

where f(x) is an unnormalized log-density and Z = ∫ exp(f(x)) dx is the normalization constant (partition function). Computing Z requires integrating over all of d — which is intractable for any interesting high-dimensional distribution. This is the fundamental bottleneck of energy-based models, undirected graphical models, and many other frameworks.

∑ The normalization constant vanishes

Taking the gradient of the log-density:

x log p(x) = ∇x [f(x) - log Z] = ∇x f(x)

Since Z is a constant with respect to x, its gradient is zero: x log Z = 0. The score function depends only on the unnormalized density. We never need to compute Z.

This is not a trick or an approximation — it is an exact mathematical identity. Any model that parameterizes an unnormalized density can be trained via its score without ever computing the partition function.

This property is what makes the score function so attractive for generative modeling. We can train a neural network sθ(x) to approximate the data score x log pdata(x) without ever needing to know pdata(x) itself — we just need samples from it, which is exactly what our training dataset provides.

Score Field Visualizer Interactive

A mixture of two Gaussians shown as a density heatmap. Arrows show the score field ∇x log p(x) — they point toward high-density regions. Increase σ to see how noise smooths the landscape.

2 components | σ = 0.70

Score Matching

We want to train a neural network sθ(x) to approximate the true data score x log pdata(x). The most natural objective is to minimize the expected squared error between them:

J(θ) = ½ 𝔼pdata [‖sθ(x) - ∇x log pdata(x)‖2]

But there's an immediate problem: we don't know x log pdata(x). We have samples from pdata (our training set), but not the density itself, much less its gradient. This seems like a chicken-and-egg problem — we need the score to train the score.

The breakthrough came from Hyvärinen (2005), who showed that this objective can be rewritten into a form that requires only the model score sθ and samples from pdata — the true score drops out entirely. This is the score matching identity.

Implicit score matching (the integration-by-parts trick)

Start by expanding the squared norm:

J(θ) = ½ 𝔼p [‖sθ2 - 2 sθ(x) · ∇x log p(x) + ‖∇x log p‖2]

The last term is a constant with respect to θ — it depends only on the true data distribution, not our model. So we can drop it for optimization. The tricky term is the middle one: 𝔼p[sθ(x) · ∇x log p(x)]. This is where the magic happens.

Using integration by parts (or equivalently, Stein's identity), and assuming the density vanishes at infinity (a mild regularity condition), the cross term can be rewritten:

𝔼p[sθ(x) · ∇x log p(x)] = -𝔼p[∇x · sθ(x)] = -𝔼p[tr(∇x sθ(x))]

where tr(∇x sθ) is the trace of the Jacobian of the score model — the sum of the diagonal entries Σi ∂sθ,i / ∂xi. This depends only on our model, not on the true score.

∑ Hyvärinen's implicit score matching objective

Putting it together, we can minimize the equivalent objective:

JISM(θ) = 𝔼pdata [tr(∇x sθ(x)) + ½ ‖sθ(x)‖2]

This is remarkable: the true score has completely disappeared. We need only (1) evaluate sθ(x) at data samples, (2) compute the trace of its Jacobian, and (3) minimize. No adversarial training, no normalization constant, no explicit density.

The catch is computational: the Jacobian trace requires d backpropagation passes (one per input dimension), which is prohibitive for high-dimensional data like images. This motivated two practical alternatives: sliced score matching (Song et al., 2020), which uses random projections to estimate the trace stochastically, and denoising score matching (Vincent, 2011), which sidesteps the Jacobian entirely.

Denoising Score Matching

Vincent (2011) discovered a beautifully practical alternative to implicit score matching. Instead of matching the score of the clean data distribution pdata(x), we match the score of a noise-perturbed distribution. The key insight: if we know the perturbation kernel, we know the perturbed score in closed form — no Jacobians required.

Given a clean data point x drawn from pdata, we create a noisy version by adding Gaussian noise:

qσ(x̃ | x) = 𝒩(x̃; x, σ2I)

The score of this perturbation kernel is:

log qσ(x̃ | x) = -(x̃ - x) / σ2

This is just the direction from the noisy point back to the clean point, scaled by the noise variance. It's the "denoising direction" — the vector that would undo the noise if followed exactly.

The denoising score matching (DSM) objective trains the model to predict this known score:

JDSM(θ) = ½ 𝔼pdata(x) 𝔼qσ(x̃|x) [‖sθ(x̃) - ∇ log qσ(x̃ | x)‖2]

Substituting the known score:

JDSM(θ) = ½ 𝔼x, ε [‖sθ(x + σε) + ε/σ‖2]

where ε ~ 𝒩(0, I). Vincent (2011) proved that minimizing this DSM objective is equivalent (up to a constant) to minimizing the original score matching objective with respect to the noise-perturbed distribution qσ(x̃) = ∫ qσ(x̃|x) pdata(x) dx. As σ → 0, we recover the true data score.

💡 Denoising score matching = denoising

The DSM objective is literally: corrupt data with noise, then train a network to point back toward the clean data. The predicted score vector sθ(x̃) should equal -(x̃ - x)/σ2, which is the direction from noisy to clean, divided by σ2. Score estimation is denoising. This is the same intuition behind DDPMs, but from the score perspective.

The connection to DDPM noise prediction

The connection between denoising score matching and DDPM's noise prediction is not just conceptual — it is an exact mathematical identity. Recall that in DDPM, the noise prediction network εθ is trained to predict the noise ε that was added. In the score framework, sθ predicts the score -(x̃ - x)/σ2 = -ε/σ. Therefore:

εθ(xt, t) = -√(1 - ᾱt) · sθ(xt, t)

Or equivalently: sθ(xt, t) = -εθ(xt, t) / √(1 - ᾱt). The noise predictor and the score predictor are the same network, up to a timestep-dependent scaling factor. Training one is exactly equivalent to training the other.

This identity is profound. It means that every DDPM ever trained is already a score-based model. The ε-prediction framing and the score-prediction framing are not competing paradigms — they are notational variants of the same underlying mathematics.

Denoising Score Matching Interactive

Clean data points (green) are corrupted by noise to produce noisy versions (blue). Score arrows point from noisy back toward clean — this is what the network learns.

σ = 0.50 | score ∝ -(x̃ - x)/σ²

Langevin Dynamics

We now have a way to estimate the score function from data (via denoising score matching). But how do we actually generate samples using a learned score? Enter Langevin dynamics — a Markov chain Monte Carlo (MCMC) algorithm that uses only the score function to produce samples from the target distribution.

The algorithm is strikingly simple. Starting from any initial point x0 ~ π0 (typically random noise), we iterate:

xt+1 = xt + (η/2) ∇x log p(xt) + √η · zt,    zt ~ 𝒩(0, I)

Each step has two components: a gradient ascent term (η/2)∇x log p(xt) that pushes the sample toward higher-density regions, and a noise injection term √η zt that adds random exploration. The step size η controls both.

The gradient term alone would just find the nearest mode (gradient ascent on log p). The noise term is crucial: it prevents the chain from collapsing to a single point and ensures the samples explore the full distribution. Too little noise and the chain gets stuck at modes. Too much noise and it never converges. The precise √η scaling is what makes it work — it comes from the physics of Brownian motion.

Convergence guarantees

Under mild regularity conditions (log-concavity, or more generally, a Poincaré inequality), the distribution of xt converges to p(x) as t → ∞ and η → 0. In practice, we use a finite step size and finite iterations, which introduces some bias — but for sufficiently small η and large t, the samples are excellent approximations.

The critical observation for generative modeling: Langevin dynamics requires only the score function to sample. If we have a good estimate sθ(x) ≈ ∇x log p(x), we can replace the true score in the update rule:

xt+1 = xt + (η/2) sθ(xt) + √η · zt

This gives us a complete generative pipeline: (1) train sθ on data via denoising score matching, (2) sample by running Langevin dynamics with the learned score. No decoder, no adversarial game, no explicit density evaluation — just a score network and an iterative update rule.

Langevin Dynamics Sampler Interactive

Particles start from random positions and perform Langevin MCMC on a mixture-of-Gaussians target. Watch them converge to the high-density regions.

Step 0 | Ready

The Low-Density Region Challenge

The score matching + Langevin dynamics pipeline sounds clean in theory. In practice, it has a devastating failure mode that becomes worse in high dimensions: the score estimate is unreliable in low-density regions.

The score matching objective 𝔼pdata[...] averages over the data distribution. This means the model is trained to be accurate where the data lives — near the high-density regions. In the vast low-density regions between modes, the model receives almost no training signal. Its score estimates there can be arbitrary.

Why does this matter? Because Langevin dynamics starts from random noise — which is almost certainly in a low-density region. Before the chain can reach a high-density mode, it must traverse a desert of space where the score estimates are unreliable. In high dimensions, this problem is catastrophic: the volume of space grows exponentially, and even nearby modes can be separated by enormous low-density regions.

Consider a simple mixture of two Gaussians in 1000 dimensions. The probability of a random point falling near either mode is astronomically small. The Langevin chain, starting from a random location, must navigate through thousands of dimensions using a score estimate that was never trained for this region of space. It wanders aimlessly, mixing slowly (or never), producing poor samples.

ℹ The curse of dimensionality for score estimation

In d dimensions, a Gaussian with variance σ2 concentrates on a thin shell at radius ≈ σ√d. Most of the volume of ℝd is far from any data. For d = 784 (a 28×28 image), even points that look "close" to the data manifold in terms of pixel distance can be astronomically far in terms of probability. The score model has never seen such points during training and has no reason to point in a useful direction.

This is not just a theoretical concern — it was the primary practical limitation of the original score matching approach. Samples from naive Langevin dynamics with a learned score model were noisy, mode-collapsed, or simply wrong. Something more was needed.

Noise Conditional Score Networks (NCSN)

Song & Ermon (2019) proposed an elegant solution to the low-density problem: add noise at multiple scales and train the score model to handle all of them. This is the Noise Conditional Score Network (NCSN) framework.

The idea is to define a sequence of noise levels:

σ1 > σ2 > ... > σL

where σ1 is large (comparable to the diameter of the data distribution) and σL is small (negligible perturbation). At each noise level, we define a perturbed distribution:

qσ(x̃) = ∫ 𝒩(x̃; x, σ2I) pdata(x) dx

A single neural network sθ(x, σ) is trained to estimate the score at all noise levels simultaneously:

JNCSN(θ) = ½ Σi=1L λ(σi) 𝔼pdata(x) 𝔼x̃ ~ 𝒩(x, σi2I) [‖sθ(x̃, σi) + (x̃ - x)/σi22]

The weighting λ(σi) = σi2 is chosen so that the loss magnitude is approximately equal across noise levels — without it, the large-σ terms (which have small scores ∝ 1/σ2) would be dwarfed by the small-σ terms.

Why does this solve the low-density problem? At the largest noise level σ1, the perturbed distribution qσ1 has substantial density everywhere in the relevant region of space. The heavy noise "fills in" the gaps between modes, ensuring the score model receives training signal across the entire space. As we decrease σ, the modes sharpen and separate, but by then the Langevin chain has already found the right neighborhood.

Annealed Langevin dynamics

Sampling from a NCSN uses annealed Langevin dynamics: run Langevin dynamics at each noise level in sequence, from largest to smallest, using the previous level's result as the starting point for the next:

x = torch.randn(d)                    # start from noise
for sigma in [sigma_1, sigma_2, ..., sigma_L]:
    eta = step_size(sigma)             # scale with sigma
    for step in range(T_per_level):
        z = torch.randn_like(x)
        x = x + (eta/2) * score_net(x, sigma) + sqrt(eta) * z
return x

At the highest noise level (σ1), the distribution is nearly Gaussian and easy to sample — the score is accurate everywhere. The Langevin chain quickly finds the general region of the data. At each subsequent level, the noise decreases, the modes sharpen, and the chain refines the sample with increasing precision. By the final level (σL), the chain is performing fine-grained corrections on an already good sample.

This coarse-to-fine strategy is strikingly similar to the reverse process in DDPMs — and it is not a coincidence, as we'll see in Section 8.

Method Training Objective Sampling Key Innovation
Score Matching (Hyvärinen 2005) tr(∇ sθ) + ½‖sθ2 Langevin dynamics No true score needed; integration by parts
Denoising SM (Vincent 2011) ‖sθ(x̃) + (x̃-x)/σ22 Langevin dynamics No Jacobian trace; noise as supervision
NCSN (Song & Ermon 2019) Multi-scale DSM across σ1,...,σL Annealed Langevin Multi-scale noise covers low-density regions
DDPM (Ho et al. 2020) ‖ε - εθ(xt, t)‖2 Ancestral sampling Noise prediction; simplified VLB
Annealed Langevin Dynamics Interactive

Multi-scale sampling: particles start from noise, then refine through decreasing noise levels (σ1 → σL). Watch coarse-to-fine convergence.

Ready | 5 noise levels

The Unified View

For two years, DDPMs and NCSNs developed as parallel lines of research with different vocabularies, different training objectives, and different sampling algorithms. DDPMs spoke of noise schedules, ᾱt, and ε-prediction. NCSNs spoke of score functions, noise levels σi, and Langevin dynamics. Yet practitioners noticed a suspicious coincidence: the two approaches produced remarkably similar results with remarkably similar algorithms.

Song et al. (2021) — in the seminal "Score-Based Generative Modeling through Stochastic Differential Equations" paper — proved that the coincidence was no accident. DDPMs and NCSNs are discretizations of the same continuous-time stochastic process. They are literally the same model, viewed through different mathematical lenses.

The unifying identity is the relationship between the noise predictor and the score:

εθ(xt, t) = -√(1 - ᾱt) · sθ(xt, t)

This is not a loose analogy — it is an exact algebraic equivalence. At every timestep t, the DDPM noise prediction εθ is exactly the score sθ scaled by -√(1 - ᾱt). The DDPM training loss ‖ε - εθ2 is a reweighted version of the denoising score matching loss at noise level σt = √(1 - ᾱt).

Similarly, the DDPM noise schedule induces a sequence of effective noise levels. Setting σt2 = (1 - ᾱt) / ᾱt (the inverse SNR) maps the DDPM schedule to the NCSN noise sequence. The DDPM reverse step is a discretized Langevin step with a specific step size tied to the schedule.

∑ Three equivalent parameterizations

A single neural network can be viewed as predicting any of three equivalent quantities, related by the schedule constants at timestep t:

εθ = -√(1 - ᾱt) · sθ = -√(1 - ᾱt) · (-x̂0 + xt) / (1 - ᾱt)

Where x̂0 = (xt - √(1-ᾱtθ) / √ᾱt is the predicted clean data. These are:

Noise prediction: εθ(xt, t)   |   Score prediction: sθ(xt, t)   |   Data prediction:0,θ(xt, t)

Different papers and codebases prefer different parameterizations. Stable Diffusion uses ε-prediction. Some recent architectures (like Imagen) use v-prediction, a hybrid. But they all train the same underlying model.

The continuous-time perspective of Song et al. (2021) goes even further, describing the forward process as a stochastic differential equation (SDE) and the reverse process as the time-reversed SDE — a framework that encompasses DDPMs, NCSNs, and more. We explore this SDE viewpoint fully in Article 04.

For now, the essential takeaway: score-based models and denoising diffusion models are the same thing. The score function is the conceptual backbone; noise prediction is the practical implementation. Learning to denoise is learning the score. Following the score is reversing diffusion. The mathematical universe of generative modeling through noise has a single, coherent structure — and the score function is its spine.

References

Seminal papers and key works referenced in this article.

  1. Song & Ermon. "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS, 2019. arXiv
  2. Song & Ermon. "Improved Techniques for Training Score-Based Generative Models." NeurIPS, 2020. arXiv
  3. Hyvarinen. "Estimation of Non-Normalized Statistical Models by Score Matching." JMLR, 2005.
  4. Vincent. "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation, 2011.