Introduction
In Article 02 we learned that DDPMs train a neural network to predict the noise ε added to data. That framing is clean, practical, and led to stunning image generation results. But it hides a deeper mathematical story — one that began years before DDPMs and unifies an entire family of generative models under a single elegant concept: the score function.
The score function is the gradient of the log-probability density: s(x) = ∇x log p(x).
It tells you the direction in which probability increases fastest — a vector field that points every
location in space toward regions of higher density. If you can learn this vector field accurately, you can
generate samples by simply following it. No adversarial training. No decoder. No normalization constant.
This idea was crystallized by Song & Ermon (2019) in their landmark paper on score-based generative models, building on classical results from Hyvärinen (2005) and Vincent (2011). Their framework — Noise Conditional Score Networks (NCSN) — approached diffusion models from a fundamentally different angle than the DDPM perspective, yet arrived at remarkably similar algorithms. The deep connection between the two was later formalized by Song et al. (2021), revealing that DDPMs and NCSNs are two views of the same mathematical framework.
This article traces the complete arc: from the score function definition, through score matching objectives that avoid the need for the intractable data density, to Langevin dynamics as a sampling engine, and finally to the multi-scale noise trick that makes it all work in practice.
This article assumes familiarity with the DDPM framework from Article 02 — particularly the noise prediction parameterization εθ(xt, t), the noise schedule constants ᾱt, and the simplified training loss. We'll show how ε-prediction and score-prediction are two sides of the same coin.
The Score Function
Let p(x) be a probability density over x ∈ ℝd. The
score function (also called the Stein score) is defined as the gradient
of the log-density with respect to the data:
This is a vector field: at every point x in data space, it produces a d-dimensional vector pointing in the direction of steepest ascent of log p(x). Where the density is high, the score vectors are small (you're already near a mode). Where density is increasing rapidly — on the flanks of a mode — the score vectors are large and point toward the peak.
Think of log p(x) as a landscape of rolling hills. The score function is the slope at every
point. A ball placed anywhere on this landscape and pushed in the direction of the score would roll uphill
toward the nearest peak — the nearest high-density region. This is exactly the intuition behind Langevin
dynamics sampling, which we'll reach in Section 5.
Score of a Gaussian
The simplest instructive example: for a multivariate Gaussian
p(x) = 𝒩(x; μ, Σ), the score is:
The score points from x directly toward the mean μ, scaled by the inverse covariance. For an
isotropic Gaussian 𝒩(0, σ2I), this simplifies to
s(x) = -x/σ2 — every point is pulled toward the origin with a force
inversely proportional to the variance. High variance means a gentle pull (the distribution is spread out);
low variance means a strong pull (the distribution is concentrated).
Freedom from the normalization constant
Here is the crucial property that makes score-based methods so powerful. Any density can be written as:
where f(x) is an unnormalized log-density and
Z = ∫ exp(f(x)) dx is the normalization constant (partition function). Computing Z
requires integrating over all of ℝd — which is intractable for any
interesting high-dimensional distribution. This is the fundamental bottleneck of energy-based models,
undirected graphical models, and many other frameworks.
Taking the gradient of the log-density:
Since Z is a constant with respect to x, its gradient is zero: ∇x log Z = 0.
The score function depends only on the unnormalized density. We never need to compute Z.
This is not a trick or an approximation — it is an exact mathematical identity. Any model that parameterizes an unnormalized density can be trained via its score without ever computing the partition function.
This property is what makes the score function so attractive for generative modeling. We can train a neural
network sθ(x) to approximate the data score
∇x log pdata(x) without ever needing to know
pdata(x) itself — we just need samples from it, which is exactly what our
training dataset provides.
A mixture of two Gaussians shown as a density heatmap. Arrows show the score field ∇x log p(x) — they point toward high-density regions. Increase σ to see how noise smooths the landscape.
Score Matching
We want to train a neural network sθ(x) to approximate the true
data score ∇x log pdata(x). The most natural objective
is to minimize the expected squared error between them:
But there's an immediate problem: we don't know ∇x log pdata(x).
We have samples from pdata (our training set), but not the density itself, much less
its gradient. This seems like a chicken-and-egg problem — we need the score to train the score.
The breakthrough came from Hyvärinen (2005), who showed that this objective can be rewritten
into a form that requires only the model score sθ and samples
from pdata — the true score drops out entirely. This is the score matching
identity.
Implicit score matching (the integration-by-parts trick)
Start by expanding the squared norm:
The last term is a constant with respect to θ — it depends only on the true data distribution,
not our model. So we can drop it for optimization. The tricky term is the middle one:
𝔼p[sθ(x) · ∇x log p(x)].
This is where the magic happens.
Using integration by parts (or equivalently, Stein's identity), and assuming the density vanishes at infinity (a mild regularity condition), the cross term can be rewritten:
where tr(∇x sθ) is the trace of the Jacobian of the score
model — the sum of the diagonal entries Σi ∂sθ,i / ∂xi.
This depends only on our model, not on the true score.
Putting it together, we can minimize the equivalent objective:
This is remarkable: the true score has completely disappeared. We need only (1) evaluate
sθ(x) at data samples, (2) compute the trace of its Jacobian,
and (3) minimize. No adversarial training, no normalization constant, no explicit density.
The catch is computational: the Jacobian trace requires d backpropagation passes (one
per input dimension), which is prohibitive for high-dimensional data like images. This motivated
two practical alternatives: sliced score matching (Song et al., 2020), which uses
random projections to estimate the trace stochastically, and denoising score matching
(Vincent, 2011), which sidesteps the Jacobian entirely.
Denoising Score Matching
Vincent (2011) discovered a beautifully practical alternative to implicit score matching. Instead of
matching the score of the clean data distribution pdata(x), we match the score of
a noise-perturbed distribution. The key insight: if we know the perturbation kernel, we know
the perturbed score in closed form — no Jacobians required.
Given a clean data point x drawn from pdata, we create a noisy
version x̃ by adding Gaussian noise:
The score of this perturbation kernel is:
This is just the direction from the noisy point back to the clean point, scaled by the noise variance. It's the "denoising direction" — the vector that would undo the noise if followed exactly.
The denoising score matching (DSM) objective trains the model to predict this known score:
Substituting the known score:
where ε ~ 𝒩(0, I). Vincent (2011) proved that minimizing this DSM objective
is equivalent (up to a constant) to minimizing the original score matching objective with respect to
the noise-perturbed distribution qσ(x̃) = ∫ qσ(x̃|x) pdata(x) dx.
As σ → 0, we recover the true data score.
The DSM objective is literally: corrupt data with noise, then train a network to point back toward the clean data. The predicted score vector sθ(x̃) should equal -(x̃ - x)/σ2, which is the direction from noisy to clean, divided by σ2. Score estimation is denoising. This is the same intuition behind DDPMs, but from the score perspective.
The connection to DDPM noise prediction
The connection between denoising score matching and DDPM's noise prediction is not just conceptual — it is an exact mathematical identity. Recall that in DDPM, the noise prediction network εθ is trained to predict the noise ε that was added. In the score framework, sθ predicts the score -(x̃ - x)/σ2 = -ε/σ. Therefore:
Or equivalently: sθ(xt, t) = -εθ(xt, t) / √(1 - ᾱt).
The noise predictor and the score predictor are the same network, up to a timestep-dependent scaling factor.
Training one is exactly equivalent to training the other.
This identity is profound. It means that every DDPM ever trained is already a score-based model. The ε-prediction framing and the score-prediction framing are not competing paradigms — they are notational variants of the same underlying mathematics.
Clean data points (green) are corrupted by noise to produce noisy versions (blue). Score arrows point from noisy back toward clean — this is what the network learns.
Langevin Dynamics
We now have a way to estimate the score function from data (via denoising score matching). But how do we actually generate samples using a learned score? Enter Langevin dynamics — a Markov chain Monte Carlo (MCMC) algorithm that uses only the score function to produce samples from the target distribution.
The algorithm is strikingly simple. Starting from any initial point
x0 ~ π0 (typically random noise), we iterate:
Each step has two components: a gradient ascent term (η/2)∇x log p(xt)
that pushes the sample toward higher-density regions, and a noise injection term
√η zt that adds random exploration. The step size η controls both.
The gradient term alone would just find the nearest mode (gradient ascent on log p). The noise term
is crucial: it prevents the chain from collapsing to a single point and ensures the samples explore
the full distribution. Too little noise and the chain gets stuck at modes. Too much noise and it
never converges. The precise √η scaling is what makes it work — it comes
from the physics of Brownian motion.
Convergence guarantees
Under mild regularity conditions (log-concavity, or more generally, a Poincaré inequality),
the distribution of xt converges to p(x) as t → ∞
and η → 0. In practice, we use a finite step size and finite iterations, which
introduces some bias — but for sufficiently small η and large t, the samples are excellent
approximations.
The critical observation for generative modeling: Langevin dynamics requires only the score
function to sample. If we have a good estimate sθ(x) ≈ ∇x log p(x),
we can replace the true score in the update rule:
This gives us a complete generative pipeline: (1) train sθ on data via
denoising score matching, (2) sample by running Langevin dynamics with the learned score. No decoder,
no adversarial game, no explicit density evaluation — just a score network and an iterative update rule.
Particles start from random positions and perform Langevin MCMC on a mixture-of-Gaussians target. Watch them converge to the high-density regions.
The Low-Density Region Challenge
The score matching + Langevin dynamics pipeline sounds clean in theory. In practice, it has a devastating failure mode that becomes worse in high dimensions: the score estimate is unreliable in low-density regions.
The score matching objective 𝔼pdata[...] averages over the
data distribution. This means the model is trained to be accurate where the data lives —
near the high-density regions. In the vast low-density regions between modes, the model receives
almost no training signal. Its score estimates there can be arbitrary.
Why does this matter? Because Langevin dynamics starts from random noise — which is almost certainly in a low-density region. Before the chain can reach a high-density mode, it must traverse a desert of space where the score estimates are unreliable. In high dimensions, this problem is catastrophic: the volume of space grows exponentially, and even nearby modes can be separated by enormous low-density regions.
Consider a simple mixture of two Gaussians in 1000 dimensions. The probability of a random point falling near either mode is astronomically small. The Langevin chain, starting from a random location, must navigate through thousands of dimensions using a score estimate that was never trained for this region of space. It wanders aimlessly, mixing slowly (or never), producing poor samples.
In d dimensions, a Gaussian with variance σ2 concentrates on a thin shell at radius ≈ σ√d. Most of the volume of ℝd is far from any data. For d = 784 (a 28×28 image), even points that look "close" to the data manifold in terms of pixel distance can be astronomically far in terms of probability. The score model has never seen such points during training and has no reason to point in a useful direction.
This is not just a theoretical concern — it was the primary practical limitation of the original score matching approach. Samples from naive Langevin dynamics with a learned score model were noisy, mode-collapsed, or simply wrong. Something more was needed.
Noise Conditional Score Networks (NCSN)
Song & Ermon (2019) proposed an elegant solution to the low-density problem: add noise at multiple scales and train the score model to handle all of them. This is the Noise Conditional Score Network (NCSN) framework.
The idea is to define a sequence of noise levels:
where σ1 is large (comparable to the diameter of the data distribution)
and σL is small (negligible perturbation). At each noise level, we
define a perturbed distribution:
A single neural network sθ(x, σ) is trained to estimate the score
at all noise levels simultaneously:
The weighting λ(σi) = σi2 is chosen so
that the loss magnitude is approximately equal across noise levels — without it, the large-σ terms
(which have small scores ∝ 1/σ2) would be dwarfed by the small-σ terms.
Why does this solve the low-density problem? At the largest noise level σ1,
the perturbed distribution qσ1 has substantial density
everywhere in the relevant region of space. The heavy noise "fills in" the gaps between modes,
ensuring the score model receives training signal across the entire space. As we decrease σ,
the modes sharpen and separate, but by then the Langevin chain has already found the right
neighborhood.
Annealed Langevin dynamics
Sampling from a NCSN uses annealed Langevin dynamics: run Langevin dynamics at each noise level in sequence, from largest to smallest, using the previous level's result as the starting point for the next:
x = torch.randn(d) # start from noise
for sigma in [sigma_1, sigma_2, ..., sigma_L]:
eta = step_size(sigma) # scale with sigma
for step in range(T_per_level):
z = torch.randn_like(x)
x = x + (eta/2) * score_net(x, sigma) + sqrt(eta) * z
return x
At the highest noise level (σ1), the distribution is nearly Gaussian and easy to sample — the score is accurate everywhere. The Langevin chain quickly finds the general region of the data. At each subsequent level, the noise decreases, the modes sharpen, and the chain refines the sample with increasing precision. By the final level (σL), the chain is performing fine-grained corrections on an already good sample.
This coarse-to-fine strategy is strikingly similar to the reverse process in DDPMs — and it is not a coincidence, as we'll see in Section 8.
| Method | Training Objective | Sampling | Key Innovation |
|---|---|---|---|
| Score Matching (Hyvärinen 2005) | tr(∇ sθ) + ½‖sθ‖2 | Langevin dynamics | No true score needed; integration by parts |
| Denoising SM (Vincent 2011) | ‖sθ(x̃) + (x̃-x)/σ2‖2 | Langevin dynamics | No Jacobian trace; noise as supervision |
| NCSN (Song & Ermon 2019) | Multi-scale DSM across σ1,...,σL | Annealed Langevin | Multi-scale noise covers low-density regions |
| DDPM (Ho et al. 2020) | ‖ε - εθ(xt, t)‖2 | Ancestral sampling | Noise prediction; simplified VLB |
Multi-scale sampling: particles start from noise, then refine through decreasing noise levels (σ1 → σL). Watch coarse-to-fine convergence.
The Unified View
For two years, DDPMs and NCSNs developed as parallel lines of research with different vocabularies, different training objectives, and different sampling algorithms. DDPMs spoke of noise schedules, ᾱt, and ε-prediction. NCSNs spoke of score functions, noise levels σi, and Langevin dynamics. Yet practitioners noticed a suspicious coincidence: the two approaches produced remarkably similar results with remarkably similar algorithms.
Song et al. (2021) — in the seminal "Score-Based Generative Modeling through Stochastic Differential Equations" paper — proved that the coincidence was no accident. DDPMs and NCSNs are discretizations of the same continuous-time stochastic process. They are literally the same model, viewed through different mathematical lenses.
The unifying identity is the relationship between the noise predictor and the score:
This is not a loose analogy — it is an exact algebraic equivalence. At every timestep t, the DDPM
noise prediction εθ is exactly the score sθ scaled by
-√(1 - ᾱt). The DDPM training loss
‖ε - εθ‖2 is a reweighted version
of the denoising score matching loss at noise level σt = √(1 - ᾱt).
Similarly, the DDPM noise schedule induces a sequence of effective noise levels. Setting
σt2 = (1 - ᾱt) / ᾱt
(the inverse SNR) maps the DDPM schedule to the NCSN noise sequence. The DDPM reverse step is a
discretized Langevin step with a specific step size tied to the schedule.
A single neural network can be viewed as predicting any of three equivalent quantities, related by the schedule constants at timestep t:
Where x̂0 = (xt - √(1-ᾱt)εθ) / √ᾱt is the predicted clean data. These are:
Different papers and codebases prefer different parameterizations. Stable Diffusion uses ε-prediction. Some recent architectures (like Imagen) use v-prediction, a hybrid. But they all train the same underlying model.
The continuous-time perspective of Song et al. (2021) goes even further, describing the forward process as a stochastic differential equation (SDE) and the reverse process as the time-reversed SDE — a framework that encompasses DDPMs, NCSNs, and more. We explore this SDE viewpoint fully in Article 04.
For now, the essential takeaway: score-based models and denoising diffusion models are the same thing. The score function is the conceptual backbone; noise prediction is the practical implementation. Learning to denoise is learning the score. Following the score is reversing diffusion. The mathematical universe of generative modeling through noise has a single, coherent structure — and the score function is its spine.
References
Seminal papers and key works referenced in this article.
- Song & Ermon. "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS, 2019. arXiv
- Song & Ermon. "Improved Techniques for Training Score-Based Generative Models." NeurIPS, 2020. arXiv
- Hyvarinen. "Estimation of Non-Normalized Statistical Models by Score Matching." JMLR, 2005.
- Vincent. "A Connection Between Score Matching and Denoising Autoencoders." Neural Computation, 2011.