EE269 Lecture 27 — Mert Pilanci, Stanford

Diffusion Models

Destroy data with noise, then learn to reverse the destruction — and generate new data from pure static.

Prerequisites: Normal distributions + Neural networks (backprop). That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Why Diffusion?

Imagine you have a beautiful photograph and you start adding static — a bit of random noise at each step. After a thousand steps, the image is pure noise. Unrecognizable. All structure gone.

Now imagine you could reverse this process. Start from pure noise and, step by step, sculpt it into a photograph. Not the original one — a new one that looks just as real. That's a diffusion model.

Before diffusion models, the dominant generative approaches were GANs (adversarial training — hard to stabilize, prone to mode collapse) and VAEs (variational autoencoders — stable but blurry outputs). Diffusion models achieve the best of both: stable training and high-quality outputs. They power DALL-E 2, Stable Diffusion, and modern audio synthesis.

The key insight: Destroying data is easy — just add noise. Reversing destruction is hard — but if you add the noise in tiny, controlled steps, each reversal step is a small, learnable denoising operation. A neural network can learn "given data at noise level t, what does it look like at noise level t−1?" Repeat enough times and you go from noise to data.
Noise Destroys, Denoising Creates

A 1D signal being progressively destroyed by noise (left to right). Drag the slider to control the noise level. At t=0, it's the clean signal. At t=T, it's pure static.

Noise step t 0
What makes diffusion models different from GANs?

Chapter 1: The Forward Process

The forward process (also called the diffusion process or noising process) is a Markov chain that gradually adds Gaussian noise to data over T steps. At each step, we mix the current signal with a small amount of random noise.

One Step of the Forward Process

Starting from data x0, the transition from step t−1 to step t is:

q(xt | xt−1) = N(xt; √(1 − βt) · xt−1,   βt I)

Let's unpack this. At each step:

TermWhat it doesAnalogy
√(1 − βt)Shrinks the signal slightlyTurning down the volume a tiny bit
βtControls how much noise to addThe "noise knob" at step t
IIdentity matrix — noise is isotropicNoise is equally random in all directions

In code, one forward step is:

python
def forward_step(x_prev, beta_t):
    noise = np.random.randn(*x_prev.shape)    # sample ε ~ N(0, I)
    scale = np.sqrt(1 - beta_t)               # shrink factor
    x_t = scale * x_prev + np.sqrt(beta_t) * noise
    return x_t

The Noise Schedule

The values β1, β2, ..., βT form the noise schedule. Typical choices:

ScheduleFormulaProperties
Linearβt = β1 + (t−1)/(T−1) · (βT − β1)Simple, original DDPM paper
Cosineβt from cos-based ᾱtGentler early noise, better quality

Usually βt ranges from β1 ≈ 10−4 (tiny noise at start) to βT ≈ 0.02 (more noise at end). With T = 1000 steps, the cumulative effect turns any data distribution into approximately N(0, I).

Why shrink before adding noise? The factor √(1 − βt) ensures the variance of xt stays bounded. Without the shrink, variance would grow to infinity. With it, xt converges to a standard normal — variance stays at 1. This is essential: the reverse process needs to start from a known distribution (pure noise).
Forward Process Step by Step

Watch a 1D distribution evolve as noise is added. The orange histogram shows the data distribution at each time step. By t = T, it's a Gaussian.

β max 0.020
In the forward process q(xt | xt−1), what does βt control?

Chapter 2: Closed-Form Noising

A magical property of Gaussian noise: we don't need to run all t steps sequentially to get xt from x0. There's a closed-form shortcut.

The Reparameterization

Define αt = 1 − βt (the "keep" fraction at step t). Define the cumulative product:

ᾱt = α1 · α2 · ... · αt = ∏s=1t (1 − βs)

Then we can jump directly from x0 to xt:

q(xt | x0) = N(xt; √ᾱt · x0,   (1 − ᾱt) I)

In code:

python
def noise_at_t(x_0, alpha_bar_t):
    """Jump directly to noise level t."""
    eps = np.random.randn(*x_0.shape)  # ε ~ N(0, I)
    x_t = np.sqrt(alpha_bar_t) * x_0 + np.sqrt(1 - alpha_bar_t) * eps
    return x_t, eps  # return both — we'll need ε for training!
Why this is a big deal: During training, we need to create noisy versions of data at random time steps t. Without the closed form, we'd need to simulate all t steps of the Markov chain every time. With it, we sample t uniformly from {1, ..., T}, compute ᾱt, and add noise in one step. This makes training fast.

What ᾱt Tells Us

ᾱtMeaningxt looks like
≈ 1 (early)Almost no cumulative noiseNearly identical to x0
≈ 0.5 (middle)Equal parts signal and noiseBlurry, ghostly version of x0
≈ 0 (late)Almost all noiseIndistinguishable from N(0, I)
ᾱt Schedule and Direct Noising

Top: the ᾱt curve drops from 1 to 0. Bottom: the signal at the selected time step t, computed via the closed-form formula.

Time step t 0
The closed-form formula q(xt | x0) lets us:

Chapter 3: The Reverse Process

The forward process is fixed — no learned parameters. It just adds noise. The reverse process is where all the learning happens: given noisy data xt, estimate the slightly less noisy version xt−1.

The Reverse Transition

We parameterize the reverse process as:

pθ(xt−1 | xt) = N(xt−1; μθ(xt, t),   σt2 I)

The neural network θ takes xt and the time step t as input, and outputs the predicted mean μθ. The variance σt2 is typically set to βt (fixed, not learned).

Why is the reverse also Gaussian? When βt is small, the forward step is nearly linear. For nearly linear Gaussian transitions, the exact reverse q(xt−1 | xt, x0) is also Gaussian. We're approximating this exact posterior with our learned model. The smaller the steps, the better the approximation.

Denoising = Reverse Diffusion

Generation starts from pure noise xT ~ N(0, I) and iterates:

xT ~ N(0, I)
Pure noise
↓ pθ(xT−1 | xT)
xT−1
Slightly less noisy
↓ pθ(xT−2 | xT−1)
xT−2
Even less noisy
↓ ... repeat T times
x0
Generated clean data

Each step, the network removes a little noise. After T steps, we have a clean sample from the data distribution. The neural network has learned the inverse of the noising process.

Reverse Process: Noise to Signal

Starting from pure noise (right), step backwards through the reverse process. Watch structure emerge. Click New Sample to generate a different trajectory.

Reverse step 50
The reverse process pθ(xt−1 | xt) is modeled as a Gaussian. What does the neural network predict?

Chapter 4: Training: Predict the Noise

We could train the network to directly predict μθ. But Ho et al. (2020) discovered something elegant: it's easier to train the network to predict the noise ε that was added, rather than the clean signal.

The Reparameterization Trick

Recall the closed-form noising: xt = √ᾱt · x0 + √(1 − ᾱt) · ε. If we know ε, we can recover x0:

x0 = (xt − √(1 − ᾱt) · ε) / √ᾱt

And from x0, we can compute the exact reverse mean. So predicting ε is equivalent to predicting the mean, just reparameterized.

The Training Algorithm (DDPM)

1. Sample x0 from data
Pick a training example
2. Sample t ~ Uniform(1, T)
Random noise level
3. Sample ε ~ N(0, I)
Random noise
4. Compute xt = √ᾱt x0 + √(1−ᾱt) ε
Closed-form noising
5. Predict ε̂ = εθ(xt, t)
Network guesses which noise was added
6. Loss = || ε − ε̂ ||2
Simple MSE between true and predicted noise
python
# DDPM training loop (simplified)
for x_0 in dataloader:
    t = torch.randint(1, T+1, (batch_size,))     # random timestep
    eps = torch.randn_like(x_0)                    # true noise
    alpha_bar = alpha_bar_schedule[t]
    x_t = sqrt(alpha_bar) * x_0 + sqrt(1-alpha_bar) * eps
    eps_pred = model(x_t, t)                       # network predicts noise
    loss = F.mse_loss(eps, eps_pred)
    loss.backward()
    optimizer.step()
The beauty of noise prediction. The loss function is just MSE. No adversarial training, no KL divergence tuning, no mode collapse. The network sees noisy data at random noise levels and learns to identify "this is the noise component." Training is stable, scales well, and converges reliably. This simplicity is why diffusion models took over.
Noise Prediction Training

The network sees noisy data xt (orange) and must predict the noise ε that was added (purple). Green shows the network's prediction improving over training steps.

Training Step 0
Noise Level t 50
In DDPM training, the network learns to predict:

Chapter 5: Score Matching

There's an alternative (and mathematically deeper) way to think about diffusion models: through score functions. This perspective connects diffusion to a rich literature in statistics and physics.

What Is the Score?

The score function of a probability distribution p(x) is the gradient of its log-density:

s(x) = ∇x log p(x)

The score points in the direction of increasing probability. If you're at a low-density point, the score tells you "move this way to reach higher density." It's a vector field that flows toward the modes (peaks) of the distribution.

Score = gradient field of log-probability. For a Gaussian N(μ, σ2), the score is s(x) = −(x − μ)/σ2. It points toward the mean from everywhere. For a complex multi-modal distribution, the score field has multiple "basins of attraction" pulling toward different modes.

Score Matching = Noise Prediction

Here's the punchline. For the noisy distribution at time t, the score is:

xt log q(xt) = −ε / √(1 − ᾱt)

The noise ε that was added is (up to a scale factor) the negative of the score! So when our network predicts ε, it's implicitly learning the score function. This is why the two perspectives — noise prediction and score matching — are equivalent.

Langevin Dynamics

Once you have the score, you can generate samples using Langevin dynamics: a physics-inspired sampling method that follows the score field with added noise:

xi+1 = xi + (η/2) ∇x log p(xi) + √η · z,    z ~ N(0, I)

This is like a ball rolling uphill in the probability landscape (the score term), with random jiggling (the noise term). Over many steps, the samples converge to the target distribution.

Score Field Visualization

A 1D mixture of Gaussians (orange curve) and its score function (teal arrows). The arrows point toward the modes. Click to place a sample and watch Langevin dynamics evolve it toward high-density regions.

The score function ∇x log p(x) tells you:

Chapter 6: Waveform Generation

Diffusion models aren't just for images. WaveGrad (Chen et al., 2020) applies the same diffusion framework to generate raw audio waveforms — the 1D signal that represents sound pressure over time.

Why Waveforms Are Hard

Audio waveforms are sampled at 16,000–48,000 Hz. A 1-second clip has 16,000+ samples. Each sample must be predicted with high precision — even small errors create audible artifacts (clicks, pops, distortion). Previous approaches (WaveNet) used autoregressive generation: predict one sample at a time. At 24 kHz, that's 24,000 sequential neural network calls per second. Way too slow for real-time.

WaveGrad Architecture

WaveGrad conditions the denoising on a mel spectrogram (a compact time-frequency representation) and refines the waveform over a small number of steps:

Mel spectrogram
Compact representation of desired audio
↓ condition
xT ~ N(0, I)
Start from noise (full waveform length)
↓ 6–50 reverse steps (not 1000!)
x0
Clean waveform (24 kHz)
Key innovation: few steps. WaveGrad uses a learned noise schedule and a continuous noise level embedding, allowing it to generate high-quality audio in just 6 reverse steps (vs. 1000 for vanilla DDPM). This makes real-time audio synthesis feasible. The network is a 1D U-Net conditioned on the mel spectrogram and the current noise level.

Conditioning on Mel Spectrograms

The neural network receives two inputs: (1) the current noisy waveform xt, and (2) the target mel spectrogram. The mel spectrogram tells the network what the audio should sound like (frequency content over time). The denoising process figures out how to realize that as a specific waveform.

ModelStepsReal-time FactorQuality (MOS)
WaveNet (AR)24,000/sec0.01x4.5
WaveGrad (50 steps)503.7x4.35
WaveGrad (6 steps)630x3.9
Waveform Denoising Simulation

Simulated WaveGrad-style denoising of a 1D audio waveform. Step through the reverse process and watch the clean waveform emerge from noise. The "mel conditioning" constrains the frequency content.

Reverse Step 50
WaveGrad's key advantage over WaveNet for audio synthesis is:

Chapter 7: Denoising Showcase

This is the payoff. Below, you'll watch a full diffusion reverse process unfold on a 1D signal. Start from pure noise. Step through the denoising. Watch structure emerge from chaos. Adjust the noise schedule, number of steps, and signal type to see how the process behaves.

What's happening: The animation simulates T denoising steps. At each step, the denoiser estimates the noise component and subtracts a fraction of it, then adds a small amount of fresh noise (for stochasticity). Early steps establish the coarse structure (low frequencies). Later steps refine the details (high frequencies). This coarse-to-fine behavior is a hallmark of diffusion models.
Full Reverse Diffusion

The main animation. Use Play for the full animation, or scrub manually. The orange shows the denoised signal at each step; the faint gray shows the true target.

Step (T→0) 200
Total Steps T 200
Noise Level & Loss Over Time

The top plot shows the remaining noise energy at each step. The bottom shows the MSE between the current denoised signal and the target. Both should decrease as we step from T to 0.

Signal Type

Choose different target signals to see how the diffusion process handles different structures.