EE269 Lecture 27 — Diffusion Models

Chapter 0: Why Diffusion?

Imagine you have a beautiful photograph and you start adding static — a bit of random noise at each step. After a thousand steps, the image is pure noise. Unrecognizable. All structure gone.

Now imagine you could reverse this process. Start from pure noise and, step by step, sculpt it into a photograph. Not the original one — a new one that looks just as real. That's a diffusion model.

Before diffusion models, the dominant generative approaches were GANs (adversarial training — hard to stabilize, prone to mode collapse) and VAEs (variational autoencoders — stable but blurry outputs). Diffusion models achieve the best of both: stable training and high-quality outputs. They power DALL-E 2, Stable Diffusion, and modern audio synthesis.

The key insight: Destroying data is easy — just add noise. Reversing destruction is hard — but if you add the noise in tiny, controlled steps, each reversal step is a small, learnable denoising operation. A neural network can learn "given data at noise level t, what does it look like at noise level t−1?" Repeat enough times and you go from noise to data.

Noise Destroys, Denoising Creates

A 1D signal being progressively destroyed by noise (left to right). Drag the slider to control the noise level. At t=0, it's the clean signal. At t=T, it's pure static.

Noise step t 0

What makes diffusion models different from GANs?

Diffusion models don't use neural networks Diffusion models generate data in one step, GANs use many steps Diffusion models iteratively denoise from pure noise, avoiding adversarial training instability

Chapter 1: The Forward Process

The forward process (also called the diffusion process or noising process) is a Markov chain that gradually adds Gaussian noise to data over T steps. At each step, we mix the current signal with a small amount of random noise.

One Step of the Forward Process

Starting from data x₀, the transition from step t−1 to step t is:

q(x_t | x_t−1) = N(x_t; √(1 − β_t) · x_t−1, β_t I)

Let's unpack this. At each step:

Term	What it does	Analogy
√(1 − β_t)	Shrinks the signal slightly	Turning down the volume a tiny bit
β_t	Controls how much noise to add	The "noise knob" at step t
I	Identity matrix — noise is isotropic	Noise is equally random in all directions

In code, one forward step is:

python
def forward_step(x_prev, beta_t):
    noise = np.random.randn(*x_prev.shape)    # sample ε ~ N(0, I)
    scale = np.sqrt(1 - beta_t)               # shrink factor
    x_t = scale * x_prev + np.sqrt(beta_t) * noise
    return x_t

The Noise Schedule

The values β₁, β₂, ..., β_T form the noise schedule. Typical choices:

Schedule	Formula	Properties
Linear	β_t = β₁ + (t−1)/(T−1) · (β_T − β₁)	Simple, original DDPM paper
Cosine	β_t from cos-based ᾱ_t	Gentler early noise, better quality

Usually β_t ranges from β₁ ≈ 10⁻⁴ (tiny noise at start) to β_T ≈ 0.02 (more noise at end). With T = 1000 steps, the cumulative effect turns any data distribution into approximately N(0, I).

Why shrink before adding noise? The factor √(1 − β_t) ensures the variance of x_t stays bounded. Without the shrink, variance would grow to infinity. With it, x_t converges to a standard normal — variance stays at 1. This is essential: the reverse process needs to start from a known distribution (pure noise).

Forward Process Step by Step

Watch a 1D distribution evolve as noise is added. The orange histogram shows the data distribution at each time step. By t = T, it's a Gaussian.

β max 0.020

In the forward process q(x_t | x_t−1), what does β_t control?

How much noise is added at step t — larger β_t means more noise The number of diffusion steps The learning rate of the neural network

Chapter 2: Closed-Form Noising

A magical property of Gaussian noise: we don't need to run all t steps sequentially to get x_t from x₀. There's a closed-form shortcut.

The Reparameterization

Define α_t = 1 − β_t (the "keep" fraction at step t). Define the cumulative product:

ᾱ_t = α₁ · α₂ · ... · α_t = ∏_s=1^t (1 − β_s)

Then we can jump directly from x₀ to x_t:

q(x_t | x₀) = N(x_t; √ᾱ_t · x₀, (1 − ᾱ_t) I)

In code:

python
def noise_at_t(x_0, alpha_bar_t):
    """Jump directly to noise level t."""
    eps = np.random.randn(*x_0.shape)  # ε ~ N(0, I)
    x_t = np.sqrt(alpha_bar_t) * x_0 + np.sqrt(1 - alpha_bar_t) * eps
    return x_t, eps  # return both — we'll need ε for training!

Why this is a big deal: During training, we need to create noisy versions of data at random time steps t. Without the closed form, we'd need to simulate all t steps of the Markov chain every time. With it, we sample t uniformly from {1, ..., T}, compute ᾱ_t, and add noise in one step. This makes training fast.

What ᾱ_t Tells Us

ᾱ_t	Meaning	x_t looks like
≈ 1 (early)	Almost no cumulative noise	Nearly identical to x₀
≈ 0.5 (middle)	Equal parts signal and noise	Blurry, ghostly version of x₀
≈ 0 (late)	Almost all noise	Indistinguishable from N(0, I)

ᾱ_t Schedule and Direct Noising

Top: the ᾱ_t curve drops from 1 to 0. Bottom: the signal at the selected time step t, computed via the closed-form formula.

Time step t 0

The closed-form formula q(x_t | x₀) lets us:

Skip the reverse process entirely Jump to any noise level t directly from clean data x₀, without simulating all intermediate steps Reduce the number of parameters in the model

Chapter 3: The Reverse Process

The forward process is fixed — no learned parameters. It just adds noise. The reverse process is where all the learning happens: given noisy data x_t, estimate the slightly less noisy version x_t−1.

The Reverse Transition

We parameterize the reverse process as:

p_θ(x_t−1 | x_t) = N(x_t−1; μ_θ(x_t, t), σ_t² I)

The neural network θ takes x_t and the time step t as input, and outputs the predicted mean μ_θ. The variance σ_t² is typically set to β_t (fixed, not learned).

Why is the reverse also Gaussian? When β_t is small, the forward step is nearly linear. For nearly linear Gaussian transitions, the exact reverse q(x_t−1 | x_t, x₀) is also Gaussian. We're approximating this exact posterior with our learned model. The smaller the steps, the better the approximation.

Denoising = Reverse Diffusion

Generation starts from pure noise x_T ~ N(0, I) and iterates:

x_T ~ N(0, I)

Pure noise

↓ p_θ(x_T−1 | x_T)

x_T−1

Slightly less noisy

↓ p_θ(x_T−2 | x_T−1)

x_T−2

Even less noisy

↓ ... repeat T times

x₀

Generated clean data

Each step, the network removes a little noise. After T steps, we have a clean sample from the data distribution. The neural network has learned the inverse of the noising process.

Reverse Process: Noise to Signal

Starting from pure noise (right), step backwards through the reverse process. Watch structure emerge. Click New Sample to generate a different trajectory.

Reverse step 50

The reverse process p_θ(x_t−1 | x_t) is modeled as a Gaussian. What does the neural network predict?

The mean μ_θ(x_t, t) of the reverse Gaussian at each step The exact clean data x₀ in one shot The forward noise schedule β_t

Chapter 4: Training: Predict the Noise

We could train the network to directly predict μ_θ. But Ho et al. (2020) discovered something elegant: it's easier to train the network to predict the noise ε that was added, rather than the clean signal.

The Reparameterization Trick

Recall the closed-form noising: x_t = √ᾱ_t · x₀ + √(1 − ᾱ_t) · ε. If we know ε, we can recover x₀:

x₀ = (x_t − √(1 − ᾱ_t) · ε) / √ᾱ_t

And from x₀, we can compute the exact reverse mean. So predicting ε is equivalent to predicting the mean, just reparameterized.

The Training Algorithm (DDPM)

1. Sample x₀ from data

Pick a training example

↓

2. Sample t ~ Uniform(1, T)

Random noise level

↓

3. Sample ε ~ N(0, I)

Random noise

↓

4. Compute x_t = √ᾱ_t x₀ + √(1−ᾱ_t) ε

Closed-form noising

↓

5. Predict ε̂ = ε_θ(x_t, t)

Network guesses which noise was added

↓

6. Loss = || ε − ε̂ ||²

Simple MSE between true and predicted noise

python
# DDPM training loop (simplified)
for x_0 in dataloader:
    t = torch.randint(1, T+1, (batch_size,))     # random timestep
    eps = torch.randn_like(x_0)                    # true noise
    alpha_bar = alpha_bar_schedule[t]
    x_t = sqrt(alpha_bar) * x_0 + sqrt(1-alpha_bar) * eps
    eps_pred = model(x_t, t)                       # network predicts noise
    loss = F.mse_loss(eps, eps_pred)
    loss.backward()
    optimizer.step()

The beauty of noise prediction. The loss function is just MSE. No adversarial training, no KL divergence tuning, no mode collapse. The network sees noisy data at random noise levels and learns to identify "this is the noise component." Training is stable, scales well, and converges reliably. This simplicity is why diffusion models took over.

Noise Prediction Training

The network sees noisy data x_t (orange) and must predict the noise ε that was added (purple). Green shows the network's prediction improving over training steps.

Training Step 0

Noise Level t 50

In DDPM training, the network learns to predict:

The clean data x₀ directly The noise ε that was added to x₀ to produce x_t The noise schedule β_t

Chapter 5: Score Matching

There's an alternative (and mathematically deeper) way to think about diffusion models: through score functions. This perspective connects diffusion to a rich literature in statistics and physics.

What Is the Score?

The score function of a probability distribution p(x) is the gradient of its log-density:

s(x) = ∇_x log p(x)

The score points in the direction of increasing probability. If you're at a low-density point, the score tells you "move this way to reach higher density." It's a vector field that flows toward the modes (peaks) of the distribution.

Score = gradient field of log-probability. For a Gaussian N(μ, σ²), the score is s(x) = −(x − μ)/σ². It points toward the mean from everywhere. For a complex multi-modal distribution, the score field has multiple "basins of attraction" pulling toward different modes.

Score Matching = Noise Prediction

Here's the punchline. For the noisy distribution at time t, the score is:

∇_{x_t} log q(x_t) = −ε / √(1 − ᾱ_t)

The noise ε that was added is (up to a scale factor) the negative of the score! So when our network predicts ε, it's implicitly learning the score function. This is why the two perspectives — noise prediction and score matching — are equivalent.

Langevin Dynamics

Once you have the score, you can generate samples using Langevin dynamics: a physics-inspired sampling method that follows the score field with added noise:

x_i+1 = x_i + (η/2) ∇_x log p(x_i) + √η · z, z ~ N(0, I)

This is like a ball rolling uphill in the probability landscape (the score term), with random jiggling (the noise term). Over many steps, the samples converge to the target distribution.

Score Field Visualization

A 1D mixture of Gaussians (orange curve) and its score function (teal arrows). The arrows point toward the modes. Click to place a sample and watch Langevin dynamics evolve it toward high-density regions.

The score function ∇_x log p(x) tells you:

The direction of steepest increase in log-probability at point x The probability of x The optimal noise schedule

Chapter 6: Waveform Generation

Diffusion models aren't just for images. WaveGrad (Chen et al., 2020) applies the same diffusion framework to generate raw audio waveforms — the 1D signal that represents sound pressure over time.

Why Waveforms Are Hard

Audio waveforms are sampled at 16,000–48,000 Hz. A 1-second clip has 16,000+ samples. Each sample must be predicted with high precision — even small errors create audible artifacts (clicks, pops, distortion). Previous approaches (WaveNet) used autoregressive generation: predict one sample at a time. At 24 kHz, that's 24,000 sequential neural network calls per second. Way too slow for real-time.

WaveGrad Architecture

WaveGrad conditions the denoising on a mel spectrogram (a compact time-frequency representation) and refines the waveform over a small number of steps:

Mel spectrogram

Compact representation of desired audio

↓ condition

x_T ~ N(0, I)

Start from noise (full waveform length)

↓ 6–50 reverse steps (not 1000!)

x₀

Clean waveform (24 kHz)

Key innovation: few steps. WaveGrad uses a learned noise schedule and a continuous noise level embedding, allowing it to generate high-quality audio in just 6 reverse steps (vs. 1000 for vanilla DDPM). This makes real-time audio synthesis feasible. The network is a 1D U-Net conditioned on the mel spectrogram and the current noise level.

Conditioning on Mel Spectrograms

The neural network receives two inputs: (1) the current noisy waveform x_t, and (2) the target mel spectrogram. The mel spectrogram tells the network what the audio should sound like (frequency content over time). The denoising process figures out how to realize that as a specific waveform.

Model	Steps	Real-time Factor	Quality (MOS)
WaveNet (AR)	24,000/sec	0.01x	4.5
WaveGrad (50 steps)	50	3.7x	4.35
WaveGrad (6 steps)	6	30x	3.9

Waveform Denoising Simulation

Simulated WaveGrad-style denoising of a 1D audio waveform. Step through the reverse process and watch the clean waveform emerge from noise. The "mel conditioning" constrains the frequency content.

Reverse Step 50

WaveGrad's key advantage over WaveNet for audio synthesis is:

WaveGrad uses a larger neural network WaveGrad generates all samples in parallel via iterative denoising, needing far fewer steps than sample-by-sample autoregression WaveGrad doesn't need any conditioning signal

Chapter 7: Denoising Showcase

This is the payoff. Below, you'll watch a full diffusion reverse process unfold on a 1D signal. Start from pure noise. Step through the denoising. Watch structure emerge from chaos. Adjust the noise schedule, number of steps, and signal type to see how the process behaves.

What's happening: The animation simulates T denoising steps. At each step, the denoiser estimates the noise component and subtracts a fraction of it, then adds a small amount of fresh noise (for stochasticity). Early steps establish the coarse structure (low frequencies). Later steps refine the details (high frequencies). This coarse-to-fine behavior is a hallmark of diffusion models.

Full Reverse Diffusion

The main animation. Use Play for the full animation, or scrub manually. The orange shows the denoised signal at each step; the faint gray shows the true target.

Step (T→0) 200

Total Steps T 200

Noise Level & Loss Over Time

The top plot shows the remaining noise energy at each step. The bottom shows the MSE between the current denoised signal and the target. Both should decrease as we step from T to 0.

Signal Type

Choose different target signals to see how the diffusion process handles different structures.

Diffusion Models