Destroy data with noise, then learn to reverse the destruction — and generate new data from pure static.
Imagine you have a beautiful photograph and you start adding static — a bit of random noise at each step. After a thousand steps, the image is pure noise. Unrecognizable. All structure gone.
Now imagine you could reverse this process. Start from pure noise and, step by step, sculpt it into a photograph. Not the original one — a new one that looks just as real. That's a diffusion model.
Before diffusion models, the dominant generative approaches were GANs (adversarial training — hard to stabilize, prone to mode collapse) and VAEs (variational autoencoders — stable but blurry outputs). Diffusion models achieve the best of both: stable training and high-quality outputs. They power DALL-E 2, Stable Diffusion, and modern audio synthesis.
A 1D signal being progressively destroyed by noise (left to right). Drag the slider to control the noise level. At t=0, it's the clean signal. At t=T, it's pure static.
The forward process (also called the diffusion process or noising process) is a Markov chain that gradually adds Gaussian noise to data over T steps. At each step, we mix the current signal with a small amount of random noise.
Starting from data x0, the transition from step t−1 to step t is:
Let's unpack this. At each step:
| Term | What it does | Analogy |
|---|---|---|
| √(1 − βt) | Shrinks the signal slightly | Turning down the volume a tiny bit |
| βt | Controls how much noise to add | The "noise knob" at step t |
| I | Identity matrix — noise is isotropic | Noise is equally random in all directions |
In code, one forward step is:
python def forward_step(x_prev, beta_t): noise = np.random.randn(*x_prev.shape) # sample ε ~ N(0, I) scale = np.sqrt(1 - beta_t) # shrink factor x_t = scale * x_prev + np.sqrt(beta_t) * noise return x_t
The values β1, β2, ..., βT form the noise schedule. Typical choices:
| Schedule | Formula | Properties |
|---|---|---|
| Linear | βt = β1 + (t−1)/(T−1) · (βT − β1) | Simple, original DDPM paper |
| Cosine | βt from cos-based ᾱt | Gentler early noise, better quality |
Usually βt ranges from β1 ≈ 10−4 (tiny noise at start) to βT ≈ 0.02 (more noise at end). With T = 1000 steps, the cumulative effect turns any data distribution into approximately N(0, I).
Watch a 1D distribution evolve as noise is added. The orange histogram shows the data distribution at each time step. By t = T, it's a Gaussian.
A magical property of Gaussian noise: we don't need to run all t steps sequentially to get xt from x0. There's a closed-form shortcut.
Define αt = 1 − βt (the "keep" fraction at step t). Define the cumulative product:
Then we can jump directly from x0 to xt:
In code:
python def noise_at_t(x_0, alpha_bar_t): """Jump directly to noise level t.""" eps = np.random.randn(*x_0.shape) # ε ~ N(0, I) x_t = np.sqrt(alpha_bar_t) * x_0 + np.sqrt(1 - alpha_bar_t) * eps return x_t, eps # return both — we'll need ε for training!
| ᾱt | Meaning | xt looks like |
|---|---|---|
| ≈ 1 (early) | Almost no cumulative noise | Nearly identical to x0 |
| ≈ 0.5 (middle) | Equal parts signal and noise | Blurry, ghostly version of x0 |
| ≈ 0 (late) | Almost all noise | Indistinguishable from N(0, I) |
Top: the ᾱt curve drops from 1 to 0. Bottom: the signal at the selected time step t, computed via the closed-form formula.
The forward process is fixed — no learned parameters. It just adds noise. The reverse process is where all the learning happens: given noisy data xt, estimate the slightly less noisy version xt−1.
We parameterize the reverse process as:
The neural network θ takes xt and the time step t as input, and outputs the predicted mean μθ. The variance σt2 is typically set to βt (fixed, not learned).
Generation starts from pure noise xT ~ N(0, I) and iterates:
Each step, the network removes a little noise. After T steps, we have a clean sample from the data distribution. The neural network has learned the inverse of the noising process.
Starting from pure noise (right), step backwards through the reverse process. Watch structure emerge. Click New Sample to generate a different trajectory.
We could train the network to directly predict μθ. But Ho et al. (2020) discovered something elegant: it's easier to train the network to predict the noise ε that was added, rather than the clean signal.
Recall the closed-form noising: xt = √ᾱt · x0 + √(1 − ᾱt) · ε. If we know ε, we can recover x0:
And from x0, we can compute the exact reverse mean. So predicting ε is equivalent to predicting the mean, just reparameterized.
python # DDPM training loop (simplified) for x_0 in dataloader: t = torch.randint(1, T+1, (batch_size,)) # random timestep eps = torch.randn_like(x_0) # true noise alpha_bar = alpha_bar_schedule[t] x_t = sqrt(alpha_bar) * x_0 + sqrt(1-alpha_bar) * eps eps_pred = model(x_t, t) # network predicts noise loss = F.mse_loss(eps, eps_pred) loss.backward() optimizer.step()
The network sees noisy data xt (orange) and must predict the noise ε that was added (purple). Green shows the network's prediction improving over training steps.
There's an alternative (and mathematically deeper) way to think about diffusion models: through score functions. This perspective connects diffusion to a rich literature in statistics and physics.
The score function of a probability distribution p(x) is the gradient of its log-density:
The score points in the direction of increasing probability. If you're at a low-density point, the score tells you "move this way to reach higher density." It's a vector field that flows toward the modes (peaks) of the distribution.
Here's the punchline. For the noisy distribution at time t, the score is:
The noise ε that was added is (up to a scale factor) the negative of the score! So when our network predicts ε, it's implicitly learning the score function. This is why the two perspectives — noise prediction and score matching — are equivalent.
Once you have the score, you can generate samples using Langevin dynamics: a physics-inspired sampling method that follows the score field with added noise:
This is like a ball rolling uphill in the probability landscape (the score term), with random jiggling (the noise term). Over many steps, the samples converge to the target distribution.
A 1D mixture of Gaussians (orange curve) and its score function (teal arrows). The arrows point toward the modes. Click to place a sample and watch Langevin dynamics evolve it toward high-density regions.
Diffusion models aren't just for images. WaveGrad (Chen et al., 2020) applies the same diffusion framework to generate raw audio waveforms — the 1D signal that represents sound pressure over time.
Audio waveforms are sampled at 16,000–48,000 Hz. A 1-second clip has 16,000+ samples. Each sample must be predicted with high precision — even small errors create audible artifacts (clicks, pops, distortion). Previous approaches (WaveNet) used autoregressive generation: predict one sample at a time. At 24 kHz, that's 24,000 sequential neural network calls per second. Way too slow for real-time.
WaveGrad conditions the denoising on a mel spectrogram (a compact time-frequency representation) and refines the waveform over a small number of steps:
The neural network receives two inputs: (1) the current noisy waveform xt, and (2) the target mel spectrogram. The mel spectrogram tells the network what the audio should sound like (frequency content over time). The denoising process figures out how to realize that as a specific waveform.
| Model | Steps | Real-time Factor | Quality (MOS) |
|---|---|---|---|
| WaveNet (AR) | 24,000/sec | 0.01x | 4.5 |
| WaveGrad (50 steps) | 50 | 3.7x | 4.35 |
| WaveGrad (6 steps) | 6 | 30x | 3.9 |
Simulated WaveGrad-style denoising of a 1D audio waveform. Step through the reverse process and watch the clean waveform emerge from noise. The "mel conditioning" constrains the frequency content.
This is the payoff. Below, you'll watch a full diffusion reverse process unfold on a 1D signal. Start from pure noise. Step through the denoising. Watch structure emerge from chaos. Adjust the noise schedule, number of steps, and signal type to see how the process behaves.
The main animation. Use Play for the full animation, or scrub manually. The orange shows the denoised signal at each step; the faint gray shows the true target.
The top plot shows the remaining noise energy at each step. The bottom shows the MSE between the current denoised signal and the target. Both should decrease as we step from T to 0.
Choose different target signals to see how the diffusion process handles different structures.