The engine behind Stable Diffusion, DALL-E, and modern image generation. Learn how neural networks learn to create by learning to denoise.
Generative modeling has a deceptively simple goal: given a dataset of images (faces, landscapes, cats), learn the underlying probability distribution p(x) and then draw new samples from it. A perfect generative model would produce images indistinguishable from real photographs.
Why is this hard? Because an image is a point in an astronomically high-dimensional space. A 512×512 RGB image lives in a space with 786,432 dimensions. The "real image" manifold is a tiny, twisted surface in that vast void. Random points are just static.
Each cell is a random pixel grid. Pure noise has no structure. Generation means learning to place every pixel in just the right spot.
The key insight of diffusion models: destruction is easy, creation is hard. The forward process takes a real image and gradually adds Gaussian noise over T steps (typically T=1000). At each step, the image gets a little noisier, until at step T it's indistinguishable from pure static.
Mathematically, at each step t we mix the current image with a little noise: q(xt | xt-1) = N(xt; √(1-βt) xt-1, βt I). The noise schedule βt controls how fast the signal is destroyed.
In code, the forward process is a single line. Given a clean image x0 and noise ε ~ N(0, I), both tensors of shape [batch, C, H, W]:
Let's plug in real numbers. With a cosine schedule and T=1000:
| Timestep t | ᾱt | √ᾱt (signal coeff) | √(1-ᾱt) (noise coeff) | Signal % |
|---|---|---|---|---|
| 0 | 1.00 | 1.00 | 0.00 | 100% |
| 250 | 0.50 | 0.71 | 0.71 | 50% |
| 500 | 0.05 | 0.22 | 0.97 | 5% |
| 750 | 0.002 | 0.04 | 1.00 | 0.2% |
| 1000 | ≈0 | ≈0 | 1.00 | 0% |
python # The entire forward process in PyTorch def forward_process(x_0, t, alpha_bar): """x_0: [B, C, H, W], t: [B], alpha_bar: [T]""" a_bar = alpha_bar[t].view(-1, 1, 1, 1) # broadcast eps = torch.randn_like(x_0) # noise ~ N(0, I) x_t = torch.sqrt(a_bar) * x_0 + torch.sqrt(1 - a_bar) * eps return x_t, eps
Drag the slider to add noise. At t=0 you see the original signal. At t=1000 it's pure static.
The forward process defines q(xt | xt-1) = N(xt; √(1-βt) xt-1, βt I). This means each step depends on the previous one. But we claimed we can jump directly to any timestep t from x0.
Your task: Starting from the single-step definition, derive q(xt | x0) = N(xt; √ᾱt x0, (1 - ᾱt) I) where ᾱt = ∏s=1t (1 - βs).
Full derivation:
Step 1: Define αt = 1 - βt. Reparameterize: xt = √αt xt-1 + √(1-αt) εt.
Step 2: Expand one more step: xt = √αt(√αt-1 xt-2 + √(1-αt-1) εt-1) + √(1-αt) εt
= √(αtαt-1) xt-2 + √αt(1-αt-1) εt-1 + √(1-αt) εt
Step 3: The two noise terms are independent Gaussians. Their sum has variance: αt(1-αt-1) + (1-αt) = 1 - αtαt-1.
Step 4: By induction, after t steps: xt = √(α1α2...αt) x0 + √(1 - α1α2...αt) ε.
Step 5: Define ᾱt = ∏s=1t αs. Then q(xt | x0) = N(xt; √ᾱt x0, (1 - ᾱt) I). ■
The key insight: The sum of independent Gaussians is Gaussian. Because each step is linear in xt-1 plus independent noise, the entire chain telescopes into a single Gaussian centered at a scaled version of x0.
python def forward_diffusion(x_0, t, alpha_bar): a_bar = alpha_bar[t].view(-1, 1, 1, 1) # [B] -> [B,1,1,1] eps = torch.randn_like(x_0) # sample noise x_t = torch.sqrt(a_bar) * x_0 + torch.sqrt(1 - a_bar) * eps return x_t, eps
The coefficients are chosen so that the total variance is preserved. If x0 has unit variance and ε has unit variance, then xt = a·x0 + b·ε has variance a² + b². Setting a = √ᾱt and b = √(1-ᾱt) gives a² + b² = ᾱt + (1-ᾱt) = 1. The variance stays constant at every timestep.
If both coefficients were 0.5, the variance would be 0.25 + 0.25 = 0.5 — the signal would shrink. Over many steps, xT would converge to a distribution with variance 0.5 instead of N(0, I). The reverse process assumes we start from standard normal noise, so this mismatch would cause the generated images to have wrong brightness/contrast. The variance-preserving property is what makes q(xT) ≈ N(0, I) regardless of the data distribution.
If we could reverse the forward process — undo each noise step — we'd have a generative model! Start from pure noise xT ~ N(0, I) and iteratively denoise to get a clean image x0. The problem: the exact reverse q(xt-1|xt) requires knowing the full data distribution, which is what we're trying to learn.
Solution: train a neural network εθ(xt, t) to approximate the reverse step. This network takes in a noisy image and the timestep, and predicts the noise that was added. Given the predicted noise, we can estimate the slightly-less-noisy image.
The denoiser is typically a U-Net: an encoder-decoder with skip connections. Let's be precise about the data flow:
| Input | Shape | Description |
|---|---|---|
| Noisy image xt | [B, C, H, W] | e.g. [1, 3, 64, 64] for pixel-space, [1, 4, 64, 64] for latent |
| Timestep t | [B] | Integer in [0, T]. Embedded via sinusoidal encoding → [B, demb] |
| Output | Shape | Description |
|---|---|---|
| Predicted noise ε̂ | [B, C, H, W] | Same shape as input. The network predicts what noise was added. |
Training is surprisingly simple. For each training step: (1) pick a random image x0 from the dataset, (2) pick a random timestep t, (3) sample noise ε ~ N(0,I), (4) create the noisy image xt = √ᾱt x0 + √(1-ᾱt) ε, and (5) train the network to predict ε from xt and t.
That's it. Plain MSE loss between the true noise and the predicted noise. No adversarial training, no mode collapse, no training instability. This simplicity is a huge reason diffusion models won.
Here's the full training loop. Read it carefully — there are no hidden steps:
python for x_0 in dataloader: # 1. Sample x_0 from data t = torch.randint(0, T, (B,)) # 2. Sample t ~ Uniform{0..T-1} eps = torch.randn_like(x_0) # 3. Sample epsilon ~ N(0, I) # 4. Compute x_t (the noisy version) a = alpha_bar[t].view(-1,1,1,1) x_t = torch.sqrt(a) * x_0 + torch.sqrt(1-a) * eps # 5. Predict noise, compute loss eps_hat = unet(x_t, t) # [B, C, H, W] -> [B, C, H, W] loss = F.mse_loss(eps_hat, eps) # That's it. MSE on noise. loss.backward() optimizer.step()
The theoretical foundation of diffusion models rests on three pillars. You don't need them to use diffusion models, but understanding them reveals why the simple training objective actually works.
We want to maximize log p(x0), but it's intractable. Instead we maximize a lower bound (the Evidence Lower Bound). The ELBO decomposes into a sum of KL divergences — one per timestep — each comparing the true reverse step to our learned approximation.
KL divergence measures how different two distributions are. Since both q and pθ are Gaussian, the KL has a closed form. It reduces to comparing means — which becomes the simple MSE loss.
The score is ∇x log p(x) — a vector pointing toward higher-density regions. Denoising is equivalent to estimating the score: εθ(xt, t) ∝ -∇x log p(xt). This connection to score matching is why diffusion models are sometimes called "score-based generative models."
Arrows show ∇ log p(x), pointing toward the data distribution (two clusters). The score field guides sampling.
We stated that the ELBO decomposes into per-timestep KL terms, and that these KL terms reduce to the simple MSE loss on noise. This is the most important derivation in diffusion models — it justifies why such a simple training objective actually maximizes log-likelihood.
Your task: Show that minimizing KL(q(xt-1|xt,x0) || pθ(xt-1|xt)) reduces to minimizing || ε - εθ(xt, t) ||².
Full derivation:
Step 1: The posterior q(xt-1|xt,x0) is Gaussian with mean μ̃t(xt, x0) and variance σ̃t² (both computable in closed form).
Step 2: We parameterize pθ(xt-1|xt) as Gaussian with the same variance σ̃t² but learned mean μθ(xt, t).
Step 3: KL between same-variance Gaussians: KL = (1/2σ̃t²) || μ̃t - μθ ||².
Step 4: Rewrite the true mean using the reparameterization xt = √ᾱt x0 + √(1-ᾱt) ε: μ̃t = (1/√αt)(xt - βt/√(1-ᾱt) · ε)
Step 5: Parameterize the model mean the same way: μθ = (1/√αt)(xt - βt/√(1-ᾱt) · εθ(xt, t))
Step 6: Subtract: || μ̃t - μθ ||² = (βt² / αt(1-ᾱt)) · || ε - εθ(xt, t) ||²
Step 7: Drop the t-dependent weighting (DDPM paper shows this works better empirically): Lsimple = E || ε - εθ(xt, t) ||². ■
The key insight: The theoretically-justified loss has t-dependent weights (harder timesteps get larger gradients). But Ho et al. 2020 found that dropping these weights gives better samples. The "simple" loss weights all timesteps equally, giving the model more gradient signal at easy (low-noise) timesteps where perceptual quality is determined.
The original DDPM used a linear schedule: βt goes from 0.0001 to 0.02 linearly. The improved DDPM paper switched to a cosine schedule. Both destroy signal by t=T, but they distribute the destruction differently across timesteps.
Your task: Given that the cosine schedule defines ᾱt = cos²(πt/2T), show that this produces a more uniform signal-to-noise ratio distribution across timesteps compared to linear, and explain why this helps training.
Why cosine wins:
Linear schedule problem: βt increases linearly, but ᾱt = ∏(1-βs) ≈ exp(-∑βs) decays exponentially. The signal is destroyed too quickly — most timesteps t > 300 are "basically pure noise" and teach the model nothing useful. Training time is wasted on uninformative gradients.
Cosine schedule solution: By directly specifying ᾱt = cos²(πt/2T), we ensure: (1) ᾱ0 = 1 (pure signal), (2) ᾱT = 0 (pure noise), (3) the decay is gradual and symmetric around the midpoint. Log-SNR decreases linearly in t, meaning each timestep range contributes equally to training.
Training impact: Uniform timestep sampling (t ~ Uniform{0,T}) now gives uniform coverage of SNR levels. Every sampled t teaches the model something new. With linear schedule + uniform sampling, 70% of samples fall in the "already destroyed" regime and provide near-zero learning signal.
The key insight: The noise schedule is not just "how fast to add noise" — it's a curriculum. Cosine schedule = optimal curriculum where every training sample is maximally informative.
Once trained, we generate images by starting from noise and iteratively denoising. The original DDPM sampler uses all T=1000 steps — faithful to the theory but painfully slow (~1 minute per image).
DDIM (Denoising Diffusion Implicit Models) noticed that the forward process can be made deterministic, allowing you to skip steps. With just 50 steps, quality is nearly identical. DPM-Solver treats sampling as solving an ODE and uses higher-order methods (like Runge-Kutta) to achieve great quality in 10-25 steps.
DDPM sampling computes xt-1 from xt using the predicted noise ε̂:
where z ~ N(0, I) is fresh noise added at each step. This stochastic term is what makes DDPM slow — each step adds randomness, so you can't skip ahead.
DDIM removes this stochastic term (sets σt = 0), making the process deterministic. A deterministic ODE can be evaluated at any subset of timesteps. Instead of [1000, 999, 998, ...], DDIM evaluates at [1000, 980, 960, ...] — 50 evenly spaced steps instead of 1000.
python # DDPM: 1000 sequential steps (slow) for t in range(T, 0, -1): eps_hat = unet(x_t, t) x_t = reverse_step(x_t, eps_hat, t) + sigma[t] * torch.randn_like(x_t) # DDIM: skip timesteps (fast, deterministic) steps = [1000, 980, 960, ..., 20, 0] # 50 steps for t, t_prev in zip(steps[:-1], steps[1:]): eps_hat = unet(x_t, t) x_t = ddim_step(x_t, eps_hat, t, t_prev) # no stochastic term
| Sampler | Steps | Speed | Quality |
|---|---|---|---|
| DDPM | 1000 | Slow | Excellent |
| DDIM | 50 | Fast | Very good |
| DPM-Solver | 15-25 | Very fast | Excellent |
| DPM-Solver++ | 10-20 | Very fast | Excellent |
Watch a 1D distribution emerge from noise. More steps = smoother convergence. Fewer steps = faster but noisier.
Diffusing directly in pixel space is expensive: a 512×512 image has ~786K dimensions. Latent Diffusion Models (LDMs) first encode the image into a compact latent space using a pretrained VAE (Variational Autoencoder), then run diffusion there.
The VAE encoder compresses the image by 8x in each spatial dimension: 512×512 → 64×64 latent. The diffusion model learns to denoise in this 64×64 space (much cheaper!), then the VAE decoder reconstructs the final image. This is exactly what Stable Diffusion does.
Let's count dimensions to see why this is such a huge win:
| Space | Shape | Dimensions | Ratio |
|---|---|---|---|
| Pixel (512×512×3) | [B, 3, 512, 512] | 786,432 | 1× |
| Latent (64×64×4) | [B, 4, 64, 64] | 16,384 | 48× smaller |
Every U-Net forward pass operates on a 16K-dimensional tensor instead of a 786K-dimensional one. Since the U-Net runs once per sampling step (and there are 20-50 steps), this is a 48× speedup per step. Multiply that across all steps and training iterations, and you understand why latent diffusion made Stable Diffusion practical.
Compare the computational cost. Each block represents a unit of work. Latent diffusion is dramatically cheaper.
Both use the same architectural pattern: force information through a narrow channel so only the essential structure survives. In standalone VAEs, this enables generation by sampling the latent. In Latent Diffusion, it enables efficient denoising by reducing dimensionality 48x. The VAE in Stable Diffusion is trained separately (frozen during diffusion training) — it's a compression utility, not a generative model.
Spot this pattern: Where else in ML do we force data through a bottleneck to separate "essential" from "noise"? (Think: autoencoders, attention's low-rank Q/K, PCA...)
Real-world solutions (as of 2024-25):
SDXL Turbo / SD Turbo: Adversarial distillation reduces to 1-4 steps. 1024px in ~1s. Uses latent space (128×128×4 for SDXL). Quality is good but not quite full-step baseline.
LCM (Latent Consistency Model): 4-8 steps with consistency distillation. 512px in ~0.5s, 1024px in ~1.5s. Matches full-step quality closely.
The math: 1024px → 8x compression = 128×128×4 latent. At ~40ms/step on A100, you can afford 2000ms - 50ms(VAE) = 1950ms / 40ms = ~48 steps max. But with LCM distillation, 8 steps suffices (320ms for diffusion + 50ms VAE decode = 370ms total). The bottleneck shifts to the VAE decoder at high resolution.
DiT vs U-Net: DiT (used in SD3, Flux) scales better with compute but each step is slightly slower (~60ms at this scale). The quality/step ratio is higher though, so 20 DiT steps ≈ 50 U-Net steps quality-wise.
Unconditional generation is impressive but not useful. We want to say "a cat wearing a top hat" and get that image. Conditioning injects a text prompt into the denoising process.
The pipeline: (1) A text encoder (typically CLIP) converts the prompt into an embedding vector. (2) This embedding is injected into the U-Net via cross-attention layers: the noisy image attends to the text tokens. The network learns to denoise differently depending on the text.
During training, the text condition is randomly dropped (replaced with empty text) some percentage of the time. At inference, we compute both the conditional and unconditional noise predictions, then amplify the difference:
The guidance scale w (typically 7-12) controls how strongly the model follows the prompt. Higher w = more adherence to text but less diversity and potential artifacts.
During training, the text condition is randomly replaced with an empty string ∅ about 10-20% of the time. This teaches the network to predict noise both with and without the prompt. At inference, every denoising step runs the U-Net twice:
python # Every sampling step runs TWO forward passes eps_uncond = unet(x_t, t, text="") # unconditional prediction eps_cond = unet(x_t, t, text=prompt) # conditional prediction # Amplify the difference eps_guided = eps_uncond + w * (eps_cond - eps_uncond) # w=1.0: just use conditional (no guidance) # w=7.5: standard for text-to-image (SD 1.5, SDXL) # w=15+: very strong adherence, but artifacts/saturation
See how classifier-free guidance amplifies the conditional signal. Low w = generic. High w = strongly steered (but may overshoot).
Text alone is often insufficient. You might want to specify a precise pose, edge map, or depth layout. ControlNet adds a parallel encoder that takes a spatial condition (like a Canny edge image) and injects it into the U-Net's skip connections.
The genius: the original Stable Diffusion weights are frozen. ControlNet trains a copy of the encoder that learns to translate spatial signals. This preserves the base model's quality while adding precise spatial control.
| Control Type | Input | What It Controls |
|---|---|---|
| Canny | Edge map | Outline / structure |
| Depth | Depth map | 3D layout, foreground/background |
| OpenPose | Skeleton keypoints | Human pose |
| Segmentation | Semantic map | Region content types |
| IP-Adapter | Reference image | Style and subject transfer |
Diffusion models have evolved rapidly. Here's a map of the landscape as of mid-2025:
| Model | Year | Key Innovation |
|---|---|---|
| DDPM | 2020 | Showed diffusion can match GANs |
| DALL-E 2 | 2022 | CLIP-guided diffusion prior |
| Stable Diffusion 1.5 | 2022 | Open-source latent diffusion |
| SDXL | 2023 | Larger U-Net, dual text encoders, 1024px |
| DALL-E 3 | 2023 | Better text understanding via recaptioning |
| SD3 / SD3.5 | 2024 | MMDiT (Transformer replaces U-Net) + flow matching |
| Flux | 2024 | Rectified flow, DiT architecture, open weights |
A radical departure: instead of iterating T steps, learn to jump directly from any noisy xt to x0 in a single step. Consistency models (Song et al., 2023) enforce that all points on the same denoising trajectory map to the same output. The result: 1-2 step generation with quality approaching multi-step diffusion.
Major milestones in diffusion model development.
Flow matching is the continuous-time limit of diffusion. Instead of T discrete denoise steps, we solve an ODE that transports noise to data along straight paths. The "noise schedule" becomes implicit in the velocity field. SD3 and Flux already use this framework — the DDPM noise schedule is replaced by rectified flow, and sampling becomes solving an ODE with adaptive step sizes.
The same "iterative refinement" pattern appears in: Kalman filter (predict/update cycles), policy gradient (improve policy each epoch), EM algorithm (E-step/M-step). What makes diffusion's version unique? (Hint: what's being refined, and what's the "ground truth"?)
You now understand how diffusion models create. From noise to structure, one step at a time.