The Complete Beginner's Path

Understand Diffusion
Models

The engine behind Stable Diffusion, DALL-E, and modern image generation. Learn how neural networks learn to create by learning to denoise.

Prerequisites: Basic probability + Familiarity with neural networks. No measure theory required.
10
Chapters
6+
Simulations
0
Pages of Proofs

Chapter 0: What Is Generation?

Generative modeling has a deceptively simple goal: given a dataset of images (faces, landscapes, cats), learn the underlying probability distribution p(x) and then draw new samples from it. A perfect generative model would produce images indistinguishable from real photographs.

Why is this hard? Because an image is a point in an astronomically high-dimensional space. A 512×512 RGB image lives in a space with 786,432 dimensions. The "real image" manifold is a tiny, twisted surface in that vast void. Random points are just static.

The core problem: We need to transform simple noise (which we can sample) into complex data (which we want to sample). Diffusion models do this by learning to gradually remove noise, one small step at a time.
Random Noise vs Structure

Each cell is a random pixel grid. Pure noise has no structure. Generation means learning to place every pixel in just the right spot.

Check: Why can't we just sample random pixel values to generate images?

Chapter 1: The Forward Process

The key insight of diffusion models: destruction is easy, creation is hard. The forward process takes a real image and gradually adds Gaussian noise over T steps (typically T=1000). At each step, the image gets a little noisier, until at step T it's indistinguishable from pure static.

Mathematically, at each step t we mix the current image with a little noise: q(xt | xt-1) = N(xt; √(1-βt) xt-1, βt I). The noise schedule βt controls how fast the signal is destroyed.

q(xt | x0) = N(xt; √ᾱt x0, (1 - ᾱt) I)
Nice property: We can jump directly to any timestep t without computing all intermediate steps! ᾱt = ∏αs is the cumulative product of (1-βs). This makes training efficient.
Interactive: Watch an Image Dissolve

Drag the slider to add noise. At t=0 you see the original signal. At t=1000 it's pure static.

Timestep t0
Check: What does the forward process do to an image?

Chapter 2: The Reverse Process

If we could reverse the forward process — undo each noise step — we'd have a generative model! Start from pure noise xT ~ N(0, I) and iteratively denoise to get a clean image x0. The problem: the exact reverse q(xt-1|xt) requires knowing the full data distribution, which is what we're trying to learn.

Solution: train a neural network εθ(xt, t) to approximate the reverse step. This network takes in a noisy image and the timestep, and predicts the noise that was added. Given the predicted noise, we can estimate the slightly-less-noisy image.

Pure Noise
xT ~ N(0, I)
↓ denoise
Slightly Less Noisy
xT-1 = f(xT, εθ)
↓ denoise
...
Repeat T times
↓ denoise
Clean Image
x0
The denoiser network is typically a U-Net: an encoder-decoder with skip connections. It takes in the noisy image concatenated with a timestep embedding, and outputs a noise prediction of the same shape. Modern variants use Transformers (DiT).
Check: What does the denoiser network predict?

Chapter 3: Training the Denoiser

Training is surprisingly simple. For each training step: (1) pick a random image x0 from the dataset, (2) pick a random timestep t, (3) sample noise ε ~ N(0,I), (4) create the noisy image xt = √ᾱt x0 + √(1-ᾱt) ε, and (5) train the network to predict ε from xt and t.

L = Et,x0 [ || ε - εθ(xt, t) ||² ]

That's it. Plain MSE loss between the true noise and the predicted noise. No adversarial training, no mode collapse, no training instability. This simplicity is a huge reason diffusion models won.

1. Sample
Pick random x0, t, ε
2. Corrupt
xt = √ᾱt x0 + √(1-ᾱt) ε
3. Predict
ε̂ = εθ(xt, t)
4. Loss
L = || ε - ε̂ ||²
Why MSE? It can be shown that minimizing this simple noise-prediction MSE is equivalent to optimizing a variational bound on the data log-likelihood. The theory is deep, but the practice is dead simple.
Check: What loss function is used to train a diffusion model?

Chapter 4: The Math

The theoretical foundation of diffusion models rests on three pillars. You don't need them to use diffusion models, but understanding them reveals why the simple training objective actually works.

Pillar 1: The ELBO

We want to maximize log p(x0), but it's intractable. Instead we maximize a lower bound (the Evidence Lower Bound). The ELBO decomposes into a sum of KL divergences — one per timestep — each comparing the true reverse step to our learned approximation.

log p(x0) ≥ Eq[ log p(x0|x1) ] - ∑ KL( q(xt-1|xt,x0) || pθ(xt-1|xt) )

Pillar 2: KL Divergence

KL divergence measures how different two distributions are. Since both q and pθ are Gaussian, the KL has a closed form. It reduces to comparing means — which becomes the simple MSE loss.

Pillar 3: Score Function

The score is ∇x log p(x) — a vector pointing toward higher-density regions. Denoising is equivalent to estimating the score: εθ(xt, t) ∝ -∇x log p(xt). This connection to score matching is why diffusion models are sometimes called "score-based generative models."

The punchline: ELBO → sum of KL terms → Gaussian KL → MSE on means → noise prediction MSE. Five steps of math justify the simplest possible training loop.
Score Field Visualization

Arrows show ∇ log p(x), pointing toward the data distribution (two clusters). The score field guides sampling.

Check: What is the score function?

Chapter 5: Sampling

Once trained, we generate images by starting from noise and iteratively denoising. The original DDPM sampler uses all T=1000 steps — faithful to the theory but painfully slow (~1 minute per image).

DDIM (Denoising Diffusion Implicit Models) noticed that the forward process can be made deterministic, allowing you to skip steps. With just 50 steps, quality is nearly identical. DPM-Solver treats sampling as solving an ODE and uses higher-order methods (like Runge-Kutta) to achieve great quality in 10-25 steps.

SamplerStepsSpeedQuality
DDPM1000SlowExcellent
DDIM50FastVery good
DPM-Solver15-25Very fastExcellent
DPM-Solver++10-20Very fastExcellent
Interactive: Step Count vs Quality

Watch a 1D distribution emerge from noise. More steps = smoother convergence. Fewer steps = faster but noisier.

Steps50
Key tradeoff: Steps ↔ quality ↔ speed. Modern samplers achieve near-perfect quality in ~20 steps, making real-time generation possible. The race is to push this even lower.
Check: Why is DDIM faster than DDPM?

Chapter 6: Latent Diffusion

Diffusing directly in pixel space is expensive: a 512×512 image has ~786K dimensions. Latent Diffusion Models (LDMs) first encode the image into a compact latent space using a pretrained VAE (Variational Autoencoder), then run diffusion there.

The VAE encoder compresses the image by 8x in each spatial dimension: 512×512 → 64×64 latent. The diffusion model learns to denoise in this 64×64 space (much cheaper!), then the VAE decoder reconstructs the final image. This is exactly what Stable Diffusion does.

Image (512×512)
786,432 dimensions
↓ VAE Encoder
Latent (64×64×4)
16,384 dimensions (~48x smaller)
↓ Diffusion here!
Denoised Latent
Still 64×64×4
↓ VAE Decoder
Generated Image
Back to 512×512
Why latent space? The VAE discards perceptually irrelevant detail (exact pixel noise). The latent space captures semantic content: shapes, colors, composition. Diffusion in latent space = faster training, faster sampling, same quality.
Pixel vs Latent Dimensions

Compare the computational cost. Each block represents a unit of work. Latent diffusion is dramatically cheaper.

Check: What does the VAE encoder do in Stable Diffusion?

Chapter 7: Conditioning

Unconditional generation is impressive but not useful. We want to say "a cat wearing a top hat" and get that image. Conditioning injects a text prompt into the denoising process.

The pipeline: (1) A text encoder (typically CLIP) converts the prompt into an embedding vector. (2) This embedding is injected into the U-Net via cross-attention layers: the noisy image attends to the text tokens. The network learns to denoise differently depending on the text.

Text Prompt
"a cat wearing a top hat"
↓ CLIP encoder
Text Embedding
77 tokens × 768 dims
↓ cross-attention
U-Net Denoiser
Noisy latent + text → predicted noise

Classifier-Free Guidance (CFG)

During training, the text condition is randomly dropped (replaced with empty text) some percentage of the time. At inference, we compute both the conditional and unconditional noise predictions, then amplify the difference:

ε̂ = εuncond + w · (εcond - εuncond)

The guidance scale w (typically 7-12) controls how strongly the model follows the prompt. Higher w = more adherence to text but less diversity and potential artifacts.

Interactive: CFG Scale

See how classifier-free guidance amplifies the conditional signal. Low w = generic. High w = strongly steered (but may overshoot).

CFG scale w7.5
Check: What happens when you increase the CFG scale?

Chapter 8: ControlNet & Adapters

Text alone is often insufficient. You might want to specify a precise pose, edge map, or depth layout. ControlNet adds a parallel encoder that takes a spatial condition (like a Canny edge image) and injects it into the U-Net's skip connections.

The genius: the original Stable Diffusion weights are frozen. ControlNet trains a copy of the encoder that learns to translate spatial signals. This preserves the base model's quality while adding precise spatial control.

Control TypeInputWhat It Controls
CannyEdge mapOutline / structure
DepthDepth map3D layout, foreground/background
OpenPoseSkeleton keypointsHuman pose
SegmentationSemantic mapRegion content types
IP-AdapterReference imageStyle and subject transfer
Spatial Condition
Edge map, depth, pose, etc.
↓ ControlNet encoder (trainable copy)
Skip Connection Residuals
Added to frozen U-Net features
↓ Combined with text conditioning
Controlled Output
Follows both text and spatial layout
Other adapters: LoRA (Low-Rank Adaptation) finetunes the model with tiny weight matrices for style or subject. T2I-Adapter is a lightweight alternative to ControlNet. These can be composed — multiple LoRAs + ControlNet + text prompt — for fine-grained control.
Check: Why does ControlNet freeze the original Stable Diffusion weights?

Chapter 9: The Diffusion Ecosystem

Diffusion models have evolved rapidly. Here's a map of the landscape as of mid-2025:

ModelYearKey Innovation
DDPM2020Showed diffusion can match GANs
DALL-E 22022CLIP-guided diffusion prior
Stable Diffusion 1.52022Open-source latent diffusion
SDXL2023Larger U-Net, dual text encoders, 1024px
DALL-E 32023Better text understanding via recaptioning
SD3 / SD3.52024MMDiT (Transformer replaces U-Net) + flow matching
Flux2024Rectified flow, DiT architecture, open weights

Consistency Models

A radical departure: instead of iterating T steps, learn to jump directly from any noisy xt to x0 in a single step. Consistency models (Song et al., 2023) enforce that all points on the same denoising trajectory map to the same output. The result: 1-2 step generation with quality approaching multi-step diffusion.

The trend: Fewer steps, bigger Transformers, better text understanding, more control. The U-Net is giving way to DiT (Diffusion Transformer). Flow matching (next lesson!) is replacing the DDPM noise schedule. The field is converging on a cleaner, simpler framework.
Evolution Timeline

Major milestones in diffusion model development.

"What I cannot create, I do not understand."
— Richard Feynman

You now understand how diffusion models create. From noise to structure, one step at a time.