The Complete Beginner's Path

Understand Diffusion
Models

The engine behind Stable Diffusion, DALL-E, and modern image generation. Learn how neural networks learn to create by learning to denoise.

Prerequisites: Basic probability + Familiarity with neural networks. No measure theory required.
10
Chapters
6+
Simulations
0
Pages of Proofs

Chapter 0: What Is Generation?

Generative modeling has a deceptively simple goal: given a dataset of images (faces, landscapes, cats), learn the underlying probability distribution p(x) and then draw new samples from it. A perfect generative model would produce images indistinguishable from real photographs.

Why is this hard? Because an image is a point in an astronomically high-dimensional space. A 512×512 RGB image lives in a space with 786,432 dimensions. The "real image" manifold is a tiny, twisted surface in that vast void. Random points are just static.

The core problem: We need to transform simple noise (which we can sample) into complex data (which we want to sample). Diffusion models do this by learning to gradually remove noise, one small step at a time.
Random Noise vs Structure

Each cell is a random pixel grid. Pure noise has no structure. Generation means learning to place every pixel in just the right spot.

Check: Why can't we just sample random pixel values to generate images?

Chapter 1: The Forward Process

The key insight of diffusion models: destruction is easy, creation is hard. The forward process takes a real image and gradually adds Gaussian noise over T steps (typically T=1000). At each step, the image gets a little noisier, until at step T it's indistinguishable from pure static.

Mathematically, at each step t we mix the current image with a little noise: q(xt | xt-1) = N(xt; √(1-βt) xt-1, βt I). The noise schedule βt controls how fast the signal is destroyed.

q(xt | x0) = N(xt; √ᾱt x0, (1 - ᾱt) I)
Nice property: We can jump directly to any timestep t without computing all intermediate steps! ᾱt = ∏αs is the cumulative product of (1-βs). This makes training efficient.

The Exact Tensor Operation

In code, the forward process is a single line. Given a clean image x0 and noise ε ~ N(0, I), both tensors of shape [batch, C, H, W]:

xt = √ᾱt · x0 + √(1 - ᾱt) · ε

Let's plug in real numbers. With a cosine schedule and T=1000:

Timestep tᾱt√ᾱt (signal coeff)√(1-ᾱt) (noise coeff)Signal %
01.001.000.00100%
2500.500.710.7150%
5000.050.220.975%
7500.0020.041.000.2%
1000≈0≈01.000%
Read this table carefully. At t=500, x500 = 0.22 · image + 0.97 · noise. That's almost entirely noise — the image is barely a ghost. By t=750, the signal coefficient is 0.04. The image is effectively gone. This is why the denoiser's job is so hard: it must recover signal from near-pure noise.
python
# The entire forward process in PyTorch
def forward_process(x_0, t, alpha_bar):
    """x_0: [B, C, H, W], t: [B], alpha_bar: [T]"""
    a_bar = alpha_bar[t].view(-1, 1, 1, 1)  # broadcast
    eps = torch.randn_like(x_0)            # noise ~ N(0, I)
    x_t = torch.sqrt(a_bar) * x_0 + torch.sqrt(1 - a_bar) * eps
    return x_t, eps
Interactive: Watch an Image Dissolve

Drag the slider to add noise. At t=0 you see the original signal. At t=1000 it's pure static.

Timestep t0
Check: What does the forward process do to an image?
🔨 Derivation Derive the closed-form q(x_t | x_0) ✓ ATTEMPTED

The forward process defines q(xt | xt-1) = N(xt; √(1-βt) xt-1, βt I). This means each step depends on the previous one. But we claimed we can jump directly to any timestep t from x0.

Your task: Starting from the single-step definition, derive q(xt | x0) = N(xt; √ᾱt x0, (1 - ᾱt) I) where ᾱt = ∏s=1t (1 - βs).

If X ~ N(μ1, σ1²) and Y ~ N(μ2, σ2²) are independent, then aX + bY ~ N(aμ1 + bμ2, a²σ1² + b²σ2²). Use the reparameterization: xt = √(1-βt) xt-1 + √βt εt.
Substitute xt-1 = √αt-1 xt-2 + √(1-αt-1) εt-1 into xt = √αt xt-1 + √(1-αt) εt. Collect the xt-2 coefficient (√αtαt-1) and the noise variance (1 - αtαt-1). The pattern emerges!
After two steps, variance = 1 - αtαt-1. This works because two independent Gaussians with variances σ1² and σ2² sum to a Gaussian with variance σ1² + σ2² (only when the mean scaling is correct). Verify: αt(1-αt-1) + (1-αt) = 1 - αtαt-1.

Full derivation:

Step 1: Define αt = 1 - βt. Reparameterize: xt = √αt xt-1 + √(1-αt) εt.

Step 2: Expand one more step: xt = √αt(√αt-1 xt-2 + √(1-αt-1) εt-1) + √(1-αt) εt

= √(αtαt-1) xt-2 + √αt(1-αt-1) εt-1 + √(1-αt) εt

Step 3: The two noise terms are independent Gaussians. Their sum has variance: αt(1-αt-1) + (1-αt) = 1 - αtαt-1.

Step 4: By induction, after t steps: xt = √(α1α2...αt) x0 + √(1 - α1α2...αt) ε.

Step 5: Define ᾱt = ∏s=1t αs. Then q(xt | x0) = N(xt; √ᾱt x0, (1 - ᾱt) I). ■

The key insight: The sum of independent Gaussians is Gaussian. Because each step is linear in xt-1 plus independent noise, the entire chain telescopes into a single Gaussian centered at a scaled version of x0.

💻 Build It Implement Forward Diffusion from Scratch ✓ ATTEMPTED
You now know the closed-form forward process. Implement a function that takes a clean image, a noise schedule, and a timestep, and returns the noisy image plus the noise that was added (needed for training).
signature def forward_diffusion(x_0: Tensor, t: Tensor, alpha_bar: Tensor) -> Tuple[Tensor, Tensor]: """ Args: x_0: Clean images, shape [B, C, H, W] t: Timesteps for each image in batch, shape [B], values in [0, T-1] alpha_bar: Cumulative product schedule, shape [T] Returns: x_t: Noisy images at timestep t, shape [B, C, H, W] eps: The noise that was added, shape [B, C, H, W] """
Test case
x_0 = torch.ones(1, 3, 4, 4) # all-ones image
alpha_bar = torch.linspace(0.9999, 0.001, 1000)
t = torch.tensor([500])
x_t, eps = forward_diffusion(x_0, t, alpha_bar)
# x_t should have mean ≈ sqrt(alpha_bar[500]) ≈ 0.22
# x_t.shape == (1, 3, 4, 4), eps.shape == (1, 3, 4, 4)
alpha_bar[t] has shape [B]. To multiply element-wise with a [B, C, H, W] tensor, reshape it to [B, 1, 1, 1] using .view(-1, 1, 1, 1). Then the standard formula applies directly.
python
def forward_diffusion(x_0, t, alpha_bar):
    a_bar = alpha_bar[t].view(-1, 1, 1, 1)   # [B] -> [B,1,1,1]
    eps = torch.randn_like(x_0)              # sample noise
    x_t = torch.sqrt(a_bar) * x_0 + torch.sqrt(1 - a_bar) * eps
    return x_t, eps
Bonus challenge: Modify this to also return the predicted x0 given xt and epsilon (the reverse formula). This is used during sampling to estimate the clean image at any step.
Checkpoint — Before you move on
Explain in your own words: why does √ᾱt multiply the signal and √(1 - ᾱt) multiply the noise? What would break if we used different coefficients (say, both equal to 0.5)?
✓ Gate cleared
Model Answer

The coefficients are chosen so that the total variance is preserved. If x0 has unit variance and ε has unit variance, then xt = a·x0 + b·ε has variance a² + b². Setting a = √ᾱt and b = √(1-ᾱt) gives a² + b² = ᾱt + (1-ᾱt) = 1. The variance stays constant at every timestep.

If both coefficients were 0.5, the variance would be 0.25 + 0.25 = 0.5 — the signal would shrink. Over many steps, xT would converge to a distribution with variance 0.5 instead of N(0, I). The reverse process assumes we start from standard normal noise, so this mismatch would cause the generated images to have wrong brightness/contrast. The variance-preserving property is what makes q(xT) ≈ N(0, I) regardless of the data distribution.

Chapter 2: The Reverse Process

If we could reverse the forward process — undo each noise step — we'd have a generative model! Start from pure noise xT ~ N(0, I) and iteratively denoise to get a clean image x0. The problem: the exact reverse q(xt-1|xt) requires knowing the full data distribution, which is what we're trying to learn.

Solution: train a neural network εθ(xt, t) to approximate the reverse step. This network takes in a noisy image and the timestep, and predicts the noise that was added. Given the predicted noise, we can estimate the slightly-less-noisy image.

Pure Noise
xT ~ N(0, I)
↓ denoise
Slightly Less Noisy
xT-1 = f(xT, εθ)
↓ denoise
...
Repeat T times
↓ denoise
Clean Image
x0

What Goes In, What Comes Out

The denoiser is typically a U-Net: an encoder-decoder with skip connections. Let's be precise about the data flow:

InputShapeDescription
Noisy image xt[B, C, H, W]e.g. [1, 3, 64, 64] for pixel-space, [1, 4, 64, 64] for latent
Timestep t[B]Integer in [0, T]. Embedded via sinusoidal encoding → [B, demb]
OutputShapeDescription
Predicted noise ε̂[B, C, H, W]Same shape as input. The network predicts what noise was added.
Why predict noise, not the clean image? Both are mathematically valid (you can derive x0 from ε and vice versa). But noise prediction gives constant-magnitude targets across all timesteps. The noise ε is always drawn from N(0, I), regardless of t. If we predicted x0 directly, the target magnitude would vary wildly with t, making training unstable. This is a key engineering insight from the DDPM paper.
The U-Net architecture uses skip connections from encoder to decoder at each resolution, preserving fine spatial detail. The timestep embedding is injected via addition or FiLM conditioning at each block. Modern variants like DiT (Diffusion Transformer) replace the U-Net with a Transformer operating on image patches.
Check: What does the denoiser network predict?

Chapter 3: Training the Denoiser

Training is surprisingly simple. For each training step: (1) pick a random image x0 from the dataset, (2) pick a random timestep t, (3) sample noise ε ~ N(0,I), (4) create the noisy image xt = √ᾱt x0 + √(1-ᾱt) ε, and (5) train the network to predict ε from xt and t.

L = Et,x0 [ || ε - εθ(xt, t) ||² ]

That's it. Plain MSE loss between the true noise and the predicted noise. No adversarial training, no mode collapse, no training instability. This simplicity is a huge reason diffusion models won.

The Complete Training Loop

Here's the full training loop. Read it carefully — there are no hidden steps:

python
for x_0 in dataloader:                     # 1. Sample x_0 from data
    t = torch.randint(0, T, (B,))         # 2. Sample t ~ Uniform{0..T-1}
    eps = torch.randn_like(x_0)            # 3. Sample epsilon ~ N(0, I)

    # 4. Compute x_t (the noisy version)
    a = alpha_bar[t].view(-1,1,1,1)
    x_t = torch.sqrt(a) * x_0 + torch.sqrt(1-a) * eps

    # 5. Predict noise, compute loss
    eps_hat = unet(x_t, t)                 # [B, C, H, W] -> [B, C, H, W]
    loss = F.mse_loss(eps_hat, eps)         # That's it. MSE on noise.

    loss.backward()
    optimizer.step()
This is the ENTIRE training loop. No discriminator, no ELBO computation, no reconstruction loss, no KL term. Sample, corrupt, predict, MSE. The theoretical justification is deep (it optimizes a variational bound on log-likelihood), but the implementation is dead simple. This is why diffusion models scaled to production so quickly.
1. Sample
Pick random x0, t, ε
2. Corrupt
xt = √ᾱt x0 + √(1-ᾱt) ε
3. Predict
ε̂ = εθ(xt, t)
4. Loss
L = || ε - ε̂ ||²
Why MSE? It can be shown that minimizing this simple noise-prediction MSE is equivalent to optimizing a variational bound on the data log-likelihood. The theory is deep, but the practice is dead simple.
Check: What loss function is used to train a diffusion model?
💥 Break-It Lab What Dies When You Remove Training Components? ✓ ATTEMPTED
A working diffusion model trains with: (1) a cosine noise schedule, (2) U-Net skip connections, and (3) uniform timestep sampling. Toggle each one off to see what fails and why.
Remove noise schedule (use constant β) ACTIVE
Failure mode: With constant β (no schedule), the signal-to-noise ratio drops too fast in early steps and too slowly in late steps. The model sees mostly pure noise during training (wasting capacity on easy "predict random" tasks) and rarely sees the hard mid-noise levels where structure emerges. Loss plateaus at a high value because the model never learns the critical transition region.
Remove U-Net skip connections ACTIVE
Failure mode: Skip connections pass high-frequency spatial detail from encoder to decoder. Without them, the bottleneck must encode ALL spatial information — an impossible task at high resolution. The model outputs blurry reconstructions (loss decreases but saturates early). Fine edges, textures, and small features are lost. This is the same reason ResNets outperform plain networks: gradients flow through skips.
Bias timestep sampling (only high-noise t) ACTIVE
Failure mode: If training only samples large t (high noise), the model learns to denoise from near-pure noise but never learns the final refinement steps (small t). Generated images have correct global structure but are blurry/noisy at fine detail. The loss appears low (predicting noise at high t is easy) but sample quality is terrible — a deceptive metric.

Chapter 4: The Math

The theoretical foundation of diffusion models rests on three pillars. You don't need them to use diffusion models, but understanding them reveals why the simple training objective actually works.

Pillar 1: The ELBO

We want to maximize log p(x0), but it's intractable. Instead we maximize a lower bound (the Evidence Lower Bound). The ELBO decomposes into a sum of KL divergences — one per timestep — each comparing the true reverse step to our learned approximation.

log p(x0) ≥ Eq[ log p(x0|x1) ] - ∑ KL( q(xt-1|xt,x0) || pθ(xt-1|xt) )

Pillar 2: KL Divergence

KL divergence measures how different two distributions are. Since both q and pθ are Gaussian, the KL has a closed form. It reduces to comparing means — which becomes the simple MSE loss.

Pillar 3: Score Function

The score is ∇x log p(x) — a vector pointing toward higher-density regions. Denoising is equivalent to estimating the score: εθ(xt, t) ∝ -∇x log p(xt). This connection to score matching is why diffusion models are sometimes called "score-based generative models."

The punchline: ELBO → sum of KL terms → Gaussian KL → MSE on means → noise prediction MSE. Five steps of math justify the simplest possible training loop.
Score Field Visualization

Arrows show ∇ log p(x), pointing toward the data distribution (two clusters). The score field guides sampling.

🔨 Derivation From ELBO to Simple MSE Loss ✓ ATTEMPTED

We stated that the ELBO decomposes into per-timestep KL terms, and that these KL terms reduce to the simple MSE loss on noise. This is the most important derivation in diffusion models — it justifies why such a simple training objective actually maximizes log-likelihood.

Your task: Show that minimizing KL(q(xt-1|xt,x0) || pθ(xt-1|xt)) reduces to minimizing || ε - εθ(xt, t) ||².

By Bayes' rule: q(xt-1|xt,x0) ∝ q(xt|xt-1) · q(xt-1|x0). Both factors are Gaussian, so the product is Gaussian. Its mean is μ̃t = (√αt(1-ᾱt-1)xt + √ᾱt-1βtx0) / (1-ᾱt).
When both distributions have the same (fixed) variance σ̃t², the KL simplifies to: KL = (1/2σ̃t²) || μ̃t - μθ ||². So we just need to match the means! The model pθ predicts μθ(xt, t).
Express μ̃t in terms of xt and ε. Then parameterize μθ the same way but with εθ replacing the true ε. The difference || μ̃t - μθ ||² becomes proportional to || ε - εθ ||².

Full derivation:

Step 1: The posterior q(xt-1|xt,x0) is Gaussian with mean μ̃t(xt, x0) and variance σ̃t² (both computable in closed form).

Step 2: We parameterize pθ(xt-1|xt) as Gaussian with the same variance σ̃t² but learned mean μθ(xt, t).

Step 3: KL between same-variance Gaussians: KL = (1/2σ̃t²) || μ̃t - μθ ||².

Step 4: Rewrite the true mean using the reparameterization xt = √ᾱt x0 + √(1-ᾱt) ε: μ̃t = (1/√αt)(xt - βt/√(1-ᾱt) · ε)

Step 5: Parameterize the model mean the same way: μθ = (1/√αt)(xt - βt/√(1-ᾱt) · εθ(xt, t))

Step 6: Subtract: || μ̃t - μθ ||² = (βt² / αt(1-ᾱt)) · || ε - εθ(xt, t) ||²

Step 7: Drop the t-dependent weighting (DDPM paper shows this works better empirically): Lsimple = E || ε - εθ(xt, t) ||². ■

The key insight: The theoretically-justified loss has t-dependent weights (harder timesteps get larger gradients). But Ho et al. 2020 found that dropping these weights gives better samples. The "simple" loss weights all timesteps equally, giving the model more gradient signal at easy (low-noise) timesteps where perceptual quality is determined.

🔨 Derivation Why Cosine Schedule Beats Linear ✓ ATTEMPTED

The original DDPM used a linear schedule: βt goes from 0.0001 to 0.02 linearly. The improved DDPM paper switched to a cosine schedule. Both destroy signal by t=T, but they distribute the destruction differently across timesteps.

Your task: Given that the cosine schedule defines ᾱt = cos²(πt/2T), show that this produces a more uniform signal-to-noise ratio distribution across timesteps compared to linear, and explain why this helps training.

Signal-to-noise ratio = ᾱt / (1 - ᾱt). For cosine: SNR(t) = cos²(πt/2T) / sin²(πt/2T) = cot²(πt/2T). For linear: ᾱt = ∏(1-βs) which decays exponentially fast.
With linear β, ᾱt decays roughly exponentially: by t=500 (halfway), the signal is nearly gone (ᾱ500 ≈ 0.01). With cosine, ᾱ500 = cos²(π/4) = 0.5 — exactly half signal, half noise at the midpoint. The cosine schedule is calibrated so the midpoint is actually the midpoint of destruction.

Why cosine wins:

Linear schedule problem: βt increases linearly, but ᾱt = ∏(1-βs) ≈ exp(-∑βs) decays exponentially. The signal is destroyed too quickly — most timesteps t > 300 are "basically pure noise" and teach the model nothing useful. Training time is wasted on uninformative gradients.

Cosine schedule solution: By directly specifying ᾱt = cos²(πt/2T), we ensure: (1) ᾱ0 = 1 (pure signal), (2) ᾱT = 0 (pure noise), (3) the decay is gradual and symmetric around the midpoint. Log-SNR decreases linearly in t, meaning each timestep range contributes equally to training.

Training impact: Uniform timestep sampling (t ~ Uniform{0,T}) now gives uniform coverage of SNR levels. Every sampled t teaches the model something new. With linear schedule + uniform sampling, 70% of samples fall in the "already destroyed" regime and provide near-zero learning signal.

The key insight: The noise schedule is not just "how fast to add noise" — it's a curriculum. Cosine schedule = optimal curriculum where every training sample is maximally informative.

Check: What is the score function?

Chapter 5: Sampling

Once trained, we generate images by starting from noise and iteratively denoising. The original DDPM sampler uses all T=1000 steps — faithful to the theory but painfully slow (~1 minute per image).

DDIM (Denoising Diffusion Implicit Models) noticed that the forward process can be made deterministic, allowing you to skip steps. With just 50 steps, quality is nearly identical. DPM-Solver treats sampling as solving an ODE and uses higher-order methods (like Runge-Kutta) to achieve great quality in 10-25 steps.

The Reverse Step in Detail

DDPM sampling computes xt-1 from xt using the predicted noise ε̂:

xt-1 = (1/√αt) · (xt - (1 - αt)/√(1 - ᾱt) · ε̂) + σt · z

where z ~ N(0, I) is fresh noise added at each step. This stochastic term is what makes DDPM slow — each step adds randomness, so you can't skip ahead.

DDIM removes this stochastic term (sets σt = 0), making the process deterministic. A deterministic ODE can be evaluated at any subset of timesteps. Instead of [1000, 999, 998, ...], DDIM evaluates at [1000, 980, 960, ...] — 50 evenly spaced steps instead of 1000.

python
# DDPM: 1000 sequential steps (slow)
for t in range(T, 0, -1):
    eps_hat = unet(x_t, t)
    x_t = reverse_step(x_t, eps_hat, t) + sigma[t] * torch.randn_like(x_t)

# DDIM: skip timesteps (fast, deterministic)
steps = [1000, 980, 960, ..., 20, 0]  # 50 steps
for t, t_prev in zip(steps[:-1], steps[1:]):
    eps_hat = unet(x_t, t)
    x_t = ddim_step(x_t, eps_hat, t, t_prev)  # no stochastic term
SamplerStepsSpeedQuality
DDPM1000SlowExcellent
DDIM50FastVery good
DPM-Solver15-25Very fastExcellent
DPM-Solver++10-20Very fastExcellent
Interactive: Step Count vs Quality

Watch a 1D distribution emerge from noise. More steps = smoother convergence. Fewer steps = faster but noisier.

Steps50
Key tradeoff: Steps ↔ quality ↔ speed. Modern samplers achieve near-perfect quality in ~20 steps, making real-time generation possible. The race is to push this even lower.
Check: Why is DDIM faster than DDPM?
⚔ Adversarial: You train a diffusion model with a uniform noise schedule (βt = 0.01 for all t) instead of cosine. Training loss converges to a low value. What do your generated samples look like?
The model trains for the same number of iterations and the final MSE loss is similar to the cosine-schedule baseline. You run inference with DDIM (50 steps) and inspect the outputs.

Chapter 6: Latent Diffusion

Diffusing directly in pixel space is expensive: a 512×512 image has ~786K dimensions. Latent Diffusion Models (LDMs) first encode the image into a compact latent space using a pretrained VAE (Variational Autoencoder), then run diffusion there.

The VAE encoder compresses the image by 8x in each spatial dimension: 512×512 → 64×64 latent. The diffusion model learns to denoise in this 64×64 space (much cheaper!), then the VAE decoder reconstructs the final image. This is exactly what Stable Diffusion does.

The Numbers That Matter

Let's count dimensions to see why this is such a huge win:

SpaceShapeDimensionsRatio
Pixel (512×512×3)[B, 3, 512, 512]786,432
Latent (64×64×4)[B, 4, 64, 64]16,38448× smaller

Every U-Net forward pass operates on a 16K-dimensional tensor instead of a 786K-dimensional one. Since the U-Net runs once per sampling step (and there are 20-50 steps), this is a 48× speedup per step. Multiply that across all steps and training iterations, and you understand why latent diffusion made Stable Diffusion practical.

What does the VAE discard? Imperceptible detail. The VAE is trained to reconstruct images with minimal perceptual loss. The latent space captures semantic content — shapes, colors, composition — while discarding sub-pixel noise that humans can't see. Diffusion in latent space means faster training, faster sampling, same visual quality.
Image (512×512)
786,432 dimensions
↓ VAE Encoder
Latent (64×64×4)
16,384 dimensions (~48x smaller)
↓ Diffusion here!
Denoised Latent
Still 64×64×4
↓ VAE Decoder
Generated Image
Back to 512×512
Why latent space? The VAE discards perceptually irrelevant detail (exact pixel noise). The latent space captures semantic content: shapes, colors, composition. Diffusion in latent space = faster training, faster sampling, same quality.
Pixel vs Latent Dimensions

Compare the computational cost. Each block represents a unit of work. Latent diffusion is dramatically cheaper.

Check: What does the VAE encoder do in Stable Diffusion?
🔗 Pattern Recognition
Compression Bottleneck — Same Pattern, Different Purpose
Latent Diffusion
VAE encoder compresses 512×512×3 → 64×64×4. Diffusion operates in this compressed space. Decoder reconstructs the image.
VAE (standalone)
Encoder q(z|x) compresses data → latent. Decoder p(x|z) reconstructs. KL term regularizes the latent to be smooth/Gaussian. → VAE & VQ-VAE lesson

Both use the same architectural pattern: force information through a narrow channel so only the essential structure survives. In standalone VAEs, this enables generation by sampling the latent. In Latent Diffusion, it enables efficient denoising by reducing dimensionality 48x. The VAE in Stable Diffusion is trained separately (frozen during diffusion training) — it's a compression utility, not a generative model.

Spot this pattern: Where else in ML do we force data through a bottleneck to separate "essential" from "noise"? (Think: autoencoders, attention's low-rank Q/K, PCA...)

🏗 Design Challenge You're the Architect: 1024×1024 in 2 Seconds ✓ ATTEMPTED
A startup asks you to build a text-to-image system that generates 1024×1024 images in under 2 seconds on a single A100 GPU (80GB VRAM). The images must be photorealistic quality. You can use any published technique.
Resolution
1024 × 1024 px
Latency
< 2 seconds end-to-end
Hardware
1× A100 80GB
Quality
Photorealistic (FID < 10)
1. Pixel-space or latent-space diffusion? If latent, what spatial compression factor?
2. How many denoising steps can you afford? (Estimate: one U-Net pass at 128×128×4 latent ≈ 40ms on A100)
3. U-Net or DiT (Diffusion Transformer)? What are the compute/quality tradeoffs at this scale?
4. Would you use distillation (fewer steps, same quality) or consistency models (1-2 steps)?

Real-world solutions (as of 2024-25):

SDXL Turbo / SD Turbo: Adversarial distillation reduces to 1-4 steps. 1024px in ~1s. Uses latent space (128×128×4 for SDXL). Quality is good but not quite full-step baseline.

LCM (Latent Consistency Model): 4-8 steps with consistency distillation. 512px in ~0.5s, 1024px in ~1.5s. Matches full-step quality closely.

The math: 1024px → 8x compression = 128×128×4 latent. At ~40ms/step on A100, you can afford 2000ms - 50ms(VAE) = 1950ms / 40ms = ~48 steps max. But with LCM distillation, 8 steps suffices (320ms for diffusion + 50ms VAE decode = 370ms total). The bottleneck shifts to the VAE decoder at high resolution.

DiT vs U-Net: DiT (used in SD3, Flux) scales better with compute but each step is slightly slower (~60ms at this scale). The quality/step ratio is higher though, so 20 DiT steps ≈ 50 U-Net steps quality-wise.

Chapter 7: Conditioning

Unconditional generation is impressive but not useful. We want to say "a cat wearing a top hat" and get that image. Conditioning injects a text prompt into the denoising process.

The pipeline: (1) A text encoder (typically CLIP) converts the prompt into an embedding vector. (2) This embedding is injected into the U-Net via cross-attention layers: the noisy image attends to the text tokens. The network learns to denoise differently depending on the text.

Text Prompt
"a cat wearing a top hat"
↓ CLIP encoder
Text Embedding
77 tokens × 768 dims
↓ cross-attention
U-Net Denoiser
Noisy latent + text → predicted noise

Classifier-Free Guidance (CFG)

During training, the text condition is randomly dropped (replaced with empty text) some percentage of the time. At inference, we compute both the conditional and unconditional noise predictions, then amplify the difference:

ε̂ = εuncond + w · (εcond - εuncond)

The guidance scale w (typically 7-12) controls how strongly the model follows the prompt. Higher w = more adherence to text but less diversity and potential artifacts.

How CFG Actually Works

During training, the text condition is randomly replaced with an empty string ∅ about 10-20% of the time. This teaches the network to predict noise both with and without the prompt. At inference, every denoising step runs the U-Net twice:

python
# Every sampling step runs TWO forward passes
eps_uncond = unet(x_t, t, text="")      # unconditional prediction
eps_cond   = unet(x_t, t, text=prompt)   # conditional prediction

# Amplify the difference
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)

# w=1.0: just use conditional (no guidance)
# w=7.5: standard for text-to-image (SD 1.5, SDXL)
# w=15+: very strong adherence, but artifacts/saturation
Why two forward passes? The difference (eps_cond - eps_uncond) isolates the effect of the text prompt. Multiplying by w amplifies that effect. Think of it as "take what the prompt contributes and crank it up." This is why CFG doubles the compute cost per step — but the quality improvement is worth it.
Interactive: CFG Scale

See how classifier-free guidance amplifies the conditional signal. Low w = generic. High w = strongly steered (but may overshoot).

CFG scale w7.5
Check: What happens when you increase the CFG scale?
⚔ Adversarial: You set CFG scale w = 0 (not 1, not 7.5, but exactly zero). What happens to the guided prediction ε̂ = εuncond + 0 · (εcond - εuncond)?
You're generating with Stable Diffusion and accidentally set the guidance scale to 0 in your config. The model runs without errors. You inspect the output.

Chapter 8: ControlNet & Adapters

Text alone is often insufficient. You might want to specify a precise pose, edge map, or depth layout. ControlNet adds a parallel encoder that takes a spatial condition (like a Canny edge image) and injects it into the U-Net's skip connections.

The genius: the original Stable Diffusion weights are frozen. ControlNet trains a copy of the encoder that learns to translate spatial signals. This preserves the base model's quality while adding precise spatial control.

Control TypeInputWhat It Controls
CannyEdge mapOutline / structure
DepthDepth map3D layout, foreground/background
OpenPoseSkeleton keypointsHuman pose
SegmentationSemantic mapRegion content types
IP-AdapterReference imageStyle and subject transfer
Spatial Condition
Edge map, depth, pose, etc.
↓ ControlNet encoder (trainable copy)
Skip Connection Residuals
Added to frozen U-Net features
↓ Combined with text conditioning
Controlled Output
Follows both text and spatial layout
Other adapters: LoRA (Low-Rank Adaptation) finetunes the model with tiny weight matrices for style or subject. T2I-Adapter is a lightweight alternative to ControlNet. These can be composed — multiple LoRAs + ControlNet + text prompt — for fine-grained control.
Check: Why does ControlNet freeze the original Stable Diffusion weights?

Chapter 9: The Diffusion Ecosystem

Diffusion models have evolved rapidly. Here's a map of the landscape as of mid-2025:

ModelYearKey Innovation
DDPM2020Showed diffusion can match GANs
DALL-E 22022CLIP-guided diffusion prior
Stable Diffusion 1.52022Open-source latent diffusion
SDXL2023Larger U-Net, dual text encoders, 1024px
DALL-E 32023Better text understanding via recaptioning
SD3 / SD3.52024MMDiT (Transformer replaces U-Net) + flow matching
Flux2024Rectified flow, DiT architecture, open weights

Consistency Models

A radical departure: instead of iterating T steps, learn to jump directly from any noisy xt to x0 in a single step. Consistency models (Song et al., 2023) enforce that all points on the same denoising trajectory map to the same output. The result: 1-2 step generation with quality approaching multi-step diffusion.

The trend: Fewer steps, bigger Transformers, better text understanding, more control. The U-Net is giving way to DiT (Diffusion Transformer). Flow matching (next lesson!) is replacing the DDPM noise schedule. The field is converging on a cleaner, simpler framework.
Evolution Timeline

Major milestones in diffusion model development.

🔗 Pattern Recognition
Iterative Refinement — Diffusion as Discretized Flow
Diffusion (DDPM)
Discrete steps: xT → xT-1 → ... → x0. Each step removes a small amount of noise. The variance schedule βt controls step sizes. T=1000 small steps.
Flow Matching
Continuous ODE: dx/dt = vθ(x, t). Straight-line interpolation from noise to data. No variance schedule needed — just predict the velocity. → Flow Matching lesson

Flow matching is the continuous-time limit of diffusion. Instead of T discrete denoise steps, we solve an ODE that transports noise to data along straight paths. The "noise schedule" becomes implicit in the velocity field. SD3 and Flux already use this framework — the DDPM noise schedule is replaced by rectified flow, and sampling becomes solving an ODE with adaptive step sizes.

The same "iterative refinement" pattern appears in: Kalman filter (predict/update cycles), policy gradient (improve policy each epoch), EM algorithm (E-step/M-step). What makes diffusion's version unique? (Hint: what's being refined, and what's the "ground truth"?)

"What I cannot create, I do not understand."
— Richard Feynman

You now understand how diffusion models create. From noise to structure, one step at a time.