microDiffusion — From Noise to Art

Chapter 1: The Forward Process

The key insight of diffusion models: destruction is easy, creation is hard. The forward process takes a real image and gradually adds Gaussian noise over T steps (typically T=1000). At each step, the image gets a little noisier, until at step T it's indistinguishable from pure static.

Mathematically, at each step t we mix the current image with a little noise: q(x_t | x_t-1) = N(x_t; √(1-β_t) x_t-1, β_t I). The noise schedule β_t controls how fast the signal is destroyed.

q(x_t | x₀) = N(x_t; √ᾱ_t x₀, (1 - ᾱ_t) I)

Nice property: We can jump directly to any timestep t without computing all intermediate steps! ᾱ_t = ∏α_s is the cumulative product of (1-β_s). This makes training efficient.

The Exact Tensor Operation

In code, the forward process is a single line. Given a clean image x₀ and noise ε ~ N(0, I), both tensors of shape [batch, C, H, W]:

x_t = √ᾱ_t · x₀ + √(1 - ᾱ_t) · ε

Let's plug in real numbers. With the original DDPM linear schedule (β_t from 0.0001 to 0.02) and T=1000:

Timestep t	ᾱ_t	√ᾱ_t (signal coeff)	√(1-ᾱ_t) (noise coeff)	Signal %
0	1.00	1.00	0.00	100%
250	0.50	0.71	0.71	50%
500	0.05	0.22	0.97	5%
750	0.002	0.04	1.00	0.2%
1000	≈0	≈0	1.00	0%

Read this table carefully. At t=500, x₅₀₀ = 0.22 · image + 0.97 · noise. That's almost entirely noise — the image is barely a ghost. By t=750, the signal coefficient is 0.04. The image is effectively gone. This is why the denoiser's job is so hard: it must recover signal from near-pure noise.

python
# The entire forward process in PyTorch
def forward_process(x_0, t, alpha_bar):
    """x_0: [B, C, H, W], t: [B], alpha_bar: [T]"""
    a_bar = alpha_bar[t].view(-1, 1, 1, 1)  # broadcast
    eps = torch.randn_like(x_0)            # noise ~ N(0, I)
    x_t = torch.sqrt(a_bar) * x_0 + torch.sqrt(1 - a_bar) * eps
    return x_t, eps

Interactive: Watch an Image Dissolve

Drag the slider to add noise. At t=0 you see the original signal. At t=1000 it's pure static.

Timestep t0

Check: What does the forward process do to an image?

Gradually adds noise until only static remains Compresses it to a smaller size Removes noise to reveal the clean image

🔨 Derivation Derive the closed-form q(x_t | x_0) ▶ ✓ ATTEMPTED

The forward process defines q(x_t | x_t-1) = N(x_t; √(1-β_t) x_t-1, β_t I). This means each step depends on the previous one. But we claimed we can jump directly to any timestep t from x₀.

Your task: Starting from the single-step definition, derive q(x_t | x₀) = N(x_t; √ᾱ_t x₀, (1 - ᾱ_t) I) where ᾱ_t = ∏_s=1^t (1 - β_s).

If X ~ N(μ₁, σ₁²) and Y ~ N(μ₂, σ₂²) are independent, then aX + bY ~ N(aμ₁ + bμ₂, a²σ₁² + b²σ₂²). Use the reparameterization: x_t = √(1-β_t) x_t-1 + √β_t ε_t.

Substitute x_t-1 = √α_t-1 x_t-2 + √(1-α_t-1) ε_t-1 into x_t = √α_t x_t-1 + √(1-α_t) ε_t. Collect the x_t-2 coefficient (√α_tα_t-1) and the noise variance (1 - α_tα_t-1). The pattern emerges!

After two steps, variance = 1 - α_tα_t-1. This works because two independent Gaussians with variances σ₁² and σ₂² sum to a Gaussian with variance σ₁² + σ₂² (only when the mean scaling is correct). Verify: α_t(1-α_t-1) + (1-α_t) = 1 - α_tα_t-1.

Full derivation:

Step 1: Define α_t = 1 - β_t. Reparameterize: x_t = √α_t x_t-1 + √(1-α_t) ε_t.

Step 2: Expand one more step: x_t = √α_t(√α_t-1 x_t-2 + √(1-α_t-1) ε_t-1) + √(1-α_t) ε_t

= √(α_tα_t-1) x_t-2 + √α_t(1-α_t-1) ε_t-1 + √(1-α_t) ε_t

Step 3: The two noise terms are independent Gaussians. Their sum has variance: α_t(1-α_t-1) + (1-α_t) = 1 - α_tα_t-1.

Step 4: By induction, after t steps: x_t = √(α₁α₂...α_t) x₀ + √(1 - α₁α₂...α_t) ε.

Step 5: Define ᾱ_t = ∏_s=1^t α_s. Then q(x_t | x₀) = N(x_t; √ᾱ_t x₀, (1 - ᾱ_t) I). ■

The key insight: The sum of independent Gaussians is Gaussian. Because each step is linear in x_t-1 plus independent noise, the entire chain telescopes into a single Gaussian centered at a scaled version of x₀.

💻 Build It Implement Forward Diffusion from Scratch ▶ ✓ ATTEMPTED

You now know the closed-form forward process. Implement a function that takes a clean image, a noise schedule, and a timestep, and returns the noisy image plus the noise that was added (needed for training).

signature def forward_diffusion(x_0: Tensor, t: Tensor, alpha_bar: Tensor) -> Tuple[Tensor, Tensor]: """ Args: x_0: Clean images, shape [B, C, H, W] t: Timesteps for each image in batch, shape [B], values in [0, T-1] alpha_bar: Cumulative product schedule, shape [T] Returns: x_t: Noisy images at timestep t, shape [B, C, H, W] eps: The noise that was added, shape [B, C, H, W] """

Test case

x_0 = torch.ones(1, 3, 4, 4) # all-ones image
alpha_bar = torch.linspace(0.9999, 0.001, 1000)
t = torch.tensor([500])
x_t, eps = forward_diffusion(x_0, t, alpha_bar)
# x_t should have mean ≈ sqrt(alpha_bar[500]) ≈ 0.22
# x_t.shape == (1, 3, 4, 4), eps.shape == (1, 3, 4, 4)

alpha_bar[t] has shape [B]. To multiply element-wise with a [B, C, H, W] tensor, reshape it to [B, 1, 1, 1] using .view(-1, 1, 1, 1). Then the standard formula applies directly.

python
def forward_diffusion(x_0, t, alpha_bar):
    a_bar = alpha_bar[t].view(-1, 1, 1, 1)   # [B] -> [B,1,1,1]
    eps = torch.randn_like(x_0)              # sample noise
    x_t = torch.sqrt(a_bar) * x_0 + torch.sqrt(1 - a_bar) * eps
    return x_t, eps

Bonus challenge: Modify this to also return the predicted x₀ given x_t and epsilon (the reverse formula). This is used during sampling to estimate the clean image at any step.

Checkpoint — Before you move on

Explain in your own words: why does √ᾱ_t multiply the signal and √(1 - ᾱ_t) multiply the noise? What would break if we used different coefficients (say, both equal to 0.5)?

✓ Gate cleared

Model Answer

The coefficients are chosen so that the total variance is preserved. If x₀ has unit variance and ε has unit variance, then x_t = a·x₀ + b·ε has variance a² + b². Setting a = √ᾱ_t and b = √(1-ᾱ_t) gives a² + b² = ᾱ_t + (1-ᾱ_t) = 1. The variance stays constant at every timestep.

If both coefficients were 0.5, the variance would be 0.25 + 0.25 = 0.5 — the signal would shrink. Over many steps, x_T would converge to a distribution with variance 0.5 instead of N(0, I). The reverse process assumes we start from standard normal noise, so this mismatch would cause the generated images to have wrong brightness/contrast. The variance-preserving property is what makes q(x_T) ≈ N(0, I) regardless of the data distribution.

Chapter 2: The Reverse Process

If we could reverse the forward process — undo each noise step — we'd have a generative model! Start from pure noise x_T ~ N(0, I) and iteratively denoise to get a clean image x₀. The problem: the exact reverse q(x_t-1|x_t) requires knowing the full data distribution, which is what we're trying to learn.

Solution: train a neural network ε_θ(x_t, t) to approximate the reverse step. This network takes in a noisy image and the timestep, and predicts the noise that was added. Given the predicted noise, we can estimate the slightly-less-noisy image.

Pure Noise

x_T ~ N(0, I)

↓ denoise

Slightly Less Noisy

x_T-1 = f(x_T, ε_θ)

↓ denoise

...

Repeat T times

↓ denoise

Clean Image

x₀

What Goes In, What Comes Out

The denoiser is typically a U-Net: an encoder-decoder with skip connections. Let's be precise about the data flow:

Input	Shape	Description
Noisy image x_t	[B, C, H, W]	e.g. [1, 3, 64, 64] for pixel-space, [1, 4, 64, 64] for latent
Timestep t	[B]	Integer in [0, T]. Embedded via sinusoidal encoding → [B, d_emb]

Output	Shape	Description
Predicted noise ε̂	[B, C, H, W]	Same shape as input. The network predicts what noise was added.

Why predict noise, not the clean image? Both are mathematically valid (you can derive x₀ from ε and vice versa). But noise prediction gives constant-magnitude targets across all timesteps. The noise ε is always drawn from N(0, I), regardless of t. If we predicted x₀ directly, the target magnitude would vary wildly with t, making training unstable. This is a key engineering insight from the DDPM paper.

The U-Net architecture uses skip connections from encoder to decoder at each resolution, preserving fine spatial detail. The timestep embedding is injected via addition or FiLM conditioning at each block. Modern variants like DiT (Diffusion Transformer) replace the U-Net with a Transformer operating on image patches.

Check: What does the denoiser network predict?

The final clean image directly The noise that was added at the current step The next noisy image in the sequence

Chapter 3: Training the Denoiser

Training is surprisingly simple. For each training step: (1) pick a random image x₀ from the dataset, (2) pick a random timestep t, (3) sample noise ε ~ N(0,I), (4) create the noisy image x_t = √ᾱ_t x₀ + √(1-ᾱ_t) ε, and (5) train the network to predict ε from x_t and t.

L = E_t,x₀,ε [ || ε - ε_θ(x_t, t) ||² ]

That's it. Plain MSE loss between the true noise and the predicted noise. No adversarial training, no mode collapse, no training instability. This simplicity is a huge reason diffusion models won.

The Complete Training Loop

Here's the full training loop. Read it carefully — there are no hidden steps:

python
for x_0 in dataloader:                     # 1. Sample x_0 from data
    t = torch.randint(0, T, (B,))         # 2. Sample t ~ Uniform{0..T-1}
    eps = torch.randn_like(x_0)            # 3. Sample epsilon ~ N(0, I)

    # 4. Compute x_t (the noisy version)
    a = alpha_bar[t].view(-1,1,1,1)
    x_t = torch.sqrt(a) * x_0 + torch.sqrt(1-a) * eps

    # 5. Predict noise, compute loss
    eps_hat = unet(x_t, t)                 # [B, C, H, W] -> [B, C, H, W]
    loss = F.mse_loss(eps_hat, eps)         # That's it. MSE on noise.

    loss.backward()
    optimizer.step()

This is the ENTIRE training loop. No discriminator, no ELBO computation, no reconstruction loss, no KL term. Sample, corrupt, predict, MSE. The theoretical justification is deep (it optimizes a variational bound on log-likelihood), but the implementation is dead simple. This is why diffusion models scaled to production so quickly.

1. Sample

Pick random x₀, t, ε

↓

2. Corrupt

x_t = √ᾱ_t x₀ + √(1-ᾱ_t) ε

↓

3. Predict

ε̂ = ε_θ(x_t, t)

↓

4. Loss

L = || ε - ε̂ ||²

Why MSE? It can be shown that minimizing this simple noise-prediction MSE is equivalent to optimizing a variational bound on the data log-likelihood. The theory is deep, but the practice is dead simple.

Check: What loss function is used to train a diffusion model?

MSE between predicted noise and actual noise Adversarial loss (like GANs) Cross-entropy on pixel classes

💥 Break-It Lab What Dies When You Remove Training Components? ▶ ✓ ATTEMPTED

A working diffusion model trains with: (1) a cosine noise schedule, (2) U-Net skip connections, and (3) uniform timestep sampling. Toggle each one off to see what fails and why.

Remove noise schedule (use constant β) ACTIVE

Failure mode: With constant β (no schedule), the signal-to-noise ratio drops too fast in early steps and too slowly in late steps. The model sees mostly pure noise during training (wasting capacity on easy "predict random" tasks) and rarely sees the hard mid-noise levels where structure emerges. Loss plateaus at a high value because the model never learns the critical transition region.

Remove U-Net skip connections ACTIVE

Failure mode: Skip connections pass high-frequency spatial detail from encoder to decoder. Without them, the bottleneck must encode ALL spatial information — an impossible task at high resolution. The model outputs blurry reconstructions (loss decreases but saturates early). Fine edges, textures, and small features are lost. This is the same reason ResNets outperform plain networks: gradients flow through skips.

Bias timestep sampling (only high-noise t) ACTIVE

Failure mode: If training only samples large t (high noise), the model learns to denoise from near-pure noise but never learns the final refinement steps (small t). Generated images have correct global structure but are blurry/noisy at fine detail. The loss appears low (predicting noise at high t is easy) but sample quality is terrible — a deceptive metric.

Chapter 4: The Math

The theoretical foundation of diffusion models rests on three pillars. You don't need them to use diffusion models, but understanding them reveals why the simple training objective actually works.

Pillar 1: The ELBO

We want to maximize log p(x₀), but it's intractable. Instead we maximize a lower bound (the Evidence Lower Bound). The ELBO decomposes into a sum of KL divergences — one per timestep — each comparing the true reverse step to our learned approximation.

log p(x₀) ≥ E_q[ log p(x₀|x₁) ] - ∑ KL( q(x_t-1|x_t,x₀) || p_θ(x_t-1|x_t) )

Pillar 2: KL Divergence

KL divergence measures how different two distributions are. Since both q and p_θ are Gaussian, the KL has a closed form. It reduces to comparing means — which becomes the simple MSE loss.

Pillar 3: Score Function

The score is ∇_x log p(x) — a vector pointing toward higher-density regions. Denoising is equivalent to estimating the score: ε_θ(x_t, t) ∝ -∇_x log p(x_t). This connection to score matching is why diffusion models are sometimes called "score-based generative models."

The punchline: ELBO → sum of KL terms → Gaussian KL → MSE on means → noise prediction MSE. Five steps of math justify the simplest possible training loop.

Score Field Visualization

Arrows show ∇ log p(x), pointing toward the data distribution (two clusters). The score field guides sampling.

🔨 Derivation From ELBO to Simple MSE Loss ▶ ✓ ATTEMPTED

We stated that the ELBO decomposes into per-timestep KL terms, and that these KL terms reduce to the simple MSE loss on noise. This is the most important derivation in diffusion models — it justifies why such a simple training objective actually maximizes log-likelihood.

Your task: Show that minimizing KL(q(x_t-1|x_t,x₀) || p_θ(x_t-1|x_t)) reduces to minimizing || ε - ε_θ(x_t, t) ||².

By Bayes' rule: q(x_t-1|x_t,x₀) ∝ q(x_t|x_t-1) · q(x_t-1|x₀). Both factors are Gaussian, so the product is Gaussian. Its mean is μ̃_t = (√α_t(1-ᾱ_t-1)x_t + √ᾱ_t-1β_tx₀) / (1-ᾱ_t).

When both distributions have the same (fixed) variance σ̃_t², the KL simplifies to: KL = (1/2σ̃_t²) || μ̃_t - μ_θ ||². So we just need to match the means! The model p_θ predicts μ_θ(x_t, t).

Express μ̃_t in terms of x_t and ε. Then parameterize μ_θ the same way but with ε_θ replacing the true ε. The difference || μ̃_t - μ_θ ||² becomes proportional to || ε - ε_θ ||².

Full derivation:

Step 1: The posterior q(x_t-1|x_t,x₀) is Gaussian with mean μ̃_t(x_t, x₀) and variance σ̃_t² (both computable in closed form).

Step 2: We parameterize p_θ(x_t-1|x_t) as Gaussian with the same variance σ̃_t² but learned mean μ_θ(x_t, t).

Step 3: KL between same-variance Gaussians: KL = (1/2σ̃_t²) || μ̃_t - μ_θ ||².

Step 4: Rewrite the true mean using the reparameterization x_t = √ᾱ_t x₀ + √(1-ᾱ_t) ε: μ̃_t = (1/√α_t)(x_t - β_t/√(1-ᾱ_t) · ε)

Step 5: Parameterize the model mean the same way: μ_θ = (1/√α_t)(x_t - β_t/√(1-ᾱ_t) · ε_θ(x_t, t))

Step 6: Subtract: || μ̃_t - μ_θ ||² = (β_t² / α_t(1-ᾱ_t)) · || ε - ε_θ(x_t, t) ||²

Step 7: Drop the t-dependent weighting (DDPM paper shows this works better empirically): L_simple = E || ε - ε_θ(x_t, t) ||². ■

The key insight: The theoretically-justified loss has t-dependent weights (harder timesteps get larger gradients). But Ho et al. 2020 found that dropping these weights gives better samples. The "simple" loss weights all timesteps equally, giving the model more gradient signal at easy (low-noise) timesteps where perceptual quality is determined.

🔨 Derivation Why Cosine Schedule Beats Linear ▶ ✓ ATTEMPTED

The original DDPM used a linear schedule: β_t goes from 0.0001 to 0.02 linearly. The improved DDPM paper switched to a cosine schedule. Both destroy signal by t=T, but they distribute the destruction differently across timesteps.

Your task: Given that the cosine schedule defines ᾱ_t = cos²(πt/2T), show that this produces a more uniform signal-to-noise ratio distribution across timesteps compared to linear, and explain why this helps training.

Signal-to-noise ratio = ᾱ_t / (1 - ᾱ_t). For cosine: SNR(t) = cos²(πt/2T) / sin²(πt/2T) = cot²(πt/2T). For linear: ᾱ_t = ∏(1-β_s) which decays exponentially fast.

With linear β, ᾱ_t decays roughly exponentially: by t=500 (halfway), the signal is nearly gone (ᾱ₅₀₀ ≈ 0.01). With cosine, ᾱ₅₀₀ = cos²(π/4) = 0.5 — exactly half signal, half noise at the midpoint. The cosine schedule is calibrated so the midpoint is actually the midpoint of destruction.

Why cosine wins:

Linear schedule problem: β_t increases linearly, but ᾱ_t = ∏(1-β_s) ≈ exp(-∑β_s) decays exponentially. The signal is destroyed too quickly — most timesteps t > 300 are "basically pure noise" and teach the model nothing useful. Training time is wasted on uninformative gradients.

Cosine schedule solution: By directly specifying ᾱ_t = cos²(πt/2T), we ensure: (1) ᾱ₀ = 1 (pure signal), (2) ᾱ_T = 0 (pure noise), (3) the decay is gradual and symmetric around the midpoint. Log-SNR decreases linearly in t, meaning each timestep range contributes equally to training.

Training impact: Uniform timestep sampling (t ~ Uniform{0,T}) now gives uniform coverage of SNR levels. Every sampled t teaches the model something new. With linear schedule + uniform sampling, 70% of samples fall in the "already destroyed" regime and provide near-zero learning signal.

The key insight: The noise schedule is not just "how fast to add noise" — it's a curriculum. Cosine schedule = optimal curriculum where every training sample is maximally informative.

Check: What is the score function?

The loss value during training The number of denoising steps The gradient of log probability, pointing toward higher density

Chapter 5: Sampling

Once trained, we generate images by starting from noise and iteratively denoising. The original DDPM sampler uses all T=1000 steps — faithful to the theory but painfully slow (~1 minute per image).

DDIM (Denoising Diffusion Implicit Models) noticed that the forward process can be made deterministic, allowing you to skip steps. With just 50 steps, quality is nearly identical. DPM-Solver treats sampling as solving an ODE and uses higher-order methods (like Runge-Kutta) to achieve great quality in 10-25 steps.

The Reverse Step in Detail

DDPM sampling computes x_t-1 from x_t using the predicted noise ε̂:

x_t-1 = (1/√α_t) · (x_t - (1 - α_t)/√(1 - ᾱ_t) · ε̂) + σ_t · z

where z ~ N(0, I) is fresh noise added at each step. This stochastic term is what makes DDPM slow — each step adds randomness, so you can't skip ahead.

DDIM removes this stochastic term (sets σ_t = 0), making the process deterministic. A deterministic ODE can be evaluated at any subset of timesteps. Instead of [1000, 999, 998, ...], DDIM evaluates at [1000, 980, 960, ...] — 50 evenly spaced steps instead of 1000.

python
# DDPM: 1000 sequential steps (slow)
for t in range(T, 0, -1):
    eps_hat = unet(x_t, t)
    x_t = reverse_step(x_t, eps_hat, t) + sigma[t] * torch.randn_like(x_t)

# DDIM: skip timesteps (fast, deterministic)
steps = [1000, 980, 960, ..., 20, 0]  # 50 steps
for t, t_prev in zip(steps[:-1], steps[1:]):
    eps_hat = unet(x_t, t)
    x_t = ddim_step(x_t, eps_hat, t, t_prev)  # no stochastic term

Sampler	Steps	Speed	Quality
DDPM	1000	Slow	Excellent
DDIM	50	Fast	Very good
DPM-Solver	15-25	Very fast	Excellent
DPM-Solver++	10-20	Very fast	Excellent

Interactive: Step Count vs Quality

Watch a 1D distribution emerge from noise. More steps = smoother convergence. Fewer steps = faster but noisier.

Steps50

Key tradeoff: Steps ↔ quality ↔ speed. Modern samplers achieve near-perfect quality in ~20 steps, making real-time generation possible. The race is to push this even lower.

Check: Why is DDIM faster than DDPM?

It skips timesteps by making the process deterministic It uses a smaller neural network It generates at lower resolution

⚔ Adversarial: You train a diffusion model with a uniform noise schedule (β_t = 0.01 for all t) instead of cosine. Training loss converges to a low value. What do your generated samples look like?

The model trains for the same number of iterations and the final MSE loss is similar to the cosine-schedule baseline. You run inference with DDIM (50 steps) and inspect the outputs.

Identical to cosine — loss convergence guarantees quality Correct global structure but blurry/noisy fine details Complete garbage — the model fails to generate anything

Chapter 6: Latent Diffusion

Diffusing directly in pixel space is expensive: a 512×512 image has ~786K dimensions. Latent Diffusion Models (LDMs) first encode the image into a compact latent space using a pretrained VAE (Variational Autoencoder), then run diffusion there.

The VAE encoder compresses the image by 8x in each spatial dimension: 512×512 → 64×64 latent. The diffusion model learns to denoise in this 64×64 space (much cheaper!), then the VAE decoder reconstructs the final image. This is exactly what Stable Diffusion does.

The Numbers That Matter

Let's count dimensions to see why this is such a huge win:

Space	Shape	Dimensions	Ratio
Pixel (512×512×3)	[B, 3, 512, 512]	786,432	1×
Latent (64×64×4)	[B, 4, 64, 64]	16,384	48× smaller

Every U-Net forward pass operates on a 16K-dimensional tensor instead of a 786K-dimensional one. Since the U-Net runs once per sampling step (and there are 20-50 steps), this is a 48× speedup per step. Multiply that across all steps and training iterations, and you understand why latent diffusion made Stable Diffusion practical.

What does the VAE discard? Imperceptible detail. The VAE is trained to reconstruct images with minimal perceptual loss. The latent space captures semantic content — shapes, colors, composition — while discarding sub-pixel noise that humans can't see. Diffusion in latent space means faster training, faster sampling, same visual quality.

Image (512×512)

786,432 dimensions

↓ VAE Encoder

Latent (64×64×4)

16,384 dimensions (~48x smaller)

↓ Diffusion here!

Denoised Latent

Still 64×64×4

↓ VAE Decoder

Generated Image

Back to 512×512

Why latent space? The VAE discards perceptually irrelevant detail (exact pixel noise). The latent space captures semantic content: shapes, colors, composition. Diffusion in latent space = faster training, faster sampling, same quality.

Pixel vs Latent Dimensions

Compare the computational cost. Each block represents a unit of work. Latent diffusion is dramatically cheaper.

Check: What does the VAE encoder do in Stable Diffusion?

Compresses the image to a compact latent representation Adds noise to the image Generates the final output image

🔗 Pattern Recognition

Compression Bottleneck — Same Pattern, Different Purpose

Latent Diffusion

VAE encoder compresses 512×512×3 → 64×64×4. Diffusion operates in this compressed space. Decoder reconstructs the image.

VAE (standalone)

Encoder q(z|x) compresses data → latent. Decoder p(x|z) reconstructs. KL term regularizes the latent to be smooth/Gaussian. → VAE & VQ-VAE lesson

Both use the same architectural pattern: force information through a narrow channel so only the essential structure survives. In standalone VAEs, this enables generation by sampling the latent. In Latent Diffusion, it enables efficient denoising by reducing dimensionality 48x. The VAE in Stable Diffusion is trained separately (frozen during diffusion training) — it's a compression utility, not a generative model.

Spot this pattern: Where else in ML do we force data through a bottleneck to separate "essential" from "noise"? (Think: autoencoders, attention's low-rank Q/K, PCA...)

🏗 Design Challenge You're the Architect: 1024×1024 in 2 Seconds ▶ ✓ ATTEMPTED

A startup asks you to build a text-to-image system that generates 1024×1024 images in under 2 seconds on a single A100 GPU (80GB VRAM). The images must be photorealistic quality. You can use any published technique.

Resolution

1024 × 1024 px

Latency

< 2 seconds end-to-end

Hardware

1× A100 80GB

Quality

Photorealistic (FID < 10)

1. Pixel-space or latent-space diffusion? If latent, what spatial compression factor?

2. How many denoising steps can you afford? (Estimate: one U-Net pass at 128×128×4 latent ≈ 40ms on A100)

3. U-Net or DiT (Diffusion Transformer)? What are the compute/quality tradeoffs at this scale?

4. Would you use distillation (fewer steps, same quality) or consistency models (1-2 steps)?

Real-world solutions (as of 2024-25):

SDXL Turbo / SD Turbo: Adversarial distillation reduces to 1-4 steps. 1024px in ~1s. Uses latent space (128×128×4 for SDXL). Quality is good but not quite full-step baseline.

LCM (Latent Consistency Model): 4-8 steps with consistency distillation. 512px in ~0.5s, 1024px in ~1.5s. Matches full-step quality closely.

The math: 1024px → 8x compression = 128×128×4 latent. At ~40ms/step on A100, you can afford 2000ms - 50ms(VAE) = 1950ms / 40ms = ~48 steps max. But with LCM distillation, 8 steps suffices (320ms for diffusion + 50ms VAE decode = 370ms total). The bottleneck shifts to the VAE decoder at high resolution.

DiT vs U-Net: DiT (used in SD3, Flux) scales better with compute but each step is slightly slower (~60ms at this scale). The quality/step ratio is higher though, so 20 DiT steps ≈ 50 U-Net steps quality-wise.

Chapter 7: Conditioning

Unconditional generation is impressive but not useful. We want to say "a cat wearing a top hat" and get that image. Conditioning injects a text prompt into the denoising process.

The pipeline: (1) A text encoder (typically CLIP) converts the prompt into an embedding vector. (2) This embedding is injected into the U-Net via cross-attention layers: the noisy image attends to the text tokens. The network learns to denoise differently depending on the text.

Text Prompt

"a cat wearing a top hat"

↓ CLIP encoder

Text Embedding

77 tokens × 768 dims

↓ cross-attention

U-Net Denoiser

Noisy latent + text → predicted noise

Classifier-Free Guidance (CFG)

During training, the text condition is randomly dropped (replaced with empty text) some percentage of the time. At inference, we compute both the conditional and unconditional noise predictions, then amplify the difference:

ε̂ = ε_uncond + w · (ε_cond - ε_uncond)

The guidance scale w (typically 7-12) controls how strongly the model follows the prompt. Higher w = more adherence to text but less diversity and potential artifacts.

How CFG Actually Works

During training, the text condition is randomly replaced with an empty string ∅ about 10-20% of the time. This teaches the network to predict noise both with and without the prompt. At inference, every denoising step runs the U-Net twice:

python
# Every sampling step runs TWO forward passes
eps_uncond = unet(x_t, t, text="")      # unconditional prediction
eps_cond   = unet(x_t, t, text=prompt)   # conditional prediction

# Amplify the difference
eps_guided = eps_uncond + w * (eps_cond - eps_uncond)

# w=1.0: just use conditional (no guidance)
# w=7.5: standard for text-to-image (SD 1.5, SDXL)
# w=15+: very strong adherence, but artifacts/saturation

Why two forward passes? The difference (eps_cond - eps_uncond) isolates the effect of the text prompt. Multiplying by w amplifies that effect. Think of it as "take what the prompt contributes and crank it up." This is why CFG doubles the compute cost per step — but the quality improvement is worth it.

Interactive: CFG Scale

See how classifier-free guidance amplifies the conditional signal. Low w = generic. High w = strongly steered (but may overshoot).

CFG scale w7.5

Check: What happens when you increase the CFG scale?

The output follows the text prompt more strongly The output becomes more random The number of sampling steps increases

⚔ Adversarial: You set CFG scale w = 0 (not 1, not 7.5, but exactly zero). What happens to the guided prediction ε̂ = ε_uncond + 0 · (ε_cond - ε_uncond)?

You're generating with Stable Diffusion and accidentally set the guidance scale to 0 in your config. The model runs without errors. You inspect the output.

The output is pure noise (no denoising happens) The output matches the text prompt perfectly (no diversity) The output is a generic image that ignores the text prompt entirely The output is the negative/inverse of the prompted image

Chapter 8: ControlNet & Adapters

Text alone is often insufficient. You might want to specify a precise pose, edge map, or depth layout. ControlNet adds a parallel encoder that takes a spatial condition (like a Canny edge image) and injects it into the U-Net's skip connections.

The genius: the original Stable Diffusion weights are frozen. ControlNet trains a copy of the encoder that learns to translate spatial signals. This preserves the base model's quality while adding precise spatial control.

Control Type	Input	What It Controls
Canny	Edge map	Outline / structure
Depth	Depth map	3D layout, foreground/background
OpenPose	Skeleton keypoints	Human pose
Segmentation	Semantic map	Region content types
IP-Adapter	Reference image	Style and subject transfer

Spatial Condition

Edge map, depth, pose, etc.

↓ ControlNet encoder (trainable copy)

Skip Connection Residuals

Added to frozen U-Net features

↓ Combined with text conditioning

Controlled Output

Follows both text and spatial layout

Other adapters: LoRA (Low-Rank Adaptation) finetunes the model with tiny weight matrices for style or subject. T2I-Adapter is a lightweight alternative to ControlNet. These can be composed — multiple LoRAs + ControlNet + text prompt — for fine-grained control.

Check: Why does ControlNet freeze the original Stable Diffusion weights?

To reduce the output image size To preserve the base model's quality while adding new spatial control Frozen weights use less GPU memory during inference

Chapter 9: The Diffusion Ecosystem

Diffusion models have evolved rapidly. Here's a map of the landscape as of mid-2025:

Model	Year	Key Innovation
DDPM	2020	Showed diffusion can match GANs
DALL-E 2	2022	CLIP-guided diffusion prior
Stable Diffusion 1.5	2022	Open-source latent diffusion
SDXL	2023	Larger U-Net, dual text encoders, 1024px
DALL-E 3	2023	Better text understanding via recaptioning
SD3 / SD3.5	2024	MMDiT (Transformer replaces U-Net) + flow matching
Flux	2024	Rectified flow, DiT architecture, open weights

Consistency Models

A radical departure: instead of iterating T steps, learn to jump directly from any noisy x_t to x₀ in a single step. Consistency models (Song et al., 2023) enforce that all points on the same denoising trajectory map to the same output. The result: 1-2 step generation with quality approaching multi-step diffusion.

The trend: Fewer steps, bigger Transformers, better text understanding, more control. The U-Net is giving way to DiT (Diffusion Transformer). Flow matching (next lesson!) is replacing the DDPM noise schedule. The field is converging on a cleaner, simpler framework.

Evolution Timeline

Major milestones in diffusion model development.

🔗 Pattern Recognition

Iterative Refinement — Diffusion as Discretized Flow

Diffusion (DDPM)

Discrete steps: x_T → x_T-1 → ... → x₀. Each step removes a small amount of noise. The variance schedule β_t controls step sizes. T=1000 small steps.

Flow Matching

Continuous ODE: dx/dt = v_θ(x, t). Straight-line interpolation from noise to data. No variance schedule needed — just predict the velocity. → Flow Matching lesson

Flow matching is the continuous-time limit of diffusion. Instead of T discrete denoise steps, we solve an ODE that transports noise to data along straight paths. The "noise schedule" becomes implicit in the velocity field. SD3 and Flux already use this framework — the DDPM noise schedule is replaced by rectified flow, and sampling becomes solving an ODE with adaptive step sizes.

The same "iterative refinement" pattern appears in: Kalman filter (predict/update cycles), policy gradient (improve policy each epoch), EM algorithm (E-step/M-step). What makes diffusion's version unique? (Hint: what's being refined, and what's the "ground truth"?)

"What I cannot create, I do not understand."

— Richard Feynman

You now understand how diffusion models create. From noise to structure, one step at a time.

Understand Diffusion
Models

Chapter 0: What Is Generation?

Chapter 1: The Forward Process

The Exact Tensor Operation

Chapter 2: The Reverse Process

What Goes In, What Comes Out

Chapter 3: Training the Denoiser

The Complete Training Loop

Chapter 4: The Math

Pillar 1: The ELBO

Pillar 2: KL Divergence

Pillar 3: Score Function

Chapter 5: Sampling

The Reverse Step in Detail

Chapter 6: Latent Diffusion

The Numbers That Matter

Chapter 7: Conditioning

Classifier-Free Guidance (CFG)

How CFG Actually Works

Chapter 8: ControlNet & Adapters

Chapter 9: The Diffusion Ecosystem

Consistency Models

Understand DiffusionModels

Chapter 0: What Is Generation?

Chapter 1: The Forward Process

The Exact Tensor Operation

Chapter 2: The Reverse Process

What Goes In, What Comes Out

Chapter 3: Training the Denoiser

The Complete Training Loop

Chapter 4: The Math

Pillar 1: The ELBO

Pillar 2: KL Divergence

Pillar 3: Score Function

Chapter 5: Sampling

The Reverse Step in Detail

Chapter 6: Latent Diffusion

The Numbers That Matter

Chapter 7: Conditioning

Classifier-Free Guidance (CFG)

How CFG Actually Works

Chapter 8: ControlNet & Adapters

Chapter 9: The Diffusion Ecosystem

Consistency Models

Understand Diffusion
Models