Workbook — Diffusion & Flow Matching

Diffusion & Flow Math

Every equation behind modern image generation — noise schedules, loss derivations, score functions, sampling algorithms, guidance, latent spaces, and flow matching — all solvable in-browser with instant feedback.

Prerequisites: Gaussian distributions (mean, variance, sampling) + Basic calculus (derivatives, chain rule). That's it.
10
Chapters
55
Exercises
5
Exercise Types
Mastery
0 / 55 exercises (0%)
0
Day Streak
Best: 0

Chapter 0: Forward Process

You have a clean image x0. The diffusion forward process gradually adds Gaussian noise over T timesteps until nothing recognizable remains. The magic: you can jump to any timestep t directly without iterating through all previous steps.

At each step, a small amount of noise is added according to a schedule β1, β2, ..., βT:

Single step: q(xt | xt-1) = N(xt; √(1-βt) · xt-1, βt I)

Direct jump (closed form):
αt = 1 - βt
āt = ∏s=1..t αs = α1 · α2 · ... · αt

q(xt | x0) = N(xt; √āt · x0, (1 - āt) I)

Reparameterization: xt = √āt · x0 + √(1-āt) · ε,   ε ~ N(0, I)
The signal-to-noise interpretation. √āt scales the original signal and √(1 - āt) scales the noise. At t=0, ā0 ≈ 1 so the image is almost clean. At t=T, āT ≈ 0 so the image is pure noise. The forward process is a smooth interpolation from signal to noise.
Exercise 0.1: Compute αt Derive

Linear schedule: βt = 0.0001 + (0.02 - 0.0001) × (t-1)/(T-1), with T=1000. Compute α1 = 1 - β1.

β1 corresponds to t=1, so β1 = 0.0001 + 0.0199 × 0/999 = 0.0001.

α1
Show derivation
β1 = 0.0001 + 0.0199 × (1-1)/(1000-1) = 0.0001
α1 = 1 - 0.0001 = 0.9999

At the first timestep, almost no noise is added. The signal is preserved at 99.99%.

Exercise 0.2: āt at t=500 Derive

For the linear schedule above, βt = 0.0001 + 0.0199 × (t-1)/999. The mean of β across all 1000 steps is (0.0001 + 0.02)/2 = 0.01005. Use the approximation: log āt ≈ ∑s=1..t log(1 - βs) ≈ -∑ βs. For the first 500 steps, the average β is approximately 0.0001 + 0.0199 × 249.5/999 ≈ 0.00507. So ∑s=1..500 βs ≈ 500 × 0.00507 = 2.535. What is ā500?

ā500
Show derivation
Average β for t=1..500: β̄ = 0.0001 + 0.0199 × ((0 + 499)/2) / 999
= 0.0001 + 0.0199 × 249.5/999 = 0.0001 + 0.00497 = 0.00507
s=1..500 βs ≈ 500 × 0.00507 = 2.535
ā500 ≈ e-2.535 = 0.0793

Only about 7.9% of the original signal power remains at the midpoint. The signal coefficient √ā500 ≈ 0.282, meaning the original image is attenuated to ~28% of its intensity.

Exercise 0.3: Signal Fraction at t=500 Derive

Using ā500 ≈ 0.0793 from above, what fraction of the total variance in x500 comes from the original signal x0 (as opposed to noise)?

Hint: xt = √āt · x0 + √(1-āt) · ε. If x0 has unit variance, the signal variance is āt and noise variance is (1-āt). Signal fraction = āt / (āt + (1-āt)).

fraction (0 to 1)
Show derivation
Signal variance = āt = 0.0793
Noise variance = 1 - āt = 0.9207
Signal fraction = 0.0793 / (0.0793 + 0.9207) = 0.0793 / 1 = 0.0793

The signal fraction IS āt (since the total variance is 1). Only 7.93% of the information in x500 comes from the original image. The denoiser has to recover the image from a signal buried under ~12× more noise.

Exercise 0.4: āT at t=1000 Derive

Now compute ā1000 using the same approximation. The average β over all 1000 steps is 0.01005.

ā1000
Show derivation
s=1..1000 βs ≈ 1000 × 0.01005 = 10.05
ā1000 ≈ e-10.05 = 4.31 × 10-5 ≈ 0.0000431

The exact value computed numerically is ~0.0000448. At t=1000, √āT ≈ 0.0067 — the original signal is attenuated to 0.67% of its intensity. xT is essentially pure Gaussian noise, which is exactly what we want: the reverse process starts from N(0, I).

Exercise 0.5: SNR at t=500 Trace
The signal-to-noise ratio (SNR) at timestep t is defined as SNR(t) = āt / (1 - āt). Using ā500 ≈ 0.0793, what is log10(SNR(500))?
Show derivation
SNR(500) = ā500 / (1 - ā500) = 0.0793 / 0.9207 = 0.0861
log10(0.0861) = -1.065

At t=500, the noise power is ~11.6× the signal power. The SNR is well below 1, meaning the denoiser is working with mostly noise. The log-SNR is a key quantity in diffusion theory — uniform spacing in log-SNR space corresponds to uniform difficulty for the denoiser.

Exercise 0.6: Implement forwardProcess() Build

Write a function that takes x0 (a number), alphabar_t, and epsilon (a noise sample), then returns x_t using the reparameterization trick.

Return a single number: the noised sample x_t.
Show solution
javascript
function forwardProcess(x0, alphabar_t, epsilon) {
  return Math.sqrt(alphabar_t) * x0 + Math.sqrt(1 - alphabar_t) * epsilon;
}

Chapter 1: Noise Schedule Math

The noise schedule determines how quickly you destroy the image. Too fast and the model can't learn — the jump between adjacent timesteps is too large. Too slow and you waste compute on timesteps where nothing interesting happens. Two schedules dominate: linear and cosine.

Linear schedule:
βt = βmin + (βmax - βmin) × (t-1)/(T-1)
āt = ∏s=1..t (1 - βs)

Cosine schedule (Nichol & Dhariwal 2021):
āt = f(t) / f(0),   where f(t) = cos²( (t/T + s) / (1+s) × π/2 )
s = 0.008 (small offset to prevent ā0 from being too small)

Then: βt = 1 - āt / āt-1, clipped to [0, 0.999]
Why cosine wins. The linear schedule destroys information too quickly in the early timesteps (around t/T ≈ 0.2 the image is already mostly noise). The cosine schedule provides a more uniform distribution of SNR across timesteps, giving the denoiser a smoother curriculum — more time on "medium noise" where learning is most productive.
Exercise 1.1: Linear ā250 Derive

Linear schedule: βmin=0.0001, βmax=0.02, T=1000. Compute ā250 using the approximation āt ≈ exp(-∑βs).

Average β for t=1..250: β̄ = 0.0001 + 0.0199 × ((0+249)/2)/999.

ā250
Show derivation
β̄ = 0.0001 + 0.0199 × 124.5/999 = 0.0001 + 0.00248 = 0.00258
1..250 βs ≈ 250 × 0.00258 = 0.645
ā250 ≈ e-0.645 = 0.525

Exact numerical computation gives ~0.536. At 25% of the way through the schedule, about half the signal remains. The approximation log(1-β) ≈ -β is accurate because each β is small (< 0.02).

Exercise 1.2: Linear ā750 Derive

Same linear schedule. Compute ā750.

ā750
Show derivation
β̄ (t=1..750) = 0.0001 + 0.0199 × ((0+749)/2)/999 = 0.0001 + 0.00746 = 0.00756
1..750 βs ≈ 750 × 0.00756 = 5.67
ā750 ≈ e-5.67 = 0.00345

Exact numerical value is ~0.00118. Our approximation overshoots because the log(1-β) ≈ -β approximation becomes less accurate for larger β values (near t=750, β ≈ 0.015). The point stands: at 75% through the schedule, the signal is essentially destroyed — less than 0.12% remains.

Exercise 1.3: Cosine ā250 Derive

Cosine schedule with s=0.008, T=1000. Compute ā250 = f(250)/f(0), where f(t) = cos²((t/T + s)/(1+s) × π/2).

f(250) = cos²((250/1000 + 0.008)/1.008 × π/2) = cos²(0.2559 × π/2) = cos²(0.4019). f(0) = cos²((0.008/1.008) × π/2) = cos²(0.01247).

ā250
Show derivation
f(250) = cos²((0.25 + 0.008)/1.008 × π/2) = cos²(0.2559 × 1.5708) = cos²(0.4020)
cos(0.4020) = 0.9211,   f(250) = 0.9211² = 0.8484
f(0) = cos²(0.008/1.008 × π/2) = cos²(0.01247) = cos(0.01247)² ≈ 0.9999² = 0.9998
ā250 = 0.8484 / 0.9998 = 0.8486 ≈ 0.847

Compare with the linear schedule: ā250 = 0.536 (linear) vs 0.847 (cosine). At t=250, the cosine schedule preserves 85% of the signal vs only 54% for linear. The cosine schedule is much more gentle early on.

Exercise 1.4: Cosine ā750 Derive

Same cosine schedule. Compute ā750.

ā750
Show derivation
f(750) = cos²((0.75 + 0.008)/1.008 × π/2) = cos²(0.7520 × 1.5708) = cos²(1.1812)
cos(1.1812) = 0.3818,   f(750) = 0.3818² = 0.1458
ā750 = 0.1458 / 0.9998 = 0.146

Cosine: 14.6% signal at t=750. Linear: 0.12% signal at t=750. The cosine schedule still has significant signal at 75% of the way through — this gives the model useful gradients even in the later stages. The linear schedule has already destroyed everything by this point.

Exercise 1.5: Why Cosine? Trace
Given the āt values we computed — Linear: (0.536, 0.079, 0.001) at t=(250, 500, 750) vs Cosine: (0.847, 0.500, 0.146) at the same points — which statement best explains why cosine produces better images?
Show explanation

Both schedules end at āT ≈ 0 (pure noise). The difference is in the distribution of difficulty. The linear schedule spends most timesteps in either "almost clean" or "almost noise" regimes — the denoiser doesn't get enough practice on the hardest middle range. The cosine schedule's ā500 ≈ 0.5 means exactly half the timesteps are above 50% signal and half below — a perfectly balanced curriculum.

Exercise 1.6: Implement cosineAlphaBar() Build

Write a function that returns āt for the cosine schedule given t, T, and s=0.008.

Return a single number between 0 and 1.
Show solution
javascript
function cosineAlphaBar(t, T) {
  const s = 0.008;
  const f = x => Math.cos((x / T + s) / (1 + s) * Math.PI / 2) ** 2;
  return f(t) / f(0);
}

Chapter 2: DDPM Loss

You're training a diffusion model and need to understand what the loss actually measures. DDPM (Ho et al. 2020) simplifies the variational bound into a stunningly simple objective: predict the noise that was added.

The simplified DDPM loss:
Lsimple = Et, x0, ε [ || ε - εθ(xt, t) ||² ]

Training procedure:
1. Sample x0 from data, t ~ Uniform{1,...,T}, ε ~ N(0, I)
2. Compute xt = √āt · x0 + √(1-āt) · ε
3. Train εθ to predict ε from xt and t
4. Loss = || ε - εθ(xt, t) ||²
Why predict noise instead of x0? Since xt = √āt · x0 + √(1-āt) · ε, predicting ε is mathematically equivalent to predicting x0 — you can recover one from the other: x0 = (xt - √(1-āt) · εθ) / √āt. Empirically, predicting noise gives more stable gradients because the noise target ε ~ N(0,1) has bounded, well-scaled values regardless of timestep.
Exercise 2.1: Single-Pixel Loss Derive

For a single pixel, the true noise is εtrue = 0.5 and your model predicts εpred = 0.3. What is the MSE loss contribution from this pixel?

loss
Show derivation
L = (εtrue - εpred)² = (0.5 - 0.3)² = 0.2² = 0.04
Exercise 2.2: Full Image Loss Derive

A 32×32×3 image has 3072 values. If the average per-element squared error is 0.04 (as above), what is the total (summed) loss? What about the mean loss?

Most implementations use mean over all elements, so the loss is independent of image size.

total loss (sum)
Show derivation
Total elements = 32 × 32 × 3 = 3072
Total loss (sum) = 3072 × 0.04 = 122.88
Mean loss = 122.88 / 3072 = 0.04

In practice, the mean loss is used for optimization (typically around 0.01-0.1 during training). The sum loss grows with image size — a 256×256×3 image would have 196,608 elements and sum loss of 7,864 for the same per-element error.

Exercise 2.3: Recover x0 from Noise Prediction Derive

Given xt = 1.5, āt = 0.25, and εθ = 0.8 (model's noise prediction), recover the predicted x0.

Use: x̂0 = (xt - √(1-āt) · εθ) / √āt

predicted x0
Show derivation
√āt = √0.25 = 0.5
√(1-āt) = √0.75 = 0.866
0 = (1.5 - 0.866 × 0.8) / 0.5 = (1.5 - 0.693) / 0.5 = 0.807 / 0.5 = 1.614
Exercise 2.4: Noise-Prediction vs x0-Prediction Equivalence Trace
If we define x0-prediction loss as Lx0 = ||x0 - x̂0||², and substitute x̂0 = (xt - √(1-ātθ)/√āt, what is the relationship between Lx0 and Lε = ||ε - εθ||²?
Show derivation
0 = (xt - √(1-āt) εθ) / √āt
x0 = (xt - √(1-āt) ε) / √āt
x0 - x̂0 = √(1-āt) / √āt × (εθ - ε)
Lx0 = ||x0 - x̂0||² = (1-āt)/āt × ||ε - εθ||² = (1-āt)/āt × Lε

This is the SNR reweighting. At high noise (āt small), the x0 loss amplifies errors massively because 1/āt is huge. Noise prediction with Lsimple implicitly downweights high-noise timesteps by a factor of āt/(1-āt), which is why it produces more stable training.

Exercise 2.5: Training Step Bug Debug

This DDPM training step has a bug. The model trains but produces blurry images. Click the buggy line.

def train_step(model, x0, t):
    noise = torch.randn_like(x0)
    alphabar_t = get_alphabar(t)
    x_t = sqrt(alphabar_t) * x0 + sqrt(alphabar_t) * noise
    pred_noise = model(x_t, t)
    loss = F.mse_loss(pred_noise, noise)
    return loss
Show explanation

Line 4 is the bug. The noise coefficient should be sqrt(1 - alphabar_t), not sqrt(alphabar_t). The correct formula is xt = √āt · x0 + √(1-āt) · ε. Using √āt for both terms means at early timesteps (āt ≈ 1) you add too much noise, and at late timesteps (āt ≈ 0) you add too little. The model learns a corrupted noise distribution, producing blurry reconstructions.

Exercise 2.6: v-Prediction Trace
Stable Diffusion v2 uses v-prediction where v = √āt · ε - √(1-āt) · x0. At t=0 (āt=1), what does v reduce to? At t=T (āt=0)?
Show derivation
At t=0: ā0 = 1 ⇒ v = √1 · ε - √0 · x0 = ε
At t=T: āT = 0 ⇒ v = √0 · ε - √1 · x0 = -x0

v-prediction smoothly interpolates: at low noise it predicts ε (like DDPM), at high noise it predicts -x0. This gives a numerically stable target at all timesteps — unlike ε-prediction which has high variance at low noise, or x0-prediction which has high variance at high noise.

Chapter 3: Score Function

The score function is the gradient of the log-probability: ∇x log p(x). It points in the direction of increasing probability — toward the data manifold. Score-based generative models learn this gradient field and follow it to generate samples.

Score of a Gaussian:
If p(x) = N(μ, σ²), then log p(x) = -½(x-μ)²/σ² + const
x log p(x) = -(x - μ) / σ²

Connection to diffusion:
The noisy distribution at time t has score: ∇x log q(xt)
εθ(xt, t) ≈ -√(1-āt) × ∇xt log q(xt)

Equivalently: score ≈ -εθ / √(1-āt)
Score = noise prediction, rescaled. A diffusion model that predicts noise IS a score model in disguise. The noise predictor εθ and the score differ only by a known, t-dependent factor of -√(1-āt). This means all the theory of score-based models (Langevin dynamics, probability flow ODEs) applies directly to DDPM.
Exercise 3.1: Gaussian Score Derive

Compute the score ∇x log p(x) at x=2 for p(x) = N(0, 1).

score
Show derivation
score = -(x - μ) / σ² = -(2 - 0) / 1² = -2

The score at x=2 is -2, pointing toward the mean at x=0. The score is always a "pull" toward high-density regions. At the mean itself, the score is zero.

Exercise 3.2: Score with Larger Variance Derive

Compute the score at x=2 for p(x) = N(0, 4) (note: σ²=4, so σ=2).

score
Show derivation
score = -(x - μ) / σ² = -(2 - 0) / 4 = -0.5

The score is smaller (magnitude 0.5 vs 2). A wider distribution has a gentler gradient — being at x=2 is less "unusual" when σ=2, so the pull back toward the mean is weaker.

Exercise 3.3: Score Direction Trace
For a mixture of two Gaussians centered at x=-3 and x=+3, what is the score at x=0?
Show explanation

For a symmetric mixture p(x) = ½ N(-3, σ²) + ½ N(+3, σ²), the density p(x) is symmetric around 0, so its log is also symmetric, and the derivative at x=0 is exactly 0. The leftward pull from the left mode and rightward pull from the right mode cancel perfectly at the midpoint.

Exercise 3.4: Convert ε to Score Derive

Your model predicts εθ = 1.2 at timestep t where āt = 0.36. What is the estimated score ∇x log q(xt)?

score
Show derivation
score = -εθ / √(1-āt) = -1.2 / √(1-0.36) = -1.2 / √0.64 = -1.2 / 0.8 = -1.5
Exercise 3.5: Implement gaussianScore() Build

Write a function that returns the score of a 1D Gaussian at a given point.

Return a single number.
Show solution
javascript
function gaussianScore(x, mu, sigmaSquared) {
  return -(x - mu) / sigmaSquared;
}

Chapter 4: Sampling (Reverse Process)

Training teaches the model to predict noise. Sampling runs the process backwards: start from pure noise xT ~ N(0, I) and iteratively denoise to get a clean image x0.

DDPM reverse step:
xt-1 = (1/√αt) × (xt - βt/√(1-āt) × εθ(xt, t)) + σt z
where z ~ N(0, I) and σt² = βt (or σt² = β̃t = βt(1-āt-1)/(1-āt))

DDIM (deterministic):
xt-1 = √āt-1 · x̂0 + √(1-āt-1) · εθ(xt, t)
where x̂0 = (xt - √(1-āt) · εθ) / √āt

Key difference: DDIM has no added noise z, so it's deterministic — same xT always gives same x0.
Cost = T forward passes. DDPM sampling requires T=1000 neural network evaluations — one per timestep. Each evaluation has the same FLOPs as one training step. This is why diffusion models are slow: generating one image costs 1000× the compute of a single forward pass. DDIM's key contribution is allowing stride > 1 in the timestep schedule.
Exercise 4.1: One DDPM Reverse Step Derive

Given: xt = 0.8, αt = 0.99, βt = 0.01, āt = 0.5, εθ = 0.6, z = 0.3. Compute xt-1 using σt = √βt.

xt-1
Show derivation
mean = (1/√0.99) × (0.8 - 0.01/√0.5 × 0.6)
= 1.00504 × (0.8 - 0.01414 × 0.6)
= 1.00504 × (0.8 - 0.00849) = 1.00504 × 0.7915 = 0.7955
σt = √0.01 = 0.1
xt-1 = 0.7955 + 0.1 × 0.3 = 0.7955 + 0.03 = 0.826
Exercise 4.2: Forward Passes for DDPM Trace
For T=1000 DDPM sampling to generate one 512×512 image, how many neural network forward passes are required?
Show explanation

Each reverse step t → t-1 requires one forward pass of εθ(xt, t) to predict the noise. With T=1000 steps (from t=1000 down to t=1), that's exactly 1000 forward passes. If each pass takes ~50ms on a GPU, generation takes ~50 seconds — far slower than GANs (one pass) or VAEs (one pass).

Exercise 4.3: DDIM Stride Derive

DDIM allows skipping timesteps. For T=1000 with 50 DDIM steps, what is the stride (step size in the timestep schedule)? List the first 4 and last 2 timesteps in the schedule.

stride
Show derivation
stride = T / steps = 1000 / 50 = 20
Schedule: [1000, 980, 960, 940, ..., 40, 20]

DDIM with 50 steps visits timesteps {1000, 980, 960, ...20}. Each DDIM step jumps 20 timesteps at once. This gives a 20× speedup over DDPM (50 forward passes vs 1000) with only moderate quality loss — typically imperceptible for 50+ steps.

Exercise 4.4: One DDIM Step Derive

Given: xt = 1.0, āt = 0.5, āt-1 = 0.6, εθ = 0.4. Compute xt-1 using DDIM (deterministic).

Step 1: x̂0 = (xt - √(1-ātθ)/√āt. Step 2: xt-1 = √āt-1 · x̂0 + √(1-āt-1) · εθ.

xt-1
Show derivation
0 = (1.0 - √0.5 × 0.4) / √0.5 = (1.0 - 0.2828) / 0.7071 = 0.7172 / 0.7071 = 1.0143
xt-1 = √0.6 × 1.0143 + √0.4 × 0.4 = 0.7746 × 1.0143 + 0.6325 × 0.4
= 0.7857 + 0.2530 = 1.039

Exact computation: x̂0 = (1.0 - 0.70711×0.4)/0.70711 = 0.71716/0.70711 = 1.01421. xt-1 = 0.77460×1.01421 + 0.63246×0.4 = 0.78560 + 0.25298 = 1.039. Note: DDIM with no noise is deterministic — run it again with the same xT and you get the exact same x0.

Exercise 4.5: Sampling Speedup Derive

A U-Net forward pass takes 45ms on an A100. How long does it take to generate one image with: (a) DDPM T=1000, (b) DDIM 50 steps, (c) DDIM 20 steps? Express in seconds.

DDIM-50 time (seconds)
Show derivation
(a) DDPM: 1000 × 45ms = 45,000ms = 45.0 seconds
(b) DDIM-50: 50 × 45ms = 2,250ms = 2.25 seconds
(c) DDIM-20: 20 × 45ms = 900ms = 0.90 seconds

DDIM-50 is 20× faster than DDPM with minimal quality loss. DDIM-20 pushes under 1 second but quality drops noticeably. Modern schedulers (DPM-Solver, UniPC) can match DDIM-50 quality with only 15-20 steps by using higher-order ODE solvers.

Chapter 5: Classifier-Free Guidance

You prompt a model with "a photo of a golden retriever" and the output is vaguely dog-shaped but blurry and generic. Classifier-Free Guidance (CFG) amplifies the model's response to the text prompt, trading diversity for fidelity. It's the single most impactful inference-time trick in text-to-image generation.

CFG formula:
ε̃ = εuncond + w × (εcond - εuncond)
= (1 + w) × εcond - w × εuncond

Training: Randomly drop the text condition (replace with ∅) with probability pdrop ≈ 0.1
Inference: Run the model twice per step — once with and once without the text condition
w = 0: no guidance (ε̃ = εuncond)    w = 1: standard conditioning    w = 7.5: typical SD
CFG doubles inference cost. Every sampling step requires TWO forward passes: one conditioned, one unconditional. This doubles the latency. Distillation techniques (like guidance distillation in SDXL Turbo) train a student to mimic the guided output in a single pass.
Exercise 5.1: CFG Computation Derive

Given εcond = [0.3, 0.7] and εuncond = [0.1, 0.1], compute ε̃ at guidance scale w=7.5.

For each element: ε̃i = εuncond,i + 7.5 × (εcond,i - εuncond,i)

ε̃[0] (first element)
Show derivation
ε̃[0] = 0.1 + 7.5 × (0.3 - 0.1) = 0.1 + 7.5 × 0.2 = 0.1 + 1.5 = 1.6
ε̃[1] = 0.1 + 7.5 × (0.7 - 0.1) = 0.1 + 7.5 × 0.6 = 0.1 + 4.5 = 4.6

Notice how the guided noise prediction (1.6, 4.6) is much larger than the conditional prediction (0.3, 0.7). CFG amplifies the difference between conditional and unconditional, pushing the sample harder toward what the prompt describes. At w=7.5, the second element goes from 0.7 to 4.6 — a 6.6× amplification.

Exercise 5.2: w=0 Baseline Trace
What does w=0 reduce the CFG formula to?
Show derivation
ε̃ = εuncond + 0 × (εcond - εuncond) = εuncond

At w=0, you get the unconditional model — it generates random images with no regard for the text prompt. At w=1, ε̃ = εuncond + εcond - εuncond = εcond, which is just the standard conditional output. Values w > 1 push beyond the conditional prediction, amplifying the prompt's influence.

Exercise 5.3: ε̃[1] at w=7.5 Derive

Using the same εcond = [0.3, 0.7], εuncond = [0.1, 0.1], compute ε̃[1] (the second element) at w=7.5.

ε̃[1]
Show derivation
ε̃[1] = 0.1 + 7.5 × (0.7 - 0.1) = 0.1 + 4.5 = 4.6
Exercise 5.4: CFG Inference Cost Derive

DDIM with 50 steps + CFG at w=7.5. Each forward pass takes 45ms. How many total forward passes and how much time for one image?

total forward passes
Show derivation
Forward passes = 50 steps × 2 (cond + uncond) = 100
Time = 100 × 45ms = 4500ms = 4.5 seconds

CFG exactly doubles the cost compared to unguided DDIM-50 (2.25s). In practice, the two predictions can be batched (concatenate the cond and uncond inputs), so wall-clock time is less than 2× if the GPU has headroom. Still, eliminating one of the two passes via distillation is worth a lot.

Exercise 5.5: High Guidance Tradeoff Trace
As guidance scale w increases from 1 to 20, what happens to the generated images?
Show explanation

Higher w amplifies the conditional signal more aggressively. At moderate w (5-10), images closely match the prompt with good quality. At very high w (15+), the amplification pushes pixel values to extremes — colors oversaturate, edges become unnaturally sharp, and artifacts appear. It's like turning up the volume on a speaker: at first it's clearer, then it distorts. The sweet spot for Stable Diffusion is w=7-9.

Chapter 6: Latent Diffusion

Running diffusion directly on 512×512 pixels is brutally expensive — the U-Net processes 262,144 spatial tokens per step. Latent Diffusion (Rombach et al. 2022) runs diffusion in a compressed latent space instead. An autoencoder maps images to a 64×64 latent, diffusion operates there, and the decoder reconstructs the final image.

Encoder: z = E(x),   x: [512, 512, 3] → z: [64, 64, 4]
Decoder: x̂ = D(z),   z: [64, 64, 4] → x̂: [512, 512, 3]

Spatial compression: 8× in each dimension (64 → 512)
Channel change: 3 → 4 (latent has 4 channels)
Pixel compression ratio: (512×512×3) / (64×64×4) = 786432 / 16384 = 48×
48× fewer values to denoise. The U-Net now operates on 16,384 spatial positions instead of 262,144. Since self-attention is O(n²), the attention cost drops by (262144/16384)² = 256× per spatial position — an enormous saving. This is what made Stable Diffusion practical on consumer GPUs.
Exercise 6.1: Compression Ratio Derive

Compute the pixel compression ratio for a latent diffusion model that maps 512×512×3 images to 64×64×4 latents.

× compression
Show derivation
Pixel values = 512 × 512 × 3 = 786,432
Latent values = 64 × 64 × 4 = 16,384
Ratio = 786,432 / 16,384 = 48×
Exercise 6.2: Attention FLOPs Savings Derive

Self-attention is O(n²) where n is the number of spatial tokens. How many times fewer FLOPs does attention cost in 64×64 latent space vs 512×512 pixel space?

× fewer FLOPs
Show derivation
Pixel tokens = 512 × 512 = 262,144
Latent tokens = 64 × 64 = 4,096
FLOPs ratio = (262,144)² / (4,096)² = (262,144/4,096)² = 64² = 4,096... wait

Actually, the full attention FLOPs are O(n² d) where d is the channel dimension. Pixel space: 262144² tokens. Latent space: 4096² tokens. Ratio = (262144/4096)² = 64² = 4096. But the U-Net doesn't use attention at every resolution — it uses attention only at lower resolutions (32×32, 16×16, 8×8) in both pixel and latent models. The 256× figure comes from the spatial sequence length ratio at the resolution where attention is applied: in latent space, the highest-resolution attention operates on (64/2)² = 1024 tokens vs pixel space (512/16)² = 1024 tokens at the same U-Net depth. The key saving is that the latent model can afford attention at HIGHER resolutions (64×64 = 4096 tokens is feasible, 512×512 = 262K is not).

Exercise 6.3: Memory for KV Cache Derive

For a single self-attention layer with d=320 channels at the full spatial resolution: compute the attention matrix memory in FP16 for (a) latent space 64×64 and (b) pixel space 512×512. Assume 1 head for simplicity.

MB (latent space)
Show derivation
Latent: n = 64 × 64 = 4,096 tokens
Attention matrix: 4,096 × 4,096 = 16,777,216 elements
FP16: 16,777,216 × 2 bytes = 33,554,432 bytes = 32 MB
Pixel: n = 512 × 512 = 262,144 tokens
Attention matrix: 262,144² = 68,719,476,736 elements
FP16: 68,719,476,736 × 2 = 137,438,953,472 bytes = 128 GB

32 MB vs 128 GB — that's a 4096× difference. Full-resolution self-attention in pixel space at 512×512 requires 128 GB for a single attention matrix of a single head of a single layer. This is why pixel-space diffusion either avoids attention or restricts it to tiny resolutions.

Exercise 6.4: 1024×1024 Latents Derive

SDXL generates 1024×1024 images. With the same 8× spatial compression, what is the latent resolution and total latent values (4 channels)?

total latent values
Show derivation
Latent resolution = 1024/8 = 128×128
Total values = 128 × 128 × 4 = 65,536

The SDXL latent at 128×128 has 4× the spatial tokens of SD 1.5's 64×64 latent. This 4× increase in sequence length makes self-attention 16× more expensive, which is why SDXL uses a larger U-Net with more efficient attention patterns and requires substantially more VRAM.

Exercise 6.5: Where Diffusion Runs Trace
In Stable Diffusion, the diffusion process (noise addition, denoising, sampling) operates on:
Show explanation

The autoencoder runs exactly twice during generation: once at the very end (decoder converts final latent z0 to pixels) and once during training (encoder converts training images to latents). The entire iterative denoising process — all 50+ sampling steps — runs in the 64×64×4 latent space. This is why it's called latent diffusion.

Chapter 7: Flow Matching

Flow matching (Lipman et al. 2023, Liu et al. 2023) offers a simpler, more elegant framework than DDPM. Instead of defining a noise schedule and deriving a reverse process, it directly learns a velocity field that transports noise to data along straight paths.

Straight path interpolation:
ψt(x) = (1-t) · x0 + t · x1,   t ∈ [0, 1]
where x0 ~ pdata (clean image), x1 ~ N(0, I) (noise)

Velocity (target):
ut(x) = x1 - x0   (constant along the straight path)

Conditional flow matching loss:
L = Et, x0, x1 [ || vθt(x), t) - ut ||² ]
= Et, x0, x1 [ || vθt(x), t) - (x1 - x0) ||² ]
Flow matching is simpler than DDPM. No noise schedule to design, no αtt to track, no reverse process derivation. Just learn a velocity field vθ that moves points from data to noise along straight lines. The loss is a simple MSE between predicted and target velocities. Sampling is just integrating an ODE: dx/dt = vθ(x, t).
Exercise 7.1: Interpolation Derive

Compute ψt at t=0.3 for x0 = [1, 2] (data) and x1 = [5, 8] (noise).

Apply element-wise: ψt[i] = (1-t) × x0[i] + t × x1[i]

ψ0.3[0] (first element)
Show derivation
ψ0.3[0] = (1-0.3) × 1 + 0.3 × 5 = 0.7 + 1.5 = 2.2
ψ0.3[1] = (1-0.3) × 2 + 0.3 × 8 = 1.4 + 2.4 = 3.8

At t=0 we're at x0=[1,2] (data), at t=1 we're at x1=[5,8] (noise), and at t=0.3 we're 30% of the way along the straight line between them. Note: in some flow matching papers, the convention is reversed (t=0 is noise, t=1 is data). Always check the convention!

Exercise 7.2: ψ0.3[1] Derive

Second element of ψ0.3 for x0=[1,2], x1=[5,8].

ψ0.3[1]
Show derivation
ψ0.3[1] = 0.7 × 2 + 0.3 × 8 = 1.4 + 2.4 = 3.8
Exercise 7.3: Target Velocity Derive

For the same x0=[1,2] and x1=[5,8], what is the target velocity ut?

ut[0] (first element)
Show derivation
ut = x1 - x0 = [5-1, 8-2] = [4, 6]

The target velocity is constant — it doesn't depend on t at all. This is a key simplification of flow matching: the velocity along a straight path is just the displacement vector from data to noise. The model learns to approximate this constant velocity at each point in space-time.

Exercise 7.4: Flow Matching vs DDPM Trace
What is the fundamental difference between flow matching and DDPM at training time?
Show explanation

Both methods train a neural network with MSE loss on paired inputs and targets. DDPM: input is noised image xt = √ātx0 + √(1-āt)ε, target is ε. Flow matching: input is interpolated point ψt = (1-t)x0 + tx1, target is velocity x1-x0. The architecture, dataset, and optimizer can be identical — only the parameterization of the interpolation path and prediction target differ.

Exercise 7.5: Implement flowMatchStep() Build

Write a function that computes one Euler integration step for flow matching sampling: xnew = xold + v × dt.

Return a single number: the new position after one Euler step.
Show solution
javascript
function flowMatchStep(x, v, dt) {
  return x + v * dt;
}
Exercise 7.6: Flow Matching Training Pipeline Design

Put these flow matching training steps in the correct order.

?
?
?
?
?
Sample x0 from data Sample x1 ~ N(0,I) Compute ψt = (1-t)x0 + tx1 vθt, t) forward pass MSE loss vs (x1 - x0)
Show answer

Correct order: (1) Sample x0 from data, (2) Sample x1 ~ N(0,I), (3) Compute interpolation ψt, (4) Forward pass vθt, t), (5) MSE loss vs target velocity (x1 - x0). This is the entire training loop — simpler than DDPM because there's no noise schedule (α, ā) to manage.

Chapter 8: ODE vs SDE Sampling

Once you have a trained model (whether it predicts noise, score, or velocity), there are two ways to sample: integrate an ODE (deterministic) or simulate an SDE (stochastic). Both generate valid samples, but they have different tradeoffs.

ODE (Probability Flow):
dx = v(x, t) dt   (deterministic, no randomness)
Same xT → same x0 every time

SDE (Langevin-type):
dx = f(x, t) dt + g(t) dW   (stochastic, adds noise at each step)
Same xT → different x0 each time

Euler method (for ODE):
xt+Δt = xt + v(xt, t) × Δt
ODE is faster, SDE is higher quality. ODE sampling is deterministic and works well with few steps (DDIM is an ODE solver). SDE sampling adds noise that acts as a correction mechanism — it can fix errors from earlier steps but requires more steps to converge. In practice: use ODE for speed (15-50 steps), SDE for maximum quality (100+ steps).
Exercise 8.1: Three Euler Steps Derive

Starting at x0 = 3.0, integrate dx/dt = v(x, t) = -x using Euler's method with Δt = 0.1 for 3 steps. Compute x1, x2, x3.

x3 (after 3 steps)
Show derivation
x1 = x0 + v(x0) × 0.1 = 3.0 + (-3.0) × 0.1 = 3.0 - 0.3 = 2.7
x2 = x1 + v(x1) × 0.1 = 2.7 + (-2.7) × 0.1 = 2.7 - 0.27 = 2.43
x3 = x2 + v(x2) × 0.1 = 2.43 + (-2.43) × 0.1 = 2.43 - 0.243 = 2.187

The true solution of dx/dt = -x is x(t) = 3e-t, so x(0.3) = 3e-0.3 = 2.222. Our Euler approximation gives 2.187 — an error of ~1.6%. The Euler method multiplies by (1-Δt) at each step, so xn = 3(0.9)n. This converges to the true exponential as Δt → 0.

Exercise 8.2: Euler Error Derive

For the same ODE (v = -x, x0 = 3.0), the exact solution at t=0.3 is x = 3e-0.3 = 2.2224. Our 3-step Euler gave 2.187. What is the relative error as a percentage?

% error
Show derivation
Error = |2.187 - 2.2224| / 2.2224 × 100% = 0.0354 / 2.2224 × 100% = 1.59%

With 30 steps (Δt = 0.01) the error would drop to ~0.15%. With 3 steps (Δt = 0.1) it's 1.6%. This is why diffusion sampling quality degrades with fewer steps — each step introduces discretization error, and it compounds. Higher-order solvers (Heun, RK4, DPM-Solver) reduce error per step, allowing fewer steps for the same quality.

Exercise 8.3: Steps vs Quality Trace
A user runs Stable Diffusion with 5 Euler steps and gets blurry, low-quality images. With 50 steps, the quality is excellent. Why?
Show explanation

With 5 steps, Δt = 0.2 per step. The Euler method approximates curves with straight line segments — larger steps mean more error where the velocity field curves. The accumulated error means the sample ends up in a low-probability region, producing blurry averages of multiple modes rather than sharp samples. This is why modern schedulers (DPM-Solver++, UniPC) use higher-order methods that better follow the ODE trajectory.

Exercise 8.4: ODE vs SDE Properties Trace
You generate two images from the exact same noise xT using ODE sampling and SDE sampling respectively. What should you expect?
Show explanation

ODE sampling is deterministic: no randomness is injected, so the same xT always maps to the same x0. SDE sampling adds Gaussian noise at each step (the dW term), so even with the same starting point, each run follows a different trajectory and produces a different image. Both are valid samples from the learned distribution, but ODE gives you a reproducible mapping while SDE gives you stochastic diversity.

Exercise 8.5: Adaptive Step Size Trace
Adaptive ODE solvers vary Δt based on local error estimates. At which timesteps should Δt be smallest (most steps concentrated)?
Show explanation

Near t=T, the velocity field is relatively smooth (noisy samples all look similar). Near t=0, the model is resolving fine spatial details — small errors get amplified into visible artifacts. Adaptive solvers automatically take smaller steps where the velocity field has high curvature (near t=0) and larger steps where it's smooth (near t=T). This optimal step allocation is why adaptive solvers outperform fixed-step schedules.

Chapter 9: Capstone: Design a Diffusion Pipeline

You're an ML engineer tasked with building an image generation system. You need to make concrete decisions about every component — architecture, noise schedule, sampling, guidance — and estimate the compute cost. This capstone ties together everything from the workbook.

The design brief. Generate 1024×1024 images conditioned on text prompts. Target: high quality, reasonable latency (<5 seconds on an A100), and the ability to serve 1000+ images per hour per GPU.
Exercise 9.1: Latent vs Pixel Trace
For 1024×1024 generation, should you use latent diffusion (with 8× compression to 128×128×4) or pixel-space diffusion? Consider that a single self-attention layer at 1024×1024 pixel resolution would need to attend over 1,048,576 tokens.
Show reasoning

At 1024×1024, pixel-space attention is impossible: 1M² = 1012 attention scores per head per layer. Even without attention, the U-Net would process 1M spatial tokens at the highest resolution. Latent diffusion compresses to 128×128 = 16K tokens, making self-attention feasible (16K² = 256M scores — large but manageable). Every modern high-resolution model (SDXL, FLUX, Imagen 3) uses latent or cascaded approaches.

Exercise 9.2: U-Net Parameter Count Derive

A U-Net with base channels C=320 and channel multipliers [1, 2, 4, 8] has 4 resolution levels. At each level, there are 2 ResNet blocks. Each ResNet block has roughly 2 × Clevel² parameters (two 3×3 convolutions). Compute the total ResNet parameters across all levels.

Channels per level: 320, 640, 1280, 2560. Remember the U-Net has encoder + decoder (double the blocks, minus the bottleneck). For simplicity, count encoder = 4 levels × 2 blocks, decoder = 4 levels × 2 blocks, bottleneck = 1 block at the deepest level.

billion ResNet params (approximate)
Show derivation
Level 0 (C=320): 2 × 320² = 204,800 per block, × 4 blocks (enc+dec) = 819,200
Level 1 (C=640): 2 × 640² = 819,200 per block, × 4 = 3,276,800
Level 2 (C=1280): 2 × 1280² = 3,276,800 per block, × 4 = 13,107,200
Level 3 (C=2560): 2 × 2560² = 13,107,200 per block, × 4 = 52,428,800
Bottleneck (C=2560): 2 × 2560² = 13,107,200
Subtotal = 819,200 + 3,276,800 + 13,107,200 + 52,428,800 + 13,107,200 = 82,739,200 ≈ 82.7M

82.7M is just the convolution parameters. A real U-Net also has: self-attention layers at each level (adding ~4×C² per attention block), cross-attention for text conditioning, time embedding MLPs, skip connections, and group norms. The full SD 1.5 U-Net is ~860M parameters; SDXL's is ~2.6B. Our rough ResNet estimate captures the scaling pattern: the deepest levels (highest C) dominate the parameter count — level 3 alone is 63% of total.

Exercise 9.3: Inference FLOPs Derive

A U-Net with ~2.6B parameters (SDXL scale) takes approximately 2 × params FLOPs per forward pass (the standard "2N" rule for FLOPs of a neural network with N params). With 50 DDIM steps + CFG, how many total FLOPs per image?

TFLOPs
Show derivation
FLOPs per forward pass ≈ 2 × 2.6B = 5.2 TFLOPs
Forward passes = 50 steps × 2 (CFG) = 100
Total = 100 × 5.2 = 520 TFLOPs per image

520 TFLOPs for one image. For comparison, a single GPT-4 token generation requires ~1.8 TFLOPs (for a ~1.8T parameter model with 2 FLOPs/param). So generating one SDXL image costs roughly as much as generating 289 GPT-4 tokens — about a paragraph of text.

Exercise 9.4: Time per Image on A100 Derive

An A100 achieves ~312 TFLOPS in FP16. Assuming 40% utilization (realistic for U-Net inference), how long per image at 520 TFLOPs/image?

seconds
Show derivation
Effective throughput = 312 × 0.40 = 124.8 TFLOPS
Time = 520 / 124.8 = 4.17 seconds

~4.2 seconds per image meets our <5 second target. In practice, SDXL on an A100 with optimized inference (TensorRT, FlashAttention) generates 1024×1024 images in 3-5 seconds, matching our estimate well.

Exercise 9.5: Hourly Throughput Derive

At 4.17 seconds per image, how many images per hour can one A100 generate? Does this meet the 1000+ images/hour requirement?

images/hour
Show derivation
Images/hour = 3600 / 4.17 = 863

863 images/hour falls short of the 1000+ target. Options: (1) reduce steps from 50 to 40 (gives 1079/hr), (2) use a distilled model that needs fewer steps, (3) batch 2 images per pass if VRAM allows, (4) use DDIM-25 with a higher-order solver (DPM-Solver++) for comparable quality. Option 3 is the most practical: batch=2 nearly doubles throughput to ~1700/hr.

Exercise 9.6: Full Pipeline Design Design

Put the complete image generation pipeline steps in order.

?
?
?
?
?
?
Encode text prompt (CLIP/T5) Sample zT ~ N(0,I) in latent space Iterative denoising (50 steps + CFG) VAE decode z0 → pixels Post-process (clip, scale to 0-255) Save/return image
Show answer

Correct order: (1) Encode text prompt, (2) Sample noise in latent space, (3) Iterative denoising loop, (4) VAE decode to pixels, (5) Post-process, (6) Save/return. The text encoding happens first (and only once per prompt). The denoising loop is 98% of the compute. The VAE decoder runs once at the end. Post-processing (clamping values, converting to uint8) is negligible.

The proof of work. If you completed every exercise in this workbook from scratch — computed noise schedules, derived losses, traced sampling algorithms, implemented forward processes and flow matching steps, and designed a full generation pipeline — you understand the mathematical machinery behind every major diffusion and flow matching system. From DDPM to Stable Diffusion to FLUX, it's all built on these fundamentals. "What I cannot create, I do not understand."

Related Lessons

TopicLesson
Diffusion fundamentalsDiffusion — From Absolute Zero
Flow matchingFlow Matching — From Absolute Zero
VAE / VQ-VAEVAE & VQ-VAE — From Absolute Zero
Contrastive learning (CLIP)Contrastive Learning & CLIP — From Absolute Zero
Transformer mathTransformer Math Workbook