Diffusion & Flow Matching Workbook

Chapter 0: Forward Process

You have a clean image x₀. The diffusion forward process gradually adds Gaussian noise over T timesteps until nothing recognizable remains. The magic: you can jump to any timestep t directly without iterating through all previous steps.

At each step, a small amount of noise is added according to a schedule β₁, β₂, ..., β_T:

Single step: q(x_t | x_t-1) = N(x_t; √(1-β_t) · x_t-1, β_t I)

Direct jump (closed form):
α_t = 1 - β_t
ā_t = ∏_s=1..t α_s = α₁ · α₂ · ... · α_t

q(x_t | x₀) = N(x_t; √ā_t · x₀, (1 - ā_t) I)

Reparameterization: x_t = √ā_t · x₀ + √(1-ā_t) · ε, ε ~ N(0, I)

The signal-to-noise interpretation. √ā_t scales the original signal and √(1 - ā_t) scales the noise. At t=0, ā₀ ≈ 1 so the image is almost clean. At t=T, ā_T ≈ 0 so the image is pure noise. The forward process is a smooth interpolation from signal to noise.

Exercise 0.1: Compute α_t Derive

Linear schedule: β_t = 0.0001 + (0.02 - 0.0001) × (t-1)/(T-1), with T=1000. Compute α₁ = 1 - β₁.

β₁ corresponds to t=1, so β₁ = 0.0001 + 0.0199 × 0/999 = 0.0001.

α₁

Show derivation

β₁ = 0.0001 + 0.0199 × (1-1)/(1000-1) = 0.0001

α₁ = 1 - 0.0001 = 0.9999

At the first timestep, almost no noise is added. The signal is preserved at 99.99%.

Exercise 0.2: ā_t at t=500 Derive

For the linear schedule above, β_t = 0.0001 + 0.0199 × (t-1)/999. The mean of β across all 1000 steps is (0.0001 + 0.02)/2 = 0.01005. Use the approximation: log ā_t ≈ ∑_s=1..t log(1 - β_s) ≈ -∑ β_s. For the first 500 steps, the average β is approximately 0.0001 + 0.0199 × 249.5/999 ≈ 0.00507. So ∑_s=1..500 β_s ≈ 500 × 0.00507 = 2.535. What is ā₅₀₀?

ā₅₀₀

Show derivation

Average β for t=1..500: β̄ = 0.0001 + 0.0199 × ((0 + 499)/2) / 999

= 0.0001 + 0.0199 × 249.5/999 = 0.0001 + 0.00497 = 0.00507

∑_s=1..500 β_s ≈ 500 × 0.00507 = 2.535

ā₅₀₀ ≈ e^-2.535 = 0.0793

Only about 7.9% of the original signal power remains at the midpoint. The signal coefficient √ā₅₀₀ ≈ 0.282, meaning the original image is attenuated to ~28% of its intensity.

Exercise 0.3: Signal Fraction at t=500 Derive

Using ā₅₀₀ ≈ 0.0793 from above, what fraction of the total variance in x₅₀₀ comes from the original signal x₀ (as opposed to noise)?

Hint: x_t = √ā_t · x₀ + √(1-ā_t) · ε. If x₀ has unit variance, the signal variance is ā_t and noise variance is (1-ā_t). Signal fraction = ā_t / (ā_t + (1-ā_t)).

fraction (0 to 1)

Show derivation

Signal variance = ā_t = 0.0793

Noise variance = 1 - ā_t = 0.9207

Signal fraction = 0.0793 / (0.0793 + 0.9207) = 0.0793 / 1 = 0.0793

The signal fraction IS ā_t (since the total variance is 1). Only 7.93% of the information in x₅₀₀ comes from the original image. The denoiser has to recover the image from a signal buried under ~12× more noise.

Exercise 0.4: ā_T at t=1000 Derive

Now compute ā₁₀₀₀ using the same approximation. The average β over all 1000 steps is 0.01005.

ā₁₀₀₀

Show derivation

∑_s=1..1000 β_s ≈ 1000 × 0.01005 = 10.05

ā₁₀₀₀ ≈ e^-10.05 = 4.31 × 10^-5 ≈ 0.0000431

The exact value computed numerically is ~0.0000448. At t=1000, √ā_T ≈ 0.0067 — the original signal is attenuated to 0.67% of its intensity. x_T is essentially pure Gaussian noise, which is exactly what we want: the reverse process starts from N(0, I).

Exercise 0.5: SNR at t=500 Trace

The signal-to-noise ratio (SNR) at timestep t is defined as SNR(t) = ā_t / (1 - ā_t). Using ā₅₀₀ ≈ 0.0793, what is log₁₀(SNR(500))?

-2.07 (SNR ≈ 0.0085) -1.065 (SNR ≈ 0.086) 0 (SNR = 1) -0.5 (SNR ≈ 0.316)

Show derivation

SNR(500) = ā₅₀₀ / (1 - ā₅₀₀) = 0.0793 / 0.9207 = 0.0861

log₁₀(0.0861) = -1.065

At t=500, the noise power is ~11.6× the signal power. The SNR is well below 1, meaning the denoiser is working with mostly noise. The log-SNR is a key quantity in diffusion theory — uniform spacing in log-SNR space corresponds to uniform difficulty for the denoiser.

Exercise 0.6: Implement forwardProcess() Build

Write a function that takes x0 (a number), alphabar_t, and epsilon (a noise sample), then returns x_t using the reparameterization trick.

Return a single number: the noised sample x_t.

Show solution

javascript
function forwardProcess(x0, alphabar_t, epsilon) {
  return Math.sqrt(alphabar_t) * x0 + Math.sqrt(1 - alphabar_t) * epsilon;
}

Chapter 1: Noise Schedule Math

The noise schedule determines how quickly you destroy the image. Too fast and the model can't learn — the jump between adjacent timesteps is too large. Too slow and you waste compute on timesteps where nothing interesting happens. Two schedules dominate: linear and cosine.

Linear schedule:
β_t = β_min + (β_max - β_min) × (t-1)/(T-1)
ā_t = ∏_s=1..t (1 - β_s)

Cosine schedule (Nichol & Dhariwal 2021):
ā_t = f(t) / f(0), where f(t) = cos²( (t/T + s) / (1+s) × π/2 )
s = 0.008 (small offset to prevent ā₀ from being too small)

Then: β_t = 1 - ā_t / ā_t-1, clipped to [0, 0.999]

Why cosine wins. The linear schedule destroys information too quickly in the early timesteps (around t/T ≈ 0.2 the image is already mostly noise). The cosine schedule provides a more uniform distribution of SNR across timesteps, giving the denoiser a smoother curriculum — more time on "medium noise" where learning is most productive.

Exercise 1.1: Linear ā₂₅₀ Derive

Linear schedule: β_min=0.0001, β_max=0.02, T=1000. Compute ā₂₅₀ using the approximation ā_t ≈ exp(-∑β_s).

Average β for t=1..250: β̄ = 0.0001 + 0.0199 × ((0+249)/2)/999.

ā₂₅₀

Show derivation

β̄ = 0.0001 + 0.0199 × 124.5/999 = 0.0001 + 0.00248 = 0.00258

∑_1..250 β_s ≈ 250 × 0.00258 = 0.645

ā₂₅₀ ≈ e^-0.645 = 0.525

Exact numerical computation gives ~0.536. At 25% of the way through the schedule, about half the signal remains. The approximation log(1-β) ≈ -β is accurate because each β is small (< 0.02).

Exercise 1.2: Linear ā₇₅₀ Derive

Same linear schedule. Compute ā₇₅₀.

ā₇₅₀

Show derivation

β̄ (t=1..750) = 0.0001 + 0.0199 × ((0+749)/2)/999 = 0.0001 + 0.00746 = 0.00756

∑_1..750 β_s ≈ 750 × 0.00756 = 5.67

ā₇₅₀ ≈ e^-5.67 = 0.00345

Exact numerical value is ~0.00118. Our approximation overshoots because the log(1-β) ≈ -β approximation becomes less accurate for larger β values (near t=750, β ≈ 0.015). The point stands: at 75% through the schedule, the signal is essentially destroyed — less than 0.12% remains.

Exercise 1.3: Cosine ā₂₅₀ Derive

Cosine schedule with s=0.008, T=1000. Compute ā₂₅₀ = f(250)/f(0), where f(t) = cos²((t/T + s)/(1+s) × π/2).

f(250) = cos²((250/1000 + 0.008)/1.008 × π/2) = cos²(0.2559 × π/2) = cos²(0.4019). f(0) = cos²((0.008/1.008) × π/2) = cos²(0.01247).

ā₂₅₀

Show derivation

f(250) = cos²((0.25 + 0.008)/1.008 × π/2) = cos²(0.2559 × 1.5708) = cos²(0.4020)

cos(0.4020) = 0.9211, f(250) = 0.9211² = 0.8484

f(0) = cos²(0.008/1.008 × π/2) = cos²(0.01247) = cos(0.01247)² ≈ 0.9999² = 0.9998

ā₂₅₀ = 0.8484 / 0.9998 = 0.8486 ≈ 0.847

Compare with the linear schedule: ā₂₅₀ = 0.536 (linear) vs 0.847 (cosine). At t=250, the cosine schedule preserves 85% of the signal vs only 54% for linear. The cosine schedule is much more gentle early on.

Exercise 1.4: Cosine ā₇₅₀ Derive

Same cosine schedule. Compute ā₇₅₀.

ā₇₅₀

Show derivation

f(750) = cos²((0.75 + 0.008)/1.008 × π/2) = cos²(0.7520 × 1.5708) = cos²(1.1812)

cos(1.1812) = 0.3818, f(750) = 0.3818² = 0.1458

ā₇₅₀ = 0.1458 / 0.9998 = 0.146

Cosine: 14.6% signal at t=750. Linear: 0.12% signal at t=750. The cosine schedule still has significant signal at 75% of the way through — this gives the model useful gradients even in the later stages. The linear schedule has already destroyed everything by this point.

Exercise 1.5: Why Cosine? Trace

Given the ā_t values we computed — Linear: (0.536, 0.079, 0.001) at t=(250, 500, 750) vs Cosine: (0.847, 0.500, 0.146) at the same points — which statement best explains why cosine produces better images?

Cosine adds less total noise overall Cosine uses fewer timesteps Cosine distributes the noise more evenly, so the denoiser has a balanced learning curriculum across all timesteps Cosine makes the math simpler for backpropagation

Show explanation

Both schedules end at ā_T ≈ 0 (pure noise). The difference is in the distribution of difficulty. The linear schedule spends most timesteps in either "almost clean" or "almost noise" regimes — the denoiser doesn't get enough practice on the hardest middle range. The cosine schedule's ā₅₀₀ ≈ 0.5 means exactly half the timesteps are above 50% signal and half below — a perfectly balanced curriculum.

Exercise 1.6: Implement cosineAlphaBar() Build

Write a function that returns ā_t for the cosine schedule given t, T, and s=0.008.

Return a single number between 0 and 1.

Show solution

javascript
function cosineAlphaBar(t, T) {
  const s = 0.008;
  const f = x => Math.cos((x / T + s) / (1 + s) * Math.PI / 2) ** 2;
  return f(t) / f(0);
}

Chapter 2: DDPM Loss

You're training a diffusion model and need to understand what the loss actually measures. DDPM (Ho et al. 2020) simplifies the variational bound into a stunningly simple objective: predict the noise that was added.

The simplified DDPM loss:
L_simple = E_{t, x₀, ε} [ || ε - ε_θ(x_t, t) ||² ]

Training procedure:
1. Sample x₀ from data, t ~ Uniform{1,...,T}, ε ~ N(0, I)
2. Compute x_t = √ā_t · x₀ + √(1-ā_t) · ε
3. Train ε_θ to predict ε from x_t and t
4. Loss = || ε - ε_θ(x_t, t) ||²

Why predict noise instead of x₀? Since x_t = √ā_t · x₀ + √(1-ā_t) · ε, predicting ε is mathematically equivalent to predicting x₀ — you can recover one from the other: x₀ = (x_t - √(1-ā_t) · ε_θ) / √ā_t. Empirically, predicting noise gives more stable gradients because the noise target ε ~ N(0,1) has bounded, well-scaled values regardless of timestep.

Exercise 2.1: Single-Pixel Loss Derive

For a single pixel, the true noise is ε_true = 0.5 and your model predicts ε_pred = 0.3. What is the MSE loss contribution from this pixel?

loss

Show derivation

L = (ε_true - ε_pred)² = (0.5 - 0.3)² = 0.2² = 0.04

Exercise 2.2: Full Image Loss Derive

A 32×32×3 image has 3072 values. If the average per-element squared error is 0.04 (as above), what is the total (summed) loss? What about the mean loss?

Most implementations use mean over all elements, so the loss is independent of image size.

total loss (sum)

Show derivation

Total elements = 32 × 32 × 3 = 3072

Total loss (sum) = 3072 × 0.04 = 122.88

Mean loss = 122.88 / 3072 = 0.04

In practice, the mean loss is used for optimization (typically around 0.01-0.1 during training). The sum loss grows with image size — a 256×256×3 image would have 196,608 elements and sum loss of 7,864 for the same per-element error.

Exercise 2.3: Recover x₀ from Noise Prediction Derive

Given x_t = 1.5, ā_t = 0.25, and ε_θ = 0.8 (model's noise prediction), recover the predicted x₀.

Use: x̂₀ = (x_t - √(1-ā_t) · ε_θ) / √ā_t

predicted x₀

Show derivation

√ā_t = √0.25 = 0.5

√(1-ā_t) = √0.75 = 0.866

x̂₀ = (1.5 - 0.866 × 0.8) / 0.5 = (1.5 - 0.693) / 0.5 = 0.807 / 0.5 = 1.614

Exercise 2.4: Noise-Prediction vs x₀-Prediction Equivalence Trace

If we define x₀-prediction loss as L_x0 = ||x₀ - x̂₀||², and substitute x̂₀ = (x_t - √(1-ā_t)ε_θ)/√ā_t, what is the relationship between L_x0 and L_ε = ||ε - ε_θ||²?

L_x0 = L_ε (they are identical) L_x0 = (1-ā_t)/ā_t × L_ε — they differ by a t-dependent scaling factor L_x0 = ā_t × L_ε They are completely unrelated

Show derivation

x̂₀ = (x_t - √(1-ā_t) ε_θ) / √ā_t

x₀ = (x_t - √(1-ā_t) ε) / √ā_t

x₀ - x̂₀ = √(1-ā_t) / √ā_t × (ε_θ - ε)

L_x0 = ||x₀ - x̂₀||² = (1-ā_t)/ā_t × ||ε - ε_θ||² = (1-ā_t)/ā_t × L_ε

This is the SNR reweighting. At high noise (ā_t small), the x₀ loss amplifies errors massively because 1/ā_t is huge. Noise prediction with L_simple implicitly downweights high-noise timesteps by a factor of ā_t/(1-ā_t), which is why it produces more stable training.

Exercise 2.5: Training Step Bug Debug

This DDPM training step has a bug. The model trains but produces blurry images. Click the buggy line.

def train_step(model, x0, t):
    noise = torch.randn_like(x0)
    alphabar_t = get_alphabar(t)
    x_t = sqrt(alphabar_t) * x0 + sqrt(alphabar_t) * noise
    pred_noise = model(x_t, t)
    loss = F.mse_loss(pred_noise, noise)
    return loss

Show explanation

Line 4 is the bug. The noise coefficient should be sqrt(1 - alphabar_t), not sqrt(alphabar_t). The correct formula is x_t = √ā_t · x₀ + √(1-ā_t) · ε. Using √ā_t for both terms means at early timesteps (ā_t ≈ 1) you add too much noise, and at late timesteps (ā_t ≈ 0) you add too little. The model learns a corrupted noise distribution, producing blurry reconstructions.

Exercise 2.6: v-Prediction Trace

Stable Diffusion v2 uses v-prediction where v = √ā_t · ε - √(1-ā_t) · x₀. At t=0 (ā_t=1), what does v reduce to? At t=T (ā_t=0)?

v(t=0) = x₀, v(t=T) = ε v(t=0) = 0, v(t=T) = 0 v(t=0) = ε, v(t=T) = -x₀ — it smoothly interpolates the prediction target v(t=0) = ε + x₀, v(t=T) = ε - x₀

Show derivation

At t=0: ā₀ = 1 ⇒ v = √1 · ε - √0 · x₀ = ε

At t=T: ā_T = 0 ⇒ v = √0 · ε - √1 · x₀ = -x₀

v-prediction smoothly interpolates: at low noise it predicts ε (like DDPM), at high noise it predicts -x₀. This gives a numerically stable target at all timesteps — unlike ε-prediction which has high variance at low noise, or x₀-prediction which has high variance at high noise.

Chapter 3: Score Function

The score function is the gradient of the log-probability: ∇_x log p(x). It points in the direction of increasing probability — toward the data manifold. Score-based generative models learn this gradient field and follow it to generate samples.

Score of a Gaussian:
If p(x) = N(μ, σ²), then log p(x) = -½(x-μ)²/σ² + const
∇_x log p(x) = -(x - μ) / σ²

Connection to diffusion:
The noisy distribution at time t has score: ∇_x log q(x_t)
ε_θ(x_t, t) ≈ -√(1-ā_t) × ∇_{x_t} log q(x_t)

Equivalently: score ≈ -ε_θ / √(1-ā_t)

Score = noise prediction, rescaled. A diffusion model that predicts noise IS a score model in disguise. The noise predictor ε_θ and the score differ only by a known, t-dependent factor of -√(1-ā_t). This means all the theory of score-based models (Langevin dynamics, probability flow ODEs) applies directly to DDPM.

Exercise 3.1: Gaussian Score Derive

Compute the score ∇_x log p(x) at x=2 for p(x) = N(0, 1).

score

Show derivation

score = -(x - μ) / σ² = -(2 - 0) / 1² = -2

The score at x=2 is -2, pointing toward the mean at x=0. The score is always a "pull" toward high-density regions. At the mean itself, the score is zero.

Exercise 3.2: Score with Larger Variance Derive

Compute the score at x=2 for p(x) = N(0, 4) (note: σ²=4, so σ=2).

score

Show derivation

score = -(x - μ) / σ² = -(2 - 0) / 4 = -0.5

The score is smaller (magnitude 0.5 vs 2). A wider distribution has a gentler gradient — being at x=2 is less "unusual" when σ=2, so the pull back toward the mean is weaker.

Exercise 3.3: Score Direction Trace

For a mixture of two Gaussians centered at x=-3 and x=+3, what is the score at x=0?

0 — by symmetry, the two modes pull equally in opposite directions Positive — pointing right toward the larger mode Negative — pointing left Undefined — the score doesn't exist between modes

Show explanation

For a symmetric mixture p(x) = ½ N(-3, σ²) + ½ N(+3, σ²), the density p(x) is symmetric around 0, so its log is also symmetric, and the derivative at x=0 is exactly 0. The leftward pull from the left mode and rightward pull from the right mode cancel perfectly at the midpoint.

Exercise 3.4: Convert ε to Score Derive

Your model predicts ε_θ = 1.2 at timestep t where ā_t = 0.36. What is the estimated score ∇_x log q(x_t)?

score

Show derivation

score = -ε_θ / √(1-ā_t) = -1.2 / √(1-0.36) = -1.2 / √0.64 = -1.2 / 0.8 = -1.5

Exercise 3.5: Implement gaussianScore() Build

Write a function that returns the score of a 1D Gaussian at a given point.

Return a single number.

Show solution

javascript
function gaussianScore(x, mu, sigmaSquared) {
  return -(x - mu) / sigmaSquared;
}

Chapter 4: Sampling (Reverse Process)

Training teaches the model to predict noise. Sampling runs the process backwards: start from pure noise x_T ~ N(0, I) and iteratively denoise to get a clean image x₀.

DDPM reverse step:
x_t-1 = (1/√α_t) × (x_t - β_t/√(1-ā_t) × ε_θ(x_t, t)) + σ_t z
where z ~ N(0, I) and σ_t² = β_t (or σ_t² = β̃_t = β_t(1-ā_t-1)/(1-ā_t))

DDIM (deterministic):
x_t-1 = √ā_t-1 · x̂₀ + √(1-ā_t-1) · ε_θ(x_t, t)
where x̂₀ = (x_t - √(1-ā_t) · ε_θ) / √ā_t

Key difference: DDIM has no added noise z, so it's deterministic — same x_T always gives same x₀.

Cost = T forward passes. DDPM sampling requires T=1000 neural network evaluations — one per timestep. Each evaluation has the same FLOPs as one training step. This is why diffusion models are slow: generating one image costs 1000× the compute of a single forward pass. DDIM's key contribution is allowing stride > 1 in the timestep schedule.

Exercise 4.1: One DDPM Reverse Step Derive

Given: x_t = 0.8, α_t = 0.99, β_t = 0.01, ā_t = 0.5, ε_θ = 0.6, z = 0.3. Compute x_t-1 using σ_t = √β_t.

x_t-1

Show derivation

mean = (1/√0.99) × (0.8 - 0.01/√0.5 × 0.6)

= 1.00504 × (0.8 - 0.01414 × 0.6)

= 1.00504 × (0.8 - 0.00849) = 1.00504 × 0.7915 = 0.7955

σ_t = √0.01 = 0.1

x_t-1 = 0.7955 + 0.1 × 0.3 = 0.7955 + 0.03 = 0.826

Exercise 4.2: Forward Passes for DDPM Trace

For T=1000 DDPM sampling to generate one 512×512 image, how many neural network forward passes are required?

1 — the model runs once and produces the image 1000 — one per denoising step from t=T down to t=1 500 — only half the steps need the network 262,144 — one per pixel

Show explanation

Each reverse step t → t-1 requires one forward pass of ε_θ(x_t, t) to predict the noise. With T=1000 steps (from t=1000 down to t=1), that's exactly 1000 forward passes. If each pass takes ~50ms on a GPU, generation takes ~50 seconds — far slower than GANs (one pass) or VAEs (one pass).

Exercise 4.3: DDIM Stride Derive

DDIM allows skipping timesteps. For T=1000 with 50 DDIM steps, what is the stride (step size in the timestep schedule)? List the first 4 and last 2 timesteps in the schedule.

stride

Show derivation

stride = T / steps = 1000 / 50 = 20

Schedule: [1000, 980, 960, 940, ..., 40, 20]

DDIM with 50 steps visits timesteps {1000, 980, 960, ...20}. Each DDIM step jumps 20 timesteps at once. This gives a 20× speedup over DDPM (50 forward passes vs 1000) with only moderate quality loss — typically imperceptible for 50+ steps.

Exercise 4.4: One DDIM Step Derive

Given: x_t = 1.0, ā_t = 0.5, ā_t-1 = 0.6, ε_θ = 0.4. Compute x_t-1 using DDIM (deterministic).

Step 1: x̂₀ = (x_t - √(1-ā_t)ε_θ)/√ā_t. Step 2: x_t-1 = √ā_t-1 · x̂₀ + √(1-ā_t-1) · ε_θ.

x_t-1

Show derivation

x̂₀ = (1.0 - √0.5 × 0.4) / √0.5 = (1.0 - 0.2828) / 0.7071 = 0.7172 / 0.7071 = 1.0143

x_t-1 = √0.6 × 1.0143 + √0.4 × 0.4 = 0.7746 × 1.0143 + 0.6325 × 0.4

= 0.7857 + 0.2530 = 1.039

Exact computation: x̂₀ = (1.0 - 0.70711×0.4)/0.70711 = 0.71716/0.70711 = 1.01421. x_t-1 = 0.77460×1.01421 + 0.63246×0.4 = 0.78560 + 0.25298 = 1.039. Note: DDIM with no noise is deterministic — run it again with the same x_T and you get the exact same x₀.

Exercise 4.5: Sampling Speedup Derive

A U-Net forward pass takes 45ms on an A100. How long does it take to generate one image with: (a) DDPM T=1000, (b) DDIM 50 steps, (c) DDIM 20 steps? Express in seconds.

DDIM-50 time (seconds)

Show derivation

(a) DDPM: 1000 × 45ms = 45,000ms = 45.0 seconds

(b) DDIM-50: 50 × 45ms = 2,250ms = 2.25 seconds

DDIM-50 is 20× faster than DDPM with minimal quality loss. DDIM-20 pushes under 1 second but quality drops noticeably. Modern schedulers (DPM-Solver, UniPC) can match DDIM-50 quality with only 15-20 steps by using higher-order ODE solvers.

Chapter 5: Classifier-Free Guidance

You prompt a model with "a photo of a golden retriever" and the output is vaguely dog-shaped but blurry and generic. Classifier-Free Guidance (CFG) amplifies the model's response to the text prompt, trading diversity for fidelity. It's the single most impactful inference-time trick in text-to-image generation.

CFG formula:
ε̃ = ε_uncond + w × (ε_cond - ε_uncond)
= (1 + w) × ε_cond - w × ε_uncond

Training: Randomly drop the text condition (replace with ∅) with probability p_drop ≈ 0.1
Inference: Run the model twice per step — once with and once without the text condition
w = 0: no guidance (ε̃ = ε_uncond) w = 1: standard conditioning w = 7.5: typical SD

CFG doubles inference cost. Every sampling step requires TWO forward passes: one conditioned, one unconditional. This doubles the latency. Distillation techniques (like guidance distillation in SDXL Turbo) train a student to mimic the guided output in a single pass.

Exercise 5.1: CFG Computation Derive

Given ε_cond = [0.3, 0.7] and ε_uncond = [0.1, 0.1], compute ε̃ at guidance scale w=7.5.

For each element: ε̃_i = ε_uncond,i + 7.5 × (ε_cond,i - ε_uncond,i)

ε̃[0] (first element)

Show derivation

ε̃[0] = 0.1 + 7.5 × (0.3 - 0.1) = 0.1 + 7.5 × 0.2 = 0.1 + 1.5 = 1.6

ε̃[1] = 0.1 + 7.5 × (0.7 - 0.1) = 0.1 + 7.5 × 0.6 = 0.1 + 4.5 = 4.6

Notice how the guided noise prediction (1.6, 4.6) is much larger than the conditional prediction (0.3, 0.7). CFG amplifies the difference between conditional and unconditional, pushing the sample harder toward what the prompt describes. At w=7.5, the second element goes from 0.7 to 4.6 — a 6.6× amplification.

Exercise 5.2: w=0 Baseline Trace

What does w=0 reduce the CFG formula to?

ε̃ = ε_uncond — the model ignores the prompt entirely ε̃ = ε_cond — standard conditioned output ε̃ = 0 — no noise prediction ε̃ = (ε_cond + ε_uncond)/2 — average

Show derivation

ε̃ = ε_uncond + 0 × (ε_cond - ε_uncond) = ε_uncond

At w=0, you get the unconditional model — it generates random images with no regard for the text prompt. At w=1, ε̃ = ε_uncond + ε_cond - ε_uncond = ε_cond, which is just the standard conditional output. Values w > 1 push beyond the conditional prediction, amplifying the prompt's influence.

Exercise 5.3: ε̃[1] at w=7.5 Derive

Using the same ε_cond = [0.3, 0.7], ε_uncond = [0.1, 0.1], compute ε̃[1] (the second element) at w=7.5.

ε̃[1]

Show derivation

ε̃[1] = 0.1 + 7.5 × (0.7 - 0.1) = 0.1 + 4.5 = 4.6

Exercise 5.4: CFG Inference Cost Derive

DDIM with 50 steps + CFG at w=7.5. Each forward pass takes 45ms. How many total forward passes and how much time for one image?

total forward passes

Show derivation

Forward passes = 50 steps × 2 (cond + uncond) = 100

Time = 100 × 45ms = 4500ms = 4.5 seconds

CFG exactly doubles the cost compared to unguided DDIM-50 (2.25s). In practice, the two predictions can be batched (concatenate the cond and uncond inputs), so wall-clock time is less than 2× if the GPU has headroom. Still, eliminating one of the two passes via distillation is worth a lot.

Exercise 5.5: High Guidance Tradeoff Trace

As guidance scale w increases from 1 to 20, what happens to the generated images?

Images become more diverse and creative Images become blurrier and lower quality Images become more prompt-adherent but oversaturated and less diverse — eventually with artifacts No visible change — w only affects training

Show explanation

Higher w amplifies the conditional signal more aggressively. At moderate w (5-10), images closely match the prompt with good quality. At very high w (15+), the amplification pushes pixel values to extremes — colors oversaturate, edges become unnaturally sharp, and artifacts appear. It's like turning up the volume on a speaker: at first it's clearer, then it distorts. The sweet spot for Stable Diffusion is w=7-9.

Chapter 6: Latent Diffusion

Running diffusion directly on 512×512 pixels is brutally expensive — the U-Net processes 262,144 spatial tokens per step. Latent Diffusion (Rombach et al. 2022) runs diffusion in a compressed latent space instead. An autoencoder maps images to a 64×64 latent, diffusion operates there, and the decoder reconstructs the final image.

Encoder: z = E(x), x: [512, 512, 3] → z: [64, 64, 4]
Decoder: x̂ = D(z), z: [64, 64, 4] → x̂: [512, 512, 3]

Spatial compression: 8× in each dimension (64 → 512)
Channel change: 3 → 4 (latent has 4 channels)
Pixel compression ratio: (512×512×3) / (64×64×4) = 786432 / 16384 = 48×

48× fewer values to denoise. The U-Net now operates on 16,384 spatial positions instead of 262,144. Since self-attention is O(n²), the attention cost drops by (262144/16384)² = 256× per spatial position — an enormous saving. This is what made Stable Diffusion practical on consumer GPUs.

Exercise 6.1: Compression Ratio Derive

Compute the pixel compression ratio for a latent diffusion model that maps 512×512×3 images to 64×64×4 latents.

× compression

Show derivation

Pixel values = 512 × 512 × 3 = 786,432

Latent values = 64 × 64 × 4 = 16,384

Ratio = 786,432 / 16,384 = 48×

Exercise 6.2: Attention FLOPs Savings Derive

Self-attention is O(n²) where n is the number of spatial tokens. How many times fewer FLOPs does attention cost in 64×64 latent space vs 512×512 pixel space?

× fewer FLOPs

Show derivation

Pixel tokens = 512 × 512 = 262,144

Latent tokens = 64 × 64 = 4,096

FLOPs ratio = (262,144)² / (4,096)² = (262,144/4,096)² = 64² = 4,096... wait

Actually, the full attention FLOPs are O(n² d) where d is the channel dimension. Pixel space: 262144² tokens. Latent space: 4096² tokens. Ratio = (262144/4096)² = 64² = 4096. But the U-Net doesn't use attention at every resolution — it uses attention only at lower resolutions (32×32, 16×16, 8×8) in both pixel and latent models. The 256× figure comes from the spatial sequence length ratio at the resolution where attention is applied: in latent space, the highest-resolution attention operates on (64/2)² = 1024 tokens vs pixel space (512/16)² = 1024 tokens at the same U-Net depth. The key saving is that the latent model can afford attention at HIGHER resolutions (64×64 = 4096 tokens is feasible, 512×512 = 262K is not).

Exercise 6.3: Memory for KV Cache Derive

For a single self-attention layer with d=320 channels at the full spatial resolution: compute the attention matrix memory in FP16 for (a) latent space 64×64 and (b) pixel space 512×512. Assume 1 head for simplicity.

MB (latent space)

Show derivation

Latent: n = 64 × 64 = 4,096 tokens

Attention matrix: 4,096 × 4,096 = 16,777,216 elements

FP16: 16,777,216 × 2 bytes = 33,554,432 bytes = 32 MB

Pixel: n = 512 × 512 = 262,144 tokens

Attention matrix: 262,144² = 68,719,476,736 elements

FP16: 68,719,476,736 × 2 = 137,438,953,472 bytes = 128 GB

32 MB vs 128 GB — that's a 4096× difference. Full-resolution self-attention in pixel space at 512×512 requires 128 GB for a single attention matrix of a single head of a single layer. This is why pixel-space diffusion either avoids attention or restricts it to tiny resolutions.

Exercise 6.4: 1024×1024 Latents Derive

SDXL generates 1024×1024 images. With the same 8× spatial compression, what is the latent resolution and total latent values (4 channels)?

total latent values

Show derivation

Latent resolution = 1024/8 = 128×128

Total values = 128 × 128 × 4 = 65,536

The SDXL latent at 128×128 has 4× the spatial tokens of SD 1.5's 64×64 latent. This 4× increase in sequence length makes self-attention 16× more expensive, which is why SDXL uses a larger U-Net with more efficient attention patterns and requires substantially more VRAM.

Exercise 6.5: Where Diffusion Runs Trace

In Stable Diffusion, the diffusion process (noise addition, denoising, sampling) operates on:

The 512×512×3 RGB pixel images directly The 64×64×4 latent representations — the encoder/decoder handle pixel-to-latent conversion once each Both pixel and latent space simultaneously The text embedding space (77×768)

Show explanation

The autoencoder runs exactly twice during generation: once at the very end (decoder converts final latent z₀ to pixels) and once during training (encoder converts training images to latents). The entire iterative denoising process — all 50+ sampling steps — runs in the 64×64×4 latent space. This is why it's called latent diffusion.

Chapter 7: Flow Matching

Flow matching (Lipman et al. 2023, Liu et al. 2023) offers a simpler, more elegant framework than DDPM. Instead of defining a noise schedule and deriving a reverse process, it directly learns a velocity field that transports noise to data along straight paths.

Straight path interpolation:
ψ_t(x) = (1-t) · x₀ + t · x₁, t ∈ [0, 1]
where x₀ ~ p_data (clean image), x₁ ~ N(0, I) (noise)

Velocity (target):
u_t(x) = x₁ - x₀ (constant along the straight path)

Conditional flow matching loss:
L = E_{t, x₀, x₁} [ || v_θ(ψ_t(x), t) - u_t ||² ]
= E_{t, x₀, x₁} [ || v_θ(ψ_t(x), t) - (x₁ - x₀) ||² ]

Flow matching is simpler than DDPM. No noise schedule to design, no α_t/ā_t to track, no reverse process derivation. Just learn a velocity field v_θ that moves points from data to noise along straight lines. The loss is a simple MSE between predicted and target velocities. Sampling is just integrating an ODE: dx/dt = v_θ(x, t).

Exercise 7.1: Interpolation Derive

Compute ψ_t at t=0.3 for x₀ = [1, 2] (data) and x₁ = [5, 8] (noise).

Apply element-wise: ψ_t[i] = (1-t) × x₀[i] + t × x₁[i]

ψ_0.3[0] (first element)

Show derivation

ψ_0.3[0] = (1-0.3) × 1 + 0.3 × 5 = 0.7 + 1.5 = 2.2

ψ_0.3[1] = (1-0.3) × 2 + 0.3 × 8 = 1.4 + 2.4 = 3.8

At t=0 we're at x₀=[1,2] (data), at t=1 we're at x₁=[5,8] (noise), and at t=0.3 we're 30% of the way along the straight line between them. Note: in some flow matching papers, the convention is reversed (t=0 is noise, t=1 is data). Always check the convention!

Exercise 7.2: ψ_0.3[1] Derive

Second element of ψ_0.3 for x₀=[1,2], x₁=[5,8].

ψ_0.3[1]

Show derivation

ψ_0.3[1] = 0.7 × 2 + 0.3 × 8 = 1.4 + 2.4 = 3.8

Exercise 7.3: Target Velocity Derive

For the same x₀=[1,2] and x₁=[5,8], what is the target velocity u_t?

u_t[0] (first element)

Show derivation

u_t = x₁ - x₀ = [5-1, 8-2] = [4, 6]

The target velocity is constant — it doesn't depend on t at all. This is a key simplification of flow matching: the velocity along a straight path is just the displacement vector from data to noise. The model learns to approximate this constant velocity at each point in space-time.

Exercise 7.4: Flow Matching vs DDPM Trace

What is the fundamental difference between flow matching and DDPM at training time?

Flow matching uses a larger model DDPM uses a different architecture DDPM predicts noise ε added to x₀; flow matching predicts velocity v = x₁-x₀ at an interpolated point Flow matching doesn't need training data

Show explanation

Both methods train a neural network with MSE loss on paired inputs and targets. DDPM: input is noised image x_t = √ā_tx₀ + √(1-ā_t)ε, target is ε. Flow matching: input is interpolated point ψ_t = (1-t)x₀ + tx₁, target is velocity x₁-x₀. The architecture, dataset, and optimizer can be identical — only the parameterization of the interpolation path and prediction target differ.

Exercise 7.5: Implement flowMatchStep() Build

Write a function that computes one Euler integration step for flow matching sampling: x_new = x_old + v × dt.

Return a single number: the new position after one Euler step.

Show solution

javascript
function flowMatchStep(x, v, dt) {
  return x + v * dt;
}

Exercise 7.6: Flow Matching Training Pipeline Design

Put these flow matching training steps in the correct order.

→

Sample x₀ from data Sample x₁ ~ N(0,I) Compute ψ_t = (1-t)x₀ + tx₁ v_θ(ψ_t, t) forward pass MSE loss vs (x₁ - x₀)

Show answer

Correct order: (1) Sample x₀ from data, (2) Sample x₁ ~ N(0,I), (3) Compute interpolation ψ_t, (4) Forward pass v_θ(ψ_t, t), (5) MSE loss vs target velocity (x₁ - x₀). This is the entire training loop — simpler than DDPM because there's no noise schedule (α, ā) to manage.

Chapter 9: Capstone: Design a Diffusion Pipeline

You're an ML engineer tasked with building an image generation system. You need to make concrete decisions about every component — architecture, noise schedule, sampling, guidance — and estimate the compute cost. This capstone ties together everything from the workbook.

The design brief. Generate 1024×1024 images conditioned on text prompts. Target: high quality, reasonable latency (<5 seconds on an A100), and the ability to serve 1000+ images per hour per GPU.

Exercise 9.1: Latent vs Pixel Trace

For 1024×1024 generation, should you use latent diffusion (with 8× compression to 128×128×4) or pixel-space diffusion? Consider that a single self-attention layer at 1024×1024 pixel resolution would need to attend over 1,048,576 tokens.

Pixel space — higher quality since no information is lost to compression Latent space — 128×128 = 16,384 tokens is feasible for attention; 1M+ tokens in pixel space is not Either works equally well at this resolution

Show reasoning

At 1024×1024, pixel-space attention is impossible: 1M² = 10¹² attention scores per head per layer. Even without attention, the U-Net would process 1M spatial tokens at the highest resolution. Latent diffusion compresses to 128×128 = 16K tokens, making self-attention feasible (16K² = 256M scores — large but manageable). Every modern high-resolution model (SDXL, FLUX, Imagen 3) uses latent or cascaded approaches.

Exercise 9.2: U-Net Parameter Count Derive

A U-Net with base channels C=320 and channel multipliers [1, 2, 4, 8] has 4 resolution levels. At each level, there are 2 ResNet blocks. Each ResNet block has roughly 2 × C_level² parameters (two 3×3 convolutions). Compute the total ResNet parameters across all levels.

Channels per level: 320, 640, 1280, 2560. Remember the U-Net has encoder + decoder (double the blocks, minus the bottleneck). For simplicity, count encoder = 4 levels × 2 blocks, decoder = 4 levels × 2 blocks, bottleneck = 1 block at the deepest level.

billion ResNet params (approximate)

Show derivation

Level 0 (C=320): 2 × 320² = 204,800 per block, × 4 blocks (enc+dec) = 819,200

Level 1 (C=640): 2 × 640² = 819,200 per block, × 4 = 3,276,800

Level 2 (C=1280): 2 × 1280² = 3,276,800 per block, × 4 = 13,107,200

Level 3 (C=2560): 2 × 2560² = 13,107,200 per block, × 4 = 52,428,800

Bottleneck (C=2560): 2 × 2560² = 13,107,200

Subtotal = 819,200 + 3,276,800 + 13,107,200 + 52,428,800 + 13,107,200 = 82,739,200 ≈ 82.7M

82.7M is just the convolution parameters. A real U-Net also has: self-attention layers at each level (adding ~4×C² per attention block), cross-attention for text conditioning, time embedding MLPs, skip connections, and group norms. The full SD 1.5 U-Net is ~860M parameters; SDXL's is ~2.6B. Our rough ResNet estimate captures the scaling pattern: the deepest levels (highest C) dominate the parameter count — level 3 alone is 63% of total.

Exercise 9.3: Inference FLOPs Derive

A U-Net with ~2.6B parameters (SDXL scale) takes approximately 2 × params FLOPs per forward pass (the standard "2N" rule for FLOPs of a neural network with N params). With 50 DDIM steps + CFG, how many total FLOPs per image?

TFLOPs

Show derivation

FLOPs per forward pass ≈ 2 × 2.6B = 5.2 TFLOPs

Forward passes = 50 steps × 2 (CFG) = 100

Total = 100 × 5.2 = 520 TFLOPs per image

520 TFLOPs for one image. For comparison, a single GPT-4 token generation requires ~1.8 TFLOPs (for a ~1.8T parameter model with 2 FLOPs/param). So generating one SDXL image costs roughly as much as generating 289 GPT-4 tokens — about a paragraph of text.

Exercise 9.4: Time per Image on A100 Derive

An A100 achieves ~312 TFLOPS in FP16. Assuming 40% utilization (realistic for U-Net inference), how long per image at 520 TFLOPs/image?

seconds

Show derivation

Effective throughput = 312 × 0.40 = 124.8 TFLOPS

Time = 520 / 124.8 = 4.17 seconds

~4.2 seconds per image meets our <5 second target. In practice, SDXL on an A100 with optimized inference (TensorRT, FlashAttention) generates 1024×1024 images in 3-5 seconds, matching our estimate well.

Exercise 9.5: Hourly Throughput Derive

At 4.17 seconds per image, how many images per hour can one A100 generate? Does this meet the 1000+ images/hour requirement?

images/hour

Show derivation

Images/hour = 3600 / 4.17 = 863

863 images/hour falls short of the 1000+ target. Options: (1) reduce steps from 50 to 40 (gives 1079/hr), (2) use a distilled model that needs fewer steps, (3) batch 2 images per pass if VRAM allows, (4) use DDIM-25 with a higher-order solver (DPM-Solver++) for comparable quality. Option 3 is the most practical: batch=2 nearly doubles throughput to ~1700/hr.

Exercise 9.6: Full Pipeline Design Design

Put the complete image generation pipeline steps in order.

→

Encode text prompt (CLIP/T5) Sample z_T ~ N(0,I) in latent space Iterative denoising (50 steps + CFG) VAE decode z₀ → pixels Post-process (clip, scale to 0-255) Save/return image

Show answer

Correct order: (1) Encode text prompt, (2) Sample noise in latent space, (3) Iterative denoising loop, (4) VAE decode to pixels, (5) Post-process, (6) Save/return. The text encoding happens first (and only once per prompt). The denoising loop is 98% of the compute. The VAE decoder runs once at the end. Post-processing (clamping values, converting to uint8) is negligible.

Topic	Lesson
Diffusion fundamentals	Diffusion — From Absolute Zero
Flow matching	Flow Matching — From Absolute Zero
VAE / VQ-VAE	VAE & VQ-VAE — From Absolute Zero
Contrastive learning (CLIP)	Contrastive Learning & CLIP — From Absolute Zero
Transformer math	Transformer Math Workbook

Diffusion & Flow Math

Chapter 0: Forward Process

Chapter 1: Noise Schedule Math

Chapter 2: DDPM Loss

Chapter 3: Score Function

Chapter 4: Sampling (Reverse Process)

Chapter 5: Classifier-Free Guidance

Chapter 6: Latent Diffusion

Chapter 7: Flow Matching

Chapter 8: ODE vs SDE Sampling

Chapter 9: Capstone: Design a Diffusion Pipeline

Related Lessons