Every equation behind modern image generation — noise schedules, loss derivations, score functions, sampling algorithms, guidance, latent spaces, and flow matching — all solvable in-browser with instant feedback.
You have a clean image x0. The diffusion forward process gradually adds Gaussian noise over T timesteps until nothing recognizable remains. The magic: you can jump to any timestep t directly without iterating through all previous steps.
At each step, a small amount of noise is added according to a schedule β1, β2, ..., βT:
Linear schedule: βt = 0.0001 + (0.02 - 0.0001) × (t-1)/(T-1), with T=1000. Compute α1 = 1 - β1.
β1 corresponds to t=1, so β1 = 0.0001 + 0.0199 × 0/999 = 0.0001.
At the first timestep, almost no noise is added. The signal is preserved at 99.99%.
For the linear schedule above, βt = 0.0001 + 0.0199 × (t-1)/999. The mean of β across all 1000 steps is (0.0001 + 0.02)/2 = 0.01005. Use the approximation: log āt ≈ ∑s=1..t log(1 - βs) ≈ -∑ βs. For the first 500 steps, the average β is approximately 0.0001 + 0.0199 × 249.5/999 ≈ 0.00507. So ∑s=1..500 βs ≈ 500 × 0.00507 = 2.535. What is ā500?
Only about 7.9% of the original signal power remains at the midpoint. The signal coefficient √ā500 ≈ 0.282, meaning the original image is attenuated to ~28% of its intensity.
Using ā500 ≈ 0.0793 from above, what fraction of the total variance in x500 comes from the original signal x0 (as opposed to noise)?
Hint: xt = √āt · x0 + √(1-āt) · ε. If x0 has unit variance, the signal variance is āt and noise variance is (1-āt). Signal fraction = āt / (āt + (1-āt)).
The signal fraction IS āt (since the total variance is 1). Only 7.93% of the information in x500 comes from the original image. The denoiser has to recover the image from a signal buried under ~12× more noise.
Now compute ā1000 using the same approximation. The average β over all 1000 steps is 0.01005.
The exact value computed numerically is ~0.0000448. At t=1000, √āT ≈ 0.0067 — the original signal is attenuated to 0.67% of its intensity. xT is essentially pure Gaussian noise, which is exactly what we want: the reverse process starts from N(0, I).
At t=500, the noise power is ~11.6× the signal power. The SNR is well below 1, meaning the denoiser is working with mostly noise. The log-SNR is a key quantity in diffusion theory — uniform spacing in log-SNR space corresponds to uniform difficulty for the denoiser.
Write a function that takes x0 (a number), alphabar_t, and epsilon (a noise sample), then returns x_t using the reparameterization trick.
javascript function forwardProcess(x0, alphabar_t, epsilon) { return Math.sqrt(alphabar_t) * x0 + Math.sqrt(1 - alphabar_t) * epsilon; }
The noise schedule determines how quickly you destroy the image. Too fast and the model can't learn — the jump between adjacent timesteps is too large. Too slow and you waste compute on timesteps where nothing interesting happens. Two schedules dominate: linear and cosine.
Linear schedule: βmin=0.0001, βmax=0.02, T=1000. Compute ā250 using the approximation āt ≈ exp(-∑βs).
Average β for t=1..250: β̄ = 0.0001 + 0.0199 × ((0+249)/2)/999.
Exact numerical computation gives ~0.536. At 25% of the way through the schedule, about half the signal remains. The approximation log(1-β) ≈ -β is accurate because each β is small (< 0.02).
Same linear schedule. Compute ā750.
Exact numerical value is ~0.00118. Our approximation overshoots because the log(1-β) ≈ -β approximation becomes less accurate for larger β values (near t=750, β ≈ 0.015). The point stands: at 75% through the schedule, the signal is essentially destroyed — less than 0.12% remains.
Cosine schedule with s=0.008, T=1000. Compute ā250 = f(250)/f(0), where f(t) = cos²((t/T + s)/(1+s) × π/2).
f(250) = cos²((250/1000 + 0.008)/1.008 × π/2) = cos²(0.2559 × π/2) = cos²(0.4019). f(0) = cos²((0.008/1.008) × π/2) = cos²(0.01247).
Compare with the linear schedule: ā250 = 0.536 (linear) vs 0.847 (cosine). At t=250, the cosine schedule preserves 85% of the signal vs only 54% for linear. The cosine schedule is much more gentle early on.
Same cosine schedule. Compute ā750.
Cosine: 14.6% signal at t=750. Linear: 0.12% signal at t=750. The cosine schedule still has significant signal at 75% of the way through — this gives the model useful gradients even in the later stages. The linear schedule has already destroyed everything by this point.
Both schedules end at āT ≈ 0 (pure noise). The difference is in the distribution of difficulty. The linear schedule spends most timesteps in either "almost clean" or "almost noise" regimes — the denoiser doesn't get enough practice on the hardest middle range. The cosine schedule's ā500 ≈ 0.5 means exactly half the timesteps are above 50% signal and half below — a perfectly balanced curriculum.
Write a function that returns āt for the cosine schedule given t, T, and s=0.008.
javascript function cosineAlphaBar(t, T) { const s = 0.008; const f = x => Math.cos((x / T + s) / (1 + s) * Math.PI / 2) ** 2; return f(t) / f(0); }
You're training a diffusion model and need to understand what the loss actually measures. DDPM (Ho et al. 2020) simplifies the variational bound into a stunningly simple objective: predict the noise that was added.
For a single pixel, the true noise is εtrue = 0.5 and your model predicts εpred = 0.3. What is the MSE loss contribution from this pixel?
A 32×32×3 image has 3072 values. If the average per-element squared error is 0.04 (as above), what is the total (summed) loss? What about the mean loss?
Most implementations use mean over all elements, so the loss is independent of image size.
In practice, the mean loss is used for optimization (typically around 0.01-0.1 during training). The sum loss grows with image size — a 256×256×3 image would have 196,608 elements and sum loss of 7,864 for the same per-element error.
Given xt = 1.5, āt = 0.25, and εθ = 0.8 (model's noise prediction), recover the predicted x0.
Use: x̂0 = (xt - √(1-āt) · εθ) / √āt
This is the SNR reweighting. At high noise (āt small), the x0 loss amplifies errors massively because 1/āt is huge. Noise prediction with Lsimple implicitly downweights high-noise timesteps by a factor of āt/(1-āt), which is why it produces more stable training.
This DDPM training step has a bug. The model trains but produces blurry images. Click the buggy line.
def train_step(model, x0, t): noise = torch.randn_like(x0) alphabar_t = get_alphabar(t) x_t = sqrt(alphabar_t) * x0 + sqrt(alphabar_t) * noise pred_noise = model(x_t, t) loss = F.mse_loss(pred_noise, noise) return loss
Line 4 is the bug. The noise coefficient should be sqrt(1 - alphabar_t), not sqrt(alphabar_t). The correct formula is xt = √āt · x0 + √(1-āt) · ε. Using √āt for both terms means at early timesteps (āt ≈ 1) you add too much noise, and at late timesteps (āt ≈ 0) you add too little. The model learns a corrupted noise distribution, producing blurry reconstructions.
v-prediction smoothly interpolates: at low noise it predicts ε (like DDPM), at high noise it predicts -x0. This gives a numerically stable target at all timesteps — unlike ε-prediction which has high variance at low noise, or x0-prediction which has high variance at high noise.
The score function is the gradient of the log-probability: ∇x log p(x). It points in the direction of increasing probability — toward the data manifold. Score-based generative models learn this gradient field and follow it to generate samples.
Compute the score ∇x log p(x) at x=2 for p(x) = N(0, 1).
The score at x=2 is -2, pointing toward the mean at x=0. The score is always a "pull" toward high-density regions. At the mean itself, the score is zero.
Compute the score at x=2 for p(x) = N(0, 4) (note: σ²=4, so σ=2).
The score is smaller (magnitude 0.5 vs 2). A wider distribution has a gentler gradient — being at x=2 is less "unusual" when σ=2, so the pull back toward the mean is weaker.
For a symmetric mixture p(x) = ½ N(-3, σ²) + ½ N(+3, σ²), the density p(x) is symmetric around 0, so its log is also symmetric, and the derivative at x=0 is exactly 0. The leftward pull from the left mode and rightward pull from the right mode cancel perfectly at the midpoint.
Your model predicts εθ = 1.2 at timestep t where āt = 0.36. What is the estimated score ∇x log q(xt)?
Write a function that returns the score of a 1D Gaussian at a given point.
javascript function gaussianScore(x, mu, sigmaSquared) { return -(x - mu) / sigmaSquared; }
Training teaches the model to predict noise. Sampling runs the process backwards: start from pure noise xT ~ N(0, I) and iteratively denoise to get a clean image x0.
Given: xt = 0.8, αt = 0.99, βt = 0.01, āt = 0.5, εθ = 0.6, z = 0.3. Compute xt-1 using σt = √βt.
Each reverse step t → t-1 requires one forward pass of εθ(xt, t) to predict the noise. With T=1000 steps (from t=1000 down to t=1), that's exactly 1000 forward passes. If each pass takes ~50ms on a GPU, generation takes ~50 seconds — far slower than GANs (one pass) or VAEs (one pass).
DDIM allows skipping timesteps. For T=1000 with 50 DDIM steps, what is the stride (step size in the timestep schedule)? List the first 4 and last 2 timesteps in the schedule.
DDIM with 50 steps visits timesteps {1000, 980, 960, ...20}. Each DDIM step jumps 20 timesteps at once. This gives a 20× speedup over DDPM (50 forward passes vs 1000) with only moderate quality loss — typically imperceptible for 50+ steps.
Given: xt = 1.0, āt = 0.5, āt-1 = 0.6, εθ = 0.4. Compute xt-1 using DDIM (deterministic).
Step 1: x̂0 = (xt - √(1-āt)εθ)/√āt. Step 2: xt-1 = √āt-1 · x̂0 + √(1-āt-1) · εθ.
Exact computation: x̂0 = (1.0 - 0.70711×0.4)/0.70711 = 0.71716/0.70711 = 1.01421. xt-1 = 0.77460×1.01421 + 0.63246×0.4 = 0.78560 + 0.25298 = 1.039. Note: DDIM with no noise is deterministic — run it again with the same xT and you get the exact same x0.
A U-Net forward pass takes 45ms on an A100. How long does it take to generate one image with: (a) DDPM T=1000, (b) DDIM 50 steps, (c) DDIM 20 steps? Express in seconds.
DDIM-50 is 20× faster than DDPM with minimal quality loss. DDIM-20 pushes under 1 second but quality drops noticeably. Modern schedulers (DPM-Solver, UniPC) can match DDIM-50 quality with only 15-20 steps by using higher-order ODE solvers.
You prompt a model with "a photo of a golden retriever" and the output is vaguely dog-shaped but blurry and generic. Classifier-Free Guidance (CFG) amplifies the model's response to the text prompt, trading diversity for fidelity. It's the single most impactful inference-time trick in text-to-image generation.
Given εcond = [0.3, 0.7] and εuncond = [0.1, 0.1], compute ε̃ at guidance scale w=7.5.
For each element: ε̃i = εuncond,i + 7.5 × (εcond,i - εuncond,i)
Notice how the guided noise prediction (1.6, 4.6) is much larger than the conditional prediction (0.3, 0.7). CFG amplifies the difference between conditional and unconditional, pushing the sample harder toward what the prompt describes. At w=7.5, the second element goes from 0.7 to 4.6 — a 6.6× amplification.
At w=0, you get the unconditional model — it generates random images with no regard for the text prompt. At w=1, ε̃ = εuncond + εcond - εuncond = εcond, which is just the standard conditional output. Values w > 1 push beyond the conditional prediction, amplifying the prompt's influence.
Using the same εcond = [0.3, 0.7], εuncond = [0.1, 0.1], compute ε̃[1] (the second element) at w=7.5.
DDIM with 50 steps + CFG at w=7.5. Each forward pass takes 45ms. How many total forward passes and how much time for one image?
CFG exactly doubles the cost compared to unguided DDIM-50 (2.25s). In practice, the two predictions can be batched (concatenate the cond and uncond inputs), so wall-clock time is less than 2× if the GPU has headroom. Still, eliminating one of the two passes via distillation is worth a lot.
Higher w amplifies the conditional signal more aggressively. At moderate w (5-10), images closely match the prompt with good quality. At very high w (15+), the amplification pushes pixel values to extremes — colors oversaturate, edges become unnaturally sharp, and artifacts appear. It's like turning up the volume on a speaker: at first it's clearer, then it distorts. The sweet spot for Stable Diffusion is w=7-9.
Running diffusion directly on 512×512 pixels is brutally expensive — the U-Net processes 262,144 spatial tokens per step. Latent Diffusion (Rombach et al. 2022) runs diffusion in a compressed latent space instead. An autoencoder maps images to a 64×64 latent, diffusion operates there, and the decoder reconstructs the final image.
Compute the pixel compression ratio for a latent diffusion model that maps 512×512×3 images to 64×64×4 latents.
Self-attention is O(n²) where n is the number of spatial tokens. How many times fewer FLOPs does attention cost in 64×64 latent space vs 512×512 pixel space?
Actually, the full attention FLOPs are O(n² d) where d is the channel dimension. Pixel space: 262144² tokens. Latent space: 4096² tokens. Ratio = (262144/4096)² = 64² = 4096. But the U-Net doesn't use attention at every resolution — it uses attention only at lower resolutions (32×32, 16×16, 8×8) in both pixel and latent models. The 256× figure comes from the spatial sequence length ratio at the resolution where attention is applied: in latent space, the highest-resolution attention operates on (64/2)² = 1024 tokens vs pixel space (512/16)² = 1024 tokens at the same U-Net depth. The key saving is that the latent model can afford attention at HIGHER resolutions (64×64 = 4096 tokens is feasible, 512×512 = 262K is not).
For a single self-attention layer with d=320 channels at the full spatial resolution: compute the attention matrix memory in FP16 for (a) latent space 64×64 and (b) pixel space 512×512. Assume 1 head for simplicity.
32 MB vs 128 GB — that's a 4096× difference. Full-resolution self-attention in pixel space at 512×512 requires 128 GB for a single attention matrix of a single head of a single layer. This is why pixel-space diffusion either avoids attention or restricts it to tiny resolutions.
SDXL generates 1024×1024 images. With the same 8× spatial compression, what is the latent resolution and total latent values (4 channels)?
The SDXL latent at 128×128 has 4× the spatial tokens of SD 1.5's 64×64 latent. This 4× increase in sequence length makes self-attention 16× more expensive, which is why SDXL uses a larger U-Net with more efficient attention patterns and requires substantially more VRAM.
The autoencoder runs exactly twice during generation: once at the very end (decoder converts final latent z0 to pixels) and once during training (encoder converts training images to latents). The entire iterative denoising process — all 50+ sampling steps — runs in the 64×64×4 latent space. This is why it's called latent diffusion.
Flow matching (Lipman et al. 2023, Liu et al. 2023) offers a simpler, more elegant framework than DDPM. Instead of defining a noise schedule and deriving a reverse process, it directly learns a velocity field that transports noise to data along straight paths.
Compute ψt at t=0.3 for x0 = [1, 2] (data) and x1 = [5, 8] (noise).
Apply element-wise: ψt[i] = (1-t) × x0[i] + t × x1[i]
At t=0 we're at x0=[1,2] (data), at t=1 we're at x1=[5,8] (noise), and at t=0.3 we're 30% of the way along the straight line between them. Note: in some flow matching papers, the convention is reversed (t=0 is noise, t=1 is data). Always check the convention!
Second element of ψ0.3 for x0=[1,2], x1=[5,8].
For the same x0=[1,2] and x1=[5,8], what is the target velocity ut?
The target velocity is constant — it doesn't depend on t at all. This is a key simplification of flow matching: the velocity along a straight path is just the displacement vector from data to noise. The model learns to approximate this constant velocity at each point in space-time.
Both methods train a neural network with MSE loss on paired inputs and targets. DDPM: input is noised image xt = √ātx0 + √(1-āt)ε, target is ε. Flow matching: input is interpolated point ψt = (1-t)x0 + tx1, target is velocity x1-x0. The architecture, dataset, and optimizer can be identical — only the parameterization of the interpolation path and prediction target differ.
Write a function that computes one Euler integration step for flow matching sampling: xnew = xold + v × dt.
javascript function flowMatchStep(x, v, dt) { return x + v * dt; }
Put these flow matching training steps in the correct order.
Correct order: (1) Sample x0 from data, (2) Sample x1 ~ N(0,I), (3) Compute interpolation ψt, (4) Forward pass vθ(ψt, t), (5) MSE loss vs target velocity (x1 - x0). This is the entire training loop — simpler than DDPM because there's no noise schedule (α, ā) to manage.
Once you have a trained model (whether it predicts noise, score, or velocity), there are two ways to sample: integrate an ODE (deterministic) or simulate an SDE (stochastic). Both generate valid samples, but they have different tradeoffs.
Starting at x0 = 3.0, integrate dx/dt = v(x, t) = -x using Euler's method with Δt = 0.1 for 3 steps. Compute x1, x2, x3.
The true solution of dx/dt = -x is x(t) = 3e-t, so x(0.3) = 3e-0.3 = 2.222. Our Euler approximation gives 2.187 — an error of ~1.6%. The Euler method multiplies by (1-Δt) at each step, so xn = 3(0.9)n. This converges to the true exponential as Δt → 0.
For the same ODE (v = -x, x0 = 3.0), the exact solution at t=0.3 is x = 3e-0.3 = 2.2224. Our 3-step Euler gave 2.187. What is the relative error as a percentage?
With 30 steps (Δt = 0.01) the error would drop to ~0.15%. With 3 steps (Δt = 0.1) it's 1.6%. This is why diffusion sampling quality degrades with fewer steps — each step introduces discretization error, and it compounds. Higher-order solvers (Heun, RK4, DPM-Solver) reduce error per step, allowing fewer steps for the same quality.
With 5 steps, Δt = 0.2 per step. The Euler method approximates curves with straight line segments — larger steps mean more error where the velocity field curves. The accumulated error means the sample ends up in a low-probability region, producing blurry averages of multiple modes rather than sharp samples. This is why modern schedulers (DPM-Solver++, UniPC) use higher-order methods that better follow the ODE trajectory.
ODE sampling is deterministic: no randomness is injected, so the same xT always maps to the same x0. SDE sampling adds Gaussian noise at each step (the dW term), so even with the same starting point, each run follows a different trajectory and produces a different image. Both are valid samples from the learned distribution, but ODE gives you a reproducible mapping while SDE gives you stochastic diversity.
Near t=T, the velocity field is relatively smooth (noisy samples all look similar). Near t=0, the model is resolving fine spatial details — small errors get amplified into visible artifacts. Adaptive solvers automatically take smaller steps where the velocity field has high curvature (near t=0) and larger steps where it's smooth (near t=T). This optimal step allocation is why adaptive solvers outperform fixed-step schedules.
You're an ML engineer tasked with building an image generation system. You need to make concrete decisions about every component — architecture, noise schedule, sampling, guidance — and estimate the compute cost. This capstone ties together everything from the workbook.
At 1024×1024, pixel-space attention is impossible: 1M² = 1012 attention scores per head per layer. Even without attention, the U-Net would process 1M spatial tokens at the highest resolution. Latent diffusion compresses to 128×128 = 16K tokens, making self-attention feasible (16K² = 256M scores — large but manageable). Every modern high-resolution model (SDXL, FLUX, Imagen 3) uses latent or cascaded approaches.
A U-Net with base channels C=320 and channel multipliers [1, 2, 4, 8] has 4 resolution levels. At each level, there are 2 ResNet blocks. Each ResNet block has roughly 2 × Clevel² parameters (two 3×3 convolutions). Compute the total ResNet parameters across all levels.
Channels per level: 320, 640, 1280, 2560. Remember the U-Net has encoder + decoder (double the blocks, minus the bottleneck). For simplicity, count encoder = 4 levels × 2 blocks, decoder = 4 levels × 2 blocks, bottleneck = 1 block at the deepest level.
82.7M is just the convolution parameters. A real U-Net also has: self-attention layers at each level (adding ~4×C² per attention block), cross-attention for text conditioning, time embedding MLPs, skip connections, and group norms. The full SD 1.5 U-Net is ~860M parameters; SDXL's is ~2.6B. Our rough ResNet estimate captures the scaling pattern: the deepest levels (highest C) dominate the parameter count — level 3 alone is 63% of total.
A U-Net with ~2.6B parameters (SDXL scale) takes approximately 2 × params FLOPs per forward pass (the standard "2N" rule for FLOPs of a neural network with N params). With 50 DDIM steps + CFG, how many total FLOPs per image?
520 TFLOPs for one image. For comparison, a single GPT-4 token generation requires ~1.8 TFLOPs (for a ~1.8T parameter model with 2 FLOPs/param). So generating one SDXL image costs roughly as much as generating 289 GPT-4 tokens — about a paragraph of text.
An A100 achieves ~312 TFLOPS in FP16. Assuming 40% utilization (realistic for U-Net inference), how long per image at 520 TFLOPs/image?
~4.2 seconds per image meets our <5 second target. In practice, SDXL on an A100 with optimized inference (TensorRT, FlashAttention) generates 1024×1024 images in 3-5 seconds, matching our estimate well.
At 4.17 seconds per image, how many images per hour can one A100 generate? Does this meet the 1000+ images/hour requirement?
863 images/hour falls short of the 1000+ target. Options: (1) reduce steps from 50 to 40 (gives 1079/hr), (2) use a distilled model that needs fewer steps, (3) batch 2 images per pass if VRAM allows, (4) use DDIM-25 with a higher-order solver (DPM-Solver++) for comparable quality. Option 3 is the most practical: batch=2 nearly doubles throughput to ~1700/hr.
Put the complete image generation pipeline steps in order.
Correct order: (1) Encode text prompt, (2) Sample noise in latent space, (3) Iterative denoising loop, (4) VAE decode to pixels, (5) Post-process, (6) Save/return. The text encoding happens first (and only once per prompt). The denoising loop is 98% of the compute. The VAE decoder runs once at the end. Post-processing (clamping values, converting to uint8) is negligible.
| Topic | Lesson |
|---|---|
| Diffusion fundamentals | Diffusion — From Absolute Zero |
| Flow matching | Flow Matching — From Absolute Zero |
| VAE / VQ-VAE | VAE & VQ-VAE — From Absolute Zero |
| Contrastive learning (CLIP) | Contrastive Learning & CLIP — From Absolute Zero |
| Transformer math | Transformer Math Workbook |