CS 231n — Generative Models (Part 2): GANs, Diffusion & Beyond

Roadmap

What You'll Master

01The Generative Model Landscape 02Generative Adversarial Networks 03Diffusion: The Core Intuition 04Rectified Flow 05Classifier-Free Guidance 06U-Net to DiT 07Latent Diffusion Models 08Text-to-Image & Video 09Generalized Diffusion 10Summary & Connections

Chapter 01

The Generative Model Landscape

Every generative model answers one question: given some training images, how do we produce new images that look like they came from the same distribution? The approaches differ in how explicitly they model the data distribution p(x).

The Taxonomy

We can split generative models along a single axis: does the model give you an explicit density function, or does it just give you samples?

Category	Density	Examples
Explicit — Tractable	You can compute p_θ(x) exactly	Autoregressive (PixelCNN, GPT)
Explicit — Approximate	You optimize a bound on p_θ(x)	VAEs (ELBO), Normalizing Flows
Implicit — Direct	No density; just a sampler	GANs
Implicit — Indirect	Learns a score/velocity field	Diffusion, Flow Matching

Quick Review from Part 1

Autoregressive models factor the joint distribution into a product of conditionals. PixelCNN generates images one pixel at a time, left-to-right, top-to-bottom. GPT generates text one token at a time. The key equation:

Autoregressive Factorization p_θ(x) = ∏_i=1ⁿ p_θ(x_i | x₁, ..., x_i−1)

Strengths: exact likelihood, stable training, no mode collapse. Weakness: sequential generation is painfully slow for images (one pixel at a time).

Variational Autoencoders (VAEs) introduce a latent variable z and optimize a lower bound on log p(x):

ELBO (Evidence Lower Bound) log p(x) ≥ E_{q_φ(z|x)}[log p_θ(x|z)] − D_KL(q_φ(z|x) || p(z))
Reconstruction term − Regularization term

Strengths: fast sampling (just decode a random z), smooth latent space. Weakness: blurry outputs because the decoder must hedge across all plausible images consistent with z.

What's New Today

Part 1 gave us density-based models (autoregressive, VAE). Today we cover the rest: GANs (implicit, adversarial), Diffusion (score-based, iterative denoising), Latent Diffusion (the modern hybrid), and the text-to-image/video pipelines that combine everything.

Chapter 02

Generative Adversarial Networks

Here's a completely different approach. Forget about modeling p(x) explicitly. Instead, train a neural network to generate samples that are so realistic a second neural network can't tell them from real data. This is the GAN idea (Goodfellow et al., 2014).

The Setup

Start with a simple distribution we can easily sample from — a standard Gaussian z ~ N(0, I). The generator G is a neural network that maps z → x: it takes random noise and produces an image. The discriminator D is a second neural network that takes an image and outputs a probability: "is this image real or fake?"

Definition

Generator G(z)

A neural network that maps latent noise z ~ p(z) to a synthetic image. Its goal: produce outputs that the discriminator classifies as real. G defines an implicit distribution p_G(x) — the distribution of images it generates.

Definition

Discriminator D(x)

A neural network that outputs D(x) ∈ [0, 1], the probability that x is a real image. D is trained to output 1 for real data and 0 for generated (fake) data.

The Minimax Objective

Training is a two-player game. D wants to correctly classify real vs. fake. G wants to fool D. Formally:

GAN Minimax Objective min_G max_D V(G, D) = E_{x~p_data}[log D(x)] + E_z~p(z)[log(1 − D(G(z)))]

Read the two terms: Term 1 — for real images, D wants D(x) ≈ 1, so log D(x) ≈ 0 (large). Term 2 — for fake images G(z), D wants D(G(z)) ≈ 0, so log(1 − D(G(z))) ≈ 0 (large). Meanwhile, G wants D(G(z)) ≈ 1 (fool the discriminator), which makes log(1 − D(G(z))) → −∞. D maximizes V; G minimizes V.

Training: Alternating Gradient Steps

GAN Training (One Iteration)

Update D (ascend on V): Sample minibatch {x₁,...,x_m} from data, sample {z₁,...,z_m} from p(z). Compute gradient ∇_{θ_D} (1/m) ∑ [log D(x_i) + log(1 − D(G(z_i)))]. Step θ_D upward.
Update G (descend on V): Sample {z₁,...,z_m} from p(z). Compute gradient ∇_{θ_G} (1/m) ∑ log(1 − D(G(z_i))). Step θ_G downward.
Repeat.

The Gradient Problem & the Non-Saturating Fix

Early in training, G produces garbage. D easily classifies everything correctly: D(G(z)) ≈ 0. The gradient of log(1 − D(G(z))) with respect to G's parameters is essentially zero when D(G(z)) ≈ 0 — the log function is flat near log(1) = 0. G receives almost no learning signal.

Vanishing Generator Gradient

When D is confident that G's outputs are fake, ∂/∂θ_G log(1 − D(G(z))) ≈ 0. The generator is stuck: it knows it's bad, but gets no gradient to improve. This is called the saturating loss.

The fix: instead of minimizing log(1 − D(G(z))), train G to maximize log D(G(z)). Same fixed points, but the gradient is large when D(G(z)) ≈ 0. This is the non-saturating GAN loss used in practice:

Non-Saturating Generator Loss L_G = −E_z~p(z)[log D(G(z))]
G wants to maximize D(G(z)) — "fool the discriminator"

The Optimal Discriminator

For a fixed G, what discriminator maximizes V(G, D)? We can solve this analytically.

Derivation — Optimal Discriminator

For each x, V contains the integrand: p_data(x) log D(x) + p_G(x) log(1 − D(x)). Take the derivative with respect to D(x) and set to zero:

p_data(x) / D(x) − p_G(x) / (1 − D(x)) = 0

Solving: D*(x) = p_data(x) / (p_data(x) + p_G(x))

When p_G = p_data, D*(x) = 1/2 everywhere. The discriminator is maximally confused — it can't tell real from fake. This is the equilibrium.

GANs Minimize Jensen–Shannon Divergence

Plug D* back into V(G, D*) and simplify:

Derivation — V(G, D*) = 2 · JSD − 2 log 2

V(G, D*) = E_{x~p_data}[log p_data(x)/(p_data(x)+p_G(x))] + E_{x~p_G}[log p_G(x)/(p_data(x)+p_G(x))]

Let M = (p_data + p_G)/2. Then:

V(G, D*) = E_{p_data}[log (p_data/2M)] + E_{p_G}[log (p_G/2M)]

= D_KL(p_data || M) + D_KL(p_G || M) − 2 log 2

= 2 · JSD(p_data || p_G) − 2 log 2

Since JSD ≥ 0 with equality iff p_data = p_G, the global minimum of V(G, D*) = −2 log 2, achieved when the generator perfectly matches the data distribution.

No Loss Curve to Monitor

Because D and G are adversaries, neither loss decreases monotonically. D's loss goes up when G improves; G's loss goes up when D improves. There is no single number you can watch to know if training is succeeding. You have to look at generated samples. This makes GAN training notoriously tricky.

DC-GAN & StyleGAN

DC-GAN (Radford et al., 2015) established the convolutional architecture recipe: fractional-strided convolutions in G (upsampling), strided convolutions in D (downsampling), batch normalization everywhere, ReLU in G, LeakyReLU in D. No fully connected layers except at the bottleneck.

StyleGAN (Karras et al., 2019) was the pinnacle of GAN image quality. Three innovations: (1) a mapping network that transforms z into an intermediate latent w, giving a more disentangled space. (2) Adaptive Instance Normalization (AdaIN) injecting w at each resolution layer — controlling "style" at different scales. (3) Progressive growing — train at 4×4, then 8×8, gradually increasing resolution.

Latent space interpolation: Walk smoothly between two z vectors and the generated images morph smoothly — a man gains a smile, sunglasses appear gradually. This smooth latent space is a hallmark of well-trained GANs.

Interactive: GAN Training in 1D

The generator (gold) tries to match the real data distribution (blue). The discriminator output D(x) is shown in green. Watch alternating training steps reshape both.

Speed

Step 0 — Click Train

GAN Summary

Pros	Cons
Beautiful, sharp samples	No density estimation (can't compute p(x))
Fast single-pass generation	Training instability (mode collapse, oscillation)
Smooth, interpolable latent space	No loss curve to monitor convergence
Conceptually elegant (game theory)	Hyperparameter-sensitive, hard to scale

The Rise and Fall

GANs dominated image generation from 2014–2021. StyleGAN produced the most photorealistic faces ever seen. But GANs never solved their core problems: mode collapse, training instability, and the inability to scale to diverse, multi-modal datasets. Diffusion models solved all three, and by 2022, GANs were largely superseded.

Chapter 03

Diffusion Models — The Core Intuition

Forget everything about adversarial training. Diffusion models take a completely different approach, and it starts with a beautifully simple idea.

The Two Processes

Pick any noise distribution p_noise — a standard Gaussian works perfectly. Now imagine two processes:

Forward process (corruption): Take a real image x and gradually add noise to it over "time" t ∈ [0, 1]. At t = 0, it's the clean image. At t = 1, it's pure noise. At any t in between, it's a noisy version of the image — call it x_t.

Reverse process (denoising): Start from pure noise x₁ ~ p_noise and gradually remove the noise, stepping backward from t = 1 to t = 0. If you can do this perfectly, you've turned random noise into a realistic image.

The Key Insight

Corrupting data is trivial — just add noise. The hard part is removing noise. But here's the trick: if you have pairs of (noisy image, clean image), you can train a neural network to predict how to denoise. That's it. Train a denoiser, then run it iteratively to generate new images from random noise.

Why This Works: Score Functions

There's a deep theoretical justification. The score function of a distribution p(x) is defined as:

Score Function s(x) = ∇_x log p(x)

This is a vector field: at every point x, the score points in the direction of increasing probability density. If you're at a low-density region, the score tells you which direction to walk to reach higher-density areas — toward the data.

The remarkable connection: the optimal denoiser is intimately related to the score function. When you train a network to remove noise from corrupted data, you're implicitly learning the score ∇_x log p(x_t) at each noise level t. This is called score matching (Hyvarinen 2005, Song & Ermon 2019).

Why Denoising = Score Estimation

Consider data corrupted by Gaussian noise: x_t = x + σ_tε where ε ~ N(0, I). The optimal denoiser satisfies: E[x | x_t] = x_t + σ_t² ∇_{x_t} log p(x_t). So predicting the noise ε is equivalent to estimating the score ∇ log p(x_t), up to scaling. Tweedie's formula makes this precise.

The Training-Inference Split

Training: Sample a clean image x from the dataset. Sample a noise level t ~ Uniform(0, 1). Create the noisy version x_t. Train the network f_θ(x_t, t) to predict how to clean it up (predict the noise, predict the clean image, or predict a velocity — more on this in Chapter 4).

Inference: Sample x₁ ~ N(0, I). Run f_θ iteratively: x_0.99 = denoise(x₁, t=1), x_0.98 = denoise(x_0.99, t=0.99), ... until you reach x₀. Each step removes a little noise. After enough steps, you have a clean, realistic image.

Cost of Iteration

Unlike GANs (one forward pass to generate), diffusion models require many forward passes (50–1000 steps). Each step is a full neural network evaluation. This makes generation slower. The entire field of diffusion acceleration (distillation, consistency models, few-step methods) exists to solve this.

Chapter 04

Rectified Flow — Clean Modern Diffusion

The original diffusion formulation (DDPM, Ho et al. 2020) uses a complex noise schedule and variance-preserving SDE. Modern practice uses a much cleaner formulation called Rectified Flow (Liu et al. 2022) or Flow Matching (Lipman et al. 2022). The core training loop is just a few lines of code.

The Straight-Line Interpolation

The idea: connect each data point x to a noise point z with a straight line. At time t ∈ [0, 1], the interpolated point is:

Linear Interpolation x_t = (1 − t) · x + t · z
At t=0: clean data x. At t=1: pure noise z.

The velocity along this straight line is constant:

Target Velocity v = dx_t/dt = z − x
Direction from data to noise (or noise to data, by flipping sign)

Training

Train a neural network f_θ(x_t, t) to predict this velocity:

Rectified Flow Training

Sample x ~ p_data (a training image)
Sample z ~ N(0, I) (random noise)
Sample t ~ Uniform(0, 1)
Interpolate: x_t = (1 − t) · x + t · z
Target: v = z − x
Loss: L = ||f_θ(x_t, t) − v||²
Gradient step on θ to minimize L

Astounding Simplicity

This is the entire training algorithm. No complex noise schedules. No forward/reverse SDE. No KL divergence. Just: mix data with noise, predict the velocity, regress with MSE. The power of diffusion models comes from scale (big networks, big datasets), not algorithmic complexity.

Sampling: Euler Integration

At inference, we want to go from noise (t = 1) to data (t = 0). We follow the learned velocity field backward:

Rectified Flow Sampling (Euler Method)

Sample x₁ ~ N(0, I)
Choose T steps (e.g., T = 50). Set Δt = 1/T.
For t = 1, 1−Δt, 1−2Δt, ..., Δt:
Compute v_t = f_θ(x_t, t)
Step: x_t−Δt = x_t − v_t · Δt
Return x₀ (the generated image)

This is just Euler integration of the ODE dx/dt = −f_θ(x, t) from t = 1 to t = 0. More steps = more accurate integration = better samples. Fewer steps = faster but noisier.

Why Straight Flows Are Better

Earlier diffusion models (DDPM, VP-SDE) use curved paths between data and noise. Curved paths require more integration steps to follow accurately — small errors in the Euler step accumulate along curves. Rectified Flow uses straight lines, which the Euler method can follow with fewer steps. Lipman et al. (2022) showed that flow matching with optimal transport couplings produces even straighter paths, enabling 10–20 step generation with minimal quality loss.

Distillation: Fewer Steps

Even 50 steps is slow for real-time applications. Distillation trains a student model to jump directly from noise to data in fewer steps (4, 2, or even 1 step). The teacher model runs the full sampling trajectory; the student learns to shortcut it. Consistency models (Song et al. 2023) learn to map any point on the trajectory directly to x₀, enabling one-step generation.

Interactive: Rectified Flow Sampling

Noise points (right, red) flow along straight lines toward data points (left, gold). Euler integration with adjustable steps. Fewer steps = cruder paths.

Steps T 20

Click Animate to see Euler integration

Chapter 05

Classifier-Free Guidance

So far, our diffusion model generates random images from the training distribution. But we want conditional generation: "generate a photo of a golden retriever" or "generate a sunset over mountains." How do we steer the diffusion process toward a specific condition y?

Conditional Rectified Flow

The simplest approach: make the network condition-aware. Instead of f_θ(x_t, t), train f_θ(x_t, y, t) where y is a class label, text embedding, or any conditioning signal. The training loop is identical; you just feed y as additional input.

This works, but the samples are "okay" — diverse but not strongly aligned with the condition. We want sharper conditioning: images that clearly match the prompt, even at the cost of some diversity.

The Trick: Random Dropout of Conditioning

During training, randomly replace y with a null token y_∅ (empty conditioning) some fraction of the time (e.g., 10%). This means the same network learns both:

• Unconditional generation: f_θ(x_t, y_∅, t) — "generate any image"

• Conditional generation: f_θ(x_t, y, t) — "generate an image matching y"

Guidance at Inference

At sampling time, compute both velocities and combine them:

Classifier-Free Guidance v_∅ = f_θ(x_t, y_∅, t) (unconditional velocity)
v_y = f_θ(x_t, y, t) (conditional velocity)
v_cfg = (1 + w) · v_y − w · v_∅
w = guidance weight. w=0 gives plain conditional; w>0 amplifies conditioning.

Why This Works

The Geometry of Guidance

Think of v_∅ as pointing toward "any plausible image" and v_y as pointing toward "images matching condition y." The difference (v_y − v_∅) is the direction that makes the image more y-like. CFG adds an extra push in that direction:

v_cfg = v_y + w · (v_y − v_∅)

When w = 0, you get plain conditional sampling. When w > 0, you amplify whatever makes the image specifically match y, pushing samples toward higher p(y|x). In score-function terms: v_cfg ∝ ∇_x log p(x_t) + (1+w) · ∇_x log p(y|x_t).

Worked Example — Effect of Guidance Weight

Prompt: "a photo of a cat." At w = 0: diverse images, some clearly cats, some ambiguous. At w = 3: all images are unambiguously cats, sharp and detailed, but less variety (similar poses/angles). At w = 10: even more cat-like but oversaturated colors, artifacts appear. At w = 20: extreme artifacts, the image is "too much cat."

Typical production values: w = 3–7 for text-to-image. There's a quality-diversity tradeoff: higher w = more aligned with the prompt but less diverse and potentially lower quality.

Why "Classifier-Free"?

Earlier work (Dhariwal & Nichol, 2021) used a separately-trained classifier p(y|x) to guide diffusion. This required training an extra model and running it at every denoising step. CFG eliminates the classifier entirely — the single diffusion model serves as both the unconditional and conditional generator. The name "classifier-free" distinguishes it from the earlier "classifier-guided" approach.

CFG Doubles Cost

Every sampling step requires two forward passes: one unconditional (v_∅) and one conditional (v_y). This doubles the compute per step. Some methods batch the two passes together to use GPU parallelism, but the FLOPs still double.

Interactive: Classifier-Free Guidance Vectors

From a point x_t, see how the unconditional (gray), conditional (blue), and guided (gold) velocity vectors change with guidance weight w.

Guidance w 3.0

Chapter 06

The Network: U-Net to DiT

The velocity predictor f_θ(x_t, y, t) needs a neural network architecture. Two families have dominated: the U-Net (inherited from image segmentation) and the Diffusion Transformer (DiT) (adapted from the vision transformer).

U-Net Architecture

The U-Net has an encoder-decoder structure with skip connections:

Encoder: Stack of downsampling convolutional blocks. Resolution halves at each level (64→32→16→8). The network sees the image at multiple scales.

Decoder: Stack of upsampling blocks. Resolution doubles at each level (8→16→32→64). Skip connections concatenate encoder features at matching resolutions, preserving fine details.

Timestep injection: The timestep t is embedded via sinusoidal positional encoding (like transformers), then injected into each block via scale-and-shift (also called FiLM conditioning): given embedding e(t), compute scale γ and shift β, then transform features as γ · h + β. This is called Adaptive Layer Normalization (AdaLN).

Text conditioning: Text embeddings from CLIP or T5 are injected via cross-attention layers. At each resolution, the image features attend to the text token sequence, allowing each spatial location to "look at" relevant words in the prompt.

Definition

AdaLN (Adaptive Layer Normalization)

Instead of fixed scale/shift in layer norm, predict them from conditioning: γ, β = MLP(e(t)). Apply as: AdaLN(h) = γ · LayerNorm(h) + β. This modulates the network's behavior based on the current timestep (and optionally class label).

Diffusion Transformer (DiT)

Peebles & Xie (2023) replaced the U-Net with a standard Vision Transformer. The key insight: transformers scale better with compute than convolutions. As you increase model size, DiT quality improves more smoothly and predictably.

Patchification: Split the image (or latent) into non-overlapping patches, flatten each patch into a vector, add positional embeddings. These patch tokens become the input sequence for the transformer — exactly like ViT for classification, but now for generation.

Conditioning injection: Timestep via AdaLN-Zero (like AdaLN but initialized to the identity function so the network starts as a plain transformer). Text via cross-attention layers interleaved with self-attention.

Why Transformers Win

U-Nets have inductive biases (locality, translation equivariance) that help at small scale but limit scaling. Transformers have no spatial inductive bias — they learn everything from data. This is worse at small scale (need more data) but better at large scale. Since modern image generation uses billions of training images and billions of parameters, the transformer's superior scaling wins.

MM-DiT: Joint Attention for Multi-Modal Inputs

Modern text-to-image models (FLUX, SD3) use MM-DiT (Multi-Modal DiT). Instead of separate self-attention for image tokens and cross-attention for text, MM-DiT concatenates all tokens (image patches + text tokens) into a single sequence and runs joint self-attention over everything. Each image patch can attend to every text token and vice versa. This is simpler and empirically better than the cross-attention design.

Architecture Comparison

U-Net (DDPM/LDM): ~860M params. Conv encoder-decoder + self-attention at low resolutions + cross-attention for text. Trained on 256×256 images.

DiT-XL/2: ~675M params. 28 transformer blocks, patch size 2. Trained on ImageNet 256×256. Achieved FID 2.27 (SOTA at the time).

FLUX.1: ~12B params. MM-DiT with joint attention. T5-XXL + CLIP text encoders. Trained on billions of text-image pairs.

Chapter 07

Latent Diffusion Models

Running diffusion in pixel space is expensive. A 512×512×3 image has 786,432 dimensions. Every denoising step processes all of them. At 50 steps, that's 39 million pixel operations. Can we work in a smaller space?

The Two-Stage Strategy

Stage 1: Compression. Train a VAE (variational autoencoder) to compress images into a low-dimensional latent space. The encoder maps a 256×256×3 image to a 32×32×16 latent. That's a 48× compression in spatial dimensions. The decoder maps the latent back to the image.

Stage 2: Diffusion on latents. Train the diffusion model (U-Net or DiT) on the 32×32×16 latents, not the 256×256×3 images. Everything is identical — add noise to latents, predict velocity, denoise — but now each step processes only 16,384 dimensions instead of 786,432.

Common Settings (Stable Diffusion) Downsampling factor: D = 8 (256/8 = 32)
Latent channels: C = 4 (SD 1.x/2.x) or C = 16 (SDXL/SD3)
Image: 256×256×3 → Latent: 32×32×C
Compression: ~48× fewer values (C=4) or ~12× (C=16)

The VAE + GAN Decoder

The autoencoder isn't a plain VAE — its decoder is trained with a GAN discriminator (perceptual + adversarial loss). This is critical: pure MSE reconstruction produces blurry images; the discriminator forces the decoder to produce sharp, detailed outputs. The encoder has a mild KL penalty to keep the latent space smooth.

The Modern LDM = VAE + GAN + Diffusion

It's a three-part system: (1) A VAE encoder compresses images to latents. (2) A diffusion model generates in latent space. (3) A GAN-enhanced VAE decoder maps latents back to sharp images. Three of the four generative model families from Chapter 1, working together.

Why Latent Space Works: Two Types of Compression

Perceptual compression removes imperceptible high-frequency details (like JPEG). The VAE's first job: collapse the pixel space by removing information humans can't see. Semantic compression captures the meaningful content: objects, layout, style. The diffusion model operates on semantically-rich latents, not redundant pixels. This separation is why LDMs are both fast and high-quality.

The Full Pipeline

Latent Diffusion: Training

Stage 1 (one-time): Train autoencoder. Encoder E: image → latent. Decoder D: latent → image. Loss = reconstruction + KL + adversarial (GAN disc).
Stage 2: Freeze the autoencoder. Encode all training images: z = E(x).
Train diffusion on latents z using rectified flow (or DDPM). The denoiser f_θ(z_t, y, t) predicts velocity in latent space.

Latent Diffusion: Sampling

Sample z₁ ~ N(0, I) in latent space (32×32×C).
Denoise: Run T steps of Euler integration with CFG to get clean latent z₀.
Decode: x = D(z₀). Run the frozen VAE decoder to get the final image.

Worked Example — Stable Diffusion XL

Input: text prompt "a cyberpunk city at night, neon lights, rain." Text encoders: CLIP-ViT-L + OpenCLIP-ViT-bigG (dual encoders for richer text understanding). Latent size: 128×128×4 (for 1024×1024 output). DiT denoiser: ~2.6B params. 50 Euler steps with CFG w=7.5. The entire generation: ~4 seconds on an A100 GPU.

Interactive: Latent Diffusion Pipeline

Watch the full LDM pipeline: Image → Encoder → Latent → Add Noise → Iterative Denoising → Clean Latent → Decoder → Image. Dimensions shown at each stage.

Denoise Steps 8

Click Play to animate the pipeline

Chapter 08

Text-to-Image & Text-to-Video

Latent diffusion + CFG + DiT = the complete text-to-image stack. Let's trace the full pipeline from text prompt to pixels, then extend to video.

The Text-to-Image Pipeline

Step 1: Text encoding. The text prompt goes through one or more frozen text encoders. Common choices: CLIP (contrastive vision-language model, produces a single embedding) and T5-XXL (encoder-decoder language model, produces a sequence of token embeddings). Using both captures complementary information — CLIP gives global semantics, T5 gives fine-grained token-level detail.

Step 2: Diffusion generation. The text embeddings condition a DiT via cross-attention (or joint attention in MM-DiT). Starting from random noise in latent space, T denoising steps with CFG produce a clean latent.

Step 3: Decoding. The VAE decoder maps the latent to pixels.

FLUX.1 Architecture

Text encoders: T5-XXL (4.7B params, produces 256 token embeddings) + CLIP-ViT-L (produces 1 global embedding). DiT: 12B params, MM-DiT with joint attention, patch size 2, 8×8 latent downsampling. VAE: 16 latent channels, 8× spatial downsampling. Trained on billions of captioned images.

From Images to Video

The conceptual leap from image to video generation is smaller than you'd think. An image is a 2D grid of latents (h × w × c). A video is a 3D grid of latents (t × h × w × c), where t is the time dimension. The diffusion model operates on this 3D tensor.

Definition

Video Latent Space

A video VAE compresses both spatially and temporally. A 16-frame 256×256 video might compress to a 4×32×32×16 latent (4× temporal, 8× spatial downsampling). The diffusion model generates in this compressed 3D space, and the video decoder inflates it back to frames.

Temporal Attention: Factorized Design

Full 3D self-attention over (t × h × w) tokens is quadratic in the total count — prohibitively expensive for long videos. The standard solution: factorized attention.

Spatial attention: Within each frame, all (h × w) patches attend to each other. This captures spatial relationships (objects, layout).

Temporal attention: For each spatial position, the tokens across all T frames attend to each other. This captures motion and temporal coherence.

These alternate in the transformer blocks: spatial-attention → temporal-attention → spatial → temporal → ...

Meta MovieGen (2024)

Scale: 30B parameter DiT. Latent space: 8×8×8 downsampling (spatial 8×, temporal 8×). A 16-second 768×768 video at 16fps = 256 frames → 32×96×96 latent → ~295K tokens per sample. Training data: Hundreds of millions of video clips. Text conditioning: T5-XXL + MetaCLIP. Result: Photorealistic, temporally coherent videos from text prompts.

The Video Generation Timeline

Year	Model	Key Innovation
2024 Feb	Sora (OpenAI)	Spacetime patches, long coherent videos
2024 Jun	Gen-3 (Runway)	Commercial text-to-video
2024 Oct	MovieGen (Meta)	30B DiT, 76K tokens, audio generation
2024 Dec	HunyuanVideo (Tencent)	Open-source, dual-stream DiT
2025	Cosmos (NVIDIA) / Wan (Alibaba)	World models, open weights

Current Limitations

Video models still struggle with: physics (objects passing through each other, gravity violations), counting (wrong number of fingers, objects), long-term coherence (character appearance drifting over time), and text rendering (garbled text on signs). These are active research areas.

Chapter 09

Generalized Diffusion & Connections

Rectified Flow is one specific choice. The general framework parameterizes the interpolation and prediction target with four functions:

Generalized Interpolation x_t = a(t) · x + b(t) · z
y_gt = c(t) · x + d(t) · z
x = data, z = noise, y_gt = prediction target

Different choices of a, b, c, d recover all the famous diffusion formulations:

Method	a(t)	b(t)	c(t)	d(t)	Predicts
Rectified Flow	1−t	t	−1	1	Velocity v = z−x
VP (DDPM)	√α(t)	√(1−α(t))	0	1	Noise ε
VE (Score SDE)	1	σ(t)	0	1	Noise ε
x-prediction	a(t)	b(t)	1	0	Clean data x
v-prediction	a(t)	b(t)	−b(t)	a(t)	Velocity v

All Roads Lead to the Same Place

Despite the different parameterizations, all these methods learn to transport the noise distribution to the data distribution along some path. The differences are in: (1) the shape of the path (straight vs. curved), (2) what the network predicts (noise, data, velocity), (3) the resulting noise schedule. In the infinite-step limit, they're all equivalent. In practice, rectified flow (straight paths, velocity prediction) tends to work best with fewer steps.

Diffusion as a Latent Variable Model

There's a beautiful connection to the VAE framework. Think of the entire forward trajectory {x₀, x_0.01, x_0.02, ..., x₁} as a hierarchy of latent variables. The forward process is the "encoder" q(x_1:T|x₀). The reverse process is the "decoder" p_θ(x_0:T-1|x_T). The ELBO decomposes into a sum of per-step denoising losses, which is exactly the MSE velocity prediction loss.

Autoregressive + Discrete Latents

An alternative to continuous diffusion: use a VQ-VAE (Vector-Quantized VAE) to encode images into a grid of discrete tokens, then train an autoregressive transformer to generate the tokens left-to-right, top-to-bottom. This is the approach behind DALL-E 1 (Ramesh et al. 2021) and recent models like LlamaGen.

VQ-VAE + Autoregressive Image → Encoder → Quantize to codebook tokens → [4, 12, 7, 31, ...]
Train transformer: p(token_i | token₁, ..., token_i-1)
Generate: sample tokens autoregressively → Decoder → Image

Distillation: From 50 Steps to 1

The remaining weakness of diffusion: sampling speed. Several approaches compress the multi-step process:

Method	Steps	Key Idea
DDIM	10–50	Deterministic sampling, skip timesteps
Progressive Distillation	4–8	Student learns to combine 2 teacher steps into 1
Consistency Models	1–2	Map any point on trajectory directly to x₀
Adversarial Distillation	1–4	GAN loss forces student to produce sharp images in few steps

Consistency Models (Song et al. 2023)

Key idea: learn a function f_θ such that f_θ(x_t, t) = x₀ for all t along the same trajectory. In other words, no matter where you are on the noisy-to-clean path, the model predicts the endpoint directly. Self-consistency condition: f_θ(x_t, t) = f_θ(x_t', t') for any t, t' on the same trajectory. Training enforces this via a consistency loss between adjacent timesteps. Result: one-step generation with quality approaching 50-step diffusion.

Chapter 10

Summary & Connections

The Four Families Compared

Property	Autoregressive	VAE	GAN	Diffusion/Flow
Density	Exact	Lower bound	None	Score-based
Training	Stable (MLE)	Stable (ELBO)	Unstable (minimax)	Stable (MSE)
Sample quality	Good (text), Fair (images)	Blurry	Sharp	Best
Diversity	High	High	Low (mode collapse)	High
Speed	Slow (sequential)	Fast (one pass)	Fast (one pass)	Slow (multi-step)
Scalability	Excellent (GPT-4)	Limited	Poor	Excellent (FLUX, Sora)
Era	2016–present	2013–2018	2014–2021	2020–present

The Modern Generative Stack

Everything Combines

A modern text-to-image model like Stable Diffusion 3 is not one technique — it's five, stacked together:

(1) VAE encoder compresses images to latents. (2) Diffusion/Flow Matching generates in latent space. (3) DiT architecture parameterizes the denoiser. (4) CFG sharpens conditional generation. (5) Distillation reduces the number of sampling steps. Each piece was invented separately; together they're greater than the sum.

The Historical Arc

Era	Dominant Method	Milestone Models
2014–2017	GANs + VAEs	GAN, DCGAN, VAE
2017–2021	GANs	StyleGAN, BigGAN, StyleGAN2
2020–2022	Diffusion emergence	DDPM, Score SDE, ADM
2022–2023	Latent Diffusion	Stable Diffusion, DALL-E 2, Imagen
2023–present	DiT + Flow Matching	DiT, SD3, FLUX, Sora, MovieGen

Key Equations

GAN Minimax min_G max_D E[log D(x)] + E[log(1 − D(G(z)))]

Rectified Flow Loss L = ||f_θ((1−t)x + tz, t) − (z − x)||²

Classifier-Free Guidance v_cfg = (1 + w) · f_θ(x_t, y, t) − w · f_θ(x_t, ∅, t)

References

#	Paper
1	Goodfellow et al. "Generative Adversarial Nets." NeurIPS, 2014.
2	Radford et al. "Unsupervised Representation Learning with DCGANs." ICLR, 2016.
3	Karras et al. "A Style-Based Generator Architecture for GANs." CVPR, 2019.
4	Ho et al. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
5	Song et al. "Score-Based Generative Modeling through SDEs." ICLR, 2021.
6	Dhariwal & Nichol. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS, 2021.
7	Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR, 2022.
8	Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023.
9	Liu et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR, 2023.
10	Peebles & Xie. "Scalable Diffusion Models with Transformers." ICCV, 2023.
11	Song et al. "Consistency Models." ICML, 2023.
12	Ho & Salimans. "Classifier-Free Diffusion Guidance." NeurIPS Workshop, 2022.
13	Ramesh et al. "Zero-Shot Text-to-Image Generation." ICML, 2021.
14	Polyak et al. "Movie Gen: A Cast of Media Foundation Models." Meta, 2024.
15	Esser et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML, 2024.
16	Black Forest Labs. "FLUX.1: Open-Weight Flow Matching Models." 2024.

The One Sentence

Modern image and video generation = compress (VAE), denoise (diffusion), condition (CFG), and scale (transformers). Everything else is optimization.

From GANs to Diffusion and Beyond

What You'll Master

The Generative Model Landscape

The Taxonomy

Quick Review from Part 1

Generative Adversarial Networks

The Setup

The Minimax Objective

Training: Alternating Gradient Steps

The Gradient Problem & the Non-Saturating Fix

The Optimal Discriminator

GANs Minimize Jensen–Shannon Divergence

DC-GAN & StyleGAN

GAN Summary

Diffusion Models — The Core Intuition

The Two Processes

Why This Works: Score Functions

The Training-Inference Split

Rectified Flow — Clean Modern Diffusion

The Straight-Line Interpolation

Training

Sampling: Euler Integration

Distillation: Fewer Steps

Classifier-Free Guidance

Conditional Rectified Flow

The Trick: Random Dropout of Conditioning

Guidance at Inference

Why This Works

The Network: U-Net to DiT

U-Net Architecture

Diffusion Transformer (DiT)

MM-DiT: Joint Attention for Multi-Modal Inputs

Latent Diffusion Models

The Two-Stage Strategy

The VAE + GAN Decoder

The Full Pipeline

Text-to-Image & Text-to-Video

The Text-to-Image Pipeline

From Images to Video

Temporal Attention: Factorized Design

The Video Generation Timeline

Generalized Diffusion & Connections

Diffusion as a Latent Variable Model

Autoregressive + Discrete Latents

Distillation: From 50 Steps to 1

Summary & Connections

The Four Families Compared

The Modern Generative Stack

The Historical Arc

Key Equations

References