← Gleams
Stanford CS 231n · Lecture 14 · Generative Models (Part 2)

From GANs to Diffusion and Beyond

Part 1 gave us autoregressive models and VAEs. Now the heavyweights: adversarial training, denoising diffusion, and the modern latent diffusion stack that powers DALL-E, Stable Diffusion, and Sora.

GANs & StyleGAN Diffusion & Flow Matching Classifier-Free Guidance Latent Diffusion & DiT Text-to-Image/Video
Roadmap

What You'll Master

Chapter 01

The Generative Model Landscape

Every generative model answers one question: given some training images, how do we produce new images that look like they came from the same distribution? The approaches differ in how explicitly they model the data distribution p(x).

The Taxonomy

We can split generative models along a single axis: does the model give you an explicit density function, or does it just give you samples?

CategoryDensityExamples
Explicit — TractableYou can compute pθ(x) exactlyAutoregressive (PixelCNN, GPT)
Explicit — ApproximateYou optimize a bound on pθ(x)VAEs (ELBO), Normalizing Flows
Implicit — DirectNo density; just a samplerGANs
Implicit — IndirectLearns a score/velocity fieldDiffusion, Flow Matching

Quick Review from Part 1

Autoregressive models factor the joint distribution into a product of conditionals. PixelCNN generates images one pixel at a time, left-to-right, top-to-bottom. GPT generates text one token at a time. The key equation:

Autoregressive Factorization pθ(x) = ∏i=1n pθ(xi | x1, ..., xi−1)

Strengths: exact likelihood, stable training, no mode collapse. Weakness: sequential generation is painfully slow for images (one pixel at a time).

Variational Autoencoders (VAEs) introduce a latent variable z and optimize a lower bound on log p(x):

ELBO (Evidence Lower Bound) log p(x) ≥ Eqφ(z|x)[log pθ(x|z)] − DKL(qφ(z|x) || p(z))
Reconstruction term − Regularization term

Strengths: fast sampling (just decode a random z), smooth latent space. Weakness: blurry outputs because the decoder must hedge across all plausible images consistent with z.

What's New Today

Part 1 gave us density-based models (autoregressive, VAE). Today we cover the rest: GANs (implicit, adversarial), Diffusion (score-based, iterative denoising), Latent Diffusion (the modern hybrid), and the text-to-image/video pipelines that combine everything.

Chapter 02

Generative Adversarial Networks

Here's a completely different approach. Forget about modeling p(x) explicitly. Instead, train a neural network to generate samples that are so realistic a second neural network can't tell them from real data. This is the GAN idea (Goodfellow et al., 2014).

The Setup

Start with a simple distribution we can easily sample from — a standard Gaussian z ~ N(0, I). The generator G is a neural network that maps z → x: it takes random noise and produces an image. The discriminator D is a second neural network that takes an image and outputs a probability: "is this image real or fake?"

Definition
Generator G(z)

A neural network that maps latent noise z ~ p(z) to a synthetic image. Its goal: produce outputs that the discriminator classifies as real. G defines an implicit distribution pG(x) — the distribution of images it generates.

Definition
Discriminator D(x)

A neural network that outputs D(x) ∈ [0, 1], the probability that x is a real image. D is trained to output 1 for real data and 0 for generated (fake) data.

The Minimax Objective

Training is a two-player game. D wants to correctly classify real vs. fake. G wants to fool D. Formally:

GAN Minimax Objective minG maxD V(G, D) = Ex~pdata[log D(x)] + Ez~p(z)[log(1 − D(G(z)))]

Read the two terms: Term 1 — for real images, D wants D(x) ≈ 1, so log D(x) ≈ 0 (large). Term 2 — for fake images G(z), D wants D(G(z)) ≈ 0, so log(1 − D(G(z))) ≈ 0 (large). Meanwhile, G wants D(G(z)) ≈ 1 (fool the discriminator), which makes log(1 − D(G(z))) → −∞. D maximizes V; G minimizes V.

Training: Alternating Gradient Steps

GAN Training (One Iteration)
  1. Update D (ascend on V): Sample minibatch {x1,...,xm} from data, sample {z1,...,zm} from p(z). Compute gradient ∇θD (1/m) ∑ [log D(xi) + log(1 − D(G(zi)))]. Step θD upward.
  2. Update G (descend on V): Sample {z1,...,zm} from p(z). Compute gradient ∇θG (1/m) ∑ log(1 − D(G(zi))). Step θG downward.
  3. Repeat.

The Gradient Problem & the Non-Saturating Fix

Early in training, G produces garbage. D easily classifies everything correctly: D(G(z)) ≈ 0. The gradient of log(1 − D(G(z))) with respect to G's parameters is essentially zero when D(G(z)) ≈ 0 — the log function is flat near log(1) = 0. G receives almost no learning signal.

Vanishing Generator Gradient

When D is confident that G's outputs are fake, ∂/∂θG log(1 − D(G(z))) ≈ 0. The generator is stuck: it knows it's bad, but gets no gradient to improve. This is called the saturating loss.

The fix: instead of minimizing log(1 − D(G(z))), train G to maximize log D(G(z)). Same fixed points, but the gradient is large when D(G(z)) ≈ 0. This is the non-saturating GAN loss used in practice:

Non-Saturating Generator Loss LG = −Ez~p(z)[log D(G(z))]
G wants to maximize D(G(z)) — "fool the discriminator"

The Optimal Discriminator

For a fixed G, what discriminator maximizes V(G, D)? We can solve this analytically.

Derivation — Optimal Discriminator

For each x, V contains the integrand: pdata(x) log D(x) + pG(x) log(1 − D(x)). Take the derivative with respect to D(x) and set to zero:

pdata(x) / D(x) − pG(x) / (1 − D(x)) = 0

Solving: D*(x) = pdata(x) / (pdata(x) + pG(x))

When pG = pdata, D*(x) = 1/2 everywhere. The discriminator is maximally confused — it can't tell real from fake. This is the equilibrium.

GANs Minimize Jensen–Shannon Divergence

Plug D* back into V(G, D*) and simplify:

Derivation — V(G, D*) = 2 · JSD − 2 log 2

V(G, D*) = Ex~pdata[log pdata(x)/(pdata(x)+pG(x))] + Ex~pG[log pG(x)/(pdata(x)+pG(x))]

Let M = (pdata + pG)/2. Then:

V(G, D*) = Epdata[log (pdata/2M)] + EpG[log (pG/2M)]

= DKL(pdata || M) + DKL(pG || M) − 2 log 2

= 2 · JSD(pdata || pG) − 2 log 2

Since JSD ≥ 0 with equality iff pdata = pG, the global minimum of V(G, D*) = −2 log 2, achieved when the generator perfectly matches the data distribution.

No Loss Curve to Monitor

Because D and G are adversaries, neither loss decreases monotonically. D's loss goes up when G improves; G's loss goes up when D improves. There is no single number you can watch to know if training is succeeding. You have to look at generated samples. This makes GAN training notoriously tricky.

DC-GAN & StyleGAN

DC-GAN (Radford et al., 2015) established the convolutional architecture recipe: fractional-strided convolutions in G (upsampling), strided convolutions in D (downsampling), batch normalization everywhere, ReLU in G, LeakyReLU in D. No fully connected layers except at the bottleneck.

StyleGAN (Karras et al., 2019) was the pinnacle of GAN image quality. Three innovations: (1) a mapping network that transforms z into an intermediate latent w, giving a more disentangled space. (2) Adaptive Instance Normalization (AdaIN) injecting w at each resolution layer — controlling "style" at different scales. (3) Progressive growing — train at 4×4, then 8×8, gradually increasing resolution.

Latent space interpolation: Walk smoothly between two z vectors and the generated images morph smoothly — a man gains a smile, sunglasses appear gradually. This smooth latent space is a hallmark of well-trained GANs.

Interactive: GAN Training in 1D
The generator (gold) tries to match the real data distribution (blue). The discriminator output D(x) is shown in green. Watch alternating training steps reshape both.
Speed
Step 0 — Click Train

GAN Summary

ProsCons
Beautiful, sharp samplesNo density estimation (can't compute p(x))
Fast single-pass generationTraining instability (mode collapse, oscillation)
Smooth, interpolable latent spaceNo loss curve to monitor convergence
Conceptually elegant (game theory)Hyperparameter-sensitive, hard to scale
The Rise and Fall

GANs dominated image generation from 2014–2021. StyleGAN produced the most photorealistic faces ever seen. But GANs never solved their core problems: mode collapse, training instability, and the inability to scale to diverse, multi-modal datasets. Diffusion models solved all three, and by 2022, GANs were largely superseded.

Chapter 03

Diffusion Models — The Core Intuition

Forget everything about adversarial training. Diffusion models take a completely different approach, and it starts with a beautifully simple idea.

The Two Processes

Pick any noise distribution pnoise — a standard Gaussian works perfectly. Now imagine two processes:

Forward process (corruption): Take a real image x and gradually add noise to it over "time" t ∈ [0, 1]. At t = 0, it's the clean image. At t = 1, it's pure noise. At any t in between, it's a noisy version of the image — call it xt.

Reverse process (denoising): Start from pure noise x1 ~ pnoise and gradually remove the noise, stepping backward from t = 1 to t = 0. If you can do this perfectly, you've turned random noise into a realistic image.

The Key Insight

Corrupting data is trivial — just add noise. The hard part is removing noise. But here's the trick: if you have pairs of (noisy image, clean image), you can train a neural network to predict how to denoise. That's it. Train a denoiser, then run it iteratively to generate new images from random noise.

Why This Works: Score Functions

There's a deep theoretical justification. The score function of a distribution p(x) is defined as:

Score Function s(x) = ∇x log p(x)

This is a vector field: at every point x, the score points in the direction of increasing probability density. If you're at a low-density region, the score tells you which direction to walk to reach higher-density areas — toward the data.

The remarkable connection: the optimal denoiser is intimately related to the score function. When you train a network to remove noise from corrupted data, you're implicitly learning the score ∇x log p(xt) at each noise level t. This is called score matching (Hyvarinen 2005, Song & Ermon 2019).

Why Denoising = Score Estimation

Consider data corrupted by Gaussian noise: xt = x + σtε where ε ~ N(0, I). The optimal denoiser satisfies: E[x | xt] = xt + σt2xt log p(xt). So predicting the noise ε is equivalent to estimating the score ∇ log p(xt), up to scaling. Tweedie's formula makes this precise.

The Training-Inference Split

Training: Sample a clean image x from the dataset. Sample a noise level t ~ Uniform(0, 1). Create the noisy version xt. Train the network fθ(xt, t) to predict how to clean it up (predict the noise, predict the clean image, or predict a velocity — more on this in Chapter 4).

Inference: Sample x1 ~ N(0, I). Run fθ iteratively: x0.99 = denoise(x1, t=1), x0.98 = denoise(x0.99, t=0.99), ... until you reach x0. Each step removes a little noise. After enough steps, you have a clean, realistic image.

Cost of Iteration

Unlike GANs (one forward pass to generate), diffusion models require many forward passes (50–1000 steps). Each step is a full neural network evaluation. This makes generation slower. The entire field of diffusion acceleration (distillation, consistency models, few-step methods) exists to solve this.

Chapter 04

Rectified Flow — Clean Modern Diffusion

The original diffusion formulation (DDPM, Ho et al. 2020) uses a complex noise schedule and variance-preserving SDE. Modern practice uses a much cleaner formulation called Rectified Flow (Liu et al. 2022) or Flow Matching (Lipman et al. 2022). The core training loop is just a few lines of code.

The Straight-Line Interpolation

The idea: connect each data point x to a noise point z with a straight line. At time t ∈ [0, 1], the interpolated point is:

Linear Interpolation xt = (1 − t) · x + t · z
At t=0: clean data x. At t=1: pure noise z.

The velocity along this straight line is constant:

Target Velocity v = dxt/dt = z − x
Direction from data to noise (or noise to data, by flipping sign)

Training

Train a neural network fθ(xt, t) to predict this velocity:

Rectified Flow Training
  1. Sample x ~ pdata (a training image)
  2. Sample z ~ N(0, I) (random noise)
  3. Sample t ~ Uniform(0, 1)
  4. Interpolate: xt = (1 − t) · x + t · z
  5. Target: v = z − x
  6. Loss: L = ||fθ(xt, t) − v||2
  7. Gradient step on θ to minimize L
Astounding Simplicity

This is the entire training algorithm. No complex noise schedules. No forward/reverse SDE. No KL divergence. Just: mix data with noise, predict the velocity, regress with MSE. The power of diffusion models comes from scale (big networks, big datasets), not algorithmic complexity.

Sampling: Euler Integration

At inference, we want to go from noise (t = 1) to data (t = 0). We follow the learned velocity field backward:

Rectified Flow Sampling (Euler Method)
  1. Sample x1 ~ N(0, I)
  2. Choose T steps (e.g., T = 50). Set Δt = 1/T.
  3. For t = 1, 1−Δt, 1−2Δt, ..., Δt:
      Compute vt = fθ(xt, t)
      Step: xt−Δt = xt − vt · Δt
  4. Return x0 (the generated image)

This is just Euler integration of the ODE dx/dt = −fθ(x, t) from t = 1 to t = 0. More steps = more accurate integration = better samples. Fewer steps = faster but noisier.

Why Straight Flows Are Better

Earlier diffusion models (DDPM, VP-SDE) use curved paths between data and noise. Curved paths require more integration steps to follow accurately — small errors in the Euler step accumulate along curves. Rectified Flow uses straight lines, which the Euler method can follow with fewer steps. Lipman et al. (2022) showed that flow matching with optimal transport couplings produces even straighter paths, enabling 10–20 step generation with minimal quality loss.

Distillation: Fewer Steps

Even 50 steps is slow for real-time applications. Distillation trains a student model to jump directly from noise to data in fewer steps (4, 2, or even 1 step). The teacher model runs the full sampling trajectory; the student learns to shortcut it. Consistency models (Song et al. 2023) learn to map any point on the trajectory directly to x0, enabling one-step generation.

Interactive: Rectified Flow Sampling
Noise points (right, red) flow along straight lines toward data points (left, gold). Euler integration with adjustable steps. Fewer steps = cruder paths.
Steps T 20
Click Animate to see Euler integration
Chapter 05

Classifier-Free Guidance

So far, our diffusion model generates random images from the training distribution. But we want conditional generation: "generate a photo of a golden retriever" or "generate a sunset over mountains." How do we steer the diffusion process toward a specific condition y?

Conditional Rectified Flow

The simplest approach: make the network condition-aware. Instead of fθ(xt, t), train fθ(xt, y, t) where y is a class label, text embedding, or any conditioning signal. The training loop is identical; you just feed y as additional input.

This works, but the samples are "okay" — diverse but not strongly aligned with the condition. We want sharper conditioning: images that clearly match the prompt, even at the cost of some diversity.

The Trick: Random Dropout of Conditioning

During training, randomly replace y with a null token y (empty conditioning) some fraction of the time (e.g., 10%). This means the same network learns both:

Unconditional generation: fθ(xt, y, t) — "generate any image"

Conditional generation: fθ(xt, y, t) — "generate an image matching y"

Guidance at Inference

At sampling time, compute both velocities and combine them:

Classifier-Free Guidance v = fθ(xt, y, t)   (unconditional velocity)
vy = fθ(xt, y, t)      (conditional velocity)
vcfg = (1 + w) · vy − w · v
w = guidance weight. w=0 gives plain conditional; w>0 amplifies conditioning.

Why This Works

The Geometry of Guidance

Think of v as pointing toward "any plausible image" and vy as pointing toward "images matching condition y." The difference (vy − v) is the direction that makes the image more y-like. CFG adds an extra push in that direction:

vcfg = vy + w · (vy − v)

When w = 0, you get plain conditional sampling. When w > 0, you amplify whatever makes the image specifically match y, pushing samples toward higher p(y|x). In score-function terms: vcfg ∝ ∇x log p(xt) + (1+w) · ∇x log p(y|xt).

Worked Example — Effect of Guidance Weight

Prompt: "a photo of a cat." At w = 0: diverse images, some clearly cats, some ambiguous. At w = 3: all images are unambiguously cats, sharp and detailed, but less variety (similar poses/angles). At w = 10: even more cat-like but oversaturated colors, artifacts appear. At w = 20: extreme artifacts, the image is "too much cat."

Typical production values: w = 3–7 for text-to-image. There's a quality-diversity tradeoff: higher w = more aligned with the prompt but less diverse and potentially lower quality.

Why "Classifier-Free"?

Earlier work (Dhariwal & Nichol, 2021) used a separately-trained classifier p(y|x) to guide diffusion. This required training an extra model and running it at every denoising step. CFG eliminates the classifier entirely — the single diffusion model serves as both the unconditional and conditional generator. The name "classifier-free" distinguishes it from the earlier "classifier-guided" approach.

CFG Doubles Cost

Every sampling step requires two forward passes: one unconditional (v) and one conditional (vy). This doubles the compute per step. Some methods batch the two passes together to use GPU parallelism, but the FLOPs still double.

Interactive: Classifier-Free Guidance Vectors
From a point xt, see how the unconditional (gray), conditional (blue), and guided (gold) velocity vectors change with guidance weight w.
Guidance w 3.0
Chapter 06

The Network: U-Net to DiT

The velocity predictor fθ(xt, y, t) needs a neural network architecture. Two families have dominated: the U-Net (inherited from image segmentation) and the Diffusion Transformer (DiT) (adapted from the vision transformer).

U-Net Architecture

The U-Net has an encoder-decoder structure with skip connections:

Encoder: Stack of downsampling convolutional blocks. Resolution halves at each level (64→32→16→8). The network sees the image at multiple scales.

Decoder: Stack of upsampling blocks. Resolution doubles at each level (8→16→32→64). Skip connections concatenate encoder features at matching resolutions, preserving fine details.

Timestep injection: The timestep t is embedded via sinusoidal positional encoding (like transformers), then injected into each block via scale-and-shift (also called FiLM conditioning): given embedding e(t), compute scale γ and shift β, then transform features as γ · h + β. This is called Adaptive Layer Normalization (AdaLN).

Text conditioning: Text embeddings from CLIP or T5 are injected via cross-attention layers. At each resolution, the image features attend to the text token sequence, allowing each spatial location to "look at" relevant words in the prompt.

Definition
AdaLN (Adaptive Layer Normalization)

Instead of fixed scale/shift in layer norm, predict them from conditioning: γ, β = MLP(e(t)). Apply as: AdaLN(h) = γ · LayerNorm(h) + β. This modulates the network's behavior based on the current timestep (and optionally class label).

Diffusion Transformer (DiT)

Peebles & Xie (2023) replaced the U-Net with a standard Vision Transformer. The key insight: transformers scale better with compute than convolutions. As you increase model size, DiT quality improves more smoothly and predictably.

Patchification: Split the image (or latent) into non-overlapping patches, flatten each patch into a vector, add positional embeddings. These patch tokens become the input sequence for the transformer — exactly like ViT for classification, but now for generation.

Conditioning injection: Timestep via AdaLN-Zero (like AdaLN but initialized to the identity function so the network starts as a plain transformer). Text via cross-attention layers interleaved with self-attention.

Why Transformers Win

U-Nets have inductive biases (locality, translation equivariance) that help at small scale but limit scaling. Transformers have no spatial inductive bias — they learn everything from data. This is worse at small scale (need more data) but better at large scale. Since modern image generation uses billions of training images and billions of parameters, the transformer's superior scaling wins.

MM-DiT: Joint Attention for Multi-Modal Inputs

Modern text-to-image models (FLUX, SD3) use MM-DiT (Multi-Modal DiT). Instead of separate self-attention for image tokens and cross-attention for text, MM-DiT concatenates all tokens (image patches + text tokens) into a single sequence and runs joint self-attention over everything. Each image patch can attend to every text token and vice versa. This is simpler and empirically better than the cross-attention design.

Architecture Comparison

U-Net (DDPM/LDM): ~860M params. Conv encoder-decoder + self-attention at low resolutions + cross-attention for text. Trained on 256×256 images.

DiT-XL/2: ~675M params. 28 transformer blocks, patch size 2. Trained on ImageNet 256×256. Achieved FID 2.27 (SOTA at the time).

FLUX.1: ~12B params. MM-DiT with joint attention. T5-XXL + CLIP text encoders. Trained on billions of text-image pairs.

Chapter 07

Latent Diffusion Models

Running diffusion in pixel space is expensive. A 512×512×3 image has 786,432 dimensions. Every denoising step processes all of them. At 50 steps, that's 39 million pixel operations. Can we work in a smaller space?

The Two-Stage Strategy

Stage 1: Compression. Train a VAE (variational autoencoder) to compress images into a low-dimensional latent space. The encoder maps a 256×256×3 image to a 32×32×16 latent. That's a 48× compression in spatial dimensions. The decoder maps the latent back to the image.

Stage 2: Diffusion on latents. Train the diffusion model (U-Net or DiT) on the 32×32×16 latents, not the 256×256×3 images. Everything is identical — add noise to latents, predict velocity, denoise — but now each step processes only 16,384 dimensions instead of 786,432.

Common Settings (Stable Diffusion) Downsampling factor: D = 8 (256/8 = 32)
Latent channels: C = 4 (SD 1.x/2.x) or C = 16 (SDXL/SD3)
Image: 256×256×3 → Latent: 32×32×C
Compression: ~48× fewer values (C=4) or ~12× (C=16)

The VAE + GAN Decoder

The autoencoder isn't a plain VAE — its decoder is trained with a GAN discriminator (perceptual + adversarial loss). This is critical: pure MSE reconstruction produces blurry images; the discriminator forces the decoder to produce sharp, detailed outputs. The encoder has a mild KL penalty to keep the latent space smooth.

The Modern LDM = VAE + GAN + Diffusion

It's a three-part system: (1) A VAE encoder compresses images to latents. (2) A diffusion model generates in latent space. (3) A GAN-enhanced VAE decoder maps latents back to sharp images. Three of the four generative model families from Chapter 1, working together.

Why Latent Space Works: Two Types of Compression

Perceptual compression removes imperceptible high-frequency details (like JPEG). The VAE's first job: collapse the pixel space by removing information humans can't see. Semantic compression captures the meaningful content: objects, layout, style. The diffusion model operates on semantically-rich latents, not redundant pixels. This separation is why LDMs are both fast and high-quality.

The Full Pipeline

Latent Diffusion: Training
  1. Stage 1 (one-time): Train autoencoder. Encoder E: image → latent. Decoder D: latent → image. Loss = reconstruction + KL + adversarial (GAN disc).
  2. Stage 2: Freeze the autoencoder. Encode all training images: z = E(x).
  3. Train diffusion on latents z using rectified flow (or DDPM). The denoiser fθ(zt, y, t) predicts velocity in latent space.
Latent Diffusion: Sampling
  1. Sample z1 ~ N(0, I) in latent space (32×32×C).
  2. Denoise: Run T steps of Euler integration with CFG to get clean latent z0.
  3. Decode: x = D(z0). Run the frozen VAE decoder to get the final image.
Worked Example — Stable Diffusion XL

Input: text prompt "a cyberpunk city at night, neon lights, rain." Text encoders: CLIP-ViT-L + OpenCLIP-ViT-bigG (dual encoders for richer text understanding). Latent size: 128×128×4 (for 1024×1024 output). DiT denoiser: ~2.6B params. 50 Euler steps with CFG w=7.5. The entire generation: ~4 seconds on an A100 GPU.

Interactive: Latent Diffusion Pipeline
Watch the full LDM pipeline: Image → Encoder → Latent → Add Noise → Iterative Denoising → Clean Latent → Decoder → Image. Dimensions shown at each stage.
Denoise Steps 8
Click Play to animate the pipeline
Chapter 08

Text-to-Image & Text-to-Video

Latent diffusion + CFG + DiT = the complete text-to-image stack. Let's trace the full pipeline from text prompt to pixels, then extend to video.

The Text-to-Image Pipeline

Step 1: Text encoding. The text prompt goes through one or more frozen text encoders. Common choices: CLIP (contrastive vision-language model, produces a single embedding) and T5-XXL (encoder-decoder language model, produces a sequence of token embeddings). Using both captures complementary information — CLIP gives global semantics, T5 gives fine-grained token-level detail.

Step 2: Diffusion generation. The text embeddings condition a DiT via cross-attention (or joint attention in MM-DiT). Starting from random noise in latent space, T denoising steps with CFG produce a clean latent.

Step 3: Decoding. The VAE decoder maps the latent to pixels.

FLUX.1 Architecture

Text encoders: T5-XXL (4.7B params, produces 256 token embeddings) + CLIP-ViT-L (produces 1 global embedding). DiT: 12B params, MM-DiT with joint attention, patch size 2, 8×8 latent downsampling. VAE: 16 latent channels, 8× spatial downsampling. Trained on billions of captioned images.

From Images to Video

The conceptual leap from image to video generation is smaller than you'd think. An image is a 2D grid of latents (h × w × c). A video is a 3D grid of latents (t × h × w × c), where t is the time dimension. The diffusion model operates on this 3D tensor.

Definition
Video Latent Space

A video VAE compresses both spatially and temporally. A 16-frame 256×256 video might compress to a 4×32×32×16 latent (4× temporal, 8× spatial downsampling). The diffusion model generates in this compressed 3D space, and the video decoder inflates it back to frames.

Temporal Attention: Factorized Design

Full 3D self-attention over (t × h × w) tokens is quadratic in the total count — prohibitively expensive for long videos. The standard solution: factorized attention.

Spatial attention: Within each frame, all (h × w) patches attend to each other. This captures spatial relationships (objects, layout).

Temporal attention: For each spatial position, the tokens across all T frames attend to each other. This captures motion and temporal coherence.

These alternate in the transformer blocks: spatial-attention → temporal-attention → spatial → temporal → ...

Meta MovieGen (2024)

Scale: 30B parameter DiT. Latent space: 8×8×8 downsampling (spatial 8×, temporal 8×). A 16-second 768×768 video at 16fps = 256 frames → 32×96×96 latent → ~295K tokens per sample. Training data: Hundreds of millions of video clips. Text conditioning: T5-XXL + MetaCLIP. Result: Photorealistic, temporally coherent videos from text prompts.

The Video Generation Timeline

YearModelKey Innovation
2024 FebSora (OpenAI)Spacetime patches, long coherent videos
2024 JunGen-3 (Runway)Commercial text-to-video
2024 OctMovieGen (Meta)30B DiT, 76K tokens, audio generation
2024 DecHunyuanVideo (Tencent)Open-source, dual-stream DiT
2025Cosmos (NVIDIA) / Wan (Alibaba)World models, open weights
Current Limitations

Video models still struggle with: physics (objects passing through each other, gravity violations), counting (wrong number of fingers, objects), long-term coherence (character appearance drifting over time), and text rendering (garbled text on signs). These are active research areas.

Chapter 09

Generalized Diffusion & Connections

Rectified Flow is one specific choice. The general framework parameterizes the interpolation and prediction target with four functions:

Generalized Interpolation xt = a(t) · x + b(t) · z
ygt = c(t) · x + d(t) · z
x = data, z = noise, ygt = prediction target

Different choices of a, b, c, d recover all the famous diffusion formulations:

Methoda(t)b(t)c(t)d(t)Predicts
Rectified Flow1−tt−11Velocity v = z−x
VP (DDPM)√α(t)√(1−α(t))01Noise ε
VE (Score SDE)1σ(t)01Noise ε
x-predictiona(t)b(t)10Clean data x
v-predictiona(t)b(t)−b(t)a(t)Velocity v
All Roads Lead to the Same Place

Despite the different parameterizations, all these methods learn to transport the noise distribution to the data distribution along some path. The differences are in: (1) the shape of the path (straight vs. curved), (2) what the network predicts (noise, data, velocity), (3) the resulting noise schedule. In the infinite-step limit, they're all equivalent. In practice, rectified flow (straight paths, velocity prediction) tends to work best with fewer steps.

Diffusion as a Latent Variable Model

There's a beautiful connection to the VAE framework. Think of the entire forward trajectory {x0, x0.01, x0.02, ..., x1} as a hierarchy of latent variables. The forward process is the "encoder" q(x1:T|x0). The reverse process is the "decoder" pθ(x0:T-1|xT). The ELBO decomposes into a sum of per-step denoising losses, which is exactly the MSE velocity prediction loss.

Autoregressive + Discrete Latents

An alternative to continuous diffusion: use a VQ-VAE (Vector-Quantized VAE) to encode images into a grid of discrete tokens, then train an autoregressive transformer to generate the tokens left-to-right, top-to-bottom. This is the approach behind DALL-E 1 (Ramesh et al. 2021) and recent models like LlamaGen.

VQ-VAE + Autoregressive Image → Encoder → Quantize to codebook tokens → [4, 12, 7, 31, ...]
Train transformer: p(tokeni | token1, ..., tokeni-1)
Generate: sample tokens autoregressively → Decoder → Image

Distillation: From 50 Steps to 1

The remaining weakness of diffusion: sampling speed. Several approaches compress the multi-step process:

MethodStepsKey Idea
DDIM10–50Deterministic sampling, skip timesteps
Progressive Distillation4–8Student learns to combine 2 teacher steps into 1
Consistency Models1–2Map any point on trajectory directly to x0
Adversarial Distillation1–4GAN loss forces student to produce sharp images in few steps
Consistency Models (Song et al. 2023)

Key idea: learn a function fθ such that fθ(xt, t) = x0 for all t along the same trajectory. In other words, no matter where you are on the noisy-to-clean path, the model predicts the endpoint directly. Self-consistency condition: fθ(xt, t) = fθ(xt', t') for any t, t' on the same trajectory. Training enforces this via a consistency loss between adjacent timesteps. Result: one-step generation with quality approaching 50-step diffusion.

Chapter 10

Summary & Connections

The Four Families Compared

PropertyAutoregressiveVAEGANDiffusion/Flow
DensityExactLower boundNoneScore-based
TrainingStable (MLE)Stable (ELBO)Unstable (minimax)Stable (MSE)
Sample qualityGood (text), Fair (images)BlurrySharpBest
DiversityHighHighLow (mode collapse)High
SpeedSlow (sequential)Fast (one pass)Fast (one pass)Slow (multi-step)
ScalabilityExcellent (GPT-4)LimitedPoorExcellent (FLUX, Sora)
Era2016–present2013–20182014–20212020–present

The Modern Generative Stack

Everything Combines

A modern text-to-image model like Stable Diffusion 3 is not one technique — it's five, stacked together:

(1) VAE encoder compresses images to latents. (2) Diffusion/Flow Matching generates in latent space. (3) DiT architecture parameterizes the denoiser. (4) CFG sharpens conditional generation. (5) Distillation reduces the number of sampling steps. Each piece was invented separately; together they're greater than the sum.

The Historical Arc

EraDominant MethodMilestone Models
2014–2017GANs + VAEsGAN, DCGAN, VAE
2017–2021GANsStyleGAN, BigGAN, StyleGAN2
2020–2022Diffusion emergenceDDPM, Score SDE, ADM
2022–2023Latent DiffusionStable Diffusion, DALL-E 2, Imagen
2023–presentDiT + Flow MatchingDiT, SD3, FLUX, Sora, MovieGen

Key Equations

GAN Minimax minG maxD E[log D(x)] + E[log(1 − D(G(z)))]
Rectified Flow Loss L = ||fθ((1−t)x + tz, t) − (z − x)||2
Classifier-Free Guidance vcfg = (1 + w) · fθ(xt, y, t) − w · fθ(xt, ∅, t)

References

#Paper
1Goodfellow et al. "Generative Adversarial Nets." NeurIPS, 2014.
2Radford et al. "Unsupervised Representation Learning with DCGANs." ICLR, 2016.
3Karras et al. "A Style-Based Generator Architecture for GANs." CVPR, 2019.
4Ho et al. "Denoising Diffusion Probabilistic Models." NeurIPS, 2020.
5Song et al. "Score-Based Generative Modeling through SDEs." ICLR, 2021.
6Dhariwal & Nichol. "Diffusion Models Beat GANs on Image Synthesis." NeurIPS, 2021.
7Rombach et al. "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR, 2022.
8Lipman et al. "Flow Matching for Generative Modeling." ICLR, 2023.
9Liu et al. "Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow." ICLR, 2023.
10Peebles & Xie. "Scalable Diffusion Models with Transformers." ICCV, 2023.
11Song et al. "Consistency Models." ICML, 2023.
12Ho & Salimans. "Classifier-Free Diffusion Guidance." NeurIPS Workshop, 2022.
13Ramesh et al. "Zero-Shot Text-to-Image Generation." ICML, 2021.
14Polyak et al. "Movie Gen: A Cast of Media Foundation Models." Meta, 2024.
15Esser et al. "Scaling Rectified Flow Transformers for High-Resolution Image Synthesis." ICML, 2024.
16Black Forest Labs. "FLUX.1: Open-Weight Flow Matching Models." 2024.
The One Sentence

Modern image and video generation = compress (VAE), denoise (diffusion), condition (CFG), and scale (transformers). Everything else is optimization.