← Gleams
Stanford CS 231n · Lecture 13 · Generative Models (Part 1)

Learning to Generate

What if your neural network could not just classify images, but create entirely new ones? That requires learning the full data distribution — a fundamentally harder problem.

Density estimation Autoregressive models Variational autoencoders The ELBO
Roadmap

What You'll Master

Chapter 01

Discriminative vs Generative

You've spent the entire course building discriminative models: given an image x, predict a label y. A classifier learns p(y|x) — "what's the probability this image is a cat, given its pixels?" That's incredibly useful. But it sidesteps a deeper question.

What does a "cat image" actually look like? Not "is this a cat?" but "what images are cat-like at all?" That question is about the data distribution p(x) — the probability density over the entire space of possible images. A model that learns p(x) is called a generative model.

What Is a Density Function?

A density function p(x) assigns a positive number to every possible data point x. Higher values mean "more likely under this distribution." There's one constraint: the density must integrate to 1 over all possible x.

Density Normalization ∫ p(x) dx = 1
p(x) ≥ 0 for all x, and the total area under the curve equals 1.

For images, x is a high-dimensional vector (e.g., 256 × 256 × 3 = 196,608 dimensions). The density p(x) lives in this enormous space. Most of that space is "noise" — random pixel values that look nothing like real images. Natural images occupy a tiny, thin manifold within it. The generative model's job is to find that manifold.

Three Flavors of Learning

TypeLearnsCan DoExample
Discriminativep(y|x)Classify, detect, segmentResNet, BERT
Generative (unconditional)p(x)Sample new data, detect outliersPixelCNN, VAE
Conditional generativep(x|y)Generate data for a specific class"Generate a cat image"
The Bayes Connection

Discriminative and generative models are two sides of Bayes' theorem: p(y|x) = p(x|y) p(y) / p(x). A discriminative model learns the left side directly. A generative model learns p(x|y) and p(x), from which you can derive p(y|x). In principle, a perfect generative model gives you a perfect classifier for free — plus the ability to generate new data.

Why Learn to Generate?

Beyond creating pretty pictures, generative models enable:

Outlier detection: If p(x) is very low for an input, it's unusual. Flag it.

Data augmentation: Generate training samples for rare classes.

Unsupervised representation learning: The features a model discovers while learning p(x) are useful for downstream tasks — no labels needed.

Scientific simulation: Model the distribution of molecules, proteins, weather patterns. Generate realistic samples for testing.

Density Function Explorer
This shows 1D data samples and a learned density p(x). Toggle between discriminative and generative views. Drag the slider to see different density fits.
Worked Example — Discriminative vs Generative

Dataset: 100 images, 60 cats, 40 dogs.

Discriminative model learns: "If the image has pointy ears and whiskers, P(cat|x) = 0.95." It can classify but can't draw a cat.

Generative model learns: "Cat images tend to have these pixel patterns, in these proportions, with these textures." It can sample entirely new cat images — and also classify (via Bayes), though usually less accurately than a dedicated discriminative model.

Definition
Maximum Likelihood Principle

Given training data {x(1), ..., x(N)}, find model parameters that make the observed data as likely as possible: maximize Πi p(x(i); W). This is how we'll train every generative model in this lecture.

Chapter 02

The Taxonomy of Generative Models

Not all generative models are created equal. The central question is: can your model actually compute p(x)? Some models give you an explicit density you can evaluate. Others can only sample — they generate realistic data, but you can't ask "how likely is this particular image?"

Explicit Density Models

These models define a parametric form for p(x) and let you compute it directly for any input x.

Tractable density: You can compute p(x) exactly, in closed form. Autoregressive models achieve this by factoring p(x) into a product of conditionals using the chain rule. The computation is exact — no approximation.

Approximate density: The model defines p(x) implicitly through a latent variable z, but computing p(x) = ∫ p(x|z)p(z)dz requires an intractable integral. Variational Autoencoders (VAEs) handle this by optimizing a lower bound instead.

Implicit Density Models

These models don't compute p(x) at all. They learn to transform noise into realistic samples without ever defining a density function.

Direct sampling: Generative Adversarial Networks (GANs) learn a generator that maps random noise z → x. You can sample, but you can't evaluate p(x) for a given x.

Iterative sampling: Diffusion models start from noise and iteratively denoise. Each step moves toward the data distribution, producing high-quality samples through many small refinements.

The Fundamental Trade-off

Tractable density models (autoregressive) give you exact likelihoods but generate slowly (one element at a time). Implicit density models (GANs, diffusion) generate fast or high-quality samples but can't tell you p(x). VAEs sit in the middle: they give you a lower bound on p(x) and can both generate and evaluate, but the samples tend to be blurrier. No free lunch.

Model FamilyDensitySample QualitySpeedExample
AutoregressiveExact p(x)GoodSlow (sequential)PixelCNN, GPT
VAELower boundBlurryFast (one pass)VAE, VQ-VAE
GANNoneSharpFast (one pass)StyleGAN
DiffusionLower boundExcellentSlow (iterative)DDPM, Stable Diffusion

Historical Context

Autoregressive models came first — language models have been autoregressive since the n-gram days. VAEs (Kingma & Welling, 2013) showed that variational inference could scale with deep nets. GANs (Goodfellow et al., 2014) stunned the field with sharp samples but were notoriously unstable to train. Diffusion models (Ho et al., 2020) emerged later and combined the training stability of VAEs with sample quality surpassing GANs. Each approach was motivated by a different answer to the question "how do you handle the intractable density?"

Definition
Latent Variable

An unobserved random variable z that captures hidden structure in the data. You never see z directly — you only observe x. The model assumes x was generated from z: first sample z ~ p(z), then sample x ~ p(x|z). The latent z might represent pose, lighting, identity, or other factors of variation.

Don't Confuse These

"Generative model" doesn't mean "model that generates images." A generative model is any model that learns a distribution p(x) or p(x,y). An autoregressive language model is a generative model. A Gaussian mixture model is a generative model. Image generation is just the most visually impressive application.

Chapter 03

Maximum Likelihood Estimation

Before we build any specific generative model, we need a training objective. Every model in this lecture will use the same principle: maximum likelihood estimation (MLE). The idea is simple — find the parameters that make the training data most probable.

The Setup

You have training data {x(1), x(2), ..., x(N)} drawn i.i.d. from some unknown true distribution pdata(x). You have a parametric model p(x; W) with learnable parameters W. Goal: find W* that makes p(x; W) as close to pdata(x) as possible.

Maximum Likelihood Objective W* = argmaxW Πi=1N p(x(i); W)

Find parameters W that maximize the probability of all training examples simultaneously.

The Log Trick

Products of N probabilities are numerically terrible. Each p(x(i)) is between 0 and 1, so multiplying millions of them gives a number astronomically close to zero — underflow. The fix: take the logarithm. Since log is monotonically increasing, maximizing the log of the product gives the same W*.

Log-Likelihood W* = argmaxWi=1N log p(x(i); W)

Product → sum. Same optimum, numerically stable.

This sum of log-probabilities is called the log-likelihood. In practice, we minimize the negative log-likelihood (NLL), because gradient descent minimizes losses:

NLL Loss L(W) = − (1/N) ∑i=1N log p(x(i); W)
Connection to KL Divergence

Minimizing the NLL is equivalent to minimizing the KL divergence between the data distribution and the model:

DKL(pdata || pmodel) = 𝔼x~pdata[log pdata(x) − log pmodel(x; W)]

The first term (log pdata) is a constant with respect to W. So minimizing DKL is equivalent to maximizing 𝔼x~pdata[log pmodel(x; W)] — which is exactly the log-likelihood, estimated by averaging over training samples.

Connection to Cross-Entropy

The cross-entropy between pdata and pmodel is:

H(pdata, pmodel) = −𝔼x~pdata[log pmodel(x; W)] = NLL

You already know cross-entropy from classification! When you train a softmax classifier with cross-entropy loss, you're doing maximum likelihood on p(y|x). Now we're doing the same thing on p(x).

Worked Example — MLE for a Gaussian

Model: p(x; μ, σ) = (1/σ√(2π)) exp(−(x−μ)2 / 2σ2). Data: {2, 3, 5}.

Log-likelihood:i log p(xi) = ∑i [−log(σ√(2π)) − (xi−μ)2 / 2σ2]

Take derivative w.r.t. μ, set to 0: μ* = (2+3+5)/3 = 10/3 ≈ 3.33. The MLE mean is the sample mean — exactly what intuition says.

Take derivative w.r.t. σ: σ*2 = (1/3)∑(xi−μ*)2 ≈ 1.56. The MLE variance is the sample variance.

MLE recovers the obvious statistics because the Gaussian is a member of the exponential family. For neural networks, there's no closed-form solution — we use gradient descent on the NLL.

The Universal Training Signal

Maximum likelihood is remarkably general. It doesn't care about the model architecture. Autoregressive model? Maximize the sum of conditional log-probs. VAE? Maximize a lower bound on the log-likelihood. Diffusion model? Also a lower bound. The architectures differ; the principle is the same: make the training data likely.

Chapter 04

Autoregressive Models

Here's the most elegant trick in generative modeling. We want p(x) where x = (x1, x2, ..., xT) is a sequence of values. The chain rule of probability says we can always decompose any joint distribution as:

Chain Rule of Probability p(x) = p(x1) · p(x2|x1) · p(x3|x1,x2) · ... · p(xT|x1,...,xT−1)
     = Πt=1T p(xt | x1, ..., xt−1)

This is EXACT. No approximation. It's a mathematical identity.

The key insight: this is not an approximation. It's an identity that holds for any distribution. If you can model each conditional p(xt|x<t) accurately, you have the exact density p(x). No tricks, no bounds, no approximations.

Language Models Are Autoregressive

You've already seen this. An RNN or Transformer language model predicts the next token given all previous tokens. That's exactly p(xt|x1,...,xt−1). GPT, LLaMA, every LLM — they're all autoregressive generative models over text.

To generate: sample x1 ~ p(x1), then x2 ~ p(x2|x1), then x3 ~ p(x3|x1,x2), and so on. Each token is sampled conditioned on all previous tokens.

Autoregressive Images: PixelRNN & PixelCNN

Can we do the same for images? Treat the image as a sequence of pixel values in scanline order — left to right, top to bottom. Each pixel has R, G, B subpixels, each taking values 0–255. Model:

PixelRNN / PixelCNN p(x) = Πi=1 p(Ri|x<i) · p(Gi|x<i, Ri) · p(Bi|x<i, Ri, Gi)

Each subpixel is a 256-way softmax conditioned on all previous subpixels.

PixelRNN (van den Oord et al., 2016) models each conditional with an LSTM. Problem: generation is sequential — each pixel waits for the previous one. A 256×256 image has ~196K subpixels. Extremely slow.

PixelCNN uses masked convolutions instead. The convolution kernel is masked so that each pixel only sees pixels above and to the left — preserving the autoregressive ordering. Training is parallel (all pixels computed at once), but generation is still sequential.

Definition
Masked Convolution

A convolution filter where weights corresponding to "future" pixels (below and to the right in scanline order) are set to zero. This ensures each output pixel only depends on previously generated pixels, maintaining the autoregressive property. During training, all positions can be computed in parallel because the input is known.

How Masked Convolutions Work

Consider a 3×3 convolution kernel. In a normal convolution, all 9 weights are used. In a masked convolution for autoregressive generation:

Mask type A (first layer): zero out the center pixel and everything below/right. Only the top row and the left cell of the middle row have nonzero weights. The center pixel cannot see itself.

Mask type B (subsequent layers): same as A, but the center pixel IS included. This is safe because in deeper layers, the center pixel's feature already excludes "future" information (it was masked in layer 1).

Result: a 5-layer masked CNN has a receptive field covering many previous pixels, but zero information leakage from future pixels. Training runs at the speed of a normal CNN — all positions are computed in one forward pass.

Worked Example — Inference Speed Comparison

Generating a 256×256 RGB image (196,608 subpixels):

PixelRNN: 196,608 sequential LSTM steps. At ~500 steps/sec on a GPU: ~6.5 minutes per image.

PixelCNN: 196,608 sequential forward passes (each is fast due to convolutions): ~2 minutes per image.

VQ-VAE + Transformer: Encode to 32×32 = 1,024 tokens. Generate 1,024 tokens autoregressively: ~0.5 seconds. Then decode tokens to pixels in one pass. 780x speedup over PixelRNN.

Autoregressive Pixel Generation
Watch an image being generated pixel-by-pixel in scanline order. Each pixel is sampled from p(x_t | x_1,...,x_{t-1}). The image gradually emerges from top-left to bottom-right.
The Scalability Problem

A 1024×1024 image has 3 × 1024 × 1024 ≈ 3.1 million subpixels. Even at 1000 subpixels per second, that's 52 minutes per image. This is why modern autoregressive image models work on discrete latent tokens instead of raw pixels: VQ-VAE encodes the image into a much smaller grid of tokens (e.g., 32×32), and an autoregressive transformer generates those tokens. 1024 tokens instead of 3M subpixels.

WaveNet: Autoregressive Audio

The same idea applies to audio. WaveNet (van den Oord et al., 2016) models raw audio waveforms autoregressively at 16kHz — 16,000 samples per second. It uses dilated causal convolutions with exponentially increasing dilation rates (1, 2, 4, 8, ..., 512) to capture long-range dependencies without prohibitive computation. The result: the most natural-sounding text-to-speech of its era.

Worked Example — Chain Rule on a 2×2 Image

Image x has 4 pixels: x1, x2, x3, x4 (top-left, top-right, bottom-left, bottom-right). Each pixel is 0 or 1 (binary for simplicity).

Step 1: Model p(x1). Say p(x1=1) = 0.3. Sample: x1 = 0.

Step 2: Model p(x2|x1=0). Say p(x2=1|x1=0) = 0.7. Sample: x2 = 1.

Step 3: Model p(x3|x1=0, x2=1). Sample: x3 = 1.

Step 4: Model p(x4|x1=0, x2=1, x3=1). Sample: x4 = 0.

Total likelihood: p(x) = p(x1=0) × p(x2=1|x1=0) × p(x3=1|0,1) × p(x4=0|0,1,1).

We can compute this exactly. That's the power of autoregressive models.

Training Is Fast, Generation Is Slow

During training, we know all the pixels (they're in the training set). So we can compute all the conditionals in parallel using masked convolutions — one forward pass gives us the loss for all T positions simultaneously. During generation, we must sample sequentially: each pixel depends on the previous ones. This asymmetry is fundamental to autoregressive models.

Chapter 05

Autoencoders

Before we get to VAEs, we need to understand their non-variational ancestor. An autoencoder learns useful features from data without labels. The idea is beautifully simple: force the network to compress data through a bottleneck, then reconstruct it.

Architecture

An autoencoder has two parts:

Encoder fθ: maps input x to a low-dimensional representation z = fθ(x). This z is called the latent code or bottleneck.

Decoder gφ: maps z back to a reconstruction x̂ = gφ(z). The goal is x̂ ≈ x.

Autoencoder z = fθ(x)        ← Encoder: compress
x̂ = gφ(z)        ← Decoder: reconstruct

L = ||x̂ − x||2     ← L2 reconstruction loss

If z has fewer dimensions than x (say, z ∈ ℝ32 while x ∈ ℝ784 for MNIST), the encoder must discover the most important features — it can't memorize every pixel. The bottleneck forces compression, and compression forces understanding.

What Are the Features Good For?

After training, throw away the decoder. The encoder maps any input to a compact feature vector z. Attach a small classifier on top of z for downstream tasks. This is unsupervised pretraining — learn features from unlabeled data, then fine-tune with a few labels.

Worked Example — MNIST Autoencoder

Input: 28×28 = 784 pixel values. Encoder: 784 → 256 → 32 (z has 32 dimensions). Decoder: 32 → 256 → 784.

After training, z captures the "essence" of each digit: stroke thickness, slant, roundness. Two 7s that look different have similar z vectors because they encode the same digit concept. The autoencoder discovered digit-relevant features without ever seeing a label.

Can We Generate with an Autoencoder?

Tempting idea: sample a random z, feed it to the decoder, get a new image. This doesn't work well.

The Latent Space Problem

The encoder maps training images to z vectors, but there's no constraint on where those vectors end up. The 7s might cluster around z = (3.2, −1.7, ...) and the 4s around z = (−5.1, 8.3, ...), with vast empty regions in between. If you sample a random z from those empty regions, the decoder has never seen anything like it — it produces garbage. The latent space is unstructured.

The fix: force z to come from a known, structured distribution. That's the core idea behind the Variational Autoencoder.

Autoencoder ≠ Generative Model

A vanilla autoencoder learns a deterministic mapping x → z → x̂. It's a feature extractor, not a density estimator. It doesn't define p(x), it can't compute likelihoods, and sampling from its latent space is unreliable. The VAE adds probabilistic structure to fix all three problems.

Variants and Regularization

Several autoencoder variants address different goals:

Denoising autoencoder: Corrupt the input (add noise, mask pixels) and train the network to reconstruct the clean original. This forces the encoder to learn robust features that capture the signal, not the noise. The denoising objective is closely related to score matching — a connection that inspired diffusion models years later.

Sparse autoencoder: Add an L1 penalty on z to encourage most latent dimensions to be zero. Each image activates only a few features, producing sparse, interpretable representations.

Contractive autoencoder: Penalize the Frobenius norm of the Jacobian ∂z/∂x. This makes the encoder insensitive to small input perturbations, learning features that capture meaningful variation and ignore noise.

Worked Example — Why Sampling Fails

Train an autoencoder on MNIST. The encoder maps each "7" to z-values near [2.1, −3.4], and each "0" to z-values near [−5.2, 1.8]. The region around [0, 0] has no training data nearby.

If you sample z = [0, 0] and decode, you get an unrecognizable blob. The decoder was never trained on inputs from this region. There's no "interpolation guarantee" — the decoder only works near the training data manifold in z-space.

VAEs fix this by regularizing ALL encoded distributions toward N(0, I), ensuring the latent space is dense — every region near the origin has training data mapped to it.

Chapter 06

Variational Autoencoders — The Setup

The Variational Autoencoder (Kingma & Welling, 2013) takes the autoencoder skeleton and makes it probabilistic. Instead of deterministic z = f(x), both the encoder and decoder output distributions.

The Generative Story

Assume each data point x(i) was generated by a two-step process:

Assumed Data Generation Process
  1. Sample latent: z ~ p(z) = N(0, I)   ← simple Gaussian prior
  2. Sample data: x ~ pθ(x|z)   ← decoder neural network

The latent z captures unobserved factors of variation — identity, pose, lighting, expression. Each z produces a different image through the decoder. The prior p(z) = N(0, I) says: before seeing any data, all z values are equally likely (centered, unit variance, independent dimensions).

What We Want: Maximum Likelihood

We want to find θ that maximizes pθ(x) for our training data. To compute pθ(x), marginalize out z:

Marginal Likelihood pθ(x) = ∫ pθ(x|z) p(z) dz

Integrate over ALL possible z values. This is intractable.
Problem 1: Intractable Integral

The integral ∫ pθ(x|z) p(z) dz sums over every possible z — an infinite-dimensional integral when z is continuous. We can't compute it analytically (pθ(x|z) is a neural network, not a conjugate distribution). We can't estimate it by Monte Carlo either: most z values produce x values nothing like our target, so random sampling would need astronomically many samples to get a useful estimate.

The Posterior Problem

We could try Bayes' rule: pθ(z|x) = pθ(x|z) p(z) / pθ(x). But computing this requires pθ(x), which is the intractable integral we're trying to avoid. Circular.

Problem 2: Intractable Posterior

Given a data point x, which z values could have produced it? That's the posterior pθ(z|x). We need it for training (to know where to "look" in z-space), but computing it requires the intractable marginal pθ(x). We're stuck.

The Solution: Approximate the Posterior

Since we can't compute pθ(z|x), we'll learn an approximation. Introduce an encoder network qφ(z|x) that takes a data point x and outputs a distribution over z. We'll train qφ to be as close to the true posterior pθ(z|x) as possible.

Encoder and Decoder Outputs Encoder: qφ(z|x) = N(μφ(x), diag(σφ2(x)))

Decoder: pθ(x|z) = N(μθ(z), σ2I)

Both output Gaussian distributions. The encoder outputs mean and variance vectors.
Why Gaussian?

Two reasons. First, the KL divergence between two Gaussians has a closed-form solution — no sampling needed to compute it. Second, we can use the reparameterization trick (Chapter 8) to backpropagate through Gaussian sampling. Other distribution families lack one or both of these properties. The Gaussian choice is pragmatic, not fundamental.

Definition
Amortized Inference

In classical variational inference, you optimize a separate set of variational parameters for each data point. The VAE amortizes this: one encoder network qφ(z|x) handles all data points. Feed in any x, get an approximate posterior. This makes inference scale to millions of data points.

Worked Example — Encoder Output

Input x is a 28×28 MNIST digit "3". Encoder processes x through conv layers and outputs two vectors:

μφ(x) = [1.2, −0.5, 0.8, ...]   (32 values, the mean)

log σφ2(x) = [−1.1, −0.3, −2.0, ...]   (32 values, log-variance for numerical stability)

The approximate posterior is a 32-dimensional Gaussian centered at μ with diagonal covariance σ2. To get a z, sample from this Gaussian. Different samples give slightly different z values, all near the mean — capturing the "3-ness" of the input with some uncertainty.

Chapter 07

The ELBO Derivation

This is the mathematical heart of the VAE. We can't compute log pθ(x) directly, so we derive a lower bound that we can compute and optimize. This bound is called the Evidence Lower Bound (ELBO).

Step 1: Start from Bayes' Rule

For any z, Bayes' rule gives us:

Step 1 — Bayes' Rule pθ(x) = pθ(x|z) · p(z) / pθ(z|x)

Take the log of both sides:

Step 2 — Take Logarithm log pθ(x) = log pθ(x|z) + log p(z) − log pθ(z|x)

Step 2: Introduce the Approximate Posterior

The right side has pθ(z|x) which we can't compute. Introduce qφ(z|x) by adding and subtracting log qφ(z|x):

Step 3 — Add and Subtract log qφ(z|x) log pθ(x) = log pθ(x|z) + log p(z) − log qφ(z|x) + log qφ(z|x) − log pθ(z|x)

Rearrange by grouping terms:

Step 4 — Rearrange log pθ(x) = log pθ(x|z) − [log qφ(z|x) − log p(z)] + [log qφ(z|x) − log pθ(z|x)]

= log pθ(x|z) − log [qφ(z|x) / p(z)] + log [qφ(z|x) / pθ(z|x)]

Step 3: Take Expectations

The left side, log pθ(x), doesn't depend on z. So if we take the expectation of both sides over z ~ qφ(z|x), the left side is unchanged:

Step 5 — Expectation over qφ(z|x) log pθ(x) = 𝔼z~qφ[log pθ(x|z)] − 𝔼z~qφ[log qφ(z|x) − log p(z)] + 𝔼z~qφ[log qφ(z|x) − log pθ(z|x)]

Step 4: Recognize the KL Divergences

The second term is a KL divergence (definition: DKL(q||p) = 𝔼q[log q/p]):

Step 6 — Identify KL Terms log pθ(x) = 𝔼z~qφ[log pθ(x|z)] − DKL(qφ(z|x) || p(z)) + DKL(qφ(z|x) || pθ(z|x))

Step 5: Drop the Intractable Term

The third term, DKL(qφ(z|x) || pθ(z|x)), measures how far our approximate posterior is from the true posterior. We can't compute it (it involves the intractable pθ(z|x)). But we know that KL divergence is always ≥ 0. So:

Step 7 — The ELBO log pθ(x) ≥ 𝔼z~qφ[log pθ(x|z)] − DKL(qφ(z|x) || p(z))

This lower bound is the ELBO (Evidence Lower Bound).
Why DKL ≥ 0 (Jensen's Inequality)

DKL(q||p) = 𝔼q[log(q/p)] = −𝔼q[log(p/q)] ≥ −log(𝔼q[p/q]) = −log(∫ q · p/q dz) = −log(∫ p dz) = −log(1) = 0.

The inequality is Jensen's: for concave log, 𝔼[log X] ≤ log 𝔼[X]. Equality only when p = q everywhere.

The ELBO Has Two Intuitive Terms

ELBO Decomposition ELBO = 𝔼z~qφ[log pθ(x|z)]DKL(qφ(z|x) || p(z))

Reconstruction term           Regularization term

Reconstruction: Sample z from the encoder, decode it, measure how well the decoder reconstructs the original x. This is the familiar autoencoder loss — make x̂ look like x.

Regularization: Push the encoder's output distribution qφ(z|x) toward the prior N(0, I). This prevents the encoder from encoding each data point at an arbitrary location in z-space. It forces the latent space to be smooth and structured — nearby z values should decode to similar images.

ELBO Components During Training
Watch the reconstruction term and KL term evolve as training progresses. The ELBO (their sum) is a lower bound on log p(x). The gap is the intractable D_KL(q||p_true).
The Reconstruction-Regularization Trade-off

If we maximize only the reconstruction term, we get a vanilla autoencoder — each data point gets its own corner of z-space, and the latent space has no structure for generation. If we maximize only the regularization term, the encoder ignores x entirely and outputs N(0, I) for everything — perfect prior match but zero information. The ELBO forces a balance: encode enough information to reconstruct, but keep the encoding close to the prior.

Closed-Form KL for Gaussians

When q = N(μ, σ2I) and p = N(0, I) (both diagonal), the KL divergence has a closed form:

DKL = −(1/2) ∑j=1J [1 + log(σj2) − μj2 − σj2]

where J is the latent dimension. Each latent dimension contributes independently. If μj = 0 and σj = 1, that dimension contributes 0 to the KL — it already matches the prior. Deviations in mean or variance increase the KL penalty.

Worked Example — KL for One Latent Dimension

Encoder outputs μ = 2.0, σ = 0.5 for one dimension.

DKL = −0.5 × [1 + log(0.25) − 4.0 − 0.25]

= −0.5 × [1 + (−1.386) − 4.0 − 0.25]

= −0.5 × [−4.636]

= 2.318

The mean is far from 0 (μ = 2) and the variance is too small (σ = 0.5 vs 1.0), so the KL penalty is high. The ELBO loss will push μ toward 0 and σ toward 1.

Why Maximizing ELBO Works

Recall: log pθ(x) = ELBO + DKL(qφ || pθ(z|x)). Maximizing the ELBO simultaneously (1) increases log pθ(x) — making the model assign higher probability to training data, and (2) decreases DKL(qφ || pθ(z|x)) — making the approximate posterior more accurate. One objective, two benefits.

Chapter 08

The Reparameterization Trick

We have the ELBO. We know what to optimize. But there's a computational snag: the reconstruction term 𝔼z~qφ[log pθ(x|z)] requires sampling z from qφ(z|x), and sampling is not differentiable.

The Problem

To estimate the reconstruction term, we sample z ~ N(μφ(x), σφ2(x)). But "sample from a distribution" is a stochastic operation — there's no gradient of "pick a random number" with respect to the distribution's parameters. We can't backpropagate through the sampling step.

Why Sampling Blocks Gradients

The gradient ∂z/∂μ is undefined when z is drawn randomly from N(μ, σ2). Changing μ changes the distribution z comes from, but any particular sample z doesn't have a smooth dependence on μ. It's like asking "how does the mean of a die change which face comes up?" The question doesn't make sense — the randomness "breaks" the chain of differentiable operations.

The Fix: Externalize the Randomness

Instead of sampling z directly from N(μ, σ2), decompose it as:

Reparameterization Trick ε ~ N(0, I)                 ← sample noise (fixed, no gradients needed)
z = μφ(x) + σφ(x) ⊙ ε   ← deterministic function of μ, σ, and ε

Now z is a deterministic function of μ, σ, and the external noise ε. The randomness comes from ε, which doesn't depend on φ. The dependence of z on φ flows through the deterministic path μ + σ ⊙ ε, which is differentiable.

The Intuition

Think of it this way. "Sample z from a Gaussian with learnable mean μ" is like saying "throw a dart at a target, where the target position is μ." You can't differentiate through the dart throw. But "take a fixed random offset ε and add it to μ" is z = μ + ε — and now dz/dμ = 1. The randomness is the same, but the gradient can flow.

Worked Example — Forward Pass with Reparameterization

Input x is a digit "7". Encoder outputs μ = [1.5, −0.3] and log σ2 = [−0.5, −1.0].

Step 1: Compute σ = exp(0.5 × log σ2) = [exp(−0.25), exp(−0.5)] = [0.779, 0.607].

Step 2: Sample ε ~ N(0, I). Say ε = [0.42, −1.10].

Step 3: z = μ + σ ⊙ ε = [1.5 + 0.779 × 0.42, −0.3 + 0.607 × (−1.10)] = [1.827, −0.968].

Step 4: Decode z to get x̂. Compute reconstruction loss ||x̂ − x||2.

Step 5: Compute KL term from μ and σ (closed form).

Step 6: Backprop through everything. Gradients flow through z = μ + σ ⊙ ε to update μ and σ (and hence φ).

The Full VAE Training Algorithm

VAE Training (One Iteration)
  1. Sample a minibatch {x(1), ..., x(M)} from the training set.
  2. Encode: For each x(i), compute μi = μφ(x(i)), σi = σφ(x(i)).
  3. Sample noise: εi ~ N(0, I) for each sample.
  4. Reparameterize: zi = μi + σi ⊙ εi.
  5. Decode: Compute x̂i = decoder(zi).
  6. Compute ELBO: L = (1/M) ∑i [||x̂i − x(i)||2 + DKL(qφ(z|x(i)) || p(z))].
  7. Backpropagate through L. Update θ and φ via gradient descent.
Why Monte Carlo Works Here

The reconstruction term 𝔼z~q[log pθ(x|z)] is estimated by sampling a single z per data point: ≈ log pθ(x|z1) where z1 = μ + σ ⊙ ε1. One sample! In practice, the minibatch average over M data points provides enough variance reduction. More samples per data point barely helps — better to process more data points.

Gradient Verification

Consider the loss L = (z − c)2 where z = μ + σε and c is a target. Then:

∂L/∂μ = 2(z − c) · ∂z/∂μ = 2(z − c) · 1 = 2(μ + σε − c)

∂L/∂σ = 2(z − c) · ∂z/∂σ = 2(z − c) · ε = 2(μ + σε − c) · ε

Both gradients are well-defined and computable. The chain rule flows cleanly through z = μ + σε because ε is treated as a constant (like input data).

Chapter 09

VAE in Action

Time to see everything come together. This interactive demo simulates a 2D VAE: the encoder maps data points to a 2D latent space, and the decoder maps latent points back to data. You can watch the latent space organize during training, sample new points, and explore the reconstruction-regularization trade-off.

VAE Latent Space Explorer
Left: 2D data distribution (colored clusters). Right: latent space z. During training, the encoder maps data points to z-space. Click anywhere in the latent space to sample — the decoder generates a new point. Adjust β to control the KL weight.
What β Controls

The β-VAE (Higgins et al., 2017) scales the KL term: ELBO = reconstruction − β × DKL. When β = 0: pure autoencoder, no regularization, latent space is unstructured. When β = 1: standard VAE, balanced trade-off. When β > 1: extra pressure toward the prior, more disentangled latent representations but blurrier reconstructions. Try dragging the slider to see the effect.

What to Look For

β = 0 (autoencoder): Clusters are tight and separated. Each class occupies its own island. If you click between islands, the decoder produces garbage — the latent space is empty there.

β = 1 (standard VAE): Clusters overlap more, filling the space. The Gaussian prior pushes everything toward the center. Clicking between clusters produces smooth interpolations. The latent space is continuous.

β > 1 (over-regularized): Everything collapses toward N(0, I). The classes are mixed together. Sampling produces "average" outputs — blurry, generic, lacking distinctiveness.

Worked Example — Latent Interpolation

Encode a "3" to zA = (1.2, 0.8). Encode an "8" to zB = (−0.4, −1.1). Interpolate: zmid = 0.5 × zA + 0.5 × zB = (0.4, −0.15). Decode zmid: you get something that looks partway between a 3 and an 8. This smooth interpolation only works because the KL regularization filled in the gaps between clusters.

Why VAE Outputs Are Blurry

The decoder minimizes ||x̂ − x||2. For a given z, multiple x values could be valid (a "3" can be drawn many ways). The L2-optimal output is the pixel-wise mean of all valid x values — which is blurry. It's averaging over possibilities rather than committing to one. GANs fix this by using a discriminator instead of L2 loss. Diffusion models fix it by generating iteratively.

Chapter 10

Summary & Connections

The Generative Models Landscape

ModelDensityTrainingGenerationQuality
AutoregressiveExact p(x)Parallel (teacher forcing)Sequential (slow)High fidelity
VAEELBO lower boundSingle forward + backwardOne pass (fast)Blurry (L2)
GANNone (implicit)Adversarial (unstable)One pass (fast)Sharp
DiffusionELBO lower boundSimple (denoise)Iterative (slow)Excellent

What We Covered

Chapter 1: Discriminative models learn p(y|x). Generative models learn p(x). The Bayes connection links them.

Chapter 2: Generative models split into explicit density (autoregressive, VAE) and implicit density (GAN, diffusion).

Chapter 3: Maximum likelihood — the universal training objective. Equivalent to minimizing KL divergence and cross-entropy.

Chapter 4: Autoregressive models factor p(x) via the chain rule. Exact density, but slow generation. PixelCNN, WaveNet, GPT.

Chapter 5: Autoencoders learn features via compression. Not generative — latent space is unstructured.

Chapters 6–8: VAEs make the autoencoder probabilistic. The ELBO = reconstruction + KL regularization. Reparameterization trick enables backprop through sampling.

Key Equations

The Four Essential Equations 1. Chain Rule: p(x) = Πt p(xt|x<t)

2. ELBO: log p(x) ≥ 𝔼z~q[log p(x|z)] − DKL(q(z|x)||p(z))

3. Reparam: z = μ + σ ⊙ ε,   ε ~ N(0, I)

4. KL (Gaussian): −(1/2) ∑j[1 + log σj2 − μj2 − σj2]

VAE Limitations

Blurry outputs: L2 reconstruction averages over modes. GANs and diffusion models produce much sharper samples.

Posterior collapse: With powerful decoders (e.g., autoregressive), the model learns to ignore z entirely. The KL term goes to zero (q matches the prior), and z carries no information. The decoder generates everything from scratch, defeating the purpose of the latent space.

Limited expressiveness: Diagonal Gaussian posteriors qφ(z|x) can't capture complex, multi-modal posteriors. The true p(z|x) might have multiple peaks, but our approximation is unimodal.

Where These Ideas Lead

VQ-VAE: Replace the continuous Gaussian latent with discrete tokens. No KL collapse, no blurriness. Pair with an autoregressive model over tokens for state-of-the-art generation. See the VAE & VQ-VAE lesson.

Diffusion Models: Instead of compressing to a bottleneck, gradually add noise until data becomes pure Gaussian, then learn to reverse each step. Combines the stable training of VAEs with exceptional sample quality. See the Diffusion Models and Flow Matching lessons.

GANs: Coming in Part 2 of this lecture. Instead of maximizing a likelihood bound, train a generator and discriminator in a minimax game. Sharp samples, but training instability. See the GAN lesson.

Autoregressive vs VAE: Head to Head

PropertyAutoregressiveVAE
DensityExact p(x) via chain ruleLower bound (ELBO) only
Latent spaceNone (implicit via hidden states)Explicit z with semantic meaning
Generation speedSlow: O(T) sequential stepsFast: one forward pass through decoder
TrainingTeacher forcing, parallelizableELBO maximization, single pass
InterpolationNot straightforwardSmooth interpolation in z-space
Sample qualityHigh fidelity (no averaging)Blurry (L2 averages modes)
ControllabilityDifficult (condition via prompts)Easy (manipulate z dimensions)

References

YearPaperContribution
2013Kingma & Welling, "Auto-Encoding Variational Bayes"VAE framework, reparameterization trick
2014Rezende et al., "Stochastic Backpropagation"Independent VAE derivation
2016van den Oord et al., "Pixel Recurrent Neural Networks"PixelRNN, PixelCNN
2016van den Oord et al., "WaveNet"Autoregressive audio with dilated convolutions
2017Higgins et al., "β-VAE"Disentangled representations via KL scaling
2017van den Oord et al., "VQ-VAE"Discrete latents, no posterior collapse
The One Sentence

Generative models learn the data distribution p(x), and the VAE does it by learning a lower bound: compress data into a Gaussian latent space, reconstruct it, and balance fidelity against structure. Everything else is architecture.