CS 231n — Generative Models (Part 1): From Density Estimation to VAEs

Roadmap

What You'll Master

01Discriminative vs Generative 02The Taxonomy of Generative Models 03Maximum Likelihood Estimation 04Autoregressive Models 05Autoencoders 06VAE — The Setup 07The ELBO Derivation 08Reparameterization Trick 09VAE in Action (Interactive) 10Summary & Connections

Chapter 01

Discriminative vs Generative

You've spent the entire course building discriminative models: given an image x, predict a label y. A classifier learns p(y|x) — "what's the probability this image is a cat, given its pixels?" That's incredibly useful. But it sidesteps a deeper question.

What does a "cat image" actually look like? Not "is this a cat?" but "what images are cat-like at all?" That question is about the data distribution p(x) — the probability density over the entire space of possible images. A model that learns p(x) is called a generative model.

What Is a Density Function?

A density function p(x) assigns a positive number to every possible data point x. Higher values mean "more likely under this distribution." There's one constraint: the density must integrate to 1 over all possible x.

Density Normalization ∫ p(x) dx = 1
p(x) ≥ 0 for all x, and the total area under the curve equals 1.

For images, x is a high-dimensional vector (e.g., 256 × 256 × 3 = 196,608 dimensions). The density p(x) lives in this enormous space. Most of that space is "noise" — random pixel values that look nothing like real images. Natural images occupy a tiny, thin manifold within it. The generative model's job is to find that manifold.

Three Flavors of Learning

Type	Learns	Can Do	Example
Discriminative	p(y\|x)	Classify, detect, segment	ResNet, BERT
Generative (unconditional)	p(x)	Sample new data, detect outliers	PixelCNN, VAE
Conditional generative	p(x\|y)	Generate data for a specific class	"Generate a cat image"

The Bayes Connection

Discriminative and generative models are two sides of Bayes' theorem: p(y|x) = p(x|y) p(y) / p(x). A discriminative model learns the left side directly. A generative model learns p(x|y) and p(x), from which you can derive p(y|x). In principle, a perfect generative model gives you a perfect classifier for free — plus the ability to generate new data.

Why Learn to Generate?

Beyond creating pretty pictures, generative models enable:

Outlier detection: If p(x) is very low for an input, it's unusual. Flag it.

Data augmentation: Generate training samples for rare classes.

Unsupervised representation learning: The features a model discovers while learning p(x) are useful for downstream tasks — no labels needed.

Scientific simulation: Model the distribution of molecules, proteins, weather patterns. Generate realistic samples for testing.

Density Function Explorer

This shows 1D data samples and a learned density p(x). Toggle between discriminative and generative views. Drag the slider to see different density fits.

Fit quality: 50

Worked Example — Discriminative vs Generative

Dataset: 100 images, 60 cats, 40 dogs.

Discriminative model learns: "If the image has pointy ears and whiskers, P(cat|x) = 0.95." It can classify but can't draw a cat.

Generative model learns: "Cat images tend to have these pixel patterns, in these proportions, with these textures." It can sample entirely new cat images — and also classify (via Bayes), though usually less accurately than a dedicated discriminative model.

Definition

Maximum Likelihood Principle

Given training data {x⁽¹⁾, ..., x^(N)}, find model parameters that make the observed data as likely as possible: maximize Π_i p(x⁽ⁱ⁾; W). This is how we'll train every generative model in this lecture.

Chapter 02

The Taxonomy of Generative Models

Not all generative models are created equal. The central question is: can your model actually compute p(x)? Some models give you an explicit density you can evaluate. Others can only sample — they generate realistic data, but you can't ask "how likely is this particular image?"

Explicit Density Models

These models define a parametric form for p(x) and let you compute it directly for any input x.

Tractable density: You can compute p(x) exactly, in closed form. Autoregressive models achieve this by factoring p(x) into a product of conditionals using the chain rule. The computation is exact — no approximation.

Approximate density: The model defines p(x) implicitly through a latent variable z, but computing p(x) = ∫ p(x|z)p(z)dz requires an intractable integral. Variational Autoencoders (VAEs) handle this by optimizing a lower bound instead.

Implicit Density Models

These models don't compute p(x) at all. They learn to transform noise into realistic samples without ever defining a density function.

Direct sampling: Generative Adversarial Networks (GANs) learn a generator that maps random noise z → x. You can sample, but you can't evaluate p(x) for a given x.

Iterative sampling: Diffusion models start from noise and iteratively denoise. Each step moves toward the data distribution, producing high-quality samples through many small refinements.

The Fundamental Trade-off

Tractable density models (autoregressive) give you exact likelihoods but generate slowly (one element at a time). Implicit density models (GANs, diffusion) generate fast or high-quality samples but can't tell you p(x). VAEs sit in the middle: they give you a lower bound on p(x) and can both generate and evaluate, but the samples tend to be blurrier. No free lunch.

Model Family	Density	Sample Quality	Speed	Example
Autoregressive	Exact p(x)	Good	Slow (sequential)	PixelCNN, GPT
VAE	Lower bound	Blurry	Fast (one pass)	VAE, VQ-VAE
GAN	None	Sharp	Fast (one pass)	StyleGAN
Diffusion	Lower bound	Excellent	Slow (iterative)	DDPM, Stable Diffusion

Historical Context

Autoregressive models came first — language models have been autoregressive since the n-gram days. VAEs (Kingma & Welling, 2013) showed that variational inference could scale with deep nets. GANs (Goodfellow et al., 2014) stunned the field with sharp samples but were notoriously unstable to train. Diffusion models (Ho et al., 2020) emerged later and combined the training stability of VAEs with sample quality surpassing GANs. Each approach was motivated by a different answer to the question "how do you handle the intractable density?"

Definition

Latent Variable

An unobserved random variable z that captures hidden structure in the data. You never see z directly — you only observe x. The model assumes x was generated from z: first sample z ~ p(z), then sample x ~ p(x|z). The latent z might represent pose, lighting, identity, or other factors of variation.

Don't Confuse These

"Generative model" doesn't mean "model that generates images." A generative model is any model that learns a distribution p(x) or p(x,y). An autoregressive language model is a generative model. A Gaussian mixture model is a generative model. Image generation is just the most visually impressive application.

Chapter 03

Maximum Likelihood Estimation

Before we build any specific generative model, we need a training objective. Every model in this lecture will use the same principle: maximum likelihood estimation (MLE). The idea is simple — find the parameters that make the training data most probable.

The Setup

You have training data {x⁽¹⁾, x⁽²⁾, ..., x^(N)} drawn i.i.d. from some unknown true distribution p_data(x). You have a parametric model p(x; W) with learnable parameters W. Goal: find W* that makes p(x; W) as close to p_data(x) as possible.

Maximum Likelihood Objective W* = argmax_W Π_i=1^N p(x⁽ⁱ⁾; W)

Find parameters W that maximize the probability of all training examples simultaneously.

The Log Trick

Products of N probabilities are numerically terrible. Each p(x⁽ⁱ⁾) is between 0 and 1, so multiplying millions of them gives a number astronomically close to zero — underflow. The fix: take the logarithm. Since log is monotonically increasing, maximizing the log of the product gives the same W*.

Log-Likelihood W* = argmax_W ∑_i=1^N log p(x⁽ⁱ⁾; W)

Product → sum. Same optimum, numerically stable.

This sum of log-probabilities is called the log-likelihood. In practice, we minimize the negative log-likelihood (NLL), because gradient descent minimizes losses:

NLL Loss L(W) = − (1/N) ∑_i=1^N log p(x⁽ⁱ⁾; W)

Connection to KL Divergence

Minimizing the NLL is equivalent to minimizing the KL divergence between the data distribution and the model:

D_KL(p_data || p_model) = 𝔼_{x~p_data}[log p_data(x) − log p_model(x; W)]

The first term (log p_data) is a constant with respect to W. So minimizing D_KL is equivalent to maximizing 𝔼_{x~p_data}[log p_model(x; W)] — which is exactly the log-likelihood, estimated by averaging over training samples.

Connection to Cross-Entropy

The cross-entropy between p_data and p_model is:

H(p_data, p_model) = −𝔼_{x~p_data}[log p_model(x; W)] = NLL

You already know cross-entropy from classification! When you train a softmax classifier with cross-entropy loss, you're doing maximum likelihood on p(y|x). Now we're doing the same thing on p(x).

Worked Example — MLE for a Gaussian

Model: p(x; μ, σ) = (1/σ√(2π)) exp(−(x−μ)² / 2σ²). Data: {2, 3, 5}.

Log-likelihood: ∑_i log p(x_i) = ∑_i [−log(σ√(2π)) − (x_i−μ)² / 2σ²]

Take derivative w.r.t. μ, set to 0: μ* = (2+3+5)/3 = 10/3 ≈ 3.33. The MLE mean is the sample mean — exactly what intuition says.

Take derivative w.r.t. σ: σ*² = (1/3)∑(x_i−μ*)² ≈ 1.56. The MLE variance is the sample variance.

MLE recovers the obvious statistics because the Gaussian is a member of the exponential family. For neural networks, there's no closed-form solution — we use gradient descent on the NLL.

The Universal Training Signal

Maximum likelihood is remarkably general. It doesn't care about the model architecture. Autoregressive model? Maximize the sum of conditional log-probs. VAE? Maximize a lower bound on the log-likelihood. Diffusion model? Also a lower bound. The architectures differ; the principle is the same: make the training data likely.

Chapter 04

Autoregressive Models

Here's the most elegant trick in generative modeling. We want p(x) where x = (x₁, x₂, ..., x_T) is a sequence of values. The chain rule of probability says we can always decompose any joint distribution as:

Chain Rule of Probability p(x) = p(x₁) · p(x₂|x₁) · p(x₃|x₁,x₂) · ... · p(x_T|x₁,...,x_T−1)
= Π_t=1^T p(x_t | x₁, ..., x_t−1)

This is EXACT. No approximation. It's a mathematical identity.

The key insight: this is not an approximation. It's an identity that holds for any distribution. If you can model each conditional p(x_t|x_<t) accurately, you have the exact density p(x). No tricks, no bounds, no approximations.

Language Models Are Autoregressive

You've already seen this. An RNN or Transformer language model predicts the next token given all previous tokens. That's exactly p(x_t|x₁,...,x_t−1). GPT, LLaMA, every LLM — they're all autoregressive generative models over text.

To generate: sample x₁ ~ p(x₁), then x₂ ~ p(x₂|x₁), then x₃ ~ p(x₃|x₁,x₂), and so on. Each token is sampled conditioned on all previous tokens.

Autoregressive Images: PixelRNN & PixelCNN

Can we do the same for images? Treat the image as a sequence of pixel values in scanline order — left to right, top to bottom. Each pixel has R, G, B subpixels, each taking values 0–255. Model:

PixelRNN / PixelCNN p(x) = Π_i=1^n² p(R_i|x_<i) · p(G_i|x_<i, R_i) · p(B_i|x_<i, R_i, G_i)

Each subpixel is a 256-way softmax conditioned on all previous subpixels.

PixelRNN (van den Oord et al., 2016) models each conditional with an LSTM. Problem: generation is sequential — each pixel waits for the previous one. A 256×256 image has ~196K subpixels. Extremely slow.

PixelCNN uses masked convolutions instead. The convolution kernel is masked so that each pixel only sees pixels above and to the left — preserving the autoregressive ordering. Training is parallel (all pixels computed at once), but generation is still sequential.

Definition

Masked Convolution

A convolution filter where weights corresponding to "future" pixels (below and to the right in scanline order) are set to zero. This ensures each output pixel only depends on previously generated pixels, maintaining the autoregressive property. During training, all positions can be computed in parallel because the input is known.

How Masked Convolutions Work

Consider a 3×3 convolution kernel. In a normal convolution, all 9 weights are used. In a masked convolution for autoregressive generation:

Mask type A (first layer): zero out the center pixel and everything below/right. Only the top row and the left cell of the middle row have nonzero weights. The center pixel cannot see itself.

Mask type B (subsequent layers): same as A, but the center pixel IS included. This is safe because in deeper layers, the center pixel's feature already excludes "future" information (it was masked in layer 1).

Result: a 5-layer masked CNN has a receptive field covering many previous pixels, but zero information leakage from future pixels. Training runs at the speed of a normal CNN — all positions are computed in one forward pass.

Worked Example — Inference Speed Comparison

Generating a 256×256 RGB image (196,608 subpixels):

PixelRNN: 196,608 sequential LSTM steps. At ~500 steps/sec on a GPU: ~6.5 minutes per image.

PixelCNN: 196,608 sequential forward passes (each is fast due to convolutions): ~2 minutes per image.

VQ-VAE + Transformer: Encode to 32×32 = 1,024 tokens. Generate 1,024 tokens autoregressively: ~0.5 seconds. Then decode tokens to pixels in one pass. 780x speedup over PixelRNN.

Autoregressive Pixel Generation

Watch an image being generated pixel-by-pixel in scanline order. Each pixel is sampled from p(x_t | x_1,...,x_{t-1}). The image gradually emerges from top-left to bottom-right.

Speed: 10

The Scalability Problem

A 1024×1024 image has 3 × 1024 × 1024 ≈ 3.1 million subpixels. Even at 1000 subpixels per second, that's 52 minutes per image. This is why modern autoregressive image models work on discrete latent tokens instead of raw pixels: VQ-VAE encodes the image into a much smaller grid of tokens (e.g., 32×32), and an autoregressive transformer generates those tokens. 1024 tokens instead of 3M subpixels.

WaveNet: Autoregressive Audio

The same idea applies to audio. WaveNet (van den Oord et al., 2016) models raw audio waveforms autoregressively at 16kHz — 16,000 samples per second. It uses dilated causal convolutions with exponentially increasing dilation rates (1, 2, 4, 8, ..., 512) to capture long-range dependencies without prohibitive computation. The result: the most natural-sounding text-to-speech of its era.

Worked Example — Chain Rule on a 2×2 Image

Image x has 4 pixels: x₁, x₂, x₃, x₄ (top-left, top-right, bottom-left, bottom-right). Each pixel is 0 or 1 (binary for simplicity).

Step 1: Model p(x₁). Say p(x₁=1) = 0.3. Sample: x₁ = 0.

Step 2: Model p(x₂|x₁=0). Say p(x₂=1|x₁=0) = 0.7. Sample: x₂ = 1.

Step 3: Model p(x₃|x₁=0, x₂=1). Sample: x₃ = 1.

Step 4: Model p(x₄|x₁=0, x₂=1, x₃=1). Sample: x₄ = 0.

Total likelihood: p(x) = p(x₁=0) × p(x₂=1|x₁=0) × p(x₃=1|0,1) × p(x₄=0|0,1,1).

We can compute this exactly. That's the power of autoregressive models.

Training Is Fast, Generation Is Slow

During training, we know all the pixels (they're in the training set). So we can compute all the conditionals in parallel using masked convolutions — one forward pass gives us the loss for all T positions simultaneously. During generation, we must sample sequentially: each pixel depends on the previous ones. This asymmetry is fundamental to autoregressive models.

Chapter 05

Autoencoders

Before we get to VAEs, we need to understand their non-variational ancestor. An autoencoder learns useful features from data without labels. The idea is beautifully simple: force the network to compress data through a bottleneck, then reconstruct it.

Architecture

An autoencoder has two parts:

Encoder f_θ: maps input x to a low-dimensional representation z = f_θ(x). This z is called the latent code or bottleneck.

Decoder g_φ: maps z back to a reconstruction x̂ = g_φ(z). The goal is x̂ ≈ x.

Autoencoder z = f_θ(x)        ← Encoder: compress
x̂ = g_φ(z)        ← Decoder: reconstruct

L = ||x̂ − x||²     ← L2 reconstruction loss

If z has fewer dimensions than x (say, z ∈ ℝ³² while x ∈ ℝ⁷⁸⁴ for MNIST), the encoder must discover the most important features — it can't memorize every pixel. The bottleneck forces compression, and compression forces understanding.

What Are the Features Good For?

After training, throw away the decoder. The encoder maps any input to a compact feature vector z. Attach a small classifier on top of z for downstream tasks. This is unsupervised pretraining — learn features from unlabeled data, then fine-tune with a few labels.

Worked Example — MNIST Autoencoder

Input: 28×28 = 784 pixel values. Encoder: 784 → 256 → 32 (z has 32 dimensions). Decoder: 32 → 256 → 784.

After training, z captures the "essence" of each digit: stroke thickness, slant, roundness. Two 7s that look different have similar z vectors because they encode the same digit concept. The autoencoder discovered digit-relevant features without ever seeing a label.

Can We Generate with an Autoencoder?

Tempting idea: sample a random z, feed it to the decoder, get a new image. This doesn't work well.

The Latent Space Problem

The encoder maps training images to z vectors, but there's no constraint on where those vectors end up. The 7s might cluster around z = (3.2, −1.7, ...) and the 4s around z = (−5.1, 8.3, ...), with vast empty regions in between. If you sample a random z from those empty regions, the decoder has never seen anything like it — it produces garbage. The latent space is unstructured.

The fix: force z to come from a known, structured distribution. That's the core idea behind the Variational Autoencoder.

Autoencoder ≠ Generative Model

A vanilla autoencoder learns a deterministic mapping x → z → x̂. It's a feature extractor, not a density estimator. It doesn't define p(x), it can't compute likelihoods, and sampling from its latent space is unreliable. The VAE adds probabilistic structure to fix all three problems.

Variants and Regularization

Several autoencoder variants address different goals:

Denoising autoencoder: Corrupt the input (add noise, mask pixels) and train the network to reconstruct the clean original. This forces the encoder to learn robust features that capture the signal, not the noise. The denoising objective is closely related to score matching — a connection that inspired diffusion models years later.

Sparse autoencoder: Add an L1 penalty on z to encourage most latent dimensions to be zero. Each image activates only a few features, producing sparse, interpretable representations.

Contractive autoencoder: Penalize the Frobenius norm of the Jacobian ∂z/∂x. This makes the encoder insensitive to small input perturbations, learning features that capture meaningful variation and ignore noise.

Worked Example — Why Sampling Fails

Train an autoencoder on MNIST. The encoder maps each "7" to z-values near [2.1, −3.4], and each "0" to z-values near [−5.2, 1.8]. The region around [0, 0] has no training data nearby.

If you sample z = [0, 0] and decode, you get an unrecognizable blob. The decoder was never trained on inputs from this region. There's no "interpolation guarantee" — the decoder only works near the training data manifold in z-space.

VAEs fix this by regularizing ALL encoded distributions toward N(0, I), ensuring the latent space is dense — every region near the origin has training data mapped to it.

Chapter 06

Variational Autoencoders — The Setup

The Variational Autoencoder (Kingma & Welling, 2013) takes the autoencoder skeleton and makes it probabilistic. Instead of deterministic z = f(x), both the encoder and decoder output distributions.

The Generative Story

Assume each data point x⁽ⁱ⁾ was generated by a two-step process:

Assumed Data Generation Process

Sample latent: z ~ p(z) = N(0, I) ← simple Gaussian prior
Sample data: x ~ p_θ(x|z) ← decoder neural network

The latent z captures unobserved factors of variation — identity, pose, lighting, expression. Each z produces a different image through the decoder. The prior p(z) = N(0, I) says: before seeing any data, all z values are equally likely (centered, unit variance, independent dimensions).

What We Want: Maximum Likelihood

We want to find θ that maximizes p_θ(x) for our training data. To compute p_θ(x), marginalize out z:

Marginal Likelihood p_θ(x) = ∫ p_θ(x|z) p(z) dz

Integrate over ALL possible z values. This is intractable.

Problem 1: Intractable Integral

The integral ∫ p_θ(x|z) p(z) dz sums over every possible z — an infinite-dimensional integral when z is continuous. We can't compute it analytically (p_θ(x|z) is a neural network, not a conjugate distribution). We can't estimate it by Monte Carlo either: most z values produce x values nothing like our target, so random sampling would need astronomically many samples to get a useful estimate.

The Posterior Problem

We could try Bayes' rule: p_θ(z|x) = p_θ(x|z) p(z) / p_θ(x). But computing this requires p_θ(x), which is the intractable integral we're trying to avoid. Circular.

Problem 2: Intractable Posterior

Given a data point x, which z values could have produced it? That's the posterior p_θ(z|x). We need it for training (to know where to "look" in z-space), but computing it requires the intractable marginal p_θ(x). We're stuck.

The Solution: Approximate the Posterior

Since we can't compute p_θ(z|x), we'll learn an approximation. Introduce an encoder network q_φ(z|x) that takes a data point x and outputs a distribution over z. We'll train q_φ to be as close to the true posterior p_θ(z|x) as possible.

Encoder and Decoder Outputs Encoder: q_φ(z|x) = N(μ_φ(x), diag(σ_φ²(x)))

Decoder: p_θ(x|z) = N(μ_θ(z), σ²I)

Both output Gaussian distributions. The encoder outputs mean and variance vectors.

Why Gaussian?

Two reasons. First, the KL divergence between two Gaussians has a closed-form solution — no sampling needed to compute it. Second, we can use the reparameterization trick (Chapter 8) to backpropagate through Gaussian sampling. Other distribution families lack one or both of these properties. The Gaussian choice is pragmatic, not fundamental.

Definition

Amortized Inference

In classical variational inference, you optimize a separate set of variational parameters for each data point. The VAE amortizes this: one encoder network q_φ(z|x) handles all data points. Feed in any x, get an approximate posterior. This makes inference scale to millions of data points.

Worked Example — Encoder Output

Input x is a 28×28 MNIST digit "3". Encoder processes x through conv layers and outputs two vectors:

μ_φ(x) = [1.2, −0.5, 0.8, ...] (32 values, the mean)

log σ_φ²(x) = [−1.1, −0.3, −2.0, ...] (32 values, log-variance for numerical stability)

The approximate posterior is a 32-dimensional Gaussian centered at μ with diagonal covariance σ². To get a z, sample from this Gaussian. Different samples give slightly different z values, all near the mean — capturing the "3-ness" of the input with some uncertainty.

Chapter 07

The ELBO Derivation

This is the mathematical heart of the VAE. We can't compute log p_θ(x) directly, so we derive a lower bound that we can compute and optimize. This bound is called the Evidence Lower Bound (ELBO).

Step 1: Start from Bayes' Rule

For any z, Bayes' rule gives us:

Step 1 — Bayes' Rule p_θ(x) = p_θ(x|z) · p(z) / p_θ(z|x)

Take the log of both sides:

Step 2 — Take Logarithm log p_θ(x) = log p_θ(x|z) + log p(z) − log p_θ(z|x)

Step 2: Introduce the Approximate Posterior

The right side has p_θ(z|x) which we can't compute. Introduce q_φ(z|x) by adding and subtracting log q_φ(z|x):

Rearrange by grouping terms:

Step 3: Take Expectations

The left side, log p_θ(x), doesn't depend on z. So if we take the expectation of both sides over z ~ q_φ(z|x), the left side is unchanged:

Step 5 — Expectation over q_φ(z|x) log p_θ(x) = 𝔼_{z~q_φ}[log p_θ(x|z)] − 𝔼_{z~q_φ}[log q_φ(z|x) − log p(z)] + 𝔼_{z~q_φ}[log q_φ(z|x) − log p_θ(z|x)]

Step 4: Recognize the KL Divergences

The second term is a KL divergence (definition: D_KL(q||p) = 𝔼_q[log q/p]):

Step 6 — Identify KL Terms log p_θ(x) = 𝔼_{z~q_φ}[log p_θ(x|z)] − D_KL(q_φ(z|x) || p(z)) + D_KL(q_φ(z|x) || p_θ(z|x))

Step 5: Drop the Intractable Term

The third term, D_KL(q_φ(z|x) || p_θ(z|x)), measures how far our approximate posterior is from the true posterior. We can't compute it (it involves the intractable p_θ(z|x)). But we know that KL divergence is always ≥ 0. So:

Step 7 — The ELBO log p_θ(x) ≥ 𝔼_{z~q_φ}[log p_θ(x|z)] − D_KL(q_φ(z|x) || p(z))

This lower bound is the ELBO (Evidence Lower Bound).

Why D_KL ≥ 0 (Jensen's Inequality)

D_KL(q||p) = 𝔼_q[log(q/p)] = −𝔼_q[log(p/q)] ≥ −log(𝔼_q[p/q]) = −log(∫ q · p/q dz) = −log(∫ p dz) = −log(1) = 0.

The inequality is Jensen's: for concave log, 𝔼[log X] ≤ log 𝔼[X]. Equality only when p = q everywhere.

The ELBO Has Two Intuitive Terms

ELBO Decomposition ELBO = 𝔼_{z~q_φ}[log p_θ(x|z)] − D_KL(q_φ(z|x) || p(z))

Reconstruction term Regularization term

Reconstruction: Sample z from the encoder, decode it, measure how well the decoder reconstructs the original x. This is the familiar autoencoder loss — make x̂ look like x.

Regularization: Push the encoder's output distribution q_φ(z|x) toward the prior N(0, I). This prevents the encoder from encoding each data point at an arbitrary location in z-space. It forces the latent space to be smooth and structured — nearby z values should decode to similar images.

ELBO Components During Training

Watch the reconstruction term and KL term evolve as training progresses. The ELBO (their sum) is a lower bound on log p(x). The gap is the intractable D_KL(q||p_true).

Speed: 5

The Reconstruction-Regularization Trade-off

If we maximize only the reconstruction term, we get a vanilla autoencoder — each data point gets its own corner of z-space, and the latent space has no structure for generation. If we maximize only the regularization term, the encoder ignores x entirely and outputs N(0, I) for everything — perfect prior match but zero information. The ELBO forces a balance: encode enough information to reconstruct, but keep the encoding close to the prior.

Closed-Form KL for Gaussians

When q = N(μ, σ²I) and p = N(0, I) (both diagonal), the KL divergence has a closed form:

D_KL = −(1/2) ∑_j=1^J [1 + log(σ_j²) − μ_j² − σ_j²]

where J is the latent dimension. Each latent dimension contributes independently. If μ_j = 0 and σ_j = 1, that dimension contributes 0 to the KL — it already matches the prior. Deviations in mean or variance increase the KL penalty.

Worked Example — KL for One Latent Dimension

Encoder outputs μ = 2.0, σ = 0.5 for one dimension.

D_KL = −0.5 × [1 + log(0.25) − 4.0 − 0.25]

= −0.5 × [1 + (−1.386) − 4.0 − 0.25]

= −0.5 × [−4.636]

= 2.318

The mean is far from 0 (μ = 2) and the variance is too small (σ = 0.5 vs 1.0), so the KL penalty is high. The ELBO loss will push μ toward 0 and σ toward 1.

Why Maximizing ELBO Works

Recall: log p_θ(x) = ELBO + D_KL(q_φ || p_θ(z|x)). Maximizing the ELBO simultaneously (1) increases log p_θ(x) — making the model assign higher probability to training data, and (2) decreases D_KL(q_φ || p_θ(z|x)) — making the approximate posterior more accurate. One objective, two benefits.

Chapter 08

The Reparameterization Trick

We have the ELBO. We know what to optimize. But there's a computational snag: the reconstruction term 𝔼_{z~q_φ}[log p_θ(x|z)] requires sampling z from q_φ(z|x), and sampling is not differentiable.

The Problem

To estimate the reconstruction term, we sample z ~ N(μ_φ(x), σ_φ²(x)). But "sample from a distribution" is a stochastic operation — there's no gradient of "pick a random number" with respect to the distribution's parameters. We can't backpropagate through the sampling step.

Why Sampling Blocks Gradients

The gradient ∂z/∂μ is undefined when z is drawn randomly from N(μ, σ²). Changing μ changes the distribution z comes from, but any particular sample z doesn't have a smooth dependence on μ. It's like asking "how does the mean of a die change which face comes up?" The question doesn't make sense — the randomness "breaks" the chain of differentiable operations.

The Fix: Externalize the Randomness

Instead of sampling z directly from N(μ, σ²), decompose it as:

Reparameterization Trick ε ~ N(0, I) ← sample noise (fixed, no gradients needed)
z = μ_φ(x) + σ_φ(x) ⊙ ε ← deterministic function of μ, σ, and ε

Now z is a deterministic function of μ, σ, and the external noise ε. The randomness comes from ε, which doesn't depend on φ. The dependence of z on φ flows through the deterministic path μ + σ ⊙ ε, which is differentiable.

The Intuition

Think of it this way. "Sample z from a Gaussian with learnable mean μ" is like saying "throw a dart at a target, where the target position is μ." You can't differentiate through the dart throw. But "take a fixed random offset ε and add it to μ" is z = μ + ε — and now dz/dμ = 1. The randomness is the same, but the gradient can flow.

Worked Example — Forward Pass with Reparameterization

Input x is a digit "7". Encoder outputs μ = [1.5, −0.3] and log σ² = [−0.5, −1.0].

Step 1: Compute σ = exp(0.5 × log σ²) = [exp(−0.25), exp(−0.5)] = [0.779, 0.607].

Step 2: Sample ε ~ N(0, I). Say ε = [0.42, −1.10].

Step 3: z = μ + σ ⊙ ε = [1.5 + 0.779 × 0.42, −0.3 + 0.607 × (−1.10)] = [1.827, −0.968].

Step 4: Decode z to get x̂. Compute reconstruction loss ||x̂ − x||².

Step 5: Compute KL term from μ and σ (closed form).

Step 6: Backprop through everything. Gradients flow through z = μ + σ ⊙ ε to update μ and σ (and hence φ).

The Full VAE Training Algorithm

VAE Training (One Iteration)

Sample a minibatch {x⁽¹⁾, ..., x^(M)} from the training set.
Encode: For each x⁽ⁱ⁾, compute μ_i = μ_φ(x⁽ⁱ⁾), σ_i = σ_φ(x⁽ⁱ⁾).
Sample noise: ε_i ~ N(0, I) for each sample.
Reparameterize: z_i = μ_i + σ_i ⊙ ε_i.
Decode: Compute x̂_i = decoder(z_i).
Compute ELBO: L = (1/M) ∑_i [||x̂_i − x⁽ⁱ⁾||² + D_KL(q_φ(z|x⁽ⁱ⁾) || p(z))].
Backpropagate through L. Update θ and φ via gradient descent.

Why Monte Carlo Works Here

The reconstruction term 𝔼_z~q[log p_θ(x|z)] is estimated by sampling a single z per data point: ≈ log p_θ(x|z₁) where z₁ = μ + σ ⊙ ε₁. One sample! In practice, the minibatch average over M data points provides enough variance reduction. More samples per data point barely helps — better to process more data points.

Gradient Verification

Consider the loss L = (z − c)² where z = μ + σε and c is a target. Then:

∂L/∂μ = 2(z − c) · ∂z/∂μ = 2(z − c) · 1 = 2(μ + σε − c)

∂L/∂σ = 2(z − c) · ∂z/∂σ = 2(z − c) · ε = 2(μ + σε − c) · ε

Both gradients are well-defined and computable. The chain rule flows cleanly through z = μ + σε because ε is treated as a constant (like input data).

Chapter 09

VAE in Action

Time to see everything come together. This interactive demo simulates a 2D VAE: the encoder maps data points to a 2D latent space, and the decoder maps latent points back to data. You can watch the latent space organize during training, sample new points, and explore the reconstruction-regularization trade-off.

VAE Latent Space Explorer

Left: 2D data distribution (colored clusters). Right: latent space z. During training, the encoder maps data points to z-space. Click anywhere in the latent space to sample — the decoder generates a new point. Adjust β to control the KL weight.

β (KL weight): 1.00

What β Controls

The β-VAE (Higgins et al., 2017) scales the KL term: ELBO = reconstruction − β × D_KL. When β = 0: pure autoencoder, no regularization, latent space is unstructured. When β = 1: standard VAE, balanced trade-off. When β > 1: extra pressure toward the prior, more disentangled latent representations but blurrier reconstructions. Try dragging the slider to see the effect.

What to Look For

β = 0 (autoencoder): Clusters are tight and separated. Each class occupies its own island. If you click between islands, the decoder produces garbage — the latent space is empty there.

β = 1 (standard VAE): Clusters overlap more, filling the space. The Gaussian prior pushes everything toward the center. Clicking between clusters produces smooth interpolations. The latent space is continuous.

β > 1 (over-regularized): Everything collapses toward N(0, I). The classes are mixed together. Sampling produces "average" outputs — blurry, generic, lacking distinctiveness.

Worked Example — Latent Interpolation

Encode a "3" to z_A = (1.2, 0.8). Encode an "8" to z_B = (−0.4, −1.1). Interpolate: z_mid = 0.5 × z_A + 0.5 × z_B = (0.4, −0.15). Decode z_mid: you get something that looks partway between a 3 and an 8. This smooth interpolation only works because the KL regularization filled in the gaps between clusters.

Why VAE Outputs Are Blurry

The decoder minimizes ||x̂ − x||². For a given z, multiple x values could be valid (a "3" can be drawn many ways). The L2-optimal output is the pixel-wise mean of all valid x values — which is blurry. It's averaging over possibilities rather than committing to one. GANs fix this by using a discriminator instead of L2 loss. Diffusion models fix it by generating iteratively.

Chapter 10

Summary & Connections

The Generative Models Landscape

Model	Density	Training	Generation	Quality
Autoregressive	Exact p(x)	Parallel (teacher forcing)	Sequential (slow)	High fidelity
VAE	ELBO lower bound	Single forward + backward	One pass (fast)	Blurry (L2)
GAN	None (implicit)	Adversarial (unstable)	One pass (fast)	Sharp
Diffusion	ELBO lower bound	Simple (denoise)	Iterative (slow)	Excellent

What We Covered

Chapter 1: Discriminative models learn p(y|x). Generative models learn p(x). The Bayes connection links them.

Chapter 2: Generative models split into explicit density (autoregressive, VAE) and implicit density (GAN, diffusion).

Chapter 3: Maximum likelihood — the universal training objective. Equivalent to minimizing KL divergence and cross-entropy.

Chapter 4: Autoregressive models factor p(x) via the chain rule. Exact density, but slow generation. PixelCNN, WaveNet, GPT.

Chapter 5: Autoencoders learn features via compression. Not generative — latent space is unstructured.

Chapters 6–8: VAEs make the autoencoder probabilistic. The ELBO = reconstruction + KL regularization. Reparameterization trick enables backprop through sampling.

Key Equations

The Four Essential Equations 1. Chain Rule: p(x) = Π_t p(x_t|x_<t)

2. ELBO: log p(x) ≥ 𝔼_z~q[log p(x|z)] − D_KL(q(z|x)||p(z))

3. Reparam: z = μ + σ ⊙ ε, ε ~ N(0, I)

4. KL (Gaussian): −(1/2) ∑_j[1 + log σ_j² − μ_j² − σ_j²]

VAE Limitations

Blurry outputs: L2 reconstruction averages over modes. GANs and diffusion models produce much sharper samples.

Posterior collapse: With powerful decoders (e.g., autoregressive), the model learns to ignore z entirely. The KL term goes to zero (q matches the prior), and z carries no information. The decoder generates everything from scratch, defeating the purpose of the latent space.

Limited expressiveness: Diagonal Gaussian posteriors q_φ(z|x) can't capture complex, multi-modal posteriors. The true p(z|x) might have multiple peaks, but our approximation is unimodal.

Where These Ideas Lead

VQ-VAE: Replace the continuous Gaussian latent with discrete tokens. No KL collapse, no blurriness. Pair with an autoregressive model over tokens for state-of-the-art generation. See the VAE & VQ-VAE lesson.

Diffusion Models: Instead of compressing to a bottleneck, gradually add noise until data becomes pure Gaussian, then learn to reverse each step. Combines the stable training of VAEs with exceptional sample quality. See the Diffusion Models and Flow Matching lessons.

GANs: Coming in Part 2 of this lecture. Instead of maximizing a likelihood bound, train a generator and discriminator in a minimax game. Sharp samples, but training instability. See the GAN lesson.

Autoregressive vs VAE: Head to Head

Property	Autoregressive	VAE
Density	Exact p(x) via chain rule	Lower bound (ELBO) only
Latent space	None (implicit via hidden states)	Explicit z with semantic meaning
Generation speed	Slow: O(T) sequential steps	Fast: one forward pass through decoder
Training	Teacher forcing, parallelizable	ELBO maximization, single pass
Interpolation	Not straightforward	Smooth interpolation in z-space
Sample quality	High fidelity (no averaging)	Blurry (L2 averages modes)
Controllability	Difficult (condition via prompts)	Easy (manipulate z dimensions)

References

Year	Paper	Contribution
2013	Kingma & Welling, "Auto-Encoding Variational Bayes"	VAE framework, reparameterization trick
2014	Rezende et al., "Stochastic Backpropagation"	Independent VAE derivation
2016	van den Oord et al., "Pixel Recurrent Neural Networks"	PixelRNN, PixelCNN
2016	van den Oord et al., "WaveNet"	Autoregressive audio with dilated convolutions
2017	Higgins et al., "β-VAE"	Disentangled representations via KL scaling
2017	van den Oord et al., "VQ-VAE"	Discrete latents, no posterior collapse

The One Sentence

Generative models learn the data distribution p(x), and the VAE does it by learning a lower bound: compress data into a Gaussian latent space, reconstruct it, and balance fidelity against structure. Everything else is architecture.

Learning to Generate

What You'll Master

Discriminative vs Generative

What Is a Density Function?

Three Flavors of Learning

Why Learn to Generate?

The Taxonomy of Generative Models

Explicit Density Models

Implicit Density Models

Historical Context

Maximum Likelihood Estimation

The Setup

The Log Trick

Autoregressive Models

Language Models Are Autoregressive

Autoregressive Images: PixelRNN & PixelCNN

WaveNet: Autoregressive Audio

Autoencoders

Architecture

What Are the Features Good For?

Can We Generate with an Autoencoder?

Variants and Regularization

Variational Autoencoders — The Setup

The Generative Story

What We Want: Maximum Likelihood

The Posterior Problem

The Solution: Approximate the Posterior

The ELBO Derivation

Step 1: Start from Bayes' Rule

Step 2: Introduce the Approximate Posterior

Step 3: Take Expectations

Step 4: Recognize the KL Divergences

Step 5: Drop the Intractable Term

The ELBO Has Two Intuitive Terms

The Reparameterization Trick

The Problem

The Fix: Externalize the Randomness

The Full VAE Training Algorithm

VAE in Action

What to Look For

Summary & Connections

The Generative Models Landscape

What We Covered

Key Equations

VAE Limitations

Where These Ideas Lead

Autoregressive vs VAE: Head to Head

References