What if your neural network could not just classify images, but create entirely new ones? That requires learning the full data distribution — a fundamentally harder problem.
You've spent the entire course building discriminative models: given an image x, predict a label y. A classifier learns p(y|x) — "what's the probability this image is a cat, given its pixels?" That's incredibly useful. But it sidesteps a deeper question.
What does a "cat image" actually look like? Not "is this a cat?" but "what images are cat-like at all?" That question is about the data distribution p(x) — the probability density over the entire space of possible images. A model that learns p(x) is called a generative model.
A density function p(x) assigns a positive number to every possible data point x. Higher values mean "more likely under this distribution." There's one constraint: the density must integrate to 1 over all possible x.
For images, x is a high-dimensional vector (e.g., 256 × 256 × 3 = 196,608 dimensions). The density p(x) lives in this enormous space. Most of that space is "noise" — random pixel values that look nothing like real images. Natural images occupy a tiny, thin manifold within it. The generative model's job is to find that manifold.
| Type | Learns | Can Do | Example |
|---|---|---|---|
| Discriminative | p(y|x) | Classify, detect, segment | ResNet, BERT |
| Generative (unconditional) | p(x) | Sample new data, detect outliers | PixelCNN, VAE |
| Conditional generative | p(x|y) | Generate data for a specific class | "Generate a cat image" |
Discriminative and generative models are two sides of Bayes' theorem: p(y|x) = p(x|y) p(y) / p(x). A discriminative model learns the left side directly. A generative model learns p(x|y) and p(x), from which you can derive p(y|x). In principle, a perfect generative model gives you a perfect classifier for free — plus the ability to generate new data.
Beyond creating pretty pictures, generative models enable:
Outlier detection: If p(x) is very low for an input, it's unusual. Flag it.
Data augmentation: Generate training samples for rare classes.
Unsupervised representation learning: The features a model discovers while learning p(x) are useful for downstream tasks — no labels needed.
Scientific simulation: Model the distribution of molecules, proteins, weather patterns. Generate realistic samples for testing.
Dataset: 100 images, 60 cats, 40 dogs.
Discriminative model learns: "If the image has pointy ears and whiskers, P(cat|x) = 0.95." It can classify but can't draw a cat.
Generative model learns: "Cat images tend to have these pixel patterns, in these proportions, with these textures." It can sample entirely new cat images — and also classify (via Bayes), though usually less accurately than a dedicated discriminative model.
Given training data {x(1), ..., x(N)}, find model parameters that make the observed data as likely as possible: maximize Πi p(x(i); W). This is how we'll train every generative model in this lecture.
Not all generative models are created equal. The central question is: can your model actually compute p(x)? Some models give you an explicit density you can evaluate. Others can only sample — they generate realistic data, but you can't ask "how likely is this particular image?"
These models define a parametric form for p(x) and let you compute it directly for any input x.
Tractable density: You can compute p(x) exactly, in closed form. Autoregressive models achieve this by factoring p(x) into a product of conditionals using the chain rule. The computation is exact — no approximation.
Approximate density: The model defines p(x) implicitly through a latent variable z, but computing p(x) = ∫ p(x|z)p(z)dz requires an intractable integral. Variational Autoencoders (VAEs) handle this by optimizing a lower bound instead.
These models don't compute p(x) at all. They learn to transform noise into realistic samples without ever defining a density function.
Direct sampling: Generative Adversarial Networks (GANs) learn a generator that maps random noise z → x. You can sample, but you can't evaluate p(x) for a given x.
Iterative sampling: Diffusion models start from noise and iteratively denoise. Each step moves toward the data distribution, producing high-quality samples through many small refinements.
Tractable density models (autoregressive) give you exact likelihoods but generate slowly (one element at a time). Implicit density models (GANs, diffusion) generate fast or high-quality samples but can't tell you p(x). VAEs sit in the middle: they give you a lower bound on p(x) and can both generate and evaluate, but the samples tend to be blurrier. No free lunch.
| Model Family | Density | Sample Quality | Speed | Example |
|---|---|---|---|---|
| Autoregressive | Exact p(x) | Good | Slow (sequential) | PixelCNN, GPT |
| VAE | Lower bound | Blurry | Fast (one pass) | VAE, VQ-VAE |
| GAN | None | Sharp | Fast (one pass) | StyleGAN |
| Diffusion | Lower bound | Excellent | Slow (iterative) | DDPM, Stable Diffusion |
Autoregressive models came first — language models have been autoregressive since the n-gram days. VAEs (Kingma & Welling, 2013) showed that variational inference could scale with deep nets. GANs (Goodfellow et al., 2014) stunned the field with sharp samples but were notoriously unstable to train. Diffusion models (Ho et al., 2020) emerged later and combined the training stability of VAEs with sample quality surpassing GANs. Each approach was motivated by a different answer to the question "how do you handle the intractable density?"
An unobserved random variable z that captures hidden structure in the data. You never see z directly — you only observe x. The model assumes x was generated from z: first sample z ~ p(z), then sample x ~ p(x|z). The latent z might represent pose, lighting, identity, or other factors of variation.
"Generative model" doesn't mean "model that generates images." A generative model is any model that learns a distribution p(x) or p(x,y). An autoregressive language model is a generative model. A Gaussian mixture model is a generative model. Image generation is just the most visually impressive application.
Before we build any specific generative model, we need a training objective. Every model in this lecture will use the same principle: maximum likelihood estimation (MLE). The idea is simple — find the parameters that make the training data most probable.
You have training data {x(1), x(2), ..., x(N)} drawn i.i.d. from some unknown true distribution pdata(x). You have a parametric model p(x; W) with learnable parameters W. Goal: find W* that makes p(x; W) as close to pdata(x) as possible.
Products of N probabilities are numerically terrible. Each p(x(i)) is between 0 and 1, so multiplying millions of them gives a number astronomically close to zero — underflow. The fix: take the logarithm. Since log is monotonically increasing, maximizing the log of the product gives the same W*.
This sum of log-probabilities is called the log-likelihood. In practice, we minimize the negative log-likelihood (NLL), because gradient descent minimizes losses:
Minimizing the NLL is equivalent to minimizing the KL divergence between the data distribution and the model:
DKL(pdata || pmodel) = 𝔼x~pdata[log pdata(x) − log pmodel(x; W)]
The first term (log pdata) is a constant with respect to W. So minimizing DKL is equivalent to maximizing 𝔼x~pdata[log pmodel(x; W)] — which is exactly the log-likelihood, estimated by averaging over training samples.
The cross-entropy between pdata and pmodel is:
H(pdata, pmodel) = −𝔼x~pdata[log pmodel(x; W)] = NLL
You already know cross-entropy from classification! When you train a softmax classifier with cross-entropy loss, you're doing maximum likelihood on p(y|x). Now we're doing the same thing on p(x).
Model: p(x; μ, σ) = (1/σ√(2π)) exp(−(x−μ)2 / 2σ2). Data: {2, 3, 5}.
Log-likelihood: ∑i log p(xi) = ∑i [−log(σ√(2π)) − (xi−μ)2 / 2σ2]
Take derivative w.r.t. μ, set to 0: μ* = (2+3+5)/3 = 10/3 ≈ 3.33. The MLE mean is the sample mean — exactly what intuition says.
Take derivative w.r.t. σ: σ*2 = (1/3)∑(xi−μ*)2 ≈ 1.56. The MLE variance is the sample variance.
MLE recovers the obvious statistics because the Gaussian is a member of the exponential family. For neural networks, there's no closed-form solution — we use gradient descent on the NLL.
Maximum likelihood is remarkably general. It doesn't care about the model architecture. Autoregressive model? Maximize the sum of conditional log-probs. VAE? Maximize a lower bound on the log-likelihood. Diffusion model? Also a lower bound. The architectures differ; the principle is the same: make the training data likely.
Here's the most elegant trick in generative modeling. We want p(x) where x = (x1, x2, ..., xT) is a sequence of values. The chain rule of probability says we can always decompose any joint distribution as:
The key insight: this is not an approximation. It's an identity that holds for any distribution. If you can model each conditional p(xt|x<t) accurately, you have the exact density p(x). No tricks, no bounds, no approximations.
You've already seen this. An RNN or Transformer language model predicts the next token given all previous tokens. That's exactly p(xt|x1,...,xt−1). GPT, LLaMA, every LLM — they're all autoregressive generative models over text.
To generate: sample x1 ~ p(x1), then x2 ~ p(x2|x1), then x3 ~ p(x3|x1,x2), and so on. Each token is sampled conditioned on all previous tokens.
Can we do the same for images? Treat the image as a sequence of pixel values in scanline order — left to right, top to bottom. Each pixel has R, G, B subpixels, each taking values 0–255. Model:
PixelRNN (van den Oord et al., 2016) models each conditional with an LSTM. Problem: generation is sequential — each pixel waits for the previous one. A 256×256 image has ~196K subpixels. Extremely slow.
PixelCNN uses masked convolutions instead. The convolution kernel is masked so that each pixel only sees pixels above and to the left — preserving the autoregressive ordering. Training is parallel (all pixels computed at once), but generation is still sequential.
A convolution filter where weights corresponding to "future" pixels (below and to the right in scanline order) are set to zero. This ensures each output pixel only depends on previously generated pixels, maintaining the autoregressive property. During training, all positions can be computed in parallel because the input is known.
Consider a 3×3 convolution kernel. In a normal convolution, all 9 weights are used. In a masked convolution for autoregressive generation:
Mask type A (first layer): zero out the center pixel and everything below/right. Only the top row and the left cell of the middle row have nonzero weights. The center pixel cannot see itself.
Mask type B (subsequent layers): same as A, but the center pixel IS included. This is safe because in deeper layers, the center pixel's feature already excludes "future" information (it was masked in layer 1).
Result: a 5-layer masked CNN has a receptive field covering many previous pixels, but zero information leakage from future pixels. Training runs at the speed of a normal CNN — all positions are computed in one forward pass.
Generating a 256×256 RGB image (196,608 subpixels):
PixelRNN: 196,608 sequential LSTM steps. At ~500 steps/sec on a GPU: ~6.5 minutes per image.
PixelCNN: 196,608 sequential forward passes (each is fast due to convolutions): ~2 minutes per image.
VQ-VAE + Transformer: Encode to 32×32 = 1,024 tokens. Generate 1,024 tokens autoregressively: ~0.5 seconds. Then decode tokens to pixels in one pass. 780x speedup over PixelRNN.
A 1024×1024 image has 3 × 1024 × 1024 ≈ 3.1 million subpixels. Even at 1000 subpixels per second, that's 52 minutes per image. This is why modern autoregressive image models work on discrete latent tokens instead of raw pixels: VQ-VAE encodes the image into a much smaller grid of tokens (e.g., 32×32), and an autoregressive transformer generates those tokens. 1024 tokens instead of 3M subpixels.
The same idea applies to audio. WaveNet (van den Oord et al., 2016) models raw audio waveforms autoregressively at 16kHz — 16,000 samples per second. It uses dilated causal convolutions with exponentially increasing dilation rates (1, 2, 4, 8, ..., 512) to capture long-range dependencies without prohibitive computation. The result: the most natural-sounding text-to-speech of its era.
Image x has 4 pixels: x1, x2, x3, x4 (top-left, top-right, bottom-left, bottom-right). Each pixel is 0 or 1 (binary for simplicity).
Step 1: Model p(x1). Say p(x1=1) = 0.3. Sample: x1 = 0.
Step 2: Model p(x2|x1=0). Say p(x2=1|x1=0) = 0.7. Sample: x2 = 1.
Step 3: Model p(x3|x1=0, x2=1). Sample: x3 = 1.
Step 4: Model p(x4|x1=0, x2=1, x3=1). Sample: x4 = 0.
Total likelihood: p(x) = p(x1=0) × p(x2=1|x1=0) × p(x3=1|0,1) × p(x4=0|0,1,1).
We can compute this exactly. That's the power of autoregressive models.
During training, we know all the pixels (they're in the training set). So we can compute all the conditionals in parallel using masked convolutions — one forward pass gives us the loss for all T positions simultaneously. During generation, we must sample sequentially: each pixel depends on the previous ones. This asymmetry is fundamental to autoregressive models.
Before we get to VAEs, we need to understand their non-variational ancestor. An autoencoder learns useful features from data without labels. The idea is beautifully simple: force the network to compress data through a bottleneck, then reconstruct it.
An autoencoder has two parts:
Encoder fθ: maps input x to a low-dimensional representation z = fθ(x). This z is called the latent code or bottleneck.
Decoder gφ: maps z back to a reconstruction x̂ = gφ(z). The goal is x̂ ≈ x.
If z has fewer dimensions than x (say, z ∈ ℝ32 while x ∈ ℝ784 for MNIST), the encoder must discover the most important features — it can't memorize every pixel. The bottleneck forces compression, and compression forces understanding.
After training, throw away the decoder. The encoder maps any input to a compact feature vector z. Attach a small classifier on top of z for downstream tasks. This is unsupervised pretraining — learn features from unlabeled data, then fine-tune with a few labels.
Input: 28×28 = 784 pixel values. Encoder: 784 → 256 → 32 (z has 32 dimensions). Decoder: 32 → 256 → 784.
After training, z captures the "essence" of each digit: stroke thickness, slant, roundness. Two 7s that look different have similar z vectors because they encode the same digit concept. The autoencoder discovered digit-relevant features without ever seeing a label.
Tempting idea: sample a random z, feed it to the decoder, get a new image. This doesn't work well.
The encoder maps training images to z vectors, but there's no constraint on where those vectors end up. The 7s might cluster around z = (3.2, −1.7, ...) and the 4s around z = (−5.1, 8.3, ...), with vast empty regions in between. If you sample a random z from those empty regions, the decoder has never seen anything like it — it produces garbage. The latent space is unstructured.
The fix: force z to come from a known, structured distribution. That's the core idea behind the Variational Autoencoder.
A vanilla autoencoder learns a deterministic mapping x → z → x̂. It's a feature extractor, not a density estimator. It doesn't define p(x), it can't compute likelihoods, and sampling from its latent space is unreliable. The VAE adds probabilistic structure to fix all three problems.
Several autoencoder variants address different goals:
Denoising autoencoder: Corrupt the input (add noise, mask pixels) and train the network to reconstruct the clean original. This forces the encoder to learn robust features that capture the signal, not the noise. The denoising objective is closely related to score matching — a connection that inspired diffusion models years later.
Sparse autoencoder: Add an L1 penalty on z to encourage most latent dimensions to be zero. Each image activates only a few features, producing sparse, interpretable representations.
Contractive autoencoder: Penalize the Frobenius norm of the Jacobian ∂z/∂x. This makes the encoder insensitive to small input perturbations, learning features that capture meaningful variation and ignore noise.
Train an autoencoder on MNIST. The encoder maps each "7" to z-values near [2.1, −3.4], and each "0" to z-values near [−5.2, 1.8]. The region around [0, 0] has no training data nearby.
If you sample z = [0, 0] and decode, you get an unrecognizable blob. The decoder was never trained on inputs from this region. There's no "interpolation guarantee" — the decoder only works near the training data manifold in z-space.
VAEs fix this by regularizing ALL encoded distributions toward N(0, I), ensuring the latent space is dense — every region near the origin has training data mapped to it.
The Variational Autoencoder (Kingma & Welling, 2013) takes the autoencoder skeleton and makes it probabilistic. Instead of deterministic z = f(x), both the encoder and decoder output distributions.
Assume each data point x(i) was generated by a two-step process:
The latent z captures unobserved factors of variation — identity, pose, lighting, expression. Each z produces a different image through the decoder. The prior p(z) = N(0, I) says: before seeing any data, all z values are equally likely (centered, unit variance, independent dimensions).
We want to find θ that maximizes pθ(x) for our training data. To compute pθ(x), marginalize out z:
The integral ∫ pθ(x|z) p(z) dz sums over every possible z — an infinite-dimensional integral when z is continuous. We can't compute it analytically (pθ(x|z) is a neural network, not a conjugate distribution). We can't estimate it by Monte Carlo either: most z values produce x values nothing like our target, so random sampling would need astronomically many samples to get a useful estimate.
We could try Bayes' rule: pθ(z|x) = pθ(x|z) p(z) / pθ(x). But computing this requires pθ(x), which is the intractable integral we're trying to avoid. Circular.
Given a data point x, which z values could have produced it? That's the posterior pθ(z|x). We need it for training (to know where to "look" in z-space), but computing it requires the intractable marginal pθ(x). We're stuck.
Since we can't compute pθ(z|x), we'll learn an approximation. Introduce an encoder network qφ(z|x) that takes a data point x and outputs a distribution over z. We'll train qφ to be as close to the true posterior pθ(z|x) as possible.
Two reasons. First, the KL divergence between two Gaussians has a closed-form solution — no sampling needed to compute it. Second, we can use the reparameterization trick (Chapter 8) to backpropagate through Gaussian sampling. Other distribution families lack one or both of these properties. The Gaussian choice is pragmatic, not fundamental.
In classical variational inference, you optimize a separate set of variational parameters for each data point. The VAE amortizes this: one encoder network qφ(z|x) handles all data points. Feed in any x, get an approximate posterior. This makes inference scale to millions of data points.
Input x is a 28×28 MNIST digit "3". Encoder processes x through conv layers and outputs two vectors:
μφ(x) = [1.2, −0.5, 0.8, ...] (32 values, the mean)
log σφ2(x) = [−1.1, −0.3, −2.0, ...] (32 values, log-variance for numerical stability)
The approximate posterior is a 32-dimensional Gaussian centered at μ with diagonal covariance σ2. To get a z, sample from this Gaussian. Different samples give slightly different z values, all near the mean — capturing the "3-ness" of the input with some uncertainty.
This is the mathematical heart of the VAE. We can't compute log pθ(x) directly, so we derive a lower bound that we can compute and optimize. This bound is called the Evidence Lower Bound (ELBO).
For any z, Bayes' rule gives us:
Take the log of both sides:
The right side has pθ(z|x) which we can't compute. Introduce qφ(z|x) by adding and subtracting log qφ(z|x):
Rearrange by grouping terms:
The left side, log pθ(x), doesn't depend on z. So if we take the expectation of both sides over z ~ qφ(z|x), the left side is unchanged:
The second term is a KL divergence (definition: DKL(q||p) = 𝔼q[log q/p]):
The third term, DKL(qφ(z|x) || pθ(z|x)), measures how far our approximate posterior is from the true posterior. We can't compute it (it involves the intractable pθ(z|x)). But we know that KL divergence is always ≥ 0. So:
DKL(q||p) = 𝔼q[log(q/p)] = −𝔼q[log(p/q)] ≥ −log(𝔼q[p/q]) = −log(∫ q · p/q dz) = −log(∫ p dz) = −log(1) = 0.
The inequality is Jensen's: for concave log, 𝔼[log X] ≤ log 𝔼[X]. Equality only when p = q everywhere.
Reconstruction: Sample z from the encoder, decode it, measure how well the decoder reconstructs the original x. This is the familiar autoencoder loss — make x̂ look like x.
Regularization: Push the encoder's output distribution qφ(z|x) toward the prior N(0, I). This prevents the encoder from encoding each data point at an arbitrary location in z-space. It forces the latent space to be smooth and structured — nearby z values should decode to similar images.
If we maximize only the reconstruction term, we get a vanilla autoencoder — each data point gets its own corner of z-space, and the latent space has no structure for generation. If we maximize only the regularization term, the encoder ignores x entirely and outputs N(0, I) for everything — perfect prior match but zero information. The ELBO forces a balance: encode enough information to reconstruct, but keep the encoding close to the prior.
When q = N(μ, σ2I) and p = N(0, I) (both diagonal), the KL divergence has a closed form:
DKL = −(1/2) ∑j=1J [1 + log(σj2) − μj2 − σj2]
where J is the latent dimension. Each latent dimension contributes independently. If μj = 0 and σj = 1, that dimension contributes 0 to the KL — it already matches the prior. Deviations in mean or variance increase the KL penalty.
Encoder outputs μ = 2.0, σ = 0.5 for one dimension.
DKL = −0.5 × [1 + log(0.25) − 4.0 − 0.25]
= −0.5 × [1 + (−1.386) − 4.0 − 0.25]
= −0.5 × [−4.636]
= 2.318
The mean is far from 0 (μ = 2) and the variance is too small (σ = 0.5 vs 1.0), so the KL penalty is high. The ELBO loss will push μ toward 0 and σ toward 1.
Recall: log pθ(x) = ELBO + DKL(qφ || pθ(z|x)). Maximizing the ELBO simultaneously (1) increases log pθ(x) — making the model assign higher probability to training data, and (2) decreases DKL(qφ || pθ(z|x)) — making the approximate posterior more accurate. One objective, two benefits.
We have the ELBO. We know what to optimize. But there's a computational snag: the reconstruction term 𝔼z~qφ[log pθ(x|z)] requires sampling z from qφ(z|x), and sampling is not differentiable.
To estimate the reconstruction term, we sample z ~ N(μφ(x), σφ2(x)). But "sample from a distribution" is a stochastic operation — there's no gradient of "pick a random number" with respect to the distribution's parameters. We can't backpropagate through the sampling step.
The gradient ∂z/∂μ is undefined when z is drawn randomly from N(μ, σ2). Changing μ changes the distribution z comes from, but any particular sample z doesn't have a smooth dependence on μ. It's like asking "how does the mean of a die change which face comes up?" The question doesn't make sense — the randomness "breaks" the chain of differentiable operations.
Instead of sampling z directly from N(μ, σ2), decompose it as:
Now z is a deterministic function of μ, σ, and the external noise ε. The randomness comes from ε, which doesn't depend on φ. The dependence of z on φ flows through the deterministic path μ + σ ⊙ ε, which is differentiable.
Think of it this way. "Sample z from a Gaussian with learnable mean μ" is like saying "throw a dart at a target, where the target position is μ." You can't differentiate through the dart throw. But "take a fixed random offset ε and add it to μ" is z = μ + ε — and now dz/dμ = 1. The randomness is the same, but the gradient can flow.
Input x is a digit "7". Encoder outputs μ = [1.5, −0.3] and log σ2 = [−0.5, −1.0].
Step 1: Compute σ = exp(0.5 × log σ2) = [exp(−0.25), exp(−0.5)] = [0.779, 0.607].
Step 2: Sample ε ~ N(0, I). Say ε = [0.42, −1.10].
Step 3: z = μ + σ ⊙ ε = [1.5 + 0.779 × 0.42, −0.3 + 0.607 × (−1.10)] = [1.827, −0.968].
Step 4: Decode z to get x̂. Compute reconstruction loss ||x̂ − x||2.
Step 5: Compute KL term from μ and σ (closed form).
Step 6: Backprop through everything. Gradients flow through z = μ + σ ⊙ ε to update μ and σ (and hence φ).
The reconstruction term 𝔼z~q[log pθ(x|z)] is estimated by sampling a single z per data point: ≈ log pθ(x|z1) where z1 = μ + σ ⊙ ε1. One sample! In practice, the minibatch average over M data points provides enough variance reduction. More samples per data point barely helps — better to process more data points.
Consider the loss L = (z − c)2 where z = μ + σε and c is a target. Then:
∂L/∂μ = 2(z − c) · ∂z/∂μ = 2(z − c) · 1 = 2(μ + σε − c)
∂L/∂σ = 2(z − c) · ∂z/∂σ = 2(z − c) · ε = 2(μ + σε − c) · ε
Both gradients are well-defined and computable. The chain rule flows cleanly through z = μ + σε because ε is treated as a constant (like input data).
Time to see everything come together. This interactive demo simulates a 2D VAE: the encoder maps data points to a 2D latent space, and the decoder maps latent points back to data. You can watch the latent space organize during training, sample new points, and explore the reconstruction-regularization trade-off.
The β-VAE (Higgins et al., 2017) scales the KL term: ELBO = reconstruction − β × DKL. When β = 0: pure autoencoder, no regularization, latent space is unstructured. When β = 1: standard VAE, balanced trade-off. When β > 1: extra pressure toward the prior, more disentangled latent representations but blurrier reconstructions. Try dragging the slider to see the effect.
β = 0 (autoencoder): Clusters are tight and separated. Each class occupies its own island. If you click between islands, the decoder produces garbage — the latent space is empty there.
β = 1 (standard VAE): Clusters overlap more, filling the space. The Gaussian prior pushes everything toward the center. Clicking between clusters produces smooth interpolations. The latent space is continuous.
β > 1 (over-regularized): Everything collapses toward N(0, I). The classes are mixed together. Sampling produces "average" outputs — blurry, generic, lacking distinctiveness.
Encode a "3" to zA = (1.2, 0.8). Encode an "8" to zB = (−0.4, −1.1). Interpolate: zmid = 0.5 × zA + 0.5 × zB = (0.4, −0.15). Decode zmid: you get something that looks partway between a 3 and an 8. This smooth interpolation only works because the KL regularization filled in the gaps between clusters.
The decoder minimizes ||x̂ − x||2. For a given z, multiple x values could be valid (a "3" can be drawn many ways). The L2-optimal output is the pixel-wise mean of all valid x values — which is blurry. It's averaging over possibilities rather than committing to one. GANs fix this by using a discriminator instead of L2 loss. Diffusion models fix it by generating iteratively.
| Model | Density | Training | Generation | Quality |
|---|---|---|---|---|
| Autoregressive | Exact p(x) | Parallel (teacher forcing) | Sequential (slow) | High fidelity |
| VAE | ELBO lower bound | Single forward + backward | One pass (fast) | Blurry (L2) |
| GAN | None (implicit) | Adversarial (unstable) | One pass (fast) | Sharp |
| Diffusion | ELBO lower bound | Simple (denoise) | Iterative (slow) | Excellent |
Chapter 1: Discriminative models learn p(y|x). Generative models learn p(x). The Bayes connection links them.
Chapter 2: Generative models split into explicit density (autoregressive, VAE) and implicit density (GAN, diffusion).
Chapter 3: Maximum likelihood — the universal training objective. Equivalent to minimizing KL divergence and cross-entropy.
Chapter 4: Autoregressive models factor p(x) via the chain rule. Exact density, but slow generation. PixelCNN, WaveNet, GPT.
Chapter 5: Autoencoders learn features via compression. Not generative — latent space is unstructured.
Chapters 6–8: VAEs make the autoencoder probabilistic. The ELBO = reconstruction + KL regularization. Reparameterization trick enables backprop through sampling.
Blurry outputs: L2 reconstruction averages over modes. GANs and diffusion models produce much sharper samples.
Posterior collapse: With powerful decoders (e.g., autoregressive), the model learns to ignore z entirely. The KL term goes to zero (q matches the prior), and z carries no information. The decoder generates everything from scratch, defeating the purpose of the latent space.
Limited expressiveness: Diagonal Gaussian posteriors qφ(z|x) can't capture complex, multi-modal posteriors. The true p(z|x) might have multiple peaks, but our approximation is unimodal.
VQ-VAE: Replace the continuous Gaussian latent with discrete tokens. No KL collapse, no blurriness. Pair with an autoregressive model over tokens for state-of-the-art generation. See the VAE & VQ-VAE lesson.
Diffusion Models: Instead of compressing to a bottleneck, gradually add noise until data becomes pure Gaussian, then learn to reverse each step. Combines the stable training of VAEs with exceptional sample quality. See the Diffusion Models and Flow Matching lessons.
GANs: Coming in Part 2 of this lecture. Instead of maximizing a likelihood bound, train a generator and discriminator in a minimax game. Sharp samples, but training instability. See the GAN lesson.
| Property | Autoregressive | VAE |
|---|---|---|
| Density | Exact p(x) via chain rule | Lower bound (ELBO) only |
| Latent space | None (implicit via hidden states) | Explicit z with semantic meaning |
| Generation speed | Slow: O(T) sequential steps | Fast: one forward pass through decoder |
| Training | Teacher forcing, parallelizable | ELBO maximization, single pass |
| Interpolation | Not straightforward | Smooth interpolation in z-space |
| Sample quality | High fidelity (no averaging) | Blurry (L2 averages modes) |
| Controllability | Difficult (condition via prompts) | Easy (manipulate z dimensions) |
| Year | Paper | Contribution |
|---|---|---|
| 2013 | Kingma & Welling, "Auto-Encoding Variational Bayes" | VAE framework, reparameterization trick |
| 2014 | Rezende et al., "Stochastic Backpropagation" | Independent VAE derivation |
| 2016 | van den Oord et al., "Pixel Recurrent Neural Networks" | PixelRNN, PixelCNN |
| 2016 | van den Oord et al., "WaveNet" | Autoregressive audio with dilated convolutions |
| 2017 | Higgins et al., "β-VAE" | Disentangled representations via KL scaling |
| 2017 | van den Oord et al., "VQ-VAE" | Discrete latents, no posterior collapse |
Generative models learn the data distribution p(x), and the VAE does it by learning a lower bound: compress data into a Gaussian latent space, reconstruct it, and balance fidelity against structure. Everything else is architecture.