The Complete Beginner's Path

Understand VAE / VQ-VAE

The compression engines behind image generation. Learn how neural nets learn to encode, quantize, and reconstruct the visual world.

Prerequisites: Neural network basics + Probability intuition. That's it.
10
Chapters
8+
Interactives
0
Assumed Knowledge

Chapter 0: Why Latent Spaces?

A 256×256 colour image has 196,608 numbers. But most of those numbers are redundant: smooth patches, repeated textures, predictable edges. The real information in an image — its content, composition, style — can be captured by far fewer numbers. A latent space is where those compressed essentials live.

The encoder-decoder idea is simple: an encoder squeezes the input into a small latent vector z, and a decoder tries to reconstruct the original from z alone. If the reconstruction is good, z must contain the essence of the data.

Input x
256×256×3 = 196,608 dims
↓ Encoder
Latent z
64 dims (compressed essence)
↓ Decoder
Reconstruction x̂
196,608 dims (recovered)
The core idea: High-dimensional data lives on a low-dimensional manifold. An image of a face can be described by a few dozen numbers: skin tone, eye shape, hair length, pose. The latent space learns to discover these factors automatically.
Compression Ratio

Drag the latent dimension to see how much compression we achieve. The bar shows what fraction of the original information we keep.

Latent dims64
Check: Why do we compress data into a latent space?

Chapter 1: Autoencoders — The Bottleneck

A plain autoencoder trains two neural networks end-to-end: the encoder f and decoder g. The loss is simply the reconstruction error: how different is g(f(x)) from x? The bottleneck — a narrow latent layer — forces compression.

It works. But there's a problem: the latent space is messy. Points are scattered unpredictably. If you pick a random point in latent space and decode it, you get garbage. The autoencoder memorizes efficient codes but doesn't organize them.

L = ||x − Decoder(Encoder(x))||²
Latent Space: Organized vs Messy

Left: messy autoencoder latent space — clusters with gaps. Right: organized VAE latent space — smooth and continuous. Click to regenerate.

Key problem: Autoencoders are good at compression but bad at generation. You can't sample new images because you don't know which latent codes are "valid." The VAE fixes this by forcing structure on the latent space.
Check: What's the main limitation of a plain autoencoder?

Chapter 2: The Variational Trick

The key insight of the VAE: instead of encoding x to a single point z, encode it to a distribution — specifically, a Gaussian with mean μ and variance σ². Then sample z from that distribution. This forces the latent space to be smooth.

But wait: sampling is not differentiable. How do we backpropagate through a random number? The reparameterization trick: instead of z ~ N(μ, σ²), write z = μ + σ · ε where ε ~ N(0,1). Now the randomness (ε) is external, and gradients flow through μ and σ.

z = μ + σ · ε      ε ~ N(0, 1)
Encoder Output
μ = 2.3, σ = 0.5
↓ sample ε ~ N(0,1)
Reparameterize
z = 2.3 + 0.5 × ε
Decoder
Reconstruct from z
Reparameterization Trick

Adjust μ and σ. Each frame samples a new ε and shows the resulting z. The orange curve is the distribution; teal dots are samples.

Mean μ1.0
Std dev σ0.8
Why it matters: The reparameterization trick is what makes VAEs trainable. Without it, you can't backpropagate through the sampling step. It's one of the cleverest tricks in modern deep learning.
Check: What does the reparameterization trick achieve?

Chapter 3: ELBO — The Training Objective

We want to maximize the likelihood of our data: p(x). But computing p(x) directly requires integrating over all possible latent codes z, which is intractable. Instead, we optimize a lower bound on log p(x): the Evidence Lower Bound (ELBO).

The ELBO has two terms: a reconstruction term (how well can we decode z back to x?) and a KL divergence term (how close is our learned distribution q(z|x) to the prior p(z) = N(0,1)?). The reconstruction term wants good decoding; the KL term wants an organized latent space.

ELBO = E[log p(x|z)] − KL(q(z|x) || p(z))
Reconstruction term: "Decode z and get back x." Encourages faithful reconstruction. Equivalent to MSE or BCE loss.
KL term: "Keep q(z|x) close to N(0,1)." Prevents the encoder from cheating by using a tiny region of latent space.
ELBO Decomposition

Adjust the balance between reconstruction quality and KL penalty to see the tradeoff. Low KL = organized latent space but blurrier. High reconstruction = sharp but messy latent space.

Recon weight5.0
KL weight5.0
Intuition: The ELBO is a tug-of-war. The reconstruction loss pulls the encoder to memorize every detail. The KL loss pulls it toward a smooth, standard Gaussian. The balance between them determines the character of the latent space.
Check: What are the two components of the ELBO?

Chapter 4: Training a VAE

In practice, the KL term has a closed-form solution for Gaussians. For each latent dimension j, the KL divergence is: ½(μj² + σj² − log σj² − 1). This is cheap to compute and differentiable.

The big practical knob is β: a weight on the KL term. With β=1, you get the standard VAE (ELBO). With β>1, you get β-VAE, which forces more disentanglement at the cost of blurrier reconstructions. With β<1, you get sharper images but messier latent space.

L = ||x − x̂||² + β · ∑j ½(μj² + σj² − log σj² − 1)
Beta Slider: Sharpness vs Structure

Low β: sharp but unstructured. High β: blurry but well-organized latent space. Watch how the "reconstruction" and "organization" bars respond.

Beta β1.00
β valueReconstructionLatent structureUse case
β < 1SharpMessyWhen quality matters most
β = 1BalancedGoodStandard VAE (ELBO)
β > 1BlurryDisentangledβ-VAE for interpretable factors
KL annealing: A common trick: start with β=0 (pure autoencoder) and slowly raise it during training. This lets the model learn useful codes before the KL term collapses them.
Check: What does increasing β do?

Chapter 5: VQ-VAE — Discrete Codes

What if latent codes were discrete instead of continuous? VQ-VAE replaces the Gaussian latent space with a codebook: a dictionary of K learned vectors. The encoder outputs a continuous vector, then it's snapped to the nearest codebook entry. This is vector quantization.

The decoder only ever sees codebook entries, not the raw encoder output. The result: a finite set of "visual words" that the decoder can reconstruct from. This is how images get turned into token sequences — the key to using transformers for vision.

Encoder
x → ze (continuous)
↓ nearest-neighbor lookup
Codebook
zq = argmin ||ze − ek||²
Decoder
zq → x̂
Vector Quantization

Blue dots are codebook entries. Drag the orange point (encoder output) and watch it snap to the nearest codebook entry. The green line shows the assignment.

L = ||x − x̂||² + ||sg[ze] − e||² + β ||ze − sg[e]||²

The loss has three parts: reconstruction, codebook loss (move codebook entries toward encoder outputs, with stop-gradient sg), and commitment loss (keep encoder outputs near codebook entries).

Stop-gradient trick: Nearest-neighbor lookup is not differentiable. The solution: copy gradients from decoder input straight to encoder output, bypassing the quantization. Called the "straight-through estimator."
Check: In VQ-VAE, what replaces the Gaussian latent space?

Chapter 6: Codebook Learning

The codebook is only useful if all entries are active. A common failure: codebook collapse — the encoder only uses a handful of entries while the rest go "dead." This wastes representational capacity.

Exponential Moving Average (EMA) updates are a popular fix. Instead of updating codebook entries with gradient descent, track the running average of all encoder outputs assigned to each entry. This is faster and more stable.

ek ← γ · ek + (1 − γ) · mean(assigned ze)
Codebook Utilization

Each bar is a codebook entry. Height = usage count. Red entries are dead (unused). Watch how dead code revival redistributes them.

StrategyHow it works
EMA updatesRunning average of assigned vectors; no gradient needed for codebook
Dead code revivalReplace unused entries with randomly sampled encoder outputs
Codebook resetPeriodically re-initialize low-usage entries from data (k-means style)
Larger codebookMore entries = finer granularity, but harder to keep all alive
Rule of thumb: Codebook utilization above 90% is healthy. Below 50% means half your representational capacity is wasted. Monitor this during training.
Check: What is "codebook collapse"?

Chapter 7: FSQ — Finite Scalar Quantization

VQ-VAE's codebook is elegant but fragile: you need to manage dead codes, tune commitment loss, and balance EMA rates. FSQ takes a radically simpler approach: instead of learning a codebook, just round each scalar to a small set of levels.

If each of d dimensions has L levels, you get Ld possible codes — an implicit codebook. With d=6 and L=5, that's 56=15,625 codes. No codebook parameters. No collapse. No commitment loss. Just rounding.

i = round(Li · tanh(zi))
FSQ: Scalar Rounding

The continuous encoder output (left axis) is rounded to discrete levels (right axis). Adjust levels per dimension to see how granularity changes. More levels = finer representation.

Levels L5
Dimensions d4
Implicit codebook size: 625
VQ-VAE: Learned codebook. Flexible but fragile. Needs EMA, dead code revival, commitment loss.
FSQ: Implicit codebook via rounding. Simple, stable, no collapse. Slightly less flexible.
Check: How does FSQ avoid codebook collapse?

Chapter 8: Image / Video Tokenizers

Modern generative models don't operate on raw pixels. They first tokenize images into discrete codes using a VQ-VAE (or FSQ), then model the distribution of codes using a transformer or diffusion model. This is the architecture behind Stable Diffusion, DALL-E, and MAGVIT.

The tokenizer compresses a 256×256 image to, say, a 32×32 grid of codebook indices. That's 1,024 tokens instead of 196,608 pixel values — a 192× compression. A transformer can then model these tokens autoregressively, just like words.

Image 256×256
196,608 pixel values
↓ VQ-VAE Encoder
Token Grid 32×32
1,024 codebook indices
↓ Transformer / Diffusion
Generate New Tokens
Model the token distribution
↓ VQ-VAE Decoder
New Image
Decode tokens back to pixels
SystemTokenizerGeneratorYear
DALL-E 1dVAE (discrete VAE)Autoregressive transformer2021
Stable DiffusionKL-regularized AELatent diffusion model2022
MAGVIT-2LFQ (lookup-free quantization)Masked transformer2023
CosmosCausal VQ-VAEAutoregressive + diffusion2024
Key insight: Tokenizer quality is the ceiling for generation quality. If the tokenizer can't reconstruct fine details, no amount of transformer magic can bring them back. This is why teams invest heavily in tokenizer design.
Check: Why do generative models use tokenizers instead of raw pixels?

Chapter 9: VAEs in the Wild

VAEs and VQ-VAEs aren't just academic exercises — they're load-bearing infrastructure in the biggest generative AI systems. Stable Diffusion's latent space? A VAE. DALL-E's image tokens? A VQ-VAE. Every video model you've seen? Some flavor of temporal VQ-VAE.

Diffusion models operate in VAE latent space. The VAE compresses 512×512×3 to 64×64×4, making diffusion computationally tractable.
Flow matching models (Stable Diffusion 3, Flux) also use the VAE latent space. Same tokenizer, different generative backbone.
VAE Family Tree

How the ideas connect. The original autoencoder begat a family of models that now power every major generative AI system.

ApplicationVAE VariantRole
Stable DiffusionKL-VAECompress images to/from latent space
DALL-E 1dVAEConvert images to discrete tokens
SoraSpatial-temporal VAETokenize video frames + motion
AudioLMSoundStream (VQ-VAE)Tokenize audio waveforms
Drug discoveryMolecular VAESmooth latent space for molecule optimization
"The art of compression is the art of understanding."
— paraphrase of Kolmogorov

You now understand latent spaces, variational inference, vector quantization, and how they power modern AI. Every generated image you see started as a latent code.