The compression engines behind image generation. Learn how neural nets learn to encode, quantize, and reconstruct the visual world.
A 256×256 colour image has 196,608 numbers. But most of those numbers are redundant: smooth patches, repeated textures, predictable edges. The real information in an image — its content, composition, style — can be captured by far fewer numbers. A latent space is where those compressed essentials live.
The encoder-decoder idea is simple: an encoder squeezes the input into a small latent vector z, and a decoder tries to reconstruct the original from z alone. If the reconstruction is good, z must contain the essence of the data.
Drag the latent dimension to see how much compression we achieve. The bar shows what fraction of the original information we keep.
A plain autoencoder trains two neural networks end-to-end: the encoder f and decoder g. The loss is simply the reconstruction error: how different is g(f(x)) from x? The bottleneck — a narrow latent layer — forces compression.
It works. But there's a problem: the latent space is messy. Points are scattered unpredictably. If you pick a random point in latent space and decode it, you get garbage. The autoencoder memorizes efficient codes but doesn't organize them.
Left: messy autoencoder latent space — clusters with gaps. Right: organized VAE latent space — smooth and continuous. Click to regenerate.
The key insight of the VAE: instead of encoding x to a single point z, encode it to a distribution — specifically, a Gaussian with mean μ and variance σ². Then sample z from that distribution. This forces the latent space to be smooth.
But wait: sampling is not differentiable. How do we backpropagate through a random number? The reparameterization trick: instead of z ~ N(μ, σ²), write z = μ + σ · ε where ε ~ N(0,1). Now the randomness (ε) is external, and gradients flow through μ and σ.
Adjust μ and σ. Each frame samples a new ε and shows the resulting z. The orange curve is the distribution; teal dots are samples.
We want to maximize the likelihood of our data: p(x). But computing p(x) directly requires integrating over all possible latent codes z, which is intractable. Instead, we optimize a lower bound on log p(x): the Evidence Lower Bound (ELBO).
The ELBO has two terms: a reconstruction term (how well can we decode z back to x?) and a KL divergence term (how close is our learned distribution q(z|x) to the prior p(z) = N(0,1)?). The reconstruction term wants good decoding; the KL term wants an organized latent space.
Adjust the balance between reconstruction quality and KL penalty to see the tradeoff. Low KL = organized latent space but blurrier. High reconstruction = sharp but messy latent space.
In practice, the KL term has a closed-form solution for Gaussians. For each latent dimension j, the KL divergence is: ½(μj² + σj² − log σj² − 1). This is cheap to compute and differentiable.
The big practical knob is β: a weight on the KL term. With β=1, you get the standard VAE (ELBO). With β>1, you get β-VAE, which forces more disentanglement at the cost of blurrier reconstructions. With β<1, you get sharper images but messier latent space.
Low β: sharp but unstructured. High β: blurry but well-organized latent space. Watch how the "reconstruction" and "organization" bars respond.
| β value | Reconstruction | Latent structure | Use case |
|---|---|---|---|
| β < 1 | Sharp | Messy | When quality matters most |
| β = 1 | Balanced | Good | Standard VAE (ELBO) |
| β > 1 | Blurry | Disentangled | β-VAE for interpretable factors |
What if latent codes were discrete instead of continuous? VQ-VAE replaces the Gaussian latent space with a codebook: a dictionary of K learned vectors. The encoder outputs a continuous vector, then it's snapped to the nearest codebook entry. This is vector quantization.
The decoder only ever sees codebook entries, not the raw encoder output. The result: a finite set of "visual words" that the decoder can reconstruct from. This is how images get turned into token sequences — the key to using transformers for vision.
Blue dots are codebook entries. Drag the orange point (encoder output) and watch it snap to the nearest codebook entry. The green line shows the assignment.
The loss has three parts: reconstruction, codebook loss (move codebook entries toward encoder outputs, with stop-gradient sg), and commitment loss (keep encoder outputs near codebook entries).
The codebook is only useful if all entries are active. A common failure: codebook collapse — the encoder only uses a handful of entries while the rest go "dead." This wastes representational capacity.
Exponential Moving Average (EMA) updates are a popular fix. Instead of updating codebook entries with gradient descent, track the running average of all encoder outputs assigned to each entry. This is faster and more stable.
Each bar is a codebook entry. Height = usage count. Red entries are dead (unused). Watch how dead code revival redistributes them.
| Strategy | How it works |
|---|---|
| EMA updates | Running average of assigned vectors; no gradient needed for codebook |
| Dead code revival | Replace unused entries with randomly sampled encoder outputs |
| Codebook reset | Periodically re-initialize low-usage entries from data (k-means style) |
| Larger codebook | More entries = finer granularity, but harder to keep all alive |
VQ-VAE's codebook is elegant but fragile: you need to manage dead codes, tune commitment loss, and balance EMA rates. FSQ takes a radically simpler approach: instead of learning a codebook, just round each scalar to a small set of levels.
If each of d dimensions has L levels, you get Ld possible codes — an implicit codebook. With d=6 and L=5, that's 56=15,625 codes. No codebook parameters. No collapse. No commitment loss. Just rounding.
The continuous encoder output (left axis) is rounded to discrete levels (right axis). Adjust levels per dimension to see how granularity changes. More levels = finer representation.
Modern generative models don't operate on raw pixels. They first tokenize images into discrete codes using a VQ-VAE (or FSQ), then model the distribution of codes using a transformer or diffusion model. This is the architecture behind Stable Diffusion, DALL-E, and MAGVIT.
The tokenizer compresses a 256×256 image to, say, a 32×32 grid of codebook indices. That's 1,024 tokens instead of 196,608 pixel values — a 192× compression. A transformer can then model these tokens autoregressively, just like words.
| System | Tokenizer | Generator | Year |
|---|---|---|---|
| DALL-E 1 | dVAE (discrete VAE) | Autoregressive transformer | 2021 |
| Stable Diffusion | KL-regularized AE | Latent diffusion model | 2022 |
| MAGVIT-2 | LFQ (lookup-free quantization) | Masked transformer | 2023 |
| Cosmos | Causal VQ-VAE | Autoregressive + diffusion | 2024 |
VAEs and VQ-VAEs aren't just academic exercises — they're load-bearing infrastructure in the biggest generative AI systems. Stable Diffusion's latent space? A VAE. DALL-E's image tokens? A VQ-VAE. Every video model you've seen? Some flavor of temporal VQ-VAE.
How the ideas connect. The original autoencoder begat a family of models that now power every major generative AI system.
| Application | VAE Variant | Role |
|---|---|---|
| Stable Diffusion | KL-VAE | Compress images to/from latent space |
| DALL-E 1 | dVAE | Convert images to discrete tokens |
| Sora | Spatial-temporal VAE | Tokenize video frames + motion |
| AudioLM | SoundStream (VQ-VAE) | Tokenize audio waveforms |
| Drug discovery | Molecular VAE | Smooth latent space for molecule optimization |
You now understand latent spaces, variational inference, vector quantization, and how they power modern AI. Every generated image you see started as a latent code.