VAE / VQ-VAE
Tokenization
The secret plumbing behind every generative system
What Is It?
An encoder compresses high-dimensional data (images, audio, video) into a compact latent space. A decoder reconstructs the original data from that compressed representation. That is the shared skeleton. The two families diverge in how the latent space is structured.
VAE (Variational Autoencoder): the encoder outputs parameters of a
continuous probability distribution — typically a diagonal Gaussian
q(z|x) = N(mu, sigma). A KL-divergence term regularizes this distribution toward
a standard normal prior, ensuring the latent space is smooth, interpolable, and generative.
VQ-VAE (Vector Quantized VAE): the encoder produces a continuous feature map that is then snapped to the nearest entry in a learnable codebook of discrete tokens. The result is a grid of integer indices — a tokenized representation that can be modeled by autoregressive transformers, just like text.
How It Works
Four interactive views into the VAE / VQ-VAE pipeline.
Key Ideas
Why It Matters
Stable Diffusion, Flux, DALL-E, Sora — none of these run diffusion in pixel space. The VAE encoder is the first step: compress a 512x512x3 image down to a 64x64x4 latent, then denoise in that space. This 64x compression is why latent diffusion is fast enough to be practical.
VQ-VAE tokenizes images, video, and audio into discrete tokens that autoregressive transformers can model — exactly like text. DALL-E 1, MAGVIT, SoundStream, and Encodec all rely on VQ tokenization. It bridges the continuous world of pixels and waveforms with the discrete world of language models.
This is the "secret" component enabling most modern generative AI. It is rarely the headline — the diffusion model or the transformer gets the credit — but without a high-quality encoder/decoder pair, nothing downstream works. It is the plumbing of generative AI.
Notable Variants
Training & Inference
The training objective is the Evidence Lower Bound (ELBO):
- Reconstruction term: how well can the decoder reproduce x from z? Often an L2 loss, perceptual loss, or adversarial loss.
- KL term: how far is the encoder's posterior from the prior N(0,I)? Keeps the latent space well-structured.
For VQ-VAE, the KL term is replaced by:
- Codebook loss:
||z_e - sg[z_q]||^2 - Commitment loss:
||sg[z_e] - z_q||^2
VQ-GAN adds a PatchGAN discriminator loss and LPIPS perceptual loss for sharper outputs.
VAE generation:
- Sample
z ~ N(0, I)from the prior - Pass z through the decoder to produce data
- Higher-dimensional z = more expressive generation
VQ-VAE generation (two-stage):
- Train a prior model (autoregressive transformer or diffusion model) over the discrete token sequences
- Sample a sequence of token indices from the prior
- Look up codebook vectors, reshape to spatial grid
- Decode to pixels / audio / video
As encoder only (Stable Diffusion):
- Encode image to latent space
- Run diffusion in latent space
- Decode the denoised latent back to pixels
Model Zoo
| Model | Type | Codebook | Latent Dim | Used In | Year |
|---|---|---|---|---|---|
| VAE | Continuous | — | Varies | Original framework | 2013 |
| VQ-VAE | Discrete | 512 | 64 | WaveNet, PixelCNN prior | 2017 |
| VQ-VAE-2 | Hierarchical Discrete | 512 x 2 levels | 64 | High-res image synthesis | 2019 |
| dVAE (DALL-E) | Discrete (Gumbel) | 8192 | 256 | DALL-E 1 | 2021 |
| VQ-GAN | Discrete + GAN | 1024–16384 | 256 | Stable Diffusion, Parti | 2021 |
| SD VAE (KL-f8) | Continuous (KL-reg) | — | 4 channels, 8x down | Stable Diffusion 1–2 | 2022 |
| SoundStream | RVQ (Residual VQ) | 1024 x N levels | Multi-scale | AudioLM, MusicLM | 2021 |
| Encodec | RVQ | 1024 x 8 levels | 128 | MusicGen, Voicebox | 2022 |
| MAGVIT-v2 | LFQ | 2^18 (implicit) | 18-bit codes | VideoPoet | 2024 |
| SDXL VAE | Continuous (KL-reg) | — | 4 channels, 8x down | SDXL, SD3, Flux | 2023 |