Architecture Atlas — 05

VAE / VQ-VAE
Tokenization

The secret plumbing behind every generative system

2013 / 2017 Year
Kingma & Welling / van den Oord et al. Creators
Generative / Compression Category
Concept

What Is It?

An encoder compresses high-dimensional data (images, audio, video) into a compact latent space. A decoder reconstructs the original data from that compressed representation. That is the shared skeleton. The two families diverge in how the latent space is structured.

VAE (Variational Autoencoder): the encoder outputs parameters of a continuous probability distribution — typically a diagonal Gaussian q(z|x) = N(mu, sigma). A KL-divergence term regularizes this distribution toward a standard normal prior, ensuring the latent space is smooth, interpolable, and generative.

VQ-VAE (Vector Quantized VAE): the encoder produces a continuous feature map that is then snapped to the nearest entry in a learnable codebook of discrete tokens. The result is a grid of integer indices — a tokenized representation that can be modeled by autoregressive transformers, just like text.

Architecture

How It Works

Four interactive views into the VAE / VQ-VAE pipeline.

Encoder-Decoder Architecture Interactive
Hover blocks to inspect
Latent Space Visualization Interactive
Click to sample points
Codebook Lookup (VQ-VAE) Interactive
Encoder output → nearest codebook vector
ELBO Decomposition Interactive
ELBO = Reconstruction - KL
Core Mechanisms

Key Ideas

L
ELBO Objective
The Evidence Lower BOund decomposes into a reconstruction term E[log p(x|z)] that rewards fidelity, minus a KL term that keeps the posterior close to the prior. Maximizing ELBO approximates maximizing the intractable log-likelihood.
R
Reparameterization Trick
Sampling z ~ q(z|x) is non-differentiable. The trick: compute z = mu + sigma * epsilon where epsilon ~ N(0,1). Gradients now flow through mu and sigma while randomness is externalized.
Q
Vector Quantization (VQ)
Replace each continuous encoder vector with its nearest neighbor in a learnable codebook of K entries. The result: discrete tokens. Gradients bypass the argmin via straight-through estimation.
F
FSQ (Finite Scalar Quantization)
Instead of a learned codebook, each scalar dimension is rounded to one of L fixed levels. Implicit codebook of size L^d. No codebook collapse, no EMA updates, simpler training.
C
Codebook Learning
Codebook vectors are updated via exponential moving average (EMA) of the encoder outputs that map to them, or via a codebook loss that pulls each vector toward its assigned encoder outputs.
$
Commitment Loss
An auxiliary loss that penalizes the encoder output for drifting away from the chosen codebook entry: ||sg[e] - z_q||^2, where sg is stop-gradient. Keeps encoder "committed" to the codebook.
Impact

Why It Matters

Stable Diffusion, Flux, DALL-E, Sora — none of these run diffusion in pixel space. The VAE encoder is the first step: compress a 512x512x3 image down to a 64x64x4 latent, then denoise in that space. This 64x compression is why latent diffusion is fast enough to be practical.

VQ-VAE tokenizes images, video, and audio into discrete tokens that autoregressive transformers can model — exactly like text. DALL-E 1, MAGVIT, SoundStream, and Encodec all rely on VQ tokenization. It bridges the continuous world of pixels and waveforms with the discrete world of language models.

This is the "secret" component enabling most modern generative AI. It is rarely the headline — the diffusion model or the transformer gets the credit — but without a high-quality encoder/decoder pair, nothing downstream works. It is the plumbing of generative AI.

Variants

Notable Variants

Beta-VAE
Higgins et al., 2017
Scale the KL term by beta > 1 to encourage disentangled latent representations where each dimension controls a single factor of variation (pose, color, size). Trade-off: lower reconstruction quality for better latent structure.
VQ-GAN
Esser et al., 2021
Replace the pixel-level reconstruction loss with an adversarial discriminator and perceptual (LPIPS) loss. Dramatically sharper reconstructions. The standard tokenizer for most image generation pipelines, including the Stable Diffusion VAE.
DALL-E dVAE
Ramesh et al., 2021
A discrete VAE with 8192 codebook entries and Gumbel-Softmax relaxation for differentiable discrete sampling. Tokenizes 256x256 images into 32x32 grids of tokens for autoregressive generation by a 12B parameter transformer.
MAGVIT-v2
Yu et al., 2024
Lookup-Free Quantization (LFQ) with a massive implicit codebook of 2^18 entries. Unified image and video tokenization. Achieves state-of-the-art reconstruction and enables language-model-style video generation.
Pipeline

Training & Inference

Training: Reconstruct + Regularize

The training objective is the Evidence Lower Bound (ELBO):

ELBO = Eq(z|x)[log p(x|z)] - KL(q(z|x) || p(z))
  • Reconstruction term: how well can the decoder reproduce x from z? Often an L2 loss, perceptual loss, or adversarial loss.
  • KL term: how far is the encoder's posterior from the prior N(0,I)? Keeps the latent space well-structured.

For VQ-VAE, the KL term is replaced by:

  • Codebook loss: ||z_e - sg[z_q]||^2
  • Commitment loss: ||sg[z_e] - z_q||^2

VQ-GAN adds a PatchGAN discriminator loss and LPIPS perceptual loss for sharper outputs.

Inference: Sample + Decode

VAE generation:

  • Sample z ~ N(0, I) from the prior
  • Pass z through the decoder to produce data
  • Higher-dimensional z = more expressive generation

VQ-VAE generation (two-stage):

  • Train a prior model (autoregressive transformer or diffusion model) over the discrete token sequences
  • Sample a sequence of token indices from the prior
  • Look up codebook vectors, reshape to spatial grid
  • Decode to pixels / audio / video

As encoder only (Stable Diffusion):

  • Encode image to latent space
  • Run diffusion in latent space
  • Decode the denoised latent back to pixels
Reference

Model Zoo

Model Type Codebook Latent Dim Used In Year
VAE Continuous Varies Original framework 2013
VQ-VAE Discrete 512 64 WaveNet, PixelCNN prior 2017
VQ-VAE-2 Hierarchical Discrete 512 x 2 levels 64 High-res image synthesis 2019
dVAE (DALL-E) Discrete (Gumbel) 8192 256 DALL-E 1 2021
VQ-GAN Discrete + GAN 1024–16384 256 Stable Diffusion, Parti 2021
SD VAE (KL-f8) Continuous (KL-reg) 4 channels, 8x down Stable Diffusion 1–2 2022
SoundStream RVQ (Residual VQ) 1024 x N levels Multi-scale AudioLM, MusicLM 2021
Encodec RVQ 1024 x 8 levels 128 MusicGen, Voicebox 2022
MAGVIT-v2 LFQ 2^18 (implicit) 18-bit codes VideoPoet 2024
SDXL VAE Continuous (KL-reg) 4 channels, 8x down SDXL, SD3, Flux 2023