VAE / VQ-VAE / Tokenization — Architecture Atlas

Concept

What Is It?

An encoder compresses high-dimensional data (images, audio, video) into a compact latent space. A decoder reconstructs the original data from that compressed representation. That is the shared skeleton. The two families diverge in how the latent space is structured.

VAE (Variational Autoencoder): the encoder outputs parameters of a continuous probability distribution — typically a diagonal Gaussian q(z|x) = N(mu, sigma). A KL-divergence term regularizes this distribution toward a standard normal prior, ensuring the latent space is smooth, interpolable, and generative.

VQ-VAE (Vector Quantized VAE): the encoder produces a continuous feature map that is then snapped to the nearest entry in a learnable codebook of discrete tokens. The result is a grid of integer indices — a tokenized representation that can be modeled by autoregressive transformers, just like text.

Architecture

How It Works

Four interactive views into the VAE / VQ-VAE pipeline.

Encoder-Decoder Architecture Interactive

Hover blocks to inspect

Latent Space Visualization Interactive

Click to sample points

Codebook Lookup (VQ-VAE) Interactive

Encoder output → nearest codebook vector

ELBO Decomposition Interactive

ELBO = Reconstruction - KL

Core Mechanisms

Key Ideas

L

ELBO Objective

The Evidence Lower BOund decomposes into a reconstruction term E[log p(x|z)] that rewards fidelity, minus a KL term that keeps the posterior close to the prior. Maximizing ELBO approximates maximizing the intractable log-likelihood.

R

Reparameterization Trick

Sampling z ~ q(z|x) is non-differentiable. The trick: compute z = mu + sigma * epsilon where epsilon ~ N(0,1). Gradients now flow through mu and sigma while randomness is externalized.

Q

Vector Quantization (VQ)

Replace each continuous encoder vector with its nearest neighbor in a learnable codebook of K entries. The result: discrete tokens. Gradients bypass the argmin via straight-through estimation.

F

FSQ (Finite Scalar Quantization)

Instead of a learned codebook, each scalar dimension is rounded to one of L fixed levels. Implicit codebook of size L^d. No codebook collapse, no EMA updates, simpler training.

C

Codebook Learning

Codebook vectors are updated via exponential moving average (EMA) of the encoder outputs that map to them, or via a codebook loss that pulls each vector toward its assigned encoder outputs.

$

Commitment Loss

An auxiliary loss that penalizes the encoder output for drifting away from the chosen codebook entry: ||sg[e] - z_q||^2, where sg is stop-gradient. Keeps encoder "committed" to the codebook.

Impact

Why It Matters

Stable Diffusion, Flux, DALL-E, Sora — none of these run diffusion in pixel space. The VAE encoder is the first step: compress a 512x512x3 image down to a 64x64x4 latent, then denoise in that space. This 64x compression is why latent diffusion is fast enough to be practical.

VQ-VAE tokenizes images, video, and audio into discrete tokens that autoregressive transformers can model — exactly like text. DALL-E 1, MAGVIT, SoundStream, and Encodec all rely on VQ tokenization. It bridges the continuous world of pixels and waveforms with the discrete world of language models.

This is the "secret" component enabling most modern generative AI. It is rarely the headline — the diffusion model or the transformer gets the credit — but without a high-quality encoder/decoder pair, nothing downstream works. It is the plumbing of generative AI.

Variants

Notable Variants

Beta-VAE

Higgins et al., 2017

Scale the KL term by beta > 1 to encourage disentangled latent representations where each dimension controls a single factor of variation (pose, color, size). Trade-off: lower reconstruction quality for better latent structure.

VQ-GAN

Esser et al., 2021

Replace the pixel-level reconstruction loss with an adversarial discriminator and perceptual (LPIPS) loss. Dramatically sharper reconstructions. The standard tokenizer for most image generation pipelines, including the Stable Diffusion VAE.

DALL-E dVAE

Ramesh et al., 2021

A discrete VAE with 8192 codebook entries and Gumbel-Softmax relaxation for differentiable discrete sampling. Tokenizes 256x256 images into 32x32 grids of tokens for autoregressive generation by a 12B parameter transformer.

MAGVIT-v2

Yu et al., 2024

Lookup-Free Quantization (LFQ) with a massive implicit codebook of 2^18 entries. Unified image and video tokenization. Achieves state-of-the-art reconstruction and enables language-model-style video generation.

Pipeline

Training & Inference

Training: Reconstruct + Regularize

The training objective is the Evidence Lower Bound (ELBO):

ELBO = E_q(z|x)[log p(x|z)] - KL(q(z|x) || p(z))

Reconstruction term: how well can the decoder reproduce x from z? Often an L2 loss, perceptual loss, or adversarial loss.
KL term: how far is the encoder's posterior from the prior N(0,I)? Keeps the latent space well-structured.

For VQ-VAE, the KL term is replaced by:

Codebook loss: ||z_e - sg[z_q]||^2
Commitment loss: ||sg[z_e] - z_q||^2

VQ-GAN adds a PatchGAN discriminator loss and LPIPS perceptual loss for sharper outputs.

Inference: Sample + Decode

VAE generation:

Sample z ~ N(0, I) from the prior
Pass z through the decoder to produce data
Higher-dimensional z = more expressive generation

VQ-VAE generation (two-stage):

Train a prior model (autoregressive transformer or diffusion model) over the discrete token sequences
Sample a sequence of token indices from the prior
Look up codebook vectors, reshape to spatial grid
Decode to pixels / audio / video

As encoder only (Stable Diffusion):

Encode image to latent space
Run diffusion in latent space
Decode the denoised latent back to pixels

Reference

Model Zoo

Model	Type	Codebook	Latent Dim	Used In	Year
VAE	Continuous	—	Varies	Original framework	`2013`
VQ-VAE	Discrete	512	64	WaveNet, PixelCNN prior	`2017`
VQ-VAE-2	Hierarchical Discrete	512 x 2 levels	64	High-res image synthesis	`2019`
dVAE (DALL-E)	Discrete (Gumbel)	8192	256	DALL-E 1	`2021`
VQ-GAN	Discrete + GAN	1024–16384	256	Stable Diffusion, Parti	`2021`
SD VAE (KL-f8)	Continuous (KL-reg)	—	4 channels, 8x down	Stable Diffusion 1–2	`2022`
SoundStream	RVQ (Residual VQ)	1024 x N levels	Multi-scale	AudioLM, MusicLM	`2021`
Encodec	RVQ	1024 x 8 levels	128	MusicGen, Voicebox	`2022`
MAGVIT-v2	LFQ	2^18 (implicit)	18-bit codes	VideoPoet	`2024`
SDXL VAE	Continuous (KL-reg)	—	4 channels, 8x down	SDXL, SD3, Flux	`2023`

VAE / VQ-VAE Tokenization