Tian, Jiang, Yuan, Peng, Wang — NeurIPS 2024 Best Paper

Visual AutoRegressive Modeling

Scalable image generation via next-scale prediction — a coarse-to-fine autoregressive paradigm that makes GPT-style models surpass diffusion transformers for the first time.

Prerequisites: VQVAE / VQ-GAN basics + Autoregressive models + Transformers
10
Chapters
5+
Simulations

Chapter 0: The Problem

Autoregressive (AR) models dominate language. GPT predicts the next token, left to right, and scales beautifully. Naturally, researchers asked: can we do the same for images?

The standard approach (VQGAN, DALL-E) works like this: encode an image into a grid of discrete tokens (say 16×16 = 256 tokens), flatten the grid into a 1D sequence using raster-scan order (left to right, top to bottom, like reading a book), and train a GPT-style transformer to predict the next token in this sequence.

It works. But it works poorly. By 2024, the best raster-scan AR models on ImageNet 256×256 achieved an FID of around 3-5 (with rejection sampling). Diffusion Transformers like DiT hit 2.27. The AR approach seemed fundamentally limited.

Why raster scan is the wrong inductive bias

The problem is structural. Raster scan imposes a left-to-right, top-to-bottom ordering on image tokens. But images are not text. They have four critical differences:

  1. Bidirectional dependencies. A VQVAE encoder produces tokens that are mutually dependent across the entire image. Token (3,7) depends on token (8,2) and vice versa. Forcing a unidirectional left-to-right ordering violates this mathematical premise.
  2. Broken spatial locality. Flattening a 2D grid into 1D destroys spatial relationships. Token (i, j) and its neighbor (i, j+1) are one position apart, but (i, j) and (i+1, j) are 16 positions apart. The model must learn to bridge this gap.
  3. No bidirectional zero-shot tasks. A raster-scan model can predict the bottom of an image given the top (it reads top-to-bottom). But it cannot predict the top given the bottom, or the left given the right. Inpainting a hole in the middle? It can only use pixels above and to the left.
  4. Brutal inefficiency. Generating a 16×16 token grid requires 256 sequential autoregressive steps. Each step attends to all previous tokens. That is O(n²) steps and O(n&sup6;) total compute for an n×n image.
The fundamental mismatch: Language is inherently sequential — words flow left to right (or right to left). Images are inherently hierarchical — you perceive the global structure first, then progressively finer details. Raster scan forces a sequential order onto a hierarchical signal. VAR asks: what if we autoregress over scales instead of positions?
Raster Scan vs. Next-Scale Prediction

Left: Raster-scan AR generates tokens one at a time in reading order. Right: VAR generates entire token maps, scale by scale, coarse to fine. Click to toggle between them.

Why does raster-scan autoregression violate the mathematical premise of AR models when applied to images?

Chapter 1: The Key Insight

How do humans perceive images? Not pixel by pixel from the top-left corner. We grasp the global structure first — "it's a dog on grass" — and then fill in details: the breed, the fur texture, the individual blades of grass. Painters work the same way: rough sketch first, then progressively refiner strokes.

VAR redefines the autoregressive unit. Instead of predicting the next token in a raster sequence, it predicts the next scale in a resolution hierarchy.

From next-token to next-scale

Standard AR factors the image likelihood as a product over individual tokens:

p(x1, x2, ..., xT) = ∏t=1T p(xt | x1, ..., xt-1)

VAR factors it as a product over scale maps:

p(r1, r2, ..., rK) = ∏k=1K p(rk | r1, ..., rk-1)

Where each rk is an entire token map at resolution hk × wk. The sequence of scales goes:

1×1 → 2×2 → 3×3 → 4×4 → 5×5 → 6×6 → 8×8 → 10×10 → 13×13 → 16×16

That is K = 10 scales, from a single token capturing the global gist to a 16×16 map with full detail.

What changes

Within each scale, all tokens are generated in parallel — there is no ordering imposed among tokens at the same resolution. The autoregression happens only across scales: scale k depends on scales 1 through k-1, but the hk×wk tokens within scale k are conditionally independent given the prefix.

The paradigm shift: Instead of 256 sequential steps (one per token), VAR uses just 10 autoregressive steps (one per scale). At each step, it predicts an entire token map in parallel. This is why VAR is 20× faster than raster-scan AR — and why it preserves spatial locality (no flattening).

This coarse-to-fine structure also naturally handles the bidirectional dependency problem. At each scale, every token can see the entire image at all coarser resolutions. The 1×1 token captures the global structure. The 2×2 map refines it. By the time we reach 16×16, the model has seen the full image context at every coarser level — effectively giving it bidirectional awareness.

In VAR, what is the autoregressive unit — and how many sequential steps does it take to generate a 16×16 token map?

Chapter 2: Multi-Scale VQVAE

VAR's next-scale prediction requires ground-truth token maps at multiple resolutions. A standard VQVAE gives you one token map (e.g., 16×16). VAR needs a multi-scale VQVAE that produces K token maps at increasing resolutions.

How it works

The architecture reuses the standard VQVAE encoder and decoder — same CNN as VQGAN. The only modification is a multi-scale quantization layer that replaces the single-scale quantizer. Here is the encoding procedure:

  1. Encode. Feed the image through the VQVAE encoder to get a continuous feature map f at resolution h×w (e.g., 16×16).
  2. For each scale k = 1, ..., K:
    • Downsample f to resolution hk×wk (via interpolation)
    • Quantize: map each vector to its nearest codebook entry → token map rk
    • Look up the codebook vectors for rk, upsample back to h×w, pass through a learned convolution φk
    • Subtract this from f (residual design): f ← f − φk(upsample(lookup(Z, rk)))

The residual subtraction is key: after encoding scale k, we remove the information that scale k captured from f. Scale k+1 then encodes the residual — the detail that coarser scales missed. This ensures each scale adds new information rather than redundantly re-encoding the same content.

Reconstruction (decoding)

To reconstruct, we reverse the process: start with f̂ = 0, then for each scale k, look up the codebook vectors for rk, upsample to full resolution, pass through φk, and add to f̂. After all K scales, decode f̂ with the standard VQVAE decoder.

f̂ = ∑k=1K φk(interpolate(lookup(Z, rk), h, w))

Shared codebook

A single codebook Z with V = 4096 entries is shared across all scales. This means every scale's tokens come from the same vocabulary — critical for the transformer, which uses a single embedding layer.

Why residual, not independent? If each scale independently quantized a downsampled version of f, coarser scales would waste capacity encoding the same low-frequency content that finer scales also capture. The residual design ensures each scale encodes only the new detail at its resolution level — like a Laplacian pyramid in classical image processing.
Multi-Scale VQVAE Encoding

Watch the residual encoding process. At each scale, the encoder quantizes the current feature map, then subtracts the quantized representation. The residual shrinks as more scales capture detail. Drag the slider to step through scales.

Scale1 (1×1)
Why does the multi-scale VQVAE use a residual design (subtracting each scale's representation from f before encoding the next scale)?

Chapter 3: The VAR Transformer

With multi-scale token maps in hand, we need a model that can autoregress over scales. VAR uses a surprisingly standard architecture: a GPT-2-style decoder-only transformer.

Input representation

The input sequence is constructed by concatenating all scale maps in order:

[s], r1, r2, ..., rK-1

Where [s] is a start token carrying class conditioning information, and each rk is the flattened hk×wk token map at scale k. The total sequence length is:

L = 1 + ∑k=1K hk × wk = 1 + 1 + 4 + 9 + 16 + 25 + 36 + 64 + 100 + 169 + 256 = 681

Each token gets a learned embedding plus a scale-specific positional embedding map.

Block-wise causal attention mask

This is the key architectural detail. Unlike standard causal masking (each token attends to all previous tokens), VAR uses a block-wise causal mask:

This is what enables parallel generation within each scale. Since all tokens at scale k see the same context (all previous scales), they can be generated independently and simultaneously.

Parallel within, sequential across: The block-wise mask creates a hybrid: autoregressive across scales (each scale depends on all coarser ones), but fully parallel within each scale (tokens at the same resolution are conditionally independent given the prefix). This gives you the best of both worlds — the coherence of autoregression and the speed of parallel decoding.

Model design

VAR uses Adaptive Layer Normalization (AdaLN) conditioned on the class label — the same technique used in DiT. The model follows a simple scaling rule:

width = 64d,   heads = d,   dropout = 0.1 · d/24

Where d is the depth (number of transformer layers). This gives model sizes:

The total parameter count scales as N(d) = 73,728 d³ — cubically with depth.

Block-Wise Causal Attention Mask

The attention mask for 4 scales. Green = can attend, dark = masked. Notice: within each scale block, attention is bidirectional (full green square on the diagonal). Across scales, it is strictly causal (green below diagonal blocks only).

Inference with KV-caching

During inference, the model generates scale by scale. At step k, it feeds in the tokens from scale k-1 (or the start token for k=1), attends to all cached KV pairs from previous scales, and outputs logits for all hk×wk token positions simultaneously. Standard KV-caching works because the attention is causal across scales — previously computed keys and values never change.

Why can all tokens within a single scale be generated in parallel during inference?

Chapter 4: Training

Training VAR is refreshingly simple. It mirrors standard language model training — predict the next token given the context — but the "tokens" are organized by scale rather than by raster position.

Loss function

Standard cross-entropy loss, summed across all scales and all token positions within each scale:

L = ∑k=1K(i,j) ∈ rk −log pθ(rk(i,j) | r1, ..., rk-1)

Each token is classified into one of V = 4096 codebook entries. The loss is identical to language model training — just cross-entropy over a vocabulary.

Two-stage training

VAR training has two completely independent stages:

Stage 1
Train the multi-scale VQVAE on OpenImages. This produces the tokenizer that encodes images into K-scale token maps. Train once, freeze, reuse.
Stage 2
Train the VAR transformer on ImageNet. Tokenize all images with the frozen VQVAE, then train the transformer to predict next-scale token maps via cross-entropy. 200-350 epochs depending on model size.

Training details

No bells and whistles: VAR deliberately avoids advanced LLM techniques like RoPE, SwiGLU, or RMSNorm. The entire architectural novelty is in the multi-scale formulation and block-wise causal mask. The transformer itself is vanilla GPT-2. This makes the paper's results a clean measurement of the next-scale paradigm — not confounded by architectural tricks.
What loss function does VAR use, and how does it compare to standard language model training?

Chapter 5: Scaling Laws

This is where VAR gets truly exciting. One of the most important properties of LLMs is their adherence to power-law scaling laws: as you increase model size, data, or compute, the test loss decreases smoothly and predictably. This lets you forecast the performance of a 100B model from a 1B model, guiding resource allocation before you commit the compute.

Prior visual AR models (VQGAN, ViT-VQGAN, RQ-Transformer) showed no clear scaling behavior. Making them bigger didn't reliably make them better. This was a fundamental blocker for the "scale up and win" strategy that powered GPT-3 and GPT-4.

VAR scales like an LLM

The paper trains VAR models from d=12 (130M params) to d=30 (2B params) and measures test cross-entropy loss. The results are striking:

L = (2.5 / N)0.20   (correlation R = −0.998)

Where N is the number of parameters in billions. This is a near-perfect power law — the correlation coefficient of −0.998 means the log-log line fits almost exactly. For comparison, the Chinchilla scaling law for LLMs has a similar correlation.

Token error rate also scales

Beyond cross-entropy loss, VAR also measures the token error rate (fraction of tokens predicted incorrectly). This too follows a power law:

Err = (5 × 102 · N)−0.02   (correlation R = −0.994)

Both the "last scale" error rate and the "all scales" error rate show power-law scaling, confirming that the improvement is not just a loss artifact — the model genuinely gets better at predicting visual tokens as it grows.

Why raster-scan AR doesn't scale: The paper argues that raster scan's violated mathematical premise (bidirectional dependencies forced into a unidirectional model) creates a fundamental bottleneck. Throwing more parameters at a structurally mismatched formulation doesn't help. VAR removes this mismatch, and scaling behavior emerges naturally — just as it does in language, where left-to-right prediction matches the inherent structure of text.
VAR Scaling Law

Log-log plot of test loss vs model parameters. VAR (teal) follows a near-perfect power law. For comparison, raster-scan AR (gray) shows no clear scaling trend. Hover over points to see model details.

What is the correlation coefficient for VAR's scaling law (test loss vs model parameters on a log-log plot)?

Chapter 6: Results

VAR is benchmarked on ImageNet 256×256 and 512×512 class-conditional generation. The results represent a milestone: the first time a GPT-style autoregressive model surpasses diffusion transformers.

ImageNet 256×256

The headline numbers for VAR-d30 with rejection sampling:

For context, the ImageNet 256×256 validation set itself has an FID of 1.78. VAR's 1.73 is actually below the reference — meaning the generated images are, by the FID metric, more "ImageNet-like" than the validation set itself.

Comparison across model families

VAR consistently outperforms every model family:

Data efficiency: VAR-d30 trains for only 350 epochs on ImageNet. DiT-XL/2 requires 1,400 epochs — 4× more training. Yet VAR achieves better FID. This suggests the next-scale inductive bias is not just faster at inference — it is a fundamentally more efficient way to learn visual distributions.
FID Comparison Across Model Families

ImageNet 256×256 class-conditional generation. Lower FID is better. VAR (teal) achieves state-of-the-art, surpassing all diffusion, GAN, and prior AR models.

What FID does VAR-d30 achieve on ImageNet 256×256, and how does it compare to DiT-XL/2?

Chapter 7: Generation Speed

Speed is one of VAR's most compelling advantages. The reason is simple arithmetic.

Raster-scan AR: O(n²) steps, O(n&sup6;) compute

A raster-scan model generating an n×n token map needs n² sequential autoregressive steps. Each step k attends to all k previous tokens, so the total attention cost is ∑k=1 k = O(n&sup4;). But at each step the model also runs through its full depth, and the KV cache grows — the effective total compute scales as O(n&sup6;).

For a 16×16 map: 256 sequential steps. For 32×32: 1,024 sequential steps.

VAR: O(log n) steps, O(n&sup4;) compute

VAR uses K = 10 scales for 16×16 (and ~13 for 32×32). At each step, it processes the entire scale map in one parallel forward pass. The total sequence length is 681 tokens — but there are only 10 sequential steps, not 256.

The total attention cost is O(L²) where L = ∑ hkwk ≈ n². So total compute is O(n&sup4;), a factor of n² improvement over raster scan.

Wall-clock numbers

The paper reports relative wall-clock times for generating a single 256×256 image:

VAR achieves nearly GAN-level speed (StyleGAN-XL is ~0.3×) while producing better images than any diffusion model.

10 steps vs 250 steps: DiT needs 250 denoising steps to generate one image. VAR needs 10 autoregressive scale steps. Even accounting for VAR's longer sequence per step (681 tokens vs DiT's fixed-size input), the 25× reduction in sequential steps dominates. VAR is around 20× faster in wall-clock time — and both models benefit from the same GPU parallelism.
Sequential Steps Comparison

Number of sequential model forward passes required to generate one 256×256 image. Fewer steps = faster generation.

Why is VAR approximately 20× faster than DiT at generating images?

Chapter 8: Zero-Shot Capabilities

One of the most appealing properties of LLMs is zero-shot generalization — performing tasks the model was never explicitly trained on. GPT-3 can translate languages, answer questions, and write code without task-specific fine-tuning. Can a visual AR model do the same?

Image inpainting

Given an image with a masked region, fill in the missing content. VAR handles this naturally: encode the unmasked regions at all scales, then let the transformer predict the masked tokens conditioned on the unmasked context. Because VAR preserves spatial structure (no flattening), it can inpaint any region — center, edges, or arbitrary shapes.

Compare this to raster-scan AR: it can only "inpaint" tokens that come after the masked region in raster order. It cannot fill in a missing top-left corner if the bottom-right is given.

Image outpainting

Extend an image beyond its borders. VAR encodes the existing image at all scales (with the extension region masked), then generates the missing border tokens. The coarse-to-fine structure ensures global coherence — the 1×1 scale captures the overall scene context, preventing the outpainted region from being semantically inconsistent.

Image editing

Change a specific attribute of an image (e.g., change a dog's breed, add sunglasses). VAR can condition on the coarse scales of the original image (preserving global structure) while regenerating fine scales with a different class label or text prompt. The coarse scales act as a structural anchor — the edited image keeps the same pose, composition, and background while changing the target attribute.

Why zero-shot works in VAR: The multi-scale structure creates a natural separation between global structure (coarse scales) and local detail (fine scales). By conditioning on some scales and regenerating others, VAR can perform editing, style transfer, and completion tasks without any task-specific training. This mirrors how LLMs use in-context prompting — the "prompt" is the provided scales, and the "completion" is the generated scales.
Why can VAR perform zero-shot inpainting of ANY image region, while raster-scan AR models cannot?

Chapter 9: Connections

What VAR built on

VQVAE / VQGAN (van den Oord et al. 2017, Esser et al. 2021): The foundation — learning discrete codebooks for image tokenization. VAR reuses the VQGAN architecture but adds multi-scale quantization.

RQ-Transformer (Lee et al. 2022): Also uses residual quantization with multiple "scales" of codes, but applies them at each spatial position independently, then autoregresses in raster order. VAR's key departure is making the scales correspond to different spatial resolutions, not different refinement levels at the same position.

GPT-2 (Radford et al. 2019): VAR adopts the GPT-2 decoder-only transformer architecture nearly verbatim. The innovation is entirely in the data representation (multi-scale token maps) and attention mask (block-wise causal), not in the transformer itself.

DiT (Peebles & Xie 2023): The diffusion transformer that VAR surpasses. DiT showed transformers work for diffusion; VAR shows they work even better for next-scale autoregression.

What VAR influenced

Unified multimodal AR: VAR's proof that visual AR can scale like LLMs opened the door to truly unified vision-language models that use next-token/next-scale prediction for both modalities.

MaskGIT and parallel decoding: MaskGIT also generates tokens in parallel, but uses masked prediction (BERT-style) rather than autoregression. VAR shows that autoregressive models can achieve parallel generation too — by changing the unit of autoregression from tokens to scales.

Scaling law research: VAR is one of the first works to demonstrate clear power-law scaling in visual generation. This has spurred further investigation into scaling laws for image and video models.

VAR's legacy: The core insight — that the right autoregressive ordering for images is coarse-to-fine, not raster-scan — is likely to influence visual generation for years. It resolves the fundamental tension between autoregressive modeling (which needs an ordering) and 2D images (which don't have one). By defining the ordering over resolution scales rather than spatial positions, VAR makes GPT-style models competitive with diffusion for the first time, while exhibiting the scaling laws and zero-shot generalization that made LLMs transformative.

Cheat sheet

Core idea
Autoregress over resolution scales (1×1 → 2×2 → ... → 16×16), not over individual tokens in raster order
Tokenizer
Multi-scale VQVAE with residual encoding, shared codebook (V=4096), K=10 scales
Architecture
GPT-2 decoder-only transformer with block-wise causal mask and AdaLN
Results
FID 1.73 on ImageNet 256×256, 20× faster than DiT, power-law scaling (R=−0.998)
Impact
First GPT-style AR to beat diffusion. Shows visual AR can have LLM-like scaling and zero-shot generalization
What is the fundamental difference between VAR and RQ-Transformer, which also uses residual quantization?