Scalable image generation via next-scale prediction — a coarse-to-fine autoregressive paradigm that makes GPT-style models surpass diffusion transformers for the first time.
Autoregressive (AR) models dominate language. GPT predicts the next token, left to right, and scales beautifully. Naturally, researchers asked: can we do the same for images?
The standard approach (VQGAN, DALL-E) works like this: encode an image into a grid of discrete tokens (say 16×16 = 256 tokens), flatten the grid into a 1D sequence using raster-scan order (left to right, top to bottom, like reading a book), and train a GPT-style transformer to predict the next token in this sequence.
It works. But it works poorly. By 2024, the best raster-scan AR models on ImageNet 256×256 achieved an FID of around 3-5 (with rejection sampling). Diffusion Transformers like DiT hit 2.27. The AR approach seemed fundamentally limited.
The problem is structural. Raster scan imposes a left-to-right, top-to-bottom ordering on image tokens. But images are not text. They have four critical differences:
Left: Raster-scan AR generates tokens one at a time in reading order. Right: VAR generates entire token maps, scale by scale, coarse to fine. Click to toggle between them.
How do humans perceive images? Not pixel by pixel from the top-left corner. We grasp the global structure first — "it's a dog on grass" — and then fill in details: the breed, the fur texture, the individual blades of grass. Painters work the same way: rough sketch first, then progressively refiner strokes.
VAR redefines the autoregressive unit. Instead of predicting the next token in a raster sequence, it predicts the next scale in a resolution hierarchy.
Standard AR factors the image likelihood as a product over individual tokens:
VAR factors it as a product over scale maps:
Where each rk is an entire token map at resolution hk × wk. The sequence of scales goes:
That is K = 10 scales, from a single token capturing the global gist to a 16×16 map with full detail.
Within each scale, all tokens are generated in parallel — there is no ordering imposed among tokens at the same resolution. The autoregression happens only across scales: scale k depends on scales 1 through k-1, but the hk×wk tokens within scale k are conditionally independent given the prefix.
This coarse-to-fine structure also naturally handles the bidirectional dependency problem. At each scale, every token can see the entire image at all coarser resolutions. The 1×1 token captures the global structure. The 2×2 map refines it. By the time we reach 16×16, the model has seen the full image context at every coarser level — effectively giving it bidirectional awareness.
VAR's next-scale prediction requires ground-truth token maps at multiple resolutions. A standard VQVAE gives you one token map (e.g., 16×16). VAR needs a multi-scale VQVAE that produces K token maps at increasing resolutions.
The architecture reuses the standard VQVAE encoder and decoder — same CNN as VQGAN. The only modification is a multi-scale quantization layer that replaces the single-scale quantizer. Here is the encoding procedure:
The residual subtraction is key: after encoding scale k, we remove the information that scale k captured from f. Scale k+1 then encodes the residual — the detail that coarser scales missed. This ensures each scale adds new information rather than redundantly re-encoding the same content.
To reconstruct, we reverse the process: start with f̂ = 0, then for each scale k, look up the codebook vectors for rk, upsample to full resolution, pass through φk, and add to f̂. After all K scales, decode f̂ with the standard VQVAE decoder.
A single codebook Z with V = 4096 entries is shared across all scales. This means every scale's tokens come from the same vocabulary — critical for the transformer, which uses a single embedding layer.
Watch the residual encoding process. At each scale, the encoder quantizes the current feature map, then subtracts the quantized representation. The residual shrinks as more scales capture detail. Drag the slider to step through scales.
With multi-scale token maps in hand, we need a model that can autoregress over scales. VAR uses a surprisingly standard architecture: a GPT-2-style decoder-only transformer.
The input sequence is constructed by concatenating all scale maps in order:
Where [s] is a start token carrying class conditioning information, and each rk is the flattened hk×wk token map at scale k. The total sequence length is:
Each token gets a learned embedding plus a scale-specific positional embedding map.
This is the key architectural detail. Unlike standard causal masking (each token attends to all previous tokens), VAR uses a block-wise causal mask:
This is what enables parallel generation within each scale. Since all tokens at scale k see the same context (all previous scales), they can be generated independently and simultaneously.
VAR uses Adaptive Layer Normalization (AdaLN) conditioned on the class label — the same technique used in DiT. The model follows a simple scaling rule:
Where d is the depth (number of transformer layers). This gives model sizes:
The total parameter count scales as N(d) = 73,728 d³ — cubically with depth.
The attention mask for 4 scales. Green = can attend, dark = masked. Notice: within each scale block, attention is bidirectional (full green square on the diagonal). Across scales, it is strictly causal (green below diagonal blocks only).
During inference, the model generates scale by scale. At step k, it feeds in the tokens from scale k-1 (or the start token for k=1), attends to all cached KV pairs from previous scales, and outputs logits for all hk×wk token positions simultaneously. Standard KV-caching works because the attention is causal across scales — previously computed keys and values never change.
Training VAR is refreshingly simple. It mirrors standard language model training — predict the next token given the context — but the "tokens" are organized by scale rather than by raster position.
Standard cross-entropy loss, summed across all scales and all token positions within each scale:
Each token is classified into one of V = 4096 codebook entries. The loss is identical to language model training — just cross-entropy over a vocabulary.
VAR training has two completely independent stages:
This is where VAR gets truly exciting. One of the most important properties of LLMs is their adherence to power-law scaling laws: as you increase model size, data, or compute, the test loss decreases smoothly and predictably. This lets you forecast the performance of a 100B model from a 1B model, guiding resource allocation before you commit the compute.
Prior visual AR models (VQGAN, ViT-VQGAN, RQ-Transformer) showed no clear scaling behavior. Making them bigger didn't reliably make them better. This was a fundamental blocker for the "scale up and win" strategy that powered GPT-3 and GPT-4.
The paper trains VAR models from d=12 (130M params) to d=30 (2B params) and measures test cross-entropy loss. The results are striking:
Where N is the number of parameters in billions. This is a near-perfect power law — the correlation coefficient of −0.998 means the log-log line fits almost exactly. For comparison, the Chinchilla scaling law for LLMs has a similar correlation.
Beyond cross-entropy loss, VAR also measures the token error rate (fraction of tokens predicted incorrectly). This too follows a power law:
Both the "last scale" error rate and the "all scales" error rate show power-law scaling, confirming that the improvement is not just a loss artifact — the model genuinely gets better at predicting visual tokens as it grows.
Log-log plot of test loss vs model parameters. VAR (teal) follows a near-perfect power law. For comparison, raster-scan AR (gray) shows no clear scaling trend. Hover over points to see model details.
VAR is benchmarked on ImageNet 256×256 and 512×512 class-conditional generation. The results represent a milestone: the first time a GPT-style autoregressive model surpasses diffusion transformers.
The headline numbers for VAR-d30 with rejection sampling:
For context, the ImageNet 256×256 validation set itself has an FID of 1.78. VAR's 1.73 is actually below the reference — meaning the generated images are, by the FID metric, more "ImageNet-like" than the validation set itself.
VAR consistently outperforms every model family:
ImageNet 256×256 class-conditional generation. Lower FID is better. VAR (teal) achieves state-of-the-art, surpassing all diffusion, GAN, and prior AR models.
Speed is one of VAR's most compelling advantages. The reason is simple arithmetic.
A raster-scan model generating an n×n token map needs n² sequential autoregressive steps. Each step k attends to all k previous tokens, so the total attention cost is ∑k=1n² k = O(n&sup4;). But at each step the model also runs through its full depth, and the KV cache grows — the effective total compute scales as O(n&sup6;).
For a 16×16 map: 256 sequential steps. For 32×32: 1,024 sequential steps.
VAR uses K = 10 scales for 16×16 (and ~13 for 32×32). At each step, it processes the entire scale map in one parallel forward pass. The total sequence length is 681 tokens — but there are only 10 sequential steps, not 256.
The total attention cost is O(L²) where L = ∑ hkwk ≈ n². So total compute is O(n&sup4;), a factor of n² improvement over raster scan.
The paper reports relative wall-clock times for generating a single 256×256 image:
VAR achieves nearly GAN-level speed (StyleGAN-XL is ~0.3×) while producing better images than any diffusion model.
Number of sequential model forward passes required to generate one 256×256 image. Fewer steps = faster generation.
One of the most appealing properties of LLMs is zero-shot generalization — performing tasks the model was never explicitly trained on. GPT-3 can translate languages, answer questions, and write code without task-specific fine-tuning. Can a visual AR model do the same?
Given an image with a masked region, fill in the missing content. VAR handles this naturally: encode the unmasked regions at all scales, then let the transformer predict the masked tokens conditioned on the unmasked context. Because VAR preserves spatial structure (no flattening), it can inpaint any region — center, edges, or arbitrary shapes.
Compare this to raster-scan AR: it can only "inpaint" tokens that come after the masked region in raster order. It cannot fill in a missing top-left corner if the bottom-right is given.
Extend an image beyond its borders. VAR encodes the existing image at all scales (with the extension region masked), then generates the missing border tokens. The coarse-to-fine structure ensures global coherence — the 1×1 scale captures the overall scene context, preventing the outpainted region from being semantically inconsistent.
Change a specific attribute of an image (e.g., change a dog's breed, add sunglasses). VAR can condition on the coarse scales of the original image (preserving global structure) while regenerating fine scales with a different class label or text prompt. The coarse scales act as a structural anchor — the edited image keeps the same pose, composition, and background while changing the target attribute.
VQVAE / VQGAN (van den Oord et al. 2017, Esser et al. 2021): The foundation — learning discrete codebooks for image tokenization. VAR reuses the VQGAN architecture but adds multi-scale quantization.
RQ-Transformer (Lee et al. 2022): Also uses residual quantization with multiple "scales" of codes, but applies them at each spatial position independently, then autoregresses in raster order. VAR's key departure is making the scales correspond to different spatial resolutions, not different refinement levels at the same position.
GPT-2 (Radford et al. 2019): VAR adopts the GPT-2 decoder-only transformer architecture nearly verbatim. The innovation is entirely in the data representation (multi-scale token maps) and attention mask (block-wise causal), not in the transformer itself.
DiT (Peebles & Xie 2023): The diffusion transformer that VAR surpasses. DiT showed transformers work for diffusion; VAR shows they work even better for next-scale autoregression.
Unified multimodal AR: VAR's proof that visual AR can scale like LLMs opened the door to truly unified vision-language models that use next-token/next-scale prediction for both modalities.
MaskGIT and parallel decoding: MaskGIT also generates tokens in parallel, but uses masked prediction (BERT-style) rather than autoregression. VAR shows that autoregressive models can achieve parallel generation too — by changing the unit of autoregression from tokens to scales.
Scaling law research: VAR is one of the first works to demonstrate clear power-law scaling in visual generation. This has spurred further investigation into scaling laws for image and video models.