Yu, Liang, Shi et al. (Meta) — 2023

CM3Leon: Autoregressive Multimodal

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning — the first autoregressive model competitive with diffusion on image generation, using retrieval augmentation and instruction tuning.

Prerequisites: Transformers + Image tokenization + Retrieval augmentation basics. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Efficiency Gap

In 2023, there was a puzzling asymmetry in generative AI. For text, autoregressive models (GPT-4, LLaMA) dominated. For images, diffusion models (DALL-E 2, Stable Diffusion) dominated. Autoregressive image generation existed (DALL-E 1), but it was far less efficient: it required 5× more compute to reach the same quality as diffusion models.

Why the gap? Two reasons:

ProblemWhy Diffusion WonWhat AR Lacked
Training efficiencyDiffusion uses classifier-free guidance (CFG) for free quality boostAR models had no equivalent of CFG
Data efficiencyDiffusion leverages massive image-text datasets with contrastive pretraining (CLIP)AR multimodal models were trained from scratch with no retrieval

CM3Leon closes this gap by borrowing the best ideas from both worlds: a retrieval-augmented training pipeline that dramatically improves data efficiency, plus novel decoding strategies (including a form of classifier-free guidance for autoregressive models) that boost generation quality.

CM3Leon's headline result: It achieves state-of-the-art image generation (FID 4.88 on MS-COCO) with 5× LESS compute than comparable autoregressive models, and it's competitive with diffusion models like DALL-E 2. It also handles text generation, image captioning, and visual QA within the same model.
Compute Efficiency Comparison

Compare the compute needed by different models to achieve similar image generation quality. CM3Leon achieves competitive results with dramatically less compute.

What two innovations allow CM3Leon to close the efficiency gap between autoregressive and diffusion image generation?

Chapter 1: CM3 Foundation

CM3Leon builds on the CM3 objective — Causally Masked Multimodal Modeling. This is a training objective that combines autoregressive generation with infilling (masked span prediction), enabling the model to both generate and complete multimodal content.

The CM3 objective

Standard autoregressive models predict left-to-right: given all previous tokens, predict the next one. CM3 adds infilling: randomly mask spans of tokens, move them to the end, and predict them. This trains the model to generate content conditioned on context from BOTH sides.

Original Document
The [IMAGE_TOKENS] cat sat on the mat.
↓ mask a span
Input (with mask)
The <mask> sat on the mat.
Target (infill)
[IMAGE_TOKENS] cat

This is crucial for image generation: given surrounding text ("A sunset over the ocean [IMAGE] Beautiful colors"), the model can generate the image tokens by infilling the masked span. Without infilling, the model could only generate images at the end of a sequence.

Architecture

ComponentCM3Leon-7B
ArchitectureDecoder-only transformer
Parameters7 billion
Text tokenizerBPE (56,320 tokens)
Image tokenizerVQ-VAE (8,192 codes, 256 tokens per image)
Total vocab64,512
Context length4,096
CM3 vs pure autoregressive: The infilling capability is what separates CM3 from models like DALL-E. DALL-E can only generate images given a text prefix. CM3 can generate images conditioned on text from both before and after the image position, enabling more natural interleaved generation like "Step 1: [IMAGE] Mix the flour. Step 2: [IMAGE] Add water."
python
# CM3 masking strategy
def cm3_mask(tokens, mask_ratio=0.15):
    # Randomly select spans to mask
    n_mask = int(len(tokens) * mask_ratio)
    mask_start = random.randint(0, len(tokens) - n_mask)
    masked_span = tokens[mask_start:mask_start + n_mask]

    # Replace masked tokens with sentinel
    input_tokens = tokens[:mask_start] + [MASK_TOKEN] + tokens[mask_start + n_mask:]

    # Target: predict the masked span
    target_tokens = masked_span
    return input_tokens, target_tokens
CM3 Objective Visualizer

Click to see how the CM3 objective masks and predicts spans in a multimodal document. The model learns to infill both text and image tokens.

What does the CM3 objective add over standard autoregressive training?

Chapter 2: Retrieval Augmentation

CM3Leon's biggest innovation over previous autoregressive multimodal models is retrieval-augmented pretraining. During training, each document is augmented with retrieved similar documents from a large corpus. This dramatically improves data efficiency.

How retrieval works in training

Training Document
"A golden retriever playing in the park [IMAGE]"
↓ retrieve similar docs
Dense Retriever (CLIP)
Find 2 similar documents from a 1.4B image-text corpus using CLIP embeddings. Both text-to-text and image-to-image retrieval.
Augmented Input
[Retrieved doc 1] [Retrieved doc 2] [Original doc]. Model trains on the combined sequence with standard CM3 objective.

The retriever uses CLIP embeddings for both text and images. For each training document, CM3Leon retrieves the top-2 most similar documents from a 1.4 billion example corpus. The retrieved documents are prepended to the training input, providing the model with relevant examples it can attend to.

Why retrieval helps so much

BenefitMechanismImpact
In-context learningRetrieved examples serve as "demonstrations" of the text-image relationshipModel learns patterns from examples, not just from parameters
Reduced memorizationFactual knowledge comes from retrieved docs, not weightsSmaller model needed for same quality
Data diversityEach training step sees the original doc + 2 related docsEffectively 3x training data exposure per step
The key insight: Retrieval augmentation during PRETRAINING (not just inference) is what gives CM3Leon its 5x efficiency advantage. The model learns to leverage retrieved examples from the very first training step. By the time it's fully trained, it has seen every training document alongside millions of related examples, building much richer multimodal representations than training on isolated documents.
python
# CM3Leon retrieval-augmented training
def create_training_example(doc, retriever, corpus):
    # Encode document with CLIP
    doc_embedding = clip.encode(doc)  # text + image features

    # Retrieve top-2 similar documents
    retrieved = retriever.search(doc_embedding, k=2)

    # Prepend retrieved docs to training input
    input_seq = (
        tokenize(retrieved[0]) +   # retrieved doc 1
        tokenize(retrieved[1]) +   # retrieved doc 2
        tokenize(doc)              # original document
    )

    # Apply CM3 masking and predict
    masked_input, target = cm3_mask(input_seq)
    return masked_input, target
Retrieval-Augmented Training Pipeline

Watch how each training document is augmented with retrieved examples before being fed to the model. The model learns to attend to similar examples for richer multimodal understanding.

Why does CM3Leon apply retrieval augmentation during pretraining, not just at inference time?

Chapter 3: Training Recipe

CM3Leon's training follows the scaling laws from the previous paper in this series (Aghajanyan et al. 2023). The mix ratio and training details are carefully chosen based on those findings.

Pretraining data

Data SourceTypeSize
Licensed ShutterstockImage-text pairs~3 billion pairs
Text corpusText onlyStandard web text
Retrieval corpusMixed~1.4 billion image-text docs

CM3Leon uses licensed data from Shutterstock, not scraped web data. This is a notable ethical choice — the model is trained on data where creators were compensated. The quality of licensed data is also generally higher than web-scraped data.

Training details

python
# CM3Leon training configuration
config = {
    "model": "decoder-only transformer",
    "params": "7B",
    "optimizer": "AdamW",
    "lr": 1e-4,
    "warmup": 1500,
    "schedule": "cosine",
    "tokens": 1_400_000_000_000,  # 1.4T total tokens seen
    "context": 4096,
    "batch_size": 8192,  # sequences
    "retrieval_k": 2,   # 2 retrieved docs per training example
    "image_tokens": 256,
    "data_mix": "~70% text, ~30% image (following scaling laws)",
}
Note the training token count: 1.4T tokens for a 7B model. Compare to Chinchilla-optimal: ~140B for 7B. CM3Leon is trained on 10x the Chinchilla-optimal token count. This is intentional — the retrieval augmentation makes each token more informative, so the model can productively absorb more data without overfitting.
Training Configuration Explorer

Explore CM3Leon's training setup. Each component contributes to the final model's capabilities.

Component Data
Why does CM3Leon train on 10x the Chinchilla-optimal token count?

Chapter 4: Decoding Strategies

CM3Leon's second key innovation is bringing classifier-free guidance (CFG) to autoregressive generation. CFG was previously exclusive to diffusion models, where it dramatically improves sample quality at the cost of diversity. CM3Leon shows it works for autoregressive models too.

What is classifier-free guidance?

The idea: during generation, compute two predictions — one conditioned on the prompt (conditional) and one without the prompt (unconditional). Then amplify the difference:

logitsguided = logitsuncond + α · (logitscond − logitsuncond)

Where α is the guidance scale. At α = 1, you get standard conditional generation. At α > 1, the model is pushed to generate outputs more strongly aligned with the prompt. At α = 0, you get unconditional generation.

CFG for autoregressive models

To enable CFG, CM3Leon occasionally drops the conditioning information during training (sets the prompt to empty with 10% probability). This trains the model to generate both conditionally and unconditionally. At inference time, it runs two forward passes per token and combines them:

python
# Classifier-free guidance for autoregressive generation
def generate_with_cfg(model, prompt, alpha=3.0):
    for step in range(n_tokens):
        # Conditional: model sees the text prompt
        logits_cond = model(prompt + generated_so_far)

        # Unconditional: model sees empty prompt
        logits_uncond = model("" + generated_so_far)

        # Guide: amplify the conditional signal
        logits = logits_uncond + alpha * (logits_cond - logits_uncond)

        # Sample next token from guided logits
        next_token = sample(softmax(logits / temperature))
        generated_so_far.append(next_token)
    return generated_so_far

TopP + TopK filtering

CM3Leon also uses sophisticated decoding filters. The paper finds that combining TopP (nucleus sampling) with TopK gives the best results:

StrategyTopPTopKCFG αFID ↓
Greedy--1.012.4
TopP only0.9-1.08.7
TopK only-2561.09.1
TopP + CFG0.9-3.05.2
TopP + TopK + CFG0.92563.04.88
CFG is a game-changer for AR models: Without CFG (α=1), FID is 8.7. With CFG (α=3), FID drops to 4.88 — a 44% improvement. This single trick closes most of the quality gap with diffusion models, which have been using CFG since the beginning.
Classifier-Free Guidance Slider

Adjust the guidance scale α to see how it affects generation quality. Higher α produces more prompt-aligned but less diverse outputs.

α 3.0
How does classifier-free guidance (CFG) improve autoregressive image generation?

Chapter 5: Instruction Tuning

After pretraining, CM3Leon undergoes supervised fine-tuning (SFT) on a diverse set of multimodal tasks. This is the same idea as instruction tuning for text LLMs (InstructGPT, FLAN), but extended to mixed-modal tasks.

Task diversity

CM3Leon is fine-tuned on tasks that cover all four modality combinations:

TaskInputOutputExamples
Text→ImageText descriptionGenerated imageText-to-image generation
Image→TextImageText descriptionCaptioning, VQA, OCR
Image+Text→TextImage + questionAnswer textVisual QA, image analysis
Image→ImageImage + instructionModified imageEditing, style transfer

Instruction format

Each fine-tuning example is formatted as a natural language instruction:

python
# Instruction tuning examples

# Text→Image
{"instruction": "Generate an image of a sunset over mountains",
 "output": "[IMAGE_TOKENS]"}

# Image→Text
{"instruction": "Describe this image in detail: [IMAGE_TOKENS]",
 "output": "A golden retriever playing fetch in a park..."}

# Visual QA
{"instruction": "[IMAGE_TOKENS] How many people are in this image?",
 "output": "Three"}

# Image Editing
{"instruction": "[IMAGE_TOKENS] Make this image look like it was painted by Monet",
 "output": "[EDITED_IMAGE_TOKENS]"}
Why instruction tuning matters: The pretrained model can generate text and images, but it doesn't know how to follow specific instructions. After SFT, the model understands task formats: "generate an image of X" produces an image, "describe this image" produces a caption, "answer: how many X?" produces a count. This is the same unlock that turned GPT-3 into ChatGPT.
Instruction Tuning Tasks

Explore the different tasks CM3Leon can perform after instruction tuning. Click each task to see the input/output format.

What is the purpose of instruction tuning CM3Leon after pretraining?

Chapter 6: Results & Showcase

CM3Leon achieves state-of-the-art results for autoregressive image generation and competitive results across a wide range of multimodal tasks — all from a single 7B model.

Image generation

ModelTypeParamsMS-COCO FID ↓Training Compute
DALL-EAutoregressive12B17.89Very high
PartiAutoregressive20B7.23Very high
DALL-E 2Diffusion6.5B10.39High
Stable DiffusionDiffusion0.9B12.63Moderate
CM3Leon-7BAutoregressive7B4.885x less than Parti
The headline number: CM3Leon achieves FID 4.88 — better than all previous models — with 5x less training compute than comparable autoregressive models. It even beats DALL-E 2 (FID 10.39) which uses diffusion. This proves that autoregressive image generation, when done right (retrieval + CFG), can match or exceed diffusion.

Multimodal understanding

After instruction tuning, CM3Leon also performs well on vision-language tasks:

TaskCM3Leon-7BFlamingo-9B
VQAv247.651.8
OK-VQA37.644.7
TextVQA40.131.8
COCO Captioning61.684.3

CM3Leon is competitive on understanding tasks despite being primarily optimized for generation. It trails Flamingo on some benchmarks but outperforms it on TextVQA (which requires reading text in images).

CM3Leon Performance Dashboard

Compare CM3Leon against other models across generation and understanding tasks.

What is CM3Leon's most impressive achievement?

Chapter 7: Connections

CM3Leon sits between RA-CM3 (which introduced retrieval for multimodal models) and Chameleon (which took the early-fusion approach to scale). Its techniques directly influenced both Chameleon and Transfusion.

ModelYearKey InnovationFrom CM3Leon?
RA-CM32022Retrieval for multimodalCM3Leon's precursor
CM3Leon2023Retrieval + CFG + SFTThis paper
Chameleon2024Early fusion at scaleTraining recipe from CM3Leon's scaling laws
Transfusion2024AR text + diffusion imagesDual-objective training philosophy
Lesson 1: Retrieval is underrated for pretraining. Most retrieval work focuses on inference-time RAG. CM3Leon shows that retrieval during training is equally valuable, enabling 5x compute savings.
Lesson 2: CFG generalizes beyond diffusion. Classifier-free guidance isn't specific to the diffusion framework. Any generative model that can generate both conditionally and unconditionally can use it. This opened the door for applying CFG to many other autoregressive domains.
Lesson 3: Licensed data matters. CM3Leon demonstrated that training on licensed, curated data (Shutterstock) can produce results competitive with or better than models trained on massive web scrapes. Quality over quantity.
CM3Leon's Lineage

Trace the evolution from RA-CM3 through CM3Leon to Chameleon and beyond.

Model CM3Leon
What is CM3Leon's most lasting contribution to the field?