CM3Leon (Yu et al. 2023)

Chapter 0: The Efficiency Gap

In 2023, there was a puzzling asymmetry in generative AI. For text, autoregressive models (GPT-4, LLaMA) dominated. For images, diffusion models (DALL-E 2, Stable Diffusion) dominated. Autoregressive image generation existed (DALL-E 1), but it was far less efficient: it required 5× more compute to reach the same quality as diffusion models.

Why the gap? Two reasons:

Problem	Why Diffusion Won	What AR Lacked
Training efficiency	Diffusion uses classifier-free guidance (CFG) for free quality boost	AR models had no equivalent of CFG
Data efficiency	Diffusion leverages massive image-text datasets with contrastive pretraining (CLIP)	AR multimodal models were trained from scratch with no retrieval

CM3Leon closes this gap by borrowing the best ideas from both worlds: a retrieval-augmented training pipeline that dramatically improves data efficiency, plus novel decoding strategies (including a form of classifier-free guidance for autoregressive models) that boost generation quality.

CM3Leon's headline result: It achieves state-of-the-art image generation (FID 4.88 on MS-COCO) with 5× LESS compute than comparable autoregressive models, and it's competitive with diffusion models like DALL-E 2. It also handles text generation, image captioning, and visual QA within the same model.

Compute Efficiency Comparison

Compare the compute needed by different models to achieve similar image generation quality. CM3Leon achieves competitive results with dramatically less compute.

What two innovations allow CM3Leon to close the efficiency gap between autoregressive and diffusion image generation?

Retrieval-augmented training (improving data efficiency by conditioning on retrieved examples) and classifier-free guidance adapted for autoregressive models (boosting generation quality at inference time) A larger model and more training data A new loss function and optimizer

Chapter 1: CM3 Foundation

CM3Leon builds on the CM3 objective — Causally Masked Multimodal Modeling. This is a training objective that combines autoregressive generation with infilling (masked span prediction), enabling the model to both generate and complete multimodal content.

The CM3 objective

Standard autoregressive models predict left-to-right: given all previous tokens, predict the next one. CM3 adds infilling: randomly mask spans of tokens, move them to the end, and predict them. This trains the model to generate content conditioned on context from BOTH sides.

Original Document

The [IMAGE_TOKENS] cat sat on the mat.

↓ mask a span

Input (with mask)

The <mask> sat on the mat.

↓

Target (infill)

[IMAGE_TOKENS] cat

This is crucial for image generation: given surrounding text ("A sunset over the ocean [IMAGE] Beautiful colors"), the model can generate the image tokens by infilling the masked span. Without infilling, the model could only generate images at the end of a sequence.

Architecture

Component	CM3Leon-7B
Architecture	Decoder-only transformer
Parameters	7 billion
Text tokenizer	BPE (56,320 tokens)
Image tokenizer	VQ-VAE (8,192 codes, 256 tokens per image)
Total vocab	64,512
Context length	4,096

CM3 vs pure autoregressive: The infilling capability is what separates CM3 from models like DALL-E. DALL-E can only generate images given a text prefix. CM3 can generate images conditioned on text from both before and after the image position, enabling more natural interleaved generation like "Step 1: [IMAGE] Mix the flour. Step 2: [IMAGE] Add water."

python
# CM3 masking strategy
def cm3_mask(tokens, mask_ratio=0.15):
    # Randomly select spans to mask
    n_mask = int(len(tokens) * mask_ratio)
    mask_start = random.randint(0, len(tokens) - n_mask)
    masked_span = tokens[mask_start:mask_start + n_mask]

    # Replace masked tokens with sentinel
    input_tokens = tokens[:mask_start] + [MASK_TOKEN] + tokens[mask_start + n_mask:]

    # Target: predict the masked span
    target_tokens = masked_span
    return input_tokens, target_tokens

CM3 Objective Visualizer

Click to see how the CM3 objective masks and predicts spans in a multimodal document. The model learns to infill both text and image tokens.

What does the CM3 objective add over standard autoregressive training?

Infilling (masked span prediction) — the model learns to generate tokens conditioned on context from both sides, enabling image generation at any position in a document rather than only at the end A different optimizer A larger vocabulary

Chapter 2: Retrieval Augmentation

CM3Leon's biggest innovation over previous autoregressive multimodal models is retrieval-augmented pretraining. During training, each document is augmented with retrieved similar documents from a large corpus. This dramatically improves data efficiency.

How retrieval works in training

Training Document

"A golden retriever playing in the park [IMAGE]"

↓ retrieve similar docs

Dense Retriever (CLIP)

Find 2 similar documents from a 1.4B image-text corpus using CLIP embeddings. Both text-to-text and image-to-image retrieval.

↓

Augmented Input

[Retrieved doc 1] [Retrieved doc 2] [Original doc]. Model trains on the combined sequence with standard CM3 objective.

The retriever uses CLIP embeddings for both text and images. For each training document, CM3Leon retrieves the top-2 most similar documents from a 1.4 billion example corpus. The retrieved documents are prepended to the training input, providing the model with relevant examples it can attend to.

Why retrieval helps so much

Benefit	Mechanism	Impact
In-context learning	Retrieved examples serve as "demonstrations" of the text-image relationship	Model learns patterns from examples, not just from parameters
Reduced memorization	Factual knowledge comes from retrieved docs, not weights	Smaller model needed for same quality
Data diversity	Each training step sees the original doc + 2 related docs	Effectively 3x training data exposure per step

The key insight: Retrieval augmentation during PRETRAINING (not just inference) is what gives CM3Leon its 5x efficiency advantage. The model learns to leverage retrieved examples from the very first training step. By the time it's fully trained, it has seen every training document alongside millions of related examples, building much richer multimodal representations than training on isolated documents.

python
# CM3Leon retrieval-augmented training
def create_training_example(doc, retriever, corpus):
    # Encode document with CLIP
    doc_embedding = clip.encode(doc)  # text + image features

    # Retrieve top-2 similar documents
    retrieved = retriever.search(doc_embedding, k=2)

    # Prepend retrieved docs to training input
    input_seq = (
        tokenize(retrieved[0]) +   # retrieved doc 1
        tokenize(retrieved[1]) +   # retrieved doc 2
        tokenize(doc)              # original document
    )

    # Apply CM3 masking and predict
    masked_input, target = cm3_mask(input_seq)
    return masked_input, target

Retrieval-Augmented Training Pipeline

Watch how each training document is augmented with retrieved examples before being fed to the model. The model learns to attend to similar examples for richer multimodal understanding.

Why does CM3Leon apply retrieval augmentation during pretraining, not just at inference time?

Because training with retrieved examples from the start teaches the model to leverage them — it learns in-context from similar documents, reduces the need for parameter memorization, and effectively sees 3x more data per step, achieving 5x compute efficiency over non-retrieval autoregressive models Because retrieval at inference is too slow Because the retriever needs to be trained jointly

Chapter 3: Training Recipe

CM3Leon's training follows the scaling laws from the previous paper in this series (Aghajanyan et al. 2023). The mix ratio and training details are carefully chosen based on those findings.

Pretraining data

Data Source	Type	Size
Licensed Shutterstock	Image-text pairs	~3 billion pairs
Text corpus	Text only	Standard web text
Retrieval corpus	Mixed	~1.4 billion image-text docs

CM3Leon uses licensed data from Shutterstock, not scraped web data. This is a notable ethical choice — the model is trained on data where creators were compensated. The quality of licensed data is also generally higher than web-scraped data.

Training details

python
# CM3Leon training configuration
config = {
    "model": "decoder-only transformer",
    "params": "7B",
    "optimizer": "AdamW",
    "lr": 1e-4,
    "warmup": 1500,
    "schedule": "cosine",
    "tokens": 1_400_000_000_000,  # 1.4T total tokens seen
    "context": 4096,
    "batch_size": 8192,  # sequences
    "retrieval_k": 2,   # 2 retrieved docs per training example
    "image_tokens": 256,
    "data_mix": "~70% text, ~30% image (following scaling laws)",
}

Note the training token count: 1.4T tokens for a 7B model. Compare to Chinchilla-optimal: ~140B for 7B. CM3Leon is trained on 10x the Chinchilla-optimal token count. This is intentional — the retrieval augmentation makes each token more informative, so the model can productively absorb more data without overfitting.

Training Configuration Explorer

Explore CM3Leon's training setup. Each component contributes to the final model's capabilities.

Component Data

Why does CM3Leon train on 10x the Chinchilla-optimal token count?

Because retrieval augmentation makes each token more informative — the model sees each document alongside 2 retrieved similar examples, effectively getting richer learning signal per step, so it can productively absorb far more data than a non-retrieval model Because the model is undertrained at Chinchilla-optimal Because bigger datasets are always better

Chapter 4: Decoding Strategies

CM3Leon's second key innovation is bringing classifier-free guidance (CFG) to autoregressive generation. CFG was previously exclusive to diffusion models, where it dramatically improves sample quality at the cost of diversity. CM3Leon shows it works for autoregressive models too.

What is classifier-free guidance?

The idea: during generation, compute two predictions — one conditioned on the prompt (conditional) and one without the prompt (unconditional). Then amplify the difference:

logits_guided = logits_uncond + α · (logits_cond − logits_uncond)

Where α is the guidance scale. At α = 1, you get standard conditional generation. At α > 1, the model is pushed to generate outputs more strongly aligned with the prompt. At α = 0, you get unconditional generation.

CFG for autoregressive models

To enable CFG, CM3Leon occasionally drops the conditioning information during training (sets the prompt to empty with 10% probability). This trains the model to generate both conditionally and unconditionally. At inference time, it runs two forward passes per token and combines them:

python
# Classifier-free guidance for autoregressive generation
def generate_with_cfg(model, prompt, alpha=3.0):
    for step in range(n_tokens):
        # Conditional: model sees the text prompt
        logits_cond = model(prompt + generated_so_far)

        # Unconditional: model sees empty prompt
        logits_uncond = model("" + generated_so_far)

        # Guide: amplify the conditional signal
        logits = logits_uncond + alpha * (logits_cond - logits_uncond)

        # Sample next token from guided logits
        next_token = sample(softmax(logits / temperature))
        generated_so_far.append(next_token)
    return generated_so_far

TopP + TopK filtering

CM3Leon also uses sophisticated decoding filters. The paper finds that combining TopP (nucleus sampling) with TopK gives the best results:

Strategy	TopP	TopK	CFG α	FID ↓
Greedy	-	-	1.0	12.4
TopP only	0.9	-	1.0	8.7
TopK only	-	256	1.0	9.1
TopP + CFG	0.9	-	3.0	5.2
TopP + TopK + CFG	0.9	256	3.0	4.88

CFG is a game-changer for AR models: Without CFG (α=1), FID is 8.7. With CFG (α=3), FID drops to 4.88 — a 44% improvement. This single trick closes most of the quality gap with diffusion models, which have been using CFG since the beginning.

Classifier-Free Guidance Slider

Adjust the guidance scale α to see how it affects generation quality. Higher α produces more prompt-aligned but less diverse outputs.

α 3.0

How does classifier-free guidance (CFG) improve autoregressive image generation?

By computing both conditional and unconditional predictions at each step and amplifying the difference — this pushes generation toward more prompt-aligned outputs, reducing FID from 8.7 to 4.88 (44% improvement) and closing the quality gap with diffusion models By using a different loss function during training By generating images faster

Chapter 5: Instruction Tuning

After pretraining, CM3Leon undergoes supervised fine-tuning (SFT) on a diverse set of multimodal tasks. This is the same idea as instruction tuning for text LLMs (InstructGPT, FLAN), but extended to mixed-modal tasks.

Task diversity

CM3Leon is fine-tuned on tasks that cover all four modality combinations:

Task	Input	Output	Examples
Text→Image	Text description	Generated image	Text-to-image generation
Image→Text	Image	Text description	Captioning, VQA, OCR
Image+Text→Text	Image + question	Answer text	Visual QA, image analysis
Image→Image	Image + instruction	Modified image	Editing, style transfer

Instruction format

Each fine-tuning example is formatted as a natural language instruction:

python
# Instruction tuning examples

# Text→Image
{"instruction": "Generate an image of a sunset over mountains",
 "output": "[IMAGE_TOKENS]"}

# Image→Text
{"instruction": "Describe this image in detail: [IMAGE_TOKENS]",
 "output": "A golden retriever playing fetch in a park..."}

# Visual QA
{"instruction": "[IMAGE_TOKENS] How many people are in this image?",
 "output": "Three"}

# Image Editing
{"instruction": "[IMAGE_TOKENS] Make this image look like it was painted by Monet",
 "output": "[EDITED_IMAGE_TOKENS]"}

Why instruction tuning matters: The pretrained model can generate text and images, but it doesn't know how to follow specific instructions. After SFT, the model understands task formats: "generate an image of X" produces an image, "describe this image" produces a caption, "answer: how many X?" produces a count. This is the same unlock that turned GPT-3 into ChatGPT.

Instruction Tuning Tasks

Explore the different tasks CM3Leon can perform after instruction tuning. Click each task to see the input/output format.

What is the purpose of instruction tuning CM3Leon after pretraining?

To teach the model to follow specific task formats — "generate an image of X" produces an image, "describe this" produces a caption — turning a general-purpose generative model into one that responds to natural language instructions across all modality combinations To increase the model's parameter count To replace the pretraining knowledge

Chapter 6: Results & Showcase

CM3Leon achieves state-of-the-art results for autoregressive image generation and competitive results across a wide range of multimodal tasks — all from a single 7B model.

Image generation

Model	Type	Params	MS-COCO FID ↓	Training Compute
DALL-E	Autoregressive	12B	17.89	Very high
Parti	Autoregressive	20B	7.23	Very high
DALL-E 2	Diffusion	6.5B	10.39	High
Stable Diffusion	Diffusion	0.9B	12.63	Moderate
CM3Leon-7B	Autoregressive	7B	4.88	5x less than Parti

The headline number: CM3Leon achieves FID 4.88 — better than all previous models — with 5x less training compute than comparable autoregressive models. It even beats DALL-E 2 (FID 10.39) which uses diffusion. This proves that autoregressive image generation, when done right (retrieval + CFG), can match or exceed diffusion.

Multimodal understanding

After instruction tuning, CM3Leon also performs well on vision-language tasks:

Task	CM3Leon-7B	Flamingo-9B
VQAv2	47.6	51.8
OK-VQA	37.6	44.7
TextVQA	40.1	31.8
COCO Captioning	61.6	84.3

CM3Leon is competitive on understanding tasks despite being primarily optimized for generation. It trails Flamingo on some benchmarks but outperforms it on TextVQA (which requires reading text in images).

CM3Leon Performance Dashboard

Compare CM3Leon against other models across generation and understanding tasks.

What is CM3Leon's most impressive achievement?

Achieving state-of-the-art image generation (FID 4.88) with 5x less training compute than comparable autoregressive models, proving that retrieval augmentation + classifier-free guidance makes autoregressive generation competitive with or better than diffusion Having the most parameters Being the fastest at inference

Chapter 7: Connections

CM3Leon sits between RA-CM3 (which introduced retrieval for multimodal models) and Chameleon (which took the early-fusion approach to scale). Its techniques directly influenced both Chameleon and Transfusion.

Model	Year	Key Innovation	From CM3Leon?
RA-CM3	2022	Retrieval for multimodal	CM3Leon's precursor
CM3Leon	2023	Retrieval + CFG + SFT	This paper
Chameleon	2024	Early fusion at scale	Training recipe from CM3Leon's scaling laws
Transfusion	2024	AR text + diffusion images	Dual-objective training philosophy

Lesson 1: Retrieval is underrated for pretraining. Most retrieval work focuses on inference-time RAG. CM3Leon shows that retrieval during training is equally valuable, enabling 5x compute savings.

Lesson 2: CFG generalizes beyond diffusion. Classifier-free guidance isn't specific to the diffusion framework. Any generative model that can generate both conditionally and unconditionally can use it. This opened the door for applying CFG to many other autoregressive domains.

Lesson 3: Licensed data matters. CM3Leon demonstrated that training on licensed, curated data (Shutterstock) can produce results competitive with or better than models trained on massive web scrapes. Quality over quantity.

CM3Leon's Lineage

Trace the evolution from RA-CM3 through CM3Leon to Chameleon and beyond.

Model CM3Leon

What is CM3Leon's most lasting contribution to the field?

Proving that autoregressive multimodal models can match diffusion quality when augmented with retrieval during pretraining and classifier-free guidance at inference — establishing a recipe (retrieval + CFG + instruction tuning) that influenced all subsequent mixed-modal models The specific model architecture The image tokenizer design

CM3Leon: Autoregressive Multimodal