Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning — the first autoregressive model competitive with diffusion on image generation, using retrieval augmentation and instruction tuning.
In 2023, there was a puzzling asymmetry in generative AI. For text, autoregressive models (GPT-4, LLaMA) dominated. For images, diffusion models (DALL-E 2, Stable Diffusion) dominated. Autoregressive image generation existed (DALL-E 1), but it was far less efficient: it required 5× more compute to reach the same quality as diffusion models.
Why the gap? Two reasons:
| Problem | Why Diffusion Won | What AR Lacked |
|---|---|---|
| Training efficiency | Diffusion uses classifier-free guidance (CFG) for free quality boost | AR models had no equivalent of CFG |
| Data efficiency | Diffusion leverages massive image-text datasets with contrastive pretraining (CLIP) | AR multimodal models were trained from scratch with no retrieval |
CM3Leon closes this gap by borrowing the best ideas from both worlds: a retrieval-augmented training pipeline that dramatically improves data efficiency, plus novel decoding strategies (including a form of classifier-free guidance for autoregressive models) that boost generation quality.
Compare the compute needed by different models to achieve similar image generation quality. CM3Leon achieves competitive results with dramatically less compute.
CM3Leon builds on the CM3 objective — Causally Masked Multimodal Modeling. This is a training objective that combines autoregressive generation with infilling (masked span prediction), enabling the model to both generate and complete multimodal content.
Standard autoregressive models predict left-to-right: given all previous tokens, predict the next one. CM3 adds infilling: randomly mask spans of tokens, move them to the end, and predict them. This trains the model to generate content conditioned on context from BOTH sides.
This is crucial for image generation: given surrounding text ("A sunset over the ocean [IMAGE] Beautiful colors"), the model can generate the image tokens by infilling the masked span. Without infilling, the model could only generate images at the end of a sequence.
| Component | CM3Leon-7B |
|---|---|
| Architecture | Decoder-only transformer |
| Parameters | 7 billion |
| Text tokenizer | BPE (56,320 tokens) |
| Image tokenizer | VQ-VAE (8,192 codes, 256 tokens per image) |
| Total vocab | 64,512 |
| Context length | 4,096 |
python # CM3 masking strategy def cm3_mask(tokens, mask_ratio=0.15): # Randomly select spans to mask n_mask = int(len(tokens) * mask_ratio) mask_start = random.randint(0, len(tokens) - n_mask) masked_span = tokens[mask_start:mask_start + n_mask] # Replace masked tokens with sentinel input_tokens = tokens[:mask_start] + [MASK_TOKEN] + tokens[mask_start + n_mask:] # Target: predict the masked span target_tokens = masked_span return input_tokens, target_tokens
Click to see how the CM3 objective masks and predicts spans in a multimodal document. The model learns to infill both text and image tokens.
CM3Leon's biggest innovation over previous autoregressive multimodal models is retrieval-augmented pretraining. During training, each document is augmented with retrieved similar documents from a large corpus. This dramatically improves data efficiency.
The retriever uses CLIP embeddings for both text and images. For each training document, CM3Leon retrieves the top-2 most similar documents from a 1.4 billion example corpus. The retrieved documents are prepended to the training input, providing the model with relevant examples it can attend to.
| Benefit | Mechanism | Impact |
|---|---|---|
| In-context learning | Retrieved examples serve as "demonstrations" of the text-image relationship | Model learns patterns from examples, not just from parameters |
| Reduced memorization | Factual knowledge comes from retrieved docs, not weights | Smaller model needed for same quality |
| Data diversity | Each training step sees the original doc + 2 related docs | Effectively 3x training data exposure per step |
python # CM3Leon retrieval-augmented training def create_training_example(doc, retriever, corpus): # Encode document with CLIP doc_embedding = clip.encode(doc) # text + image features # Retrieve top-2 similar documents retrieved = retriever.search(doc_embedding, k=2) # Prepend retrieved docs to training input input_seq = ( tokenize(retrieved[0]) + # retrieved doc 1 tokenize(retrieved[1]) + # retrieved doc 2 tokenize(doc) # original document ) # Apply CM3 masking and predict masked_input, target = cm3_mask(input_seq) return masked_input, target
Watch how each training document is augmented with retrieved examples before being fed to the model. The model learns to attend to similar examples for richer multimodal understanding.
CM3Leon's training follows the scaling laws from the previous paper in this series (Aghajanyan et al. 2023). The mix ratio and training details are carefully chosen based on those findings.
| Data Source | Type | Size |
|---|---|---|
| Licensed Shutterstock | Image-text pairs | ~3 billion pairs |
| Text corpus | Text only | Standard web text |
| Retrieval corpus | Mixed | ~1.4 billion image-text docs |
CM3Leon uses licensed data from Shutterstock, not scraped web data. This is a notable ethical choice — the model is trained on data where creators were compensated. The quality of licensed data is also generally higher than web-scraped data.
python # CM3Leon training configuration config = { "model": "decoder-only transformer", "params": "7B", "optimizer": "AdamW", "lr": 1e-4, "warmup": 1500, "schedule": "cosine", "tokens": 1_400_000_000_000, # 1.4T total tokens seen "context": 4096, "batch_size": 8192, # sequences "retrieval_k": 2, # 2 retrieved docs per training example "image_tokens": 256, "data_mix": "~70% text, ~30% image (following scaling laws)", }
Explore CM3Leon's training setup. Each component contributes to the final model's capabilities.
CM3Leon's second key innovation is bringing classifier-free guidance (CFG) to autoregressive generation. CFG was previously exclusive to diffusion models, where it dramatically improves sample quality at the cost of diversity. CM3Leon shows it works for autoregressive models too.
The idea: during generation, compute two predictions — one conditioned on the prompt (conditional) and one without the prompt (unconditional). Then amplify the difference:
Where α is the guidance scale. At α = 1, you get standard conditional generation. At α > 1, the model is pushed to generate outputs more strongly aligned with the prompt. At α = 0, you get unconditional generation.
To enable CFG, CM3Leon occasionally drops the conditioning information during training (sets the prompt to empty with 10% probability). This trains the model to generate both conditionally and unconditionally. At inference time, it runs two forward passes per token and combines them:
python # Classifier-free guidance for autoregressive generation def generate_with_cfg(model, prompt, alpha=3.0): for step in range(n_tokens): # Conditional: model sees the text prompt logits_cond = model(prompt + generated_so_far) # Unconditional: model sees empty prompt logits_uncond = model("" + generated_so_far) # Guide: amplify the conditional signal logits = logits_uncond + alpha * (logits_cond - logits_uncond) # Sample next token from guided logits next_token = sample(softmax(logits / temperature)) generated_so_far.append(next_token) return generated_so_far
CM3Leon also uses sophisticated decoding filters. The paper finds that combining TopP (nucleus sampling) with TopK gives the best results:
| Strategy | TopP | TopK | CFG α | FID ↓ |
|---|---|---|---|---|
| Greedy | - | - | 1.0 | 12.4 |
| TopP only | 0.9 | - | 1.0 | 8.7 |
| TopK only | - | 256 | 1.0 | 9.1 |
| TopP + CFG | 0.9 | - | 3.0 | 5.2 |
| TopP + TopK + CFG | 0.9 | 256 | 3.0 | 4.88 |
Adjust the guidance scale α to see how it affects generation quality. Higher α produces more prompt-aligned but less diverse outputs.
After pretraining, CM3Leon undergoes supervised fine-tuning (SFT) on a diverse set of multimodal tasks. This is the same idea as instruction tuning for text LLMs (InstructGPT, FLAN), but extended to mixed-modal tasks.
CM3Leon is fine-tuned on tasks that cover all four modality combinations:
| Task | Input | Output | Examples |
|---|---|---|---|
| Text→Image | Text description | Generated image | Text-to-image generation |
| Image→Text | Image | Text description | Captioning, VQA, OCR |
| Image+Text→Text | Image + question | Answer text | Visual QA, image analysis |
| Image→Image | Image + instruction | Modified image | Editing, style transfer |
Each fine-tuning example is formatted as a natural language instruction:
python # Instruction tuning examples # Text→Image {"instruction": "Generate an image of a sunset over mountains", "output": "[IMAGE_TOKENS]"} # Image→Text {"instruction": "Describe this image in detail: [IMAGE_TOKENS]", "output": "A golden retriever playing fetch in a park..."} # Visual QA {"instruction": "[IMAGE_TOKENS] How many people are in this image?", "output": "Three"} # Image Editing {"instruction": "[IMAGE_TOKENS] Make this image look like it was painted by Monet", "output": "[EDITED_IMAGE_TOKENS]"}
Explore the different tasks CM3Leon can perform after instruction tuning. Click each task to see the input/output format.
CM3Leon achieves state-of-the-art results for autoregressive image generation and competitive results across a wide range of multimodal tasks — all from a single 7B model.
| Model | Type | Params | MS-COCO FID ↓ | Training Compute |
|---|---|---|---|---|
| DALL-E | Autoregressive | 12B | 17.89 | Very high |
| Parti | Autoregressive | 20B | 7.23 | Very high |
| DALL-E 2 | Diffusion | 6.5B | 10.39 | High |
| Stable Diffusion | Diffusion | 0.9B | 12.63 | Moderate |
| CM3Leon-7B | Autoregressive | 7B | 4.88 | 5x less than Parti |
After instruction tuning, CM3Leon also performs well on vision-language tasks:
| Task | CM3Leon-7B | Flamingo-9B |
|---|---|---|
| VQAv2 | 47.6 | 51.8 |
| OK-VQA | 37.6 | 44.7 |
| TextVQA | 40.1 | 31.8 |
| COCO Captioning | 61.6 | 84.3 |
CM3Leon is competitive on understanding tasks despite being primarily optimized for generation. It trails Flamingo on some benchmarks but outperforms it on TextVQA (which requires reading text in images).
Compare CM3Leon against other models across generation and understanding tasks.
CM3Leon sits between RA-CM3 (which introduced retrieval for multimodal models) and Chameleon (which took the early-fusion approach to scale). Its techniques directly influenced both Chameleon and Transfusion.
| Model | Year | Key Innovation | From CM3Leon? |
|---|---|---|---|
| RA-CM3 | 2022 | Retrieval for multimodal | CM3Leon's precursor |
| CM3Leon | 2023 | Retrieval + CFG + SFT | This paper |
| Chameleon | 2024 | Early fusion at scale | Training recipe from CM3Leon's scaling laws |
| Transfusion | 2024 | AR text + diffusion images | Dual-objective training philosophy |
Trace the evolution from RA-CM3 through CM3Leon to Chameleon and beyond.