Retrieval Augmented Multimodal Language Modeling — combine retrieval with a multimodal generative model so it can look things up instead of memorizing everything.
RAG (Retrieval-Augmented Generation) revolutionized text models: instead of memorizing every fact, retrieve relevant documents at inference time. But in 2022, this idea hadn't been extended to multimodal models. If you wanted a model that could generate both text and images, it had to memorize everything in its weights — visual patterns, text knowledge, everything.
The inefficiency is staggering. A text+image model needs to store representations of millions of visual concepts (what a golden retriever looks like, what a sunset looks like, etc.) in its parameters. A 7B model simply doesn't have enough capacity for high-fidelity visual memory.
This is the paper that introduced retrieval-augmented pretraining for multimodal models — the technique that CM3Leon later scaled to achieve state-of-the-art results.
Compare how a memorization-based model (left) and a retrieval-augmented model (right) handle the same generation task. The retrieval model has access to a corpus of reference documents.
Text-only RAG retrieves text documents. RA-CM3 needs to retrieve multimodal documents — pages that contain both text and images. And the retrieval query itself can be either text, an image, or both.
RA-CM3 uses CLIP as its retrieval encoder. CLIP maps both text and images into a shared embedding space, making cross-modal retrieval natural: a text query can retrieve relevant images, and an image query can retrieve relevant text.
| Mode | Query | Returns | Use Case |
|---|---|---|---|
| Text→Multi | Text from input doc | Docs with related text+images | Generate image given text description |
| Image→Multi | Image from input doc | Docs with similar images+text | Caption a novel image |
| Multi→Multi | Combined text+image | Most similar multimodal docs | General multimodal generation |
python # RA-CM3 retrieval with CLIP import clip, faiss # Pre-compute corpus embeddings (offline) corpus_embeddings = [] for doc in corpus: # Encode text and image components text_emb = clip.encode_text(doc.text) # [512] img_emb = clip.encode_image(doc.image) # [512] combined = (text_emb + img_emb) / 2 # average corpus_embeddings.append(combined) # Build FAISS index index = faiss.IndexFlatIP(512) index.add(np.stack(corpus_embeddings)) # At training/inference: retrieve for each input document def retrieve(query_doc, k=2): q_emb = clip.encode(query_doc) # [512] scores, ids = index.search(q_emb, k) return [corpus[i] for i in ids[0]]
See how CLIP enables cross-modal retrieval. A text query retrieves relevant images, an image query retrieves relevant text. Both work because CLIP maps them to the same embedding space.
RA-CM3 uses the CM3 architecture (the same one that CM3Leon later scaled up) augmented with retrieved documents. The key architectural choice: retrieved documents are simply prepended to the input sequence.
Each retrieved document contains interleaved text and image tokens, just like the original document. The model processes the entire concatenated sequence with standard causal attention. Retrieved docs provide context that helps the model generate the original doc.
| Component | Details |
|---|---|
| Base model | CM3 (decoder-only transformer) |
| Parameters | Up to 2.7B |
| Image tokenizer | VQ-VAE, 256 tokens per image |
| Retrieval k | 2 documents |
| Retriever | CLIP (frozen) |
| Context length | 4096 tokens |
python # RA-CM3 forward pass def ra_cm3_forward(model, doc, retriever, corpus): # Step 1: Retrieve similar documents retrieved = retriever.search(doc, k=2) # Step 2: Tokenize everything ret_tokens = [tokenize(r) for r in retrieved] doc_tokens = tokenize(doc) # Step 3: Concatenate: [ret1] [SEP] [ret2] [SEP] [doc] input_seq = ret_tokens[0] + [SEP] + ret_tokens[1] + [SEP] + doc_tokens # Step 4: Standard CM3 forward pass logits = model(input_seq) # [L, vocab_size] # Loss only on original doc tokens (not retrieved) loss = cross_entropy(logits[len_retrieved:], doc_tokens[1:]) return loss
Watch how retrieved documents are prepended to the input and processed by the CM3 transformer.
RA-CM3's training involves three stages, each building on the previous one.
The CLIP retriever is never fine-tuned. This is important for three reasons:
| Reason | Explanation |
|---|---|
| Stability | If the retriever changes during training, the retrieved documents change, making training non-stationary and unstable |
| Efficiency | Re-encoding and re-indexing the entire corpus every time the retriever updates would be prohibitively expensive |
| Quality | CLIP's pre-trained representations are already excellent for cross-modal retrieval; fine-tuning risks degradation |
Step through the three stages of RA-CM3's training pipeline.
Not all retrieval is created equal. RA-CM3 explores different strategies for what to encode as the query and what to retrieve, finding that the choice significantly affects downstream performance.
The paper experiments with three query strategies:
| Strategy | Query Encoding | Best For |
|---|---|---|
| Text-only query | Encode only the text portion of the input with CLIP text encoder | Text-to-image generation |
| Image-only query | Encode only the image portion with CLIP image encoder | Image captioning, similar image retrieval |
| Combined query | Average of text and image CLIP embeddings | General multimodal tasks |
The corpus contains ~150 million multimodal web documents, each consisting of interleaved text and images from web pages. Documents are filtered for quality: minimum text length, image resolution requirements, and deduplication.
Toggle between retrieval strategies to see how different query types retrieve different documents.
How does retrieval improve generation quality? The paper provides detailed analysis showing that retrieved documents serve as "visual priors" that dramatically improve image generation fidelity.
For text-to-image generation, retrieved images provide the model with visual examples of what the output should look like. Instead of synthesizing a golden retriever from scratch (relying entirely on patterns memorized in weights), the model can reference actual golden retriever images and compose a new one.
| Metric | CM3 (no retrieval) | RA-CM3 (with retrieval) | Improvement |
|---|---|---|---|
| FID ↓ | 18.1 | 10.4 | 42% better |
| CLIP Score ↑ | 25.3 | 28.7 | 13% better |
| Text Perplexity ↓ | 12.8 | 11.2 | 12% better |
One might think retrieval is just a crutch for small models. But the paper shows that even their largest model (2.7B) benefits significantly from retrieval. This makes sense: no model can memorize the long tail of visual concepts. A retrieval corpus effectively gives the model access to unlimited "visual vocabulary" without increasing parameter count.
Toggle retrieval on/off and see how it affects generation quality metrics. Retrieval consistently improves all metrics, especially image quality.
RA-CM3 is evaluated on both image generation and text generation tasks, showing consistent improvements from retrieval augmentation across the board.
| Model | Type | FID ↓ | Retrieval? |
|---|---|---|---|
| DALL-E | AR | 17.89 | No |
| CM3 (baseline) | AR | 18.1 | No |
| RA-CM3 | AR + Retrieval | 10.4 | Yes |
| DALL-E 2 | Diffusion | 10.39 | No |
Compare RA-CM3 against models with and without retrieval across multiple metrics.
RA-CM3 introduced the recipe that the entire CM3 lineage built on: retrieval-augmented multimodal pretraining. Its ideas directly enabled CM3Leon, which in turn informed Chameleon and Transfusion.
| Innovation | RA-CM3 (2022) | Where It Went |
|---|---|---|
| Retrieval pretraining | Introduced for multimodal | CM3Leon scaled it with CFG |
| Frozen CLIP retriever | Simple but effective | Standard approach in all successors |
| Cross-modal retrieval | Text queries retrieve images | Influenced RAG for VLMs |
| Prepend-and-attend | No architecture changes needed | Adopted by RETRO, Atlas, etc. |
Trace how RA-CM3's innovations influenced subsequent models.