Yasunaga, Aghajanyan, Shi, James, Leskovec, Liang et al. (Meta) — 2022

RA-CM3: Retrieval-Augmented Multimodal

Retrieval Augmented Multimodal Language Modeling — combine retrieval with a multimodal generative model so it can look things up instead of memorizing everything.

Prerequisites: Transformers + RAG basics + Image tokenization. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: Retrieval Meets Vision

RAG (Retrieval-Augmented Generation) revolutionized text models: instead of memorizing every fact, retrieve relevant documents at inference time. But in 2022, this idea hadn't been extended to multimodal models. If you wanted a model that could generate both text and images, it had to memorize everything in its weights — visual patterns, text knowledge, everything.

The inefficiency is staggering. A text+image model needs to store representations of millions of visual concepts (what a golden retriever looks like, what a sunset looks like, etc.) in its parameters. A 7B model simply doesn't have enough capacity for high-fidelity visual memory.

RA-CM3's solution: Give the multimodal model access to a retrieval memory. When asked to generate an image of a golden retriever, the model retrieves actual images of golden retrievers from a large corpus and uses them as reference. When asked a factual question, it retrieves relevant text. The model becomes a reader and composer, not a memorizer.

This is the paper that introduced retrieval-augmented pretraining for multimodal models — the technique that CM3Leon later scaled to achieve state-of-the-art results.

Memorization vs Retrieval

Compare how a memorization-based model (left) and a retrieval-augmented model (right) handle the same generation task. The retrieval model has access to a corpus of reference documents.

What fundamental limitation does RA-CM3 address in multimodal models?

Chapter 1: Multi-Modal Retrieval

Text-only RAG retrieves text documents. RA-CM3 needs to retrieve multimodal documents — pages that contain both text and images. And the retrieval query itself can be either text, an image, or both.

The retrieval encoder: CLIP

RA-CM3 uses CLIP as its retrieval encoder. CLIP maps both text and images into a shared embedding space, making cross-modal retrieval natural: a text query can retrieve relevant images, and an image query can retrieve relevant text.

Query: Text or Image
Input document (or its text/image component) is encoded by CLIP into a vector q ∈ R512.
↓ nearest neighbor search
Corpus: Multimodal Documents
Each document in the corpus is pre-encoded by CLIP. Find top-k nearest neighbors by cosine similarity.
Retrieved Documents
Return k multimodal documents (text + images) that are most similar to the query. These become part of the model's input context.

Three retrieval modes

ModeQueryReturnsUse Case
Text→MultiText from input docDocs with related text+imagesGenerate image given text description
Image→MultiImage from input docDocs with similar images+textCaption a novel image
Multi→MultiCombined text+imageMost similar multimodal docsGeneral multimodal generation
Why CLIP is perfect for this: CLIP was trained to align text and image representations in a shared space. "A photo of a cat" and an actual photo of a cat have similar CLIP embeddings. This means RA-CM3 can retrieve reference images using text queries (or vice versa) with no additional training. The retriever is frozen — it's used as-is.
python
# RA-CM3 retrieval with CLIP
import clip, faiss

# Pre-compute corpus embeddings (offline)
corpus_embeddings = []
for doc in corpus:
    # Encode text and image components
    text_emb = clip.encode_text(doc.text)    # [512]
    img_emb = clip.encode_image(doc.image)   # [512]
    combined = (text_emb + img_emb) / 2     # average
    corpus_embeddings.append(combined)

# Build FAISS index
index = faiss.IndexFlatIP(512)
index.add(np.stack(corpus_embeddings))

# At training/inference: retrieve for each input document
def retrieve(query_doc, k=2):
    q_emb = clip.encode(query_doc)  # [512]
    scores, ids = index.search(q_emb, k)
    return [corpus[i] for i in ids[0]]
Cross-Modal Retrieval

See how CLIP enables cross-modal retrieval. A text query retrieves relevant images, an image query retrieves relevant text. Both work because CLIP maps them to the same embedding space.

Why does RA-CM3 use CLIP as its retrieval encoder?

Chapter 2: The Architecture

RA-CM3 uses the CM3 architecture (the same one that CM3Leon later scaled up) augmented with retrieved documents. The key architectural choice: retrieved documents are simply prepended to the input sequence.

Input format

[Retrieved doc 1] [SEP] [Retrieved doc 2] [SEP] [Original doc]

Each retrieved document contains interleaved text and image tokens, just like the original document. The model processes the entire concatenated sequence with standard causal attention. Retrieved docs provide context that helps the model generate the original doc.

Model specifications

ComponentDetails
Base modelCM3 (decoder-only transformer)
ParametersUp to 2.7B
Image tokenizerVQ-VAE, 256 tokens per image
Retrieval k2 documents
RetrieverCLIP (frozen)
Context length4096 tokens
python
# RA-CM3 forward pass
def ra_cm3_forward(model, doc, retriever, corpus):
    # Step 1: Retrieve similar documents
    retrieved = retriever.search(doc, k=2)

    # Step 2: Tokenize everything
    ret_tokens = [tokenize(r) for r in retrieved]
    doc_tokens = tokenize(doc)

    # Step 3: Concatenate: [ret1] [SEP] [ret2] [SEP] [doc]
    input_seq = ret_tokens[0] + [SEP] + ret_tokens[1] + [SEP] + doc_tokens

    # Step 4: Standard CM3 forward pass
    logits = model(input_seq)  # [L, vocab_size]

    # Loss only on original doc tokens (not retrieved)
    loss = cross_entropy(logits[len_retrieved:], doc_tokens[1:])
    return loss
No architectural modifications needed! RA-CM3 doesn't add any new layers or attention mechanisms. Retrieved documents are simply prepended to the input, and the standard transformer processes them. The model learns to attend to relevant parts of retrieved docs through regular self-attention training. This simplicity is key to the approach's success.
RA-CM3 Architecture

Watch how retrieved documents are prepended to the input and processed by the CM3 transformer.

How does RA-CM3 incorporate retrieved documents into the model?

Chapter 3: Training Pipeline

RA-CM3's training involves three stages, each building on the previous one.

Stage 1: Index Building
Encode all documents in the corpus with CLIP. Build a FAISS index for fast nearest-neighbor search. This is done once, offline.
Stage 2: Retrieval-Augmented Pretraining
For each training batch: retrieve 2 similar docs per example, prepend them, train with CM3 objective. Retriever is frozen; only the CM3 model learns.
Stage 3: Evaluation
At inference: retrieve relevant docs, prepend, generate. The model can generate text, images, or both.

Critical design choice: frozen retriever

The CLIP retriever is never fine-tuned. This is important for three reasons:

ReasonExplanation
StabilityIf the retriever changes during training, the retrieved documents change, making training non-stationary and unstable
EfficiencyRe-encoding and re-indexing the entire corpus every time the retriever updates would be prohibitively expensive
QualityCLIP's pre-trained representations are already excellent for cross-modal retrieval; fine-tuning risks degradation
The loss masking trick: Loss is computed only on the original document tokens, not on the retrieved document tokens. This prevents the model from learning to simply copy retrieved content — instead, it learns to USE retrieved content to better predict the original document.
Training Pipeline Visualizer

Step through the three stages of RA-CM3's training pipeline.

Stage Pretraining
Why is the CLIP retriever kept frozen during RA-CM3 training?

Chapter 4: Retrieval Strategies

Not all retrieval is created equal. RA-CM3 explores different strategies for what to encode as the query and what to retrieve, finding that the choice significantly affects downstream performance.

Query formation

The paper experiments with three query strategies:

StrategyQuery EncodingBest For
Text-only queryEncode only the text portion of the input with CLIP text encoderText-to-image generation
Image-only queryEncode only the image portion with CLIP image encoderImage captioning, similar image retrieval
Combined queryAverage of text and image CLIP embeddingsGeneral multimodal tasks

Retrieval corpus composition

The corpus contains ~150 million multimodal web documents, each consisting of interleaved text and images from web pages. Documents are filtered for quality: minimum text length, image resolution requirements, and deduplication.

Finding: Dense retrieval beats sparse. The paper compares CLIP-based dense retrieval against BM25 (sparse, keyword-based) retrieval. Dense retrieval produces significantly better results because it understands semantic similarity, not just keyword overlap. "A photo of a dog playing" retrieves images of dogs, while BM25 retrieves pages containing the words "photo," "dog," and "playing" — which might not actually show dogs.
Retrieval Strategy Comparison

Toggle between retrieval strategies to see how different query types retrieve different documents.

Why does RA-CM3 use CLIP-based dense retrieval instead of keyword-based BM25?

Chapter 5: Generation Quality

How does retrieval improve generation quality? The paper provides detailed analysis showing that retrieved documents serve as "visual priors" that dramatically improve image generation fidelity.

Image generation improvement

For text-to-image generation, retrieved images provide the model with visual examples of what the output should look like. Instead of synthesizing a golden retriever from scratch (relying entirely on patterns memorized in weights), the model can reference actual golden retriever images and compose a new one.

MetricCM3 (no retrieval)RA-CM3 (with retrieval)Improvement
FID ↓18.110.442% better
CLIP Score ↑25.328.713% better
Text Perplexity ↓12.811.212% better
Retrieval improves BOTH modalities. Not just images (42% FID improvement) but also text (12% perplexity improvement). Retrieved text provides factual context; retrieved images provide visual references. The model learns to leverage both for better generation across all modalities.

Why retrieval helps even at large scale

One might think retrieval is just a crutch for small models. But the paper shows that even their largest model (2.7B) benefits significantly from retrieval. This makes sense: no model can memorize the long tail of visual concepts. A retrieval corpus effectively gives the model access to unlimited "visual vocabulary" without increasing parameter count.

Retrieval Impact on Generation

Toggle retrieval on/off and see how it affects generation quality metrics. Retrieval consistently improves all metrics, especially image quality.

Why does retrieval improve image generation even for large models?

Chapter 6: Results & Showcase

RA-CM3 is evaluated on both image generation and text generation tasks, showing consistent improvements from retrieval augmentation across the board.

Zero-shot image generation (MS-COCO)

ModelTypeFID ↓Retrieval?
DALL-EAR17.89No
CM3 (baseline)AR18.1No
RA-CM3AR + Retrieval10.4Yes
DALL-E 2Diffusion10.39No
Key result: RA-CM3 matches DALL-E 2 (FID 10.4 vs 10.39) despite being a much smaller autoregressive model. Retrieval gives a 42% improvement over the non-retrieval baseline. This proved that retrieval is a viable path to closing the AR-diffusion quality gap — a finding that CM3Leon later amplified with scale.

Capabilities

Text→Image
Generate images from text descriptions, using retrieved reference images for fidelity.
Image→Text
Caption images by retrieving similar image-caption pairs as context.
Interleaved
Generate mixed text+image documents with retrieval support at each step.
RA-CM3 vs Baselines

Compare RA-CM3 against models with and without retrieval across multiple metrics.

What does RA-CM3's FID of 10.4 demonstrate about retrieval-augmented generation?

Chapter 7: Connections

RA-CM3 introduced the recipe that the entire CM3 lineage built on: retrieval-augmented multimodal pretraining. Its ideas directly enabled CM3Leon, which in turn informed Chameleon and Transfusion.

InnovationRA-CM3 (2022)Where It Went
Retrieval pretrainingIntroduced for multimodalCM3Leon scaled it with CFG
Frozen CLIP retrieverSimple but effectiveStandard approach in all successors
Cross-modal retrievalText queries retrieve imagesInfluenced RAG for VLMs
Prepend-and-attendNo architecture changes neededAdopted by RETRO, Atlas, etc.
Lesson 1: Retrieval is not just for inference. The biggest impact comes from retrieval during pretraining, not just at inference time. The model learns how to use retrieved information from its first training step.
Lesson 2: Simplicity wins. RA-CM3's approach (prepend retrieved docs, use standard attention) is simpler than alternatives like cross-attention or gated fusion. Yet it works just as well or better. Sometimes the simplest approach is the right one.
Lesson 3: Retrieval complements scale. Even large models benefit from retrieval. This suggests that the future of multimodal AI is not just bigger models, but bigger models with bigger retrieval corpora.
RA-CM3 Legacy

Trace how RA-CM3's innovations influenced subsequent models.

Model RA-CM3
What is RA-CM3's most lasting contribution?