RA-CM3 (Yasunaga et al. 2022)

Chapter 0: Retrieval Meets Vision

RAG (Retrieval-Augmented Generation) revolutionized text models: instead of memorizing every fact, retrieve relevant documents at inference time. But in 2022, this idea hadn't been extended to multimodal models. If you wanted a model that could generate both text and images, it had to memorize everything in its weights — visual patterns, text knowledge, everything.

The inefficiency is staggering. A text+image model needs to store representations of millions of visual concepts (what a golden retriever looks like, what a sunset looks like, etc.) in its parameters. A 7B model simply doesn't have enough capacity for high-fidelity visual memory.

RA-CM3's solution: Give the multimodal model access to a retrieval memory. When asked to generate an image of a golden retriever, the model retrieves actual images of golden retrievers from a large corpus and uses them as reference. When asked a factual question, it retrieves relevant text. The model becomes a reader and composer, not a memorizer.

This is the paper that introduced retrieval-augmented pretraining for multimodal models — the technique that CM3Leon later scaled to achieve state-of-the-art results.

Memorization vs Retrieval

Compare how a memorization-based model (left) and a retrieval-augmented model (right) handle the same generation task. The retrieval model has access to a corpus of reference documents.

What fundamental limitation does RA-CM3 address in multimodal models?

Models must memorize all visual and textual knowledge in their parameters, which is inefficient — RA-CM3 adds a retrieval memory so the model can look up reference images and text instead of storing everything in weights Models are too slow at inference Models can't process images at all

Chapter 1: Multi-Modal Retrieval

Text-only RAG retrieves text documents. RA-CM3 needs to retrieve multimodal documents — pages that contain both text and images. And the retrieval query itself can be either text, an image, or both.

The retrieval encoder: CLIP

RA-CM3 uses CLIP as its retrieval encoder. CLIP maps both text and images into a shared embedding space, making cross-modal retrieval natural: a text query can retrieve relevant images, and an image query can retrieve relevant text.

Query: Text or Image

Input document (or its text/image component) is encoded by CLIP into a vector q ∈ R⁵¹².

↓ nearest neighbor search

Corpus: Multimodal Documents

Each document in the corpus is pre-encoded by CLIP. Find top-k nearest neighbors by cosine similarity.

↓

Retrieved Documents

Return k multimodal documents (text + images) that are most similar to the query. These become part of the model's input context.

Three retrieval modes

Mode	Query	Returns	Use Case
Text→Multi	Text from input doc	Docs with related text+images	Generate image given text description
Image→Multi	Image from input doc	Docs with similar images+text	Caption a novel image
Multi→Multi	Combined text+image	Most similar multimodal docs	General multimodal generation

Why CLIP is perfect for this: CLIP was trained to align text and image representations in a shared space. "A photo of a cat" and an actual photo of a cat have similar CLIP embeddings. This means RA-CM3 can retrieve reference images using text queries (or vice versa) with no additional training. The retriever is frozen — it's used as-is.

python
# RA-CM3 retrieval with CLIP
import clip, faiss

# Pre-compute corpus embeddings (offline)
corpus_embeddings = []
for doc in corpus:
    # Encode text and image components
    text_emb = clip.encode_text(doc.text)    # [512]
    img_emb = clip.encode_image(doc.image)   # [512]
    combined = (text_emb + img_emb) / 2     # average
    corpus_embeddings.append(combined)

# Build FAISS index
index = faiss.IndexFlatIP(512)
index.add(np.stack(corpus_embeddings))

# At training/inference: retrieve for each input document
def retrieve(query_doc, k=2):
    q_emb = clip.encode(query_doc)  # [512]
    scores, ids = index.search(q_emb, k)
    return [corpus[i] for i in ids[0]]

Cross-Modal Retrieval

See how CLIP enables cross-modal retrieval. A text query retrieves relevant images, an image query retrieves relevant text. Both work because CLIP maps them to the same embedding space.

Why does RA-CM3 use CLIP as its retrieval encoder?

Because CLIP maps text and images into a shared embedding space — so a text query like "golden retriever" naturally retrieves images of golden retrievers, enabling cross-modal retrieval with no additional training needed Because CLIP is the fastest encoder Because CLIP was the only available option

Chapter 2: The Architecture

RA-CM3 uses the CM3 architecture (the same one that CM3Leon later scaled up) augmented with retrieved documents. The key architectural choice: retrieved documents are simply prepended to the input sequence.

Input format

[Retrieved doc 1] [SEP] [Retrieved doc 2] [SEP] [Original doc]

Each retrieved document contains interleaved text and image tokens, just like the original document. The model processes the entire concatenated sequence with standard causal attention. Retrieved docs provide context that helps the model generate the original doc.

Model specifications

Component	Details
Base model	CM3 (decoder-only transformer)
Parameters	Up to 2.7B
Image tokenizer	VQ-VAE, 256 tokens per image
Retrieval k	2 documents
Retriever	CLIP (frozen)
Context length	4096 tokens

python
# RA-CM3 forward pass
def ra_cm3_forward(model, doc, retriever, corpus):
    # Step 1: Retrieve similar documents
    retrieved = retriever.search(doc, k=2)

    # Step 2: Tokenize everything
    ret_tokens = [tokenize(r) for r in retrieved]
    doc_tokens = tokenize(doc)

    # Step 3: Concatenate: [ret1] [SEP] [ret2] [SEP] [doc]
    input_seq = ret_tokens[0] + [SEP] + ret_tokens[1] + [SEP] + doc_tokens

    # Step 4: Standard CM3 forward pass
    logits = model(input_seq)  # [L, vocab_size]

    # Loss only on original doc tokens (not retrieved)
    loss = cross_entropy(logits[len_retrieved:], doc_tokens[1:])
    return loss

No architectural modifications needed! RA-CM3 doesn't add any new layers or attention mechanisms. Retrieved documents are simply prepended to the input, and the standard transformer processes them. The model learns to attend to relevant parts of retrieved docs through regular self-attention training. This simplicity is key to the approach's success.

RA-CM3 Architecture

Watch how retrieved documents are prepended to the input and processed by the CM3 transformer.

How does RA-CM3 incorporate retrieved documents into the model?

By simply prepending them to the input sequence — no architectural modifications needed. The standard transformer learns to attend to relevant parts of retrieved docs through regular self-attention, and loss is computed only on the original document tokens By adding special cross-attention layers By fine-tuning the retriever jointly with the model

Chapter 3: Training Pipeline

RA-CM3's training involves three stages, each building on the previous one.

Stage 1: Index Building

Encode all documents in the corpus with CLIP. Build a FAISS index for fast nearest-neighbor search. This is done once, offline.

↓

Stage 2: Retrieval-Augmented Pretraining

For each training batch: retrieve 2 similar docs per example, prepend them, train with CM3 objective. Retriever is frozen; only the CM3 model learns.

↓

Stage 3: Evaluation

At inference: retrieve relevant docs, prepend, generate. The model can generate text, images, or both.

Critical design choice: frozen retriever

The CLIP retriever is never fine-tuned. This is important for three reasons:

Reason	Explanation
Stability	If the retriever changes during training, the retrieved documents change, making training non-stationary and unstable
Efficiency	Re-encoding and re-indexing the entire corpus every time the retriever updates would be prohibitively expensive
Quality	CLIP's pre-trained representations are already excellent for cross-modal retrieval; fine-tuning risks degradation

The loss masking trick: Loss is computed only on the original document tokens, not on the retrieved document tokens. This prevents the model from learning to simply copy retrieved content — instead, it learns to USE retrieved content to better predict the original document.

Training Pipeline Visualizer

Step through the three stages of RA-CM3's training pipeline.

Stage Pretraining

Why is the CLIP retriever kept frozen during RA-CM3 training?

Because updating the retriever would change which documents are retrieved, making training non-stationary; it would require re-indexing the entire corpus; and CLIP's pre-trained representations are already excellent for cross-modal retrieval Because CLIP can't be fine-tuned Because it would make the model too large

Chapter 4: Retrieval Strategies

Not all retrieval is created equal. RA-CM3 explores different strategies for what to encode as the query and what to retrieve, finding that the choice significantly affects downstream performance.

Query formation

The paper experiments with three query strategies:

Strategy	Query Encoding	Best For
Text-only query	Encode only the text portion of the input with CLIP text encoder	Text-to-image generation
Image-only query	Encode only the image portion with CLIP image encoder	Image captioning, similar image retrieval
Combined query	Average of text and image CLIP embeddings	General multimodal tasks

Retrieval corpus composition

The corpus contains ~150 million multimodal web documents, each consisting of interleaved text and images from web pages. Documents are filtered for quality: minimum text length, image resolution requirements, and deduplication.

Finding: Dense retrieval beats sparse. The paper compares CLIP-based dense retrieval against BM25 (sparse, keyword-based) retrieval. Dense retrieval produces significantly better results because it understands semantic similarity, not just keyword overlap. "A photo of a dog playing" retrieves images of dogs, while BM25 retrieves pages containing the words "photo," "dog," and "playing" — which might not actually show dogs.

Retrieval Strategy Comparison

Toggle between retrieval strategies to see how different query types retrieve different documents.

Why does RA-CM3 use CLIP-based dense retrieval instead of keyword-based BM25?

Because dense retrieval understands semantic similarity (a text query about dogs retrieves images of dogs), while BM25 only matches keywords and may return documents that contain the right words but wrong content Because BM25 is too slow Because BM25 can't handle images

Chapter 5: Generation Quality

How does retrieval improve generation quality? The paper provides detailed analysis showing that retrieved documents serve as "visual priors" that dramatically improve image generation fidelity.

Image generation improvement

For text-to-image generation, retrieved images provide the model with visual examples of what the output should look like. Instead of synthesizing a golden retriever from scratch (relying entirely on patterns memorized in weights), the model can reference actual golden retriever images and compose a new one.

Metric	CM3 (no retrieval)	RA-CM3 (with retrieval)	Improvement
FID ↓	18.1	10.4	42% better
CLIP Score ↑	25.3	28.7	13% better
Text Perplexity ↓	12.8	11.2	12% better

Retrieval improves BOTH modalities. Not just images (42% FID improvement) but also text (12% perplexity improvement). Retrieved text provides factual context; retrieved images provide visual references. The model learns to leverage both for better generation across all modalities.

Why retrieval helps even at large scale

One might think retrieval is just a crutch for small models. But the paper shows that even their largest model (2.7B) benefits significantly from retrieval. This makes sense: no model can memorize the long tail of visual concepts. A retrieval corpus effectively gives the model access to unlimited "visual vocabulary" without increasing parameter count.

Retrieval Impact on Generation

Toggle retrieval on/off and see how it affects generation quality metrics. Retrieval consistently improves all metrics, especially image quality.

Why does retrieval improve image generation even for large models?

Because no model can memorize the long tail of visual concepts — retrieval provides reference images for rare or specific visual content, effectively giving the model unlimited "visual vocabulary" without increasing parameter count Because larger models have worse memory Because retrieval replaces the model's own generation

Chapter 6: Results & Showcase

RA-CM3 is evaluated on both image generation and text generation tasks, showing consistent improvements from retrieval augmentation across the board.

Zero-shot image generation (MS-COCO)

Model	Type	FID ↓	Retrieval?
DALL-E	AR	17.89	No
CM3 (baseline)	AR	18.1	No
RA-CM3	AR + Retrieval	10.4	Yes
DALL-E 2	Diffusion	10.39	No

Key result: RA-CM3 matches DALL-E 2 (FID 10.4 vs 10.39) despite being a much smaller autoregressive model. Retrieval gives a 42% improvement over the non-retrieval baseline. This proved that retrieval is a viable path to closing the AR-diffusion quality gap — a finding that CM3Leon later amplified with scale.

Capabilities

Text→Image

Generate images from text descriptions, using retrieved reference images for fidelity.

↓

Image→Text

Caption images by retrieving similar image-caption pairs as context.

↓

Interleaved

Generate mixed text+image documents with retrieval support at each step.

RA-CM3 vs Baselines

Compare RA-CM3 against models with and without retrieval across multiple metrics.

What does RA-CM3's FID of 10.4 demonstrate about retrieval-augmented generation?

That retrieval augmentation can bring a small autoregressive model to match DALL-E 2 quality (FID 10.39) — proving retrieval is a viable alternative to massive scale or diffusion for achieving high image generation quality That autoregressive models are always better than diffusion That CLIP is the best image model

Chapter 7: Connections

RA-CM3 introduced the recipe that the entire CM3 lineage built on: retrieval-augmented multimodal pretraining. Its ideas directly enabled CM3Leon, which in turn informed Chameleon and Transfusion.

Innovation	RA-CM3 (2022)	Where It Went
Retrieval pretraining	Introduced for multimodal	CM3Leon scaled it with CFG
Frozen CLIP retriever	Simple but effective	Standard approach in all successors
Cross-modal retrieval	Text queries retrieve images	Influenced RAG for VLMs
Prepend-and-attend	No architecture changes needed	Adopted by RETRO, Atlas, etc.

Lesson 1: Retrieval is not just for inference. The biggest impact comes from retrieval during pretraining, not just at inference time. The model learns how to use retrieved information from its first training step.

Lesson 2: Simplicity wins. RA-CM3's approach (prepend retrieved docs, use standard attention) is simpler than alternatives like cross-attention or gated fusion. Yet it works just as well or better. Sometimes the simplest approach is the right one.

Lesson 3: Retrieval complements scale. Even large models benefit from retrieval. This suggests that the future of multimodal AI is not just bigger models, but bigger models with bigger retrieval corpora.

RA-CM3 Legacy

Trace how RA-CM3's innovations influenced subsequent models.

Model RA-CM3

What is RA-CM3's most lasting contribution?

Establishing that retrieval-augmented pretraining (not just inference-time retrieval) dramatically improves multimodal models — a simple technique (prepend retrieved docs, use standard attention) that requires no architecture changes and benefits models at every scale The specific model architecture The CLIP retriever design

RA-CM3: Retrieval-Augmented Multimodal