Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. (Facebook AI Research) — NeurIPS 2020

RAG: Retrieval-Augmented Generation

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — instead of memorizing the world inside model weights, retrieve relevant documents at inference time and condition generation on them.

Prerequisites: Seq2seq models (encoder-decoder) + Dense embeddings + Basic probability. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Knowledge Problem

Imagine you're building a question-answering system. Someone asks: "What is the capital of Burkina Faso?" Your language model was trained on internet text and it might know the answer — Ouagadougou — because that fact appeared in its training data. But what if someone asks: "Who won the 2024 Nobel Prize in Physics?" Your model was trained in 2020. It literally cannot know.

This is the knowledge problem in language models. Every fact the model "knows" must be baked into its parameters during training. This creates three fundamental failures:

Failure ModeWhy It HappensExample
StalenessTraining data has a cutoff date. The world changes; the model doesn't."Who is the current president of X?" is wrong after an election.
HallucinationThe model generates fluent but factually incorrect text when it doesn't know the answer.Confidently inventing a fake paper citation with plausible-sounding authors.
OpacityYou can't trace which training document a fact came from. No provenance, no citations.Model says "The drug interacts with X" but you can't verify the source.

Before RAG, the standard approach to knowledge-intensive tasks was to make the model bigger. More parameters means more capacity to memorize facts. GPT-3 demonstrated this: at 175 billion parameters, it could answer trivia questions with impressive accuracy — because the answers were literally stored in its weights.

But this approach is absurdly inefficient. To store one more fact, you need to retrain a 175B-parameter model. To update a stale fact, you need to retrain again. The knowledge is entangled with the model's language abilities in ways we can't control or inspect.

The core insight of RAG: Separate knowledge from reasoning. Let the language model focus on understanding and generating language. For factual knowledge, give it access to a searchable document store that can be updated at any time without retraining. At inference time, retrieve relevant documents, paste them into the context, and let the model generate an answer conditioned on those documents. The model becomes a reader, not a memorizer.

Think of it like the difference between a closed-book exam and an open-book exam. In a closed-book exam (standard LM), you must memorize every possible fact. In an open-book exam (RAG), you just need to know how to find the right page and read it.

Closed-Book vs Open-Book QA

Click "Ask Question" to see how a closed-book model (left) and an open-book RAG model (right) handle the same question. The closed-book model must rely on memorized parameters. The RAG model retrieves relevant passages first.

What is the fundamental problem RAG aims to solve?

Chapter 1: Dense Passage Retrieval

Before we can build RAG, we need a way to find relevant documents. Given a question like "What causes aurora borealis?", we need to search through millions of Wikipedia passages and return the handful that contain the answer. This is the job of the retriever.

Traditional search engines use sparse retrieval — methods like TF-IDF or BM25 that match exact keywords. If your question says "aurora" and a passage says "aurora," it's a match. This works well for simple queries but fails badly when the question and answer use different words. "What causes the Northern Lights?" should match a passage about "solar wind interacting with Earth's magnetosphere" — but there are no shared keywords.

RAG uses Dense Passage Retrieval (DPR), published by Karpukhin et al. just months before the RAG paper. Instead of matching keywords, DPR encodes both the question and every passage into dense vector embeddings, then finds passages whose vectors are closest to the question vector.

How DPR works

DPR uses two separate BERT encoders:

Question Encoder: BERTQ
Takes a question string, outputs a 768-dim vector. Input: "What causes aurora borealis?" → Output: q ∈ R768
Passage Encoder: BERTP
Takes a passage string, outputs a 768-dim vector. Input: "Solar wind particles..." → Output: p ∈ R768
Similarity: dot product
score(q, p) = qTp. Higher dot product = more relevant passage. Retrieve top-k by score.

The two encoders are trained on pairs of (question, relevant passage) using a contrastive loss: push the question vector close to the relevant passage vector and far from irrelevant passage vectors.

L = −log   eqTp+ / (eqTp+ + ∑i eqTpi)

Where p+ is the correct passage and pi are negative passages (wrong answers). This is the standard InfoNCE loss — the same contrastive loss used in CLIP and SimCLR. It says: make the correct (question, passage) pair have a higher dot product than any incorrect pair.

The MIPS trick: searching millions of passages in milliseconds

At inference time, we need to find the top-k passages out of ~21 million Wikipedia passages. A brute-force search would compute 21 million dot products — too slow. DPR uses Maximum Inner Product Search (MIPS) with Facebook's FAISS library, which builds a compressed index that enables approximate nearest-neighbor search in milliseconds.

python
# Building a DPR index with FAISS
import faiss
import numpy as np

# Pre-compute passage embeddings (done ONCE, offline)
passage_embeddings = encode_all_passages(wikipedia)  # shape: [21M, 768]

# Build FAISS index for fast approximate search
index = faiss.IndexFlatIP(768)          # inner product index
index.add(passage_embeddings)              # add all 21M vectors

# At inference: encode question, search index
q_vec = question_encoder("What causes aurora?")  # [1, 768]
scores, indices = index.search(q_vec, k=5)    # top-5 passages
# scores: [1, 5] — similarity scores
# indices: [1, 5] — passage IDs to look up text
Pre-compute once, search forever. The key efficiency trick: all 21 million passage embeddings are computed ONCE offline and stored in the FAISS index. At inference time, only the question needs to be encoded (one BERT forward pass). The search over 21M vectors takes ~10ms with FAISS. This asymmetry — expensive offline, cheap online — is what makes dense retrieval practical.

Hard negatives matter

The quality of DPR depends heavily on how you choose negative examples during training. Random negatives (random passages from the corpus) are too easy — the model learns to distinguish by topic rather than by relevance. Hard negatives are passages that are topically related but don't answer the question. Karpukhin et al. found that using BM25-retrieved passages as hard negatives (high keyword overlap but not the answer) dramatically improved retrieval quality.

python
# DPR training with hard negatives
for question, positive_passage in training_data:
    # Easy negative: random passage from corpus
    easy_neg = random.choice(corpus)

    # Hard negative: BM25-retrieved but NOT the answer
    bm25_results = bm25.search(question, k=100)
    hard_neg = [p for p in bm25_results if p != positive_passage][0]

    # Contrastive loss: push q closer to pos, farther from negs
    loss = nce_loss(
        q=encode_q(question),
        pos=encode_p(positive_passage),
        negs=[encode_p(easy_neg), encode_p(hard_neg)]
    )

This detail matters for RAG because the retriever quality directly determines the generator's input quality. Garbage in, garbage out — if the retriever returns irrelevant passages, even a perfect generator can't produce correct answers.

Dense vs Sparse Retrieval

Type a query (or click examples below) to see how sparse retrieval (BM25, keyword matching) and dense retrieval (DPR, semantic matching) rank passages differently. Notice how dense retrieval finds semantically relevant passages even without keyword overlap.

Why does RAG use dense retrieval (DPR) instead of sparse keyword matching (BM25)?

Chapter 2: The RAG Architecture

Now we have the two pieces: a retriever that finds relevant documents and a generator that produces text. RAG combines them into a single end-to-end model. The architecture is elegant: query in, answer out, with document retrieval happening in the middle.

The two components

Retriever: pη(z | x) — Given an input query x, the retriever returns a distribution over documents z. In practice, it retrieves the top-k documents (k=5 or k=10) from a FAISS index using DPR. The parameter η refers to the BERTQ question encoder weights (the passage encoder is frozen and pre-computed).

Generator: pθ(yi | x, z, y1:i-1) — Given the input x, a retrieved document z, and previously generated tokens y1:i-1, the generator predicts the next token yi. This is BART — a pre-trained seq2seq model. The encoder processes [x; z] (query concatenated with retrieved document), and the decoder generates the answer autoregressively.

Input Query x
"What year was the Eiffel Tower completed?"
↓ BERTQ encodes to q ∈ R768
MIPS Retrieval
Search FAISS index of 21M passages. Return top-k (z1, ..., zk) with scores pη(zi | x).
↓ for each zi
BART Generator
Encode [x ; zi], decode answer y. Each retrieved doc gives a different generation probability.
↓ marginalize over docs
Final Answer y
"1889" — produced by marginalizing over all k retrieved documents.

The marginalization step

This is the mathematical heart of RAG. We don't just pick the best document and generate from it. Instead, we treat the retrieved documents as latent variables and marginalize them out. The probability of generating output y given input x is:

p(y | x) ≈ ∑z ∈ top-k pη(z | x) · pθ(y | x, z)

Where pη(z | x) is the retriever's score for document z (how relevant it is to the query) and pθ(y | x, z) is the generator's probability of producing answer y given document z.

In plain English: run the generator k times, once per retrieved document. Each run produces a probability for the answer. Weight each probability by how relevant the document was. Sum them up. Documents that are more relevant contribute more to the final answer.

Why marginalize instead of just using the top-1 document? Because the retriever isn't perfect. The most relevant passage might be ranked second or third. By marginalizing over multiple documents, RAG hedges its bets — if document z1 contains part of the answer and z3 contains another part, both contribute. This is fundamentally more robust than picking a single document and hoping it's right.

Tensor shapes through the pipeline

python
# RAG forward pass — exact tensor shapes
x = tokenizer("What year was the Eiffel Tower completed?")
# x.input_ids: [1, seq_len]  (batch=1, sequence length)

# Step 1: Encode query
q = question_encoder(x.input_ids)     # [1, 768]

# Step 2: Retrieve top-k documents
scores, doc_ids = faiss_index.search(q, k=5)
# scores: [1, 5]   doc_ids: [1, 5]
doc_texts = [passage_db[i] for i in doc_ids[0]]

# Step 3: Concatenate query with each document
inputs = [x_text + " [SEP] " + doc for doc in doc_texts]
enc_ids = tokenizer(inputs, padding=True)
# enc_ids.input_ids: [5, max_doc_len]  — 5 docs, padded

# Step 4: Run BART encoder + decoder for each document
gen_scores = bart.generate(enc_ids, return_scores=True)
# gen_scores: [5, output_len, vocab_size]

# Step 5: Marginalize over documents
doc_priors = softmax(scores)              # [1, 5] — retrieval weights
final_probs = (doc_priors @ gen_scores)   # weighted sum over docs
RAG Pipeline Visualizer

Watch data flow through the RAG pipeline. Click "Step" to advance one stage at a time, or "Run All" to see the full pipeline. Each retrieved document contributes to the final answer weighted by its relevance score.

In RAG, why are retrieved documents treated as latent variables and marginalized out, rather than just using the single most relevant document?

Chapter 3: RAG-Token vs RAG-Sequence

The formula we saw in Chapter 2 — marginalizing over documents — has a subtle but critical ambiguity. When do we marginalize? This choice creates two distinct variants of RAG, each with different properties.

RAG-Sequence: marginalize per complete sequence

In RAG-Sequence, we generate the entire output sequence conditioned on each document separately, then marginalize over documents once at the end:

pRAG-Seq(y | x) ≈ ∑z ∈ top-k pη(z | x) · ∏i=1N pθ(yi | x, z, y1:i-1)

In plain English: for each document z, generate the complete answer using only that document. Then weight each complete answer by how relevant z was. This means each answer candidate is "faithful" to a single document — the model never mixes information across documents within a single generation.

RAG-Token: marginalize per token

In RAG-Token, we marginalize at each token position separately. Every token in the output can attend to a different document:

pRAG-Token(y | x) ≈ ∏i=1Nz ∈ top-k pη(z | x) · pθ(yi | x, z, y1:i-1)

Notice the difference: in RAG-Sequence, the sum over z is outside the product over tokens. In RAG-Token, the sum over z is inside the product. This means each output token can draw from a different document.

When does the difference matter? For short, factoid answers ("1889", "Ouagadougou"), both variants produce similar results because the answer comes from a single fact in a single document. The difference emerges for longer, compositional answers where different parts of the answer might come from different documents. RAG-Token can say "The Eiffel Tower [from doc 1] was designed by Gustave Eiffel [from doc 3] and completed in 1889 [from doc 2]." RAG-Sequence cannot — each candidate answer draws from only one document.
PropertyRAG-SequenceRAG-Token
MarginalizationOver full sequencesPer token
Cross-doc infoNo — each answer uses one docYes — each token can use a different doc
Best forShort factoid answers (QA)Longer generated text (fact verification, Jeopardy)
DecodingBeam search per doc, then rerankStandard beam search on marginalized probs
Computationk separate beam searchesOne beam search, k forward passes per step

Decoding details

For RAG-Token, decoding is straightforward. At each step, compute the per-token probability from each document, take the weighted sum, and feed it into standard beam search.

For RAG-Sequence, it's trickier. We run beam search separately for each of the k documents, generating k sets of candidate sequences. Then we need to score each unique candidate across all documents using the full marginalization formula. Some candidates might only appear in one document's beam, in which case we need to run an additional forward pass with the other documents to get their probabilities (or approximate them as zero).

python
# RAG-Sequence decoding (simplified)
candidates = {}
for z_i, score_i in zip(top_k_docs, retrieval_scores):
    # Run beam search conditioned on this single document
    beams = bart.beam_search(query, z_i, num_beams=5)
    for seq, log_prob in beams:
        if seq not in candidates:
            candidates[seq] = []
        candidates[seq].append((score_i, log_prob))

# Marginalize: for each unique candidate, sum p(z|x) * p(y|x,z)
final_scores = {}
for seq, contributions in candidates.items():
    final_scores[seq] = sum(s * exp(lp) for s, lp in contributions)

best_answer = max(final_scores, key=final_scores.get)
RAG-Token vs RAG-Sequence

Toggle between the two variants to see how they combine information from multiple documents. In RAG-Sequence, each answer is generated from a single document and then ranked. In RAG-Token, each output token can draw from any document, allowing cross-document synthesis.

What is the key difference between RAG-Token and RAG-Sequence?

Chapter 4: Training RAG

Training RAG is tricky because we have two components — a retriever and a generator — and they need to learn together. The retriever needs to learn what documents are useful for the generator, and the generator needs to learn how to use retrieved documents to produce correct answers.

What gets trained

RAG has three sets of parameters:

ComponentParametersTrained?Why?
Question encoder (BERTQ)ηYesNeeds to learn to produce queries that retrieve useful documents for BART.
BART generatorθYesNeeds to learn to read retrieved documents and extract/generate answers.
Passage encoder (BERTP)No (frozen)Re-encoding 21M passages at every training step is prohibitively expensive. Kept frozen from DPR pre-training.
Why freeze the passage encoder? If we updated BERTP during training, every gradient step would invalidate all 21 million pre-computed passage embeddings. We'd need to re-encode the entire Wikipedia every time. This is computationally impossible during training. So we freeze BERTP and only update BERTQ. The question encoder learns to produce queries that work well with the fixed passage embeddings. This is an engineering compromise, not a theoretical one — it means the passage representations are stuck at their DPR-trained values.

The training objective

We train to maximize the marginal log-likelihood of the correct answer:

L = − ∑(x,y) ∈ D log p(y | x) = − ∑(x,y) ∈ D log ∑z ∈ top-k pη(z | x) · pθ(y | x, z)

Where D is the training set of (question, answer) pairs. We don't have supervision for which documents should be retrieved — the documents are latent. The model must learn which documents are useful entirely from the end-to-end signal of producing correct answers.

Gradient flow

How do gradients flow through retrieval? The retriever scores pη(z | x) are differentiable with respect to η (the question encoder weights). The FAISS lookup itself is non-differentiable (it's a discrete nearest-neighbor search), but we approximate the gradient by fixing the set of retrieved documents during each forward pass and only backpropagating through the scores.

python
# RAG training loop (simplified)
for x, y in dataloader:
    # 1. Encode question (DIFFERENTIABLE w.r.t. eta)
    q = question_encoder(x)                # [B, 768]

    # 2. Retrieve top-k documents (NOT differentiable)
    scores, doc_ids = faiss_index.search(q, k=5)

    # 3. Re-score with differentiable dot product
    doc_embeds = passage_embeds[doc_ids]     # [B, 5, 768] — frozen
    retriever_scores = (q.unsqueeze(1) @ doc_embeds.transpose(-1,-2)).squeeze()
    # [B, 5] — these ARE differentiable w.r.t. q (and thus eta)

    doc_priors = softmax(retriever_scores)  # [B, 5]

    # 4. Generate with BART for each document
    gen_log_probs = []
    for i in range(5):
        bart_input = concat(x, docs[i])
        lp = bart.log_prob(bart_input, y)  # log p(y | x, z_i)
        gen_log_probs.append(lp)

    # 5. Marginalize and compute loss
    gen_probs = torch.stack(gen_log_probs, dim=1).exp()  # [B, 5]
    marginal = (doc_priors * gen_probs).sum(dim=1)          # [B]
    loss = -marginal.log().mean()

    loss.backward()   # gradients flow to BOTH eta and theta
    optimizer.step()
The retriever learns without retrieval labels. This is the most elegant aspect of RAG training. Nobody tells the model which documents to retrieve. The only supervision is (question, answer) pairs. If retrieving document z3 leads to a higher probability of the correct answer, then pη(z3 | x) gets a positive gradient. The retriever learns to find useful documents purely from the end-task signal. This is a form of latent variable learning — the documents are hidden variables that are never directly observed.

The REINFORCE-like gradient

Let's unpack the gradient math one more step. The loss for a single example is:

L = −log ∑z pη(z | x) · pθ(y | x, z)

For the generator parameters θ, the gradient is straightforward — it flows through pθ(y | x, z) for each document. For the retriever parameters η, the gradient has an interesting structure:

η L = − ∑z w(z) · ∇η log pη(z | x)

Where w(z) is a weight proportional to how much document z helped produce the correct answer. This looks like a REINFORCE gradient — the retriever gets a "reward signal" for retrieving documents that led to correct answers. Documents that helped the generator are reinforced; irrelevant documents are downweighted.

Think of it as a credit assignment problem. When the generator produces the correct answer "1889", which of the 5 retrieved documents deserves credit? The gradient automatically assigns credit: documents where pθ(y | x, z) was high (the answer was likely given that document) get a stronger positive gradient. Documents where the answer was unlikely get weaker signals. The retriever learns to find the documents that the generator finds most useful.

Training hyperparameters

SettingValue
Pre-trained retrieverDPR (trained on Natural Questions)
Pre-trained generatorBART-large (400M params)
Document indexWikipedia dump (Dec 2018), 21M 100-word passages
Top-k documentsk = 5 or k = 10
Learning rate1e-5 (both retriever and generator)
OptimizerAdam
Batch sizeNot specified (limited by GPU memory due to k forward passes)
Training dataTask-specific (question, answer) pairs only — no retrieval labels

A critical detail: the FAISS index is not updated during training. This means the set of retrievable documents stays fixed at whatever was indexed from the DPR passage encoder. The question encoder shifts to produce better queries for the fixed index, but the index itself doesn't improve. REALM (a concurrent paper by Guu et al.) tackled this by periodically re-encoding the entire index — asynchronously, in the background — but this added significant engineering complexity.

Training Gradient Flow

Watch how gradients flow through RAG during training. The correct answer signal flows back through BART (generator gradient) and through the retrieval scores (retriever gradient). The passage encoder stays frozen. Click "Train Step" to see one step.

Why is the passage encoder (BERTP) frozen during RAG training while the question encoder (BERTQ) is trained?

Chapter 5: Open-Domain QA Results

The headline evaluation for RAG is open-domain question answering — answering factual questions using only a corpus of documents (Wikipedia) as the knowledge source, with no task-specific architecture or pre-defined answer set.

Lewis et al. evaluated on four major QA benchmarks:

BenchmarkPrevious SOTARAG-TokenRAG-SequenceImprovement
Natural Questions44.5 (DPR)44.544.5Matches SOTA
TriviaQA57.9 (DPR)56.855.9Near SOTA
CuratedTrec46.0 (DPR)50.052.2+6.2 pts
WebQuestions42.4 (DPR)45.545.2+3.1 pts
RAG matches or beats DPR's extractive reader on all four benchmarks — and it does so generatively. DPR extracts a span from a retrieved passage (pointing to exact text). RAG generates the answer token by token. This matters because generative models can synthesize information, rephrase, and handle questions where the answer isn't a literal span in any document.

Beyond QA: knowledge-intensive generation

RAG's real advantage shows on tasks that require generating text, not just extracting spans:

Jeopardy Question Generation: Given an entity (e.g., "Eiffel Tower"), generate a Jeopardy-style clue. RAG produced more factual, specific, and diverse clues than BART alone. Human evaluators preferred RAG's outputs on factuality and specificity.

FEVER Fact Verification: Given a claim, classify it as SUPPORTED, REFUTED, or NOT ENOUGH INFO using retrieved evidence. RAG achieved 72.5% accuracy, within 2.7 points of pipeline systems with much more complex architectures.

MS-MARCO (abstractive QA): Generate free-form answers to real Bing search queries. RAG produced more specific and factual answers than BART, though this benchmark wasn't the primary focus.

Why RAG outperforms "retrieve then read"

The standard pipeline approach before RAG was: (1) retrieve relevant passages, (2) feed them to a reader model, (3) extract an answer span. RAG improves on this in three ways:

End-to-end training
Retriever and generator are trained jointly. The retriever learns what the generator needs, not just what looks relevant in isolation.
Generative answers
Can synthesize, rephrase, and combine information. Not limited to extracting literal spans from passages.
Marginalization over documents
Robust to retrieval errors. If the best passage is ranked third, it still contributes to the answer through marginalization.
Benchmark Performance Comparison

Compare RAG against baselines across benchmarks. Use the buttons to switch benchmarks and see how RAG-Token and RAG-Sequence compare to extractive (DPR) and closed-book (T5) approaches.

What is a key advantage of RAG's generative approach over extractive QA systems like DPR?

Chapter 6: Knowledge Updates

Here is perhaps the most practically important property of RAG: you can update what the model "knows" without retraining it. Just swap the document index.

Consider a traditional language model trained in January 2020. By March 2020, COVID-19 has changed the world, but the model still thinks it's a minor outbreak in Wuhan. To update the model, you'd need to retrain it on new data — an expensive process requiring massive compute.

With RAG, you simply update the Wikipedia dump in the FAISS index. The model's weights stay the same. The next time someone asks about COVID-19, the retriever finds the new passages, and the generator produces an up-to-date answer.

Hot-swappable knowledge. This is the fundamental advantage of separating knowledge (document index) from reasoning (model weights). The document index is just a data structure — you can add new documents, remove outdated ones, or replace the entire index. The model doesn't need to be retrained, fine-tuned, or even restarted. This makes RAG uniquely suited for domains where facts change rapidly: news, medicine, law, company policies.

What updating the index looks like

python
# Updating RAG's knowledge — no model retraining needed

# 1. Get new documents
new_articles = fetch_latest_wikipedia_dump()  # or any corpus

# 2. Chunk into 100-word passages
new_passages = chunk_documents(new_articles, chunk_size=100)

# 3. Encode with frozen passage encoder (BERT_P)
new_embeddings = passage_encoder.encode_batch(new_passages)
# shape: [num_new_passages, 768]

# 4. Replace FAISS index
new_index = faiss.IndexFlatIP(768)
new_index.add(new_embeddings)

# 5. Swap in the new index — zero downtime
rag_model.retriever.index = new_index

# Done! Model now answers with updated knowledge.
# No gradient updates. No retraining. No GPU hours.

Limitations of index swapping

Knowledge update via index swapping isn't a silver bullet. There are real limitations:

LimitationWhy It Matters
Encoder driftThe question encoder was trained with the original index. New documents encoded by the frozen passage encoder might not align perfectly with what the question encoder expects. Performance can degrade over time.
No unlearningRemoving a document from the index doesn't make the generator forget what it learned from similar documents during training. The model weights still contain latent knowledge from training.
Passage encoder qualityThe frozen BERTP determines how well new documents are embedded. If new documents have a very different style than the original training corpus, embedding quality may suffer.
Re-encoding costFor a full Wikipedia swap (~21M passages), re-encoding takes hours on GPUs even though it's a one-time cost. Incremental updates (adding/removing individual passages) are cheaper.

Parametric vs non-parametric knowledge

RAG formalized a distinction that has become fundamental to how we think about LLM systems:

PropertyParametric Knowledge (model weights)Non-Parametric Knowledge (document index)
LocationEncoded in 400M BART parametersStored in 21M passage embeddings + text
Update costFull retraining (days, $$$$)Re-encode new documents (hours, $)
ProvenanceOpaque — can't trace which training example taught a factTransparent — can cite exact retrieved passage
CapacityFixed by model size — 400M params store limited factsUnlimited — add more documents to the index
StrengthsLanguage understanding, reasoning, generation fluencyFactual accuracy, recency, verifiability

This separation is the single most important conceptual contribution of the RAG paper. The specific architecture (DPR + BART) matters less than the pattern: let the model handle language, let the index handle knowledge.

RAG as a design pattern. The 2020 RAG paper was the first to formalize this retrieve-then-generate paradigm end-to-end. But the core insight — separating parametric knowledge (model weights) from non-parametric knowledge (document index) — has become perhaps the most widely adopted pattern in production LLM systems. Every major LLM-powered application today uses some form of RAG: ChatGPT with browsing, Bing Chat, Perplexity, enterprise knowledge bases. The 2020 paper planted the seed; the pattern grew into an industry standard.
Knowledge Update Simulator

Simulate updating RAG's knowledge index. The timeline shows the original index (2018) and lets you add newer documents. Ask questions and see how answers change as the index is updated — all without retraining.

How does RAG allow knowledge updates without retraining the model?

Chapter 7: Connections

RAG sits at a critical junction in the history of NLP — it formalized the retrieve-then-generate pattern that became the backbone of modern LLM applications. Let's map where it came from and where it led.

Lineage

Model / PaperYearRelationship to RAG
DrQA (Chen et al.)2017First large-scale open-domain QA with retrieval + reader. Used TF-IDF (sparse), not dense retrieval. RAG's direct ancestor.
ORQA (Lee et al.)2019First end-to-end trained retriever + reader. Used Inverse Cloze Task for pre-training. Showed joint training beats pipeline.
REALM (Guu et al.)2020Concurrent with RAG. Also marginalizes over retrieved docs. Key difference: REALM periodically re-encodes the entire index during training (asynchronous index refresh). RAG freezes the passage encoder instead.
DPR (Karpukhin et al.)2020RAG's retriever. Showed that simple dense embeddings beat BM25 for open-domain QA. RAG inherits DPR's bi-encoder architecture.
BART (Lewis et al.)2020RAG's generator. Denoising seq2seq pre-training. Same first author as RAG — Patrick Lewis.
FiD (Izacard & Grave)2021Fusion-in-Decoder: encode each document independently, concatenate in decoder. Simpler than marginalization, often stronger. Became the dominant approach post-RAG.
RETRO (Borgeaud et al.)2022Retrieval-enhanced Transformer: retrieval baked into the architecture with chunked cross-attention. Scales retrieval to 2T tokens.
ChatGPT + Retrieval2023+Production RAG: GPT-4 with web browsing, Bing Chat, Perplexity. RAG's conceptual descendants, though architecturally different (long-context models with retrieval as tool use).

What RAG got right

Separation of concerns. The most lasting contribution: parametric knowledge (model weights) vs non-parametric knowledge (document index) should be separate systems. This insight survived every subsequent architectural change.

End-to-end training. Training the retriever and generator jointly, with the retriever learning from the end-task signal without retrieval labels. This is cleaner than pipeline approaches where the retriever is trained on a proxy task.

Two marginalization variants. RAG-Token and RAG-Sequence provided a principled framework for how to combine information from multiple documents. This distinction influenced subsequent work on multi-document reasoning.

What changed since RAG

Long-context models. GPT-4 with 128K context, Gemini with 1M+ tokens. When you can paste 50 documents into the context, the marginalization math becomes less important — you just concatenate everything and let attention handle it.

Retrieval as tool use. Modern systems treat retrieval as a tool the model can choose to invoke, not a mandatory step. The model decides when it needs external knowledge.

Chunking strategies. RAG used fixed 100-word chunks. Modern systems use semantic chunking, recursive splitting, and hierarchical retrieval. Chunk quality matters enormously.

Re-ranking. Modern RAG pipelines add a cross-encoder re-ranker between retrieval and generation: retrieve 100 candidates with a fast bi-encoder, re-rank to top-10 with a slower but more accurate cross-encoder.

RAG in 2020 vs RAG in 2025: the paper's specific architecture (DPR + BART + marginalization) is rarely used directly today. But the pattern — retrieve relevant context, condition generation on it — is everywhere. The 2020 paper gave the pattern a name, a mathematical framework, and empirical validation. That's why it matters.

The production RAG stack (2024)

For context, here's what a modern production RAG system looks like compared to the 2020 paper:

Ingest & Chunk
Semantic chunking with overlap. 2020 RAG: fixed 100-word chunks. Modern: recursive splitting, hierarchical chunks, parent-child relationships.
Embed & Index
2020: DPR bi-encoder + FAISS. Modern: instruction-tuned embedders (E5, BGE), hybrid sparse+dense, metadata filtering.
Retrieve & Re-rank
2020: top-k from FAISS. Modern: retrieve 100 candidates, cross-encoder re-rank to top-10, diversity filtering.
Generate
2020: BART-large (400M). Modern: GPT-4, Claude, Gemini (100B+). Long context (128K+) means more docs fit.
RAG Evolution Timeline

Explore the evolution of retrieval-augmented models. Drag the slider through time to see how the pattern evolved from sparse retrieval (2017) through RAG (2020) to modern production systems (2024+).

Year 2020
What is the most lasting conceptual contribution of the RAG paper?