RAG (Lewis 2020)

Chapter 0: The Knowledge Problem

Imagine you're building a question-answering system. Someone asks: "What is the capital of Burkina Faso?" Your language model was trained on internet text and it might know the answer — Ouagadougou — because that fact appeared in its training data. But what if someone asks: "Who won the 2024 Nobel Prize in Physics?" Your model was trained in 2020. It literally cannot know.

This is the knowledge problem in language models. Every fact the model "knows" must be baked into its parameters during training. This creates three fundamental failures:

Failure Mode	Why It Happens	Example
Staleness	Training data has a cutoff date. The world changes; the model doesn't.	"Who is the current president of X?" is wrong after an election.
Hallucination	The model generates fluent but factually incorrect text when it doesn't know the answer.	Confidently inventing a fake paper citation with plausible-sounding authors.
Opacity	You can't trace which training document a fact came from. No provenance, no citations.	Model says "The drug interacts with X" but you can't verify the source.

Before RAG, the standard approach to knowledge-intensive tasks was to make the model bigger. More parameters means more capacity to memorize facts. GPT-3 demonstrated this: at 175 billion parameters, it could answer trivia questions with impressive accuracy — because the answers were literally stored in its weights.

But this approach is absurdly inefficient. To store one more fact, you need to retrain a 175B-parameter model. To update a stale fact, you need to retrain again. The knowledge is entangled with the model's language abilities in ways we can't control or inspect.

The core insight of RAG: Separate knowledge from reasoning. Let the language model focus on understanding and generating language. For factual knowledge, give it access to a searchable document store that can be updated at any time without retraining. At inference time, retrieve relevant documents, paste them into the context, and let the model generate an answer conditioned on those documents. The model becomes a reader, not a memorizer.

Think of it like the difference between a closed-book exam and an open-book exam. In a closed-book exam (standard LM), you must memorize every possible fact. In an open-book exam (RAG), you just need to know how to find the right page and read it.

Closed-Book vs Open-Book QA

Click "Ask Question" to see how a closed-book model (left) and an open-book RAG model (right) handle the same question. The closed-book model must rely on memorized parameters. The RAG model retrieves relevant passages first.

What is the fundamental problem RAG aims to solve?

Language models store all knowledge in their parameters, making it stale, unverifiable, and expensive to update — RAG separates knowledge storage (document index) from language generation (model) so facts can be retrieved at inference time Language models are too slow at inference time Language models can't generate text longer than their context window

Chapter 1: Dense Passage Retrieval

Before we can build RAG, we need a way to find relevant documents. Given a question like "What causes aurora borealis?", we need to search through millions of Wikipedia passages and return the handful that contain the answer. This is the job of the retriever.

Traditional search engines use sparse retrieval — methods like TF-IDF or BM25 that match exact keywords. If your question says "aurora" and a passage says "aurora," it's a match. This works well for simple queries but fails badly when the question and answer use different words. "What causes the Northern Lights?" should match a passage about "solar wind interacting with Earth's magnetosphere" — but there are no shared keywords.

RAG uses Dense Passage Retrieval (DPR), published by Karpukhin et al. just months before the RAG paper. Instead of matching keywords, DPR encodes both the question and every passage into dense vector embeddings, then finds passages whose vectors are closest to the question vector.

How DPR works

DPR uses two separate BERT encoders:

Question Encoder: BERT_Q

Takes a question string, outputs a 768-dim vector. Input: "What causes aurora borealis?" → Output: q ∈ R⁷⁶⁸

↓

Passage Encoder: BERT_P

Takes a passage string, outputs a 768-dim vector. Input: "Solar wind particles..." → Output: p ∈ R⁷⁶⁸

↓

Similarity: dot product

score(q, p) = q^Tp. Higher dot product = more relevant passage. Retrieve top-k by score.

The two encoders are trained on pairs of (question, relevant passage) using a contrastive loss: push the question vector close to the relevant passage vector and far from irrelevant passage vectors.

L = −log e^{q^Tp⁺} / (e^{q^Tp⁺} + ∑_i e^{q^Tp_i⁻})

Where p⁺ is the correct passage and p_i⁻ are negative passages (wrong answers). This is the standard InfoNCE loss — the same contrastive loss used in CLIP and SimCLR. It says: make the correct (question, passage) pair have a higher dot product than any incorrect pair.

The MIPS trick: searching millions of passages in milliseconds

At inference time, we need to find the top-k passages out of ~21 million Wikipedia passages. A brute-force search would compute 21 million dot products — too slow. DPR uses Maximum Inner Product Search (MIPS) with Facebook's FAISS library, which builds a compressed index that enables approximate nearest-neighbor search in milliseconds.

python
# Building a DPR index with FAISS
import faiss
import numpy as np

# Pre-compute passage embeddings (done ONCE, offline)
passage_embeddings = encode_all_passages(wikipedia)  # shape: [21M, 768]

# Build FAISS index for fast approximate search
index = faiss.IndexFlatIP(768)          # inner product index
index.add(passage_embeddings)              # add all 21M vectors

# At inference: encode question, search index
q_vec = question_encoder("What causes aurora?")  # [1, 768]
scores, indices = index.search(q_vec, k=5)    # top-5 passages
# scores: [1, 5] — similarity scores
# indices: [1, 5] — passage IDs to look up text

Pre-compute once, search forever. The key efficiency trick: all 21 million passage embeddings are computed ONCE offline and stored in the FAISS index. At inference time, only the question needs to be encoded (one BERT forward pass). The search over 21M vectors takes ~10ms with FAISS. This asymmetry — expensive offline, cheap online — is what makes dense retrieval practical.

Hard negatives matter

The quality of DPR depends heavily on how you choose negative examples during training. Random negatives (random passages from the corpus) are too easy — the model learns to distinguish by topic rather than by relevance. Hard negatives are passages that are topically related but don't answer the question. Karpukhin et al. found that using BM25-retrieved passages as hard negatives (high keyword overlap but not the answer) dramatically improved retrieval quality.

python
# DPR training with hard negatives
for question, positive_passage in training_data:
    # Easy negative: random passage from corpus
    easy_neg = random.choice(corpus)

    # Hard negative: BM25-retrieved but NOT the answer
    bm25_results = bm25.search(question, k=100)
    hard_neg = [p for p in bm25_results if p != positive_passage][0]

    # Contrastive loss: push q closer to pos, farther from negs
    loss = nce_loss(
        q=encode_q(question),
        pos=encode_p(positive_passage),
        negs=[encode_p(easy_neg), encode_p(hard_neg)]
    )

This detail matters for RAG because the retriever quality directly determines the generator's input quality. Garbage in, garbage out — if the retriever returns irrelevant passages, even a perfect generator can't produce correct answers.

Dense vs Sparse Retrieval

Type a query (or click examples below) to see how sparse retrieval (BM25, keyword matching) and dense retrieval (DPR, semantic matching) rank passages differently. Notice how dense retrieval finds semantically relevant passages even without keyword overlap.

Why does RAG use dense retrieval (DPR) instead of sparse keyword matching (BM25)?

BM25 is too slow for large document collections Dense retrieval encodes semantic meaning into vectors, so it can match questions to passages that are relevant but use different words — "Northern Lights" matches "solar wind and magnetosphere" even without shared keywords Dense retrieval requires less storage than BM25

Chapter 2: The RAG Architecture

Now we have the two pieces: a retriever that finds relevant documents and a generator that produces text. RAG combines them into a single end-to-end model. The architecture is elegant: query in, answer out, with document retrieval happening in the middle.

The two components

Retriever: p_η(z | x) — Given an input query x, the retriever returns a distribution over documents z. In practice, it retrieves the top-k documents (k=5 or k=10) from a FAISS index using DPR. The parameter η refers to the BERT_Q question encoder weights (the passage encoder is frozen and pre-computed).

Generator: p_θ(y_i | x, z, y_1:i-1) — Given the input x, a retrieved document z, and previously generated tokens y_1:i-1, the generator predicts the next token y_i. This is BART — a pre-trained seq2seq model. The encoder processes [x; z] (query concatenated with retrieved document), and the decoder generates the answer autoregressively.

Input Query x

"What year was the Eiffel Tower completed?"

↓ BERT_Q encodes to q ∈ R⁷⁶⁸

MIPS Retrieval

Search FAISS index of 21M passages. Return top-k (z₁, ..., z_k) with scores p_η(z_i | x).

↓ for each z_i

BART Generator

Encode [x ; z_i], decode answer y. Each retrieved doc gives a different generation probability.

↓ marginalize over docs

Final Answer y

"1889" — produced by marginalizing over all k retrieved documents.

The marginalization step

This is the mathematical heart of RAG. We don't just pick the best document and generate from it. Instead, we treat the retrieved documents as latent variables and marginalize them out. The probability of generating output y given input x is:

p(y | x) ≈ ∑_{z ∈ top-k} p_η(z | x) · p_θ(y | x, z)

Where p_η(z | x) is the retriever's score for document z (how relevant it is to the query) and p_θ(y | x, z) is the generator's probability of producing answer y given document z.

In plain English: run the generator k times, once per retrieved document. Each run produces a probability for the answer. Weight each probability by how relevant the document was. Sum them up. Documents that are more relevant contribute more to the final answer.

Why marginalize instead of just using the top-1 document? Because the retriever isn't perfect. The most relevant passage might be ranked second or third. By marginalizing over multiple documents, RAG hedges its bets — if document z₁ contains part of the answer and z₃ contains another part, both contribute. This is fundamentally more robust than picking a single document and hoping it's right.

Tensor shapes through the pipeline

python
# RAG forward pass — exact tensor shapes
x = tokenizer("What year was the Eiffel Tower completed?")
# x.input_ids: [1, seq_len]  (batch=1, sequence length)

# Step 1: Encode query
q = question_encoder(x.input_ids)     # [1, 768]

# Step 2: Retrieve top-k documents
scores, doc_ids = faiss_index.search(q, k=5)
# scores: [1, 5]   doc_ids: [1, 5]
doc_texts = [passage_db[i] for i in doc_ids[0]]

# Step 3: Concatenate query with each document
inputs = [x_text + " [SEP] " + doc for doc in doc_texts]
enc_ids = tokenizer(inputs, padding=True)
# enc_ids.input_ids: [5, max_doc_len]  — 5 docs, padded

# Step 4: Run BART encoder + decoder for each document
gen_scores = bart.generate(enc_ids, return_scores=True)
# gen_scores: [5, output_len, vocab_size]

# Step 5: Marginalize over documents
doc_priors = softmax(scores)              # [1, 5] — retrieval weights
final_probs = (doc_priors @ gen_scores)   # weighted sum over docs

RAG Pipeline Visualizer

Watch data flow through the RAG pipeline. Click "Step" to advance one stage at a time, or "Run All" to see the full pipeline. Each retrieved document contributes to the final answer weighted by its relevance score.

In RAG, why are retrieved documents treated as latent variables and marginalized out, rather than just using the single most relevant document?

Because the retriever isn't perfect — by weighting and summing over multiple documents, RAG hedges against retrieval errors and allows partial information from different documents to contribute to the answer Because using multiple documents is always faster than using one Because BART can only process multiple inputs at once

Chapter 3: RAG-Token vs RAG-Sequence

The formula we saw in Chapter 2 — marginalizing over documents — has a subtle but critical ambiguity. When do we marginalize? This choice creates two distinct variants of RAG, each with different properties.

RAG-Sequence: marginalize per complete sequence

In RAG-Sequence, we generate the entire output sequence conditioned on each document separately, then marginalize over documents once at the end:

p_RAG-Seq(y | x) ≈ ∑_{z ∈ top-k} p_η(z | x) · ∏_i=1^N p_θ(y_i | x, z, y_1:i-1)

In plain English: for each document z, generate the complete answer using only that document. Then weight each complete answer by how relevant z was. This means each answer candidate is "faithful" to a single document — the model never mixes information across documents within a single generation.

RAG-Token: marginalize per token

In RAG-Token, we marginalize at each token position separately. Every token in the output can attend to a different document:

p_RAG-Token(y | x) ≈ ∏_i=1^N ∑_{z ∈ top-k} p_η(z | x) · p_θ(y_i | x, z, y_1:i-1)

Notice the difference: in RAG-Sequence, the sum over z is outside the product over tokens. In RAG-Token, the sum over z is inside the product. This means each output token can draw from a different document.

When does the difference matter? For short, factoid answers ("1889", "Ouagadougou"), both variants produce similar results because the answer comes from a single fact in a single document. The difference emerges for longer, compositional answers where different parts of the answer might come from different documents. RAG-Token can say "The Eiffel Tower [from doc 1] was designed by Gustave Eiffel [from doc 3] and completed in 1889 [from doc 2]." RAG-Sequence cannot — each candidate answer draws from only one document.

Property	RAG-Sequence	RAG-Token
Marginalization	Over full sequences	Per token
Cross-doc info	No — each answer uses one doc	Yes — each token can use a different doc
Best for	Short factoid answers (QA)	Longer generated text (fact verification, Jeopardy)
Decoding	Beam search per doc, then rerank	Standard beam search on marginalized probs
Computation	k separate beam searches	One beam search, k forward passes per step

Decoding details

For RAG-Token, decoding is straightforward. At each step, compute the per-token probability from each document, take the weighted sum, and feed it into standard beam search.

For RAG-Sequence, it's trickier. We run beam search separately for each of the k documents, generating k sets of candidate sequences. Then we need to score each unique candidate across all documents using the full marginalization formula. Some candidates might only appear in one document's beam, in which case we need to run an additional forward pass with the other documents to get their probabilities (or approximate them as zero).

python
# RAG-Sequence decoding (simplified)
candidates = {}
for z_i, score_i in zip(top_k_docs, retrieval_scores):
    # Run beam search conditioned on this single document
    beams = bart.beam_search(query, z_i, num_beams=5)
    for seq, log_prob in beams:
        if seq not in candidates:
            candidates[seq] = []
        candidates[seq].append((score_i, log_prob))

# Marginalize: for each unique candidate, sum p(z|x) * p(y|x,z)
final_scores = {}
for seq, contributions in candidates.items():
    final_scores[seq] = sum(s * exp(lp) for s, lp in contributions)

best_answer = max(final_scores, key=final_scores.get)

RAG-Token vs RAG-Sequence

Toggle between the two variants to see how they combine information from multiple documents. In RAG-Sequence, each answer is generated from a single document and then ranked. In RAG-Token, each output token can draw from any document, allowing cross-document synthesis.

What is the key difference between RAG-Token and RAG-Sequence?

RAG-Token uses more documents than RAG-Sequence RAG-Token is faster because it uses fewer forward passes RAG-Token marginalizes over documents at each token position, allowing different tokens to draw from different documents — while RAG-Sequence generates complete answers from each document separately and then combines scores

Chapter 4: Training RAG

Training RAG is tricky because we have two components — a retriever and a generator — and they need to learn together. The retriever needs to learn what documents are useful for the generator, and the generator needs to learn how to use retrieved documents to produce correct answers.

What gets trained

RAG has three sets of parameters:

Component	Parameters	Trained?	Why?
Question encoder (BERT_Q)	η	Yes	Needs to learn to produce queries that retrieve useful documents for BART.
BART generator	θ	Yes	Needs to learn to read retrieved documents and extract/generate answers.
Passage encoder (BERT_P)	—	No (frozen)	Re-encoding 21M passages at every training step is prohibitively expensive. Kept frozen from DPR pre-training.

Why freeze the passage encoder? If we updated BERT_P during training, every gradient step would invalidate all 21 million pre-computed passage embeddings. We'd need to re-encode the entire Wikipedia every time. This is computationally impossible during training. So we freeze BERT_P and only update BERT_Q. The question encoder learns to produce queries that work well with the fixed passage embeddings. This is an engineering compromise, not a theoretical one — it means the passage representations are stuck at their DPR-trained values.

The training objective

We train to maximize the marginal log-likelihood of the correct answer:

L = − ∑_{(x,y) ∈ D} log p(y | x) = − ∑_{(x,y) ∈ D} log ∑_{z ∈ top-k} p_η(z | x) · p_θ(y | x, z)

Where D is the training set of (question, answer) pairs. We don't have supervision for which documents should be retrieved — the documents are latent. The model must learn which documents are useful entirely from the end-to-end signal of producing correct answers.

Gradient flow

How do gradients flow through retrieval? The retriever scores p_η(z | x) are differentiable with respect to η (the question encoder weights). The FAISS lookup itself is non-differentiable (it's a discrete nearest-neighbor search), but we approximate the gradient by fixing the set of retrieved documents during each forward pass and only backpropagating through the scores.

python
# RAG training loop (simplified)
for x, y in dataloader:
    # 1. Encode question (DIFFERENTIABLE w.r.t. eta)
    q = question_encoder(x)                # [B, 768]

    # 2. Retrieve top-k documents (NOT differentiable)
    scores, doc_ids = faiss_index.search(q, k=5)

    # 3. Re-score with differentiable dot product
    doc_embeds = passage_embeds[doc_ids]     # [B, 5, 768] — frozen
    retriever_scores = (q.unsqueeze(1) @ doc_embeds.transpose(-1,-2)).squeeze()
    # [B, 5] — these ARE differentiable w.r.t. q (and thus eta)

    doc_priors = softmax(retriever_scores)  # [B, 5]

    # 4. Generate with BART for each document
    gen_log_probs = []
    for i in range(5):
        bart_input = concat(x, docs[i])
        lp = bart.log_prob(bart_input, y)  # log p(y | x, z_i)
        gen_log_probs.append(lp)

    # 5. Marginalize and compute loss
    gen_probs = torch.stack(gen_log_probs, dim=1).exp()  # [B, 5]
    marginal = (doc_priors * gen_probs).sum(dim=1)          # [B]
    loss = -marginal.log().mean()

    loss.backward()   # gradients flow to BOTH eta and theta
    optimizer.step()

The retriever learns without retrieval labels. This is the most elegant aspect of RAG training. Nobody tells the model which documents to retrieve. The only supervision is (question, answer) pairs. If retrieving document z₃ leads to a higher probability of the correct answer, then p_η(z₃ | x) gets a positive gradient. The retriever learns to find useful documents purely from the end-task signal. This is a form of latent variable learning — the documents are hidden variables that are never directly observed.

The REINFORCE-like gradient

Let's unpack the gradient math one more step. The loss for a single example is:

L = −log ∑_z p_η(z | x) · p_θ(y | x, z)

For the generator parameters θ, the gradient is straightforward — it flows through p_θ(y | x, z) for each document. For the retriever parameters η, the gradient has an interesting structure:

∇_η L = − ∑_z w(z) · ∇_η log p_η(z | x)

Where w(z) is a weight proportional to how much document z helped produce the correct answer. This looks like a REINFORCE gradient — the retriever gets a "reward signal" for retrieving documents that led to correct answers. Documents that helped the generator are reinforced; irrelevant documents are downweighted.

Think of it as a credit assignment problem. When the generator produces the correct answer "1889", which of the 5 retrieved documents deserves credit? The gradient automatically assigns credit: documents where p_θ(y | x, z) was high (the answer was likely given that document) get a stronger positive gradient. Documents where the answer was unlikely get weaker signals. The retriever learns to find the documents that the generator finds most useful.

Training hyperparameters

Setting	Value
Pre-trained retriever	DPR (trained on Natural Questions)
Pre-trained generator	BART-large (400M params)
Document index	Wikipedia dump (Dec 2018), 21M 100-word passages
Top-k documents	k = 5 or k = 10
Learning rate	1e-5 (both retriever and generator)
Optimizer	Adam
Batch size	Not specified (limited by GPU memory due to k forward passes)
Training data	Task-specific (question, answer) pairs only — no retrieval labels

A critical detail: the FAISS index is not updated during training. This means the set of retrievable documents stays fixed at whatever was indexed from the DPR passage encoder. The question encoder shifts to produce better queries for the fixed index, but the index itself doesn't improve. REALM (a concurrent paper by Guu et al.) tackled this by periodically re-encoding the entire index — asynchronously, in the background — but this added significant engineering complexity.

Training Gradient Flow

Watch how gradients flow through RAG during training. The correct answer signal flows back through BART (generator gradient) and through the retrieval scores (retriever gradient). The passage encoder stays frozen. Click "Train Step" to see one step.

Why is the passage encoder (BERT_P) frozen during RAG training while the question encoder (BERT_Q) is trained?

Because updating the passage encoder would require re-encoding all 21 million passages in the FAISS index at every training step — computationally prohibitive. Instead, only the question encoder is trained to produce queries that work well with the fixed passage embeddings. Because the passage encoder is already perfect from DPR pre-training Because the passage encoder has fewer parameters

Chapter 5: Open-Domain QA Results

The headline evaluation for RAG is open-domain question answering — answering factual questions using only a corpus of documents (Wikipedia) as the knowledge source, with no task-specific architecture or pre-defined answer set.

Lewis et al. evaluated on four major QA benchmarks:

Benchmark	Previous SOTA	RAG-Token	RAG-Sequence	Improvement
Natural Questions	44.5 (DPR)	44.5	44.5	Matches SOTA
TriviaQA	57.9 (DPR)	56.8	55.9	Near SOTA
CuratedTrec	46.0 (DPR)	50.0	52.2	+6.2 pts
WebQuestions	42.4 (DPR)	45.5	45.2	+3.1 pts

RAG matches or beats DPR's extractive reader on all four benchmarks — and it does so generatively. DPR extracts a span from a retrieved passage (pointing to exact text). RAG generates the answer token by token. This matters because generative models can synthesize information, rephrase, and handle questions where the answer isn't a literal span in any document.

Beyond QA: knowledge-intensive generation

RAG's real advantage shows on tasks that require generating text, not just extracting spans:

Jeopardy Question Generation: Given an entity (e.g., "Eiffel Tower"), generate a Jeopardy-style clue. RAG produced more factual, specific, and diverse clues than BART alone. Human evaluators preferred RAG's outputs on factuality and specificity.

FEVER Fact Verification: Given a claim, classify it as SUPPORTED, REFUTED, or NOT ENOUGH INFO using retrieved evidence. RAG achieved 72.5% accuracy, within 2.7 points of pipeline systems with much more complex architectures.

MS-MARCO (abstractive QA): Generate free-form answers to real Bing search queries. RAG produced more specific and factual answers than BART, though this benchmark wasn't the primary focus.

Why RAG outperforms "retrieve then read"

The standard pipeline approach before RAG was: (1) retrieve relevant passages, (2) feed them to a reader model, (3) extract an answer span. RAG improves on this in three ways:

End-to-end training

Retriever and generator are trained jointly. The retriever learns what the generator needs, not just what looks relevant in isolation.

↓

Generative answers

Can synthesize, rephrase, and combine information. Not limited to extracting literal spans from passages.

↓

Marginalization over documents

Robust to retrieval errors. If the best passage is ranked third, it still contributes to the answer through marginalization.

Benchmark Performance Comparison

Compare RAG against baselines across benchmarks. Use the buttons to switch benchmarks and see how RAG-Token and RAG-Sequence compare to extractive (DPR) and closed-book (T5) approaches.

What is a key advantage of RAG's generative approach over extractive QA systems like DPR?

RAG generates answers token by token, allowing it to synthesize, rephrase, and combine information from multiple passages — not just extract literal spans from a single document RAG is always faster than extractive systems RAG requires fewer retrieved documents

Chapter 6: Knowledge Updates

Here is perhaps the most practically important property of RAG: you can update what the model "knows" without retraining it. Just swap the document index.

Consider a traditional language model trained in January 2020. By March 2020, COVID-19 has changed the world, but the model still thinks it's a minor outbreak in Wuhan. To update the model, you'd need to retrain it on new data — an expensive process requiring massive compute.

With RAG, you simply update the Wikipedia dump in the FAISS index. The model's weights stay the same. The next time someone asks about COVID-19, the retriever finds the new passages, and the generator produces an up-to-date answer.

Hot-swappable knowledge. This is the fundamental advantage of separating knowledge (document index) from reasoning (model weights). The document index is just a data structure — you can add new documents, remove outdated ones, or replace the entire index. The model doesn't need to be retrained, fine-tuned, or even restarted. This makes RAG uniquely suited for domains where facts change rapidly: news, medicine, law, company policies.

What updating the index looks like

python
# Updating RAG's knowledge — no model retraining needed

# 1. Get new documents
new_articles = fetch_latest_wikipedia_dump()  # or any corpus

# 2. Chunk into 100-word passages
new_passages = chunk_documents(new_articles, chunk_size=100)

# 3. Encode with frozen passage encoder (BERT_P)
new_embeddings = passage_encoder.encode_batch(new_passages)
# shape: [num_new_passages, 768]

# 4. Replace FAISS index
new_index = faiss.IndexFlatIP(768)
new_index.add(new_embeddings)

# 5. Swap in the new index — zero downtime
rag_model.retriever.index = new_index

# Done! Model now answers with updated knowledge.
# No gradient updates. No retraining. No GPU hours.

Limitations of index swapping

Knowledge update via index swapping isn't a silver bullet. There are real limitations:

Limitation	Why It Matters
Encoder drift	The question encoder was trained with the original index. New documents encoded by the frozen passage encoder might not align perfectly with what the question encoder expects. Performance can degrade over time.
No unlearning	Removing a document from the index doesn't make the generator forget what it learned from similar documents during training. The model weights still contain latent knowledge from training.
Passage encoder quality	The frozen BERT_P determines how well new documents are embedded. If new documents have a very different style than the original training corpus, embedding quality may suffer.
Re-encoding cost	For a full Wikipedia swap (~21M passages), re-encoding takes hours on GPUs even though it's a one-time cost. Incremental updates (adding/removing individual passages) are cheaper.

Parametric vs non-parametric knowledge

RAG formalized a distinction that has become fundamental to how we think about LLM systems:

Property	Parametric Knowledge (model weights)	Non-Parametric Knowledge (document index)
Location	Encoded in 400M BART parameters	Stored in 21M passage embeddings + text
Update cost	Full retraining (days, $$$$)	Re-encode new documents (hours, $)
Provenance	Opaque — can't trace which training example taught a fact	Transparent — can cite exact retrieved passage
Capacity	Fixed by model size — 400M params store limited facts	Unlimited — add more documents to the index
Strengths	Language understanding, reasoning, generation fluency	Factual accuracy, recency, verifiability

This separation is the single most important conceptual contribution of the RAG paper. The specific architecture (DPR + BART) matters less than the pattern: let the model handle language, let the index handle knowledge.

RAG as a design pattern. The 2020 RAG paper was the first to formalize this retrieve-then-generate paradigm end-to-end. But the core insight — separating parametric knowledge (model weights) from non-parametric knowledge (document index) — has become perhaps the most widely adopted pattern in production LLM systems. Every major LLM-powered application today uses some form of RAG: ChatGPT with browsing, Bing Chat, Perplexity, enterprise knowledge bases. The 2020 paper planted the seed; the pattern grew into an industry standard.

Knowledge Update Simulator

Simulate updating RAG's knowledge index. The timeline shows the original index (2018) and lets you add newer documents. Ask questions and see how answers change as the index is updated — all without retraining.

How does RAG allow knowledge updates without retraining the model?

By fine-tuning only the last layer of the generator By replacing or updating the document index — the passage encoder re-encodes new documents and the FAISS index is swapped, so the model retrieves from updated knowledge at inference time without any gradient updates to its weights By increasing the model's context window to fit more information

Chapter 7: Connections

RAG sits at a critical junction in the history of NLP — it formalized the retrieve-then-generate pattern that became the backbone of modern LLM applications. Let's map where it came from and where it led.

Lineage

Model / Paper	Year	Relationship to RAG
DrQA (Chen et al.)	2017	First large-scale open-domain QA with retrieval + reader. Used TF-IDF (sparse), not dense retrieval. RAG's direct ancestor.
ORQA (Lee et al.)	2019	First end-to-end trained retriever + reader. Used Inverse Cloze Task for pre-training. Showed joint training beats pipeline.
REALM (Guu et al.)	2020	Concurrent with RAG. Also marginalizes over retrieved docs. Key difference: REALM periodically re-encodes the entire index during training (asynchronous index refresh). RAG freezes the passage encoder instead.
DPR (Karpukhin et al.)	2020	RAG's retriever. Showed that simple dense embeddings beat BM25 for open-domain QA. RAG inherits DPR's bi-encoder architecture.
BART (Lewis et al.)	2020	RAG's generator. Denoising seq2seq pre-training. Same first author as RAG — Patrick Lewis.
FiD (Izacard & Grave)	2021	Fusion-in-Decoder: encode each document independently, concatenate in decoder. Simpler than marginalization, often stronger. Became the dominant approach post-RAG.
RETRO (Borgeaud et al.)	2022	Retrieval-enhanced Transformer: retrieval baked into the architecture with chunked cross-attention. Scales retrieval to 2T tokens.
ChatGPT + Retrieval	2023+	Production RAG: GPT-4 with web browsing, Bing Chat, Perplexity. RAG's conceptual descendants, though architecturally different (long-context models with retrieval as tool use).

What RAG got right

Separation of concerns. The most lasting contribution: parametric knowledge (model weights) vs non-parametric knowledge (document index) should be separate systems. This insight survived every subsequent architectural change.

End-to-end training. Training the retriever and generator jointly, with the retriever learning from the end-task signal without retrieval labels. This is cleaner than pipeline approaches where the retriever is trained on a proxy task.

Two marginalization variants. RAG-Token and RAG-Sequence provided a principled framework for how to combine information from multiple documents. This distinction influenced subsequent work on multi-document reasoning.

What changed since RAG

Long-context models. GPT-4 with 128K context, Gemini with 1M+ tokens. When you can paste 50 documents into the context, the marginalization math becomes less important — you just concatenate everything and let attention handle it.

Retrieval as tool use. Modern systems treat retrieval as a tool the model can choose to invoke, not a mandatory step. The model decides when it needs external knowledge.

Chunking strategies. RAG used fixed 100-word chunks. Modern systems use semantic chunking, recursive splitting, and hierarchical retrieval. Chunk quality matters enormously.

Re-ranking. Modern RAG pipelines add a cross-encoder re-ranker between retrieval and generation: retrieve 100 candidates with a fast bi-encoder, re-rank to top-10 with a slower but more accurate cross-encoder.

RAG in 2020 vs RAG in 2025: the paper's specific architecture (DPR + BART + marginalization) is rarely used directly today. But the pattern — retrieve relevant context, condition generation on it — is everywhere. The 2020 paper gave the pattern a name, a mathematical framework, and empirical validation. That's why it matters.

The production RAG stack (2024)

For context, here's what a modern production RAG system looks like compared to the 2020 paper:

Ingest & Chunk

Semantic chunking with overlap. 2020 RAG: fixed 100-word chunks. Modern: recursive splitting, hierarchical chunks, parent-child relationships.

↓

Embed & Index

2020: DPR bi-encoder + FAISS. Modern: instruction-tuned embedders (E5, BGE), hybrid sparse+dense, metadata filtering.

↓

Retrieve & Re-rank

2020: top-k from FAISS. Modern: retrieve 100 candidates, cross-encoder re-rank to top-10, diversity filtering.

↓

Generate

2020: BART-large (400M). Modern: GPT-4, Claude, Gemini (100B+). Long context (128K+) means more docs fit.

RAG Evolution Timeline

Explore the evolution of retrieval-augmented models. Drag the slider through time to see how the pattern evolved from sparse retrieval (2017) through RAG (2020) to modern production systems (2024+).

Year 2020

What is the most lasting conceptual contribution of the RAG paper?

The specific DPR + BART architecture The separation of parametric knowledge (model weights) from non-parametric knowledge (document index), allowing facts to be updated, verified, and cited without retraining — a design pattern now used in virtually every production LLM system The use of FAISS for fast vector search

RAG: Retrieval-Augmented Generation