Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks — instead of memorizing the world inside model weights, retrieve relevant documents at inference time and condition generation on them.
Imagine you're building a question-answering system. Someone asks: "What is the capital of Burkina Faso?" Your language model was trained on internet text and it might know the answer — Ouagadougou — because that fact appeared in its training data. But what if someone asks: "Who won the 2024 Nobel Prize in Physics?" Your model was trained in 2020. It literally cannot know.
This is the knowledge problem in language models. Every fact the model "knows" must be baked into its parameters during training. This creates three fundamental failures:
| Failure Mode | Why It Happens | Example |
|---|---|---|
| Staleness | Training data has a cutoff date. The world changes; the model doesn't. | "Who is the current president of X?" is wrong after an election. |
| Hallucination | The model generates fluent but factually incorrect text when it doesn't know the answer. | Confidently inventing a fake paper citation with plausible-sounding authors. |
| Opacity | You can't trace which training document a fact came from. No provenance, no citations. | Model says "The drug interacts with X" but you can't verify the source. |
Before RAG, the standard approach to knowledge-intensive tasks was to make the model bigger. More parameters means more capacity to memorize facts. GPT-3 demonstrated this: at 175 billion parameters, it could answer trivia questions with impressive accuracy — because the answers were literally stored in its weights.
But this approach is absurdly inefficient. To store one more fact, you need to retrain a 175B-parameter model. To update a stale fact, you need to retrain again. The knowledge is entangled with the model's language abilities in ways we can't control or inspect.
Think of it like the difference between a closed-book exam and an open-book exam. In a closed-book exam (standard LM), you must memorize every possible fact. In an open-book exam (RAG), you just need to know how to find the right page and read it.
Click "Ask Question" to see how a closed-book model (left) and an open-book RAG model (right) handle the same question. The closed-book model must rely on memorized parameters. The RAG model retrieves relevant passages first.
Before we can build RAG, we need a way to find relevant documents. Given a question like "What causes aurora borealis?", we need to search through millions of Wikipedia passages and return the handful that contain the answer. This is the job of the retriever.
Traditional search engines use sparse retrieval — methods like TF-IDF or BM25 that match exact keywords. If your question says "aurora" and a passage says "aurora," it's a match. This works well for simple queries but fails badly when the question and answer use different words. "What causes the Northern Lights?" should match a passage about "solar wind interacting with Earth's magnetosphere" — but there are no shared keywords.
RAG uses Dense Passage Retrieval (DPR), published by Karpukhin et al. just months before the RAG paper. Instead of matching keywords, DPR encodes both the question and every passage into dense vector embeddings, then finds passages whose vectors are closest to the question vector.
DPR uses two separate BERT encoders:
The two encoders are trained on pairs of (question, relevant passage) using a contrastive loss: push the question vector close to the relevant passage vector and far from irrelevant passage vectors.
Where p+ is the correct passage and pi− are negative passages (wrong answers). This is the standard InfoNCE loss — the same contrastive loss used in CLIP and SimCLR. It says: make the correct (question, passage) pair have a higher dot product than any incorrect pair.
At inference time, we need to find the top-k passages out of ~21 million Wikipedia passages. A brute-force search would compute 21 million dot products — too slow. DPR uses Maximum Inner Product Search (MIPS) with Facebook's FAISS library, which builds a compressed index that enables approximate nearest-neighbor search in milliseconds.
python # Building a DPR index with FAISS import faiss import numpy as np # Pre-compute passage embeddings (done ONCE, offline) passage_embeddings = encode_all_passages(wikipedia) # shape: [21M, 768] # Build FAISS index for fast approximate search index = faiss.IndexFlatIP(768) # inner product index index.add(passage_embeddings) # add all 21M vectors # At inference: encode question, search index q_vec = question_encoder("What causes aurora?") # [1, 768] scores, indices = index.search(q_vec, k=5) # top-5 passages # scores: [1, 5] — similarity scores # indices: [1, 5] — passage IDs to look up text
The quality of DPR depends heavily on how you choose negative examples during training. Random negatives (random passages from the corpus) are too easy — the model learns to distinguish by topic rather than by relevance. Hard negatives are passages that are topically related but don't answer the question. Karpukhin et al. found that using BM25-retrieved passages as hard negatives (high keyword overlap but not the answer) dramatically improved retrieval quality.
python # DPR training with hard negatives for question, positive_passage in training_data: # Easy negative: random passage from corpus easy_neg = random.choice(corpus) # Hard negative: BM25-retrieved but NOT the answer bm25_results = bm25.search(question, k=100) hard_neg = [p for p in bm25_results if p != positive_passage][0] # Contrastive loss: push q closer to pos, farther from negs loss = nce_loss( q=encode_q(question), pos=encode_p(positive_passage), negs=[encode_p(easy_neg), encode_p(hard_neg)] )
This detail matters for RAG because the retriever quality directly determines the generator's input quality. Garbage in, garbage out — if the retriever returns irrelevant passages, even a perfect generator can't produce correct answers.
Type a query (or click examples below) to see how sparse retrieval (BM25, keyword matching) and dense retrieval (DPR, semantic matching) rank passages differently. Notice how dense retrieval finds semantically relevant passages even without keyword overlap.
Now we have the two pieces: a retriever that finds relevant documents and a generator that produces text. RAG combines them into a single end-to-end model. The architecture is elegant: query in, answer out, with document retrieval happening in the middle.
Retriever: pη(z | x) — Given an input query x, the retriever returns a distribution over documents z. In practice, it retrieves the top-k documents (k=5 or k=10) from a FAISS index using DPR. The parameter η refers to the BERTQ question encoder weights (the passage encoder is frozen and pre-computed).
Generator: pθ(yi | x, z, y1:i-1) — Given the input x, a retrieved document z, and previously generated tokens y1:i-1, the generator predicts the next token yi. This is BART — a pre-trained seq2seq model. The encoder processes [x; z] (query concatenated with retrieved document), and the decoder generates the answer autoregressively.
This is the mathematical heart of RAG. We don't just pick the best document and generate from it. Instead, we treat the retrieved documents as latent variables and marginalize them out. The probability of generating output y given input x is:
Where pη(z | x) is the retriever's score for document z (how relevant it is to the query) and pθ(y | x, z) is the generator's probability of producing answer y given document z.
In plain English: run the generator k times, once per retrieved document. Each run produces a probability for the answer. Weight each probability by how relevant the document was. Sum them up. Documents that are more relevant contribute more to the final answer.
python # RAG forward pass — exact tensor shapes x = tokenizer("What year was the Eiffel Tower completed?") # x.input_ids: [1, seq_len] (batch=1, sequence length) # Step 1: Encode query q = question_encoder(x.input_ids) # [1, 768] # Step 2: Retrieve top-k documents scores, doc_ids = faiss_index.search(q, k=5) # scores: [1, 5] doc_ids: [1, 5] doc_texts = [passage_db[i] for i in doc_ids[0]] # Step 3: Concatenate query with each document inputs = [x_text + " [SEP] " + doc for doc in doc_texts] enc_ids = tokenizer(inputs, padding=True) # enc_ids.input_ids: [5, max_doc_len] — 5 docs, padded # Step 4: Run BART encoder + decoder for each document gen_scores = bart.generate(enc_ids, return_scores=True) # gen_scores: [5, output_len, vocab_size] # Step 5: Marginalize over documents doc_priors = softmax(scores) # [1, 5] — retrieval weights final_probs = (doc_priors @ gen_scores) # weighted sum over docs
Watch data flow through the RAG pipeline. Click "Step" to advance one stage at a time, or "Run All" to see the full pipeline. Each retrieved document contributes to the final answer weighted by its relevance score.
The formula we saw in Chapter 2 — marginalizing over documents — has a subtle but critical ambiguity. When do we marginalize? This choice creates two distinct variants of RAG, each with different properties.
In RAG-Sequence, we generate the entire output sequence conditioned on each document separately, then marginalize over documents once at the end:
In plain English: for each document z, generate the complete answer using only that document. Then weight each complete answer by how relevant z was. This means each answer candidate is "faithful" to a single document — the model never mixes information across documents within a single generation.
In RAG-Token, we marginalize at each token position separately. Every token in the output can attend to a different document:
Notice the difference: in RAG-Sequence, the sum over z is outside the product over tokens. In RAG-Token, the sum over z is inside the product. This means each output token can draw from a different document.
| Property | RAG-Sequence | RAG-Token |
|---|---|---|
| Marginalization | Over full sequences | Per token |
| Cross-doc info | No — each answer uses one doc | Yes — each token can use a different doc |
| Best for | Short factoid answers (QA) | Longer generated text (fact verification, Jeopardy) |
| Decoding | Beam search per doc, then rerank | Standard beam search on marginalized probs |
| Computation | k separate beam searches | One beam search, k forward passes per step |
For RAG-Token, decoding is straightforward. At each step, compute the per-token probability from each document, take the weighted sum, and feed it into standard beam search.
For RAG-Sequence, it's trickier. We run beam search separately for each of the k documents, generating k sets of candidate sequences. Then we need to score each unique candidate across all documents using the full marginalization formula. Some candidates might only appear in one document's beam, in which case we need to run an additional forward pass with the other documents to get their probabilities (or approximate them as zero).
python # RAG-Sequence decoding (simplified) candidates = {} for z_i, score_i in zip(top_k_docs, retrieval_scores): # Run beam search conditioned on this single document beams = bart.beam_search(query, z_i, num_beams=5) for seq, log_prob in beams: if seq not in candidates: candidates[seq] = [] candidates[seq].append((score_i, log_prob)) # Marginalize: for each unique candidate, sum p(z|x) * p(y|x,z) final_scores = {} for seq, contributions in candidates.items(): final_scores[seq] = sum(s * exp(lp) for s, lp in contributions) best_answer = max(final_scores, key=final_scores.get)
Toggle between the two variants to see how they combine information from multiple documents. In RAG-Sequence, each answer is generated from a single document and then ranked. In RAG-Token, each output token can draw from any document, allowing cross-document synthesis.
Training RAG is tricky because we have two components — a retriever and a generator — and they need to learn together. The retriever needs to learn what documents are useful for the generator, and the generator needs to learn how to use retrieved documents to produce correct answers.
RAG has three sets of parameters:
| Component | Parameters | Trained? | Why? |
|---|---|---|---|
| Question encoder (BERTQ) | η | Yes | Needs to learn to produce queries that retrieve useful documents for BART. |
| BART generator | θ | Yes | Needs to learn to read retrieved documents and extract/generate answers. |
| Passage encoder (BERTP) | — | No (frozen) | Re-encoding 21M passages at every training step is prohibitively expensive. Kept frozen from DPR pre-training. |
We train to maximize the marginal log-likelihood of the correct answer:
Where D is the training set of (question, answer) pairs. We don't have supervision for which documents should be retrieved — the documents are latent. The model must learn which documents are useful entirely from the end-to-end signal of producing correct answers.
How do gradients flow through retrieval? The retriever scores pη(z | x) are differentiable with respect to η (the question encoder weights). The FAISS lookup itself is non-differentiable (it's a discrete nearest-neighbor search), but we approximate the gradient by fixing the set of retrieved documents during each forward pass and only backpropagating through the scores.
python # RAG training loop (simplified) for x, y in dataloader: # 1. Encode question (DIFFERENTIABLE w.r.t. eta) q = question_encoder(x) # [B, 768] # 2. Retrieve top-k documents (NOT differentiable) scores, doc_ids = faiss_index.search(q, k=5) # 3. Re-score with differentiable dot product doc_embeds = passage_embeds[doc_ids] # [B, 5, 768] — frozen retriever_scores = (q.unsqueeze(1) @ doc_embeds.transpose(-1,-2)).squeeze() # [B, 5] — these ARE differentiable w.r.t. q (and thus eta) doc_priors = softmax(retriever_scores) # [B, 5] # 4. Generate with BART for each document gen_log_probs = [] for i in range(5): bart_input = concat(x, docs[i]) lp = bart.log_prob(bart_input, y) # log p(y | x, z_i) gen_log_probs.append(lp) # 5. Marginalize and compute loss gen_probs = torch.stack(gen_log_probs, dim=1).exp() # [B, 5] marginal = (doc_priors * gen_probs).sum(dim=1) # [B] loss = -marginal.log().mean() loss.backward() # gradients flow to BOTH eta and theta optimizer.step()
Let's unpack the gradient math one more step. The loss for a single example is:
For the generator parameters θ, the gradient is straightforward — it flows through pθ(y | x, z) for each document. For the retriever parameters η, the gradient has an interesting structure:
Where w(z) is a weight proportional to how much document z helped produce the correct answer. This looks like a REINFORCE gradient — the retriever gets a "reward signal" for retrieving documents that led to correct answers. Documents that helped the generator are reinforced; irrelevant documents are downweighted.
| Setting | Value |
|---|---|
| Pre-trained retriever | DPR (trained on Natural Questions) |
| Pre-trained generator | BART-large (400M params) |
| Document index | Wikipedia dump (Dec 2018), 21M 100-word passages |
| Top-k documents | k = 5 or k = 10 |
| Learning rate | 1e-5 (both retriever and generator) |
| Optimizer | Adam |
| Batch size | Not specified (limited by GPU memory due to k forward passes) |
| Training data | Task-specific (question, answer) pairs only — no retrieval labels |
A critical detail: the FAISS index is not updated during training. This means the set of retrievable documents stays fixed at whatever was indexed from the DPR passage encoder. The question encoder shifts to produce better queries for the fixed index, but the index itself doesn't improve. REALM (a concurrent paper by Guu et al.) tackled this by periodically re-encoding the entire index — asynchronously, in the background — but this added significant engineering complexity.
Watch how gradients flow through RAG during training. The correct answer signal flows back through BART (generator gradient) and through the retrieval scores (retriever gradient). The passage encoder stays frozen. Click "Train Step" to see one step.
The headline evaluation for RAG is open-domain question answering — answering factual questions using only a corpus of documents (Wikipedia) as the knowledge source, with no task-specific architecture or pre-defined answer set.
Lewis et al. evaluated on four major QA benchmarks:
| Benchmark | Previous SOTA | RAG-Token | RAG-Sequence | Improvement |
|---|---|---|---|---|
| Natural Questions | 44.5 (DPR) | 44.5 | 44.5 | Matches SOTA |
| TriviaQA | 57.9 (DPR) | 56.8 | 55.9 | Near SOTA |
| CuratedTrec | 46.0 (DPR) | 50.0 | 52.2 | +6.2 pts |
| WebQuestions | 42.4 (DPR) | 45.5 | 45.2 | +3.1 pts |
RAG's real advantage shows on tasks that require generating text, not just extracting spans:
Jeopardy Question Generation: Given an entity (e.g., "Eiffel Tower"), generate a Jeopardy-style clue. RAG produced more factual, specific, and diverse clues than BART alone. Human evaluators preferred RAG's outputs on factuality and specificity.
FEVER Fact Verification: Given a claim, classify it as SUPPORTED, REFUTED, or NOT ENOUGH INFO using retrieved evidence. RAG achieved 72.5% accuracy, within 2.7 points of pipeline systems with much more complex architectures.
MS-MARCO (abstractive QA): Generate free-form answers to real Bing search queries. RAG produced more specific and factual answers than BART, though this benchmark wasn't the primary focus.
The standard pipeline approach before RAG was: (1) retrieve relevant passages, (2) feed them to a reader model, (3) extract an answer span. RAG improves on this in three ways:
Compare RAG against baselines across benchmarks. Use the buttons to switch benchmarks and see how RAG-Token and RAG-Sequence compare to extractive (DPR) and closed-book (T5) approaches.
Here is perhaps the most practically important property of RAG: you can update what the model "knows" without retraining it. Just swap the document index.
Consider a traditional language model trained in January 2020. By March 2020, COVID-19 has changed the world, but the model still thinks it's a minor outbreak in Wuhan. To update the model, you'd need to retrain it on new data — an expensive process requiring massive compute.
With RAG, you simply update the Wikipedia dump in the FAISS index. The model's weights stay the same. The next time someone asks about COVID-19, the retriever finds the new passages, and the generator produces an up-to-date answer.
python # Updating RAG's knowledge — no model retraining needed # 1. Get new documents new_articles = fetch_latest_wikipedia_dump() # or any corpus # 2. Chunk into 100-word passages new_passages = chunk_documents(new_articles, chunk_size=100) # 3. Encode with frozen passage encoder (BERT_P) new_embeddings = passage_encoder.encode_batch(new_passages) # shape: [num_new_passages, 768] # 4. Replace FAISS index new_index = faiss.IndexFlatIP(768) new_index.add(new_embeddings) # 5. Swap in the new index — zero downtime rag_model.retriever.index = new_index # Done! Model now answers with updated knowledge. # No gradient updates. No retraining. No GPU hours.
Knowledge update via index swapping isn't a silver bullet. There are real limitations:
| Limitation | Why It Matters |
|---|---|
| Encoder drift | The question encoder was trained with the original index. New documents encoded by the frozen passage encoder might not align perfectly with what the question encoder expects. Performance can degrade over time. |
| No unlearning | Removing a document from the index doesn't make the generator forget what it learned from similar documents during training. The model weights still contain latent knowledge from training. |
| Passage encoder quality | The frozen BERTP determines how well new documents are embedded. If new documents have a very different style than the original training corpus, embedding quality may suffer. |
| Re-encoding cost | For a full Wikipedia swap (~21M passages), re-encoding takes hours on GPUs even though it's a one-time cost. Incremental updates (adding/removing individual passages) are cheaper. |
RAG formalized a distinction that has become fundamental to how we think about LLM systems:
| Property | Parametric Knowledge (model weights) | Non-Parametric Knowledge (document index) |
|---|---|---|
| Location | Encoded in 400M BART parameters | Stored in 21M passage embeddings + text |
| Update cost | Full retraining (days, $$$$) | Re-encode new documents (hours, $) |
| Provenance | Opaque — can't trace which training example taught a fact | Transparent — can cite exact retrieved passage |
| Capacity | Fixed by model size — 400M params store limited facts | Unlimited — add more documents to the index |
| Strengths | Language understanding, reasoning, generation fluency | Factual accuracy, recency, verifiability |
This separation is the single most important conceptual contribution of the RAG paper. The specific architecture (DPR + BART) matters less than the pattern: let the model handle language, let the index handle knowledge.
Simulate updating RAG's knowledge index. The timeline shows the original index (2018) and lets you add newer documents. Ask questions and see how answers change as the index is updated — all without retraining.
RAG sits at a critical junction in the history of NLP — it formalized the retrieve-then-generate pattern that became the backbone of modern LLM applications. Let's map where it came from and where it led.
| Model / Paper | Year | Relationship to RAG |
|---|---|---|
| DrQA (Chen et al.) | 2017 | First large-scale open-domain QA with retrieval + reader. Used TF-IDF (sparse), not dense retrieval. RAG's direct ancestor. |
| ORQA (Lee et al.) | 2019 | First end-to-end trained retriever + reader. Used Inverse Cloze Task for pre-training. Showed joint training beats pipeline. |
| REALM (Guu et al.) | 2020 | Concurrent with RAG. Also marginalizes over retrieved docs. Key difference: REALM periodically re-encodes the entire index during training (asynchronous index refresh). RAG freezes the passage encoder instead. |
| DPR (Karpukhin et al.) | 2020 | RAG's retriever. Showed that simple dense embeddings beat BM25 for open-domain QA. RAG inherits DPR's bi-encoder architecture. |
| BART (Lewis et al.) | 2020 | RAG's generator. Denoising seq2seq pre-training. Same first author as RAG — Patrick Lewis. |
| FiD (Izacard & Grave) | 2021 | Fusion-in-Decoder: encode each document independently, concatenate in decoder. Simpler than marginalization, often stronger. Became the dominant approach post-RAG. |
| RETRO (Borgeaud et al.) | 2022 | Retrieval-enhanced Transformer: retrieval baked into the architecture with chunked cross-attention. Scales retrieval to 2T tokens. |
| ChatGPT + Retrieval | 2023+ | Production RAG: GPT-4 with web browsing, Bing Chat, Perplexity. RAG's conceptual descendants, though architecturally different (long-context models with retrieval as tool use). |
Separation of concerns. The most lasting contribution: parametric knowledge (model weights) vs non-parametric knowledge (document index) should be separate systems. This insight survived every subsequent architectural change.
End-to-end training. Training the retriever and generator jointly, with the retriever learning from the end-task signal without retrieval labels. This is cleaner than pipeline approaches where the retriever is trained on a proxy task.
Two marginalization variants. RAG-Token and RAG-Sequence provided a principled framework for how to combine information from multiple documents. This distinction influenced subsequent work on multi-document reasoning.
Long-context models. GPT-4 with 128K context, Gemini with 1M+ tokens. When you can paste 50 documents into the context, the marginalization math becomes less important — you just concatenate everything and let attention handle it.
Retrieval as tool use. Modern systems treat retrieval as a tool the model can choose to invoke, not a mandatory step. The model decides when it needs external knowledge.
Chunking strategies. RAG used fixed 100-word chunks. Modern systems use semantic chunking, recursive splitting, and hierarchical retrieval. Chunk quality matters enormously.
Re-ranking. Modern RAG pipelines add a cross-encoder re-ranker between retrieval and generation: retrieve 100 candidates with a fast bi-encoder, re-rank to top-10 with a slower but more accurate cross-encoder.
For context, here's what a modern production RAG system looks like compared to the 2020 paper:
Explore the evolution of retrieval-augmented models. Drag the slider through time to see how the pattern evolved from sparse retrieval (2017) through RAG (2020) to modern production systems (2024+).