RAG — Retrieval-Augmented Generation From Absolute Zero

Chapter 0: Why RAG?

You deploy a chatbot on your company's 10,000-page documentation site. A user asks: "What's the refund policy for enterprise licenses purchased after Q3 2024?" Your LLM confidently answers — incorrectly. It has never seen your docs. It's making things up. This is hallucination: the model generates plausible-sounding text that has no grounding in actual fact.

The root cause is architectural. An LLM is a function that maps text to text, parameterized by weights frozen at training time. It has no way to look things up. It can only recall what was baked into its weights during pretraining — which didn't include your internal wiki, your latest product specs, or yesterday's earnings call.

The problem in one sentence: LLMs are brilliant at language but deaf to your data. Retrieval-Augmented Generation fixes this by fetching relevant documents at query time and including them in the prompt as context — a cheat sheet the model can read before answering.

The key insight: you don't need the model to memorize everything. You just need it to read the right passage at the right moment. A student who can look things up during an exam beats one who memorized the wrong chapters.

Hallucination vs. RAG — Side by Side

Click Ask Both to see a bare LLM vs. a RAG system handling a question about a fictional internal policy. Watch which one invents answers.

RAG was formalized by Lewis et al. (2020) at Facebook AI Research, but the intuition is older: any information retrieval system that feeds retrieved snippets to a generation model is doing RAG. What changed is the quality of both retrieval (via dense vector search) and generation (via LLMs) — together they produce a system that is both accurate and fluent.

Why does an LLM hallucinate facts about your private documents?

The model is not intelligent enough to handle specialized topics The model's context window is too small to hold all documents The model's weights were frozen at training time and never saw your private data

Chapter 1: The Pipeline

RAG is not one algorithm — it's a pipeline of six steps. Each step has its own literature, trade-offs, and failure modes. Understanding the pipeline as a whole is the prerequisite for fixing any single part of it.

1. Chunk

Split documents into passages (256–512 tokens). Overlap by ~20% to preserve context across boundaries.

↓

2. Embed

Run each chunk through an embedding model. Each chunk becomes a dense vector in semantic space (e.g., 768 or 1536 dimensions).

↓

3. Store

Load vectors into a vector database (Pinecone, Weaviate, Chroma, pgvector). Index for fast approximate nearest-neighbor lookup.

↓ — query time —

4. Retrieve

Embed the user query. Find the top-K most similar chunk vectors via ANN search. Return the raw text of those chunks.

↓

5. Rerank

Score each retrieved chunk against the query using a cross-encoder (slower but more accurate). Drop chunks below threshold.

↓

6. Generate

Inject top chunks into the LLM prompt as context. The model reads them and generates a grounded, citeable answer.

Two phases, two time budgets: Steps 1–3 are offline — you do them once when ingesting documents. Steps 4–6 are online — they happen at every query. Offline can be slow (hours); online must be fast (hundreds of milliseconds). Design accordingly.

Pipeline Data Flow

Click each stage to trace what data flows through it. Each box shows the input and output format.

Which RAG steps happen offline (once during ingestion) vs. online (every query)?

All steps happen online at query time Chunk + Embed + Store are offline; Retrieve + Rerank + Generate are online Only the Generate step is online; everything else is precomputed

Chapter 2: Indexing

Before you can retrieve anything, you need to turn raw documents into searchable chunks with semantic vector representations. This is the indexing pipeline: load → chunk → embed → store.

1. Document Loading

Documents come in many formats: PDF, DOCX, HTML, Markdown, CSV, database rows. Each needs a parser that extracts clean text while preserving structure cues (headings, table relationships). Libraries like LangChain's DocumentLoader and LlamaIndex abstract this, but the output quality depends entirely on parser quality. A PDF with scanned images needs OCR; a DOCX with embedded tables needs special handling.

2. Chunking

You can't embed a 100-page manual as one unit — the vector would average over everything and represent nothing specific. Chunking splits documents into semantically coherent passages, each independently embeddable. The two main strategies are:

Strategy	How	Best For
Fixed-size	Split every N tokens; overlap last M tokens	Uniform docs (logs, articles)
Semantic / recursive	Split on paragraphs, then sentences; keep natural boundaries	Structured docs (manuals, code)
Hierarchical	Keep parent-child: section → paragraph → sentence	RAPTOR, multi-granularity retrieval

Overlap is not wasteful: A 20% overlap between consecutive chunks means the same sentence may appear in two chunks. This is intentional — if the answer spans a chunk boundary, at least one chunk will contain it fully.

3. Embedding

An embedding model maps text to a dense vector where semantically similar texts are close in cosine distance. Common choices: OpenAI text-embedding-3-small (1536-dim, API), all-MiniLM-L6-v2 (384-dim, local, fast), bge-large-en (BAAI, strong retrieval-focused model).

python
# Offline indexing pipeline
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# 1. Chunk — 512 tokens, 50-token overlap
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", " "]
)
chunks = splitter.split_documents(raw_docs)  # list[Document]

# 2. Embed + 3. Store — all in one call with Chroma
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)
# Each chunk is now: { text, metadata, embedding_vector }

Chunk Size Trade-off

Drag the slider to change chunk size. Watch how specificity (smaller = more precise) trades off against context coverage (larger = more context, noisier vector).

Chunk tokens 256

Why do we use overlapping chunks instead of non-overlapping chunks of the same size?

Overlapping uses more storage, which improves retrieval accuracy Overlap reduces the total number of embedding API calls Answers that span chunk boundaries can still be found in at least one chunk

Chapter 3: Retrieval

At query time, you have a user question and a vector store with millions of embedded chunks. Retrieval means: find the K chunks most semantically relevant to this query in milliseconds. This is harder than it sounds.

Query Embedding

The user's query is embedded with the same model used to embed the documents. This is critical — if you indexed with text-embedding-3-small, you must retrieve with text-embedding-3-small. Different models produce incompatible vector spaces.

q⃗ = Embed("What is the refund policy?") ∈ ℝ¹⁵³⁶
top-K = argmax_i cosine_sim(q⃗, d⃗_i)

Approximate Nearest Neighbor (ANN)

Brute-force cosine similarity over 10 million vectors takes seconds. ANN algorithms (HNSW, IVF, ScaNN) trade a tiny accuracy loss for 100-1000x speedup by partitioning the vector space and only searching nearby regions. HNSW (Hierarchical Navigable Small World) is the default in most vector stores — it builds a navigable graph where each node links to its nearest neighbors at multiple granularities.

Dense vs. Sparse vs. Hybrid

Method	How	Strength	Weakness
Dense (bi-encoder)	Embed both query + doc; cosine similarity	Semantic: finds paraphrases, synonyms	Misses exact keyword matches
Sparse (BM25)	TF-IDF weighted token overlap	Exact terms: product codes, names	No semantic understanding
Hybrid	Merge dense + sparse scores (RRF)	Best of both worlds	Two indexes to maintain

Reciprocal Rank Fusion (RRF): To merge a dense ranking and a sparse ranking, assign each chunk a fused score: RRF(d) = ∑ 1/(k + rank_method(d)). With k=60, this is robust to score scale differences between methods and typically outperforms either alone.

python
# Hybrid retrieval: dense + sparse, fused with RRF
from langchain_community.retrievers import BM25Retriever
from langchain.retrievers import EnsembleRetriever

bm25 = BM25Retriever.from_documents(chunks, k=10)
dense = vectorstore.as_retriever(search_kwargs={"k": 10})

# EnsembleRetriever applies RRF internally
hybrid = EnsembleRetriever(
    retrievers=[bm25, dense],
    weights=[0.4, 0.6]  # tune based on your domain
)
results = hybrid.invoke("What is the refund policy?")
# returns: list[Document] sorted by fused relevance

Dense vs. Sparse Retrieval

Toggle query type to see which retrieval method wins. Semantic queries favor dense; exact-term queries favor sparse.

A user searches for "SKU-44892-B return window". Which retrieval method is most reliable?

Dense only — embeddings capture the full meaning Dense only — it handles exact strings better than BM25 Hybrid — the product code needs sparse matching; the semantic meaning benefits from dense

Chapter 4: Reranking

You retrieved the top-20 chunks with fast ANN search. Fast is good, but ANN uses a bi-encoder: the query and each document are encoded independently. That means the model never sees them side-by-side. It's like scoring job candidates by their LinkedIn headline alone.

A cross-encoder reranker does something slower but smarter: it takes the query and a single document concatenated together — "[CLS] query [SEP] document [SEP]" — and outputs a single relevance score. The model can attend across both texts simultaneously. It's like reading the candidate's full application together with the job description.

The trade-off: Bi-encoders are O(1) at query time (pre-encode docs offline). Cross-encoders are O(K) at query time (score each retrieved doc live). You can't cross-encode 1M docs — but you can cross-encode 20 retrieved docs in ~100ms. That's the architecture: bi-encoder for cheap first-pass, cross-encoder for accurate second-pass.

Bi-Encoder vs Cross-Encoder Scoring

Click Run Retrieval to fetch 8 chunks with a bi-encoder, then Apply Reranker to see how the cross-encoder reorders them. Watch which chunks move up or down.

python
# Cross-encoder reranking with sentence-transformers
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

# retrieved: list[Document] from the bi-encoder step (top-20)
pairs = [(query, doc.page_content) for doc in retrieved]
scores = reranker.predict(pairs)  # shape: (20,) — relevance score each

# Sort by score descending, keep top-5 for context
ranked = sorted(zip(scores, retrieved), reverse=True)
top_docs = [doc for _, doc in ranked[:5]]

Cohere Rerank API: If you don't want to run a local cross-encoder, Cohere's rerank endpoint does the same thing. Pass your query + list of retrieved texts; get back relevance scores in one API call. Also useful: mixedbread-ai/mxbai-rerank-large-v1 (local, strong, MIT license).

Why can't we use a cross-encoder to score ALL 1 million documents in the index at query time?

Cross-encoders can only handle short texts, not full documents Cross-encoders run inference on every (query, doc) pair at query time — 1M pairs would take minutes Cross-encoders require the documents to be pre-indexed in a special format

Chapter 5: Generation

You have the top-K retrieved and reranked chunks. Now you need to turn them into a grounded answer. This is where the LLM does what it's actually good at: reading, synthesizing, and writing.

Prompt Construction

The prompt has three parts: a system instruction (your persona, tone, and citation rules), the retrieved context (the chunks, each labeled with a source), and the user query. The context goes between the system and the query so the model reads it before the question.

python
def build_prompt(query: str, docs: list[str], sources: list[str]) -> str:
    context_block = ""
    for i, (doc, src) in enumerate(zip(docs, sources)):
        context_block += f"[{i+1}] ({src})\n{doc}\n\n"

    return f"""You are a helpful assistant. Answer ONLY using the provided context.
If the context does not contain the answer, say "I don't have enough information to answer that."
Cite sources as [1], [2], etc.

CONTEXT:
{context_block}
QUESTION: {query}

ANSWER:"""

Context Window Budget

Every token in the context costs: latency, money, and attention. The LLM has a fixed context window (4K to 128K tokens). You must fit: the system prompt (~200 tokens), retrieved chunks (~300 tokens each × K), the query (~50 tokens), and leave room for the answer (~500 tokens). With a 4K window, you can fit roughly 10 chunks of 300 tokens each.

Lost-in-the-middle problem: LLMs attend better to information at the beginning and end of their context. If you have 20 chunks, put the most relevant ones first and last — not buried in the middle. This is a real effect measured empirically by Liu et al. (2023).

"I Don't Know" is an Answer

A RAG system that makes up an answer when retrieval fails is worse than no system at all. Explicitly instruct the model: if the retrieved context does not contain sufficient information, say so. You can reinforce this with a threshold: if no chunk scores above 0.5 relevance, decline to answer and return a "not found" response rather than hallucinating.

Context Window Budget Visualizer

Adjust top-K and chunk size to see how much of your context window is consumed. Red = over budget.

Top-K chunks 5

Chunk size (tokens) 256

A RAG system has no retrieved chunk with relevance score above 0.3. What should it do?

Send all retrieved chunks to the LLM anyway — it might still infer the answer Decline to answer and tell the user the information isn't in the knowledge base Increase top-K and retry retrieval until scores are higher

Chapter 6: Evaluation

How do you know if your RAG system is good? "It seems to work" is not an answer. You need metrics that measure the three things that can go wrong independently: retrieval might fetch the wrong chunks, the model might ignore the right chunks, and the answer might not address the question.

The Three RAGAS Metrics

RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework. It decomposes quality into three orthogonal scores, each measured by an LLM judge:

Metric	Measures	Formula intuition
Faithfulness	Does the answer come from the context? (no hallucination)	# claims in answer supported by context / total claims
Answer Relevance	Does the answer address the question?	Cosine sim between generated answer and original question
Context Recall	Did retrieval find all needed information?	# ground-truth sentences entailed by retrieved context / total

Why all three matter independently: You can have perfect faithfulness (answer comes from context) but low relevance (context was about something else). You can have high relevance but low faithfulness (model added facts beyond the context). Only when all three are high does your system actually work.

python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_recall
from datasets import Dataset

# Build evaluation dataset
eval_data = {
    "question": ["What is the refund window?"],
    "answer": ["Enterprise licenses have a 30-day refund window."],
    "contexts": [["Refunds: Enterprise: 30 days. Consumer: 14 days."]],
    "ground_truth": ["30 days for enterprise licenses"]
}
ds = Dataset.from_dict(eval_data)

result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_recall])
# result: { faithfulness: 1.0, answer_relevancy: 0.97, context_recall: 1.0 }

RAG Score Dashboard

Drag the sliders to simulate different system configurations. See how each failure mode reduces a specific metric.

Retrieval quality 0.85

Model faithfulness 0.90

A RAG system has faithfulness=0.3 but context_recall=1.0. What does this indicate?

Retrieval is broken — it's fetching irrelevant documents The user's questions are too ambiguous to answer Retrieval found the right chunks, but the LLM is ignoring them and hallucinating

Chapter 7: Failure Modes

RAG fails in ways that are specific, diagnosable, and fixable — but only if you know what to look for. Here are the five most common failure modes, each with a distinct signature in your RAGAS scores.

Failure Mode 1 — Wrong Retrieval: The retrieved chunks are not relevant to the query. Caused by: poor embedding model for your domain, query and document in different linguistic registers (user asks casually, docs are formal), or chunk boundaries that split key information.
Fix: hybrid retrieval, domain-fine-tuned embeddings, smaller chunk size.

Failure Mode 2 — Right Retrieval, Wrong Answer: The model has the answer in context but generates something different. Common cause: context is long and the answer is in the middle (lost-in-middle). Also: model's pretraining knowledge overrides retrieved context ("I already know the answer").
Fix: shorter context, explicit instruction to use ONLY context, put key chunks at start/end.

Failure Mode 3 — Context Overflow: Too many chunks consume the entire context window, leaving no room for the answer, or degrading attention quality across a long context.
Fix: aggressive reranking to trim to top 3-5, smaller chunks, use a model with larger context window.

Failure Mode 4 — Stale Data: Your index was built last month. The answer changed last week. The model confidently answers from outdated chunks.
Fix: incremental indexing (re-embed on document update), metadata filtering (filter by last-modified date), include timestamps in chunk metadata so the model can caveat.

Failure Mode 5 — Lost in the Middle: Empirically documented: LLMs attend much better to tokens at the beginning and end of their context than the middle. A relevant chunk buried at position 12 of 20 may be ignored.
Fix: always put the most relevant chunk first, second-most relevant last.

Failure Mode Diagnosis

Click a failure mode to see which RAGAS metrics drop and by how much. Use this as a diagnostic guide.

A RAG system has context_recall=0.95 but faithfulness=0.4. Which failure mode is most likely?

Wrong Retrieval — the chunks don't contain the answer Right Retrieval Wrong Answer — retrieval works, but the model hallucinates beyond the context Context Overflow — too many chunks confuse the model

Chapter 8: Advanced RAG

Once you've got vanilla RAG working, there are four research-backed extensions that each solve a specific limitation. None of them are mandatory — add them when the vanilla system fails in the specific way they fix.

HyDE — Hypothetical Document Embedding

Problem: query embeddings and document embeddings live in slightly different parts of vector space (queries are short, documents are long). Solution: use the LLM to generate a hypothetical answer to the query, then embed that hypothetical answer for retrieval. Hypothetical answers are document-shaped — so they match real documents better.

python
# HyDE: embed a hallucinated answer, not the query
hypo_answer = llm.invoke(f"Write a short passage that answers: {query}")
query_vector = embed(hypo_answer)  # embed the hypothesis, not the query
results = vectorstore.similarity_search_by_vector(query_vector, k=10)

RAPTOR — Recursive Abstractive Processing for Tree-Organized Retrieval

Problem: individual chunks lack global context. A chunk about a product feature doesn't know it's part of a larger enterprise offering. Solution: cluster similar chunks, generate LLM summaries of each cluster, embed the summaries, and add them to the index. Build this tree recursively. Now you can retrieve at multiple granularities — fine-grained chunk OR high-level summary depending on query type.

Agentic RAG

Problem: one retrieval step is often insufficient. A question may require looking up three separate facts, combining them, then verifying. Solution: give an LLM agent the vector store as a tool. The agent decides when to retrieve, what query to use, and can iterate (retrieve → read → decide to retrieve again with a refined query). This is RAG inside a ReAct loop.

Self-RAG

Self-RAG trains the LLM itself to decide when retrieval is necessary, evaluate the retrieved chunks' relevance, and critique its own generated answer for faithfulness — all using special [Retrieve], [Relevant], [Supported] tokens inserted during fine-tuning. The model becomes its own RAG controller.

Method	Solves	Cost
HyDE	Query-document mismatch	1 extra LLM call per query
RAPTOR	Missing global context	Expensive offline indexing
Agentic RAG	Single-hop limitations	Multiple retrieval calls
Self-RAG	Hallucination + irrelevant retrieval	Fine-tuning required

HyDE improves retrieval by embedding what instead of the raw query?

A keyword-extracted version of the query A longer, reformulated version of the same query A hypothetical answer to the query, generated by an LLM

Chapter 9: Interactive RAG Pipeline

This is the full system in action. Type a query, watch it flow through retrieval, reranking, and generation. Adjust top-K and toggle the reranker to feel the difference each component makes.

Live RAG Pipeline Simulator

Choose a query, configure top-K and reranking, then click Run Pipeline to trace the full flow step by step.

Top-K 5

Enable Reranker

Chapter 10: Connections

RAG is not a terminal architecture — it's the retrieval layer in a larger system. Here's how it connects to adjacent ideas you'll encounter as you go deeper.

RAG → Agents

In a ReAct agent, the vector store is just another tool: search_knowledge_base(query). The agent decides when to call it, can call it multiple times with refined queries, and can combine retrieved facts with other tool outputs (web search, code execution). This is agentic RAG — the retrieval is no longer a fixed pipeline step but a dynamic capability.

See: Agents & Tool Use

RAG → Multimodal

Text is not the only modality you might want to retrieve. Multimodal RAG indexes images, PDFs with figures, audio transcripts, and video frames. Embedding models like CLIP embed images and text into the same space — enabling "retrieve images relevant to this text query." The pipeline is identical; only the embedding model and the chunk definition change.

RAG → Enterprise Production

Deploying RAG at enterprise scale introduces new concerns: access control (who can see which documents?), incremental indexing (re-embed changed docs without re-indexing everything), multi-tenant isolation (company A cannot retrieve company B's data), and observability (log every retrieval + generation for debugging and compliance). These are engineering problems, not ML problems — but they're what makes or breaks a production system.

What	When to add it
Hybrid retrieval	Queries contain product codes, names, or exact strings
Cross-encoder reranker	Retrieval precision is low; top-1 chunk is often wrong
HyDE	Query-doc register mismatch (casual queries, formal docs)
RAPTOR hierarchical index	Questions require synthesizing across multiple sections
Agentic RAG	Multi-hop questions that require 2+ retrieval steps
Self-RAG	High-stakes domains where faithfulness is critical; willing to fine-tune

"What I cannot create, I do not understand." — Build a minimal RAG system today: LangChain + Chroma + any OpenAI key. Load 10 PDFs. Ask it questions. Find where it breaks. That breakdown is your next optimization target. The architecture becomes real the moment you debug it.

In a production RAG system, a new document is added to the knowledge base. What must happen to make it retrievable?

Nothing — the LLM will automatically incorporate new documents The LLM must be fine-tuned on the new document The document must be chunked, embedded, and added to the vector store index

Retrieval-AugmentedGeneration