LLMs hallucinate on your private data. RAG gives them a cheat sheet — fetch the right facts first, then generate. This is how AI products actually work.
You deploy a chatbot on your company's 10,000-page documentation site. A user asks: "What's the refund policy for enterprise licenses purchased after Q3 2024?" Your LLM confidently answers — incorrectly. It has never seen your docs. It's making things up. This is hallucination: the model generates plausible-sounding text that has no grounding in actual fact.
The root cause is architectural. An LLM is a function that maps text to text, parameterized by weights frozen at training time. It has no way to look things up. It can only recall what was baked into its weights during pretraining — which didn't include your internal wiki, your latest product specs, or yesterday's earnings call.
The key insight: you don't need the model to memorize everything. You just need it to read the right passage at the right moment. A student who can look things up during an exam beats one who memorized the wrong chapters.
Click Ask Both to see a bare LLM vs. a RAG system handling a question about a fictional internal policy. Watch which one invents answers.
RAG was formalized by Lewis et al. (2020) at Facebook AI Research, but the intuition is older: any information retrieval system that feeds retrieved snippets to a generation model is doing RAG. What changed is the quality of both retrieval (via dense vector search) and generation (via LLMs) — together they produce a system that is both accurate and fluent.
RAG is not one algorithm — it's a pipeline of six steps. Each step has its own literature, trade-offs, and failure modes. Understanding the pipeline as a whole is the prerequisite for fixing any single part of it.
Click each stage to trace what data flows through it. Each box shows the input and output format.
Before you can retrieve anything, you need to turn raw documents into searchable chunks with semantic vector representations. This is the indexing pipeline: load → chunk → embed → store.
Documents come in many formats: PDF, DOCX, HTML, Markdown, CSV, database rows. Each needs a parser that extracts clean text while preserving structure cues (headings, table relationships). Libraries like LangChain's DocumentLoader and LlamaIndex abstract this, but the output quality depends entirely on parser quality. A PDF with scanned images needs OCR; a DOCX with embedded tables needs special handling.
You can't embed a 100-page manual as one unit — the vector would average over everything and represent nothing specific. Chunking splits documents into semantically coherent passages, each independently embeddable. The two main strategies are:
| Strategy | How | Best For |
|---|---|---|
| Fixed-size | Split every N tokens; overlap last M tokens | Uniform docs (logs, articles) |
| Semantic / recursive | Split on paragraphs, then sentences; keep natural boundaries | Structured docs (manuals, code) |
| Hierarchical | Keep parent-child: section → paragraph → sentence | RAPTOR, multi-granularity retrieval |
An embedding model maps text to a dense vector where semantically similar texts are close in cosine distance. Common choices: OpenAI text-embedding-3-small (1536-dim, API), all-MiniLM-L6-v2 (384-dim, local, fast), bge-large-en (BAAI, strong retrieval-focused model).
python # Offline indexing pipeline from langchain.text_splitter import RecursiveCharacterTextSplitter from langchain_openai import OpenAIEmbeddings from langchain_community.vectorstores import Chroma # 1. Chunk — 512 tokens, 50-token overlap splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=50, separators=["\n\n", "\n", " "] ) chunks = splitter.split_documents(raw_docs) # list[Document] # 2. Embed + 3. Store — all in one call with Chroma vectorstore = Chroma.from_documents( documents=chunks, embedding=OpenAIEmbeddings(model="text-embedding-3-small"), persist_directory="./chroma_db" ) # Each chunk is now: { text, metadata, embedding_vector }
Drag the slider to change chunk size. Watch how specificity (smaller = more precise) trades off against context coverage (larger = more context, noisier vector).
At query time, you have a user question and a vector store with millions of embedded chunks. Retrieval means: find the K chunks most semantically relevant to this query in milliseconds. This is harder than it sounds.
The user's query is embedded with the same model used to embed the documents. This is critical — if you indexed with text-embedding-3-small, you must retrieve with text-embedding-3-small. Different models produce incompatible vector spaces.
Brute-force cosine similarity over 10 million vectors takes seconds. ANN algorithms (HNSW, IVF, ScaNN) trade a tiny accuracy loss for 100-1000x speedup by partitioning the vector space and only searching nearby regions. HNSW (Hierarchical Navigable Small World) is the default in most vector stores — it builds a navigable graph where each node links to its nearest neighbors at multiple granularities.
| Method | How | Strength | Weakness |
|---|---|---|---|
| Dense (bi-encoder) | Embed both query + doc; cosine similarity | Semantic: finds paraphrases, synonyms | Misses exact keyword matches |
| Sparse (BM25) | TF-IDF weighted token overlap | Exact terms: product codes, names | No semantic understanding |
| Hybrid | Merge dense + sparse scores (RRF) | Best of both worlds | Two indexes to maintain |
python # Hybrid retrieval: dense + sparse, fused with RRF from langchain_community.retrievers import BM25Retriever from langchain.retrievers import EnsembleRetriever bm25 = BM25Retriever.from_documents(chunks, k=10) dense = vectorstore.as_retriever(search_kwargs={"k": 10}) # EnsembleRetriever applies RRF internally hybrid = EnsembleRetriever( retrievers=[bm25, dense], weights=[0.4, 0.6] # tune based on your domain ) results = hybrid.invoke("What is the refund policy?") # returns: list[Document] sorted by fused relevance
Toggle query type to see which retrieval method wins. Semantic queries favor dense; exact-term queries favor sparse.
You retrieved the top-20 chunks with fast ANN search. Fast is good, but ANN uses a bi-encoder: the query and each document are encoded independently. That means the model never sees them side-by-side. It's like scoring job candidates by their LinkedIn headline alone.
A cross-encoder reranker does something slower but smarter: it takes the query and a single document concatenated together — "[CLS] query [SEP] document [SEP]" — and outputs a single relevance score. The model can attend across both texts simultaneously. It's like reading the candidate's full application together with the job description.
Click Run Retrieval to fetch 8 chunks with a bi-encoder, then Apply Reranker to see how the cross-encoder reorders them. Watch which chunks move up or down.
python # Cross-encoder reranking with sentence-transformers from sentence_transformers import CrossEncoder reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2") # retrieved: list[Document] from the bi-encoder step (top-20) pairs = [(query, doc.page_content) for doc in retrieved] scores = reranker.predict(pairs) # shape: (20,) — relevance score each # Sort by score descending, keep top-5 for context ranked = sorted(zip(scores, retrieved), reverse=True) top_docs = [doc for _, doc in ranked[:5]]
rerank endpoint does the same thing. Pass your query + list of retrieved texts; get back relevance scores in one API call. Also useful: mixedbread-ai/mxbai-rerank-large-v1 (local, strong, MIT license).You have the top-K retrieved and reranked chunks. Now you need to turn them into a grounded answer. This is where the LLM does what it's actually good at: reading, synthesizing, and writing.
The prompt has three parts: a system instruction (your persona, tone, and citation rules), the retrieved context (the chunks, each labeled with a source), and the user query. The context goes between the system and the query so the model reads it before the question.
python def build_prompt(query: str, docs: list[str], sources: list[str]) -> str: context_block = "" for i, (doc, src) in enumerate(zip(docs, sources)): context_block += f"[{i+1}] ({src})\n{doc}\n\n" return f"""You are a helpful assistant. Answer ONLY using the provided context. If the context does not contain the answer, say "I don't have enough information to answer that." Cite sources as [1], [2], etc. CONTEXT: {context_block} QUESTION: {query} ANSWER:"""
Every token in the context costs: latency, money, and attention. The LLM has a fixed context window (4K to 128K tokens). You must fit: the system prompt (~200 tokens), retrieved chunks (~300 tokens each × K), the query (~50 tokens), and leave room for the answer (~500 tokens). With a 4K window, you can fit roughly 10 chunks of 300 tokens each.
A RAG system that makes up an answer when retrieval fails is worse than no system at all. Explicitly instruct the model: if the retrieved context does not contain sufficient information, say so. You can reinforce this with a threshold: if no chunk scores above 0.5 relevance, decline to answer and return a "not found" response rather than hallucinating.
Adjust top-K and chunk size to see how much of your context window is consumed. Red = over budget.
How do you know if your RAG system is good? "It seems to work" is not an answer. You need metrics that measure the three things that can go wrong independently: retrieval might fetch the wrong chunks, the model might ignore the right chunks, and the answer might not address the question.
RAGAS (Retrieval Augmented Generation Assessment) is the standard evaluation framework. It decomposes quality into three orthogonal scores, each measured by an LLM judge:
| Metric | Measures | Formula intuition |
|---|---|---|
| Faithfulness | Does the answer come from the context? (no hallucination) | # claims in answer supported by context / total claims |
| Answer Relevance | Does the answer address the question? | Cosine sim between generated answer and original question |
| Context Recall | Did retrieval find all needed information? | # ground-truth sentences entailed by retrieved context / total |
python from ragas import evaluate from ragas.metrics import faithfulness, answer_relevancy, context_recall from datasets import Dataset # Build evaluation dataset eval_data = { "question": ["What is the refund window?"], "answer": ["Enterprise licenses have a 30-day refund window."], "contexts": [["Refunds: Enterprise: 30 days. Consumer: 14 days."]], "ground_truth": ["30 days for enterprise licenses"] } ds = Dataset.from_dict(eval_data) result = evaluate(ds, metrics=[faithfulness, answer_relevancy, context_recall]) # result: { faithfulness: 1.0, answer_relevancy: 0.97, context_recall: 1.0 }
Drag the sliders to simulate different system configurations. See how each failure mode reduces a specific metric.
RAG fails in ways that are specific, diagnosable, and fixable — but only if you know what to look for. Here are the five most common failure modes, each with a distinct signature in your RAGAS scores.
Click a failure mode to see which RAGAS metrics drop and by how much. Use this as a diagnostic guide.
Once you've got vanilla RAG working, there are four research-backed extensions that each solve a specific limitation. None of them are mandatory — add them when the vanilla system fails in the specific way they fix.
Problem: query embeddings and document embeddings live in slightly different parts of vector space (queries are short, documents are long). Solution: use the LLM to generate a hypothetical answer to the query, then embed that hypothetical answer for retrieval. Hypothetical answers are document-shaped — so they match real documents better.
python # HyDE: embed a hallucinated answer, not the query hypo_answer = llm.invoke(f"Write a short passage that answers: {query}") query_vector = embed(hypo_answer) # embed the hypothesis, not the query results = vectorstore.similarity_search_by_vector(query_vector, k=10)
Problem: individual chunks lack global context. A chunk about a product feature doesn't know it's part of a larger enterprise offering. Solution: cluster similar chunks, generate LLM summaries of each cluster, embed the summaries, and add them to the index. Build this tree recursively. Now you can retrieve at multiple granularities — fine-grained chunk OR high-level summary depending on query type.
Problem: one retrieval step is often insufficient. A question may require looking up three separate facts, combining them, then verifying. Solution: give an LLM agent the vector store as a tool. The agent decides when to retrieve, what query to use, and can iterate (retrieve → read → decide to retrieve again with a refined query). This is RAG inside a ReAct loop.
Self-RAG trains the LLM itself to decide when retrieval is necessary, evaluate the retrieved chunks' relevance, and critique its own generated answer for faithfulness — all using special [Retrieve], [Relevant], [Supported] tokens inserted during fine-tuning. The model becomes its own RAG controller.
| Method | Solves | Cost |
|---|---|---|
| HyDE | Query-document mismatch | 1 extra LLM call per query |
| RAPTOR | Missing global context | Expensive offline indexing |
| Agentic RAG | Single-hop limitations | Multiple retrieval calls |
| Self-RAG | Hallucination + irrelevant retrieval | Fine-tuning required |
This is the full system in action. Type a query, watch it flow through retrieval, reranking, and generation. Adjust top-K and toggle the reranker to feel the difference each component makes.
Choose a query, configure top-K and reranking, then click Run Pipeline to trace the full flow step by step.
RAG is not a terminal architecture — it's the retrieval layer in a larger system. Here's how it connects to adjacent ideas you'll encounter as you go deeper.
In a ReAct agent, the vector store is just another tool: search_knowledge_base(query). The agent decides when to call it, can call it multiple times with refined queries, and can combine retrieved facts with other tool outputs (web search, code execution). This is agentic RAG — the retrieval is no longer a fixed pipeline step but a dynamic capability.
See: Agents & Tool Use
Text is not the only modality you might want to retrieve. Multimodal RAG indexes images, PDFs with figures, audio transcripts, and video frames. Embedding models like CLIP embed images and text into the same space — enabling "retrieve images relevant to this text query." The pipeline is identical; only the embedding model and the chunk definition change.
Deploying RAG at enterprise scale introduces new concerns: access control (who can see which documents?), incremental indexing (re-embed changed docs without re-indexing everything), multi-tenant isolation (company A cannot retrieve company B's data), and observability (log every retrieval + generation for debugging and compliance). These are engineering problems, not ML problems — but they're what makes or breaks a production system.
| What | When to add it |
|---|---|
| Hybrid retrieval | Queries contain product codes, names, or exact strings |
| Cross-encoder reranker | Retrieval precision is low; top-1 chunk is often wrong |
| HyDE | Query-doc register mismatch (casual queries, formal docs) |
| RAPTOR hierarchical index | Questions require synthesizing across multiple sections |
| Agentic RAG | Multi-hop questions that require 2+ retrieval steps |
| Self-RAG | High-stakes domains where faithfulness is critical; willing to fine-tune |