Retrieval-Augmented Generation

Multimodal RAG
Beyond Plain Text

Real documents have charts, tables, diagrams, and screenshots. Here's how to build RAG systems that actually read them.

Prerequisites: Familiarity with basic RAG (see our RAG lesson). That's it.
9
Chapters
8+
Simulations
1
Prereq

Chapter 0: Your RAG System Is Half Blind

You've built a RAG pipeline. You've chunked PDFs, embedded the text, stuffed everything into a vector database. It works great — until someone asks about the chart on page 7.

The chart shows quarterly revenue growing 40% year-over-year. The text on that page says "see Figure 3 for details." Your system retrieves the text, finds the reference, and then... returns nothing useful. The actual data lived inside the image. Your pipeline never saw it.

The 40% problem: Studies of enterprise documents consistently find that 30–50% of the information in technical reports, financial filings, scientific papers, and product manuals lives in non-text form: figures, charts, tables, diagrams, screenshots. A text-only RAG system is blind to all of it.

This isn't a niche edge case. It's the default state of real documents. Think about the last PDF you read that actually mattered:

Information Loss Visualizer

Click a document type to see what a text-only RAG system actually retrieves vs. what's in the document.

The solution is multimodal RAG: a retrieval-augmented generation system that can ingest, index, retrieve, and reason over text, images, tables, and diagrams together. By the end of this lesson, you'll know how to build one.

One-sentence preview: Multimodal RAG extends classic RAG by (1) parsing documents into text + image + table chunks, (2) embedding each modality with the right encoder, (3) querying all indexes and fusing the results, and (4) passing retrieved visual content directly to a vision-capable LLM.
Why does text-only RAG fail on many real documents?

Chapter 1: The Information Loss Problem

Text-only RAG has one job: turn text into embeddings, retrieve similar text. The moment you hit a PDF with real structure, three distinct failure modes appear.

Failure Mode 1: Images Are Invisible

When you extract text from a PDF with a tool like PyPDF2, images are simply skipped. The page with the architecture diagram becomes a blank string or a fragment like "Figure 2 shows the proposed method." Your vector database gets a useless placeholder. Queries about what's in Figure 2 return nothing, or worse, return adjacent text that talks about the figure rather than describing its content.

Invisible by default: Standard PDF text extraction (pdfplumber, pypdf) yields zero bytes for every image in the document. The image bytes are in the PDF, but they're in a compressed binary stream that text parsers don't touch.

Failure Mode 2: OCR Loses Layout

You might think: "just OCR the whole page." OCR (Optical Character Recognition) converts image pixels to text, but it destroys spatial layout. A multi-column research paper becomes interleaved garbage — the text from column one and column two are merged in reading-order, mixing unrelated sentences. A table becomes a flat list of numbers with no row/column structure. The semantic relationships encoded by position are gone.

Layout Destruction Demo

Toggle between a structured document layout and what naive OCR produces. Notice how position encodes meaning.

Failure Mode 3: Tables Become Gibberish

A table in a PDF is usually rendered as positioned text — each cell is an independent text object at a specific (x, y) coordinate on the page. Extraction tools read these in left-to-right, top-to-bottom order, but the row/column relationships are implicit in the coordinates, not in the text stream. A simple 3×3 table like:

ModelAccuracyLatency
GPT-4V91.2%2.1s
Claude 389.7%1.8s

becomes flat text: "Model Accuracy Latency GPT-4V 91.2% 2.1s Claude 3 89.7% 1.8s" — a string that loses all relational meaning. Embedding this string and retrieving it will match queries about "accuracy" or "latency" only by accident, not because the embedding understands row/column structure.

The root cause: Text embeddings encode semantic meaning of prose. They were trained on sentences and paragraphs. They have no concept of spatial position, table rows, or visual salience. Feeding them image-derived text or table fragments is like asking a historian to read a circuit diagram.
Why does OCR alone fail to preserve table structure?

Chapter 2: Document Parsing

Before you can retrieve anything, you need to extract structured content from raw documents. Three tools dominate this space, each with different strengths.

PyMuPDF (fitz) — Fast and Low-Level

PyMuPDF is a Python binding to the MuPDF rendering engine. It gives you raw access to PDF internals: every text block with its bounding box, every image as a PIL Image object, every drawing as SVG path data. It's fast, it's precise, and it requires you to assemble structure yourself.

python
import fitz  # pip install pymupdf
from PIL import Image
import io

doc = fitz.open("report.pdf")

for page_num, page in enumerate(doc):
    # --- Extract text blocks with bounding boxes ---
    blocks = page.get_text("blocks")  # list of (x0,y0,x1,y1,text,block_no,block_type)
    text_blocks = [b for b in blocks if b[6] == 0]   # type 0 = text
    image_blocks = [b for b in blocks if b[6] == 1]  # type 1 = image

    # --- Extract actual image bytes ---
    for img_info in page.get_images():
        xref = img_info[0]                          # cross-reference number
        base_image = doc.extract_image(xref)
        img_bytes = base_image["image"]             # raw bytes (PNG/JPEG)
        img = Image.open(io.BytesIO(img_bytes))     # → PIL Image
        # Now you can: caption it with a VLM, embed with CLIP, run OCR

    # --- Render whole page as image (for ColPali) ---
    mat = fitz.Matrix(2.0, 2.0)                   # 2x scale = 144 DPI
    pix = page.get_pixmap(matrix=mat)
    page_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)

Unstructured — Structure-Aware Extraction

Unstructured is a library that attempts to infer document structure automatically. It classifies each element as Title, NarrativeText, Table, Image, ListItem, etc., using layout models and heuristics. This is more useful for downstream RAG because you know what kind of thing each chunk is.

python
from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="report.pdf",
    extract_images_in_pdf=True,        # saves images to disk
    infer_table_structure=True,         # reconstructs table HTML
    strategy="hi_res",                  # uses layout detection model (slower but better)
    extract_image_block_output_dir="./images",
)

for el in elements:
    print(el.category)        # "Title" | "NarrativeText" | "Table" | "Image"
    print(el.text)            # extracted text content
    print(el.metadata)        # page_number, coordinates, filename, etc.
    if el.category == "Table":
        print(el.metadata.text_as_html)   # table reconstructed as HTML string

Docling — IBM's Modern Parser

Docling (IBM Research, 2024) is purpose-built for document AI. It uses a suite of models: layout detection (DocLayNet), table structure recognition (TableFormer), and OCR. Output is a rich DoclingDocument object with a full document hierarchy. It handles complex layouts that Unstructured struggles with.

python
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

converter = DocumentConverter()
result = converter.convert("report.pdf")
doc = result.document

# Iterate over all elements in reading order
for item, level in doc.iterate_items():
    print(item.label)     # SectionHeader | Text | Table | Picture | etc.
    if hasattr(item, "data"):   # Table has a TableData object
        df = item.data.export_to_dataframe()  # → pandas DataFrame with proper rows/cols

# Export images: Docling renders each page and crops the figures
for pic in doc.pictures:
    image = pic.get_image(result)   # → PIL Image, properly cropped
Parser Comparison

Each parser has different trade-offs. Click a parser to see what it does well and where it struggles.

ToolSpeedTable QualityImage ExtractionBest For
PyMuPDFVery fastRaw coords onlyExcellent (raw bytes)Custom pipelines, page rendering
UnstructuredMediumGood (HTML)Good (saves to disk)Quick prototyping, mixed docs
DoclingSlowBest (TableFormer)Best (cropped)Production, complex layouts
You need to extract tables from a complex financial report with merged cells and multi-row headers. Which parser is most likely to handle this correctly?

Chapter 3: Vision Embeddings for Retrieval

You've extracted images from your documents. Now you need to embed them — turn each image into a dense vector so you can retrieve images by query. Two approaches dominate, with very different trade-offs.

CLIP: Image + Text in One Space

CLIP (Contrastive Language-Image Pretraining, OpenAI 2021) trains a vision encoder and a text encoder jointly so that matching image-text pairs end up close in the same vector space. This is exactly what you want for retrieval: embed a query as text, embed images as vectors, find nearest neighbors.

python
from PIL import Image
import torch
from transformers import CLIPProcessor, CLIPModel

model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14")

# Embed an image
img = Image.open("figure_3.png")
inputs = processor.__call__(images=img, return_tensors="pt")
with torch.no_grad():
    img_emb = model.get_image_features(**inputs)  # shape: [1, 768]
img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True)  # L2 normalize

# Embed the query
inputs = processor.__call__(text="quarterly revenue bar chart", return_tensors="pt")
with torch.no_grad():
    txt_emb = model.get_text_features(**inputs)  # shape: [1, 768]
txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True)

# Similarity (cosine, since both are L2-normalized)
similarity = (txt_emb @ img_emb.T).item()   # scalar, [-1, 1]

CLIP works well for natural images, photographs, and screenshots of interfaces. It struggles with dense technical figures — a circuit schematic and a genome map look similar to CLIP because it was trained on photo-caption pairs from the web, not scientific diagrams.

ColPali: Embed the Whole Page

ColPali (Faysse et al., 2024) takes a radically different approach. Instead of extracting text and images separately, it renders the entire PDF page as an image and embeds it with a vision-language model (PaliGemma). The key innovation is late interaction: the query produces multiple token embeddings, the page image also produces a grid of patch embeddings, and similarity is computed as a MaxSim across all patch-query token pairs.

score(query, page) = ∑q ∈ query_tokens maxp ∈ page_patches (q · p)
python
from colpali_engine.models import ColPali, ColPaliProcessor

model = ColPali.from_pretrained(
    "vidore/colpali-v1.2",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2")

# Render pages as images (using PyMuPDF)
page_images = [render_page(doc, i) for i in range(len(doc))]

# Embed pages — each page becomes a grid of patch embeddings
batch = processor.process_images(page_images).to(model.device)
with torch.no_grad():
    page_embeddings = model(**batch)
# page_embeddings: [num_pages, num_patches, dim]  e.g. [10, 1024, 128]

# Embed query
query_batch = processor.process_queries(["What is the revenue trend?"]).to(model.device)
with torch.no_grad():
    query_embedding = model(**query_batch)
# query_embedding: [1, num_query_tokens, dim]

# Score = MaxSim over patches for each query token, then sum
scores = processor.score_multi_vector(query_embedding, page_embeddings)
best_page = scores.argmax().item()
Why ColPali works: A PDF page rendered at 144 DPI gives you a 1024×1024 pixel image. ColPali slices this into a 32×32 = 1024 patch grid. Each patch covers a ~32×32 pixel region — small enough to localize a specific chart, table cell, or diagram element. The MaxSim operation finds which page patches best match each query token, so the score is high if any region of the page is relevant to any part of the query.
CLIP vs ColPali Comparison

Slide the query type to see how each approach scores different page types.

Query Type Natural image
ColPali retrieves by rendering full PDF pages as images. What is the main advantage of this approach over extract-then-embed?

Chapter 4: Hybrid Retrieval

A document has three kinds of content: prose, images, and tables. You've built separate embeddings for each. Now you need to query all three indexes at once and combine the results. This is hybrid retrieval.

The Three-Index Architecture

Each modality gets its own index, each optimized for its content type:

Text Index
text-embedding-3-large or BGE-M3. Chunks of 400 tokens from prose sections.
↓ parallel query
Image Index
CLIP-L/14 embeddings (for natural images) or ColPali page embeddings (for technical docs).
↓ parallel query
Table Index
Description embeddings: each table is summarized into a prose description, then embedded as text.
python
import asyncio
from qdrant_client import QdrantClient

client = QdrantClient(":memory:")

async def hybrid_retrieve(query: str, top_k: int = 5):
    # Embed query for each modality in parallel
    text_emb, img_emb = await asyncio.gather(
        embed_text(query),      # → [1, 1536]
        embed_image_query(query) # → [1, 768] (CLIP text encoder)
    )

    # Query all three indexes in parallel
    text_hits, img_hits, table_hits = await asyncio.gather(
        client.search(collection_name="text_chunks",   query_vector=text_emb, limit=top_k),
        client.search(collection_name="images",        query_vector=img_emb,  limit=top_k),
        client.search(collection_name="tables",        query_vector=text_emb, limit=top_k),
    )
    return text_hits, img_hits, table_hits

Reciprocal Rank Fusion

Each index returns a ranked list of results. How do you merge three ranked lists into one? The standard answer is Reciprocal Rank Fusion (RRF). The intuition: an item that ranks #1 in one list and #1 in another list should score higher than an item that ranks #1 in one list but doesn't appear in the others.

RRF_score(d) = ∑r ∈ rankers 1 / (k + rankr(d))

Where k = 60 is a constant that prevents top-ranked items from dominating too strongly, and rank(d) is the item's rank in that ranker's list (1-indexed). Items not present in a ranker's list contribute 0 from that ranker.

python
def reciprocal_rank_fusion(ranked_lists: list[list], k: int = 60) -> list:
    """
    ranked_lists: each is a list of (doc_id, score) tuples, best-first.
    Returns merged list of (doc_id, rrf_score) sorted best-first.
    """
    scores = {}
    for ranked_list in ranked_lists:
        for rank, (doc_id, _) in enumerate(ranked_list, start=1):
            scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

# Worked example:
# text_hits:  [(t1, .91), (i3, .88), (t2, .85)]   ← image appeared in text index too
# img_hits:   [(i3, .93), (i1, .80), (i2, .75)]
# table_hits: [(tb1, .89), (t1, .82), (i3, .71)]
#
# RRF scores (k=60):
# i3:  1/61 + 1/61 + 1/63 = 0.0164 + 0.0164 + 0.0159 = 0.0487  ← appears in all 3
# t1:  1/61 + 0    + 1/62 = 0.0164 + 0      + 0.0161 = 0.0325
# tb1: 0    + 0    + 1/61 = 0.0164                   = 0.0164
RRF Fusion Visualizer

See how items ranked across three lists get combined. Drag the sliders to change how each modality ranks a result.

Text rank#2
Image rank#1
Table rank#3
Query routing: Not every query needs all three indexes. "What does the architecture look like?" → prioritize images. "What were the revenue figures?" → prioritize tables. A simple keyword classifier or a small LLM call can route queries to the relevant index, reducing retrieval latency and noise.
In Reciprocal Rank Fusion, why does an item appearing in all three ranked lists score higher than one appearing in only one?

Chapter 5: Multimodal Generation

You've retrieved the right pages, images, and table chunks. Now you need to pass them to a language model that can actually see them. This is where vision-language models (VLMs) enter the picture.

Passing Images to a VLM

Modern frontier APIs — Claude 3.5, GPT-4o, Gemini 1.5 — accept interleaved image and text inputs. The model processes both modalities in a single forward pass. From the API's perspective, you provide a list of content blocks: some are text strings, some are base64-encoded images.

python
import anthropic
import base64
from pathlib import Path

client = anthropic.Anthropic()

def image_to_b64(path: str) -> str:
    return base64.b64encode(Path(path).read_bytes()).decode()

def multimodal_rag_answer(query: str, retrieved_items: list) -> str:
    content = []

    # Build context from retrieved items (text + images interleaved)
    content.append({"type": "text", "text": "Here are the relevant document sections:\n"})

    for i, item in enumerate(retrieved_items):
        if item["type"] == "text":
            content.append({
                "type": "text",
                "text": f"[Section {i+1}, p.{item['page']}]\n{item['text']}\n"
            })
        elif item["type"] == "image":
            content.append({
                "type": "text",
                "text": f"[Figure {i+1}, p.{item['page']}]"
            })
            content.append({
                "type": "image",
                "source": {
                    "type": "base64",
                    "media_type": "image/png",
                    "data": image_to_b64(item["path"]),
                }
            })

    # The actual question
    content.append({"type": "text", "text": f"\nQuestion: {query}\nAnswer with specific references to figures and page numbers."})

    resp = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": content}]
    )
    return resp.content[0].text

Prompt Design for Visual Grounding

The quality of multimodal answers depends heavily on how you prompt. Three techniques matter most:

TechniqueWhy it mattersExample instruction
Label images explicitlyModel can cite specific figures in its answer"[Figure 3, p.12]" before each image
Demand visual groundingForces the model to look at images, not guess from text"Base your answer only on the provided figures and text"
Request uncertainty signalsReduces hallucination when the chart is ambiguous"If a chart is unclear, say what you can and cannot determine"
Describe chart type firstPrimes the model's visual understanding"This is a bar chart showing..." in your system prompt
The grounding problem: Without explicit figure labels, VLMs often answer from training data knowledge rather than the retrieved images. "What does Figure 3 show?" with a labeled image forces the model to look. "What does the chart show?" with an unlabeled image invites confabulation from pre-training.
Prompt Quality Demo

Compare a poorly grounded vs. well-grounded prompt for visual RAG. See what information each approach risks losing.

Why should you add a label like "[Figure 3, p.12]" before each image in your VLM prompt?

Chapter 6: Table Understanding

Tables are a special case. They carry structured, relational data — the kind of information that's hardest to retrieve correctly. A flat text embedding can't capture "row 3, column 2 = 91.2%" as a queryable fact. You need a deliberate strategy for each table.

Four Serialization Strategies

A serialization strategy decides how you convert a table into something embeddable. Each strategy has very different retrieval properties.

Markdown
| Model | Acc | Latency |
|---|---|---|
| GPT-4V | 91.2% | 2.1s |
Compact. Preserves structure. Best for small tables.
HTML
<table><tr><th>Model</th>...</tr></table>
Verbose. Perfect row/col semantics. Best for complex merged cells.
NL Description
"A comparison table of 3 models. GPT-4V achieves highest accuracy at 91.2% with 2.1s latency..."
Most embeddable. Best for semantic retrieval. Loses exact values.
Hybrid: Embed Description + Store Raw
Index the NL description for retrieval, but store Markdown or HTML in the payload for generation context. Best of both worlds.
python
import pandas as pd

def table_to_nl_description(df: pd.DataFrame, table_caption: str = "") -> str:
    """
    Generate a natural language description of a table for embedding.
    The description is what gets indexed; the markdown is what gets retrieved.
    """
    rows, cols = df.shape
    col_names = ", ".join(df.columns.tolist())

    desc = f"A table with {rows} rows and {cols} columns: {col_names}."
    if table_caption:
        desc += f" Caption: {table_caption}."

    # Describe min/max of numeric columns
    for col in df.select_dtypes(include="number").columns:
        mn, mx = df[col].min(), df[col].max()
        desc += f" {col} ranges from {mn} to {mx}."

    # Include first few rows verbatim
    for _, row in df.head(3).iterrows():
        desc += " | ".join(f"{k}: {v}" for k, v in row.items()) + "."

    return desc

def index_table(df: pd.DataFrame, caption: str, qdrant_client):
    description = table_to_nl_description(df, caption)
    embedding = embed_text(description)    # index this
    markdown = df.to_markdown(index=False) # store this as payload

    qdrant_client.upsert(
        collection_name="tables",
        points=[{
            "id": hash(caption),
            "vector": embedding,
            "payload": {
                "type": "table",
                "description": description,
                "markdown": markdown,  # ← passed to LLM at generation time
                "caption": caption,
            }
        }]
    )

When to Embed the Image vs. the Description

Not all tables should be text-embedded. A table with complex visual structure — color-coded cells, sparklines, merged headers — loses too much when serialized to text. For those, treat the table as an image: render it, embed with CLIP or ColPali, and pass the image to the VLM at generation time. A simple threshold: if the table has more than 2 header levels or contains visual elements, treat it as an image.

Table Strategy Selector

Answer the questions to see which table strategy fits your document type.

You have a table with complex merged headers and color-coded cells showing risk levels. What is the best strategy for indexing it?

Chapter 7: Interactive Multimodal RAG Pipeline

Here it all comes together. Below is a simulated end-to-end multimodal RAG pipeline. A document arrives with text sections, images, and tables. Watch each stage process it — then query the indexed document and see which modalities get retrieved. Toggle Vision Mode off to see what text-only RAG would miss.

Multimodal RAG Pipeline Simulator

Step through the pipeline or run automatically. The colored chunks show what each modality contributes to the final answer.

Vision Mode:
Query:
What to notice: With Vision ON, the chart image and table are both retrieved and contribute to the answer. With Vision OFF (text-only mode), the chart disappears from retrieval results and the table becomes a flat text string — the answer degrades noticeably. The quality gap is largest for queries that ask about trends, comparisons, and visual structure.

Chapter 8: Connections & What's Next

Multimodal RAG is not a standalone technique — it sits at the intersection of several larger fields. Understanding where it connects helps you see where to invest next.

The Technology Map

FieldWhat it addsWhere multimodal RAG fits
Document AILayout detection, OCR, table recognition modelsDocument AI is the parsing layer — Docling, TableFormer, DocLayNet all come from here
Vision-Language ModelsModels that jointly understand images and textVLMs are both the embedding backbone (CLIP, ColPali) and the generation model (Claude, GPT-4o)
Enterprise SearchMulti-source retrieval over structured + unstructured dataMultimodal RAG is the foundation layer for modern enterprise search over document repositories
Knowledge ManagementMaking organizational knowledge queryableRAG over enterprise document stores (Confluence, SharePoint, Notion) is the primary use case driving commercial adoption
Agentic SystemsAI systems that take actionsAgents use multimodal RAG as their memory/context layer — retrieve relevant documents before acting

The Remaining Hard Problems

Multimodal RAG is still maturing. The unsolved problems are where research is happening right now:

Cross-modal grounding: When an image and the surrounding text disagree (e.g., the chart shows a decline but the caption says "growth"), which does the LLM believe? Current models don't have reliable mechanisms for resolving such conflicts.
Long-document retrieval: ColPali embeds full pages but still retrieves at the page level. A 500-page document gives you 500 page embeddings. Chunk-level precision within a page — "which panel of this multi-panel figure is relevant?" — is still an open problem.
Evaluation: How do you measure whether a multimodal RAG system answered correctly? Standard benchmarks (MMQA, DocVQA, MMLongBench-Doc) exist but don't cover domain-specific documents. Building eval datasets for your specific document type is usually necessary for production systems.

What to Read Next

Related Gleams Lessons

How CLIP, BLIP, and PaLI-X work — the embedding models that power multimodal RAG.
The training objective behind CLIP's joint image-text embedding space.
The backbone architecture for both text embedders and vision encoders.
The Feynman test for this lesson: Can you explain to a colleague why their text-only RAG pipeline is missing information, and sketch on a whiteboard the four components they'd need to add (parser → vision embedder → hybrid retrieval with RRF → VLM generation)? If yes, you understand multimodal RAG.
"What I cannot create, I do not understand. What I cannot retrieve, I cannot use."