Real documents have charts, tables, diagrams, and screenshots. Here's how to build RAG systems that actually read them.
You've built a RAG pipeline. You've chunked PDFs, embedded the text, stuffed everything into a vector database. It works great — until someone asks about the chart on page 7.
The chart shows quarterly revenue growing 40% year-over-year. The text on that page says "see Figure 3 for details." Your system retrieves the text, finds the reference, and then... returns nothing useful. The actual data lived inside the image. Your pipeline never saw it.
This isn't a niche edge case. It's the default state of real documents. Think about the last PDF you read that actually mattered:
Click a document type to see what a text-only RAG system actually retrieves vs. what's in the document.
The solution is multimodal RAG: a retrieval-augmented generation system that can ingest, index, retrieve, and reason over text, images, tables, and diagrams together. By the end of this lesson, you'll know how to build one.
Text-only RAG has one job: turn text into embeddings, retrieve similar text. The moment you hit a PDF with real structure, three distinct failure modes appear.
When you extract text from a PDF with a tool like PyPDF2, images are simply skipped. The page with the architecture diagram becomes a blank string or a fragment like "Figure 2 shows the proposed method." Your vector database gets a useless placeholder. Queries about what's in Figure 2 return nothing, or worse, return adjacent text that talks about the figure rather than describing its content.
pdfplumber, pypdf) yields zero bytes for every image in the document. The image bytes are in the PDF, but they're in a compressed binary stream that text parsers don't touch.You might think: "just OCR the whole page." OCR (Optical Character Recognition) converts image pixels to text, but it destroys spatial layout. A multi-column research paper becomes interleaved garbage — the text from column one and column two are merged in reading-order, mixing unrelated sentences. A table becomes a flat list of numbers with no row/column structure. The semantic relationships encoded by position are gone.
Toggle between a structured document layout and what naive OCR produces. Notice how position encodes meaning.
A table in a PDF is usually rendered as positioned text — each cell is an independent text object at a specific (x, y) coordinate on the page. Extraction tools read these in left-to-right, top-to-bottom order, but the row/column relationships are implicit in the coordinates, not in the text stream. A simple 3×3 table like:
| Model | Accuracy | Latency |
|---|---|---|
| GPT-4V | 91.2% | 2.1s |
| Claude 3 | 89.7% | 1.8s |
becomes flat text: "Model Accuracy Latency GPT-4V 91.2% 2.1s Claude 3 89.7% 1.8s" — a string that loses all relational meaning. Embedding this string and retrieving it will match queries about "accuracy" or "latency" only by accident, not because the embedding understands row/column structure.
Before you can retrieve anything, you need to extract structured content from raw documents. Three tools dominate this space, each with different strengths.
PyMuPDF is a Python binding to the MuPDF rendering engine. It gives you raw access to PDF internals: every text block with its bounding box, every image as a PIL Image object, every drawing as SVG path data. It's fast, it's precise, and it requires you to assemble structure yourself.
python import fitz # pip install pymupdf from PIL import Image import io doc = fitz.open("report.pdf") for page_num, page in enumerate(doc): # --- Extract text blocks with bounding boxes --- blocks = page.get_text("blocks") # list of (x0,y0,x1,y1,text,block_no,block_type) text_blocks = [b for b in blocks if b[6] == 0] # type 0 = text image_blocks = [b for b in blocks if b[6] == 1] # type 1 = image # --- Extract actual image bytes --- for img_info in page.get_images(): xref = img_info[0] # cross-reference number base_image = doc.extract_image(xref) img_bytes = base_image["image"] # raw bytes (PNG/JPEG) img = Image.open(io.BytesIO(img_bytes)) # → PIL Image # Now you can: caption it with a VLM, embed with CLIP, run OCR # --- Render whole page as image (for ColPali) --- mat = fitz.Matrix(2.0, 2.0) # 2x scale = 144 DPI pix = page.get_pixmap(matrix=mat) page_image = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
Unstructured is a library that attempts to infer document structure automatically. It classifies each element as Title, NarrativeText, Table, Image, ListItem, etc., using layout models and heuristics. This is more useful for downstream RAG because you know what kind of thing each chunk is.
python from unstructured.partition.pdf import partition_pdf elements = partition_pdf( filename="report.pdf", extract_images_in_pdf=True, # saves images to disk infer_table_structure=True, # reconstructs table HTML strategy="hi_res", # uses layout detection model (slower but better) extract_image_block_output_dir="./images", ) for el in elements: print(el.category) # "Title" | "NarrativeText" | "Table" | "Image" print(el.text) # extracted text content print(el.metadata) # page_number, coordinates, filename, etc. if el.category == "Table": print(el.metadata.text_as_html) # table reconstructed as HTML string
Docling (IBM Research, 2024) is purpose-built for document AI. It uses a suite of models: layout detection (DocLayNet), table structure recognition (TableFormer), and OCR. Output is a rich DoclingDocument object with a full document hierarchy. It handles complex layouts that Unstructured struggles with.
python from docling.document_converter import DocumentConverter from docling.datamodel.base_models import InputFormat converter = DocumentConverter() result = converter.convert("report.pdf") doc = result.document # Iterate over all elements in reading order for item, level in doc.iterate_items(): print(item.label) # SectionHeader | Text | Table | Picture | etc. if hasattr(item, "data"): # Table has a TableData object df = item.data.export_to_dataframe() # → pandas DataFrame with proper rows/cols # Export images: Docling renders each page and crops the figures for pic in doc.pictures: image = pic.get_image(result) # → PIL Image, properly cropped
Each parser has different trade-offs. Click a parser to see what it does well and where it struggles.
| Tool | Speed | Table Quality | Image Extraction | Best For |
|---|---|---|---|---|
| PyMuPDF | Very fast | Raw coords only | Excellent (raw bytes) | Custom pipelines, page rendering |
| Unstructured | Medium | Good (HTML) | Good (saves to disk) | Quick prototyping, mixed docs |
| Docling | Slow | Best (TableFormer) | Best (cropped) | Production, complex layouts |
You've extracted images from your documents. Now you need to embed them — turn each image into a dense vector so you can retrieve images by query. Two approaches dominate, with very different trade-offs.
CLIP (Contrastive Language-Image Pretraining, OpenAI 2021) trains a vision encoder and a text encoder jointly so that matching image-text pairs end up close in the same vector space. This is exactly what you want for retrieval: embed a query as text, embed images as vectors, find nearest neighbors.
python from PIL import Image import torch from transformers import CLIPProcessor, CLIPModel model = CLIPModel.from_pretrained("openai/clip-vit-large-patch14") processor = CLIPProcessor.from_pretrained("openai/clip-vit-large-patch14") # Embed an image img = Image.open("figure_3.png") inputs = processor.__call__(images=img, return_tensors="pt") with torch.no_grad(): img_emb = model.get_image_features(**inputs) # shape: [1, 768] img_emb = img_emb / img_emb.norm(dim=-1, keepdim=True) # L2 normalize # Embed the query inputs = processor.__call__(text="quarterly revenue bar chart", return_tensors="pt") with torch.no_grad(): txt_emb = model.get_text_features(**inputs) # shape: [1, 768] txt_emb = txt_emb / txt_emb.norm(dim=-1, keepdim=True) # Similarity (cosine, since both are L2-normalized) similarity = (txt_emb @ img_emb.T).item() # scalar, [-1, 1]
CLIP works well for natural images, photographs, and screenshots of interfaces. It struggles with dense technical figures — a circuit schematic and a genome map look similar to CLIP because it was trained on photo-caption pairs from the web, not scientific diagrams.
ColPali (Faysse et al., 2024) takes a radically different approach. Instead of extracting text and images separately, it renders the entire PDF page as an image and embeds it with a vision-language model (PaliGemma). The key innovation is late interaction: the query produces multiple token embeddings, the page image also produces a grid of patch embeddings, and similarity is computed as a MaxSim across all patch-query token pairs.
python from colpali_engine.models import ColPali, ColPaliProcessor model = ColPali.from_pretrained( "vidore/colpali-v1.2", torch_dtype=torch.bfloat16, device_map="cuda", ) processor = ColPaliProcessor.from_pretrained("vidore/colpali-v1.2") # Render pages as images (using PyMuPDF) page_images = [render_page(doc, i) for i in range(len(doc))] # Embed pages — each page becomes a grid of patch embeddings batch = processor.process_images(page_images).to(model.device) with torch.no_grad(): page_embeddings = model(**batch) # page_embeddings: [num_pages, num_patches, dim] e.g. [10, 1024, 128] # Embed query query_batch = processor.process_queries(["What is the revenue trend?"]).to(model.device) with torch.no_grad(): query_embedding = model(**query_batch) # query_embedding: [1, num_query_tokens, dim] # Score = MaxSim over patches for each query token, then sum scores = processor.score_multi_vector(query_embedding, page_embeddings) best_page = scores.argmax().item()
Slide the query type to see how each approach scores different page types.
A document has three kinds of content: prose, images, and tables. You've built separate embeddings for each. Now you need to query all three indexes at once and combine the results. This is hybrid retrieval.
Each modality gets its own index, each optimized for its content type:
python import asyncio from qdrant_client import QdrantClient client = QdrantClient(":memory:") async def hybrid_retrieve(query: str, top_k: int = 5): # Embed query for each modality in parallel text_emb, img_emb = await asyncio.gather( embed_text(query), # → [1, 1536] embed_image_query(query) # → [1, 768] (CLIP text encoder) ) # Query all three indexes in parallel text_hits, img_hits, table_hits = await asyncio.gather( client.search(collection_name="text_chunks", query_vector=text_emb, limit=top_k), client.search(collection_name="images", query_vector=img_emb, limit=top_k), client.search(collection_name="tables", query_vector=text_emb, limit=top_k), ) return text_hits, img_hits, table_hits
Each index returns a ranked list of results. How do you merge three ranked lists into one? The standard answer is Reciprocal Rank Fusion (RRF). The intuition: an item that ranks #1 in one list and #1 in another list should score higher than an item that ranks #1 in one list but doesn't appear in the others.
Where k = 60 is a constant that prevents top-ranked items from dominating too strongly, and rank(d) is the item's rank in that ranker's list (1-indexed). Items not present in a ranker's list contribute 0 from that ranker.
python def reciprocal_rank_fusion(ranked_lists: list[list], k: int = 60) -> list: """ ranked_lists: each is a list of (doc_id, score) tuples, best-first. Returns merged list of (doc_id, rrf_score) sorted best-first. """ scores = {} for ranked_list in ranked_lists: for rank, (doc_id, _) in enumerate(ranked_list, start=1): scores[doc_id] = scores.get(doc_id, 0.0) + 1.0 / (k + rank) return sorted(scores.items(), key=lambda x: x[1], reverse=True) # Worked example: # text_hits: [(t1, .91), (i3, .88), (t2, .85)] ← image appeared in text index too # img_hits: [(i3, .93), (i1, .80), (i2, .75)] # table_hits: [(tb1, .89), (t1, .82), (i3, .71)] # # RRF scores (k=60): # i3: 1/61 + 1/61 + 1/63 = 0.0164 + 0.0164 + 0.0159 = 0.0487 ← appears in all 3 # t1: 1/61 + 0 + 1/62 = 0.0164 + 0 + 0.0161 = 0.0325 # tb1: 0 + 0 + 1/61 = 0.0164 = 0.0164
See how items ranked across three lists get combined. Drag the sliders to change how each modality ranks a result.
You've retrieved the right pages, images, and table chunks. Now you need to pass them to a language model that can actually see them. This is where vision-language models (VLMs) enter the picture.
Modern frontier APIs — Claude 3.5, GPT-4o, Gemini 1.5 — accept interleaved image and text inputs. The model processes both modalities in a single forward pass. From the API's perspective, you provide a list of content blocks: some are text strings, some are base64-encoded images.
python import anthropic import base64 from pathlib import Path client = anthropic.Anthropic() def image_to_b64(path: str) -> str: return base64.b64encode(Path(path).read_bytes()).decode() def multimodal_rag_answer(query: str, retrieved_items: list) -> str: content = [] # Build context from retrieved items (text + images interleaved) content.append({"type": "text", "text": "Here are the relevant document sections:\n"}) for i, item in enumerate(retrieved_items): if item["type"] == "text": content.append({ "type": "text", "text": f"[Section {i+1}, p.{item['page']}]\n{item['text']}\n" }) elif item["type"] == "image": content.append({ "type": "text", "text": f"[Figure {i+1}, p.{item['page']}]" }) content.append({ "type": "image", "source": { "type": "base64", "media_type": "image/png", "data": image_to_b64(item["path"]), } }) # The actual question content.append({"type": "text", "text": f"\nQuestion: {query}\nAnswer with specific references to figures and page numbers."}) resp = client.messages.create( model="claude-opus-4-5", max_tokens=1024, messages=[{"role": "user", "content": content}] ) return resp.content[0].text
The quality of multimodal answers depends heavily on how you prompt. Three techniques matter most:
| Technique | Why it matters | Example instruction |
|---|---|---|
| Label images explicitly | Model can cite specific figures in its answer | "[Figure 3, p.12]" before each image |
| Demand visual grounding | Forces the model to look at images, not guess from text | "Base your answer only on the provided figures and text" |
| Request uncertainty signals | Reduces hallucination when the chart is ambiguous | "If a chart is unclear, say what you can and cannot determine" |
| Describe chart type first | Primes the model's visual understanding | "This is a bar chart showing..." in your system prompt |
Compare a poorly grounded vs. well-grounded prompt for visual RAG. See what information each approach risks losing.
Tables are a special case. They carry structured, relational data — the kind of information that's hardest to retrieve correctly. A flat text embedding can't capture "row 3, column 2 = 91.2%" as a queryable fact. You need a deliberate strategy for each table.
A serialization strategy decides how you convert a table into something embeddable. Each strategy has very different retrieval properties.
python import pandas as pd def table_to_nl_description(df: pd.DataFrame, table_caption: str = "") -> str: """ Generate a natural language description of a table for embedding. The description is what gets indexed; the markdown is what gets retrieved. """ rows, cols = df.shape col_names = ", ".join(df.columns.tolist()) desc = f"A table with {rows} rows and {cols} columns: {col_names}." if table_caption: desc += f" Caption: {table_caption}." # Describe min/max of numeric columns for col in df.select_dtypes(include="number").columns: mn, mx = df[col].min(), df[col].max() desc += f" {col} ranges from {mn} to {mx}." # Include first few rows verbatim for _, row in df.head(3).iterrows(): desc += " | ".join(f"{k}: {v}" for k, v in row.items()) + "." return desc def index_table(df: pd.DataFrame, caption: str, qdrant_client): description = table_to_nl_description(df, caption) embedding = embed_text(description) # index this markdown = df.to_markdown(index=False) # store this as payload qdrant_client.upsert( collection_name="tables", points=[{ "id": hash(caption), "vector": embedding, "payload": { "type": "table", "description": description, "markdown": markdown, # ← passed to LLM at generation time "caption": caption, } }] )
Not all tables should be text-embedded. A table with complex visual structure — color-coded cells, sparklines, merged headers — loses too much when serialized to text. For those, treat the table as an image: render it, embed with CLIP or ColPali, and pass the image to the VLM at generation time. A simple threshold: if the table has more than 2 header levels or contains visual elements, treat it as an image.
Answer the questions to see which table strategy fits your document type.
Here it all comes together. Below is a simulated end-to-end multimodal RAG pipeline. A document arrives with text sections, images, and tables. Watch each stage process it — then query the indexed document and see which modalities get retrieved. Toggle Vision Mode off to see what text-only RAG would miss.
Step through the pipeline or run automatically. The colored chunks show what each modality contributes to the final answer.
Multimodal RAG is not a standalone technique — it sits at the intersection of several larger fields. Understanding where it connects helps you see where to invest next.
| Field | What it adds | Where multimodal RAG fits |
|---|---|---|
| Document AI | Layout detection, OCR, table recognition models | Document AI is the parsing layer — Docling, TableFormer, DocLayNet all come from here |
| Vision-Language Models | Models that jointly understand images and text | VLMs are both the embedding backbone (CLIP, ColPali) and the generation model (Claude, GPT-4o) |
| Enterprise Search | Multi-source retrieval over structured + unstructured data | Multimodal RAG is the foundation layer for modern enterprise search over document repositories |
| Knowledge Management | Making organizational knowledge queryable | RAG over enterprise document stores (Confluence, SharePoint, Notion) is the primary use case driving commercial adoption |
| Agentic Systems | AI systems that take actions | Agents use multimodal RAG as their memory/context layer — retrieve relevant documents before acting |
Multimodal RAG is still maturing. The unsolved problems are where research is happening right now:
"What I cannot create, I do not understand. What I cannot retrieve, I cannot use."