How to slice documents so retrieval actually works.
You have a 50-page PDF — a technical manual, a research paper, a company handbook. A user asks: "What's the warranty policy for the XR-7 motor?" You want the AI to answer from the document, not hallucinate.
The naive approach: feed the whole PDF into the language model. Problem: GPT-4 has a ~128k token context window. A 50-page PDF is roughly 25,000 tokens — it fits. But your system needs to handle 500 PDFs. That's 12 million tokens, far beyond any context window. And even if it fit, the model's attention diffuses over 12 million tokens and the relevant sentence drowns in noise.
So you need to embed the documents — turn them into vectors in a high-dimensional space where "similar meaning" means "nearby in space." Then when a user asks a question, you embed the question and find the nearest document chunks. This is Retrieval-Augmented Generation (RAG).
You could embed each word separately. But a word alone has no context. The word "bank" means nothing without "river bank" vs "savings bank." You need enough surrounding text to disambiguate.
The solution is to split the document into chunks — segments of text that each capture one coherent idea, with enough context to be understood on their own. Each chunk gets one embedding vector. At query time, you retrieve the most relevant chunks and feed them to the model.
See why a single vector for a whole document loses information. Each colored dot is a distinct topic from the document.
Showing: Full Document (1 vector)The simplest strategy: count characters (or tokens) and cut every N of them. Like tearing a book into equal-width strips. Fast, deterministic, requires no understanding of the content.
Consider this paragraph from a climate science paper:
With a chunk size of 100 characters and no overlap, you get:
chunks (100-char, no overlap) # Chunk 1 "The Greenland ice sheet has been losing mass at an accelerating rate since the 1990s. Between 2" # Chunk 2 "002 and 2020, it shed approximately 280 billion tons of ice per year. This loss is driven primar" # Chunk 3 "ily by surface melt and glacial discharge, with surface melt accounting for roughly 60% of total" # Chunk 4 " mass loss."
Chunk 1 ends mid-number ("2002" is split). Chunk 2 starts with "002" — meaningless without context. If someone queries "how much ice does Greenland lose per year?", the answer "280 billion tons" is split across two chunks. Neither chunk alone has the full fact.
Overlap is the band-aid: repeat the last K characters from chunk N at the start of chunk N+1. This reduces mid-fact splits but inflates your index and creates duplicate information in adjacent chunks.
python — fixed-size chunking with overlap def fixed_chunk(text: str, size: 500, overlap: 50) -> list[str]: chunks = [] start = 0 while start < len(text): end = start + size chunks.append(text[start:end]) start += size - overlap # step back by overlap amount return chunks # Example text = "The Greenland ice sheet..." chunks = fixed_chunk(text, size=500, overlap=50) # Result: chunks with 50-char "buffer zones" at boundaries
Adjust chunk size and overlap to see where cuts land in a sample sentence.
One obvious fix: don't cut mid-sentence. Split on sentence boundaries — periods, exclamation marks, question marks followed by whitespace. Every chunk is at least one complete sentence.
python — sentence splitting with spaCy import spacy nlp = spacy.load("en_core_web_sm") def sentence_chunks(text: str, sentences_per_chunk: 3) -> list[str]: doc = nlp(text) sentences = [sent.text.strip() for sent in doc.sents] chunks = [] for i in range(0, len(sentences), sentences_per_chunk): chunk = " ".join(sentences[i : i + sentences_per_chunk]) chunks.append(chunk) return chunks
This is better than character splitting. But consider this passage about the water cycle:
With 2 sentences per chunk:
A concept often spans multiple sentences. "Those droplets form clouds" is meaningless without the preceding "cooling causes condensation into tiny droplets." Sentence splitting preserves grammar but not semantics.
One improvement: group sentences by adding a sliding window — each chunk overlaps the previous by N sentences. This creates redundancy but ensures related sentences often appear together in at least one chunk.
python — sentence chunks with sliding window overlap def sliding_sentence_chunks(sentences, window=3, step=2): # window=3: 3 sentences per chunk # step=2: advance by 2, so 1 sentence overlaps return [ " ".join(sentences[i:i+window]) for i in range(0, len(sentences) - window + 1, step) ]
LangChain popularized this. The idea: try to split on the best delimiter first — a paragraph break is better than a sentence break, which is better than a word break, which is better than a character cut. Work down the hierarchy until chunks are small enough.
This is a greedy top-down algorithm: always prefer the largest semantic unit that fits within the target chunk size. A paragraph break is semantically meaningful — it signals a topic shift. You only descend to finer splits when you must.
python — recursive character splitting (LangChain-style) from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=500, # target characters per chunk chunk_overlap=50, # overlap between adjacent chunks separators=[ # try these in order "\n\n", # paragraph break (best) "\n", # line break ". ", # sentence end " ", # word break "" # character (worst case) ] ) chunks = splitter.split_text(document) # Returns a list of strings, each under 500 chars # Each one cut at the most natural boundary possible
You can customize the separator list for your content type. Code files should split on function boundaries before line breaks. Legal documents should split on sections before paragraphs. The hierarchy should match the document structure.
Watch how a text segment is recursively subdivided. Click a segment to split it further.
Click a large segment to split itEvery strategy so far splits on structure — characters, sentences, paragraphs. None of them look at meaning. Semantic chunking does: it uses embeddings to detect when the topic shifts, and splits there.
The algorithm: embed each sentence individually. For each adjacent pair of sentences, compute the cosine distance between their embeddings. A high distance means "these two sentences are about different things" — a good split point. A low distance means they're about the same thing — keep them together.
The breakpoint threshold controls granularity. Set it low (e.g., 80th percentile of all distances) and you split frequently into small, focused chunks. Set it high (e.g., 95th percentile) and you only split on major topic changes.
python — semantic chunking with sentence-transformers from sentence_transformers import SentenceTransformer import numpy as np model = SentenceTransformer("all-MiniLM-L6-v2") def semantic_chunks(sentences: list[str], percentile: 90): # Step 1: embed all sentences embeddings = model.encode(sentences) # Step 2: cosine distance between consecutive sentences def cos_dist(a, b): return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)) distances = [cos_dist(embeddings[i], embeddings[i+1]) for i in range(len(embeddings) - 1)] # Step 3: threshold at Nth percentile threshold = np.percentile(distances, percentile) breakpoints = [i for i, d in enumerate(distances) if d > threshold] # Step 4: split at breakpoints chunks, prev = [], 0 for bp in breakpoints: chunks.append(" ".join(sentences[prev:bp+1])) prev = bp + 1 chunks.append(" ".join(sentences[prev:])) return chunks
Simulated cosine distances between consecutive sentences. The red line is the breakpoint threshold. Drag the threshold to see how it changes chunk count.
The best split points in a document are usually already marked — by the author. Markdown headers, HTML tags, PDF section titles, bullet list items, code blocks. These are explicit signals that a new topic is beginning. Use them.
Consider a Markdown API documentation file. It has headers at multiple levels. Under each header: prose, then a code example, then parameter tables. The right chunks are: one per section, not crossing header boundaries, with code blocks kept intact.
python — markdown-aware chunking from langchain.text_splitter import MarkdownHeaderTextSplitter headers = [ ("#", "h1"), # top-level header ("##", "h2"), # section header ("###", "h3"), # subsection header ] splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers) chunks = splitter.split_text(markdown_doc) # Each chunk includes metadata: # chunk.metadata = {"h1": "API Reference", "h2": "Authentication", "h3": "OAuth Flow"} # chunk.page_content = "The OAuth flow requires three steps..."
The metadata is the killer feature: every chunk knows which section it came from. When you retrieve a chunk, you can display "from: API Reference > Authentication > OAuth Flow" — much more useful than "from: document, chunk 47".
| Format | Signals |
|---|---|
| Markdown | #, ##, ###, --- |
| HTML | h1-h6, <section>, <article> |
| font-size changes, bold lines | |
| Code | function/class definitions |
| Legal | "Article N", "Section N" |
For PDFs, structure is implicit. You need to extract it from visual cues: font size, bold text, indentation, whitespace. Libraries like pdfminer and pymupdf expose these properties so you can reconstruct the logical structure.
python — PDF structure extraction with pymupdf import fitz # pymupdf def extract_sections(pdf_path): doc = fitz.open(pdf_path) sections = [] current_section = {"title": "", "content": ""} for page in doc: blocks = page.get_text("dict")["blocks"] for block in blocks: if block["type"] == 0: # text block for line in block["lines"]: for span in line["spans"]: size = span["size"] text = span["text"].strip() if size > 14: # likely a heading if current_section["content"]: sections.append(current_section) current_section = {"title": text, "content": ""} else: current_section["content"] += text + " " if current_section["content"]: sections.append(current_section) return sections
No chunking strategy is complete without choosing how big to make the chunks. This is not a free parameter — it has measurable consequences on retrieval precision and recall.
The sweet spot depends on three factors:
| Factor | Effect on optimal size |
|---|---|
| Embedding model context window | text-embedding-3-small handles 8k tokens. Chunks larger than ~2k tokens truncate silently — the tail is ignored. |
| Query type | Specific factual queries ("what year was X founded?") prefer small chunks. Thematic queries ("explain the philosophy of X") prefer large chunks. |
| Document density | Dense technical text (papers) needs smaller chunks — each sentence is load-bearing. Narrative prose can use larger chunks. |
| LLM context window | You'll retrieve k chunks and feed them to the LLM. If k=5 and your LLM context is 4k tokens, max chunk size = 800 tokens each. |
One advanced strategy: hierarchical chunking. Index the same document at two granularities — small chunks (128 tokens) for precise retrieval, large chunks (1024 tokens) for context. Retrieve using small chunks, but return the parent large chunk to the LLM. You get precision AND context.
python — hierarchical chunk indexing def build_hierarchical_index(text): # Small chunks for retrieval precision small_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=20) small_chunks = small_splitter.split_text(text) # Large chunks for LLM context large_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=100) large_chunks = large_splitter.split_text(text) # Map each small chunk to its parent large chunk chunk_map = {} for small in small_chunks: for i, large in enumerate(large_chunks): if small in large: chunk_map[small] = i break # At query time: retrieve small chunks, return their parent large chunks return small_chunks, large_chunks, chunk_map
Simulated precision and recall curves as chunk size varies. The sweet spot maximizes F1 (harmonic mean of precision and recall).
See all four strategies side by side on real text. Adjust parameters and watch the chunks change in real time. Enter a sample query to see which strategy retrieves the most relevant chunk.
Chunking is not the goal — it's the foundation. Every downstream component in a RAG system depends on chunk quality. Let's trace the full pipeline and see where chunking quality propagates.
There's a useful analogy: chunking is like writing a good index for a textbook. A bad index (wrong entries, too broad, too fine-grained) makes the book useless even if the content is brilliant. A good index makes every concept findable in seconds.
| Strategy | Speed | Quality | Metadata | Best for |
|---|---|---|---|---|
| Fixed-size | Fastest | Lowest | None | Prototyping baselines |
| Sentence | Fast | Low-medium | None | Simple prose, no structure |
| Recursive | Fast | Medium | None | General-purpose default |
| Semantic | Slow | High | None | Dense, thematic text |
| Structure-aware | Medium | High | Rich | Markdown, HTML, PDFs with structure |
| Hierarchical | Slowest | Highest | Some | Production RAG systems |
"The devil is in the details, and with RAG, the detail is chunking."
— Practical wisdom from production deployments