RAG FOUNDATIONS

Text Chunking

How to slice documents so retrieval actually works.

Prerequisites: Can read Python code. That's it.
9
Chapters
6
Simulations
0
Assumed Knowledge

Chapter 0: Why Chunk?

You have a 50-page PDF — a technical manual, a research paper, a company handbook. A user asks: "What's the warranty policy for the XR-7 motor?" You want the AI to answer from the document, not hallucinate.

The naive approach: feed the whole PDF into the language model. Problem: GPT-4 has a ~128k token context window. A 50-page PDF is roughly 25,000 tokens — it fits. But your system needs to handle 500 PDFs. That's 12 million tokens, far beyond any context window. And even if it fit, the model's attention diffuses over 12 million tokens and the relevant sentence drowns in noise.

So you need to embed the documents — turn them into vectors in a high-dimensional space where "similar meaning" means "nearby in space." Then when a user asks a question, you embed the question and find the nearest document chunks. This is Retrieval-Augmented Generation (RAG).

The core problem: you can't embed a 50-page PDF as one vector. One vector can only hold one "meaning." A PDF has thousands of distinct ideas. If you compress them all into a single point, the point becomes meaningless — everything averages out.

You could embed each word separately. But a word alone has no context. The word "bank" means nothing without "river bank" vs "savings bank." You need enough surrounding text to disambiguate.

The solution is to split the document into chunks — segments of text that each capture one coherent idea, with enough context to be understood on their own. Each chunk gets one embedding vector. At query time, you retrieve the most relevant chunks and feed them to the model.

The chunking problem: what is "one coherent idea"? A sentence? A paragraph? A section? The answer is: it depends on your content, your embedding model, and your query patterns. The rest of this lesson maps out every strategy for answering that question.
Embedding Space: One Vector vs. Many Chunks

See why a single vector for a whole document loses information. Each colored dot is a distinct topic from the document.

Showing: Full Document (1 vector)
Why can't you embed a whole 500-document corpus as a single vector and use it for retrieval?

Chapter 1: Fixed-Size Chunking

The simplest strategy: count characters (or tokens) and cut every N of them. Like tearing a book into equal-width strips. Fast, deterministic, requires no understanding of the content.

Consider this paragraph from a climate science paper:

"The Greenland ice sheet has been losing mass at an accelerating rate since the 1990s. Between 2002 and 2020, it shed approximately 280 billion tons of ice per year. This loss is driven primarily by surface melt and glacial discharge, with surface melt accounting for roughly 60% of total mass loss."

With a chunk size of 100 characters and no overlap, you get:

chunks (100-char, no overlap)
# Chunk 1
"The Greenland ice sheet has been losing mass at an accelerating rate since the 1990s. Between 2"

# Chunk 2
"002 and 2020, it shed approximately 280 billion tons of ice per year. This loss is driven primar"

# Chunk 3
"ily by surface melt and glacial discharge, with surface melt accounting for roughly 60% of total"

# Chunk 4
" mass loss."

Chunk 1 ends mid-number ("2002" is split). Chunk 2 starts with "002" — meaningless without context. If someone queries "how much ice does Greenland lose per year?", the answer "280 billion tons" is split across two chunks. Neither chunk alone has the full fact.

Overlap is the band-aid: repeat the last K characters from chunk N at the start of chunk N+1. This reduces mid-fact splits but inflates your index and creates duplicate information in adjacent chunks.

python — fixed-size chunking with overlap
def fixed_chunk(text: str, size: 500, overlap: 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap  # step back by overlap amount
    return chunks

# Example
text = "The Greenland ice sheet..."
chunks = fixed_chunk(text, size=500, overlap=50)
# Result: chunks with 50-char "buffer zones" at boundaries
When does fixed-size chunking work? Almost never in production. It's useful for prototyping because it requires zero analysis. But it routinely splits sentences, facts, and concepts. Use it only when you need a quick baseline to benchmark against better methods.
Fixed-Size Chunking — Live Preview

Adjust chunk size and overlap to see where cuts land in a sample sentence.

Chunk size 60
Overlap 10
Why does overlap help with fixed-size chunking but not fully solve the problem?

Chapter 2: Sentence-Based Chunking

One obvious fix: don't cut mid-sentence. Split on sentence boundaries — periods, exclamation marks, question marks followed by whitespace. Every chunk is at least one complete sentence.

python — sentence splitting with spaCy
import spacy

nlp = spacy.load("en_core_web_sm")

def sentence_chunks(text: str, sentences_per_chunk: 3) -> list[str]:
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i : i + sentences_per_chunk])
        chunks.append(chunk)
    return chunks

This is better than character splitting. But consider this passage about the water cycle:

"Water evaporates from oceans when solar radiation heats the surface. The vapor rises into the atmosphere. It cools at altitude. This cooling causes condensation into tiny droplets. Those droplets form clouds. Eventually precipitation returns water to the surface."

With 2 sentences per chunk:

Chunk 1: "Water evaporates from oceans when solar radiation heats the surface. The vapor rises into the atmosphere."
Chunk 2: "It cools at altitude. This cooling causes condensation into tiny droplets."
Chunk 3: "Those droplets form clouds. Eventually precipitation returns water to the surface."

Chunk 3 starts "Those droplets" — which droplets? The reference is to Chunk 2. Context is broken across chunk boundaries.

A concept often spans multiple sentences. "Those droplets form clouds" is meaningless without the preceding "cooling causes condensation into tiny droplets." Sentence splitting preserves grammar but not semantics.

Key insight: sentence boundaries are syntactic, not semantic. The concept of cloud formation spans four sentences in this example. Any chunking strategy that ignores meaning will sometimes split coherent ideas — and it will never know it happened.

One improvement: group sentences by adding a sliding window — each chunk overlaps the previous by N sentences. This creates redundancy but ensures related sentences often appear together in at least one chunk.

python — sentence chunks with sliding window overlap
def sliding_sentence_chunks(sentences, window=3, step=2):
    # window=3: 3 sentences per chunk
    # step=2: advance by 2, so 1 sentence overlaps
    return [
        " ".join(sentences[i:i+window])
        for i in range(0, len(sentences) - window + 1, step)
    ]
Sentence-based chunking is better than fixed-size because it never cuts mid-sentence. But what problem remains?

Chapter 3: Recursive Character Splitting

LangChain popularized this. The idea: try to split on the best delimiter first — a paragraph break is better than a sentence break, which is better than a word break, which is better than a character cut. Work down the hierarchy until chunks are small enough.

Try: "\n\n" (paragraph break)
If chunk fits → done
↓ chunk still too large
Try: "\n" (line break)
If chunk fits → done
↓ chunk still too large
Try: ". " (sentence end)
If chunk fits → done
↓ chunk still too large
Try: " " (word break)
If chunk fits → done
↓ chunk still too large
Cut at character boundary
Last resort

This is a greedy top-down algorithm: always prefer the largest semantic unit that fits within the target chunk size. A paragraph break is semantically meaningful — it signals a topic shift. You only descend to finer splits when you must.

python — recursive character splitting (LangChain-style)
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # target characters per chunk
    chunk_overlap=50,     # overlap between adjacent chunks
    separators=[          # try these in order
        "\n\n",           # paragraph break (best)
        "\n",             # line break
        ". ",             # sentence end
        " ",              # word break
        ""                # character (worst case)
    ]
)

chunks = splitter.split_text(document)
# Returns a list of strings, each under 500 chars
# Each one cut at the most natural boundary possible
Why hierarchy matters: it's not just about cleaner cuts. A paragraph break signals intentional topic separation by the author. A sentence break signals grammatical completion. A word break at least preserves morphology. Character breaks are pure noise. By trying the most meaningful separator first, you preserve authorial intent as much as possible.

You can customize the separator list for your content type. Code files should split on function boundaries before line breaks. Legal documents should split on sections before paragraphs. The hierarchy should match the document structure.

Recursive Splitting — The Decision Tree

Watch how a text segment is recursively subdivided. Click a segment to split it further.

Click a large segment to split it
In recursive character splitting, why is "\n\n" tried before "." ?

Chapter 4: Semantic Chunking

Every strategy so far splits on structure — characters, sentences, paragraphs. None of them look at meaning. Semantic chunking does: it uses embeddings to detect when the topic shifts, and splits there.

The algorithm: embed each sentence individually. For each adjacent pair of sentences, compute the cosine distance between their embeddings. A high distance means "these two sentences are about different things" — a good split point. A low distance means they're about the same thing — keep them together.

1. Split into sentences
Get S₁, S₂, ..., Sₙ
2. Embed each sentence
Get vectors e₁, e₂, ..., eₙ
3. Compute consecutive distances
d(i) = cosine_distance(eᵢ, eᵢ₊₁)
4. Find breakpoints
Split where d(i) > threshold

The breakpoint threshold controls granularity. Set it low (e.g., 80th percentile of all distances) and you split frequently into small, focused chunks. Set it high (e.g., 95th percentile) and you only split on major topic changes.

python — semantic chunking with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunks(sentences: list[str], percentile: 90):
    # Step 1: embed all sentences
    embeddings = model.encode(sentences)

    # Step 2: cosine distance between consecutive sentences
    def cos_dist(a, b):
        return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    distances = [cos_dist(embeddings[i], embeddings[i+1])
                 for i in range(len(embeddings) - 1)]

    # Step 3: threshold at Nth percentile
    threshold = np.percentile(distances, percentile)
    breakpoints = [i for i, d in enumerate(distances) if d > threshold]

    # Step 4: split at breakpoints
    chunks, prev = [], 0
    for bp in breakpoints:
        chunks.append(" ".join(sentences[prev:bp+1]))
        prev = bp + 1
    chunks.append(" ".join(sentences[prev:]))
    return chunks
Cost: semantic chunking requires embedding every sentence at index time — O(n) embedding calls, not free. For a 50-page PDF with ~500 sentences, that's 500 embedding API calls or 500 forward passes through your local model. For most applications this is acceptable, but for millions of documents it adds up.
Semantic Distance Plot — Finding Breakpoints

Simulated cosine distances between consecutive sentences. The red line is the breakpoint threshold. Drag the threshold to see how it changes chunk count.

Threshold percentile 85th
In semantic chunking, a high cosine distance between two consecutive sentence embeddings means:

Chapter 5: Document-Structure-Aware Chunking

The best split points in a document are usually already marked — by the author. Markdown headers, HTML tags, PDF section titles, bullet list items, code blocks. These are explicit signals that a new topic is beginning. Use them.

Consider a Markdown API documentation file. It has headers at multiple levels. Under each header: prose, then a code example, then parameter tables. The right chunks are: one per section, not crossing header boundaries, with code blocks kept intact.

python — markdown-aware chunking
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#",   "h1"),   # top-level header
    ("##",  "h2"),   # section header
    ("###", "h3"),   # subsection header
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)

# Each chunk includes metadata:
# chunk.metadata = {"h1": "API Reference", "h2": "Authentication", "h3": "OAuth Flow"}
# chunk.page_content = "The OAuth flow requires three steps..."

The metadata is the killer feature: every chunk knows which section it came from. When you retrieve a chunk, you can display "from: API Reference > Authentication > OAuth Flow" — much more useful than "from: document, chunk 47".

Common structure signals

FormatSignals
Markdown#, ##, ###, ---
HTMLh1-h6, <section>, <article>
PDFfont-size changes, bold lines
Codefunction/class definitions
Legal"Article N", "Section N"

Special content types

Code blocks should never be split mid-function. A function is the atomic unit of code — split on function boundaries, or keep entire files as one chunk if small enough.
Tables should stay intact. A row is meaningless without its header. Extract tables as structured data, not raw text.

For PDFs, structure is implicit. You need to extract it from visual cues: font size, bold text, indentation, whitespace. Libraries like pdfminer and pymupdf expose these properties so you can reconstruct the logical structure.

python — PDF structure extraction with pymupdf
import fitz  # pymupdf

def extract_sections(pdf_path):
    doc = fitz.open(pdf_path)
    sections = []
    current_section = {"title": "", "content": ""}

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if block["type"] == 0:  # text block
                for line in block["lines"]:
                    for span in line["spans"]:
                        size = span["size"]
                        text = span["text"].strip()
                        if size > 14:  # likely a heading
                            if current_section["content"]:
                                sections.append(current_section)
                            current_section = {"title": text, "content": ""}
                        else:
                            current_section["content"] += text + " "

    if current_section["content"]:
        sections.append(current_section)
    return sections
What advantage does document-structure-aware chunking offer that semantic chunking doesn't?

Chapter 6: Chunk Size vs. Retrieval Quality

No chunking strategy is complete without choosing how big to make the chunks. This is not a free parameter — it has measurable consequences on retrieval precision and recall.

Too small: each chunk has too little context. A 50-character chunk might say "280 billion tons per year" — but per year of what? Which substance? Which measurement? The embedding of this fragment is ambiguous. Retrieval will find it for the wrong queries and miss it for the right ones.
Too large: each chunk covers multiple topics. Its embedding vector is a compromise — a blurry average of several ideas. A query about "glacial discharge" will retrieve a 2000-character chunk about the entire water cycle. The answer is buried in noise. The model has to read more irrelevant text to find the needle.

The sweet spot depends on three factors:

FactorEffect on optimal size
Embedding model context windowtext-embedding-3-small handles 8k tokens. Chunks larger than ~2k tokens truncate silently — the tail is ignored.
Query typeSpecific factual queries ("what year was X founded?") prefer small chunks. Thematic queries ("explain the philosophy of X") prefer large chunks.
Document densityDense technical text (papers) needs smaller chunks — each sentence is load-bearing. Narrative prose can use larger chunks.
LLM context windowYou'll retrieve k chunks and feed them to the LLM. If k=5 and your LLM context is 4k tokens, max chunk size = 800 tokens each.
Empirical rule of thumb: 256–512 tokens per chunk works for most general-purpose RAG systems. Start here, then measure: run your real queries against a test set, check if the correct chunk is in the top-k retrieved, and adjust. There is no substitute for this measurement — theory can only narrow the search space.

One advanced strategy: hierarchical chunking. Index the same document at two granularities — small chunks (128 tokens) for precise retrieval, large chunks (1024 tokens) for context. Retrieve using small chunks, but return the parent large chunk to the LLM. You get precision AND context.

python — hierarchical chunk indexing
def build_hierarchical_index(text):
    # Small chunks for retrieval precision
    small_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=20)
    small_chunks = small_splitter.split_text(text)

    # Large chunks for LLM context
    large_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=100)
    large_chunks = large_splitter.split_text(text)

    # Map each small chunk to its parent large chunk
    chunk_map = {}
    for small in small_chunks:
        for i, large in enumerate(large_chunks):
            if small in large:
                chunk_map[small] = i
                break

    # At query time: retrieve small chunks, return their parent large chunks
    return small_chunks, large_chunks, chunk_map
Chunk Size vs. Retrieval Performance

Simulated precision and recall curves as chunk size varies. The sweet spot maximizes F1 (harmonic mean of precision and recall).

Query type Factual
Why does a very large chunk size hurt retrieval precision?

Chapter 7: Interactive Chunking Lab

See all four strategies side by side on real text. Adjust parameters and watch the chunks change in real time. Enter a sample query to see which strategy retrieves the most relevant chunk.

Chunking Strategy Comparison Lab
Chunk size (chars) 160
Overlap (chars) 20
Semantic threshold 80th pct

Chapter 8: Connections

Chunking is not the goal — it's the foundation. Every downstream component in a RAG system depends on chunk quality. Let's trace the full pipeline and see where chunking quality propagates.

Documents
Raw PDFs, HTML, Markdown, code
chunking (this lesson)
Chunks
Text segments, each ~one coherent idea
↓ embedding model (e.g. text-embedding-3-small)
Vectors
High-dim float arrays, one per chunk
↓ vector database (Pinecone, Chroma, Weaviate)
Index
ANN index for fast approximate nearest neighbor search
↓ retrieval (top-k cosine similarity)
Retrieved chunks
The k most relevant segments for a query
↓ LLM (GPT-4, Claude, Llama)
Answer
Grounded response citing retrieved context
Chunk quality is the #1 determinant of RAG quality. If a chunk splits a fact across two segments, the relevant information will never be retrieved together. If chunks are too large, the embedding becomes too general. If chunks are too small, they lack context. The embedding model and vector database cannot compensate for bad chunking — garbage in, garbage out.

There's a useful analogy: chunking is like writing a good index for a textbook. A bad index (wrong entries, too broad, too fine-grained) makes the book useless even if the content is brilliant. A good index makes every concept findable in seconds.

Strategy Comparison

Strategy Speed Quality Metadata Best for
Fixed-size Fastest Lowest None Prototyping baselines
Sentence Fast Low-medium None Simple prose, no structure
Recursive Fast Medium None General-purpose default
Semantic Slow High None Dense, thematic text
Structure-aware Medium High Rich Markdown, HTML, PDFs with structure
Hierarchical Slowest Highest Some Production RAG systems

Where to go next

Embedding models — how sentences become vectors. The quality of your embeddings determines how well semantic meaning is captured. A better embedding model can compensate partially for imperfect chunking, but not fully.
Vector databases — how vectors are indexed for fast retrieval. ANN algorithms (HNSW, IVF) allow searching millions of vectors in milliseconds.
RAG evaluation — how to measure whether your chunking works. RAGAS, TruLens, and similar frameworks measure context relevance, faithfulness, and answer correctness automatically.
Reranking — after retrieval, use a cross-encoder to rerank the top-k chunks. Slower than embedding similarity but more accurate for deciding which chunks to include in the LLM context.
"What I cannot create, I do not understand." — Build a minimal RAG system from scratch: pick a PDF, chunk it with recursive splitting, embed with all-MiniLM-L6-v2 (free, local), store in ChromaDB (local), and retrieve the top-3 chunks for a query. You'll feel exactly where chunking quality matters.

"The devil is in the details, and with RAG, the detail is chunking."
— Practical wisdom from production deployments