Text Chunking Strategies — From Absolute Zero to Mastery

Chapter 0: Why Chunk?

You have a 50-page PDF — a technical manual, a research paper, a company handbook. A user asks: "What's the warranty policy for the XR-7 motor?" You want the AI to answer from the document, not hallucinate.

The naive approach: feed the whole PDF into the language model. Problem: GPT-4 has a ~128k token context window. A 50-page PDF is roughly 25,000 tokens — it fits. But your system needs to handle 500 PDFs. That's 12 million tokens, far beyond any context window. And even if it fit, the model's attention diffuses over 12 million tokens and the relevant sentence drowns in noise.

So you need to embed the documents — turn them into vectors in a high-dimensional space where "similar meaning" means "nearby in space." Then when a user asks a question, you embed the question and find the nearest document chunks. This is Retrieval-Augmented Generation (RAG).

The core problem: you can't embed a 50-page PDF as one vector. One vector can only hold one "meaning." A PDF has thousands of distinct ideas. If you compress them all into a single point, the point becomes meaningless — everything averages out.

You could embed each word separately. But a word alone has no context. The word "bank" means nothing without "river bank" vs "savings bank." You need enough surrounding text to disambiguate.

The solution is to split the document into chunks — segments of text that each capture one coherent idea, with enough context to be understood on their own. Each chunk gets one embedding vector. At query time, you retrieve the most relevant chunks and feed them to the model.

The chunking problem: what is "one coherent idea"? A sentence? A paragraph? A section? The answer is: it depends on your content, your embedding model, and your query patterns. The rest of this lesson maps out every strategy for answering that question.

Embedding Space: One Vector vs. Many Chunks

See why a single vector for a whole document loses information. Each colored dot is a distinct topic from the document.

Showing: Full Document (1 vector)

Why can't you embed a whole 500-document corpus as a single vector and use it for retrieval?

The vector would be too large to store in memory Embedding models only accept short text Thousands of distinct ideas average into a meaningless single point, so no specific topic can be retrieved Vector databases can't store more than one embedding per document

Chapter 1: Fixed-Size Chunking

The simplest strategy: count characters (or tokens) and cut every N of them. Like tearing a book into equal-width strips. Fast, deterministic, requires no understanding of the content.

Consider this paragraph from a climate science paper:

"The Greenland ice sheet has been losing mass at an accelerating rate since the 1990s. Between 2002 and 2020, it shed approximately 280 billion tons of ice per year. This loss is driven primarily by surface melt and glacial discharge, with surface melt accounting for roughly 60% of total mass loss."

With a chunk size of 100 characters and no overlap, you get:

chunks (100-char, no overlap)
# Chunk 1
"The Greenland ice sheet has been losing mass at an accelerating rate since the 1990s. Between 2"

# Chunk 2
"002 and 2020, it shed approximately 280 billion tons of ice per year. This loss is driven primar"

# Chunk 3
"ily by surface melt and glacial discharge, with surface melt accounting for roughly 60% of total"

# Chunk 4
" mass loss."

Chunk 1 ends mid-number ("2002" is split). Chunk 2 starts with "002" — meaningless without context. If someone queries "how much ice does Greenland lose per year?", the answer "280 billion tons" is split across two chunks. Neither chunk alone has the full fact.

Overlap is the band-aid: repeat the last K characters from chunk N at the start of chunk N+1. This reduces mid-fact splits but inflates your index and creates duplicate information in adjacent chunks.

python — fixed-size chunking with overlap
def fixed_chunk(text: str, size: 500, overlap: 50) -> list[str]:
    chunks = []
    start = 0
    while start < len(text):
        end = start + size
        chunks.append(text[start:end])
        start += size - overlap  # step back by overlap amount
    return chunks

# Example
text = "The Greenland ice sheet..."
chunks = fixed_chunk(text, size=500, overlap=50)
# Result: chunks with 50-char "buffer zones" at boundaries

When does fixed-size chunking work? Almost never in production. It's useful for prototyping because it requires zero analysis. But it routinely splits sentences, facts, and concepts. Use it only when you need a quick baseline to benchmark against better methods.

Fixed-Size Chunking — Live Preview

Adjust chunk size and overlap to see where cuts land in a sample sentence.

Chunk size 60

Overlap 10

Why does overlap help with fixed-size chunking but not fully solve the problem?

Overlap makes chunks too large for the embedding model Overlap reduces boundary splits but still cuts at arbitrary character positions, not semantic boundaries — and duplicates data Overlap only works with token-based chunking, not character-based Overlap causes retrieval to return the same chunk twice

Chapter 2: Sentence-Based Chunking

One obvious fix: don't cut mid-sentence. Split on sentence boundaries — periods, exclamation marks, question marks followed by whitespace. Every chunk is at least one complete sentence.

python — sentence splitting with spaCy
import spacy

nlp = spacy.load("en_core_web_sm")

def sentence_chunks(text: str, sentences_per_chunk: 3) -> list[str]:
    doc = nlp(text)
    sentences = [sent.text.strip() for sent in doc.sents]
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = " ".join(sentences[i : i + sentences_per_chunk])
        chunks.append(chunk)
    return chunks

This is better than character splitting. But consider this passage about the water cycle:

"Water evaporates from oceans when solar radiation heats the surface. The vapor rises into the atmosphere. It cools at altitude. This cooling causes condensation into tiny droplets. Those droplets form clouds. Eventually precipitation returns water to the surface."

With 2 sentences per chunk:

Chunk 1: "Water evaporates from oceans when solar radiation heats the surface. The vapor rises into the atmosphere."

Chunk 2: "It cools at altitude. This cooling causes condensation into tiny droplets."

Chunk 3: "Those droplets form clouds. Eventually precipitation returns water to the surface."

Chunk 3 starts "Those droplets" — which droplets? The reference is to Chunk 2. Context is broken across chunk boundaries.

A concept often spans multiple sentences. "Those droplets form clouds" is meaningless without the preceding "cooling causes condensation into tiny droplets." Sentence splitting preserves grammar but not semantics.

Key insight: sentence boundaries are syntactic, not semantic. The concept of cloud formation spans four sentences in this example. Any chunking strategy that ignores meaning will sometimes split coherent ideas — and it will never know it happened.

One improvement: group sentences by adding a sliding window — each chunk overlaps the previous by N sentences. This creates redundancy but ensures related sentences often appear together in at least one chunk.

python — sentence chunks with sliding window overlap
def sliding_sentence_chunks(sentences, window=3, step=2):
    # window=3: 3 sentences per chunk
    # step=2: advance by 2, so 1 sentence overlaps
    return [
        " ".join(sentences[i:i+window])
        for i in range(0, len(sentences) - window + 1, step)
    ]

Sentence-based chunking is better than fixed-size because it never cuts mid-sentence. But what problem remains?

It requires a language model to detect sentence boundaries, which is slow It produces chunks that are all the same length, which wastes embedding space A concept often spans multiple sentences — the chunk boundary can still split a coherent idea mid-thought Sentence splitting only works in English

Chapter 3: Recursive Character Splitting

LangChain popularized this. The idea: try to split on the best delimiter first — a paragraph break is better than a sentence break, which is better than a word break, which is better than a character cut. Work down the hierarchy until chunks are small enough.

Try: "\n\n" (paragraph break)

If chunk fits → done

↓ chunk still too large

Try: "\n" (line break)

If chunk fits → done

↓ chunk still too large

Try: ". " (sentence end)

If chunk fits → done

↓ chunk still too large

Try: " " (word break)

If chunk fits → done

↓ chunk still too large

Cut at character boundary

Last resort

This is a greedy top-down algorithm: always prefer the largest semantic unit that fits within the target chunk size. A paragraph break is semantically meaningful — it signals a topic shift. You only descend to finer splits when you must.

python — recursive character splitting (LangChain-style)
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,        # target characters per chunk
    chunk_overlap=50,     # overlap between adjacent chunks
    separators=[          # try these in order
        "\n\n",           # paragraph break (best)
        "\n",             # line break
        ". ",             # sentence end
        " ",              # word break
        ""                # character (worst case)
    ]
)

chunks = splitter.split_text(document)
# Returns a list of strings, each under 500 chars
# Each one cut at the most natural boundary possible

Why hierarchy matters: it's not just about cleaner cuts. A paragraph break signals intentional topic separation by the author. A sentence break signals grammatical completion. A word break at least preserves morphology. Character breaks are pure noise. By trying the most meaningful separator first, you preserve authorial intent as much as possible.

You can customize the separator list for your content type. Code files should split on function boundaries before line breaks. Legal documents should split on sections before paragraphs. The hierarchy should match the document structure.

Recursive Splitting — The Decision Tree

Watch how a text segment is recursively subdivided. Click a segment to split it further.

Click a large segment to split it

In recursive character splitting, why is "\n\n" tried before "." ?

Because double newlines produce smaller chunks, which are easier to embed Because a paragraph break signals a larger semantic unit — the author deliberately separated those paragraphs — while a period only signals the end of one sentence Because "\n\n" is a more common character in English text Because periods are ambiguous — they can appear in abbreviations like "Dr." or "U.S.A."

Chapter 4: Semantic Chunking

Every strategy so far splits on structure — characters, sentences, paragraphs. None of them look at meaning. Semantic chunking does: it uses embeddings to detect when the topic shifts, and splits there.

The algorithm: embed each sentence individually. For each adjacent pair of sentences, compute the cosine distance between their embeddings. A high distance means "these two sentences are about different things" — a good split point. A low distance means they're about the same thing — keep them together.

1. Split into sentences

Get S₁, S₂, ..., Sₙ

↓

2. Embed each sentence

Get vectors e₁, e₂, ..., eₙ

↓

3. Compute consecutive distances

d(i) = cosine_distance(eᵢ, eᵢ₊₁)

↓

4. Find breakpoints

Split where d(i) > threshold

The breakpoint threshold controls granularity. Set it low (e.g., 80th percentile of all distances) and you split frequently into small, focused chunks. Set it high (e.g., 95th percentile) and you only split on major topic changes.

python — semantic chunking with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunks(sentences: list[str], percentile: 90):
    # Step 1: embed all sentences
    embeddings = model.encode(sentences)

    # Step 2: cosine distance between consecutive sentences
    def cos_dist(a, b):
        return 1 - np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

    distances = [cos_dist(embeddings[i], embeddings[i+1])
                 for i in range(len(embeddings) - 1)]

    # Step 3: threshold at Nth percentile
    threshold = np.percentile(distances, percentile)
    breakpoints = [i for i, d in enumerate(distances) if d > threshold]

    # Step 4: split at breakpoints
    chunks, prev = [], 0
    for bp in breakpoints:
        chunks.append(" ".join(sentences[prev:bp+1]))
        prev = bp + 1
    chunks.append(" ".join(sentences[prev:]))
    return chunks

Cost: semantic chunking requires embedding every sentence at index time — O(n) embedding calls, not free. For a 50-page PDF with ~500 sentences, that's 500 embedding API calls or 500 forward passes through your local model. For most applications this is acceptable, but for millions of documents it adds up.

Semantic Distance Plot — Finding Breakpoints

Simulated cosine distances between consecutive sentences. The red line is the breakpoint threshold. Drag the threshold to see how it changes chunk count.

Threshold percentile 85th

In semantic chunking, a high cosine distance between two consecutive sentence embeddings means:

The two sentences use many of the same words The second sentence is much longer than the first The two sentences are about different topics — their embedding vectors point in different directions in semantic space The embedding model failed to process one of the sentences

Chapter 5: Document-Structure-Aware Chunking

The best split points in a document are usually already marked — by the author. Markdown headers, HTML tags, PDF section titles, bullet list items, code blocks. These are explicit signals that a new topic is beginning. Use them.

Consider a Markdown API documentation file. It has headers at multiple levels. Under each header: prose, then a code example, then parameter tables. The right chunks are: one per section, not crossing header boundaries, with code blocks kept intact.

python — markdown-aware chunking
from langchain.text_splitter import MarkdownHeaderTextSplitter

headers = [
    ("#",   "h1"),   # top-level header
    ("##",  "h2"),   # section header
    ("###", "h3"),   # subsection header
]

splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers)
chunks = splitter.split_text(markdown_doc)

# Each chunk includes metadata:
# chunk.metadata = {"h1": "API Reference", "h2": "Authentication", "h3": "OAuth Flow"}
# chunk.page_content = "The OAuth flow requires three steps..."

The metadata is the killer feature: every chunk knows which section it came from. When you retrieve a chunk, you can display "from: API Reference > Authentication > OAuth Flow" — much more useful than "from: document, chunk 47".

Common structure signals

Format	Signals
Markdown	#, ##, ###, ---
HTML	h1-h6, <section>, <article>
PDF	font-size changes, bold lines
Code	function/class definitions
Legal	"Article N", "Section N"

Special content types

Code blocks should never be split mid-function. A function is the atomic unit of code — split on function boundaries, or keep entire files as one chunk if small enough.

Tables should stay intact. A row is meaningless without its header. Extract tables as structured data, not raw text.

For PDFs, structure is implicit. You need to extract it from visual cues: font size, bold text, indentation, whitespace. Libraries like pdfminer and pymupdf expose these properties so you can reconstruct the logical structure.

python — PDF structure extraction with pymupdf
import fitz  # pymupdf

def extract_sections(pdf_path):
    doc = fitz.open(pdf_path)
    sections = []
    current_section = {"title": "", "content": ""}

    for page in doc:
        blocks = page.get_text("dict")["blocks"]
        for block in blocks:
            if block["type"] == 0:  # text block
                for line in block["lines"]:
                    for span in line["spans"]:
                        size = span["size"]
                        text = span["text"].strip()
                        if size > 14:  # likely a heading
                            if current_section["content"]:
                                sections.append(current_section)
                            current_section = {"title": text, "content": ""}
                        else:
                            current_section["content"] += text + " "

    if current_section["content"]:
        sections.append(current_section)
    return sections

What advantage does document-structure-aware chunking offer that semantic chunking doesn't?

It's faster because it doesn't require embedding models It produces smaller chunks, which are cheaper to store It produces metadata (section titles, hierarchy) alongside each chunk, enabling source attribution and hierarchical retrieval It works for all languages without modification

Chapter 6: Chunk Size vs. Retrieval Quality

No chunking strategy is complete without choosing how big to make the chunks. This is not a free parameter — it has measurable consequences on retrieval precision and recall.

Too small: each chunk has too little context. A 50-character chunk might say "280 billion tons per year" — but per year of what? Which substance? Which measurement? The embedding of this fragment is ambiguous. Retrieval will find it for the wrong queries and miss it for the right ones.

Too large: each chunk covers multiple topics. Its embedding vector is a compromise — a blurry average of several ideas. A query about "glacial discharge" will retrieve a 2000-character chunk about the entire water cycle. The answer is buried in noise. The model has to read more irrelevant text to find the needle.

The sweet spot depends on three factors:

Factor	Effect on optimal size
Embedding model context window	text-embedding-3-small handles 8k tokens. Chunks larger than ~2k tokens truncate silently — the tail is ignored.
Query type	Specific factual queries ("what year was X founded?") prefer small chunks. Thematic queries ("explain the philosophy of X") prefer large chunks.
Document density	Dense technical text (papers) needs smaller chunks — each sentence is load-bearing. Narrative prose can use larger chunks.
LLM context window	You'll retrieve k chunks and feed them to the LLM. If k=5 and your LLM context is 4k tokens, max chunk size = 800 tokens each.

Empirical rule of thumb: 256–512 tokens per chunk works for most general-purpose RAG systems. Start here, then measure: run your real queries against a test set, check if the correct chunk is in the top-k retrieved, and adjust. There is no substitute for this measurement — theory can only narrow the search space.

One advanced strategy: hierarchical chunking. Index the same document at two granularities — small chunks (128 tokens) for precise retrieval, large chunks (1024 tokens) for context. Retrieve using small chunks, but return the parent large chunk to the LLM. You get precision AND context.

python — hierarchical chunk indexing
def build_hierarchical_index(text):
    # Small chunks for retrieval precision
    small_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=20)
    small_chunks = small_splitter.split_text(text)

    # Large chunks for LLM context
    large_splitter = RecursiveCharacterTextSplitter(chunk_size=1024, chunk_overlap=100)
    large_chunks = large_splitter.split_text(text)

    # Map each small chunk to its parent large chunk
    chunk_map = {}
    for small in small_chunks:
        for i, large in enumerate(large_chunks):
            if small in large:
                chunk_map[small] = i
                break

    # At query time: retrieve small chunks, return their parent large chunks
    return small_chunks, large_chunks, chunk_map

Chunk Size vs. Retrieval Performance

Simulated precision and recall curves as chunk size varies. The sweet spot maximizes F1 (harmonic mean of precision and recall).

Query type Factual

Why does a very large chunk size hurt retrieval precision?

Large chunks exceed the embedding model's token limit and get truncated A large chunk covers multiple topics, so its embedding is a blurry average — it will match many queries partially, making it hard to retrieve the right chunk for any specific question Large chunks take longer to compute cosine similarity against Vector databases impose a maximum chunk size of 512 tokens

Chapter 7: Interactive Chunking Lab

See all four strategies side by side on real text. Adjust parameters and watch the chunks change in real time. Enter a sample query to see which strategy retrieves the most relevant chunk.

Chunking Strategy Comparison Lab

Sample text (edit freely):

Chunk size (chars) 160

Overlap (chars) 20

Semantic threshold 80th pct

Chapter 8: Connections

Chunking is not the goal — it's the foundation. Every downstream component in a RAG system depends on chunk quality. Let's trace the full pipeline and see where chunking quality propagates.

Documents

Raw PDFs, HTML, Markdown, code

↓ chunking (this lesson)

Chunks

Text segments, each ~one coherent idea

↓ embedding model (e.g. text-embedding-3-small)

Vectors

High-dim float arrays, one per chunk

↓ vector database (Pinecone, Chroma, Weaviate)

Index

ANN index for fast approximate nearest neighbor search

↓ retrieval (top-k cosine similarity)

Retrieved chunks

The k most relevant segments for a query

↓ LLM (GPT-4, Claude, Llama)

Answer

Grounded response citing retrieved context

Chunk quality is the #1 determinant of RAG quality. If a chunk splits a fact across two segments, the relevant information will never be retrieved together. If chunks are too large, the embedding becomes too general. If chunks are too small, they lack context. The embedding model and vector database cannot compensate for bad chunking — garbage in, garbage out.

There's a useful analogy: chunking is like writing a good index for a textbook. A bad index (wrong entries, too broad, too fine-grained) makes the book useless even if the content is brilliant. A good index makes every concept findable in seconds.

Strategy Comparison

Strategy	Speed	Quality	Metadata	Best for
Fixed-size	Fastest	Lowest	None	Prototyping baselines
Sentence	Fast	Low-medium	None	Simple prose, no structure
Recursive	Fast	Medium	None	General-purpose default
Semantic	Slow	High	None	Dense, thematic text
Structure-aware	Medium	High	Rich	Markdown, HTML, PDFs with structure
Hierarchical	Slowest	Highest	Some	Production RAG systems

Where to go next

Embedding models — how sentences become vectors. The quality of your embeddings determines how well semantic meaning is captured. A better embedding model can compensate partially for imperfect chunking, but not fully.

Vector databases — how vectors are indexed for fast retrieval. ANN algorithms (HNSW, IVF) allow searching millions of vectors in milliseconds.

RAG evaluation — how to measure whether your chunking works. RAGAS, TruLens, and similar frameworks measure context relevance, faithfulness, and answer correctness automatically.

Reranking — after retrieval, use a cross-encoder to rerank the top-k chunks. Slower than embedding similarity but more accurate for deciding which chunks to include in the LLM context.

"What I cannot create, I do not understand." — Build a minimal RAG system from scratch: pick a PDF, chunk it with recursive splitting, embed with all-MiniLM-L6-v2 (free, local), store in ChromaDB (local), and retrieve the top-3 chunks for a query. You'll feel exactly where chunking quality matters.

"The devil is in the details, and with RAG, the detail is chunking."
— Practical wisdom from production deployments