Introduction to Algorithms (CLRS) — Chapter 18

B-Trees

Disk-optimized search trees — how databases store billions of keys with just 2-3 disk reads.

Prerequisites: BSTs + Disk I/O intuition. That's it.
10
Chapters
9+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

You are building a database that stores 1 billion user records. Each record has a unique ID, and users search, insert, and delete records constantly. The data does not fit in RAM — it lives on disk (an SSD or, historically, a spinning hard drive). How do you organize the data so that finding any record is fast?

Your first instinct: a binary search tree. With 1 billion keys, a balanced BST has height log2(109) ≈ 30. That means 30 comparisons to find any key. In RAM, 30 comparisons take microseconds. But here is the catch: each comparison requires reading a different node, and each node lives on a different disk page.

A single disk read (called a "seek" on a spinning disk, or a "page fetch" on an SSD) takes about 10 milliseconds on a hard drive. That is 10,000,000 nanoseconds. A CPU instruction takes about 1 nanosecond. So one disk read costs the same as 10 million CPU operations. Thirty disk reads means 300 milliseconds — a third of a second — just to find one record.

The core insight. When data lives on disk, the bottleneck is not comparisons — it is the number of disk reads. Each node visit is a disk read. A BST with height 30 needs 30 disk reads. We need a tree with much lower height, even if each node is bigger and each visit does more work. Reduce height by increasing branching factor.

A B-tree solves this by packing hundreds or thousands of keys into each node. Instead of 2 children per node, a B-tree node can have 1000 children. The height drops from log2(n) to log1000(n). For 1 billion keys:

BST height: log2(109) ≈ 30 disk reads B-tree height: log1000(109) = 3 disk reads That is a 10x reduction in disk reads. 300ms → 30ms. And the root is usually cached in RAM, so it is really 2 disk reads.

Each B-tree node is sized to fit exactly one disk page (typically 4 KB or 16 KB). When the OS reads a page from disk, it gets the entire node — all its keys and child pointers — in a single I/O operation. The extra CPU work of scanning through hundreds of keys within a node is trivial compared to the cost of one more disk read.

The simulation below lets you feel the difference. It compares the number of "disk reads" (node visits) for a BST versus a B-tree as the number of keys grows. Watch how the BST's height climbs relentlessly while the B-tree stays flat.

Disk Reads: BST vs B-Tree

Drag the slider to change the number of keys. The chart shows how many disk reads (node visits) each structure needs to find a key in the worst case.

Keys (n) 1015
B-tree branching (t) 100

Even with a modest branching factor of 100, the B-tree needs only 4-5 disk reads for billions of keys. Increase the branching factor to 500 (common in real databases) and you need 2-3 reads for practically any dataset on Earth.

This is why every major database — MySQL, PostgreSQL, SQLite, Oracle, MongoDB — uses B-trees (or the closely related B+ tree) as their primary index structure. It is the single most important data structure in storage systems.

The Memory Hierarchy

To understand why B-trees matter, you need to understand the memory hierarchy. Modern computers have several levels of storage, each faster but smaller than the next:

LevelSizeAccess TimeRelative Speed
CPU Registers~1 KB~0.3 ns1x
L1 Cache~64 KB~1 ns3x slower
L2 Cache~256 KB~4 ns13x slower
L3 Cache~8 MB~12 ns40x slower
RAM~16 GB~100 ns333x slower
SSD~1 TB~100 μs333,000x slower
HDD~4 TB~10 ms33,000,000x slower

The jump from RAM to SSD is 1,000x. The jump from RAM to HDD is 100,000x. A data structure that minimizes disk accesses — even at the cost of more CPU work per access — wins by orders of magnitude. That is the B-tree's entire design philosophy.

Why "B"? The name "B-tree" was introduced by Rudolf Bayer and Edward McCreight at Boeing Scientific Research Labs in 1972. They never explained what the "B" stands for. It might be Boeing, Bayer, balanced, broad, or bushy. Nobody knows for certain. What we do know: B-trees were designed specifically for disk-based storage systems, and they remain the dominant index structure 50+ years later.

The Disk I/O Model

Textbook algorithms assume all data is in RAM and count comparisons. The disk I/O model (also called the external memory model) counts disk transfers instead. In this model:

In this model, scanning n items costs O(n/B) I/Os (read n/B blocks). Sorting n items costs O((n/B) · logM/B(n/B)) I/Os — much better than the O(n log n) comparison sort. And searching in a B-tree costs O(logB n) I/Os, because each node holds B items.

This is why B-trees use branching factor proportional to B (the block/page size): it makes each "level" of the tree map to exactly one disk read, giving the optimal O(logB n) search cost in the external memory model. B-trees are I/O optimal for search — no other comparison-based data structure can do better.

SSDs changed the game — but not the answer. SSDs eliminated the 10ms seek penalty of HDDs, reducing random reads to ~100 μs. But SSDs still have a page-level granularity (4 KB reads), and sequential access is still 10x faster than random. B-trees remain optimal on SSDs — the branching factor just needs to be recalibrated for the new page sizes and latencies. Some SSD-optimized databases use smaller pages (2 KB instead of 16 KB) to reduce write amplification.
Concept check: A balanced BST with 1 billion keys has height ~30. Why is this a problem when the data lives on disk?

Chapter 1: B-Tree Properties

A B-tree is defined by a single parameter called the minimum degree, written t (where t ≥ 2). This parameter controls how wide and short the tree is. Every rule about a B-tree flows from t.

Here are the five properties that define a B-tree. Memorize these — they are the entire specification.

The five B-tree rules (for minimum degree t):
1. Every node stores between t-1 and 2t-1 keys (except the root, which can have as few as 1 key).
2. Every internal node with k keys has exactly k+1 children.
3. Keys within each node are stored in sorted order.
4. For an internal node with keys k1, k2, ..., kn: all keys in childi fall between ki-1 and ki. (The BST property, but generalized to multiple keys.)
5. All leaves appear at the same depth. The tree is perfectly balanced by construction.

Let us make this concrete with t = 3 (minimum degree 3). Each node has between 2 and 5 keys (t-1 = 2, 2t-1 = 5). Each internal node has between 3 and 6 children (t = 3, 2t = 6). The root can have as few as 1 key and 2 children.

Think of each node as a sorted array with signposts. If a node contains keys [10, 20, 30], it has four child pointers: one for keys < 10, one for keys between 10 and 20, one for keys between 20 and 30, and one for keys > 30. Each child pointer leads to an entire subtree of keys in that range.

Node: | 10 | 20 | 30 | / | | \ <10 10-20 20-30 >30 3 keys = 4 child pointers. The keys act as "dividers" that route searches to the correct child.

The Routing Property in Detail

Property 4 — the multi-key routing rule — deserves careful examination. In a BST, each node has one key that divides the search space into two halves: left (≤ key) and right (≥ key). In a B-tree, each node has k keys that divide the search space into k+1 ranges:

Node with keys [k₁, k₂, k₃]: child₀ child₁ child₂ child₃ < k₁ k₁ ≤ x < k₂ k₂ ≤ x < k₃ ≥ k₃ Every key in child₀ is strictly less than k₁. Every key in child₁ is ≥ k₁ and < k₂. Every key in child₂ is ≥ k₂ and < k₃. Every key in child₃ is ≥ k₃.

This is exactly like a multi-way decision tree. At each node, you scan the keys to find which range your search key falls into, then descend into the corresponding child. With k keys per node, you make one (k+1)-way decision instead of k binary decisions — but you do it in a single disk read.

Think of it like a phone book directory. Each page of the directory has tab markers: A-C, D-F, G-I, ... Each tab tells you which section to flip to. The tabs are the keys; the sections are the children. A BST phone book would have one letter per page — 26 binary decisions to find any name. A B-tree phone book has all 26 letters on the first page — one decision, then flip directly to the right section.

Why the minimum fill constraint?

The rule that every non-root node must have at least t-1 keys is crucial. It guarantees that the tree stays balanced and short. If nodes could be arbitrarily empty, the tree could degenerate into a tall, skinny structure — just like an unbalanced BST. The minimum fill ensures every node is at least half full, which bounds the height.

Height bound

A B-tree with n keys and minimum degree t has height at most:

h ≤ logt((n + 1) / 2)

For t = 1001 and n = 109: h ≤ log1001(500,000,000) ≈ 2.9. That is, a B-tree with minimum degree 1001 stores a billion keys in a tree of height 3. The root plus 2 more levels.

Choosing t in Practice

In theory, t can be any integer ≥ 2. In practice, t is chosen so that one node fills exactly one disk page. Here is the calculation:

Given: page_size = 4096 bytes (or 16384 for InnoDB) key_size = 8 bytes (int64) pointer_size = 8 bytes (page_id or pointer) A node with k keys has k+1 child pointers: node_size = k · key_size + (k+1) · pointer_size + overhead 4096 = k · 8 + (k+1) · 8 + 16 (16 bytes for header) 4096 = 16k + 8 + 16 4072 = 16k k = 254 So max keys = 254, which means 2t-1 = 254, so t = 128. Branching factor = 255.

With variable-length keys (like strings), the calculation is more complex. Databases typically store the maximum number of keys that fit in a page given the average key size, with overflow pages for very long keys.

The interactive visualization below lets you explore a B-tree with t = 3. Click the keys and child pointers to see the structure. Notice how every leaf is at the same depth, and every non-root node has at least 2 keys.

B-Tree Node Structure (t = 3)

A B-tree with minimum degree t=3. Each node holds 2-5 keys. All leaves are at the same depth. Hover over nodes to see key counts.

Propertyt = 2 (2-3-4 tree)t = 3t = 1001 (database)
Min keys/node121000
Max keys/node352001
Min children231001
Max children462002
Height for 106 keys~20~9~2
Height for 109 keys~30~13~3

The Node Data Structure

Every B-tree node stores three things: a list of keys (sorted), a list of child pointers, and a flag indicating whether it is a leaf. Here is the concrete implementation:

python
class BTreeNode:
    def __init__(self, leaf=True):
        self.keys = []        # sorted list of keys
        self.children = []    # list of child node pointers
        self.leaf = leaf      # True if this is a leaf node
        self.n = 0            # number of keys currently stored

class BTree:
    def __init__(self, t):
        self.root = BTreeNode()
        self.t = t            # minimum degree
        # Every node has at most 2t-1 keys and 2t children
        # Every non-root node has at least t-1 keys

In a real database, each "node" is a disk page. The children list does not contain Python pointers — it contains page numbers. To access a child, you call disk_read(page_number), which loads the page from disk into RAM. This is the expensive operation — the one that B-trees minimize.

In-memory representation: On-disk representation: node.children[2] = <object> node.children[2] = page_id 4807 ↑ must call disk_read(4807) In textbooks, disk_read() and disk_write() are explicit. In real databases, the buffer pool abstracts this — pages are read into a shared memory pool, and "pointers" become page_id lookups.

Deriving the Height Bound

Let us prove the height bound h ≤ logt((n+1)/2). A B-tree of height h has:

Level 0 (root): 1 node, at least 1 key Level 1: at least 2 nodes (root has ≥ 2 children) Level 2: at least 2t nodes (each L1 node has ≥ t children) Level 3: at least 2t² nodes ... Level h: at least 2th-1 nodes Total keys ≥ 1 + (t-1) · [2 + 2t + 2t² + ... + 2th-1] = 1 + 2(t-1) · (th - 1)/(t - 1) = 2th - 1 So n ≥ 2th - 1, which gives h ≤ logt((n+1)/2). □

For t = 1001, n = 109: h ≤ log1001(500,000,000) = log(500,000,000) / log(1001) ≈ 8.7 / 3.0 ≈ 2.9. So the height is at most 3. With the root always in memory, you need just 2 disk reads to search a billion keys.

Why t = 2 is special. A B-tree with t = 2 is called a 2-3-4 tree: each node has 1-3 keys and 2-4 children. This is structurally equivalent to a red-black tree (CLRS Ch 13). Every red-black tree can be converted to a 2-3-4 tree and vice versa. So red-black trees are B-trees in disguise — optimized for RAM (binary branching) rather than disk (high branching). The red-black tree's rotation operations correspond to B-tree splits and merges. This is why CLRS places the B-tree chapter right after red-black trees — they are the same idea at different scales.
Concept check: In a B-tree with minimum degree t = 4, what is the maximum number of keys a single node can hold?

Chapter 2: Search

Searching a B-tree is a natural generalization of BST search. At each node, instead of comparing with one key and going left or right, you scan through multiple keys to find the right child pointer. The search descends level by level until it finds the key or reaches a leaf.

Here is the algorithm, step by step:

Start at root
Read the root node from disk into memory.
Scan keys
In the current node, find the smallest index i such that key ≥ node.keys[i]. This can be linear scan or binary search.
Found?
If key == node.keys[i], return (node, i). Done.
↓ not found
Leaf check
If the current node is a leaf, the key is not in the tree. Return NOT FOUND.
↓ internal node
Descend
Read child[i] from disk (one disk read). Go back to "Scan keys."
↻ repeat

The Code

python
def b_tree_search(node, key):
    """Search for key in B-tree rooted at node.
    Returns (node, index) if found, None otherwise."""
    i = 0
    # Find the first key >= target
    while i < node.n and key > node.keys[i]:
        i += 1
    # Check if we found it
    if i < node.n and key == node.keys[i]:
        return (node, i)        # Found!
    # If leaf, key is not in tree
    if node.leaf:
        return None
    # Otherwise, read child from disk and recurse
    disk_read(node.children[i])  # ← THIS is the expensive operation
    return b_tree_search(node.children[i], key)

Cost Analysis

At each level, we do two things: scan through the keys in the node (CPU work), and read one child from disk (I/O work). The costs are:

ResourcePer levelTotal (h levels)
Disk reads1O(h) = O(logt n)
CPU comparisonsO(t) linear scan
or O(log t) binary search
O(t · logt n)
or O(log2 n)

The CPU cost is irrelevant — even O(t · logt n) with t = 1000 is millions of times cheaper than the disk I/O. The number of disk reads, O(logt n), is what matters. And the root node is almost always cached in RAM (it is accessed on every operation), so in practice you save one more disk read.

Binary search within nodes. For large t (say t = 1000), scanning 2000 keys linearly is wasteful even for CPU. Real implementations use binary search within each node, bringing the per-node comparison cost from O(t) to O(log t). This does not affect disk reads — it just makes the CPU work faster.

Amortized I/O: Why Caching Matters

In practice, the "O(logt n) disk reads per search" is an overestimate. The root node is accessed on every search, so it stays permanently in the buffer pool (cache). Level-1 nodes are also accessed very frequently. With a modest buffer pool, the top 2-3 levels of the tree are always cached.

Let us calculate the memory needed to cache the top levels:

B+ tree with t=500, n=1 billion keys, page_size=16 KB: Level 0 (root): 1 page = 16 KB Level 1: ~1000 pages = 16 MB Level 2: ~1,000,000 pages = 16 GB Level 3 (leaves): ~1,000,000,000 / 500 = 2,000,000 pages = 32 GB Caching levels 0-1 costs only 16 MB. Every search is then 1-2 disk reads. Caching levels 0-2 costs 16 GB. Every search is then 0-1 disk reads. The "right" amount of cache depends on your RAM budget and dataset size.

This is why database tuning guides always emphasize buffer pool sizing. With enough RAM to cache the internal levels of your hottest indexes, read performance becomes nearly optimal — close to 1 disk read per query regardless of dataset size.

What About Writes?

Search only reads. Insertion and deletion also write. Each modified page must eventually be flushed to disk. But writes can be deferred: mark the page "dirty" in the buffer pool and let a background process flush it later. Multiple modifications to the same page (common for hot nodes) are batched into a single write. This further amortizes the I/O cost.

The write-ahead log (WAL) ensures durability even when dirty pages have not been flushed. As long as the log entry is on disk, the modification is safe — even if the system crashes before the actual page is written. We will discuss WAL in detail in Chapter 7.

Worked Search Example

Consider searching for key 25 in this B-tree (t=3):

[20, 40] / | \ [5, 10, 15] [25, 30, 35] [45, 50, 55, 60] Step 1: Read root [20, 40]. Scan: 25 > 20, 25 < 40. Descend to child[1]. Disk reads so far: 1 Step 2: Read child [25, 30, 35]. Scan: 25 == 25. FOUND at index 0! Disk reads so far: 2 Total: 2 disk reads. If root is cached, just 1 disk read.

Now search for key 17 (not in tree):

Step 1: Read root [20, 40]. Scan: 17 < 20. Descend to child[0]. Disk reads: 1 Step 2: Read child [5, 10, 15]. Scan: 17 > 5, 17 > 10, 17 > 15. Reached end of keys. This is a leaf → NOT FOUND. Disk reads: 2 Total: 2 disk reads to confirm the key does not exist.

The simulation below lets you search for a key in a B-tree. Watch how the search descends level by level, scanning keys at each node before choosing a child.

B-Tree Search Step-Through

Enter a key and click "Search" to watch the algorithm descend through the tree. Each highlighted node is one disk read.

Concept check: In a B-tree with t = 500 and 1 billion keys, approximately how many disk reads does a search require (assuming the root is cached in RAM)?

Chapter 3: Insertion

Inserting into a B-tree is more complex than BST insertion. You cannot just create a new leaf anywhere — that would violate the "all leaves at the same depth" property. Instead, every new key goes into an existing leaf. But what if the leaf is already full (has 2t-1 keys)?

The answer: split the full node before inserting. Splitting takes a full node with 2t-1 keys, pulls out the median key, and creates two nodes of t-1 keys each. The median key moves up to the parent.

Before split (t=3, full node with 5 keys): Parent: | ... | P | ... | | | 10 | 20 | 30 | 40 | 50 | ← FULL (2t-1 = 5 keys) After split (median = 30 goes up): Parent: | ... | P | 30 | ... | ← parent gains one key / \ | 10 | 20 | | 40 | 50 | ← two nodes with t-1 = 2 keys each

Proactive Splitting: The CLRS Approach

Here is the clever part. CLRS describes a single-pass, top-down insertion. As you walk down the tree looking for the correct leaf, you split every full node you encounter on the way down — before you need to. This guarantees that when you arrive at the leaf, it is not full, so you can insert directly.

Why proactive? If you split only when needed (bottom-up), a split might push a key up to a parent that is also full, causing a cascade of splits back up the tree. With proactive top-down splitting, you never need to go back up — one pass down, done.

Special case: root is full
Create a new empty root, make the old root its only child, then split the old root. Tree height increases by 1.
Walk down
At each internal node, find which child to descend into.
Child full?
If the child has 2t-1 keys, split it NOW (before descending). Median goes up to current node.
Descend
Move into the (now non-full) child. Repeat until you reach a leaf.
Insert into leaf
The leaf is guaranteed non-full. Insert key in sorted position. Done.

The Code

python
def b_tree_insert(tree, key):
    r = tree.root
    if r.n == 2 * tree.t - 1:  # root is full
        s = BTreeNode(leaf=False)
        tree.root = s
        s.children.append(r)
        split_child(s, 0)           # split old root
        insert_nonfull(s, key)
    else:
        insert_nonfull(r, key)

def insert_nonfull(node, key):
    i = node.n - 1
    if node.leaf:
        # Insert key into sorted position in leaf
        node.keys.append(None)
        while i >= 0 and key < node.keys[i]:
            node.keys[i + 1] = node.keys[i]
            i -= 1
        node.keys[i + 1] = key
        node.n += 1
    else:
        # Find which child to descend into
        while i >= 0 and key < node.keys[i]:
            i -= 1
        i += 1
        # If child is full, split it first
        if node.children[i].n == 2 * t - 1:
            split_child(node, i)
            if key > node.keys[i]:
                i += 1
        insert_nonfull(node.children[i], key)

def split_child(parent, i):
    """Split parent.children[i] which must be full (2t-1 keys)."""
    t = tree.t
    full_node = parent.children[i]
    new_node = BTreeNode(leaf=full_node.leaf)
    # new_node gets the last t-1 keys
    new_node.keys = full_node.keys[t:]        # keys after median
    new_node.n = t - 1
    if not full_node.leaf:
        new_node.children = full_node.children[t:]  # children after median
    # Median key goes up to parent
    median = full_node.keys[t - 1]
    full_node.keys = full_node.keys[:t - 1]  # keep first t-1 keys
    full_node.n = t - 1
    if not full_node.leaf:
        full_node.children = full_node.children[:t]
    # Insert median and new child into parent
    parent.keys.insert(i, median)
    parent.children.insert(i + 1, new_node)
    parent.n += 1

Worked Example

Let us insert the keys 10, 20, 30, 40, 50, 60, 70, 80, 5, 15 into an initially empty B-tree with t = 3 (max 5 keys per node). Follow along with the simulation below.

B-Tree Insertion (t = 3)

Click "Insert Next" to add keys one at a time. Watch splits happen when nodes reach 5 keys. The sequence is: 10, 20, 30, 40, 50, 60, 70, 80, 5, 15.

The key moments in this sequence:

After insertingWhat happens
10, 20, 30, 40, 50Root is full with [10,20,30,40,50]. No split yet.
60Root is full. Split: median 30 becomes the new root. Left child [10,20], right child [40,50]. Then 60 goes into right child: [40,50,60].
70, 80Right child becomes [40,50,60,70,80] — full but no split until the next insert needs it.
5Goes into left child: [5,10,20].
15Goes into left child: [5,10,15,20]. Still not full.
The only way the tree grows taller is when the root splits. Every other split just redistributes keys at the same level. This is why all leaves stay at the same depth — the tree grows from the top, not the bottom. This is fundamentally different from BSTs, which grow from the bottom (new leaves). B-tree growth from the top is what guarantees perfect balance.

Split Cost Analysis

How often do splits happen? Each split produces two nodes of t-1 keys from one node of 2t-1 keys. Before the next split can happen at the same node, t more keys must be inserted there. So on average, one split occurs for every t insertions. With t = 1000, you get one split per 1000 inserts. Splits are rare events.

When a split does happen, it requires:

Total: 4 disk I/Os for a split. Amortized over t insertions, that is 4/t extra I/Os per insert — negligible for large t.

Insertion Order and Tree Shape

Unlike BSTs, B-trees always produce balanced trees regardless of insertion order. Insert 1, 2, 3, ..., n in sorted order into a BST and you get a linked list (height n). Insert the same sequence into a B-tree and you get a perfectly balanced tree (height logt n). The proactive splitting mechanism redistributes keys as the tree grows, maintaining balance automatically.

However, insertion order does affect space utilization. Sequential insertions into a B-tree produce nodes that are about half full (each split creates two minimum-fill nodes). Random insertions produce nodes that are about 69% full. Some databases (like PostgreSQL) detect sequential insert patterns and use a special "fast path" that fills pages more aggressively.

The Bulk Loading Optimization

If you have a large dataset and want to build a B-tree from scratch, inserting keys one at a time is wasteful. Each insert does a full root-to-leaf traversal. For n keys, that is O(n · logt n) total disk I/Os.

The better approach is bulk loading: sort the keys first, then build the tree bottom-up by filling leaf pages to capacity, then creating internal pages over them. This requires only O(n/B) I/Os (one pass through the sorted data, writing full pages sequentially).

One-by-one insertion of 10 million keys (t=500): 10M × 3 disk reads = 30M disk I/Os At 10,000 IOPS (HDD): ~50 minutes Bulk load: Sort: O(n/B · logM/B(n/B)) I/Os = ~30,000 I/Os Build: O(n/B) I/Os = ~20,000 I/Os Total: ~50,000 I/Os At 10,000 IOPS: ~5 seconds Bulk loading is ~600x faster for large datasets.

This is why database import tools (like MySQL's LOAD DATA INFILE or PostgreSQL's COPY) are dramatically faster than individual INSERT statements. They sort the data and bulk-load the B-tree indexes.

Fill factor. When bulk loading, databases let you control the fill factor — how full to make each page. A fill factor of 100% uses minimum space but leaves no room for future inserts (every insert triggers a split). A fill factor of 70% wastes 30% of space but allows many inserts before splits are needed. InnoDB defaults to ~15/16 ≈ 94% fill for bulk loads. The right choice depends on your insert-to-read ratio.
Concept check: Why does CLRS split full nodes on the way DOWN (proactively) instead of splitting on the way back UP?

Chapter 4: Deletion

Deletion is the hardest operation in B-trees. The problem: after removing a key, a node might have fewer than t-1 keys, violating the B-tree property. We must fix this, and the fix depends on where the key is and what the neighboring nodes look like.

There are three cases, and each case has sub-cases. Let us work through all of them.

Case 1: Key is in a leaf

This is the simplest case. If the leaf has more than t-1 keys, just remove the key. Done. The node still satisfies the minimum-fill constraint.

Before: leaf = | 10 | 20 | 30 | (3 keys, t=3, min is 2) Delete 20: | 10 | 30 | (2 keys, still valid)

Case 2: Key is in an internal node

We cannot just remove a key from an internal node — it would leave the node with one fewer key than child pointers, breaking the structure. Instead, we replace the key with its predecessor or successor and delete that instead.

Case 2a: If the child y that precedes the key has at least t keys, find the predecessor (the largest key in y's subtree), replace the key with the predecessor, and recursively delete the predecessor from y.

Case 2b: If the child z that follows the key has at least t keys, find the successor (the smallest key in z's subtree), replace the key with the successor, and recursively delete the successor from z.

Case 2c: If both y and z have only t-1 keys, merge: combine y, the key, and z into a single node of 2t-1 keys. Then recursively delete the key from the merged node.
Case 2a — replace with predecessor: | ... | 30 | ... | / | 10 | 20 | 25 | ← child y has ≥ t keys Delete 30: replace with 25 (predecessor), then delete 25 from child | ... | 25 | ... | / | 10 | 20 |

Case 3: Key is not yet found, descending

We are still searching for the key and about to descend into child[i]. But child[i] has only t-1 keys — the minimum. If we delete from it, it might underflow. So before descending, we ensure the child has at least t keys.

Case 3a — steal from sibling: If an immediate sibling of child[i] has at least t keys, rotate a key through the parent. Move a key from the parent down into child[i], and move a key from the sibling up into the parent. This gives child[i] one more key without violating any constraints.

Case 3b — merge with sibling: If both immediate siblings have only t-1 keys, merge child[i] with one sibling, pulling down the separating key from the parent. The merged node has (t-1) + 1 + (t-1) = 2t-1 keys.
Case 3a — steal from right sibling: | ... | 30 | ... | / \ | 10 | 20 | | 40 | 50 | 60 | ← right sibling has t=3 keys (≥ t) Rotate: 30 goes down, 40 goes up: | ... | 40 | ... | / \ | 10 | 20 | 30 | | 50 | 60 | Now child[i] has 3 keys — safe to descend and delete.
Case 3b — merge (both siblings have t-1 keys): | ... | 30 | ... | / \ | 10 | 20 | | 40 | 50 | ← both have exactly t-1 = 2 keys Merge: pull 30 down, combine into one node: | ... | | ... | | | 10 | 20 | 30 | 40 | 50 | ← merged node has 2t-1 = 5 keys Parent lost a key. If parent was the root with 1 key, it becomes empty and the merged node becomes the new root (tree shrinks by 1 level).

Just like insertion uses proactive splitting, deletion uses proactive fixing on the way down. Before descending into any child, ensure it has at least t keys (not just t-1). This guarantees we never need to backtrack.

The Code

python
def b_tree_delete(node, key):
    i = 0
    while i < node.n and key > node.keys[i]:
        i += 1

    if i < node.n and key == node.keys[i]:
        if node.leaf:
            # Case 1: key in leaf — just remove
            node.keys.pop(i)
            node.n -= 1
        else:
            # Case 2: key in internal node
            delete_internal(node, i)
    else:
        if node.leaf:
            return  # key not in tree
        # Case 3: key not found yet, must descend
        # Ensure child[i] has ≥ t keys before descending
        if node.children[i].n == t - 1:
            fix_child(node, i)
        # Recurse (index may have changed after merge)
        b_tree_delete(node.children[i], key)

def delete_internal(node, i):
    key = node.keys[i]
    if node.children[i].n >= t:       # Case 2a
        pred = get_predecessor(node, i)
        node.keys[i] = pred
        b_tree_delete(node.children[i], pred)
    elif node.children[i+1].n >= t:  # Case 2b
        succ = get_successor(node, i)
        node.keys[i] = succ
        b_tree_delete(node.children[i+1], succ)
    else:                              # Case 2c
        merge_children(node, i)
        b_tree_delete(node.children[i], key)

Summary of All Deletion Cases

CaseConditionActionI/O Cost
1Key in leaf, leaf has > t-1 keysRemove key directly1 read + 1 write
2aKey in internal, left child has ≥ t keysReplace with predecessor, recurseO(h) reads + writes
2bKey in internal, right child has ≥ t keysReplace with successor, recurseO(h) reads + writes
2cKey in internal, both children have t-1 keysMerge children around key, recurseO(h) reads + writes
3aKey not found yet, child has t-1, sibling has ≥ tRotate: steal key through parent3 reads + 3 writes
3bKey not found yet, child has t-1, all siblings at t-1Merge child with sibling3 reads + 2 writes

The recurring theme: never operate on a node that is at the minimum. Before you do anything, make sure the target node has room. For insertion, "room" means not full (split if full). For deletion, "room" means above minimum (steal or merge if at minimum).

Symmetry between insertion and deletion. Insertion uses proactive splitting (fix full nodes on the way down). Deletion uses proactive filling (fix minimum nodes on the way down). Both ensure a single downward pass with no backtracking. Both maintain the invariant that all leaves are at the same depth. Splits grow the tree from the top; merges shrink it from the top.

Worked Deletion Example

Let us walk through a concrete deletion sequence on a B-tree with t=3. Start with this tree containing keys [3, 5, 7, 10, 12, 15, 18, 20, 22, 25, 28, 30, 33, 35, 38, 40]:

[20] / \ [10, 15] [30, 35] / | \ / | \ [3,5,7] [12] [18] [22,25,28] [33] [38,40]

Delete 7 (Case 1 — key in leaf with enough keys): The leaf [3,5,7] has 3 keys (> t-1 = 2). Just remove 7. Result: [3,5].

Delete 15 (Case 2a — key in internal, left child has enough keys): 15 is in the internal node [10,15]. The left child's subtree has predecessor 12. Replace 15 with 12, then delete 12 from its leaf. But wait — the leaf [12] has only 1 key (= t-1). We need Case 3. Before descending, check siblings. The left sibling [3,5] has 2 keys (= t-1) and the right sibling [18] has 1 key (= t-1). Both at minimum, so merge: combine [12] + 15 + [18] = [12,15,18]. Now delete 15 from this merged leaf. Result: internal becomes [10], children are [3,5] and [12,18].

Delete 20 (Case 2c — key in internal, both children have t-1 keys): 20 is the root key. Left child [10] and right child [30,35] — left has 1 key (t-1), but right has 2 keys. So Case 2b applies: take the successor (22), replace 20 with 22, delete 22 from the right subtree.

Each deletion case builds on the same principle: ensure you have enough keys to work with before you operate. This is the B-tree's version of "measure twice, cut once."

B-Tree Deletion — All Cases

Select a case and click "Step" to see the deletion animation. Watch keys rotate, merge, or simply disappear.

Concept check: You want to delete a key from an internal node but both its left child and right child have exactly t-1 keys. What do you do?

Chapter 5: B-Tree Showcase

This is the payoff. The simulation below is a fully interactive B-tree. You choose the minimum degree t. You insert and delete any key. You watch every split and merge in real time. The tree shows node utilization, current height, and a running count of disk reads.

Play with this. Try inserting keys 1 through 30 in order — watch the tree split and grow. Then delete keys from the middle and watch merges shrink it. Try random insertions. Try t=2 (the tightest tree) versus t=5 (much wider nodes). Feel how the branching factor changes the tree's shape and height.
Interactive B-Tree Simulator

Insert/delete keys. Adjust t to change the branching factor. Watch splits and merges live.

Min degree t 3
Height: 0 Keys: 0 Nodes: 0 Disk reads (last op): 0

Things to try:

Understanding Node Utilization

After many random insertions, most nodes are between half-full and completely full. This is not an accident — the split mechanism guarantees it. When a node with 2t-1 keys splits, it produces two nodes with t-1 keys each. Those nodes are exactly at the minimum. Subsequent insertions fill them back up. The average node utilization in a random B-tree is about ln(2) ≈ 69% — a result from Knuth's analysis.

This means a B-tree wastes about 31% of its allocated space on average. In practice, this is a good tradeoff: the wasted space buys us room for future insertions without immediate splits, and 69% utilization is much better than the 50% worst case.

Disk Reads Per Operation

Watch the "Disk reads (last op)" counter as you insert and delete. For a tree of height h:

OperationDisk ReadsDisk WritesExplanation
SearchO(h)0Read one node per level
Insert (no split)O(h)O(1)Read path + write leaf
Insert (with splits)O(h)O(h)Read path + write split nodes
Delete (simple)O(h)O(1)Read path + write modified node
Delete (with merges)O(h)O(h)Read path + write merged nodes

The key insight: even in the worst case, every operation does at most O(h) = O(logt n) disk I/Os. With t = 1000 and n = 109, that is 3 reads and 3 writes. The actual CPU work per node (scanning keys, shifting elements) is trivial in comparison.

Write amplification. When you modify a single key in a B-tree, you must rewrite the entire node (the entire 4-16 KB disk page). Inserting an 8-byte key causes a 16 KB write. This is called write amplification — the ratio of data written to disk vs data actually changed. For B-trees, write amplification = page_size / change_size. For an 8-byte key on a 16 KB page, that is 2048x. LSM-trees have even higher write amplification (due to compaction), but their writes are sequential, which is faster on both HDDs and SSDs.
Concept check: In a B-tree with t=3, a node with 5 keys splits. How many keys does each resulting node have, and where does the extra key go?

Chapter 6: B+ Trees

B-trees as described by CLRS store data (or pointers to data) at every node — both internal and leaf. This is fine in theory, but real databases almost universally use a variant called the B+ tree. The difference is simple but profound.

B+ tree rule: All actual data lives in the leaves only. Internal nodes store only keys and child pointers — they are pure routing nodes. Additionally, all leaf nodes are linked together in a doubly-linked list for fast sequential access.

Why B+ trees?

Three reasons, all practical:

1. Internal nodes are smaller. Since internal nodes do not carry data payloads, they carry only keys and pointers. More keys fit in a single disk page, so the branching factor is higher, so the tree is shorter. If each record is 100 bytes but each key is 8 bytes, a B-tree internal node wastes 92% of its space on data you do not need during search.

2. Range queries are fast. Want all records with keys between 50 and 200? In a B+ tree, find 50 (follow the tree down to the leaf), then walk the leaf linked list until you pass 200. You never need to go back up to the internal nodes. In a standard B-tree, range queries require an inorder traversal that bounces between internal and leaf nodes.

3. Predictable I/O. Every search touches exactly the same number of internal nodes plus one leaf. There is no "lucky" case where the key happens to be in an internal node at level 2 instead of a leaf. This makes query planning and caching more predictable.

B-tree: B+ tree: [30] [30] ← internal: keys only / \ / \ [10,20] [40,50] [10,20] [40,50] ← internal: keys only | | [10,20] ↔ [30,40,50] ← leaves: keys + data + linked (leaves store ALL keys)

Notice that in the B+ tree, the key 30 appears in both an internal node and a leaf. Internal copies are just signposts — the leaf copy is where the actual data record lives.

Counting the Advantage

Let us do the math to see why B+ trees have a higher branching factor. Suppose each key is 8 bytes, each child pointer is 8 bytes, and each data record is 200 bytes. Page size is 4 KB (4096 bytes).

B-tree internal node: Each entry = key (8 bytes) + data pointer (8 bytes) + child pointer (8 bytes) = 24 bytes Keys per node = 4096 / 24 ≈ 170 Branching factor = 171 B+ tree internal node: Each entry = key (8 bytes) + child pointer (8 bytes) = 16 bytes Keys per node = 4096 / 16 ≈ 256 Branching factor = 257 B+ tree has 50% higher branching factor! This translates to shorter trees. B-tree leaf node: Each entry = key (8 bytes) + data (200 bytes) = 208 bytes Records per leaf = 4096 / 208 ≈ 19 B+ tree leaf node: Same: 19 records per leaf (leaves store data in both) But also stores next/prev leaf pointers (16 bytes overhead) Records per leaf = (4096 - 16) / 208 ≈ 19 (negligible overhead)

For a dataset of 100 million records:

B-tree (branching 171): Height = log171(100M) ≈ 3.6 → height 4 Point query: 4 disk reads B+ tree (branching 257): Height = log257(100M / 19) ≈ 2.8 → height 3 Point query: 3 disk reads (+ 1 leaf read = still 3-4 total, but internal levels = 3) One fewer level means ~171 times fewer internal nodes to cache. The entire internal tree fits in RAM more easily.

Range Query Walkthrough

Suppose we want all keys in the range [15, 45] from a B+ tree. The algorithm:

Find 15
Search the tree for 15 (or the first key ≥ 15). Descend from root to leaf. Cost: O(logt n) disk reads.
Scan leaves
Follow the leaf linked list, collecting every key ≤ 45. Each leaf is one disk read, and leaves are stored sequentially on disk.
Stop at 45
When you see a key > 45, stop. Total cost: O(logt n + k/B) where k is the number of results and B is keys per leaf.
B+ Tree vs B-Tree: Range Queries

Enter a range [lo, hi] and compare how many nodes each structure visits. The B+ tree follows the leaf chain; the B-tree must traverse the tree.

Range: to
FeatureB-TreeB+ Tree
Data locationEvery nodeLeaves only
Leaf linkingNoDoubly-linked list
Duplicate keys in internal nodesNoYes (as signposts)
Point query costO(logt n) — might stop earlyO(logt n) — always to leaf
Range query costO(logt n + k) via inorderO(logt n + k/B) via leaf scan
Branching factorLower (data in nodes)Higher (keys only in internal)
Used in practiceSQLite (some tables)MySQL InnoDB, PostgreSQL, most RDBMS

B+ Tree Insertion and Deletion

B+ tree insertion works like B-tree insertion with one twist: when a leaf splits, the median key is copied up to the parent (not moved). In a B-tree, the median leaves its original node. In a B+ tree, the median stays in the right leaf (because all keys must be in leaves) and a copy goes to the parent as a routing signpost.

B-tree split (median MOVES up): | 10 | 20 | 30 | 40 | 50 | → | 10 | 20 | 30 | 40 | 50 | ↑ goes to parent, removed from here B+ tree split (median COPIED up): | 10 | 20 | 30 | 40 | 50 | → | 10 | 20 | 30 | 30 | 40 | 50 | ↑ copy to parent, original stays in right leaf

Deletion in a B+ tree is also slightly different. If you delete a key that appears as a signpost in an internal node, the internal copy might become stale. Some implementations leave stale signposts (they still correctly route searches). Others update the signpost to the new minimum of the right subtree. Both approaches are valid — stale signposts just waste a tiny amount of space.

Real-World Page Layout

Here is what a B+ tree page looks like in a real database (InnoDB-style, 16 KB page):

+-----------------------------------------------------------+ | Page Header (38 bytes) | | page_id, checksum, LSN, page_type, ... | +-----------------------------------------------------------+ | Index Header (36 bytes) | | num_records, heap_top, free_list_ptr, ... | +-----------------------------------------------------------+ | Record Directory (grows backward from end of page) | | slot[0], slot[1], ..., slot[n] (2 bytes each) | +-----------------------------------------------------------+ | Free Space | | (new records are allocated here) | +-----------------------------------------------------------+ | Records (grow forward) | | [hdr|key|data] [hdr|key|data] [hdr|key|data] ... | +-----------------------------------------------------------+ | Page Trailer (8 bytes) | | checksum, LSN | +-----------------------------------------------------------+ For internal pages: [key|child_page_id] instead of [key|data] For leaf pages: next_page_ptr and prev_page_ptr in the header

With a 16 KB page, 8-byte keys, and 6-byte child pointers, an internal node holds about 16,000 / 14 ≈ 1,143 keys — so the branching factor is about 1,144. Two levels of internal nodes can index 1,1442 ≈ 1.3 million leaf pages. If each leaf holds 100 rows, that is 130 million rows reachable in 3 disk reads.

Concept check: Why are range queries faster in a B+ tree than in a standard B-tree?

Chapter 7: B-Trees in Databases

Every time you run a SQL query with a WHERE clause on an indexed column, a B-tree (or B+ tree) is doing the work. Let us look at how the three most popular open-source databases use them.

MySQL InnoDB: Clustered B+ Tree Index

InnoDB stores the entire table as a B+ tree, ordered by the primary key. This is called a clustered index — the leaf nodes of the B+ tree contain the actual row data, not just pointers to rows stored elsewhere.

Clustered index. The table IS the B+ tree. When you do SELECT * FROM users WHERE id = 42, InnoDB searches the B+ tree for key 42 and finds the entire row at the leaf. There is no separate "heap" of row data — the index and the data are the same structure.

Secondary indexes in InnoDB are also B+ trees, but their leaves store the primary key (not the row data). A secondary index lookup requires two B+ tree searches: one to find the primary key in the secondary index, then another to find the row in the clustered index. This is called a double lookup or "bookmark lookup."

Primary key lookup (1 B+ tree traversal): B+ tree (id) → leaf contains row data → done Secondary index lookup (2 B+ tree traversals): B+ tree (email) → leaf contains primary key (id=42) B+ tree (id) → leaf contains row data for id=42

PostgreSQL: Heap + B-Tree Indexes

PostgreSQL takes a different approach. The table data lives in a heap — an unordered pile of row pages. Indexes are separate B-tree structures whose leaves contain tuple IDs (TIDs), which are (page number, offset) pointers into the heap.

Every index lookup in PostgreSQL requires a B-tree search followed by a heap page fetch. This seems slower than InnoDB's clustered index, but it has an advantage: secondary indexes point directly to the heap (one B-tree search), whereas InnoDB's secondary indexes require a double lookup.

SQLite: B-Tree for Tables, B+ Tree for Indexes

SQLite uses plain B-trees (not B+) for tables (which it calls "table B-trees") and B+ trees for indexes. Each database is a single file divided into fixed-size pages (default 4096 bytes). Each page is one B-tree node. The page size is the node size — this is literally what we meant by "size nodes to fit one disk page."

The Buffer Pool

Databases do not read from disk on every B-tree node access. They maintain a buffer pool — a cache of recently-accessed disk pages in RAM. When a B-tree search needs a node, it first checks the buffer pool. If the page is there (a "cache hit"), no disk I/O happens. Only on a "cache miss" does the database actually read from disk.

The root node of every frequently-used index is virtually always in the buffer pool (it is accessed on every query). Internal nodes near the root are also likely cached. In practice, only the leaf node access triggers a disk read. For a B+ tree with height 4, that means 3 cache hits and 1 disk read — blazing fast.

Buffer Pool Simulation

Search for keys in a B+ tree with a buffer pool. Watch which nodes cause cache hits (green) vs cache misses (red/disk read). Recently accessed nodes stay cached.

Cache hits: 0 Cache misses: 0 Hit rate: - Buffer pool: 0/8 pages
DatabaseTable StoragePrimary IndexSecondary IndexPage Size
MySQL InnoDBClustered B+ treeB+ tree (data in leaves)B+ tree (stores PK)16 KB
PostgreSQLHeap (unordered)B-tree (stores TID)B-tree (stores TID)8 KB
SQLiteB-treeTable B-treeB+ tree (stores rowid)4 KB

Concurrency: B-Tree Locking

In a multi-threaded database, many transactions access the B-tree simultaneously. Without coordination, two threads might try to split the same node, corrupting the tree. Databases use latches (lightweight locks) to protect B-tree pages.

The naive approach — lock the entire tree for every operation — destroys concurrency. The standard technique is latch crabbing (also called "latch coupling"):

Lock parent
Acquire a latch on the current node.
Lock child
Acquire a latch on the child you are descending into.
Safe?
If the child is "safe" (not full for inserts, or not at minimum for deletes), release the parent's latch.
Continue
Descend to child. Repeat: lock next child, check if safe, release parent if so.

A node is "safe" if the current operation cannot cause a structural change (split or merge) at that node. For search, every node is safe (no modifications). For insert, a node is safe if it has fewer than 2t-1 keys (room for one more without splitting). For delete, a node is safe if it has more than t-1 keys (room to lose one without merging).

Latch crabbing ensures that at most two levels are locked at any time. This allows operations on different parts of the tree to proceed in parallel — a reader in the left subtree does not block a writer in the right subtree.

Optimistic locking. Modern databases (like PostgreSQL) often use an optimistic approach: descend the tree without write locks, and only lock when you reach the leaf and are ready to modify. If the leaf needs a split (which is rare), restart the operation with pessimistic locking. Since splits are infrequent (once per t inserts), the optimistic path succeeds almost every time, giving near-lock-free read performance.

Write-Ahead Logging and B-Trees

What happens if the database crashes while splitting a B-tree node? A split modifies three pages: the full node, the new sibling, and the parent. If the system crashes after writing the sibling but before updating the parent, the tree is corrupted — the new sibling is an orphan.

Databases solve this with Write-Ahead Logging (WAL). Before modifying any B-tree page, the database writes a log entry describing the change. If a crash occurs, the recovery process replays the log to complete or undo partial operations. The log is append-only and sequential, so writing it is fast.

1. Write log entry
"About to split page 42. New sibling = page 107. Median key = 30."
2. Modify pages
Write page 42, page 107, parent page.
3. Checkpoint
Periodically flush dirty pages to disk and mark log entries as applied.

An alternative approach is copy-on-write (COW), used by LMDB and btrfs. Instead of modifying pages in place, write new pages and atomically update the root pointer. This eliminates the need for WAL but increases write amplification (every modification creates new pages all the way up to the root).

WAL approach (InnoDB, PostgreSQL): 1. Append log record (sequential write, fast) 2. Modify page in buffer pool (in RAM, instant) 3. Later: flush dirty page to disk (background, batched) Crash recovery: replay log from last checkpoint COW approach (LMDB, btrfs): 1. Write new leaf page with modification 2. Write new parent page pointing to new leaf 3. ... all the way up to root 4. Atomically update root pointer (single write) Crash recovery: nothing! Old root still points to old (consistent) pages. Free snapshots: keep old root pointers → instant read-only snapshots.

COW trades write amplification for simplicity and free snapshots. LMDB is single-writer (no write concurrency) but achieves phenomenal read performance because readers never need locks — they just follow the root pointer they started with, which always points to a consistent snapshot.

Compaction and Fragmentation

Over time, B-tree pages become fragmented. Deletions leave gaps. Splits create half-full pages. Pages that were once adjacent on disk may become scattered. This causes two problems:

Internal fragmentation: Pages are not fully utilized. A page with 50% utilization wastes half its space. Databases periodically run OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) to rebuild the B-tree from scratch, compacting all pages.

External fragmentation: Logically adjacent leaf pages (e.g., keys 1-100 and keys 101-200) may be physically scattered across the disk. Sequential range scans become random I/O. SSDs mitigate this (random reads are nearly as fast as sequential), but HDDs suffer badly.

Choosing the Right Index

Not every column needs an index, and not every index should be a B-tree. Here is a decision guide:

WorkloadBest Index TypeWhy
Primary key lookupsB+ tree (clustered)Data in leaves, no second lookup
Range queries (BETWEEN, >, <)B+ treeLeaf chain enables sequential scan
Exact match only (=)Hash indexO(1) average, no wasted space on ordering
Full-text searchInverted index (GIN)B-trees cannot search for words within text
Geospatial (lat/lng)R-tree (GiST)B-trees only handle 1D ordering
High-cardinality flag columnsBitmap indexCompact representation for low-cardinality data
Write-heavy, rarely readConsider no indexEvery index adds write overhead

Each secondary index adds overhead to every INSERT, UPDATE, and DELETE on the table (the index must be updated too). A table with 5 indexes effectively writes 6 times for every row modification: once to the table plus once to each index. This is why DBAs are cautious about over-indexing.

Why not hash indexes for everything? Hash indexes give O(1) point queries — faster than B-trees for exact lookups. But they cannot do range queries, ordered scans, or prefix matching. Since most databases need all of these, B-trees win as the default index type. PostgreSQL and MySQL do offer hash indexes for specific use cases. Additionally, hash indexes do not support ORDER BY, GROUP BY, or BETWEEN clauses — all of which rely on key ordering that B-trees provide for free.
Concept check: In MySQL InnoDB, a secondary index (e.g., on email column) lookup requires how many B+ tree searches to find the full row data?

Chapter 8: Interview Arsenal

Cheat Sheet

OperationDisk ReadsCPU TimeNotes
SearchO(logt n)O(t · logt n)Binary search within nodes: O(log n)
InsertO(logt n)O(t · logt n)At most O(logt n) splits
DeleteO(logt n)O(t · logt n)At most O(logt n) merges
Range [a, b]O(logt n + k/B)O(logt n + k)B+ tree; k results, B keys/leaf
Min / MaxO(logt n)O(logt n)Leftmost / rightmost leaf
SpaceO(n) — each key stored exactly once (B-tree) or duplicated in internals (B+)

Design Questions

Q: "Design a database index for a table with 100 million rows."

A: Use a B+ tree with page size = disk page (4-16 KB). With 8-byte keys and 8-byte pointers, a 4 KB page holds ~250 key-pointer pairs. Branching factor ~250. Height = log250(108) ≈ 3.3, so height 4. With the root and level-1 nodes cached (fits in ~64 KB), every point query needs at most 2 disk reads. Use a clustered index on the primary key (rows stored in PK order) and B+ tree secondary indexes for frequently queried columns.
Q: "Why not use a hash index for range queries?"

A: Hash functions destroy ordering. hash(50) and hash(51) map to completely unrelated buckets. To answer "find all keys between 50 and 200," a hash index must scan every bucket — full table scan. A B+ tree finds 50 in O(log n), then follows the leaf chain to 200. Range queries are O(log n + k) in a B+ tree and O(n) in a hash index. This is the reason B-trees, not hash tables, are the default index structure in every major database.
Q: "What is the difference between a clustered and non-clustered index?"

A: A clustered index stores the actual data rows in the leaf nodes, sorted by the index key. There can be only one (the data has one physical order). A non-clustered (secondary) index stores pointers to the data (primary keys or row IDs) in its leaves. There can be many. In InnoDB, the primary key is always a clustered B+ tree; secondary indexes point to the PK. In PostgreSQL, all indexes are non-clustered (the data lives in an unordered heap).

Coding Drills

Drill 1: Implement B-tree search. Given a B-tree node class with keys[], children[], n (key count), and leaf (boolean), write the search function. Key insight: find the right child index using a linear scan or binary search on keys, then recurse if not found and not a leaf.
Drill 2: Design a file-system index. You are building an index for a file system that stores 10 million files. Each file has a path (variable-length string, avg 100 bytes) and metadata (200 bytes). Design the B+ tree. Key considerations: use a hash of the path as the key (fixed 8 bytes) or use the path directly (variable-length keys, more complex node format). Page size 4 KB = ~13 entries with direct paths or ~250 entries with hash keys.
Drill 3: B-tree vs LSM-tree. Modern write-heavy systems (Cassandra, RocksDB, LevelDB) use LSM-trees instead of B-trees. The tradeoff: B-trees do in-place updates (fast reads, slower writes — random I/O). LSM-trees batch writes sequentially (fast writes, slower reads — requires periodic compaction). Interview follow-up: "When would you choose an LSM-tree over a B-tree?"

Complexity Comparison

StructureSearchInsertDeleteRangeDisk-Optimal?
Sorted ArrayO(log n)O(n)O(n)O(log n + k)No (O(n) insert)
BST (balanced)O(log n)O(log n)O(log n)O(log n + k)No (height = log2 n)
Hash TableO(1)O(1)O(1)O(n)No (no ordering)
B-TreeO(logt n)O(logt n)O(logt n)O(logt n + k)Yes
LSM-TreeO(log n · L)O(1) amortizedO(1) amortizedO(log n + k)Yes (write-optimized)

Worked Design Problem: "Design a Key-Value Store"

This is a common system design interview question. Here is a staff-engineer-grade walkthrough using B-trees.

Problem: Design a persistent key-value store that supports GET(key), PUT(key, value), DELETE(key), and SCAN(start_key, end_key). Keys are strings (avg 32 bytes). Values are blobs (avg 256 bytes). Dataset: 100 GB. Must survive crashes. Target: <1ms reads, <5ms writes.

Step 1: Choose the data structure. We need point lookups AND range scans, so hash tables are out. B+ tree is the standard choice. LSM-tree is an alternative if writes dominate reads.

Step 2: Page size. Use 16 KB pages (InnoDB default). With 32-byte keys and 8-byte child pointers, an internal page holds ~400 keys. Branching factor ≈ 400.

Step 3: Estimate tree dimensions.

Total records: 100 GB / (32 + 256 bytes) = ~350 million records Leaf pages: each leaf holds ~55 records (16 KB / 288 bytes per record) Total leaves: 350M / 55 = ~6.4 million leaf pages Internal pages: 6.4M / 400 = 16,000 level-1 pages 16,000 / 400 = 40 level-2 pages 40 / 400 = 1 root page Height = 4 (root + 3 levels). Every GET requires at most 4 disk reads. With root + level-2 in cache (41 pages = 656 KB), GET needs 2 disk reads. At 0.1ms per SSD read, that is 0.2ms per GET. ✓ Under 1ms target.

Step 4: Write path. PUT = search (4 reads) + write leaf page (1 write) + WAL entry (1 sequential write). The WAL write is sequential (fast). The page write may be deferred (dirty page in buffer pool, flushed later). Amortized write latency: ~0.5ms on SSD. Under 5ms target.

Step 5: Buffer pool sizing. Cache the top 2 levels (41 pages = 656 KB) plus hot leaf pages. With 1 GB of RAM for the buffer pool, you can cache ~65,000 pages = 1% of all pages. After warm-up, the hot set gives >90% hit rate.

Step 6: Crash recovery. WAL with checkpointing. On crash, replay WAL from last checkpoint. Recovery time proportional to WAL length since last checkpoint (seconds to minutes).

Complete B-Tree Implementation

Here is a complete, working Python B-tree that you can copy and run. Every function maps directly to the pseudocode in CLRS. Study the split_child and delete functions — these are the two that interviewers care about most.

python
class BTreeNode:
    def __init__(self, leaf=True):
        self.keys = []
        self.children = []
        self.leaf = leaf

class BTree:
    def __init__(self, t):
        self.root = BTreeNode()
        self.t = t  # minimum degree (must be ≥ 2)

    def search(self, key, node=None):
        """Return (node, index) or None."""
        if node is None:
            node = self.root
        i = 0
        while i < len(node.keys) and key > node.keys[i]:
            i += 1
        if i < len(node.keys) and key == node.keys[i]:
            return (node, i)
        if node.leaf:
            return None
        return self.search(key, node.children[i])

    def insert(self, key):
        root = self.root
        if len(root.keys) == 2 * self.t - 1:
            # Root is full — grow the tree
            new_root = BTreeNode(leaf=False)
            new_root.children.append(self.root)
            self.root = new_root
            self._split_child(new_root, 0)
            self._insert_nonfull(new_root, key)
        else:
            self._insert_nonfull(root, key)

    def _split_child(self, parent, i):
        """Split parent.children[i], which must be full."""
        t = self.t
        full = parent.children[i]
        new = BTreeNode(leaf=full.leaf)
        # Median key
        median = full.keys[t - 1]
        # Right half goes to new node
        new.keys = full.keys[t:]
        full.keys = full.keys[:t - 1]
        if not full.leaf:
            new.children = full.children[t:]
            full.children = full.children[:t]
        # Insert median into parent
        parent.keys.insert(i, median)
        parent.children.insert(i + 1, new)

    def _insert_nonfull(self, node, key):
        i = len(node.keys) - 1
        if node.leaf:
            node.keys.append(None)
            while i >= 0 and key < node.keys[i]:
                node.keys[i + 1] = node.keys[i]
                i -= 1
            node.keys[i + 1] = key
        else:
            while i >= 0 and key < node.keys[i]:
                i -= 1
            i += 1
            if len(node.children[i].keys) == 2 * self.t - 1:
                self._split_child(node, i)
                if key > node.keys[i]:
                    i += 1
            self._insert_nonfull(node.children[i], key)

    def inorder(self, node=None):
        """Return all keys in sorted order."""
        if node is None:
            node = self.root
        result = []
        for i in range(len(node.keys)):
            if not node.leaf:
                result.extend(self.inorder(node.children[i]))
            result.append(node.keys[i])
        if not node.leaf:
            result.extend(self.inorder(node.children[-1]))
        return result

# Usage:
tree = BTree(t=3)
for key in [10, 20, 30, 40, 50, 60, 70, 80, 5, 15]:
    tree.insert(key)
print(tree.inorder())
# Output: [5, 10, 15, 20, 30, 40, 50, 60, 70, 80]

print(tree.search(30))   # (<node>, 0) — found!
print(tree.search(99))   # None — not found

Notice the key design decision in _insert_nonfull: we check if the child is full before descending. If it is, we split it immediately and then decide which of the two resulting children to descend into. This is the proactive splitting that guarantees a single downward pass.

Implementation exercise. Add a delete(key) method to this class. Follow the three cases from Chapter 4. Test it by inserting 20 random keys, deleting them in random order, and verifying that inorder() is always sorted and the tree properties hold. This is a 60-90 minute coding exercise — well within the scope of a systems coding interview.

Verification: How to Check B-Tree Invariants

When implementing or debugging a B-tree, you need a function that verifies all five properties. Here is a verification checklist you can turn into code:

python
def verify_btree(tree):
    """Verify all B-tree properties. Raises AssertionError if violated."""
    t = tree.t

    def check(node, min_key, max_key, depth, leaf_depth):
        # Property 3: keys are sorted
        for i in range(len(node.keys) - 1):
            assert node.keys[i] < node.keys[i+1], f"Keys not sorted: {node.keys}"

        # Property 1: key count bounds
        if node != tree.root:
            assert len(node.keys) >= t - 1, f"Underflow: {len(node.keys)} < {t-1}"
        assert len(node.keys) <= 2 * t - 1, f"Overflow: {len(node.keys)} > {2*t-1}"

        # Property 4: keys within bounds from ancestors
        for k in node.keys:
            assert min_key < k < max_key, f"Key {k} outside [{min_key},{max_key}]"

        if node.leaf:
            # Property 5: all leaves at same depth
            if leaf_depth[0] is None:
                leaf_depth[0] = depth
            assert depth == leaf_depth[0], f"Leaf at depth {depth}, expected {leaf_depth[0]}"
        else:
            # Property 2: k keys = k+1 children
            assert len(node.children) == len(node.keys) + 1
            # Recurse into children with tightened bounds
            bounds = [min_key] + node.keys + [max_key]
            for i, child in enumerate(node.children):
                check(child, bounds[i], bounds[i+1], depth + 1, leaf_depth)

    check(tree.root, float('-inf'), float('inf'), 0, [None])
    print("All B-tree properties verified! ✓")

Run this after every insert and delete during development. It catches bugs immediately. The function checks all five properties in O(n) time by visiting every node once.

Five Dimensions of B-Tree Mastery

DimensionWhat Interviewers TestKey Knowledge
CONCEPTCan you explain WHY B-trees exist?Disk I/O bottleneck, branching factor trades CPU for fewer reads
DESIGNCan you design a system using B-trees?Page sizing, height estimation, buffer pool, WAL, clustered vs secondary
CODECan you implement B-tree search and insert?Split logic, proactive splitting, median promotion
DEBUGCan you diagnose B-tree performance issues?Fragmentation, lock contention, buffer pool misses, sequential inserts causing right-heavy splits
FRONTIERDo you know what comes after B-trees?LSM-trees, Bw-trees (lock-free), learned indexes, adaptive radix trees
Concept check: A developer proposes replacing all B-tree indexes with hash indexes "because hash lookups are O(1)." What is the most important reason this is a bad idea for a general-purpose database?

Chapter 9: Connections

B-trees sit at the intersection of algorithms and systems. Understanding them connects you to a web of related ideas.

Where B-Trees Came From

ConceptRelationshipLink
BSTs (CLRS Ch 12)B-trees generalize BSTs: instead of 1 key and 2 children, each node has up to 2t-1 keys and 2t children. The BST property generalizes to the multi-key routing rule.BST lesson
Red-Black Trees (CLRS Ch 13)A red-black tree is isomorphic to a B-tree with t=2 (2-3-4 tree). Red nodes are "absorbed" into their black parent to form a B-tree node. This is why red-black trees are balanced — they are B-trees in disguise.Red-Black Tree lesson
Disk I/O ModelB-trees are designed for the external memory model where computation is free and I/O (disk reads/writes) is expensive. This model also motivates external sorting (merge sort) and cache-oblivious algorithms.

Where B-Trees Lead

ConceptRelationship
B+ TreesThe practical variant used in real databases. Data in leaves only, leaves linked for range scans. We covered this in Chapter 6.
LSM-TreesThe write-optimized alternative. Used in Cassandra, RocksDB, LevelDB. Trades read speed for write speed. B-trees do in-place updates; LSM-trees do sequential appends + compaction.
Buffer TreesAttach write buffers to internal B-tree nodes. Writes are batched and lazily pushed down. Combines B-tree read performance with better write throughput.
Fractal TreesUsed in TokuDB/PerconaFT. Like buffer trees but with better theoretical bounds. O(logB n / Bε) amortized inserts.
COW B-TreesCopy-on-write B-trees never modify pages in place — they write new pages and update pointers. Used in btrfs, LMDB. Provides free snapshots and crash consistency.

Modern Research: Beyond Traditional B-Trees

B-trees have been the dominant index structure for 50 years, but researchers continue to push the boundaries:

Learned Indexes (Kraska et al., 2018): Instead of a tree of comparisons, use a machine learning model to predict where a key is. A neural network trained on the data distribution can predict the position of any key in a sorted array with high accuracy. The prediction replaces the tree traversal. On read-only workloads with static data, learned indexes can be 1.5-3x faster than B-trees while using 10-100x less memory. The challenge: handling inserts, deletes, and distribution shifts. Active research area.

Bw-Trees (Levandoski et al., 2013): Lock-free B-trees for multi-core systems. Instead of latching individual pages, Bw-trees use compare-and-swap (CAS) operations and delta chains. Each modification creates a small delta record that is prepended to a page's chain. Periodically, deltas are consolidated into a new base page. Used in SQL Server Hekaton and Azure Cosmos DB.

ALEX (Ding et al., 2020): An "adaptive learned index" that handles inserts by combining learned models with gapped arrays. It self-tunes its structure based on the workload, gradually reshaping itself as the data distribution evolves. Achieves near-learned-index read performance with practical insert support.

Write-Optimized Indexes: Bε-trees (Brodal & Fagerberg, 2003) add write buffers to each node, batching writes and flushing them down periodically. They achieve O(logB n / Bε) amortized insert cost — exponentially better than B-trees for write-heavy workloads. TokuDB (now PerconaFT) implements this as "Fractal Tree Indexes."

The 50-year lesson. B-trees have survived through the transition from magnetic drums to spinning disks to SSDs to NVMe. They will likely survive the transition to persistent memory (Intel Optane) and storage-class memory (CXL). The principle — match the data structure to the memory hierarchy — is timeless even as the specific hardware changes.

The Bigger Picture

B-Tree vs LSM-Tree: The Great Debate

The most important alternative to B-trees is the Log-Structured Merge-Tree (LSM-tree), invented by Patrick O'Neil in 1996. It powers Cassandra, RocksDB, LevelDB, HBase, and many modern "NoSQL" databases. The tradeoff is fundamental:

PropertyB-TreeLSM-Tree
Write patternRandom I/O (update in place)Sequential I/O (append-only)
Read pattern1 tree traversalCheck memtable + multiple sorted runs
Write throughputGoodExcellent (10-100x better on HDD)
Read latencyPredictable, lowVariable (may check many levels)
Space amplification~1.5x (half-full pages)~1.1x (compacted) to 2x+ (during compaction)
Write amplificationModerate (full page rewrite)High (compaction rewrites data multiple times)
ConcurrencyComplex (latch crabbing)Simple (immutable sorted runs)
Range queriesFast (B+ tree leaf chain)Requires merge across all levels

When to choose B-trees: Read-heavy workloads, range queries, predictable latency requirements, traditional OLTP (banking, e-commerce). This is most workloads.

When to choose LSM-trees: Write-heavy workloads (logging, time-series, event streams), write-once-read-rarely patterns, workloads where sequential I/O matters (HDDs, distributed storage). Also when you need simple crash recovery (immutable files are easier to recover than in-place-modified pages).

Modern systems increasingly blur this line. RocksDB is an LSM-tree that adds Bloom filters and prefix seeking to improve read performance. WiredTiger (MongoDB's engine) supports both B-tree and LSM modes. Research systems like Dostoevsky and Monkey allow tuning the LSM structure along a continuum between pure LSM and pure B-tree behavior.

The Landscape of Tree-Based Indexes

StructureBranchingOptimized ForUsed In
BST / AVL / Red-Black2RAM, O(log₂ n) comparisonsstd::map, TreeMap, in-memory DBs
B-Tree (t=2, aka 2-3-4)2-4Equivalent to Red-Black treeTheoretical equivalence
B-Tree (large t)100-2000Disk I/O, minimize readsSQLite tables, ext4 directory indexes
B+ Tree100-2000Range queries + disk I/OMySQL, PostgreSQL, Oracle, SQL Server
B* Tree100-2000Higher node utilization (2/3 full)Some file systems (HFS+)
LSM-TreeN/A (log-structured)Write-heavy workloadsCassandra, RocksDB, LevelDB
Bw-Tree100-2000Lock-free, multi-coreSQL Server Hekaton, Azure Cosmos
Fractal Tree100-2000Reduced write amplificationTokuDB (PerconaFT)
B-trees teach a fundamental systems lesson: algorithms must respect the hardware they run on. A BST is optimal for comparisons, but comparisons are not the bottleneck on disk. B-trees are "worse" in terms of comparisons (O(t · logt n) vs O(log2 n)) but vastly better in terms of I/O. The same principle appears everywhere: cache-oblivious algorithms, GPU-friendly data structures, SIMD-optimized sorting. The best algorithm depends on the machine.

What We Covered

In this lesson you learned:

The one thing to remember. If someone asks "how does a database find a record among billions?" — the answer is a B+ tree. Height 3-4, each node is one disk page, the root is cached, so any record is 2-3 disk reads away. That is the single most important idea in this entire lesson.

Recommended Reading

"The art of programming is the art of organizing complexity, of mastering multitude and avoiding its bastard chaos as effectively as possible." — Edsger Dijkstra

Final check: A red-black tree is structurally equivalent to which B-tree variant?