Disk-optimized search trees — how databases store billions of keys with just 2-3 disk reads.
You are building a database that stores 1 billion user records. Each record has a unique ID, and users search, insert, and delete records constantly. The data does not fit in RAM — it lives on disk (an SSD or, historically, a spinning hard drive). How do you organize the data so that finding any record is fast?
Your first instinct: a binary search tree. With 1 billion keys, a balanced BST has height log2(109) ≈ 30. That means 30 comparisons to find any key. In RAM, 30 comparisons take microseconds. But here is the catch: each comparison requires reading a different node, and each node lives on a different disk page.
A single disk read (called a "seek" on a spinning disk, or a "page fetch" on an SSD) takes about 10 milliseconds on a hard drive. That is 10,000,000 nanoseconds. A CPU instruction takes about 1 nanosecond. So one disk read costs the same as 10 million CPU operations. Thirty disk reads means 300 milliseconds — a third of a second — just to find one record.
A B-tree solves this by packing hundreds or thousands of keys into each node. Instead of 2 children per node, a B-tree node can have 1000 children. The height drops from log2(n) to log1000(n). For 1 billion keys:
Each B-tree node is sized to fit exactly one disk page (typically 4 KB or 16 KB). When the OS reads a page from disk, it gets the entire node — all its keys and child pointers — in a single I/O operation. The extra CPU work of scanning through hundreds of keys within a node is trivial compared to the cost of one more disk read.
The simulation below lets you feel the difference. It compares the number of "disk reads" (node visits) for a BST versus a B-tree as the number of keys grows. Watch how the BST's height climbs relentlessly while the B-tree stays flat.
Drag the slider to change the number of keys. The chart shows how many disk reads (node visits) each structure needs to find a key in the worst case.
Even with a modest branching factor of 100, the B-tree needs only 4-5 disk reads for billions of keys. Increase the branching factor to 500 (common in real databases) and you need 2-3 reads for practically any dataset on Earth.
This is why every major database — MySQL, PostgreSQL, SQLite, Oracle, MongoDB — uses B-trees (or the closely related B+ tree) as their primary index structure. It is the single most important data structure in storage systems.
To understand why B-trees matter, you need to understand the memory hierarchy. Modern computers have several levels of storage, each faster but smaller than the next:
| Level | Size | Access Time | Relative Speed |
|---|---|---|---|
| CPU Registers | ~1 KB | ~0.3 ns | 1x |
| L1 Cache | ~64 KB | ~1 ns | 3x slower |
| L2 Cache | ~256 KB | ~4 ns | 13x slower |
| L3 Cache | ~8 MB | ~12 ns | 40x slower |
| RAM | ~16 GB | ~100 ns | 333x slower |
| SSD | ~1 TB | ~100 μs | 333,000x slower |
| HDD | ~4 TB | ~10 ms | 33,000,000x slower |
The jump from RAM to SSD is 1,000x. The jump from RAM to HDD is 100,000x. A data structure that minimizes disk accesses — even at the cost of more CPU work per access — wins by orders of magnitude. That is the B-tree's entire design philosophy.
Textbook algorithms assume all data is in RAM and count comparisons. The disk I/O model (also called the external memory model) counts disk transfers instead. In this model:
In this model, scanning n items costs O(n/B) I/Os (read n/B blocks). Sorting n items costs O((n/B) · logM/B(n/B)) I/Os — much better than the O(n log n) comparison sort. And searching in a B-tree costs O(logB n) I/Os, because each node holds B items.
This is why B-trees use branching factor proportional to B (the block/page size): it makes each "level" of the tree map to exactly one disk read, giving the optimal O(logB n) search cost in the external memory model. B-trees are I/O optimal for search — no other comparison-based data structure can do better.
A B-tree is defined by a single parameter called the minimum degree, written t (where t ≥ 2). This parameter controls how wide and short the tree is. Every rule about a B-tree flows from t.
Here are the five properties that define a B-tree. Memorize these — they are the entire specification.
Let us make this concrete with t = 3 (minimum degree 3). Each node has between 2 and 5 keys (t-1 = 2, 2t-1 = 5). Each internal node has between 3 and 6 children (t = 3, 2t = 6). The root can have as few as 1 key and 2 children.
Think of each node as a sorted array with signposts. If a node contains keys [10, 20, 30], it has four child pointers: one for keys < 10, one for keys between 10 and 20, one for keys between 20 and 30, and one for keys > 30. Each child pointer leads to an entire subtree of keys in that range.
Property 4 — the multi-key routing rule — deserves careful examination. In a BST, each node has one key that divides the search space into two halves: left (≤ key) and right (≥ key). In a B-tree, each node has k keys that divide the search space into k+1 ranges:
This is exactly like a multi-way decision tree. At each node, you scan the keys to find which range your search key falls into, then descend into the corresponding child. With k keys per node, you make one (k+1)-way decision instead of k binary decisions — but you do it in a single disk read.
Think of it like a phone book directory. Each page of the directory has tab markers: A-C, D-F, G-I, ... Each tab tells you which section to flip to. The tabs are the keys; the sections are the children. A BST phone book would have one letter per page — 26 binary decisions to find any name. A B-tree phone book has all 26 letters on the first page — one decision, then flip directly to the right section.
The rule that every non-root node must have at least t-1 keys is crucial. It guarantees that the tree stays balanced and short. If nodes could be arbitrarily empty, the tree could degenerate into a tall, skinny structure — just like an unbalanced BST. The minimum fill ensures every node is at least half full, which bounds the height.
A B-tree with n keys and minimum degree t has height at most:
For t = 1001 and n = 109: h ≤ log1001(500,000,000) ≈ 2.9. That is, a B-tree with minimum degree 1001 stores a billion keys in a tree of height 3. The root plus 2 more levels.
In theory, t can be any integer ≥ 2. In practice, t is chosen so that one node fills exactly one disk page. Here is the calculation:
With variable-length keys (like strings), the calculation is more complex. Databases typically store the maximum number of keys that fit in a page given the average key size, with overflow pages for very long keys.
The interactive visualization below lets you explore a B-tree with t = 3. Click the keys and child pointers to see the structure. Notice how every leaf is at the same depth, and every non-root node has at least 2 keys.
A B-tree with minimum degree t=3. Each node holds 2-5 keys. All leaves are at the same depth. Hover over nodes to see key counts.
| Property | t = 2 (2-3-4 tree) | t = 3 | t = 1001 (database) |
|---|---|---|---|
| Min keys/node | 1 | 2 | 1000 |
| Max keys/node | 3 | 5 | 2001 |
| Min children | 2 | 3 | 1001 |
| Max children | 4 | 6 | 2002 |
| Height for 106 keys | ~20 | ~9 | ~2 |
| Height for 109 keys | ~30 | ~13 | ~3 |
Every B-tree node stores three things: a list of keys (sorted), a list of child pointers, and a flag indicating whether it is a leaf. Here is the concrete implementation:
python class BTreeNode: def __init__(self, leaf=True): self.keys = [] # sorted list of keys self.children = [] # list of child node pointers self.leaf = leaf # True if this is a leaf node self.n = 0 # number of keys currently stored class BTree: def __init__(self, t): self.root = BTreeNode() self.t = t # minimum degree # Every node has at most 2t-1 keys and 2t children # Every non-root node has at least t-1 keys
In a real database, each "node" is a disk page. The children list does not contain Python pointers — it contains page numbers. To access a child, you call disk_read(page_number), which loads the page from disk into RAM. This is the expensive operation — the one that B-trees minimize.
Let us prove the height bound h ≤ logt((n+1)/2). A B-tree of height h has:
For t = 1001, n = 109: h ≤ log1001(500,000,000) = log(500,000,000) / log(1001) ≈ 8.7 / 3.0 ≈ 2.9. So the height is at most 3. With the root always in memory, you need just 2 disk reads to search a billion keys.
Searching a B-tree is a natural generalization of BST search. At each node, instead of comparing with one key and going left or right, you scan through multiple keys to find the right child pointer. The search descends level by level until it finds the key or reaches a leaf.
Here is the algorithm, step by step:
python def b_tree_search(node, key): """Search for key in B-tree rooted at node. Returns (node, index) if found, None otherwise.""" i = 0 # Find the first key >= target while i < node.n and key > node.keys[i]: i += 1 # Check if we found it if i < node.n and key == node.keys[i]: return (node, i) # Found! # If leaf, key is not in tree if node.leaf: return None # Otherwise, read child from disk and recurse disk_read(node.children[i]) # ← THIS is the expensive operation return b_tree_search(node.children[i], key)
At each level, we do two things: scan through the keys in the node (CPU work), and read one child from disk (I/O work). The costs are:
| Resource | Per level | Total (h levels) |
|---|---|---|
| Disk reads | 1 | O(h) = O(logt n) |
| CPU comparisons | O(t) linear scan or O(log t) binary search | O(t · logt n) or O(log2 n) |
The CPU cost is irrelevant — even O(t · logt n) with t = 1000 is millions of times cheaper than the disk I/O. The number of disk reads, O(logt n), is what matters. And the root node is almost always cached in RAM (it is accessed on every operation), so in practice you save one more disk read.
In practice, the "O(logt n) disk reads per search" is an overestimate. The root node is accessed on every search, so it stays permanently in the buffer pool (cache). Level-1 nodes are also accessed very frequently. With a modest buffer pool, the top 2-3 levels of the tree are always cached.
Let us calculate the memory needed to cache the top levels:
This is why database tuning guides always emphasize buffer pool sizing. With enough RAM to cache the internal levels of your hottest indexes, read performance becomes nearly optimal — close to 1 disk read per query regardless of dataset size.
Search only reads. Insertion and deletion also write. Each modified page must eventually be flushed to disk. But writes can be deferred: mark the page "dirty" in the buffer pool and let a background process flush it later. Multiple modifications to the same page (common for hot nodes) are batched into a single write. This further amortizes the I/O cost.
The write-ahead log (WAL) ensures durability even when dirty pages have not been flushed. As long as the log entry is on disk, the modification is safe — even if the system crashes before the actual page is written. We will discuss WAL in detail in Chapter 7.
Consider searching for key 25 in this B-tree (t=3):
Now search for key 17 (not in tree):
The simulation below lets you search for a key in a B-tree. Watch how the search descends level by level, scanning keys at each node before choosing a child.
Enter a key and click "Search" to watch the algorithm descend through the tree. Each highlighted node is one disk read.
Inserting into a B-tree is more complex than BST insertion. You cannot just create a new leaf anywhere — that would violate the "all leaves at the same depth" property. Instead, every new key goes into an existing leaf. But what if the leaf is already full (has 2t-1 keys)?
The answer: split the full node before inserting. Splitting takes a full node with 2t-1 keys, pulls out the median key, and creates two nodes of t-1 keys each. The median key moves up to the parent.
Here is the clever part. CLRS describes a single-pass, top-down insertion. As you walk down the tree looking for the correct leaf, you split every full node you encounter on the way down — before you need to. This guarantees that when you arrive at the leaf, it is not full, so you can insert directly.
Why proactive? If you split only when needed (bottom-up), a split might push a key up to a parent that is also full, causing a cascade of splits back up the tree. With proactive top-down splitting, you never need to go back up — one pass down, done.
python def b_tree_insert(tree, key): r = tree.root if r.n == 2 * tree.t - 1: # root is full s = BTreeNode(leaf=False) tree.root = s s.children.append(r) split_child(s, 0) # split old root insert_nonfull(s, key) else: insert_nonfull(r, key) def insert_nonfull(node, key): i = node.n - 1 if node.leaf: # Insert key into sorted position in leaf node.keys.append(None) while i >= 0 and key < node.keys[i]: node.keys[i + 1] = node.keys[i] i -= 1 node.keys[i + 1] = key node.n += 1 else: # Find which child to descend into while i >= 0 and key < node.keys[i]: i -= 1 i += 1 # If child is full, split it first if node.children[i].n == 2 * t - 1: split_child(node, i) if key > node.keys[i]: i += 1 insert_nonfull(node.children[i], key) def split_child(parent, i): """Split parent.children[i] which must be full (2t-1 keys).""" t = tree.t full_node = parent.children[i] new_node = BTreeNode(leaf=full_node.leaf) # new_node gets the last t-1 keys new_node.keys = full_node.keys[t:] # keys after median new_node.n = t - 1 if not full_node.leaf: new_node.children = full_node.children[t:] # children after median # Median key goes up to parent median = full_node.keys[t - 1] full_node.keys = full_node.keys[:t - 1] # keep first t-1 keys full_node.n = t - 1 if not full_node.leaf: full_node.children = full_node.children[:t] # Insert median and new child into parent parent.keys.insert(i, median) parent.children.insert(i + 1, new_node) parent.n += 1
Let us insert the keys 10, 20, 30, 40, 50, 60, 70, 80, 5, 15 into an initially empty B-tree with t = 3 (max 5 keys per node). Follow along with the simulation below.
Click "Insert Next" to add keys one at a time. Watch splits happen when nodes reach 5 keys. The sequence is: 10, 20, 30, 40, 50, 60, 70, 80, 5, 15.
The key moments in this sequence:
| After inserting | What happens |
|---|---|
| 10, 20, 30, 40, 50 | Root is full with [10,20,30,40,50]. No split yet. |
| 60 | Root is full. Split: median 30 becomes the new root. Left child [10,20], right child [40,50]. Then 60 goes into right child: [40,50,60]. |
| 70, 80 | Right child becomes [40,50,60,70,80] — full but no split until the next insert needs it. |
| 5 | Goes into left child: [5,10,20]. |
| 15 | Goes into left child: [5,10,15,20]. Still not full. |
How often do splits happen? Each split produces two nodes of t-1 keys from one node of 2t-1 keys. Before the next split can happen at the same node, t more keys must be inserted there. So on average, one split occurs for every t insertions. With t = 1000, you get one split per 1000 inserts. Splits are rare events.
When a split does happen, it requires:
Total: 4 disk I/Os for a split. Amortized over t insertions, that is 4/t extra I/Os per insert — negligible for large t.
Unlike BSTs, B-trees always produce balanced trees regardless of insertion order. Insert 1, 2, 3, ..., n in sorted order into a BST and you get a linked list (height n). Insert the same sequence into a B-tree and you get a perfectly balanced tree (height logt n). The proactive splitting mechanism redistributes keys as the tree grows, maintaining balance automatically.
However, insertion order does affect space utilization. Sequential insertions into a B-tree produce nodes that are about half full (each split creates two minimum-fill nodes). Random insertions produce nodes that are about 69% full. Some databases (like PostgreSQL) detect sequential insert patterns and use a special "fast path" that fills pages more aggressively.
If you have a large dataset and want to build a B-tree from scratch, inserting keys one at a time is wasteful. Each insert does a full root-to-leaf traversal. For n keys, that is O(n · logt n) total disk I/Os.
The better approach is bulk loading: sort the keys first, then build the tree bottom-up by filling leaf pages to capacity, then creating internal pages over them. This requires only O(n/B) I/Os (one pass through the sorted data, writing full pages sequentially).
This is why database import tools (like MySQL's LOAD DATA INFILE or PostgreSQL's COPY) are dramatically faster than individual INSERT statements. They sort the data and bulk-load the B-tree indexes.
Deletion is the hardest operation in B-trees. The problem: after removing a key, a node might have fewer than t-1 keys, violating the B-tree property. We must fix this, and the fix depends on where the key is and what the neighboring nodes look like.
There are three cases, and each case has sub-cases. Let us work through all of them.
This is the simplest case. If the leaf has more than t-1 keys, just remove the key. Done. The node still satisfies the minimum-fill constraint.
We cannot just remove a key from an internal node — it would leave the node with one fewer key than child pointers, breaking the structure. Instead, we replace the key with its predecessor or successor and delete that instead.
We are still searching for the key and about to descend into child[i]. But child[i] has only t-1 keys — the minimum. If we delete from it, it might underflow. So before descending, we ensure the child has at least t keys.
Just like insertion uses proactive splitting, deletion uses proactive fixing on the way down. Before descending into any child, ensure it has at least t keys (not just t-1). This guarantees we never need to backtrack.
python def b_tree_delete(node, key): i = 0 while i < node.n and key > node.keys[i]: i += 1 if i < node.n and key == node.keys[i]: if node.leaf: # Case 1: key in leaf — just remove node.keys.pop(i) node.n -= 1 else: # Case 2: key in internal node delete_internal(node, i) else: if node.leaf: return # key not in tree # Case 3: key not found yet, must descend # Ensure child[i] has ≥ t keys before descending if node.children[i].n == t - 1: fix_child(node, i) # Recurse (index may have changed after merge) b_tree_delete(node.children[i], key) def delete_internal(node, i): key = node.keys[i] if node.children[i].n >= t: # Case 2a pred = get_predecessor(node, i) node.keys[i] = pred b_tree_delete(node.children[i], pred) elif node.children[i+1].n >= t: # Case 2b succ = get_successor(node, i) node.keys[i] = succ b_tree_delete(node.children[i+1], succ) else: # Case 2c merge_children(node, i) b_tree_delete(node.children[i], key)
| Case | Condition | Action | I/O Cost |
|---|---|---|---|
| 1 | Key in leaf, leaf has > t-1 keys | Remove key directly | 1 read + 1 write |
| 2a | Key in internal, left child has ≥ t keys | Replace with predecessor, recurse | O(h) reads + writes |
| 2b | Key in internal, right child has ≥ t keys | Replace with successor, recurse | O(h) reads + writes |
| 2c | Key in internal, both children have t-1 keys | Merge children around key, recurse | O(h) reads + writes |
| 3a | Key not found yet, child has t-1, sibling has ≥ t | Rotate: steal key through parent | 3 reads + 3 writes |
| 3b | Key not found yet, child has t-1, all siblings at t-1 | Merge child with sibling | 3 reads + 2 writes |
The recurring theme: never operate on a node that is at the minimum. Before you do anything, make sure the target node has room. For insertion, "room" means not full (split if full). For deletion, "room" means above minimum (steal or merge if at minimum).
Let us walk through a concrete deletion sequence on a B-tree with t=3. Start with this tree containing keys [3, 5, 7, 10, 12, 15, 18, 20, 22, 25, 28, 30, 33, 35, 38, 40]:
Delete 7 (Case 1 — key in leaf with enough keys): The leaf [3,5,7] has 3 keys (> t-1 = 2). Just remove 7. Result: [3,5].
Delete 15 (Case 2a — key in internal, left child has enough keys): 15 is in the internal node [10,15]. The left child's subtree has predecessor 12. Replace 15 with 12, then delete 12 from its leaf. But wait — the leaf [12] has only 1 key (= t-1). We need Case 3. Before descending, check siblings. The left sibling [3,5] has 2 keys (= t-1) and the right sibling [18] has 1 key (= t-1). Both at minimum, so merge: combine [12] + 15 + [18] = [12,15,18]. Now delete 15 from this merged leaf. Result: internal becomes [10], children are [3,5] and [12,18].
Delete 20 (Case 2c — key in internal, both children have t-1 keys): 20 is the root key. Left child [10] and right child [30,35] — left has 1 key (t-1), but right has 2 keys. So Case 2b applies: take the successor (22), replace 20 with 22, delete 22 from the right subtree.
Each deletion case builds on the same principle: ensure you have enough keys to work with before you operate. This is the B-tree's version of "measure twice, cut once."
Select a case and click "Step" to see the deletion animation. Watch keys rotate, merge, or simply disappear.
This is the payoff. The simulation below is a fully interactive B-tree. You choose the minimum degree t. You insert and delete any key. You watch every split and merge in real time. The tree shows node utilization, current height, and a running count of disk reads.
Insert/delete keys. Adjust t to change the branching factor. Watch splits and merges live.
Things to try:
After many random insertions, most nodes are between half-full and completely full. This is not an accident — the split mechanism guarantees it. When a node with 2t-1 keys splits, it produces two nodes with t-1 keys each. Those nodes are exactly at the minimum. Subsequent insertions fill them back up. The average node utilization in a random B-tree is about ln(2) ≈ 69% — a result from Knuth's analysis.
This means a B-tree wastes about 31% of its allocated space on average. In practice, this is a good tradeoff: the wasted space buys us room for future insertions without immediate splits, and 69% utilization is much better than the 50% worst case.
Watch the "Disk reads (last op)" counter as you insert and delete. For a tree of height h:
| Operation | Disk Reads | Disk Writes | Explanation |
|---|---|---|---|
| Search | O(h) | 0 | Read one node per level |
| Insert (no split) | O(h) | O(1) | Read path + write leaf |
| Insert (with splits) | O(h) | O(h) | Read path + write split nodes |
| Delete (simple) | O(h) | O(1) | Read path + write modified node |
| Delete (with merges) | O(h) | O(h) | Read path + write merged nodes |
The key insight: even in the worst case, every operation does at most O(h) = O(logt n) disk I/Os. With t = 1000 and n = 109, that is 3 reads and 3 writes. The actual CPU work per node (scanning keys, shifting elements) is trivial in comparison.
B-trees as described by CLRS store data (or pointers to data) at every node — both internal and leaf. This is fine in theory, but real databases almost universally use a variant called the B+ tree. The difference is simple but profound.
Three reasons, all practical:
1. Internal nodes are smaller. Since internal nodes do not carry data payloads, they carry only keys and pointers. More keys fit in a single disk page, so the branching factor is higher, so the tree is shorter. If each record is 100 bytes but each key is 8 bytes, a B-tree internal node wastes 92% of its space on data you do not need during search.
2. Range queries are fast. Want all records with keys between 50 and 200? In a B+ tree, find 50 (follow the tree down to the leaf), then walk the leaf linked list until you pass 200. You never need to go back up to the internal nodes. In a standard B-tree, range queries require an inorder traversal that bounces between internal and leaf nodes.
3. Predictable I/O. Every search touches exactly the same number of internal nodes plus one leaf. There is no "lucky" case where the key happens to be in an internal node at level 2 instead of a leaf. This makes query planning and caching more predictable.
Notice that in the B+ tree, the key 30 appears in both an internal node and a leaf. Internal copies are just signposts — the leaf copy is where the actual data record lives.
Let us do the math to see why B+ trees have a higher branching factor. Suppose each key is 8 bytes, each child pointer is 8 bytes, and each data record is 200 bytes. Page size is 4 KB (4096 bytes).
For a dataset of 100 million records:
Suppose we want all keys in the range [15, 45] from a B+ tree. The algorithm:
Enter a range [lo, hi] and compare how many nodes each structure visits. The B+ tree follows the leaf chain; the B-tree must traverse the tree.
| Feature | B-Tree | B+ Tree |
|---|---|---|
| Data location | Every node | Leaves only |
| Leaf linking | No | Doubly-linked list |
| Duplicate keys in internal nodes | No | Yes (as signposts) |
| Point query cost | O(logt n) — might stop early | O(logt n) — always to leaf |
| Range query cost | O(logt n + k) via inorder | O(logt n + k/B) via leaf scan |
| Branching factor | Lower (data in nodes) | Higher (keys only in internal) |
| Used in practice | SQLite (some tables) | MySQL InnoDB, PostgreSQL, most RDBMS |
B+ tree insertion works like B-tree insertion with one twist: when a leaf splits, the median key is copied up to the parent (not moved). In a B-tree, the median leaves its original node. In a B+ tree, the median stays in the right leaf (because all keys must be in leaves) and a copy goes to the parent as a routing signpost.
Deletion in a B+ tree is also slightly different. If you delete a key that appears as a signpost in an internal node, the internal copy might become stale. Some implementations leave stale signposts (they still correctly route searches). Others update the signpost to the new minimum of the right subtree. Both approaches are valid — stale signposts just waste a tiny amount of space.
Here is what a B+ tree page looks like in a real database (InnoDB-style, 16 KB page):
With a 16 KB page, 8-byte keys, and 6-byte child pointers, an internal node holds about 16,000 / 14 ≈ 1,143 keys — so the branching factor is about 1,144. Two levels of internal nodes can index 1,1442 ≈ 1.3 million leaf pages. If each leaf holds 100 rows, that is 130 million rows reachable in 3 disk reads.
Every time you run a SQL query with a WHERE clause on an indexed column, a B-tree (or B+ tree) is doing the work. Let us look at how the three most popular open-source databases use them.
InnoDB stores the entire table as a B+ tree, ordered by the primary key. This is called a clustered index — the leaf nodes of the B+ tree contain the actual row data, not just pointers to rows stored elsewhere.
SELECT * FROM users WHERE id = 42, InnoDB searches the B+ tree for key 42 and finds the entire row at the leaf. There is no separate "heap" of row data — the index and the data are the same structure.Secondary indexes in InnoDB are also B+ trees, but their leaves store the primary key (not the row data). A secondary index lookup requires two B+ tree searches: one to find the primary key in the secondary index, then another to find the row in the clustered index. This is called a double lookup or "bookmark lookup."
PostgreSQL takes a different approach. The table data lives in a heap — an unordered pile of row pages. Indexes are separate B-tree structures whose leaves contain tuple IDs (TIDs), which are (page number, offset) pointers into the heap.
Every index lookup in PostgreSQL requires a B-tree search followed by a heap page fetch. This seems slower than InnoDB's clustered index, but it has an advantage: secondary indexes point directly to the heap (one B-tree search), whereas InnoDB's secondary indexes require a double lookup.
SQLite uses plain B-trees (not B+) for tables (which it calls "table B-trees") and B+ trees for indexes. Each database is a single file divided into fixed-size pages (default 4096 bytes). Each page is one B-tree node. The page size is the node size — this is literally what we meant by "size nodes to fit one disk page."
Databases do not read from disk on every B-tree node access. They maintain a buffer pool — a cache of recently-accessed disk pages in RAM. When a B-tree search needs a node, it first checks the buffer pool. If the page is there (a "cache hit"), no disk I/O happens. Only on a "cache miss" does the database actually read from disk.
The root node of every frequently-used index is virtually always in the buffer pool (it is accessed on every query). Internal nodes near the root are also likely cached. In practice, only the leaf node access triggers a disk read. For a B+ tree with height 4, that means 3 cache hits and 1 disk read — blazing fast.
Search for keys in a B+ tree with a buffer pool. Watch which nodes cause cache hits (green) vs cache misses (red/disk read). Recently accessed nodes stay cached.
| Database | Table Storage | Primary Index | Secondary Index | Page Size |
|---|---|---|---|---|
| MySQL InnoDB | Clustered B+ tree | B+ tree (data in leaves) | B+ tree (stores PK) | 16 KB |
| PostgreSQL | Heap (unordered) | B-tree (stores TID) | B-tree (stores TID) | 8 KB |
| SQLite | B-tree | Table B-tree | B+ tree (stores rowid) | 4 KB |
In a multi-threaded database, many transactions access the B-tree simultaneously. Without coordination, two threads might try to split the same node, corrupting the tree. Databases use latches (lightweight locks) to protect B-tree pages.
The naive approach — lock the entire tree for every operation — destroys concurrency. The standard technique is latch crabbing (also called "latch coupling"):
A node is "safe" if the current operation cannot cause a structural change (split or merge) at that node. For search, every node is safe (no modifications). For insert, a node is safe if it has fewer than 2t-1 keys (room for one more without splitting). For delete, a node is safe if it has more than t-1 keys (room to lose one without merging).
Latch crabbing ensures that at most two levels are locked at any time. This allows operations on different parts of the tree to proceed in parallel — a reader in the left subtree does not block a writer in the right subtree.
What happens if the database crashes while splitting a B-tree node? A split modifies three pages: the full node, the new sibling, and the parent. If the system crashes after writing the sibling but before updating the parent, the tree is corrupted — the new sibling is an orphan.
Databases solve this with Write-Ahead Logging (WAL). Before modifying any B-tree page, the database writes a log entry describing the change. If a crash occurs, the recovery process replays the log to complete or undo partial operations. The log is append-only and sequential, so writing it is fast.
An alternative approach is copy-on-write (COW), used by LMDB and btrfs. Instead of modifying pages in place, write new pages and atomically update the root pointer. This eliminates the need for WAL but increases write amplification (every modification creates new pages all the way up to the root).
COW trades write amplification for simplicity and free snapshots. LMDB is single-writer (no write concurrency) but achieves phenomenal read performance because readers never need locks — they just follow the root pointer they started with, which always points to a consistent snapshot.
Over time, B-tree pages become fragmented. Deletions leave gaps. Splits create half-full pages. Pages that were once adjacent on disk may become scattered. This causes two problems:
Internal fragmentation: Pages are not fully utilized. A page with 50% utilization wastes half its space. Databases periodically run OPTIMIZE TABLE (MySQL) or REINDEX (PostgreSQL) to rebuild the B-tree from scratch, compacting all pages.
External fragmentation: Logically adjacent leaf pages (e.g., keys 1-100 and keys 101-200) may be physically scattered across the disk. Sequential range scans become random I/O. SSDs mitigate this (random reads are nearly as fast as sequential), but HDDs suffer badly.
Not every column needs an index, and not every index should be a B-tree. Here is a decision guide:
| Workload | Best Index Type | Why |
|---|---|---|
| Primary key lookups | B+ tree (clustered) | Data in leaves, no second lookup |
| Range queries (BETWEEN, >, <) | B+ tree | Leaf chain enables sequential scan |
| Exact match only (=) | Hash index | O(1) average, no wasted space on ordering |
| Full-text search | Inverted index (GIN) | B-trees cannot search for words within text |
| Geospatial (lat/lng) | R-tree (GiST) | B-trees only handle 1D ordering |
| High-cardinality flag columns | Bitmap index | Compact representation for low-cardinality data |
| Write-heavy, rarely read | Consider no index | Every index adds write overhead |
Each secondary index adds overhead to every INSERT, UPDATE, and DELETE on the table (the index must be updated too). A table with 5 indexes effectively writes 6 times for every row modification: once to the table plus once to each index. This is why DBAs are cautious about over-indexing.
| Operation | Disk Reads | CPU Time | Notes |
|---|---|---|---|
| Search | O(logt n) | O(t · logt n) | Binary search within nodes: O(log n) |
| Insert | O(logt n) | O(t · logt n) | At most O(logt n) splits |
| Delete | O(logt n) | O(t · logt n) | At most O(logt n) merges |
| Range [a, b] | O(logt n + k/B) | O(logt n + k) | B+ tree; k results, B keys/leaf |
| Min / Max | O(logt n) | O(logt n) | Leftmost / rightmost leaf |
| Space | O(n) — each key stored exactly once (B-tree) or duplicated in internals (B+) | ||
keys[], children[], n (key count), and leaf (boolean), write the search function. Key insight: find the right child index using a linear scan or binary search on keys, then recurse if not found and not a leaf.| Structure | Search | Insert | Delete | Range | Disk-Optimal? |
|---|---|---|---|---|---|
| Sorted Array | O(log n) | O(n) | O(n) | O(log n + k) | No (O(n) insert) |
| BST (balanced) | O(log n) | O(log n) | O(log n) | O(log n + k) | No (height = log2 n) |
| Hash Table | O(1) | O(1) | O(1) | O(n) | No (no ordering) |
| B-Tree | O(logt n) | O(logt n) | O(logt n) | O(logt n + k) | Yes |
| LSM-Tree | O(log n · L) | O(1) amortized | O(1) amortized | O(log n + k) | Yes (write-optimized) |
This is a common system design interview question. Here is a staff-engineer-grade walkthrough using B-trees.
Step 1: Choose the data structure. We need point lookups AND range scans, so hash tables are out. B+ tree is the standard choice. LSM-tree is an alternative if writes dominate reads.
Step 2: Page size. Use 16 KB pages (InnoDB default). With 32-byte keys and 8-byte child pointers, an internal page holds ~400 keys. Branching factor ≈ 400.
Step 3: Estimate tree dimensions.
Step 4: Write path. PUT = search (4 reads) + write leaf page (1 write) + WAL entry (1 sequential write). The WAL write is sequential (fast). The page write may be deferred (dirty page in buffer pool, flushed later). Amortized write latency: ~0.5ms on SSD. Under 5ms target.
Step 5: Buffer pool sizing. Cache the top 2 levels (41 pages = 656 KB) plus hot leaf pages. With 1 GB of RAM for the buffer pool, you can cache ~65,000 pages = 1% of all pages. After warm-up, the hot set gives >90% hit rate.
Step 6: Crash recovery. WAL with checkpointing. On crash, replay WAL from last checkpoint. Recovery time proportional to WAL length since last checkpoint (seconds to minutes).
Here is a complete, working Python B-tree that you can copy and run. Every function maps directly to the pseudocode in CLRS. Study the split_child and delete functions — these are the two that interviewers care about most.
python class BTreeNode: def __init__(self, leaf=True): self.keys = [] self.children = [] self.leaf = leaf class BTree: def __init__(self, t): self.root = BTreeNode() self.t = t # minimum degree (must be ≥ 2) def search(self, key, node=None): """Return (node, index) or None.""" if node is None: node = self.root i = 0 while i < len(node.keys) and key > node.keys[i]: i += 1 if i < len(node.keys) and key == node.keys[i]: return (node, i) if node.leaf: return None return self.search(key, node.children[i]) def insert(self, key): root = self.root if len(root.keys) == 2 * self.t - 1: # Root is full — grow the tree new_root = BTreeNode(leaf=False) new_root.children.append(self.root) self.root = new_root self._split_child(new_root, 0) self._insert_nonfull(new_root, key) else: self._insert_nonfull(root, key) def _split_child(self, parent, i): """Split parent.children[i], which must be full.""" t = self.t full = parent.children[i] new = BTreeNode(leaf=full.leaf) # Median key median = full.keys[t - 1] # Right half goes to new node new.keys = full.keys[t:] full.keys = full.keys[:t - 1] if not full.leaf: new.children = full.children[t:] full.children = full.children[:t] # Insert median into parent parent.keys.insert(i, median) parent.children.insert(i + 1, new) def _insert_nonfull(self, node, key): i = len(node.keys) - 1 if node.leaf: node.keys.append(None) while i >= 0 and key < node.keys[i]: node.keys[i + 1] = node.keys[i] i -= 1 node.keys[i + 1] = key else: while i >= 0 and key < node.keys[i]: i -= 1 i += 1 if len(node.children[i].keys) == 2 * self.t - 1: self._split_child(node, i) if key > node.keys[i]: i += 1 self._insert_nonfull(node.children[i], key) def inorder(self, node=None): """Return all keys in sorted order.""" if node is None: node = self.root result = [] for i in range(len(node.keys)): if not node.leaf: result.extend(self.inorder(node.children[i])) result.append(node.keys[i]) if not node.leaf: result.extend(self.inorder(node.children[-1])) return result # Usage: tree = BTree(t=3) for key in [10, 20, 30, 40, 50, 60, 70, 80, 5, 15]: tree.insert(key) print(tree.inorder()) # Output: [5, 10, 15, 20, 30, 40, 50, 60, 70, 80] print(tree.search(30)) # (<node>, 0) — found! print(tree.search(99)) # None — not found
Notice the key design decision in _insert_nonfull: we check if the child is full before descending. If it is, we split it immediately and then decide which of the two resulting children to descend into. This is the proactive splitting that guarantees a single downward pass.
delete(key) method to this class. Follow the three cases from Chapter 4. Test it by inserting 20 random keys, deleting them in random order, and verifying that inorder() is always sorted and the tree properties hold. This is a 60-90 minute coding exercise — well within the scope of a systems coding interview.When implementing or debugging a B-tree, you need a function that verifies all five properties. Here is a verification checklist you can turn into code:
python def verify_btree(tree): """Verify all B-tree properties. Raises AssertionError if violated.""" t = tree.t def check(node, min_key, max_key, depth, leaf_depth): # Property 3: keys are sorted for i in range(len(node.keys) - 1): assert node.keys[i] < node.keys[i+1], f"Keys not sorted: {node.keys}" # Property 1: key count bounds if node != tree.root: assert len(node.keys) >= t - 1, f"Underflow: {len(node.keys)} < {t-1}" assert len(node.keys) <= 2 * t - 1, f"Overflow: {len(node.keys)} > {2*t-1}" # Property 4: keys within bounds from ancestors for k in node.keys: assert min_key < k < max_key, f"Key {k} outside [{min_key},{max_key}]" if node.leaf: # Property 5: all leaves at same depth if leaf_depth[0] is None: leaf_depth[0] = depth assert depth == leaf_depth[0], f"Leaf at depth {depth}, expected {leaf_depth[0]}" else: # Property 2: k keys = k+1 children assert len(node.children) == len(node.keys) + 1 # Recurse into children with tightened bounds bounds = [min_key] + node.keys + [max_key] for i, child in enumerate(node.children): check(child, bounds[i], bounds[i+1], depth + 1, leaf_depth) check(tree.root, float('-inf'), float('inf'), 0, [None]) print("All B-tree properties verified! ✓")
Run this after every insert and delete during development. It catches bugs immediately. The function checks all five properties in O(n) time by visiting every node once.
| Dimension | What Interviewers Test | Key Knowledge |
|---|---|---|
| CONCEPT | Can you explain WHY B-trees exist? | Disk I/O bottleneck, branching factor trades CPU for fewer reads |
| DESIGN | Can you design a system using B-trees? | Page sizing, height estimation, buffer pool, WAL, clustered vs secondary |
| CODE | Can you implement B-tree search and insert? | Split logic, proactive splitting, median promotion |
| DEBUG | Can you diagnose B-tree performance issues? | Fragmentation, lock contention, buffer pool misses, sequential inserts causing right-heavy splits |
| FRONTIER | Do you know what comes after B-trees? | LSM-trees, Bw-trees (lock-free), learned indexes, adaptive radix trees |
B-trees sit at the intersection of algorithms and systems. Understanding them connects you to a web of related ideas.
| Concept | Relationship | Link |
|---|---|---|
| BSTs (CLRS Ch 12) | B-trees generalize BSTs: instead of 1 key and 2 children, each node has up to 2t-1 keys and 2t children. The BST property generalizes to the multi-key routing rule. | BST lesson |
| Red-Black Trees (CLRS Ch 13) | A red-black tree is isomorphic to a B-tree with t=2 (2-3-4 tree). Red nodes are "absorbed" into their black parent to form a B-tree node. This is why red-black trees are balanced — they are B-trees in disguise. | Red-Black Tree lesson |
| Disk I/O Model | B-trees are designed for the external memory model where computation is free and I/O (disk reads/writes) is expensive. This model also motivates external sorting (merge sort) and cache-oblivious algorithms. |
| Concept | Relationship |
|---|---|
| B+ Trees | The practical variant used in real databases. Data in leaves only, leaves linked for range scans. We covered this in Chapter 6. |
| LSM-Trees | The write-optimized alternative. Used in Cassandra, RocksDB, LevelDB. Trades read speed for write speed. B-trees do in-place updates; LSM-trees do sequential appends + compaction. |
| Buffer Trees | Attach write buffers to internal B-tree nodes. Writes are batched and lazily pushed down. Combines B-tree read performance with better write throughput. |
| Fractal Trees | Used in TokuDB/PerconaFT. Like buffer trees but with better theoretical bounds. O(logB n / Bε) amortized inserts. |
| COW B-Trees | Copy-on-write B-trees never modify pages in place — they write new pages and update pointers. Used in btrfs, LMDB. Provides free snapshots and crash consistency. |
B-trees have been the dominant index structure for 50 years, but researchers continue to push the boundaries:
Learned Indexes (Kraska et al., 2018): Instead of a tree of comparisons, use a machine learning model to predict where a key is. A neural network trained on the data distribution can predict the position of any key in a sorted array with high accuracy. The prediction replaces the tree traversal. On read-only workloads with static data, learned indexes can be 1.5-3x faster than B-trees while using 10-100x less memory. The challenge: handling inserts, deletes, and distribution shifts. Active research area.
Bw-Trees (Levandoski et al., 2013): Lock-free B-trees for multi-core systems. Instead of latching individual pages, Bw-trees use compare-and-swap (CAS) operations and delta chains. Each modification creates a small delta record that is prepended to a page's chain. Periodically, deltas are consolidated into a new base page. Used in SQL Server Hekaton and Azure Cosmos DB.
ALEX (Ding et al., 2020): An "adaptive learned index" that handles inserts by combining learned models with gapped arrays. It self-tunes its structure based on the workload, gradually reshaping itself as the data distribution evolves. Achieves near-learned-index read performance with practical insert support.
Write-Optimized Indexes: Bε-trees (Brodal & Fagerberg, 2003) add write buffers to each node, batching writes and flushing them down periodically. They achieve O(logB n / Bε) amortized insert cost — exponentially better than B-trees for write-heavy workloads. TokuDB (now PerconaFT) implements this as "Fractal Tree Indexes."
The most important alternative to B-trees is the Log-Structured Merge-Tree (LSM-tree), invented by Patrick O'Neil in 1996. It powers Cassandra, RocksDB, LevelDB, HBase, and many modern "NoSQL" databases. The tradeoff is fundamental:
| Property | B-Tree | LSM-Tree |
|---|---|---|
| Write pattern | Random I/O (update in place) | Sequential I/O (append-only) |
| Read pattern | 1 tree traversal | Check memtable + multiple sorted runs |
| Write throughput | Good | Excellent (10-100x better on HDD) |
| Read latency | Predictable, low | Variable (may check many levels) |
| Space amplification | ~1.5x (half-full pages) | ~1.1x (compacted) to 2x+ (during compaction) |
| Write amplification | Moderate (full page rewrite) | High (compaction rewrites data multiple times) |
| Concurrency | Complex (latch crabbing) | Simple (immutable sorted runs) |
| Range queries | Fast (B+ tree leaf chain) | Requires merge across all levels |
When to choose B-trees: Read-heavy workloads, range queries, predictable latency requirements, traditional OLTP (banking, e-commerce). This is most workloads.
When to choose LSM-trees: Write-heavy workloads (logging, time-series, event streams), write-once-read-rarely patterns, workloads where sequential I/O matters (HDDs, distributed storage). Also when you need simple crash recovery (immutable files are easier to recover than in-place-modified pages).
Modern systems increasingly blur this line. RocksDB is an LSM-tree that adds Bloom filters and prefix seeking to improve read performance. WiredTiger (MongoDB's engine) supports both B-tree and LSM modes. Research systems like Dostoevsky and Monkey allow tuning the LSM structure along a continuum between pure LSM and pure B-tree behavior.
| Structure | Branching | Optimized For | Used In |
|---|---|---|---|
| BST / AVL / Red-Black | 2 | RAM, O(log₂ n) comparisons | std::map, TreeMap, in-memory DBs |
| B-Tree (t=2, aka 2-3-4) | 2-4 | Equivalent to Red-Black tree | Theoretical equivalence |
| B-Tree (large t) | 100-2000 | Disk I/O, minimize reads | SQLite tables, ext4 directory indexes |
| B+ Tree | 100-2000 | Range queries + disk I/O | MySQL, PostgreSQL, Oracle, SQL Server |
| B* Tree | 100-2000 | Higher node utilization (2/3 full) | Some file systems (HFS+) |
| LSM-Tree | N/A (log-structured) | Write-heavy workloads | Cassandra, RocksDB, LevelDB |
| Bw-Tree | 100-2000 | Lock-free, multi-core | SQL Server Hekaton, Azure Cosmos |
| Fractal Tree | 100-2000 | Reduced write amplification | TokuDB (PerconaFT) |
In this lesson you learned:
"The art of programming is the art of organizing complexity, of mastering multitude and avoiding its bastard chaos as effectively as possible." — Edsger Dijkstra