B-trees, LSM-trees, column stores — how databases actually store and find your data.
Imagine you need to build a database. Not PostgreSQL. Not anything fancy. Just a key-value store: you give it a key (like "user:42") and a value (like "Alice"), and later you ask for that key back and it returns "Alice". How hard can it be?
Here is the world's simplest database. Two bash functions:
bash db_set() { echo "$1,$2" >> database # append key,value to a file } db_get() { grep "^$1," database | sed "s/^$1,//" | tail -n 1 # find last match }
db_set appends a line to a file. That is an O(1) operation — the operating system just adds bytes to the end. It does not touch any existing data. This is spectacularly fast. Modern SSDs can sustain hundreds of megabytes per second of sequential writes.
db_get scans the entire file looking for the key, then takes the last matching line (because later writes override earlier ones). This is an O(n) operation — it must read every line in the file. With 10 records, no one notices. With a million records, it takes seconds. With a billion records, it takes hours.
The simulation below shows our append-only database in action. Each "write" appends to the log. Each "read" must scan from the beginning. Watch how read time grows linearly while write time stays flat. Click "Add 50 Records" repeatedly and watch the read bar grow.
Click to add records. Watch read latency grow linearly while write stays constant.
After a few clicks, the read bar is ten or twenty times the write bar. With a million records, you would be staring at a spinning cursor. This is unacceptable for any real application.
Let us implement our toy database in Python so we can measure the actual performance difference between writes and reads:
python import time, os class WorldsWorstDB: """Append-only log with O(1) write and O(n) read.""" def __init__(self, path="database.txt"): self.path = path self.f = open(path, 'a+') def set(self, key, value): """O(1) write — append to file.""" self.f.write(f"{key},{value}\n") self.f.flush() # ensure durability def get(self, key): """O(n) read — scan entire file.""" self.f.seek(0) result = None for line in self.f: k, v = line.strip().split(',', 1) if k == key: result = v # keep going — last match wins return result # Benchmark: write 100K records, then read one db = WorldsWorstDB() t0 = time.time() for i in range(100000): db.set(f"key{i}", f"value{i}") write_time = time.time() - t0 # ~0.5 seconds for 100K writes (fast!) t0 = time.time() val = db.get("key99999") # find the last key read_time = time.time() - t0 # ~0.2 seconds for ONE read (terrible!) print(f"100K writes: {write_time:.3f}s ({100000/write_time:.0f} writes/sec)") print(f"1 read: {read_time:.3f}s") # Output: 100K writes: 0.45s (222,222 writes/sec) # 1 read: 0.18s ← scanning 100K records for ONE lookup
The solution? Keep the fast append-only write, but build an index — a separate data structure that tells you where in the file a key lives, so you can jump straight to it instead of scanning.
Before we dive into specific index structures, let us understand the layers of a real database engine. Every database — from SQLite to PostgreSQL to Cassandra — has roughly the same architecture:
This chapter focuses entirely on the Storage Engine layer — the most architecturally interesting part. Everything above it (query optimization, SQL parsing) and below it (filesystem, disk hardware) is important but orthogonal.
To understand why storage engines exist, you need to understand the memory hierarchy. Modern computers have multiple layers of storage, each with wildly different speed and cost:
| Layer | Access Time | Size | Cost per GB | Relative Speed |
|---|---|---|---|---|
| CPU L1 Cache | ~1 ns | 64 KB | — | 1x |
| CPU L2 Cache | ~4 ns | 256 KB | — | 4x slower |
| CPU L3 Cache | ~12 ns | 8-64 MB | — | 12x slower |
| DRAM (RAM) | ~100 ns | 16-512 GB | ~$5/GB | 100x slower |
| NVMe SSD | ~100 μs | 1-16 TB | ~$0.10/GB | 100,000x slower |
| HDD | ~10 ms | 4-20 TB | ~$0.02/GB | 10,000,000x slower |
| Network (S3) | ~50 ms | Unlimited | ~$0.02/GB | 50,000,000x slower |
RAM is 100,000x faster than SSD. SSD is 100x faster than HDD. Every storage engine is fighting this brutal reality: the fast storage is small and expensive, the large storage is slow and cheap. The entire field of database engineering is about bridging this gap — keeping hot data in fast storage and cold data in slow storage, while making it all look seamless to the application.
The rest of this lesson explores three families of indexes, each making different trade-offs:
| Index Type | Read | Write | Used By |
|---|---|---|---|
| Hash Index | O(1) | O(1) | Bitcask (Riak) |
| LSM-Tree / SSTable | O(log n) | O(1) amortized | LevelDB, RocksDB, Cassandra |
| B-Tree | O(log n) | O(log n) | PostgreSQL, MySQL, virtually all RDBMS |
Let us put real numbers to this. Suppose each record is 100 bytes (a key, a value, a delimiter). We have an SSD that reads at 500 MB/s sequential throughput:
This is why databases exist. Not for durability (files give you that). Not for fancy SQL syntax. Databases exist because raw files have O(n) reads, and O(n) is unacceptable when n is large.
An index is any additional data structure that you derive from the primary data. It does not change the data itself — it just provides a faster path to find what you are looking for, like an index at the back of a textbook.
But there is a fundamental trade-off: every index must be updated whenever you write new data. If you have 5 indexes on a table, every insert must update all 5. This is why databases do not index everything by default — each index speeds up reads but slows down writes. The database administrator's job is to choose the right indexes for the application's access pattern.
The simplest possible index is the one you already know from any programming language: a hash map (dictionary, hash table). The idea: keep the append-only log file on disk, but maintain an in-memory hash map that maps every key to the byte offset where that key's most recent value starts in the file.
When you write a key, you append to the file (just like before), then update the hash map to point to the new byte offset. When you read a key, you look it up in the hash map (O(1)), seek to that byte offset on disk (O(1)), and read the value. Both operations are now fast.
This is not a toy idea. This is exactly how Bitcask, the default storage engine of Riak, works. Bitcask offers very fast reads and writes as long as all keys fit in memory.
Let us trace through a concrete example. We will insert five key-value pairs and watch both the log file and the hash map evolve:
Notice that the old value for "cat" (Whiskers at offset 0) is still in the file. It just wastes space. This is where compaction comes in.
The log file grows forever because we never delete old entries. To reclaim space, we perform compaction: read the entire segment, discard all but the latest value for each key, and write a new, smaller segment file.
In practice, you do this in the background while the current segment continues accepting writes. Once the new compacted segment is ready, you atomically switch reads to it and delete the old one. This is the same pattern as a garbage collector — never block the live path.
You can also break the log into fixed-size segments (e.g., when a file reaches 1 GB, close it and start a new one). Then compaction can merge multiple segments into one, discarding duplicate keys.
The simulation below lets you insert keys into a Bitcask-style hash index. Watch the log file grow on the left and the hash map update on the right. Try inserting the same key twice to see the pointer update. Click "Compact" to merge duplicate entries.
Type a key and value, then click Set. Insert duplicates to see the hash map update. Click Compact to remove stale entries.
Several engineering details make hash indexes production-ready:
Crash recovery. If the process crashes, the in-memory hash map is lost. You must rebuild it by scanning the entire data file on startup, which is slow for large files. Bitcask speeds this up by storing a "hint file" alongside each segment — a stripped-down version of the segment containing only keys and offsets (not values). The hint file is much smaller and can rebuild the hash map quickly.
Deletion. You cannot remove a record from an append-only file. Instead, append a special tombstone record (a marker that says "this key is deleted"). During compaction, the tombstone causes the compaction process to discard all prior values for that key. The key is removed from the hash map immediately.
Partially written records. The database might crash mid-write, leaving a corrupted partial record at the end of the file. Bitcask includes a CRC checksum with each record so that damaged records can be detected and discarded on recovery.
Concurrency. Since the file is append-only, writes are strictly sequential — only one writer thread. But multiple reader threads can read concurrently without locks, because the data file segments are immutable once written (only the active segment is being appended to, and the hash map protects reads from seeing partial writes via the pointer update being atomic).
python # Tombstone pattern for deletions TOMBSTONE = "__DELETED__" def delete(db, key): """Mark a key as deleted. Compaction removes it later.""" db.set(key, TOMBSTONE) # append tombstone to log del db.index[key] # remove from in-memory hash map def compact(old_segment, new_segment): """Remove tombstoned and duplicate keys.""" seen = {} for key, value in read_segment(old_segment): if value == TOMBSTONE: seen.pop(key, None) # remove if present else: seen[key] = value # keep latest value for key, value in seen.items(): write_to_segment(new_segment, key, value)
Hash indexes are fast and simple, but they have two critical limitations:
1. All keys must fit in RAM. The hash map is entirely in memory. If you have a billion unique keys, each requiring ~64 bytes for the key plus 8 bytes for the offset, that is 72 GB of RAM just for the index. This is workable for some use cases (Bitcask was designed for relatively few keys with frequently updated values, like URL visit counters) but not for general-purpose databases.
2. Range queries are terrible. Want to find all keys between "user:1000" and "user:2000"? A hash map cannot help you. You would have to scan every key in the map and check if it falls in the range. There is no ordering. This is O(n) for range queries regardless of the result set size.
Let us calculate exactly how much RAM a hash index needs for a realistic workload:
A production Bitcask-style log format is more structured than our toy "key,value\n" approach. Each record includes a header with metadata:
python import struct, zlib, time # On-disk record format (Bitcask-style): # ┌──────────┬───────────┬─────────┬──────────┬───────┬────────┐ # │ CRC (4B) │ tstamp(4B)│ ksz(2B) │ vsz (4B) │ key │ value │ # └──────────┴───────────┴─────────┴──────────┴───────┴────────┘ HEADER_FMT = "!IIhI" # CRC, timestamp, key_size, value_size HEADER_SIZE = struct.calcsize(HEADER_FMT) # = 14 bytes def encode_record(key, value): """Encode a key-value pair into the on-disk format.""" k = key.encode() if isinstance(key, str) else key v = value.encode() if isinstance(value, str) else value ts = int(time.time()) payload = struct.pack(HEADER_FMT[1:], ts, len(k), len(v)) + k + v crc = zlib.crc32(payload) & 0xFFFFFFFF return struct.pack("!I", crc) + payload def decode_record(data, offset): """Decode a record starting at the given offset.""" crc, ts, ksz, vsz = struct.unpack(HEADER_FMT, data[offset:offset+HEADER_SIZE]) k = data[offset+HEADER_SIZE : offset+HEADER_SIZE+ksz] v = data[offset+HEADER_SIZE+ksz : offset+HEADER_SIZE+ksz+vsz] # Verify CRC to detect corruption payload = data[offset+4 : offset+HEADER_SIZE+ksz+vsz] assert zlib.crc32(payload) & 0xFFFFFFFF == crc, "Corrupted record!" return k.decode(), v.decode(), HEADER_SIZE + ksz + vsz
In a real Bitcask deployment, the append-only log is divided into segments. When a segment reaches a size threshold (e.g., 1 GB), it is closed (becomes immutable) and a new segment is opened for writes. Here is the full lifecycle:
The merge process is crucial because without it, disk usage grows without bound. Even if you only have 1 million unique keys, writing each key 1000 times creates 1 billion records across many segments. Compaction reduces this back to 1 million records.
What if we take the append-only log from Chapter 1 but add one constraint: the keys in each segment file must be sorted? This seemingly simple change — called a Sorted String Table (SSTable) — gives us three enormous advantages over the hash index approach.
Advantage 1: Efficient merging. When you need to compact multiple segment files into one, you can merge them like merge sort — read from all input segments simultaneously, always picking the lowest key. If the same key appears in multiple segments, keep only the value from the most recent segment. This runs in O(n) time and O(1) memory regardless of file sizes, because each segment is already sorted.
Advantage 2: Sparse index. You no longer need every key in the in-memory index. If you know that key "handbag" is at offset 12000 and key "handsaw" is at offset 13200, then "handler" must be somewhere between those offsets. You jump to offset 12000 and scan forward until you find it. With one index entry per few kilobytes of data, you can cover terabytes of on-disk data with a tiny in-memory index.
Advantage 3: Range queries. Because keys are sorted, finding all keys between "user:1000" and "user:2000" means seeking to the first key ≥ "user:1000" and scanning forward until you pass "user:2000". This is efficient — you only read the data you need.
You cannot sort a file as you append to it — that would require random I/O, which defeats the whole purpose. The trick is to separate the write path from the on-disk format:
This three-level architecture is called a Log-Structured Merge-Tree (LSM-Tree). It was originally described by Patrick O'Neil in 1996 and is the engine behind LevelDB, RocksDB, Cassandra, HBase, and many modern databases.
Reading from an LSM-tree requires checking multiple places, in order:
The worst case is looking up a key that does not exist. You must check the memtable and every SSTable on disk before you can conclude it is not there. This is where Bloom filters save the day.
A Bloom filter is a memory-efficient probabilistic data structure that can tell you "this key is definitely not in this SSTable" or "this key might be in this SSTable." It never gives false negatives, but it can give false positives (at a controllable rate, typically 1-2%).
How it works: a Bloom filter is a bit array of m bits, initially all zeros. To add a key, hash it with k different hash functions, each producing a position in the bit array. Set those k bits to 1. To check if a key exists, hash it with the same k functions and check if all k bits are set. If any bit is 0, the key is definitely not in the set. If all are 1, the key probably is (but might be a false positive due to hash collisions).
Each SSTable has its own Bloom filter, stored alongside it. Before seeking into an SSTable file, you check its Bloom filter. If the filter says "no," you skip that file entirely. This turns most non-existent-key lookups from O(levels × log n) into O(levels), checking only a few bytes of Bloom filter per level.
Let us trace a write and a read through a realistic LSM-tree with 3 levels:
You cannot simply remove a key from the memtable and call it deleted. The key might still exist in older SSTables on disk. If someone reads the key, they would find the old value in an SSTable and think it still exists.
The solution: write a special marker called a tombstone. The tombstone is a record that says "this key has been deleted." When the read path encounters a tombstone (in the memtable or any SSTable), it returns "key not found" without checking deeper levels. During compaction, tombstones cancel out the key-value pairs they cover, and both are discarded once the tombstone reaches the deepest level.
Tombstones are one of the trickiest parts of LSM-tree engineering. If you delete many keys but compaction has not run yet, the tombstones pile up and make reads slower (you still have to check all the SSTables, only to find tombstones). This is why some databases (like Cassandra) let you set a gc_grace_seconds parameter that controls how long tombstones linger before compaction is forced.
There are two main approaches to compaction, and they produce very different behaviors:
| Strategy | How it works | Write amplification | Read amplification | Space amplification | Used by |
|---|---|---|---|---|---|
| Size-tiered | Merge similarly-sized SSTables into one larger one. Newer, smaller SSTables are merged first. | Lower | Higher (more files to check) | Higher (old data lingers) | Cassandra, HBase |
| Leveled | Organize SSTables into levels. Each level is 10x the size of the previous. SSTables within a level have non-overlapping key ranges. | Higher (more rewriting) | Lower (at most one file per level) | Lower (less redundancy) | LevelDB, RocksDB |
In size-tiered compaction, SSTables are grouped by size. When you accumulate enough similarly-sized SSTables (typically 4), they are merged into one larger SSTable. Think of it like a tournament bracket — small files merge into medium files, medium into large.
In leveled compaction (used by LevelDB and RocksDB), SSTables are organized into levels. Each level is 10x the total size of the previous. Within a level (except L0), SSTable key ranges do not overlap — this is the key invariant that makes reads fast.
The simulation below shows an LSM-tree in action. Writes go to the memtable. When it fills, it flushes to Level 0. Background compaction merges L0 into L1. Watch the data flow through levels.
Click "Write" to add a key. Watch memtable fill and flush to disk. "Compact" merges L0 into L1.
Let us calculate the write amplification of leveled compaction with a size ratio of 10:
Production LSM-tree engines like RocksDB expose dozens of tuning parameters. The ones that matter most:
| Parameter | Default (RocksDB) | Effect of Increasing | Trade-off |
|---|---|---|---|
| write_buffer_size (memtable size) | 64 MB | Fewer flushes, larger SSTables, higher write throughput | More memory usage, longer recovery after crash (more WAL to replay) |
| max_write_buffer_number | 2 | More memtables can exist before stall, smoother writes | More memory usage |
| level0_file_num_compaction_trigger | 4 | Fewer compactions triggered, higher write throughput | More L0 files to check per read, higher read amplification |
| max_bytes_for_level_base (L1 size) | 256 MB | More data per level, fewer levels for same data size | Each compaction moves more data, longer compaction pauses |
| max_background_compactions | 1 | Parallel compaction, keep up with high write rates | More CPU and disk I/O consumed by background work |
| bloom_bits_per_key | 10 | Lower false positive rate, fewer wasted disk reads | More memory per SSTable |
While LSM-trees are the new hotness, the B-tree has been the dominant index structure since 1970. Virtually every relational database — PostgreSQL, MySQL, Oracle, SQL Server — uses B-trees as their primary index structure. If you have used a database, you have used a B-tree.
Like SSTables, B-trees keep keys sorted, which enables efficient range queries. But the design philosophy is completely different. Where LSM-trees buffer writes in memory and periodically flush sorted files, B-trees update data in place on disk.
A B-tree organizes data into fixed-size pages (also called blocks), typically 4 KB — the same size as a disk sector or virtual memory page. This is not a coincidence. It means reading one B-tree node costs exactly one disk I/O.
There are two types of pages:
Internal pages contain keys and pointers to child pages. If an internal page contains keys [10, 20, 30], it has four child pointers: one for keys < 10, one for keys 10-19, one for keys 20-29, and one for keys ≥ 30. The number of children per page is called the branching factor.
Leaf pages contain the actual key-value pairs (or keys plus pointers to the rows in a separate heap file).
The magic of B-trees is in the branching factor. A typical 4 KB page can hold hundreds of key-pointer pairs. Let us work through the math:
To find a key in a B-tree, start at the root page. Binary-search within the page to find the right child pointer. Follow it. Repeat until you reach a leaf. Each level costs one disk read (or one cache hit if the page is already in the buffer pool).
To insert a new key, follow the lookup path to the correct leaf page. If the leaf has room, insert the key in sorted order within the page. If the leaf is full, split it into two half-full pages and promote the middle key to the parent. If the parent is also full, split it too. This can cascade up to the root, which is the only way a B-tree grows taller.
Let us trace through a page split step by step:
The average fill factor of a B-tree after many random insertions is about 69% (ln 2). This means roughly one-third of the space in each page is wasted. Some databases let you set a lower initial fill factor (e.g., PostgreSQL's fillfactor parameter) to leave room for future updates without splitting, at the cost of a deeper tree.
B-trees update pages in place. If the system crashes mid-write (say, while splitting a page), the tree could be left in a corrupted state — one page has been written but the parent pointer has not been updated yet, leaving an orphan page.
The solution: before modifying any page, write the intended change to an append-only write-ahead log (WAL, also called redo log). If the system crashes, replay the WAL to bring the tree back to a consistent state. This means every B-tree write actually hits the disk twice: once for the WAL, once for the actual page update.
The simulation below lets you insert keys into a B-tree with a small branching factor (4, so you can see splits happen). Watch pages fill and split as you add keys.
Click "Insert" to add a random key. Watch pages fill, then split when full. The tree grows taller only when the root splits.
When multiple threads read and write the B-tree simultaneously, you need concurrency control. B-trees use latches (lightweight mutexes) on individual pages, not full database-level locks. A common strategy: acquire a read latch on the root, descend to the child, acquire a latch on the child, release the parent's latch. This "latch crabbing" protocol minimizes contention.
An alternative approach used by some modern databases (LMDB, BoltDB) is copy-on-write: instead of modifying a page in place, create a new version of the page and atomically update the parent pointer. This eliminates the need for latches and the WAL, but uses more disk space.
Almost all databases actually use a B+ tree, not a classic B-tree. The difference:
| Feature | B-tree | B+ tree (what databases use) |
|---|---|---|
| Values in internal nodes | Yes — each key has its value in the node | No — internal nodes only store keys and child pointers |
| Values in leaf nodes | Values at every level | All values are in leaf nodes only |
| Leaf node links | No sibling pointers | Leaves linked in a doubly-linked list |
| Range queries | Requires tree traversal | Find first key, then follow leaf pointers |
| Branching factor | Lower (values take space in internal nodes) | Higher (internal nodes are pure routing) |
The B+ tree's leaf-linked-list makes range queries extremely efficient: find the starting key (O(log n) tree traversal), then follow the leaf pointers sequentially. No need to go back up and down the tree. This is why PostgreSQL, MySQL, and SQLite all use B+ trees.
Production B-tree implementations use several tricks beyond the basic algorithm:
Prefix compression. Keys within a page often share a common prefix (e.g., "user:1001", "user:1002", "user:1003"). Store the common prefix once and only the differing suffixes per key. This increases the branching factor — more keys per page = shallower tree.
Bulk loading. Building a B-tree by inserting keys one at a time is O(n log n) and creates a tree with ~69% fill factor. Bulk loading sorts all keys first, then builds the tree bottom-up, filling each page to 100%. This creates a more compact tree and is much faster. PostgreSQL's CREATE INDEX CONCURRENTLY uses this approach.
Fractal trees. A hybrid of B-trees and LSM-trees. Like a B-tree, but each internal node has a small buffer for pending insertions. When the buffer fills, the insertions are pushed down to children in a batch. This gives near-LSM write performance with B-tree read performance. TokuDB (now part of Percona) used fractal trees.
Buffer pool management. The database maintains a fixed-size buffer pool of B-tree pages in RAM. Popular pages (the root, heavily-accessed leaves) stay resident. Eviction uses LRU or clock algorithms. PostgreSQL's shared_buffers parameter controls this — typically set to 25% of available RAM.
Deletion in B-trees is the inverse of insertion. Remove the key from the leaf page. If the leaf now has fewer keys than the minimum (typically half the maximum), it is underful. The tree must rebalance by either borrowing a key from a sibling page (called rotation) or merging the underful page with a sibling (the inverse of a split). Merged pages propagate up the tree, just like splits.
In practice, many databases do not merge pages immediately on deletion. PostgreSQL marks deleted entries as "dead" and relies on VACUUM to reclaim space later. This is simpler and avoids the complexity of page merges, but it means deleted rows leave behind dead space (called bloat). If a table has 50% bloat, it is taking twice the disk space it needs, and scans read twice as many pages.
Let us derive the exact I/O cost for B-tree operations:
This is THE comparison question in systems design interviews. "When would you use an LSM-tree versus a B-tree?" If you can answer this fluently, you demonstrate a deep understanding of storage engine trade-offs.
LSM-trees batch writes into memory (the memtable) and flush them as large sequential writes. B-trees perform random I/O — seeking to the correct page and updating it in place. On any storage medium (HDD, SSD, or cloud block storage), sequential I/O is faster than random I/O:
| Operation | HDD | SSD | Cloud (EBS gp3) |
|---|---|---|---|
| Sequential write | ~150 MB/s | ~500 MB/s | ~250 MB/s |
| Random write (4KB) | ~1 MB/s (100 IOPS) | ~100 MB/s (25K IOPS) | ~50 MB/s (12K IOPS) |
| Sequential/Random ratio | 150x | 5x | 5x |
On spinning disks, the advantage is massive (150x). On SSDs, it is smaller but still significant (5x). This is why LSM-trees can sustain higher write throughput despite higher write amplification.
B-trees have a clear read advantage. A point lookup touches at most depth pages (typically 2-3 disk reads, often 1 due to caching). An LSM-tree must check the memtable, then potentially every level of SSTables — even with Bloom filters, this can mean 2-5 disk reads for a cold key.
Latency predictability is another B-tree strength. B-tree read latency is consistent: always depth disk reads. LSM-tree read latency varies — it might be in the memtable (instant) or buried in the oldest SSTable (multiple disk reads). If you need predictable p99 latency (e.g., for a financial trading system), B-trees win.
LSM-trees must perform background compaction to merge SSTables. This compaction competes with foreground operations for disk bandwidth and CPU. If the write rate exceeds the compaction throughput, uncompacted SSTables pile up, reads slow down (more files to check), and eventually the database must throttle writes. This is called compaction debt.
B-trees have no compaction. Their write overhead is paid immediately (WAL + page write), not deferred. This makes their performance more predictable under sustained load.
SSDs have a finite number of write cycles per cell (typically 1,000-10,000 for TLC NAND). Write amplification directly reduces SSD lifetime. If your LSM-tree has 30x write amplification and you are writing 1 GB/s of user data, the SSD sees 30 GB/s of actual writes. An SSD rated for 1 petabyte of total writes would last:
B-trees naturally support transactions because each key lives in exactly one place (one page). You can lock individual keys, implement MVCC by storing multiple versions in the page, and ensure serializability with page-level latches.
LSM-trees are harder to transact on. A key might exist in the memtable AND in multiple SSTables simultaneously (with different versions). Implementing snapshot isolation requires careful coordination between the memtable and all SSTable levels. This is doable (RocksDB supports transactions via a pessimistic or optimistic concurrency control layer on top of the LSM), but it adds complexity and overhead that B-trees handle naturally.
LSM-trees achieve much better compression than B-trees, for two reasons:
1. No fragmentation. B-tree pages are rarely 100% full. After many random insertions and deletions, the average page fill factor is about 69%. That means 31% of your on-disk index is wasted space. LSM-tree SSTables are written sequentially and compacted, so they have no internal fragmentation.
2. Compression-friendly layout. SSTable files contain sorted keys, which means adjacent entries often share prefixes and have similar values. This is ideal for dictionary encoding, prefix compression, and LZ4/Snappy/Zstd compression. B-tree pages are too small (4 KB) for compression to be as effective, and compressing/decompressing individual pages adds latency to every read.
Some modern storage engines try to get the best of both worlds:
Bw-Tree (Buzzword-Tree), developed by Microsoft Research, is a lock-free B-tree that uses a mapping table to indirect all page references. Updates create "delta records" that chain to the base page (like an LSM within each page). Periodic consolidation merges deltas into the base page. This gives B-tree read performance with near-LSM write performance. Used in SQL Server's Hekaton in-memory engine.
Fractal Tree (TokuDB, now Percona): each internal B-tree node has an attached write buffer. Inserts go into the buffer instead of being pushed all the way down to a leaf. When the buffer fills, its contents are flushed to children in a batch. This converts random writes into sequential batch writes at each level — effectively an LSM approach applied within a B-tree structure. Write amplification is O(logB(N) / B) instead of O(B × logB(N)) for standard B-tree.
The most feared production issue with LSM-trees is the write stall. Here is how it happens:
The mitigation strategy has three layers:
1. Prevent: Size your compaction throughput for 2x your peak write rate. More compaction threads, faster disks (NVMe over SATA), larger memtable (fewer L0 flushes).
2. Soften: Use rate limiting (delayed_write_rate) to gradually slow writers before a hard stall. Writers get slower instead of completely blocked.
3. Recover: If a stall happens, temporarily reduce write rate (back-pressure from the database to the application), let compaction catch up, then resume normal operation.
Theoretical analysis is useful, but let us look at real benchmark data. The following numbers come from published benchmarks comparing RocksDB (LSM) vs InnoDB (B-tree) on the same hardware:
| Benchmark | RocksDB (LSM) | InnoDB (B-tree) | Winner |
|---|---|---|---|
| Random write (ops/sec) | ~80,000 | ~15,000 | RocksDB (5.3x) |
| Sequential write (ops/sec) | ~200,000 | ~100,000 | RocksDB (2x) |
| Random read (ops/sec) | ~40,000 | ~65,000 | InnoDB (1.6x) |
| Range scan (rows/sec) | ~500,000 | ~800,000 | InnoDB (1.6x) |
| p99 read latency | ~2ms | ~0.5ms | InnoDB (4x) |
| p99 write latency | ~0.3ms | ~5ms | RocksDB (17x) |
| Disk space (1B keys) | ~45 GB | ~80 GB | RocksDB (1.8x) |
Notice the pattern: RocksDB dominates writes (especially p99 write latency — 17x better) and disk space efficiency. InnoDB dominates reads (1.6x better throughput, 4x better p99 read latency). The LSM compaction spikes are what cause the higher read p99.
Adjust the read/write ratio and see which engine performs better. High writes favor LSM. High reads favor B-tree.
| Factor | Choose LSM-Tree | Choose B-Tree |
|---|---|---|
| Write pattern | High write throughput, batch inserts, time-series | Moderate writes, frequent updates of same key |
| Read pattern | Few point lookups, mostly range scans | Many point lookups, predictable latency needed |
| Transactions | Not needed (or eventual consistency OK) | ACID transactions, serializable isolation |
| Disk usage | Compression-friendly, less fragmentation | More fragmentation, but no compaction overhead |
| Latency SLA | p50 matters more than p99 | Tight p99 requirements |
| Examples | Cassandra, RocksDB, InfluxDB, ScyllaDB | PostgreSQL, MySQL InnoDB, Oracle, SQL Server |
Here is how some well-known companies picked their storage engines, and why:
| Company | Use Case | Engine | Why |
|---|---|---|---|
| Social graph (friends, likes) | RocksDB (LSM) | Massive write volume from billions of social actions. Write throughput is king. | |
| Discord | Message storage | ScyllaDB (LSM, Cassandra-compatible) | Write-once messages, time-ordered range queries, massive scale (trillions of messages). |
| Uber | Trip data, pricing | PostgreSQL (B-tree) + Cassandra (LSM) | PostgreSQL for ACID transactions (pricing must be exact). Cassandra for high-volume trip events. |
| Netflix | User profiles, viewing history | Cassandra (LSM) | Global distribution across regions. Eventual consistency is acceptable for viewing history. |
| Shopify | E-commerce | MySQL/InnoDB (B-tree) | ACID transactions for orders and payments. Predictable latency for checkout flow. |
| Cloudflare | KV store at edge | Custom LSM | Write-heavy with eventual consistency. Replicated to 300+ locations worldwide. |
So far we have talked about indexing a single key. But real databases need much more: secondary indexes (looking up by a non-primary column), multi-column indexes (querying by two fields at once), full-text search, and even geospatial queries. Each requires a different indexing approach.
A primary index maps a unique key (like user_id) to the row's location. A secondary index maps a non-unique column (like city) to the set of rows that match. The B-tree or LSM-tree structure is the same — the difference is that secondary index entries can point to multiple rows.
There are two ways to handle this:
Heap file + pointers. The table data lives in a separate heap file in insertion order. Both the primary index and all secondary indexes contain pointers (row IDs) into this heap file. A secondary index lookup returns a list of row IDs, and you chase each pointer to the heap file to retrieve the actual row data. This pointer-chasing is called a non-clustered index scan.
Clustered index. Store the actual row data directly inside the primary index's leaf pages, sorted by the primary key. Secondary indexes still use pointers, but now they point into the primary index rather than a heap file. This is how InnoDB (MySQL) works — the primary key IS the table, and secondary indexes point to primary key values.
A concatenated index combines multiple columns into a single key. An index on (last_name, first_name) stores keys like "Smith|Alice". This supports queries that filter by last_name alone or by (last_name AND first_name), but NOT by first_name alone — the index is sorted by last_name first.
A covering index (or index-only scan) includes additional column values directly in the index entries, so the database can answer a query without touching the heap file at all. For example, an index on (city) that also INCLUDEs (population) can answer "SELECT population FROM cities WHERE city = 'Tokyo'" entirely from the index.
A covering index is one of the most powerful optimization techniques in database tuning. The idea: include the values the query needs directly in the index, so the database never has to visit the heap file or primary index at all.
python # Standard secondary index on email (PostgreSQL): # CREATE INDEX idx_email ON users(email); # SELECT name FROM users WHERE email = 'alice@example.com'; # → index lookup (email → ctid) → heap fetch (ctid → row) → return name # TWO lookups! # Covering index (INCLUDE the columns we need): # CREATE INDEX idx_email_covering ON users(email) INCLUDE (name); # SELECT name FROM users WHERE email = 'alice@example.com'; # → index lookup (email → name) → return name # ONE lookup! The "name" value is stored in the index leaf page. # EXPLAIN output would show "Index Only Scan" instead of "Index Scan" # This can be 2-5x faster for queries that hit many rows.
INCLUDE in CREATE INDEX. In MySQL, secondary indexes on InnoDB automatically "cover" the primary key (since secondary indexes store the primary key value).How do you find "all restaurants within 2 km of my location"? This requires searching on two dimensions (latitude and longitude) simultaneously. A standard B-tree can sort by one dimension, but not two.
R-trees solve this by grouping nearby objects into rectangular bounding boxes, organized hierarchically. PostGIS (the geospatial extension for PostgreSQL) uses R-trees (via GiST indexes) for spatial queries.
Another approach: space-filling curves like the Z-order curve or Hilbert curve map 2D coordinates to a 1D value, preserving spatial locality. Then you can use a standard B-tree on the 1D value. This is how DynamoDB's geospatial indexes work.
Full-text search indexes need to handle things that B-trees and LSM-trees cannot: word stemming (running, runs, ran → run), synonym matching, relevance ranking, and fuzzy matching (tolerate typos). These use inverted indexes — a map from each word to the list of document IDs containing that word. Lucene (the engine behind Elasticsearch and Solr) uses an SSTable-like structure for its term dictionary and posting lists.
For fuzzy matching (finding "restaurant" when the user typed "resturant"), Lucene uses a finite automaton built from a Levenshtein distance threshold. The automaton can efficiently traverse the sorted term dictionary and find all terms within edit distance 1 or 2 of the query term. This is how Elasticsearch's fuzzy queries work.
A standard B-tree can efficiently search one dimension. But "find all restaurants within 2 km" requires searching two dimensions (latitude and longitude) simultaneously. An R-tree extends the B-tree idea to multiple dimensions using minimum bounding rectangles (MBRs).
PostGIS uses GiST indexes (Generalized Search Trees), which are a generalization of R-trees that support any decomposable search predicate. This is how Uber finds nearby drivers, how Airbnb searches for listings in a map viewport, and how DoorDash finds restaurants near you.
Choosing which indexes to create is one of the most impactful decisions a database administrator makes. Too few indexes → slow reads. Too many indexes → slow writes and wasted disk space. The decision framework:
| Question | If Yes | If No |
|---|---|---|
| Is this column in WHERE clauses of frequent queries? | Index it | Don't index |
| Does the table have >100K rows? | Index is worthwhile | Full scan might be fine |
| Is the column high cardinality (many unique values)? | B-tree index | Consider bitmap index (for OLAP) |
| Does the query need range scans on this column? | B-tree (sorted) | Hash index might suffice |
| Is the table write-heavy (>80% writes)? | Minimize indexes | Add more indexes |
| Does the query SELECT only indexed columns? | Use covering index (INCLUDE) | Standard index |
python # PostgreSQL: check if your indexes are being used # This query shows which indexes are never used (candidates for removal) query = """ SELECT schemaname, tablename, indexname, idx_scan as times_used, pg_size_pretty(pg_relation_size(indexrelid)) as index_size FROM pg_stat_user_indexes WHERE idx_scan = 0 -- NEVER used! ORDER BY pg_relation_size(indexrelid) DESC; """ # Each unused index costs: # 1. Disk space (visible in index_size above) # 2. Write overhead (every INSERT/UPDATE/DELETE updates this index) # 3. VACUUM overhead (must clean dead entries in index too) # Drop unused indexes for immediate write performance improvement.
The order of columns in a concatenated index matters enormously. Consider an index on (last_name, first_name, age):
Click a city name in the secondary index to trace the pointer chain to the heap file rows.
We have spent five chapters agonizing over disk I/O. Pages, segments, sequential vs random writes. What if we just... keep everything in RAM?
RAM prices have fallen dramatically. In 2024, a cloud instance with 256 GB of RAM costs roughly $3/hour. Many datasets fit comfortably in that budget. This opens the door to in-memory databases — systems that keep the entire dataset in RAM, with disk used only for durability (crash recovery).
Keeping data only in RAM would mean losing everything on a crash or power failure. In-memory databases solve this with several strategies:
Write-ahead log on disk. Every write is appended to a persistent log on disk before being applied in memory. On crash, replay the log. This is how Redis (with AOF mode) and VoltDB work. The log is append-only and sequential, so it does not bottleneck writes.
Periodic snapshots. Serialize the entire in-memory state to disk at intervals (e.g., every 5 minutes). On crash, restore from the latest snapshot, then replay the WAL entries since the snapshot. Redis RDB mode does this.
Replication. Keep copies on other machines. If one machine dies, another has the data. This is how Memcached clusters survive node failures (though Memcached itself has no persistence — it relies on the application to re-populate from the primary database).
To make the "data structures" advantage concrete, here are operations that Redis handles in microseconds but are nightmarish on disk:
| Data Structure | Redis Operation | Time | Disk-based equivalent | Disk time |
|---|---|---|---|---|
| Sorted Set | ZADD (insert with score) | O(log n), ~5μs | B-tree insert + heap update | ~500μs |
| Sorted Set | ZRANK (rank of element) | O(log n), ~5μs | COUNT(*) WHERE score > X | ~50ms for 10M rows |
| HyperLogLog | PFADD + PFCOUNT | O(1), ~2μs | COUNT(DISTINCT ...) with full scan | Seconds to minutes |
| Bitmap | BITOP AND/OR | O(n/64) bits, ~10μs | JOIN two tables + filter | Seconds |
| Stream | XADD + XRANGE | O(1) + O(k), ~3μs | INSERT + range scan | ~1ms |
The ZRANK operation is particularly illustrative. In Redis, a sorted set is implemented as a skip list with augmented size counts at each level. Finding the rank of an element means traversing the skip list and summing the span counts — O(log n). In PostgreSQL, finding the rank requires counting all rows with a higher score, which is a full index scan — O(n). For a 10-million-entry leaderboard, that is the difference between 5 microseconds and 50 milliseconds — a 10,000x gap.
Anti-caching (used by VoltDB, MemSQL/SingleStore) flips the traditional approach. Instead of caching hot data in memory from a disk-based store, you keep everything in memory and evict cold data to disk when RAM pressure increases. It is like a user-space page cache — the database knows which tuples are cold (based on access patterns), evicts them efficiently, and fetches them back on demand.
This gives you the programming model of an in-memory database (all data structures, no page management) with the capacity of a disk-based system.
Non-Volatile Memory (NVM), most notably Intel Optane Persistent Memory (now discontinued, but the concept lives on), sits between DRAM and SSD in both speed and cost. It offers byte-addressable persistent storage with latency around 300ns — 3x slower than DRAM but 300x faster than SSD. This blurs the line between in-memory and on-disk databases.
With NVM, a database can keep its data structures (B-trees, hash maps) directly in persistent memory. No WAL needed — every write is instantly durable. No buffer pool needed — the data is already in addressable memory. No crash recovery needed — the data survives power loss. This eliminates the three biggest sources of complexity in storage engine design.
However, NVM introduces its own challenges: cache-line-level atomicity (writes smaller than a cache line are atomic, but larger writes are not), memory ordering (CPU may reorder writes, requiring explicit flush instructions), and wear leveling (like SSD, NVM cells have finite write endurance). These require new data structure designs — "persistent data structures" that are correct even if the system crashes mid-operation.
| System | Model | Durability | Best For |
|---|---|---|---|
| Redis | Key-value + data structures | AOF / RDB snapshots | Caching, sessions, leaderboards, pub/sub |
| Memcached | Key-value (simple) | None (cache only) | Simple caching layer in front of a DB |
| VoltDB | Relational (SQL) | WAL + replication | High-throughput OLTP, financial systems |
| SAP HANA | Relational + columnar | WAL + savepoints | Enterprise OLAP + OLTP (HTAP) |
| Dragonfly | Redis-compatible | Snapshots + replication | Drop-in Redis replacement with better memory efficiency |
Compare typical latencies for various operations. The gap is largest for complex data structures.
Both are in-memory, but they serve different use cases:
| Feature | Redis | Memcached |
|---|---|---|
| Data structures | Strings, lists, sets, sorted sets, hashes, streams, bitmaps, HyperLogLog | Strings only (key → value) |
| Persistence | RDB snapshots + AOF log | None (pure cache) |
| Replication | Built-in primary-replica | None (client-side sharding) |
| Memory efficiency | Higher overhead per key (~70 bytes) | Lower overhead (~50 bytes) |
| Threading | Single-threaded (multi-threaded I/O in Redis 7+) | Multi-threaded from day one |
| Pub/Sub | Built-in | No |
| Lua scripting | Built-in (atomic transactions) | No |
| Best for | Sessions, leaderboards, rate limiting, real-time features | Simple read-through cache layer in front of MySQL/PostgreSQL |
Redis offers two persistence mechanisms that can be used independently or together:
RDB (Redis Database) snapshots. Periodically fork the process and dump the entire dataset to a point-in-time snapshot file. The fork uses copy-on-write (COW), so the parent continues serving requests while the child writes to disk. Configurable: "save after 900 seconds if at least 1 key changed." Recovery: load the snapshot on startup. Data loss: up to one snapshot interval (e.g., 15 minutes of data).
AOF (Append Only File). Log every write operation to an append-only file on disk. Three durability levels: (1) always — fsync every write (slowest, no data loss), (2) everysec — fsync every second (good compromise, up to 1 second of data loss), (3) no — let the OS decide when to flush (fastest, up to 30 seconds of data loss). The AOF file is periodically rewritten (compacted) in the background to prevent unbounded growth.
So far we have been thinking about databases that serve user-facing applications: a web server reads a user record, updates a shopping cart, inserts an order. These queries touch a small number of rows, but there are thousands of them per second. This is Online Transaction Processing (OLTP).
But there is a second class of database usage that looks completely different. A business analyst wants to know: "What was the total revenue by product category in Q3 across all stores?" This query must scan millions or billions of rows, aggregate values, and return a single summary. This is Online Analytical Processing (OLAP).
| Property | OLTP | OLAP |
|---|---|---|
| Read pattern | Small number of rows per query, fetched by key | Aggregate over a large number of rows |
| Write pattern | Random-access, low-latency writes from user input | Bulk import (ETL) or streaming ingest |
| Primarily used by | End users via web/mobile applications | Data analysts, data scientists, dashboards |
| Data represents | Latest state of things (current balance, current inventory) | History of events over time |
| Dataset size | GB to low TB | TB to PB |
| Query style | SELECT * FROM users WHERE id = 42 | SELECT product, SUM(revenue) FROM sales GROUP BY product |
| Bottleneck | Disk seek time (how fast you can find a row) | Disk bandwidth (how fast you can scan data) |
Most organizations do not run analytics directly on their OLTP databases. Instead, they Extract, Transform, Load (ETL) data from multiple OLTP systems into a dedicated data warehouse optimized for analytics.
Data warehouses typically organize data into a star schema. At the center is a fact table — a very large table where each row represents an event (a sale, a click, a page view). Each fact row has foreign keys pointing to smaller dimension tables (product, store, customer, date).
The star schema gets its name from the visual layout: the fact table in the center with dimension tables radiating outward like points of a star.
A snowflake schema is a variation where dimension tables are further normalized (e.g., the product dimension has a separate category sub-dimension). This saves space but makes queries more complex with additional joins.
A common question: "Why can't the analyst just query the OLTP database directly?" Three reasons:
1. Performance interference. An OLAP scan that reads millions of rows competes with OLTP queries for buffer pool space, disk I/O, and CPU. A single dashboard refresh could evict thousands of frequently-accessed pages from the buffer pool, causing every user-facing query to hit disk instead of cache. Your 2ms point lookups suddenly become 20ms.
2. Different schemas. OLTP tables are normalized (third normal form) to minimize update anomalies. Analytics work better on denormalized star schemas where every fact row contains or points to all the context you need. ETL performs this denormalization during the transform step.
3. Different storage formats. OLTP uses row-oriented storage (fast for inserting and reading individual rows). OLAP is 10-100x faster with column-oriented storage (Chapter 8). You cannot get both benefits from a single storage format.
python # A simplified ETL pipeline for a data warehouse # 1. EXTRACT: Read from OLTP sources users = pg_conn.execute("SELECT * FROM users WHERE updated_at > %s", last_etl) orders = mysql_conn.execute("SELECT * FROM orders WHERE created_at > %s", last_etl) events = mongo_conn.find({"ts": {"$gt": last_etl}}) # 2. TRANSFORM: Clean, join, denormalize into star schema fact_rows = [] for order in orders: user = users_by_id[order.user_id] fact_rows.append({ "order_id": order.id, "date_key": to_date_key(order.created_at), "user_key": user.id, "product_key": order.product_id, "revenue": order.total, "quantity": order.qty, "user_country": user.country, # denormalized from user dim }) # 3. LOAD: Bulk insert into the data warehouse warehouse.bulk_insert("sales_facts", fact_rows)
Traditional ETL runs on a schedule — hourly or daily. This means the data warehouse is always at least one ETL cycle behind the OLTP sources. For many analytics use cases, this is fine ("what were last month's sales?" does not need real-time data).
But some use cases need fresher data ("how many users signed up in the last 5 minutes?"). Change Data Capture (CDC) replaces batch ETL with a continuous stream of changes. The OLTP database's WAL (write-ahead log) is read in real time, and each change (insert, update, delete) is streamed to the data warehouse within seconds.
| Approach | Latency | OLTP Impact | Complexity | Tools |
|---|---|---|---|---|
| Batch ETL | Hours to minutes | Read queries on source DB | Lower | Airflow, dbt, Fivetran |
| CDC (log-based) | Seconds | Near-zero (reads WAL only) | Higher | Debezium, Maxwell, AWS DMS |
See how an OLTP point lookup touches one row while an OLAP aggregation scans millions.
Fact tables in a data warehouse can be enormous. Let us size one for a mid-size e-commerce company:
The newest trend is HTAP (Hybrid Transactional/Analytical Processing): databases that handle both OLTP and OLAP in a single system. The idea: maintain a row-oriented store for transactions and a column-oriented store for analytics, with automatic real-time synchronization between them.
TiDB, for example, has two storage engines: TiKV (a distributed key-value store based on RocksDB/LSM for OLTP) and TiFlash (a columnar store for OLAP). When you write to TiKV, the data is asynchronously replicated to TiFlash in columnar format. The query optimizer routes each query to the appropriate engine — point lookups go to TiKV, aggregation queries go to TiFlash.
This is the theoretical ideal: one database, no ETL, real-time analytics. In practice, HTAP databases are still maturing, and most large organizations still separate OLTP and OLAP into different systems for operational simplicity and failure isolation.
This is the showcase chapter — the single most important idea for OLAP performance. Once you understand column-oriented storage, you understand why data warehouses are orders of magnitude faster than row-oriented databases for analytics.
In a row-oriented database (PostgreSQL, MySQL), each row is stored contiguously on disk. A row for a sales event might look like this in storage:
Now run the query: SELECT SUM(revenue) FROM sales WHERE date BETWEEN '2024-01-01' AND '2024-03-31'.
This query needs only two columns: date and revenue. But in a row-oriented layout, the database must read ALL seven columns for every row, because each row is stored as a single unit. If each row is 100 bytes and you have 100 million rows, you read 10 GB of data to access 16 bytes per row (8-byte date + 8-byte revenue) — you are reading 6x more data than you need.
In a column-oriented (columnar) database, all values from one column are stored contiguously, separate from other columns:
Storing values of the same type contiguously also enables dramatically better compression. A column of dates contains similar values (most dates cluster in recent months). A column of product categories might have only 50 distinct values across 100 million rows.
Bitmap encoding. For a column with n distinct values, create n bitmaps — one per distinct value. Each bitmap has one bit per row (1 if that row has this value, 0 otherwise). For a "country" column with 200 countries across 100M rows, each bitmap is 100M bits = 12.5 MB, and the total is 200 × 12.5 MB = 2.5 GB. But most bitmaps are very sparse (mostly zeros), so they compress extremely well with run-length encoding.
With run-length encoding, a bitmap for "NYC" across 100 million rows where 30% are NYC compresses from 12.5 MB to about 200 KB. The compression ratio for the entire column can be 10-100x.
The simulation below shows the same table stored in row-oriented and column-oriented format. Select columns for your query and see how much data each format must read.
Toggle columns to include in your query. Watch the bytes-read bar for each format.
Beyond bitmap encoding, column stores use two other powerful compression techniques:
Dictionary encoding. Replace each distinct string value with a small integer. Store a separate dictionary mapping integers to strings. If a "product_name" column has 10,000 distinct products across 100 million rows, each row stores a 2-byte integer instead of a 20-byte string — 10x compression on that column alone.
Delta encoding. For sorted columns (especially timestamps), store the difference between consecutive values instead of the absolute values. If timestamps are in milliseconds and rows are 100ms apart, each delta is small and compresses well:
Column stores unlock a performance trick that row stores cannot use: vectorized execution. Because all values in a column are the same type, stored contiguously in memory, you can process them in tight loops that exploit CPU cache lines and SIMD instructions.
A modern CPU can compare 16 integers against a constant in a single SIMD instruction (using 128-bit SSE or 256-bit AVX). Processing a column of dictionary-encoded integers with SIMD is 10-20x faster than the row-by-row processing that row stores must use.
Within each column file, the rows must be stored in the same order across all columns (so that row 42 in the date file corresponds to row 42 in the revenue file). You can choose what order to sort the rows in.
Sorting by date (the most common filter in analytics) means the date column compresses beautifully (many consecutive identical dates → long run-length encoding runs), and date-range queries scan a contiguous region.
Some databases (like Vertica, now part of OpenText) maintain multiple sort orders — several copies of the data, each sorted differently. This is like having multiple clustered indexes. It uses more space but means the query optimizer can pick the sort order that best matches each query.
For frequently-run aggregation queries, you can pre-compute the result and store it as a materialized view. A data cube (also called an OLAP cube) is a specialized materialized view that pre-computes aggregates across every combination of dimensions. For example, a cube over (date, product, store) pre-computes the SUM of revenue for every (date, product, store) combination.
The downside: materialized views and cubes must be updated when the underlying data changes, and they cannot answer queries that do not match the pre-computed dimensions. They are a caching optimization, not a general solution.
DuckDB is worth special attention because it demonstrates the power of column-oriented processing in an embedded context — no server, no network, just a library linked into your application (like SQLite, but columnar).
python import duckdb # DuckDB can directly query Parquet files (columnar format) on disk # No need to load into a database first! result = duckdb.sql(""" SELECT product_category, SUM(revenue) as total_revenue, COUNT(*) as num_sales, AVG(revenue) as avg_sale FROM read_parquet('sales_2024_*.parquet') WHERE date BETWEEN '2024-01-01' AND '2024-03-31' GROUP BY product_category ORDER BY total_revenue DESC LIMIT 10 """) # This query over 100M rows in Parquet files completes in ~2 seconds # because DuckDB: # 1. Only reads the 4 columns needed (date, product_category, revenue) # 2. Uses vectorized execution (processes 2048 values at a time) # 3. Pushes the WHERE filter down to the Parquet reader (skip row groups) # 4. Runs on all CPU cores in parallel # The same query on PostgreSQL (row-oriented) would read ALL columns # and take 20-60 seconds. 10-30x slower on the same hardware.
Column stores are not always better. They lose decisively in these scenarios:
1. Point lookups. Retrieving a single row by primary key requires reading one value from every column file and reassembling the row. If you have 200 columns, that is 200 file seeks. A row store does this in 1 seek. For OLTP-style "get user by ID" queries, row stores are 10-100x faster.
2. Small inserts. Adding one row to a column store means appending to 200 separate column files. A row store appends to one file. This is why column stores use the LSM-buffered write path (batch in memory, flush in bulk).
3. Wide SELECT * queries. If your query needs all columns (like a backup or export), column store must read every column file and reassemble rows. Row store reads the data in its natural format with zero reassembly cost.
4. Narrow tables. If your table has only 3-5 columns, the column store advantage disappears — you are reading most of the data regardless of format. The overhead of maintaining separate column files is not worth it.
Column-oriented storage is optimized for reads, but writes are tricky. Inserting a single row requires appending a value to every column file — potentially hundreds of files. The solution: use an LSM-tree within each column. Incoming writes go to an in-memory buffer (a row-oriented memtable). When the buffer is full, pivot the rows into columnar format and flush all columns as a batch. Background compaction merges small column segments into larger sorted ones.
This is exactly how Apache Parquet + Delta Lake work, and it is the write engine behind ClickHouse, DuckDB, and Snowflake. The key insight: accept row-oriented writes, batch them, convert to columnar format in bulk. This bridges the gap between the write-friendly row format and the read-friendly columnar format.
python # Conceptual column store write path class ColumnStore: def __init__(self, columns, flush_threshold=10000): self.columns = columns # list of column names self.buffer = [] # row-oriented in-memory buffer self.threshold = flush_threshold self.segments = [] # list of flushed column segments def insert(self, row): """Insert a row (dict of column_name → value).""" self.buffer.append(row) if len(self.buffer) >= self.threshold: self._flush() def _flush(self): """Pivot rows to columns and write to disk.""" segment = {} for col in self.columns: segment[col] = [row[col] for row in self.buffer] self.segments.append(segment) self.buffer = [] def scan(self, needed_columns, predicate=None): """Only read the columns we need — the whole point.""" results = [] for seg in self.segments: n_rows = len(seg[self.columns[0]]) for i in range(n_rows): row = {c: seg[c][i] for c in needed_columns} if predicate is None or predicate(row): results.append(row) return results
Understanding the Parquet file format makes column store concepts concrete. A Parquet file is organized into row groups, each containing column chunks, each containing pages:
Predicate pushdown is the key optimization. Each column chunk's metadata includes the minimum and maximum values for that chunk. If your query says WHERE date > '2024-06-01' and a row group's date column has max = '2024-05-31', the entire row group (potentially millions of rows) can be skipped without reading a single byte of data.
This chapter is your cheat sheet. Everything from Chapters 0-8 condensed into tables, decision frameworks, and ready-to-use answers for system design interviews.
| Structure | Read | Write | Space | Range Query | Best Use Case |
|---|---|---|---|---|---|
| Hash Index | O(1) | O(1) | Keys must fit in RAM | No | Few keys, frequent updates (counters, sessions) |
| LSM-Tree | O(log n) × levels | O(1) amortized | Good compression | Yes (sorted SSTs) | Write-heavy, time-series, event logs |
| B-Tree | O(log n) | O(log n) | Some fragmentation | Yes | OLTP, transactions, predictable latency |
| Column Store | Full-scan optimized | Batch only | Excellent compression | Yes (per-column) | Analytics, data warehousing |
| In-Memory (Redis) | O(1) - O(log n) | O(1) - O(log n) | RAM-bound | Depends on structure | Caching, real-time leaderboards, pub/sub |
"Design a key-value store."
"Design a time-series database."
"Design a data warehouse."
Let us work through a real design challenge, the kind you would get in a 45-minute system design interview:
| Symptom | Likely Cause | Solution |
|---|---|---|
| Reads getting slower over time | LSM compaction falling behind — too many SSTables to check per read | Tune compaction (increase background threads, reduce size ratio). Check Bloom filter effectiveness. Consider leveled compaction to limit files per level. |
| Writes spike then stall | B-tree page splits cascading up to root, or LSM write stall when L0 is full | For B-tree: pre-split pages, use fillfactor < 100%. For LSM: increase memtable size, add more compaction threads. |
| Point lookups are fast but range queries are slow | Data is not clustered by the range-query key, causing random I/O | Add a clustered index on the range-query column, or use a covering index to avoid heap lookups. |
| Analytics queries slow down OLTP | Analytics scans are evicting OLTP data from the buffer pool | Separate OLTP and OLAP into different databases. Set up ETL to a data warehouse. Or use read replicas for analytics. |
| Disk space growing despite deletes | LSM: deleted keys are tombstoned, not removed until compaction. B-tree: pages are not returned to OS after row deletion. | LSM: force compaction on affected segments. B-tree: VACUUM (PostgreSQL) or OPTIMIZE TABLE (MySQL) to reclaim space. |
Drill 1: Simple Hash Index in Python
python import os class HashIndex: """Bitcask-style hash index: append-only log + in-memory hash map.""" def __init__(self, path): self.path = path self.index = {} # key → byte offset self.f = open(path, 'a+b') # append + read, binary self._rebuild_index() def _rebuild_index(self): """Scan the file to rebuild the in-memory index on startup.""" self.f.seek(0) while True: offset = self.f.tell() line = self.f.readline() if not line: break key, _ = line.decode().strip().split(',', 1) self.index[key] = offset # last write wins def set(self, key, value): offset = self.f.seek(0, 2) # seek to end self.f.write(f"{key},{value}\n".encode()) self.f.flush() self.index[key] = offset def get(self, key): if key not in self.index: return None self.f.seek(self.index[key]) line = self.f.readline().decode().strip() _, value = line.split(',', 1) return value
Drill 2: Simple LSM Memtable + Flush
python import json, os from sortedcontainers import SortedDict # pip install sortedcontainers class LSMTree: """Minimal LSM-tree: memtable + SSTable flush.""" def __init__(self, directory, memtable_limit=1000): self.directory = directory self.memtable_limit = memtable_limit self.memtable = SortedDict() # sorted in-memory tree self.sstables = [] # list of SSTable file paths, newest first os.makedirs(directory, exist_ok=True) def put(self, key, value): self.memtable[key] = value if len(self.memtable) >= self.memtable_limit: self._flush() def get(self, key): # 1. Check memtable first (newest data) if key in self.memtable: return self.memtable[key] # 2. Check SSTables from newest to oldest for sst_path in self.sstables: with open(sst_path) as f: data = json.load(f) # [{key:..., value:...}, ...] # Binary search (data is sorted by key) lo, hi = 0, len(data) - 1 while lo <= hi: mid = (lo + hi) // 2 if data[mid]['key'] == key: return data[mid]['value'] elif data[mid]['key'] < key: lo = mid + 1 else: hi = mid - 1 return None def _flush(self): """Write sorted memtable to disk as an SSTable.""" path = os.path.join(self.directory, f"sst_{len(self.sstables)}.json") data = [{'key': k, 'value': v} for k, v in self.memtable.items()] with open(path, 'w') as f: json.dump(data, f) self.sstables.insert(0, path) # newest first self.memtable = SortedDict()
Drill 3: B-Tree Search in Python
python class BTreeNode: def __init__(self, leaf=True): self.keys = [] # sorted list of keys self.values = [] # values (only in leaf nodes) self.children = [] # child pointers (only in internal nodes) self.leaf = leaf def btree_search(node, key): """Search for a key in a B-tree. Returns (node, index) or None.""" if node is None: return None # Binary search within the node's keys lo, hi = 0, len(node.keys) - 1 while lo <= hi: mid = (lo + hi) // 2 if node.keys[mid] == key: return (node, mid) # found! elif node.keys[mid] < key: lo = mid + 1 else: hi = mid - 1 # Not found in this node if node.leaf: return None # key doesn't exist else: return btree_search(node.children[lo], key) # recurse into child # Usage: each recursive call = one disk page read # Total disk reads = depth of tree = log_b(n)
Drill 4: Bloom Filter in Python
python import hashlib class BloomFilter: """Simple Bloom filter with k=3 hash functions.""" def __init__(self, size=1000): self.size = size self.bits = [False] * size def _hashes(self, key): """Generate 3 hash positions for a key.""" h = hashlib.md5(key.encode()).hexdigest() return [ int(h[:8], 16) % self.size, int(h[8:16], 16) % self.size, int(h[16:24], 16) % self.size, ] def add(self, key): for pos in self._hashes(key): self.bits[pos] = True def might_contain(self, key): """Returns False = definitely not in set. True = maybe in set.""" return all(self.bits[pos] for pos in self._hashes(key)) # Usage in LSM read path: # if not bloom_filter.might_contain(key): # skip_this_sstable() # saved a disk read!
Let us size a real storage system from scratch — the kind of back-of-envelope calculation you would do in a system design interview:
When asked "Design a [X] storage system" in an interview, use this framework:
| If the interviewer says... | You should think... |
|---|---|
| "Write-heavy workload" | LSM-tree (sequential writes, high throughput) |
| "Read-heavy with point lookups" | B-tree or hash index |
| "Range queries" | B-tree or LSM-tree (NOT hash index) |
| "Analytics / aggregation" | Column-oriented storage |
| "All data must fit in memory" | Redis / in-memory DB |
| "Predictable p99 latency" | B-tree (no compaction spikes) |
| "Time-series data" | LSM-tree with timestamp keys |
| "Need transactions" | B-tree (WAL + latches = ACID friendly) |
| "Geospatial queries" | R-tree or space-filling curve + B-tree |
| "Full-text search" | Inverted index (Lucene/Elasticsearch) |
Jeff Dean's famous "latency numbers every programmer should know," updated for 2024 hardware and applied to storage engines:
| Operation | Latency | Storage Engine Impact |
|---|---|---|
| L1 cache reference | 1 ns | Binary search within a cached B-tree page |
| L3 cache reference | 12 ns | Memtable lookup (all in CPU cache for small memtables) |
| DRAM access | 100 ns | Hash index lookup, Bloom filter check |
| NVMe SSD random read (4KB) | 100 μs | One B-tree leaf page read = one SSD I/O |
| NVMe SSD sequential read (1MB) | 200 μs | SSTable block read during compaction or range scan |
| HDD seek + read (4KB) | 10 ms | One B-tree page on spinning disk (50x slower than SSD) |
| Network round trip (same datacenter) | 500 μs | Reading from a remote replica or distributed storage |
| Network round trip (cross-region) | 100 ms | Reading from a geo-replicated database |
Storage and retrieval is the foundation upon which everything else in database systems is built. Now that you understand how data is organized on disk (and in memory), the remaining DDIA chapters snap into focus:
| Topic | Connection to Storage & Retrieval |
|---|---|
| Data Models (Ch 2) | Relational, document, and graph models all sit on top of the storage engines we covered. A document database might use an LSM-tree (MongoDB's WiredTiger engine) while a relational database uses a B-tree (PostgreSQL). The data model determines WHAT queries you need; the storage engine determines HOW fast those queries run. |
| Encoding & Evolution (Ch 4) | When you write data to an SSTable or B-tree page, it must be serialized to bytes. The encoding format (JSON, Protobuf, Avro) determines how efficiently the data is stored, how quickly it can be decoded, and whether schema evolution is possible without rewriting all data. |
| Replication (Ch 5) | The WAL (write-ahead log) that B-trees use for crash recovery is the same mechanism used for replication — ship the WAL to a replica, replay it, and you have a copy of the database. LSM-tree SSTables can also be shipped to replicas. |
| Partitioning (Ch 6) | When data exceeds what one machine can store, you partition (shard) it. The choice of partition key interacts with the index structure: a hash-based partition key distributes writes evenly but makes range queries hit all partitions; a range-based key enables efficient range scans but can create hot spots. |
| Transactions (Ch 7) | B-trees with WAL and latches support serializable transactions naturally. LSM-trees require additional mechanisms (lock tables, MVCC) to provide transaction isolation. The storage engine's concurrency model shapes what isolation levels are practical. |
The storage engine landscape continues to evolve. Some frontiers worth exploring:
Learned indexes (2018). Tim Kraska et al. proposed replacing B-trees with neural networks that predict the position of a key. A small ML model trained on the data distribution can predict "key 42 is at approximately position 1,247,000" faster than traversing a B-tree. If the data has a near-uniform distribution, a linear model (2 multiplications) replaces a 4-level B-tree (4 random reads). Still mostly academic, but Google's internal systems reportedly use learned index components for specific workloads.
Persistent memory (Intel Optane). Non-volatile memory that sits between DRAM and SSD in both speed and capacity. It blurs the line between in-memory and disk-based databases, potentially making the WAL unnecessary (writes persist directly to memory). Intel discontinued Optane in 2022, but CXL-attached persistent memory from other vendors continues the concept.
Disaggregated storage (Aurora, Neon). Separate the compute (query processing) from storage (data persistence) and connect them via a fast network. The storage layer replicates and persists data independently, while compute nodes are stateless and can scale independently. Aurora DSQL (2024) pushes this further with active-active multi-writer across regions — something traditional storage engines were never designed for.
LSM-tree optimizations. WiscKey (2016) separates keys from values in LSM-trees — keys stay in the LSM-tree, values go in a separate log. This reduces write amplification dramatically because only small keys are compacted, not large values. RocksDB's BlobDB implements this pattern. For large values (images, embeddings), this can reduce write amplification from 50x to 5x.
The rise of Rust databases. A new generation of storage engines written in Rust (TiKV, SurrealDB, Qdrant, AgateDB) is challenging the C/C++ incumbents. Rust's ownership model prevents entire classes of bugs (buffer overflows, use-after-free) that have caused data corruption in production databases. The performance is comparable to C++, but the safety guarantees mean fewer catastrophic production incidents.
The biggest trend in storage engines today is convergence. The hard lines between categories are blurring:
| Trend | What's Happening | Examples |
|---|---|---|
| HTAP | Hybrid Transactional/Analytical Processing — one engine handles both OLTP and OLAP | TiDB (B-tree for OLTP + columnar replica for OLAP), SingleStore, AlloyDB |
| Embedded Analytics | Column-oriented engines embedded inside applications, no server needed | DuckDB (the SQLite of analytics), Apache DataFusion |
| Cloud-Native | Separate compute from storage, use object storage (S3) as the durable layer | Snowflake, Neon, Aurora DSQL |
| Vector Indexes | New index type for nearest-neighbor search on high-dimensional embeddings | pgvector, Pinecone, Milvus (using HNSW graphs or IVF indexes) |
With the rise of AI and embeddings, a new category of index has emerged: the vector index. Instead of looking up keys or ranges, you search for the nearest neighbors in a high-dimensional embedding space. "Find the 10 products most similar to this one" is a vector search query.
The dominant algorithm is HNSW (Hierarchical Navigable Small World): a multi-layer graph where each node is an embedding vector. Higher layers contain fewer nodes (for fast, coarse search), lower layers contain all nodes (for precise results). A query navigates top-down, getting closer to the target at each layer. Typical query latency: 1-5ms for finding 10 nearest neighbors among 10 million vectors.
| System | Type | Index Algorithm | Best For |
|---|---|---|---|
| pgvector | PostgreSQL extension | IVFFlat, HNSW | Existing PostgreSQL users, moderate scale (<10M vectors) |
| Pinecone | Managed service | Proprietary (HNSW-based) | No-ops vector search, production RAG systems |
| Milvus | Open-source | IVF, HNSW, DiskANN | Large-scale (>100M vectors), on-premise deployment |
| Weaviate | Open-source | HNSW | AI-native applications with hybrid search |
| Qdrant | Open-source (Rust) | HNSW | High-performance, filtering + vector search |
| Paper/Resource | Why Read It |
|---|---|
| O'Neil et al., "The Log-Structured Merge-Tree" (1996) | The original LSM-tree paper. Short and readable. |
| Bayer & McCreight, "Organization and Maintenance of Large Ordered Indexes" (1972) | The original B-tree paper. Elegant and still relevant. |
| Chang et al., "Bigtable: A Distributed Storage System" (2006) | Google's Bigtable paper. Introduced SSTables and the memtable/compaction model to a generation of engineers. |
| Stonebraker et al., "C-Store: A Column-oriented DBMS" (2005) | The seminal column store paper. Became Vertica. |
| Dong et al., "Optimizing Space Amplification in RocksDB" (2017) | RocksDB's compaction strategies explained by Facebook engineers. |
| Kraska et al., "The Case for Learned Index Structures" (2018) | The provocative paper on replacing B-trees with ML models. |
| Lu et al., "WiscKey: Separating Keys from Values in SSD-Conscious Storage" (2016) | Reducing LSM write amplification by 10x — keys in LSM, values in a separate log. |
| Malviya et al., "Rethinking Main Memory OLTP Recovery" (2014) | How VoltDB achieves durability without traditional WAL — command logging instead. |
| Abadi et al., "The Design and Implementation of Modern Column-Oriented Databases" (2013) | Comprehensive survey of column store techniques by a C-Store/Vertica co-inventor. |
Everything in this chapter comes down to one sentence: different access patterns demand different storage layouts. There is no universal "best" storage engine, just as there is no universal "best" data structure. A hash map is perfect for key lookups, terrible for sorted iteration. A sorted array is perfect for binary search, terrible for insertions. Storage engines are the same — each is optimized for a specific workload pattern.
The mark of a strong engineer is not memorizing which engine to use when. It is understanding why each engine makes the trade-offs it does, so you can reason about novel situations the way the engine designers did. If you understand that LSM-trees convert random writes into sequential writes by buffering in memory — you can predict that any write-heavy workload benefits. If you understand that B-trees keep data in a single sorted structure — you can predict that they offer better read latency and stronger transaction support. If you understand that column stores skip unnecessary data — you can predict that any analytical query benefits from columnar layout.
The principles are timeless. The specific engines will change. The trade-offs will not.
If you want to cement your understanding, try building these (in order of difficulty):
| Challenge | Difficulty | What You Learn |
|---|---|---|
| Implement a Bitcask-style key-value store in Python (as in Ch 9) | Easy (2 hours) | Append-only logs, in-memory indexing, crash recovery via log replay |
| Add compaction and segment merging to the Bitcask implementation | Medium (4 hours) | Background compaction, atomic segment switching, tombstone handling |
| Implement a memtable with sorted output (red-black tree or skip list) | Medium (3 hours) | In-memory balanced trees, sorted iteration for SSTable generation |
| Build a simple SSTable writer and reader with binary search | Medium (4 hours) | File format design, binary search on disk, sparse index creation |
| Implement a Bloom filter and integrate it with your SSTable reader | Easy (1 hour) | Probabilistic data structures, false positive tuning |
| Build a complete 2-level LSM-tree (memtable + L0 + compaction) | Hard (8 hours) | Full LSM lifecycle: write → memtable → flush → compaction → read |
| Implement a B-tree with page splits on a simulated disk (4KB pages) | Hard (8 hours) | Page-based storage, split propagation, WAL for crash recovery |
Each challenge from the table above can be done in a single file in Python. The goal is not production quality — it is to experience the engineering trade-offs firsthand. You will remember the design decisions much better after fighting with them in code.