Designing Data-Intensive Applications — Chapter 4

Storage & Retrieval

B-trees, LSM-trees, column stores — how databases actually store and find your data.

Prerequisites: Basic data structures + Big-O notation. That's it.
11
Chapters
9+
Simulations
5
Interview Dimensions

Chapter 0: The Problem

Imagine you need to build a database. Not PostgreSQL. Not anything fancy. Just a key-value store: you give it a key (like "user:42") and a value (like "Alice"), and later you ask for that key back and it returns "Alice". How hard can it be?

Here is the world's simplest database. Two bash functions:

bash
db_set() {
    echo "$1,$2" >> database   # append key,value to a file
}

db_get() {
    grep "^$1," database | sed "s/^$1,//" | tail -n 1   # find last match
}

db_set appends a line to a file. That is an O(1) operation — the operating system just adds bytes to the end. It does not touch any existing data. This is spectacularly fast. Modern SSDs can sustain hundreds of megabytes per second of sequential writes.

db_get scans the entire file looking for the key, then takes the last matching line (because later writes override earlier ones). This is an O(n) operation — it must read every line in the file. With 10 records, no one notices. With a million records, it takes seconds. With a billion records, it takes hours.

The fundamental trade-off. Our toy database has perfect write performance (O(1) append) and terrible read performance (O(n) scan). Every storage engine in the history of computing is trying to solve this asymmetry. An index is an additional data structure that makes reads faster — at the cost of making writes slightly slower (because you have to update the index on every write). This is the central tension of this entire chapter: reads vs. writes, and the indexes that bridge them.

Watch the Pain Grow

The simulation below shows our append-only database in action. Each "write" appends to the log. Each "read" must scan from the beginning. Watch how read time grows linearly while write time stays flat. Click "Add 50 Records" repeatedly and watch the read bar grow.

The Append-Only Log: O(1) Write, O(n) Read

Click to add records. Watch read latency grow linearly while write stays constant.

0 records in the log. Click "Add 50 Records" to start.

After a few clicks, the read bar is ten or twenty times the write bar. With a million records, you would be staring at a spinning cursor. This is unacceptable for any real application.

A Python Implementation of the World's Worst Database

Let us implement our toy database in Python so we can measure the actual performance difference between writes and reads:

python
import time, os

class WorldsWorstDB:
    """Append-only log with O(1) write and O(n) read."""

    def __init__(self, path="database.txt"):
        self.path = path
        self.f = open(path, 'a+')

    def set(self, key, value):
        """O(1) write — append to file."""
        self.f.write(f"{key},{value}\n")
        self.f.flush()  # ensure durability

    def get(self, key):
        """O(n) read — scan entire file."""
        self.f.seek(0)
        result = None
        for line in self.f:
            k, v = line.strip().split(',', 1)
            if k == key:
                result = v  # keep going — last match wins
        return result

# Benchmark: write 100K records, then read one
db = WorldsWorstDB()
t0 = time.time()
for i in range(100000):
    db.set(f"key{i}", f"value{i}")
write_time = time.time() - t0
# ~0.5 seconds for 100K writes (fast!)

t0 = time.time()
val = db.get("key99999")  # find the last key
read_time = time.time() - t0
# ~0.2 seconds for ONE read (terrible!)

print(f"100K writes: {write_time:.3f}s ({100000/write_time:.0f} writes/sec)")
print(f"1 read:      {read_time:.3f}s")
# Output: 100K writes: 0.45s (222,222 writes/sec)
#         1 read:      0.18s  ← scanning 100K records for ONE lookup

The solution? Keep the fast append-only write, but build an index — a separate data structure that tells you where in the file a key lives, so you can jump straight to it instead of scanning.

What Does a Real Database Look Like Inside?

Before we dive into specific index structures, let us understand the layers of a real database engine. Every database — from SQLite to PostgreSQL to Cassandra — has roughly the same architecture:

Query Parser & Optimizer
Parse SQL, plan the best execution strategy. Decides WHICH indexes to use and in what order. This layer is smart but not the focus of this chapter.
Execution Engine
Executes the plan: scans, joins, sorts, aggregates. Calls the storage engine API to fetch data.
Storage Engine (THIS CHAPTER)
The engine that organizes data on disk and in memory. Implements get(key), put(key, value), scan(start, end). B-tree (InnoDB, PostgreSQL) or LSM-tree (RocksDB, LevelDB). This is the layer that determines read/write performance.
Buffer Pool / Cache
Keeps frequently-accessed disk pages in RAM. A page hit (found in RAM) is 1000x faster than a page fault (must read from disk). The buffer pool is the single most important performance component in any database.
OS / Filesystem / Disk
The physical storage medium. The storage engine writes pages to the filesystem, which writes them to the disk. SSD random I/O: ~100μs. HDD: ~10ms. The storage engine's job is to minimize these I/O operations.

This chapter focuses entirely on the Storage Engine layer — the most architecturally interesting part. Everything above it (query optimization, SQL parsing) and below it (filesystem, disk hardware) is important but orthogonal.

The Memory Hierarchy: Why Storage Matters

To understand why storage engines exist, you need to understand the memory hierarchy. Modern computers have multiple layers of storage, each with wildly different speed and cost:

LayerAccess TimeSizeCost per GBRelative Speed
CPU L1 Cache~1 ns64 KB1x
CPU L2 Cache~4 ns256 KB4x slower
CPU L3 Cache~12 ns8-64 MB12x slower
DRAM (RAM)~100 ns16-512 GB~$5/GB100x slower
NVMe SSD~100 μs1-16 TB~$0.10/GB100,000x slower
HDD~10 ms4-20 TB~$0.02/GB10,000,000x slower
Network (S3)~50 msUnlimited~$0.02/GB50,000,000x slower

RAM is 100,000x faster than SSD. SSD is 100x faster than HDD. Every storage engine is fighting this brutal reality: the fast storage is small and expensive, the large storage is slow and cheap. The entire field of database engineering is about bridging this gap — keeping hot data in fast storage and cold data in slow storage, while making it all look seamless to the application.

The disk I/O wall. A single SSD random read takes ~100 microseconds. In that same time, a modern CPU can execute about 400,000 instructions. This means every unnecessary disk read wastes 400,000 CPU cycles of potential work. This is why reducing disk reads is the single most important optimization in database engineering. A B-tree with 4 levels does 4 reads; reducing it to 2 (via caching) doubles your throughput.

The rest of this lesson explores three families of indexes, each making different trade-offs:

Index TypeReadWriteUsed By
Hash IndexO(1)O(1)Bitcask (Riak)
LSM-Tree / SSTableO(log n)O(1) amortizedLevelDB, RocksDB, Cassandra
B-TreeO(log n)O(log n)PostgreSQL, MySQL, virtually all RDBMS

A Concrete Calculation: How Bad Is O(n)?

Let us put real numbers to this. Suppose each record is 100 bytes (a key, a value, a delimiter). We have an SSD that reads at 500 MB/s sequential throughput:

// 10,000 records: 10,000 × 100 bytes = 1 MB → 2 ms to scan
// 1,000,000 records: 1M × 100 bytes = 100 MB → 200 ms (noticeable lag)
// 100,000,000 records: 100M × 100 bytes = 10 GB → 20 seconds
// 1,000,000,000 records: 1B × 100 bytes = 100 GB → 200 seconds (3+ minutes)

// Meanwhile, every write is still ~100 nanoseconds (append to file)
// The asymmetry is staggering: 1 billion to 1

This is why databases exist. Not for durability (files give you that). Not for fancy SQL syntax. Databases exist because raw files have O(n) reads, and O(n) is unacceptable when n is large.

The Index Trade-Off

An index is any additional data structure that you derive from the primary data. It does not change the data itself — it just provides a faster path to find what you are looking for, like an index at the back of a textbook.

But there is a fundamental trade-off: every index must be updated whenever you write new data. If you have 5 indexes on a table, every insert must update all 5. This is why databases do not index everything by default — each index speeds up reads but slows down writes. The database administrator's job is to choose the right indexes for the application's access pattern.

Interview warm-up: Your team uses an append-only log file as a simple write-ahead log. Reads are performed by scanning the file. A colleague says "We should add an index to speed up reads." What is the unavoidable cost of adding any index?

Chapter 1: Hash Indexes

The simplest possible index is the one you already know from any programming language: a hash map (dictionary, hash table). The idea: keep the append-only log file on disk, but maintain an in-memory hash map that maps every key to the byte offset where that key's most recent value starts in the file.

When you write a key, you append to the file (just like before), then update the hash map to point to the new byte offset. When you read a key, you look it up in the hash map (O(1)), seek to that byte offset on disk (O(1)), and read the value. Both operations are now fast.

This is not a toy idea. This is exactly how Bitcask, the default storage engine of Riak, works. Bitcask offers very fast reads and writes as long as all keys fit in memory.

The Mechanics, Step by Step

Let us trace through a concrete example. We will insert five key-value pairs and watch both the log file and the hash map evolve:

// Initial state: empty file, empty hash map
// Log file: []
// Hash map: {}

// SET "cat" → "Whiskers" (8 bytes for key+value+delimiter)
Log: [cat,Whiskers]    offset 0
Map: { cat → 0 }

// SET "dog" → "Rex" (7 bytes)
Log: [cat,Whiskers][dog,Rex]    offset 13
Map: { cat → 0, dog → 13 }

// SET "bird" → "Tweety" (11 bytes)
Log: [cat,Whiskers][dog,Rex][bird,Tweety]    offset 20
Map: { cat → 0, dog → 13, bird → 20 }

// SET "cat" → "Mittens" (overwrite! 11 bytes)
Log: [cat,Whiskers][dog,Rex][bird,Tweety][cat,Mittens]    offset 31
Map: { cat → 31, dog → 13, bird → 20 } ← cat now points to offset 31

// GET "cat" → look up map: offset 31 → seek to byte 31 → read "Mittens" ✓
// GET "dog" → look up map: offset 13 → seek to byte 13 → read "Rex" ✓

Notice that the old value for "cat" (Whiskers at offset 0) is still in the file. It just wastes space. This is where compaction comes in.

Compaction and Segment Merging

The log file grows forever because we never delete old entries. To reclaim space, we perform compaction: read the entire segment, discard all but the latest value for each key, and write a new, smaller segment file.

In practice, you do this in the background while the current segment continues accepting writes. Once the new compacted segment is ready, you atomically switch reads to it and delete the old one. This is the same pattern as a garbage collector — never block the live path.

You can also break the log into fixed-size segments (e.g., when a file reaches 1 GB, close it and start a new one). Then compaction can merge multiple segments into one, discarding duplicate keys.

Why append-only instead of update-in-place? Four reasons: (1) Sequential writes are vastly faster than random writes on both spinning disks and SSDs — 150x on HDD, 5x on SSD. (2) Crash recovery is simpler — if you crash mid-append, you just discard the last partial record (detected via CRC). If you crash mid-update, the old value might be half-overwritten and corrupted beyond repair. (3) Concurrency is easier — immutable segments can be read without locks while a single writer appends to the active segment. (4) No fragmentation — merging old segments produces clean, compact files with no holes.

Interactive Hash Index

The simulation below lets you insert keys into a Bitcask-style hash index. Watch the log file grow on the left and the hash map update on the right. Try inserting the same key twice to see the pointer update. Click "Compact" to merge duplicate entries.

Bitcask Hash Index Simulator

Type a key and value, then click Set. Insert duplicates to see the hash map update. Click Compact to remove stale entries.

Ready. Try: key="cat", value="Whiskers"

Practical Considerations: Crash Recovery, Deletion, Concurrency

Several engineering details make hash indexes production-ready:

Crash recovery. If the process crashes, the in-memory hash map is lost. You must rebuild it by scanning the entire data file on startup, which is slow for large files. Bitcask speeds this up by storing a "hint file" alongside each segment — a stripped-down version of the segment containing only keys and offsets (not values). The hint file is much smaller and can rebuild the hash map quickly.

Deletion. You cannot remove a record from an append-only file. Instead, append a special tombstone record (a marker that says "this key is deleted"). During compaction, the tombstone causes the compaction process to discard all prior values for that key. The key is removed from the hash map immediately.

Partially written records. The database might crash mid-write, leaving a corrupted partial record at the end of the file. Bitcask includes a CRC checksum with each record so that damaged records can be detected and discarded on recovery.

Concurrency. Since the file is append-only, writes are strictly sequential — only one writer thread. But multiple reader threads can read concurrently without locks, because the data file segments are immutable once written (only the active segment is being appended to, and the hash map protects reads from seeing partial writes via the pointer update being atomic).

python
# Tombstone pattern for deletions
TOMBSTONE = "__DELETED__"

def delete(db, key):
    """Mark a key as deleted. Compaction removes it later."""
    db.set(key, TOMBSTONE)  # append tombstone to log
    del db.index[key]         # remove from in-memory hash map

def compact(old_segment, new_segment):
    """Remove tombstoned and duplicate keys."""
    seen = {}
    for key, value in read_segment(old_segment):
        if value == TOMBSTONE:
            seen.pop(key, None)  # remove if present
        else:
            seen[key] = value     # keep latest value
    for key, value in seen.items():
        write_to_segment(new_segment, key, value)

Limitations of Hash Indexes

Hash indexes are fast and simple, but they have two critical limitations:

1. All keys must fit in RAM. The hash map is entirely in memory. If you have a billion unique keys, each requiring ~64 bytes for the key plus 8 bytes for the offset, that is 72 GB of RAM just for the index. This is workable for some use cases (Bitcask was designed for relatively few keys with frequently updated values, like URL visit counters) but not for general-purpose databases.

2. Range queries are terrible. Want to find all keys between "user:1000" and "user:2000"? A hash map cannot help you. You would have to scan every key in the map and check if it falls in the range. There is no ordering. This is O(n) for range queries regardless of the result set size.

Real-world Bitcask use case. Bitcask was Riak's default engine for a reason: it excelled at workloads with a moderate number of keys (millions, not billions) that were updated very frequently (e.g., per-URL counters, per-user session data). Each key-value pair was small. The entire key set fit comfortably in RAM. Reads hit a single disk seek. Writes were pure sequential appends. For this niche, Bitcask was faster than any B-tree engine.

Worked Example: Memory Budget for Hash Index

Let us calculate exactly how much RAM a hash index needs for a realistic workload:

// Scenario: Session store for a social media platform
// 50 million active sessions
// Key: session_id (UUID, 36 bytes as string)
// Value: serialized session data (avg 2 KB, stored on disk)

// In-memory hash map needs per entry:
// - Key copy: 36 bytes
// - Byte offset: 8 bytes (64-bit pointer)
// - Hash table overhead: ~40 bytes (bucket pointer, hash value, next pointer)
// Total per entry: 36 + 8 + 40 = 84 bytes

// With load factor 0.75 (standard for hash tables):
// Actual memory = 50M × 84 bytes / 0.75 = 5.6 GB

// This fits in a 16 GB instance with plenty of room for OS, buffers, etc.
// But 500M sessions? 56 GB. Getting expensive.
// 5B sessions? 560 GB. Need a different approach (LSM-tree).

Implementation Detail: File Format

A production Bitcask-style log format is more structured than our toy "key,value\n" approach. Each record includes a header with metadata:

python
import struct, zlib, time

# On-disk record format (Bitcask-style):
# ┌──────────┬───────────┬─────────┬──────────┬───────┬────────┐
# │ CRC (4B) │ tstamp(4B)│ ksz(2B) │ vsz (4B) │ key   │ value  │
# └──────────┴───────────┴─────────┴──────────┴───────┴────────┘

HEADER_FMT = "!IIhI"  # CRC, timestamp, key_size, value_size
HEADER_SIZE = struct.calcsize(HEADER_FMT)  # = 14 bytes

def encode_record(key, value):
    """Encode a key-value pair into the on-disk format."""
    k = key.encode() if isinstance(key, str) else key
    v = value.encode() if isinstance(value, str) else value
    ts = int(time.time())
    payload = struct.pack(HEADER_FMT[1:], ts, len(k), len(v)) + k + v
    crc = zlib.crc32(payload) & 0xFFFFFFFF
    return struct.pack("!I", crc) + payload

def decode_record(data, offset):
    """Decode a record starting at the given offset."""
    crc, ts, ksz, vsz = struct.unpack(HEADER_FMT, data[offset:offset+HEADER_SIZE])
    k = data[offset+HEADER_SIZE : offset+HEADER_SIZE+ksz]
    v = data[offset+HEADER_SIZE+ksz : offset+HEADER_SIZE+ksz+vsz]
    # Verify CRC to detect corruption
    payload = data[offset+4 : offset+HEADER_SIZE+ksz+vsz]
    assert zlib.crc32(payload) & 0xFFFFFFFF == crc, "Corrupted record!"
    return k.decode(), v.decode(), HEADER_SIZE + ksz + vsz

Segment Strategy: How Bitcask Manages Disk Space

In a real Bitcask deployment, the append-only log is divided into segments. When a segment reaches a size threshold (e.g., 1 GB), it is closed (becomes immutable) and a new segment is opened for writes. Here is the full lifecycle:

1. Active Segment
All new writes append to this one file. Only one active segment at a time. When it reaches the size limit, freeze it.
↓ segment reaches 1 GB
2. Freeze & Open New
The active segment becomes immutable. Open a new empty segment for writes. Readers can still access the frozen segment via the hash map.
↓ background thread
3. Compact & Merge
Background thread picks multiple frozen segments, reads them all, keeps only the latest value for each key, writes a single new compacted segment. Atomically swap pointers in the hash map. Delete old segments.

The merge process is crucial because without it, disk usage grows without bound. Even if you only have 1 million unique keys, writing each key 1000 times creates 1 billion records across many segments. Compaction reduces this back to 1 million records.

// Segment merge example:
// Segment 1 (old): {cat→Whiskers, dog→Rex, bird→Tweety}
// Segment 2 (newer): {cat→Mittens, fish→Nemo}
// Segment 3 (newest): {dog→Fido, cat→Shadow}

// Merge result (newest value for each key):
// {bird→Tweety, cat→Shadow, dog→Fido, fish→Nemo}
// 8 records across 3 segments → 4 records in 1 segment
// 50% space reclaimed
Interview question: You are building a URL shortener. Each shortened URL maps to a long URL. You expect 100 million unique URLs, each key averaging 8 characters (~8 bytes) plus an 8-byte offset pointer. Your server has 16 GB of RAM. Can you use a Bitcask-style hash index, and should you?

Chapter 2: SSTables & LSM-Trees

What if we take the append-only log from Chapter 1 but add one constraint: the keys in each segment file must be sorted? This seemingly simple change — called a Sorted String Table (SSTable) — gives us three enormous advantages over the hash index approach.

Why Sorted Keys Change Everything

Advantage 1: Efficient merging. When you need to compact multiple segment files into one, you can merge them like merge sort — read from all input segments simultaneously, always picking the lowest key. If the same key appears in multiple segments, keep only the value from the most recent segment. This runs in O(n) time and O(1) memory regardless of file sizes, because each segment is already sorted.

Advantage 2: Sparse index. You no longer need every key in the in-memory index. If you know that key "handbag" is at offset 12000 and key "handsaw" is at offset 13200, then "handler" must be somewhere between those offsets. You jump to offset 12000 and scan forward until you find it. With one index entry per few kilobytes of data, you can cover terabytes of on-disk data with a tiny in-memory index.

Advantage 3: Range queries. Because keys are sorted, finding all keys between "user:1000" and "user:2000" means seeking to the first key ≥ "user:1000" and scanning forward until you pass "user:2000". This is efficient — you only read the data you need.

But How Do You Sort an Append-Only Log?

You cannot sort a file as you append to it — that would require random I/O, which defeats the whole purpose. The trick is to separate the write path from the on-disk format:

1. Memtable (in-memory)
Incoming writes go to a balanced binary tree in RAM (a red-black tree, AVL tree, or skip list). This keeps keys sorted in memory. Typical size: a few megabytes.
↓ when memtable exceeds threshold
2. Flush to SSTable (on disk)
Write the entire sorted memtable to disk as a new SSTable file. This is a single sequential write — the tree is traversed in order, producing sorted output. Start a fresh empty memtable for new writes.
↓ background process
3. Compaction (merge SSTables)
Periodically merge multiple SSTable files into fewer, larger files. This removes deleted keys and duplicate entries. The merge is efficient because all input files are already sorted.

This three-level architecture is called a Log-Structured Merge-Tree (LSM-Tree). It was originally described by Patrick O'Neil in 1996 and is the engine behind LevelDB, RocksDB, Cassandra, HBase, and many modern databases.

The Read Path: Where Is My Key?

Reading from an LSM-tree requires checking multiple places, in order:

// Read path for key K:
1. Check the memtable (in-memory balanced tree)    O(log n)
2. Check the most recent SSTable on disk    O(log n) with sparse index
3. Check the next most recent SSTable    O(log n)
4. ... continue through older SSTables ...
5. If not found anywhere → key does not exist

// Worst case: key doesn't exist → must check ALL levels
// This is expensive! Solution: Bloom filters (see below)

The worst case is looking up a key that does not exist. You must check the memtable and every SSTable on disk before you can conclude it is not there. This is where Bloom filters save the day.

Bloom Filters: Avoiding Unnecessary Disk Reads

A Bloom filter is a memory-efficient probabilistic data structure that can tell you "this key is definitely not in this SSTable" or "this key might be in this SSTable." It never gives false negatives, but it can give false positives (at a controllable rate, typically 1-2%).

How it works: a Bloom filter is a bit array of m bits, initially all zeros. To add a key, hash it with k different hash functions, each producing a position in the bit array. Set those k bits to 1. To check if a key exists, hash it with the same k functions and check if all k bits are set. If any bit is 0, the key is definitely not in the set. If all are 1, the key probably is (but might be a false positive due to hash collisions).

// Bloom filter false positive rate:
FPR ≈ (1 - e-kn/m)k

// Where: m = bits in filter, n = number of keys, k = number of hash functions

// For a 1% false positive rate with 1M keys:
// m ≈ 10 × n = 10M bits = 1.25 MB
// k = (m/n) × ln(2) ≈ 10 × 0.693 ≈ 7 hash functions

// 1.25 MB of Bloom filter saves you from reading a 500 MB SSTable
// on 99% of non-existent key lookups. Extraordinary ROI.

Each SSTable has its own Bloom filter, stored alongside it. Before seeking into an SSTable file, you check its Bloom filter. If the filter says "no," you skip that file entirely. This turns most non-existent-key lookups from O(levels × log n) into O(levels), checking only a few bytes of Bloom filter per level.

Bloom filters in practice. RocksDB uses Bloom filters on every SSTable with a default 10 bits per key (about 1% false positive rate). LevelDB does the same. Cassandra lets you configure the false positive rate per table — lower rates use more memory but save more disk I/O. For a table with mostly point lookups and many non-existent key queries (e.g., checking if a username is taken), tuning the Bloom filter can make a 10x difference in read latency.

Worked Example: Complete Read and Write Paths

Let us trace a write and a read through a realistic LSM-tree with 3 levels:

// ═══ WRITE PATH: SET "user:42" → "Alice" ═══

// Step 1: Append to WAL (crash recovery)
WAL: ... | SET user:42 Alice |    → fsync to disk (guarantees durability)

// Step 2: Insert into memtable (in-memory red-black tree)
Memtable: { ..., user:39→Bob, user:42→Alice, user:45→Carol, ... }
// Memtable size: 12 MB / 16 MB limit

// Step 3: If memtable is full, freeze it and flush
// The frozen memtable becomes an immutable SSTable
// A new empty memtable is created for incoming writes
// Clients never block — they write to the new memtable immediately

// ═══ READ PATH: GET "user:42" ═══

// Step 1: Check active memtable
Memtable: contains user:42? → YES → return "Alice" ✓
// If YES, done. This is the fastest case (pure RAM).

// Step 2 (if not in memtable): Check L0 SSTables (newest → oldest)
L0-SST-3: Bloom filter says "maybe" → binary search → NOT FOUND
L0-SST-2: Bloom filter says "no" → SKIP (saved a disk read!)
L0-SST-1: Bloom filter says "maybe" → binary search → FOUND → return "Alice" ✓

// Step 3 (if not in L0): Check L1 SSTable
// L1 has non-overlapping key ranges, so only one SSTable could contain user:42
L1-SST containing [user:30 ... user:50]: sparse index → offset 48200
// Seek to offset 48200, scan forward → FOUND or NOT FOUND

// Step 4 (if not in L1): Check L2 SSTable
// Same process. If not found in any level → key does not exist.
The memtable acts as a write buffer AND a read cache. Any key written in the last few seconds is almost certainly still in the memtable, so reads for recently-written keys are pure in-memory lookups — faster than any disk-based index. This is why LSM-trees excel at workloads where recent data is read most frequently (which is most workloads).

Handling Deletes: Tombstones

You cannot simply remove a key from the memtable and call it deleted. The key might still exist in older SSTables on disk. If someone reads the key, they would find the old value in an SSTable and think it still exists.

The solution: write a special marker called a tombstone. The tombstone is a record that says "this key has been deleted." When the read path encounters a tombstone (in the memtable or any SSTable), it returns "key not found" without checking deeper levels. During compaction, tombstones cancel out the key-value pairs they cover, and both are discarded once the tombstone reaches the deepest level.

Tombstones are one of the trickiest parts of LSM-tree engineering. If you delete many keys but compaction has not run yet, the tombstones pile up and make reads slower (you still have to check all the SSTables, only to find tombstones). This is why some databases (like Cassandra) let you set a gc_grace_seconds parameter that controls how long tombstones linger before compaction is forced.

Compaction Strategies

There are two main approaches to compaction, and they produce very different behaviors:

StrategyHow it worksWrite amplificationRead amplificationSpace amplificationUsed by
Size-tieredMerge similarly-sized SSTables into one larger one. Newer, smaller SSTables are merged first.LowerHigher (more files to check)Higher (old data lingers)Cassandra, HBase
LeveledOrganize SSTables into levels. Each level is 10x the size of the previous. SSTables within a level have non-overlapping key ranges.Higher (more rewriting)Lower (at most one file per level)Lower (less redundancy)LevelDB, RocksDB
The three amplifications. Every storage engine is evaluated on three "amplification" metrics: Write amplification = how many bytes you write to disk for each byte of incoming data (LSM-trees typically 10-30x). Read amplification = how many disk reads per logical read (LSM must check multiple levels). Space amplification = how much extra disk space is used beyond the actual data size (due to stale data awaiting compaction). These three compete — improving one usually worsens another. This is THE framework for comparing storage engines in interviews.

Size-Tiered Compaction: Visual Walkthrough

In size-tiered compaction, SSTables are grouped by size. When you accumulate enough similarly-sized SSTables (typically 4), they are merged into one larger SSTable. Think of it like a tournament bracket — small files merge into medium files, medium into large.

// Size-tiered compaction example:
// Each memtable flush produces a ~64 MB SSTable

// After 4 flushes: [64MB] [64MB] [64MB] [64MB]
// → merge into one [256MB] SSTable

// After 4 more flushes: [64MB] [64MB] [64MB] [64MB] [256MB]
// → merge the four 64MB → second [256MB]

// Now we have: [256MB] [256MB] [256MB] [256MB]
// → merge into one [1GB] SSTable

// Eventually: [1GB] [1GB] [1GB] [1GB] → [4GB] → [16GB] → ...

// Problem: during compaction, both old and new SSTables exist temporarily
// Worst case: 2x disk space needed (the old files + the new merged file)

Leveled Compaction: Visual Walkthrough

In leveled compaction (used by LevelDB and RocksDB), SSTables are organized into levels. Each level is 10x the total size of the previous. Within a level (except L0), SSTable key ranges do not overlap — this is the key invariant that makes reads fast.

// Leveled compaction layout (size ratio T = 10):
L0: [sst] [sst] [sst] [sst]    ← may overlap, up to 4 files
L1: [a-f] [g-m] [n-s] [t-z]    ← non-overlapping ranges, ~640MB total
L2: [a-c] [d-f] [g-i] ... [x-z]    ← non-overlapping, ~6.4GB total
L3: [a-b] [c-d] [e-f] ... [y-z]    ← non-overlapping, ~64GB total

// Read path (key = "hello"):
// L0: check all 4 files (overlapping ranges)
// L1: only check [g-m] (binary search on SSTable boundaries)
// L2: only check one SSTable containing "h" range
// L3: only check one SSTable
// Total: at most 4 + 1 + 1 + 1 = 7 files (vs potentially dozens in size-tiered)

// Compaction process (L1 → L2):
// Pick one L1 SSTable, e.g., [g-m]
// Find all L2 SSTables whose ranges overlap with [g-m]
// Merge them all → write new L2 SSTables with non-overlapping ranges
// Delete old files atomically
The leveled compaction guarantee. At most one SSTable per level can contain any given key (except L0). This means a point lookup checks at most: all L0 files (usually 4) + one file per remaining level. For a 1 TB database with 4 levels, worst case is 4 + 4 = 8 file checks. With Bloom filters eliminating ~99% of false checks, average is 4 + 1 = 5 checks. Compare to size-tiered where you might have dozens of overlapping SSTables to check.

LSM-Tree Visualization

The simulation below shows an LSM-tree in action. Writes go to the memtable. When it fills, it flushes to Level 0. Background compaction merges L0 into L1. Watch the data flow through levels.

LSM-Tree: Write & Compaction Flow

Click "Write" to add a key. Watch memtable fill and flush to disk. "Compact" merges L0 into L1.

Memtable empty. Write some keys to begin.

Worked Example: Write Amplification

Let us calculate the write amplification of leveled compaction with a size ratio of 10:

// Leveled compaction, size ratio T = 10
// Data written to memtable: 1x (original write)
// Flush memtable to L0: 1x (sequential write)
// Compact L0 → L1: data is merged into L1 = ~1x rewrite
// Compact L1 → L2: all L1 data merged into L2 = ~1x rewrite
// ... each level adds ~1 rewrite ...

// With T=10 and L levels:
Write amplification ≈ L × T = L × 10

// For a 1 TB database with 64 MB memtable:
// Number of levels = log₁₀(1TB / 64MB) ≈ log₁₀(16384) ≈ 4.2 → 5 levels
// Worst-case write amplification ≈ 5 × 10 = 50x
// That means writing 1 byte of user data causes 50 bytes of disk I/O!
Why do people tolerate 50x write amplification? Because those 50 writes are all sequential. On an SSD, sequential writes are 10-100x faster than random writes. A B-tree might have only 2x write amplification, but each write is a random I/O. Sequential at 50x can still beat random at 2x, depending on the hardware. This is the key insight behind LSM-trees.

LSM-Tree Tuning: The Knobs That Matter

Production LSM-tree engines like RocksDB expose dozens of tuning parameters. The ones that matter most:

ParameterDefault (RocksDB)Effect of IncreasingTrade-off
write_buffer_size (memtable size)64 MBFewer flushes, larger SSTables, higher write throughputMore memory usage, longer recovery after crash (more WAL to replay)
max_write_buffer_number2More memtables can exist before stall, smoother writesMore memory usage
level0_file_num_compaction_trigger4Fewer compactions triggered, higher write throughputMore L0 files to check per read, higher read amplification
max_bytes_for_level_base (L1 size)256 MBMore data per level, fewer levels for same data sizeEach compaction moves more data, longer compaction pauses
max_background_compactions1Parallel compaction, keep up with high write ratesMore CPU and disk I/O consumed by background work
bloom_bits_per_key10Lower false positive rate, fewer wasted disk readsMore memory per SSTable
The universal tuning heuristic for LSM-trees. If writes stall: increase memtable size, increase L0 trigger, add more compaction threads. If reads are slow: decrease L0 trigger (fewer L0 files), increase Bloom filter bits, increase block cache size. If disk space is an issue: reduce the size ratio (e.g., from 10 to 8) for more aggressive compaction at the cost of higher write amplification. There is no free lunch — every improvement somewhere costs something elsewhere.
Design question: You are building a time-series database that ingests 1 million sensor readings per second. Each reading has a timestamp key and a float value. Reads are mostly range queries ("give me all readings between 2pm and 3pm"). Would you use a hash index or an LSM-tree? Why?

Chapter 3: B-Trees

While LSM-trees are the new hotness, the B-tree has been the dominant index structure since 1970. Virtually every relational database — PostgreSQL, MySQL, Oracle, SQL Server — uses B-trees as their primary index structure. If you have used a database, you have used a B-tree.

Like SSTables, B-trees keep keys sorted, which enables efficient range queries. But the design philosophy is completely different. Where LSM-trees buffer writes in memory and periodically flush sorted files, B-trees update data in place on disk.

The Structure: Pages and Pointers

A B-tree organizes data into fixed-size pages (also called blocks), typically 4 KB — the same size as a disk sector or virtual memory page. This is not a coincidence. It means reading one B-tree node costs exactly one disk I/O.

There are two types of pages:

Internal pages contain keys and pointers to child pages. If an internal page contains keys [10, 20, 30], it has four child pointers: one for keys < 10, one for keys 10-19, one for keys 20-29, and one for keys ≥ 30. The number of children per page is called the branching factor.

Leaf pages contain the actual key-value pairs (or keys plus pointers to the rows in a separate heap file).

Deriving the Depth: Why B-Trees Are So Shallow

The magic of B-trees is in the branching factor. A typical 4 KB page can hold hundreds of key-pointer pairs. Let us work through the math:

// Typical page size: 4096 bytes
// Each key: ~8 bytes (64-bit integer)
// Each pointer: ~6 bytes (page ID)
// Per entry: 8 + 6 = 14 bytes + some overhead ≈ 16 bytes
// Branching factor b ≈ 4096 / 16 = 256

// More optimistic with prefix compression: b ≈ 500

// Number of keys in a B-tree with depth d and branching factor b:
n = bd

// So depth = log_b(n)
d = logb(n) = log(n) / log(b)

// Worked examples with b = 500:
d=1: 500 keys
d=2: 5002 = 250,000 keys    (250K)
d=3: 5003 = 125,000,000 keys    (125M)
d=4: 5004 = 62,500,000,000 keys    (62.5B — more than any table)

// Key insight: even with 62.5 BILLION keys, you need only 4 disk reads
// to find any key. And the top 2-3 levels are cached in RAM,
// so in practice it's 1-2 disk reads per lookup.
Four levels store 62.5 billion keys. This is the most important number in database internals. A B-tree with branching factor 500 needs only 4 levels to index more data than any single machine could hold. Since the root page is always cached in RAM, and the second level is usually cached too, most lookups need at most 2 disk reads. Compare this to a binary search tree, where depth = log2(62.5B) = 36 disk reads. The branching factor is what makes B-trees practical.

Lookups: Following the Pointers

To find a key in a B-tree, start at the root page. Binary-search within the page to find the right child pointer. Follow it. Repeat until you reach a leaf. Each level costs one disk read (or one cache hit if the page is already in the buffer pool).

// Lookup example: find key 42 in a B-tree with branching factor 4

// Root page: [20, 50, 80]
// → 42 is between 20 and 50 → follow second child pointer

// Internal page: [30, 40, 45]
// → 42 is between 40 and 45 → follow third child pointer

// Leaf page: [41, 42, 43, 44]
// → binary search within page → found key 42 → return value

// Total: 3 page reads. On a disk with 200μs seek time: 600μs.
// With top 2 levels cached: 1 page read = 200μs.

Insertions: Splitting Pages

To insert a new key, follow the lookup path to the correct leaf page. If the leaf has room, insert the key in sorted order within the page. If the leaf is full, split it into two half-full pages and promote the middle key to the parent. If the parent is also full, split it too. This can cascade up to the root, which is the only way a B-tree grows taller.

Let us trace through a page split step by step:

// Leaf page is full (max 4 keys): [10, 20, 30, 40]
// We want to insert key 25.

// Step 1: Find where 25 goes → between 20 and 30
// Step 2: Page is full → SPLIT
// Step 3: Split at median (key 25 in this case):
// Left page: [10, 20]
// Right page: [30, 40]
// Insert 25 into left: [10, 20, 25]
// Promote median (25) to parent

// Parent page before: [15, 35] with pointers to children
// Parent page after: [15, 25, 35] with new pointer to the right split page

// If the parent was also full, IT would split too,
// potentially all the way up to the root.

The average fill factor of a B-tree after many random insertions is about 69% (ln 2). This means roughly one-third of the space in each page is wasted. Some databases let you set a lower initial fill factor (e.g., PostgreSQL's fillfactor parameter) to leave room for future updates without splitting, at the cost of a deeper tree.

The Write-Ahead Log (WAL)

B-trees update pages in place. If the system crashes mid-write (say, while splitting a page), the tree could be left in a corrupted state — one page has been written but the parent pointer has not been updated yet, leaving an orphan page.

The solution: before modifying any page, write the intended change to an append-only write-ahead log (WAL, also called redo log). If the system crashes, replay the WAL to bring the tree back to a consistent state. This means every B-tree write actually hits the disk twice: once for the WAL, once for the actual page update.

Interactive B-Tree

The simulation below lets you insert keys into a B-tree with a small branching factor (4, so you can see splits happen). Watch pages fill and split as you add keys.

B-Tree Insertion Visualizer (branching factor = 4)

Click "Insert" to add a random key. Watch pages fill, then split when full. The tree grows taller only when the root splits.

Empty tree. Insert some keys.

Concurrency: Latches, Not Locks

When multiple threads read and write the B-tree simultaneously, you need concurrency control. B-trees use latches (lightweight mutexes) on individual pages, not full database-level locks. A common strategy: acquire a read latch on the root, descend to the child, acquire a latch on the child, release the parent's latch. This "latch crabbing" protocol minimizes contention.

An alternative approach used by some modern databases (LMDB, BoltDB) is copy-on-write: instead of modifying a page in place, create a new version of the page and atomically update the parent pointer. This eliminates the need for latches and the WAL, but uses more disk space.

B-Tree vs B+ Tree

Almost all databases actually use a B+ tree, not a classic B-tree. The difference:

FeatureB-treeB+ tree (what databases use)
Values in internal nodesYes — each key has its value in the nodeNo — internal nodes only store keys and child pointers
Values in leaf nodesValues at every levelAll values are in leaf nodes only
Leaf node linksNo sibling pointersLeaves linked in a doubly-linked list
Range queriesRequires tree traversalFind first key, then follow leaf pointers
Branching factorLower (values take space in internal nodes)Higher (internal nodes are pure routing)

The B+ tree's leaf-linked-list makes range queries extremely efficient: find the starting key (O(log n) tree traversal), then follow the leaf pointers sequentially. No need to go back up and down the tree. This is why PostgreSQL, MySQL, and SQLite all use B+ trees.

B-Tree Optimizations in Practice

Production B-tree implementations use several tricks beyond the basic algorithm:

Prefix compression. Keys within a page often share a common prefix (e.g., "user:1001", "user:1002", "user:1003"). Store the common prefix once and only the differing suffixes per key. This increases the branching factor — more keys per page = shallower tree.

Bulk loading. Building a B-tree by inserting keys one at a time is O(n log n) and creates a tree with ~69% fill factor. Bulk loading sorts all keys first, then builds the tree bottom-up, filling each page to 100%. This creates a more compact tree and is much faster. PostgreSQL's CREATE INDEX CONCURRENTLY uses this approach.

Fractal trees. A hybrid of B-trees and LSM-trees. Like a B-tree, but each internal node has a small buffer for pending insertions. When the buffer fills, the insertions are pushed down to children in a batch. This gives near-LSM write performance with B-tree read performance. TokuDB (now part of Percona) used fractal trees.

Buffer pool management. The database maintains a fixed-size buffer pool of B-tree pages in RAM. Popular pages (the root, heavily-accessed leaves) stay resident. Eviction uses LRU or clock algorithms. PostgreSQL's shared_buffers parameter controls this — typically set to 25% of available RAM.

// Buffer pool math for PostgreSQL:
// Server: 64 GB RAM, shared_buffers = 16 GB
// Page size: 8 KB (PostgreSQL default)
// Buffer pool pages: 16 GB / 8 KB = 2 million pages

// B-tree on a 500M row table with branching factor 200:
// Depth = log₂₀₀(500M) ≈ 3.8 → 4 levels
// Total pages: root(1) + L1(200) + L2(40K) + L3(8M) ≈ 8 million pages
// Buffer pool holds 2M / 8M = 25% of the index
// Top 3 levels (40,201 pages) are always cached → 2 levels in RAM
// Most lookups: 1-2 disk reads (leaf + maybe one internal page)

B-Tree Deletion: Merging Pages

Deletion in B-trees is the inverse of insertion. Remove the key from the leaf page. If the leaf now has fewer keys than the minimum (typically half the maximum), it is underful. The tree must rebalance by either borrowing a key from a sibling page (called rotation) or merging the underful page with a sibling (the inverse of a split). Merged pages propagate up the tree, just like splits.

In practice, many databases do not merge pages immediately on deletion. PostgreSQL marks deleted entries as "dead" and relies on VACUUM to reclaim space later. This is simpler and avoids the complexity of page merges, but it means deleted rows leave behind dead space (called bloat). If a table has 50% bloat, it is taking twice the disk space it needs, and scans read twice as many pages.

// PostgreSQL table bloat example:
// Table: 100 GB on disk
// After deleting 40% of rows: still 100 GB (dead tuples remain)
// After VACUUM: dead tuples marked as reusable, but disk not freed
// After VACUUM FULL: table rewritten, 60 GB (but locks table exclusively!)

// Check bloat: SELECT pg_stat_user_tables.n_dead_tup
// Healthy: n_dead_tup < 10% of n_live_tup
// Unhealthy: n_dead_tup > 50% of n_live_tup → run VACUUM

B-Tree Performance Characteristics

Let us derive the exact I/O cost for B-tree operations:

// Point lookup (read one key):
Disk reads = depth of tree = logb(n)
// With top levels cached: depth - 2 disk reads typically

// Range query (read k consecutive keys):
Disk reads = logb(n) + k / keys_per_page
// B+ tree: after finding the first key, follow leaf pointers
// Each leaf page holds ~200 keys → 10,000 keys = 50 leaf reads

// Insert (one key):
Disk reads = depth (find the leaf)
Disk writes = 1 (WAL) + 1 (page write) = 2
// If page splits: additional 1 read + 2 writes per split level
// Splits are rare: probability ≈ 1/b per insert (once every ~200 inserts)

// Delete (one key):
Disk reads = depth (find the key)
Disk writes = 1 (WAL) + 1 (mark dead) = 2
// No merge — just mark as dead, VACUUM later
Debug scenario: Your PostgreSQL database has a table with 500 million rows and a B-tree index on the primary key (64-bit integer). You run a point lookup and it takes 15 ms. Your colleague says the index must be corrupt because "B-trees should be instant." What is your diagnosis?

Chapter 4: B-Trees vs LSM-Trees

This is THE comparison question in systems design interviews. "When would you use an LSM-tree versus a B-tree?" If you can answer this fluently, you demonstrate a deep understanding of storage engine trade-offs.

The Write Path: Sequential vs Random

LSM-trees batch writes into memory (the memtable) and flush them as large sequential writes. B-trees perform random I/O — seeking to the correct page and updating it in place. On any storage medium (HDD, SSD, or cloud block storage), sequential I/O is faster than random I/O:

OperationHDDSSDCloud (EBS gp3)
Sequential write~150 MB/s~500 MB/s~250 MB/s
Random write (4KB)~1 MB/s (100 IOPS)~100 MB/s (25K IOPS)~50 MB/s (12K IOPS)
Sequential/Random ratio150x5x5x

On spinning disks, the advantage is massive (150x). On SSDs, it is smaller but still significant (5x). This is why LSM-trees can sustain higher write throughput despite higher write amplification.

Write Amplification: The Detailed Comparison

// B-tree write amplification:
1. Write to WAL:    1x (sequential append)
2. Write to page:    1x (random I/O, full 4KB page even for small update)
Total: 2x minimum, but the 4KB page write for a 100-byte update is
effectively 40x amplification at the byte level.

// LSM-tree write amplification (leveled, 10 levels, size ratio 10):
1. Write to WAL:    1x (sequential append)
2. Write to memtable:    0x (RAM only)
3. Flush + compaction across levels:    10-50x (all sequential)
Total: 11-51x, but all sequential I/O.

// Net effect on SSDs:
// B-tree: 2 random writes × (1/25K IOPS) = 80 μs per write
// LSM: 50 sequential bytes × (1/500 MB/s) = 0.1 μs per write (amortized)
// LSM wins on write throughput despite higher amplification!

The Read Path: One Seek vs Many Checks

B-trees have a clear read advantage. A point lookup touches at most depth pages (typically 2-3 disk reads, often 1 due to caching). An LSM-tree must check the memtable, then potentially every level of SSTables — even with Bloom filters, this can mean 2-5 disk reads for a cold key.

Latency predictability is another B-tree strength. B-tree read latency is consistent: always depth disk reads. LSM-tree read latency varies — it might be in the memtable (instant) or buried in the oldest SSTable (multiple disk reads). If you need predictable p99 latency (e.g., for a financial trading system), B-trees win.

The Compaction Tax

LSM-trees must perform background compaction to merge SSTables. This compaction competes with foreground operations for disk bandwidth and CPU. If the write rate exceeds the compaction throughput, uncompacted SSTables pile up, reads slow down (more files to check), and eventually the database must throttle writes. This is called compaction debt.

B-trees have no compaction. Their write overhead is paid immediately (WAL + page write), not deferred. This makes their performance more predictable under sustained load.

SSD Lifetime: The Hidden Cost of Write Amplification

SSDs have a finite number of write cycles per cell (typically 1,000-10,000 for TLC NAND). Write amplification directly reduces SSD lifetime. If your LSM-tree has 30x write amplification and you are writing 1 GB/s of user data, the SSD sees 30 GB/s of actual writes. An SSD rated for 1 petabyte of total writes would last:

// SSD lifetime with write amplification:
// User write rate: 1 GB/s = 86.4 TB/day
// Actual SSD writes: 30 × 86.4 = 2,592 TB/day
// SSD rated for 1 PB total writes: 1,000 TB / 2,592 TB/day ≈ 0.39 days!

// At a more realistic 10 MB/s user writes:
// Actual: 300 MB/s → 25.9 TB/day → 1,000 TB / 25.9 ≈ 39 days

// B-tree with 2x amplification at 10 MB/s:
// Actual: 20 MB/s → 1.73 TB/day → 1,000 TB / 1.73 ≈ 579 days

// 15x longer SSD lifetime with B-tree. This matters at scale!
Write amplification is a cost multiplier. In cloud environments, you pay for IOPS and throughput. 30x write amplification means your effective storage cost is 30x what the bill says. For a write-heavy service running on AWS EBS gp3 volumes, the difference between an LSM-tree and a B-tree can be thousands of dollars per month per node. This is why storage engine selection is an infrastructure cost decision, not just a performance decision.

Transaction Support

B-trees naturally support transactions because each key lives in exactly one place (one page). You can lock individual keys, implement MVCC by storing multiple versions in the page, and ensure serializability with page-level latches.

LSM-trees are harder to transact on. A key might exist in the memtable AND in multiple SSTables simultaneously (with different versions). Implementing snapshot isolation requires careful coordination between the memtable and all SSTable levels. This is doable (RocksDB supports transactions via a pessimistic or optimistic concurrency control layer on top of the LSM), but it adds complexity and overhead that B-trees handle naturally.

Compression: LSM-Trees Win Decisively

LSM-trees achieve much better compression than B-trees, for two reasons:

1. No fragmentation. B-tree pages are rarely 100% full. After many random insertions and deletions, the average page fill factor is about 69%. That means 31% of your on-disk index is wasted space. LSM-tree SSTables are written sequentially and compacted, so they have no internal fragmentation.

2. Compression-friendly layout. SSTable files contain sorted keys, which means adjacent entries often share prefixes and have similar values. This is ideal for dictionary encoding, prefix compression, and LZ4/Snappy/Zstd compression. B-tree pages are too small (4 KB) for compression to be as effective, and compressing/decompressing individual pages adds latency to every read.

// Space efficiency comparison (1 billion 100-byte key-value pairs):

// Raw data size: 1B × 100 bytes = 100 GB

// B-tree (InnoDB):
// Pages: 69% fill factor → 100 GB / 0.69 = 145 GB data pages
// Internal pages: ~1% overhead = 1.5 GB
// Total: ~147 GB (1.47x raw data)

// LSM-tree (RocksDB with Zstd compression):
// Compressed data: 100 GB × 0.3 (typical compression ratio) = 30 GB
// Space amplification from compaction overlap: ×1.1 = 33 GB
// Total: ~33 GB (0.33x raw data)

// LSM uses 4.5x less disk space than B-tree!
// At $0.10/GB/month for cloud EBS: $14.70/month vs $3.30/month

The Emerging Middle Ground: Bw-Trees and Fractal Trees

Some modern storage engines try to get the best of both worlds:

Bw-Tree (Buzzword-Tree), developed by Microsoft Research, is a lock-free B-tree that uses a mapping table to indirect all page references. Updates create "delta records" that chain to the base page (like an LSM within each page). Periodic consolidation merges deltas into the base page. This gives B-tree read performance with near-LSM write performance. Used in SQL Server's Hekaton in-memory engine.

Fractal Tree (TokuDB, now Percona): each internal B-tree node has an attached write buffer. Inserts go into the buffer instead of being pushed all the way down to a leaf. When the buffer fills, its contents are flushed to children in a batch. This converts random writes into sequential batch writes at each level — effectively an LSM approach applied within a B-tree structure. Write amplification is O(logB(N) / B) instead of O(B × logB(N)) for standard B-tree.

Write Stalls: The LSM-Tree Achilles Heel

The most feared production issue with LSM-trees is the write stall. Here is how it happens:

// Normal operation:
// Writes → memtable → flush to L0 → compact L0→L1 → compact L1→L2
// Everything flows smoothly because compaction keeps up with writes.

// Write stall scenario:
// 1. Write rate spikes (e.g., Black Friday traffic surge)
// 2. Memtable fills faster → flushes to L0 more often
// 3. L0 accumulates SSTables faster than compaction can process
// 4. L0 reaches max_file_count (default: 20 in RocksDB)
// 5. RocksDB STALLS ALL WRITES until compaction catches up
// 6. Write latency jumps from 0.1ms to 1000ms+ (10,000x increase)

// Detection: monitor these RocksDB metrics:
// - rocksdb.is-write-stopped (1 = stalled)
// - rocksdb.actual-delayed-write-rate (bytes/sec being throttled)
// - rocksdb.num-files-at-level0 (should stay below 20)
// - rocksdb.compaction-pending (number of compaction jobs waiting)

The mitigation strategy has three layers:

1. Prevent: Size your compaction throughput for 2x your peak write rate. More compaction threads, faster disks (NVMe over SATA), larger memtable (fewer L0 flushes).

2. Soften: Use rate limiting (delayed_write_rate) to gradually slow writers before a hard stall. Writers get slower instead of completely blocked.

3. Recover: If a stall happens, temporarily reduce write rate (back-pressure from the database to the application), let compaction catch up, then resume normal operation.

This is THE LSM production war story. Every team that runs an LSM-tree at scale has been bitten by write stalls. It is the first thing you should mention when discussing LSM limitations in an interview: "LSM-trees have great average-case write performance, but background compaction can cause tail-latency spikes and write stalls under burst load. Mitigation requires careful monitoring and tuning of compaction parameters."

Real Benchmark Numbers

Theoretical analysis is useful, but let us look at real benchmark data. The following numbers come from published benchmarks comparing RocksDB (LSM) vs InnoDB (B-tree) on the same hardware:

BenchmarkRocksDB (LSM)InnoDB (B-tree)Winner
Random write (ops/sec)~80,000~15,000RocksDB (5.3x)
Sequential write (ops/sec)~200,000~100,000RocksDB (2x)
Random read (ops/sec)~40,000~65,000InnoDB (1.6x)
Range scan (rows/sec)~500,000~800,000InnoDB (1.6x)
p99 read latency~2ms~0.5msInnoDB (4x)
p99 write latency~0.3ms~5msRocksDB (17x)
Disk space (1B keys)~45 GB~80 GBRocksDB (1.8x)

Notice the pattern: RocksDB dominates writes (especially p99 write latency — 17x better) and disk space efficiency. InnoDB dominates reads (1.6x better throughput, 4x better p99 read latency). The LSM compaction spikes are what cause the higher read p99.

The p99 story is critical. Average latency is almost meaningless for user-facing systems. What matters is the tail: p99 (1 in 100 requests) and p999 (1 in 1000). LSM-tree read p99 can spike during compaction because the compaction process holds locks on SSTables and consumes disk bandwidth. B-tree p99 is more stable because there is no background process competing for resources. If your SLA says "p99 read latency under 5ms," B-tree is the safer choice.

Side-by-Side Comparison

B-Tree vs LSM-Tree: Same Workload

Adjust the read/write ratio and see which engine performs better. High writes favor LSM. High reads favor B-tree.

Write-Heavy Read-Heavy
50% reads, 50% writes — mixed workload

The Decision Framework

FactorChoose LSM-TreeChoose B-Tree
Write patternHigh write throughput, batch inserts, time-seriesModerate writes, frequent updates of same key
Read patternFew point lookups, mostly range scansMany point lookups, predictable latency needed
TransactionsNot needed (or eventual consistency OK)ACID transactions, serializable isolation
Disk usageCompression-friendly, less fragmentationMore fragmentation, but no compaction overhead
Latency SLAp50 matters more than p99Tight p99 requirements
ExamplesCassandra, RocksDB, InfluxDB, ScyllaDBPostgreSQL, MySQL InnoDB, Oracle, SQL Server
The interview answer. "It depends on the workload. For write-heavy workloads like time-series, event logs, or messaging — LSM-trees. Their sequential I/O pattern sustains higher write throughput despite higher write amplification. For read-heavy, transaction-heavy workloads like OLTP — B-trees. Their single-seek reads give predictable latency, and their in-place updates support strong ACID guarantees without the compaction tax."

Real-World Decision: How Companies Choose

Here is how some well-known companies picked their storage engines, and why:

CompanyUse CaseEngineWhy
FacebookSocial graph (friends, likes)RocksDB (LSM)Massive write volume from billions of social actions. Write throughput is king.
DiscordMessage storageScyllaDB (LSM, Cassandra-compatible)Write-once messages, time-ordered range queries, massive scale (trillions of messages).
UberTrip data, pricingPostgreSQL (B-tree) + Cassandra (LSM)PostgreSQL for ACID transactions (pricing must be exact). Cassandra for high-volume trip events.
NetflixUser profiles, viewing historyCassandra (LSM)Global distribution across regions. Eventual consistency is acceptable for viewing history.
ShopifyE-commerceMySQL/InnoDB (B-tree)ACID transactions for orders and payments. Predictable latency for checkout flow.
CloudflareKV store at edgeCustom LSMWrite-heavy with eventual consistency. Replicated to 300+ locations worldwide.
Design challenge: You are designing a key-value store for a chat application. Messages are written once and read many times. Reads are always by conversation ID + timestamp range ("show me messages in this chat from the last hour"). Messages are never updated or deleted. Which storage engine would you choose?

Chapter 5: Other Indexing Structures

So far we have talked about indexing a single key. But real databases need much more: secondary indexes (looking up by a non-primary column), multi-column indexes (querying by two fields at once), full-text search, and even geospatial queries. Each requires a different indexing approach.

Secondary Indexes

A primary index maps a unique key (like user_id) to the row's location. A secondary index maps a non-unique column (like city) to the set of rows that match. The B-tree or LSM-tree structure is the same — the difference is that secondary index entries can point to multiple rows.

There are two ways to handle this:

Heap file + pointers. The table data lives in a separate heap file in insertion order. Both the primary index and all secondary indexes contain pointers (row IDs) into this heap file. A secondary index lookup returns a list of row IDs, and you chase each pointer to the heap file to retrieve the actual row data. This pointer-chasing is called a non-clustered index scan.

Clustered index. Store the actual row data directly inside the primary index's leaf pages, sorted by the primary key. Secondary indexes still use pointers, but now they point into the primary index rather than a heap file. This is how InnoDB (MySQL) works — the primary key IS the table, and secondary indexes point to primary key values.

Clustered vs non-clustered trade-off. Clustered indexes make primary key lookups faster (no pointer chasing to a heap file) and range scans on the primary key blazing fast (data is physically sorted). But they make secondary index lookups slower (you must look up the primary key after finding the secondary index entry, then traverse the primary B-tree). They also make inserts slower (data must be inserted in primary key order, potentially splitting pages).

Multi-Column and Covering Indexes

A concatenated index combines multiple columns into a single key. An index on (last_name, first_name) stores keys like "Smith|Alice". This supports queries that filter by last_name alone or by (last_name AND first_name), but NOT by first_name alone — the index is sorted by last_name first.

A covering index (or index-only scan) includes additional column values directly in the index entries, so the database can answer a query without touching the heap file at all. For example, an index on (city) that also INCLUDEs (population) can answer "SELECT population FROM cities WHERE city = 'Tokyo'" entirely from the index.

Covering Indexes: Avoiding the Second Lookup

A covering index is one of the most powerful optimization techniques in database tuning. The idea: include the values the query needs directly in the index, so the database never has to visit the heap file or primary index at all.

python
# Standard secondary index on email (PostgreSQL):
# CREATE INDEX idx_email ON users(email);
# SELECT name FROM users WHERE email = 'alice@example.com';
# → index lookup (email → ctid) → heap fetch (ctid → row) → return name
# TWO lookups!

# Covering index (INCLUDE the columns we need):
# CREATE INDEX idx_email_covering ON users(email) INCLUDE (name);
# SELECT name FROM users WHERE email = 'alice@example.com';
# → index lookup (email → name) → return name
# ONE lookup! The "name" value is stored in the index leaf page.

# EXPLAIN output would show "Index Only Scan" instead of "Index Scan"
# This can be 2-5x faster for queries that hit many rows.
When to use covering indexes. Covering indexes are most valuable when: (1) the query is run frequently, (2) it returns a small number of columns, (3) the table is large (so heap fetches are expensive). The trade-off: the index uses more disk space and slows down writes (the included columns must be updated on every row change). In PostgreSQL, use INCLUDE in CREATE INDEX. In MySQL, secondary indexes on InnoDB automatically "cover" the primary key (since secondary indexes store the primary key value).

Multi-Dimensional Indexes

How do you find "all restaurants within 2 km of my location"? This requires searching on two dimensions (latitude and longitude) simultaneously. A standard B-tree can sort by one dimension, but not two.

R-trees solve this by grouping nearby objects into rectangular bounding boxes, organized hierarchically. PostGIS (the geospatial extension for PostgreSQL) uses R-trees (via GiST indexes) for spatial queries.

Another approach: space-filling curves like the Z-order curve or Hilbert curve map 2D coordinates to a 1D value, preserving spatial locality. Then you can use a standard B-tree on the 1D value. This is how DynamoDB's geospatial indexes work.

Full-Text Search and Fuzzy Indexes

Full-text search indexes need to handle things that B-trees and LSM-trees cannot: word stemming (running, runs, ran → run), synonym matching, relevance ranking, and fuzzy matching (tolerate typos). These use inverted indexes — a map from each word to the list of document IDs containing that word. Lucene (the engine behind Elasticsearch and Solr) uses an SSTable-like structure for its term dictionary and posting lists.

// Inverted index for full-text search:

// Documents:
// doc1: "the quick brown fox"
// doc2: "the lazy brown dog"
// doc3: "quick fox jumps"

// Inverted index (term → document list):
"brown" → [doc1, doc2]
"dog" → [doc2]
"fox" → [doc1, doc3]
"jumps" → [doc3]
"lazy" → [doc2]
"quick" → [doc1, doc3]
"the" → [doc1, doc2]

// Query: "quick brown" → intersection of [doc1, doc3] ∩ [doc1, doc2] = [doc1]

For fuzzy matching (finding "restaurant" when the user typed "resturant"), Lucene uses a finite automaton built from a Levenshtein distance threshold. The automaton can efficiently traverse the sorted term dictionary and find all terms within edit distance 1 or 2 of the query term. This is how Elasticsearch's fuzzy queries work.

R-Trees for Geospatial Queries

A standard B-tree can efficiently search one dimension. But "find all restaurants within 2 km" requires searching two dimensions (latitude and longitude) simultaneously. An R-tree extends the B-tree idea to multiple dimensions using minimum bounding rectangles (MBRs).

// R-tree structure for geospatial data:

// Root: MBR covers entire US
// ├── Child 1: MBR covers West Coast
// │ ├── Leaf: MBR covers San Francisco
// │ │ ├── Point: Restaurant A (37.77, -122.42)
// │ │ ├── Point: Restaurant B (37.78, -122.41)
// │ │ └── ...
// │ └── Leaf: MBR covers Los Angeles
// │ └── ...
// └── Child 2: MBR covers East Coast
// └── ...

// Query: "All restaurants within 2 km of (37.77, -122.42)"
// 1. At root: check which child MBRs intersect the query circle
// 2. Descend into matching children only
// 3. At leaf: check actual point distances
// Result: prune most of the tree, only check nearby points

PostGIS uses GiST indexes (Generalized Search Trees), which are a generalization of R-trees that support any decomposable search predicate. This is how Uber finds nearby drivers, how Airbnb searches for listings in a map viewport, and how DoorDash finds restaurants near you.

Index Selection: The DBA's Art

Choosing which indexes to create is one of the most impactful decisions a database administrator makes. Too few indexes → slow reads. Too many indexes → slow writes and wasted disk space. The decision framework:

QuestionIf YesIf No
Is this column in WHERE clauses of frequent queries?Index itDon't index
Does the table have >100K rows?Index is worthwhileFull scan might be fine
Is the column high cardinality (many unique values)?B-tree indexConsider bitmap index (for OLAP)
Does the query need range scans on this column?B-tree (sorted)Hash index might suffice
Is the table write-heavy (>80% writes)?Minimize indexesAdd more indexes
Does the query SELECT only indexed columns?Use covering index (INCLUDE)Standard index
python
# PostgreSQL: check if your indexes are being used
# This query shows which indexes are never used (candidates for removal)

query = """
SELECT
    schemaname, tablename, indexname,
    idx_scan as times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0  -- NEVER used!
ORDER BY pg_relation_size(indexrelid) DESC;
"""

# Each unused index costs:
# 1. Disk space (visible in index_size above)
# 2. Write overhead (every INSERT/UPDATE/DELETE updates this index)
# 3. VACUUM overhead (must clean dead entries in index too)
# Drop unused indexes for immediate write performance improvement.

Concatenated Index Deep Dive

The order of columns in a concatenated index matters enormously. Consider an index on (last_name, first_name, age):

// Index on (last_name, first_name, age):
// Sorted as: last_name first, then first_name, then age

// Queries this index CAN accelerate:
WHERE last_name = 'Smith' ✓ leftmost prefix
WHERE last_name = 'Smith' AND first_name = 'Alice' ✓ two-column prefix
WHERE last_name = 'Smith' AND first_name = 'Alice' AND age = 30 ✓ full match
WHERE last_name = 'Smith' AND first_name > 'A' ✓ prefix + range

// Queries this index CANNOT accelerate:
WHERE first_name = 'Alice' ✗ skipped leftmost column
WHERE age = 30 ✗ skipped two leftmost columns
WHERE last_name = 'Smith' AND age = 30 ✗ gap in the middle

// The "leftmost prefix rule": the index can be used for any
// query that uses a prefix of the column list, left to right,
// with at most one range condition on the last used column.
The column order decision. Put the most-filtered column first (highest cardinality, most selective). If queries always filter by last_name, put it first. If 80% of queries filter by country (only 200 values) and 20% by user_id (millions of values), put user_id first because it is more selective — the index prunes more rows. This is non-obvious and frequently comes up in database design interviews.

Secondary Index Canvas

Secondary Index: Pointer Chasing

Click a city name in the secondary index to trace the pointer chain to the heap file rows.

Click a city to trace the secondary index lookup.
Interview question: You have a users table with a clustered primary index on user_id and a secondary index on email. A query runs: SELECT * FROM users WHERE email = 'alice@example.com'. How many index lookups does this require, and why?

Chapter 6: In-Memory Databases

We have spent five chapters agonizing over disk I/O. Pages, segments, sequential vs random writes. What if we just... keep everything in RAM?

RAM prices have fallen dramatically. In 2024, a cloud instance with 256 GB of RAM costs roughly $3/hour. Many datasets fit comfortably in that budget. This opens the door to in-memory databases — systems that keep the entire dataset in RAM, with disk used only for durability (crash recovery).

How In-Memory Databases Achieve Durability

Keeping data only in RAM would mean losing everything on a crash or power failure. In-memory databases solve this with several strategies:

Write-ahead log on disk. Every write is appended to a persistent log on disk before being applied in memory. On crash, replay the log. This is how Redis (with AOF mode) and VoltDB work. The log is append-only and sequential, so it does not bottleneck writes.

Periodic snapshots. Serialize the entire in-memory state to disk at intervals (e.g., every 5 minutes). On crash, restore from the latest snapshot, then replay the WAL entries since the snapshot. Redis RDB mode does this.

Replication. Keep copies on other machines. If one machine dies, another has the data. This is how Memcached clusters survive node failures (though Memcached itself has no persistence — it relies on the application to re-populate from the primary database).

The real advantage of in-memory is NOT speed. A common misconception is that in-memory databases are fast because they avoid disk reads. But a well-tuned disk-based database with enough buffer pool memory also serves most reads from RAM (cache hits). The real advantage is that in-memory databases can use data structures that are hard to implement on disk. Redis offers sorted sets, priority queues, HyperLogLog, Streams — data structures that would be nightmarishly complex to build on top of B-tree pages. When your data lives in RAM, you are free to organize it in any way that makes operations efficient, without worrying about page boundaries or disk layout.

Redis Data Structures: Why In-Memory Matters

To make the "data structures" advantage concrete, here are operations that Redis handles in microseconds but are nightmarish on disk:

Data StructureRedis OperationTimeDisk-based equivalentDisk time
Sorted SetZADD (insert with score)O(log n), ~5μsB-tree insert + heap update~500μs
Sorted SetZRANK (rank of element)O(log n), ~5μsCOUNT(*) WHERE score > X~50ms for 10M rows
HyperLogLogPFADD + PFCOUNTO(1), ~2μsCOUNT(DISTINCT ...) with full scanSeconds to minutes
BitmapBITOP AND/ORO(n/64) bits, ~10μsJOIN two tables + filterSeconds
StreamXADD + XRANGEO(1) + O(k), ~3μsINSERT + range scan~1ms

The ZRANK operation is particularly illustrative. In Redis, a sorted set is implemented as a skip list with augmented size counts at each level. Finding the rank of an element means traversing the skip list and summing the span counts — O(log n). In PostgreSQL, finding the rank requires counting all rows with a higher score, which is a full index scan — O(n). For a 10-million-entry leaderboard, that is the difference between 5 microseconds and 50 milliseconds — a 10,000x gap.

Anti-Caching: The Best of Both Worlds

Anti-caching (used by VoltDB, MemSQL/SingleStore) flips the traditional approach. Instead of caching hot data in memory from a disk-based store, you keep everything in memory and evict cold data to disk when RAM pressure increases. It is like a user-space page cache — the database knows which tuples are cold (based on access patterns), evicts them efficiently, and fetches them back on demand.

This gives you the programming model of an in-memory database (all data structures, no page management) with the capacity of a disk-based system.

NVM and the Future of Memory

Non-Volatile Memory (NVM), most notably Intel Optane Persistent Memory (now discontinued, but the concept lives on), sits between DRAM and SSD in both speed and cost. It offers byte-addressable persistent storage with latency around 300ns — 3x slower than DRAM but 300x faster than SSD. This blurs the line between in-memory and on-disk databases.

With NVM, a database can keep its data structures (B-trees, hash maps) directly in persistent memory. No WAL needed — every write is instantly durable. No buffer pool needed — the data is already in addressable memory. No crash recovery needed — the data survives power loss. This eliminates the three biggest sources of complexity in storage engine design.

However, NVM introduces its own challenges: cache-line-level atomicity (writes smaller than a cache line are atomic, but larger writes are not), memory ordering (CPU may reorder writes, requiring explicit flush instructions), and wear leveling (like SSD, NVM cells have finite write endurance). These require new data structure designs — "persistent data structures" that are correct even if the system crashes mid-operation.

The disappearing storage hierarchy. As NVM technology matures and becomes cheaper, the sharp boundary between "fast volatile memory" and "slow persistent disk" will soften. Future storage engines may not need separate in-memory and on-disk representations at all. The B-tree and the buffer pool might merge into a single persistent B-tree in NVM. This would simplify database architecture enormously — but we are not there yet.

Cost Analysis: When Does In-Memory Make Sense?

// RAM vs SSD cost comparison (2024 cloud pricing):
RAM: ~$5/GB/month (AWS r6g instances)
SSD: ~$0.10/GB/month (AWS gp3 EBS)
// RAM is 50x more expensive than SSD per GB

// Break-even analysis for a 100 GB dataset:
// Redis: 100 GB × $5/GB = $500/month
// PostgreSQL: 100 GB × $0.10/GB + compute = ~$150/month

// But if Redis eliminates 3 PostgreSQL read replicas:
// 3 × $300/month = $900/month in replicas saved
// Net savings: $900 - $500 = $400/month

// Rule of thumb: use in-memory when:
// - Dataset < 200 GB (affordable in RAM)
// - Latency requirements < 1ms (disk can't deliver)
// - Complex data structure operations needed (sorted sets, etc.)
// - The alternative would be multiple disk-based replicas

In-Memory Database Landscape

SystemModelDurabilityBest For
RedisKey-value + data structuresAOF / RDB snapshotsCaching, sessions, leaderboards, pub/sub
MemcachedKey-value (simple)None (cache only)Simple caching layer in front of a DB
VoltDBRelational (SQL)WAL + replicationHigh-throughput OLTP, financial systems
SAP HANARelational + columnarWAL + savepointsEnterprise OLAP + OLTP (HTAP)
DragonflyRedis-compatibleSnapshots + replicationDrop-in Redis replacement with better memory efficiency
In-Memory vs Disk-Based: Latency Comparison

Compare typical latencies for various operations. The gap is largest for complex data structures.

Redis vs Memcached: When to Use Which

Both are in-memory, but they serve different use cases:

FeatureRedisMemcached
Data structuresStrings, lists, sets, sorted sets, hashes, streams, bitmaps, HyperLogLogStrings only (key → value)
PersistenceRDB snapshots + AOF logNone (pure cache)
ReplicationBuilt-in primary-replicaNone (client-side sharding)
Memory efficiencyHigher overhead per key (~70 bytes)Lower overhead (~50 bytes)
ThreadingSingle-threaded (multi-threaded I/O in Redis 7+)Multi-threaded from day one
Pub/SubBuilt-inNo
Lua scriptingBuilt-in (atomic transactions)No
Best forSessions, leaderboards, rate limiting, real-time featuresSimple read-through cache layer in front of MySQL/PostgreSQL
The simple rule: If you need data structures (sorted sets, lists, pub/sub) or persistence, use Redis. If you need a simple "cache this SQL query result" layer and nothing else, Memcached is simpler and slightly more memory-efficient. In practice, most teams choose Redis because it covers both use cases and the operational overhead of running one system is lower than two.

Redis Persistence Deep Dive

Redis offers two persistence mechanisms that can be used independently or together:

RDB (Redis Database) snapshots. Periodically fork the process and dump the entire dataset to a point-in-time snapshot file. The fork uses copy-on-write (COW), so the parent continues serving requests while the child writes to disk. Configurable: "save after 900 seconds if at least 1 key changed." Recovery: load the snapshot on startup. Data loss: up to one snapshot interval (e.g., 15 minutes of data).

AOF (Append Only File). Log every write operation to an append-only file on disk. Three durability levels: (1) always — fsync every write (slowest, no data loss), (2) everysec — fsync every second (good compromise, up to 1 second of data loss), (3) no — let the OS decide when to flush (fastest, up to 30 seconds of data loss). The AOF file is periodically rewritten (compacted) in the background to prevent unbounded growth.

// Redis persistence trade-offs:

// RDB only: Fast recovery (load binary file), minimal I/O during operation,
// but up to 15 min data loss on crash

// AOF (everysec): At most 1 second of data loss, but AOF rewrite adds
// periodic CPU/disk pressure. AOF file is larger than RDB snapshot.

// Both (recommended): RDB for fast recovery after restart,
// AOF for minimal data loss. On recovery: load AOF (more complete)
// if available, otherwise fall back to RDB snapshot.
Design question: Your team is debating whether to use Redis or PostgreSQL for a real-time leaderboard that ranks 10 million players by score. The leaderboard must support: (1) update a player's score, (2) get a player's rank, (3) get the top 100 players. Which would you choose and why?

Chapter 7: OLTP vs OLAP

So far we have been thinking about databases that serve user-facing applications: a web server reads a user record, updates a shopping cart, inserts an order. These queries touch a small number of rows, but there are thousands of them per second. This is Online Transaction Processing (OLTP).

But there is a second class of database usage that looks completely different. A business analyst wants to know: "What was the total revenue by product category in Q3 across all stores?" This query must scan millions or billions of rows, aggregate values, and return a single summary. This is Online Analytical Processing (OLAP).

Two Completely Different Access Patterns

PropertyOLTPOLAP
Read patternSmall number of rows per query, fetched by keyAggregate over a large number of rows
Write patternRandom-access, low-latency writes from user inputBulk import (ETL) or streaming ingest
Primarily used byEnd users via web/mobile applicationsData analysts, data scientists, dashboards
Data representsLatest state of things (current balance, current inventory)History of events over time
Dataset sizeGB to low TBTB to PB
Query styleSELECT * FROM users WHERE id = 42SELECT product, SUM(revenue) FROM sales GROUP BY product
BottleneckDisk seek time (how fast you can find a row)Disk bandwidth (how fast you can scan data)
Different access patterns need different storage engines. Using a B-tree-based OLTP database for analytics is like using a sports car to haul freight. It can do it, but it is terrible at it. The OLAP query "SUM revenue across 100M rows" on a row-oriented B-tree must read every column of every row, even though it only needs the revenue column. This wastes enormous disk bandwidth. This is why data warehouses exist — and why they use column-oriented storage (Chapter 8).

The Data Warehouse: ETL from OLTP

Most organizations do not run analytics directly on their OLTP databases. Instead, they Extract, Transform, Load (ETL) data from multiple OLTP systems into a dedicated data warehouse optimized for analytics.

OLTP Sources
PostgreSQL (users), MySQL (orders), MongoDB (events), Stripe API (payments) — each optimized for its own write workload.
↓ Extract
ETL Pipeline
Read data from each source. Transform: clean, deduplicate, conform schemas, compute derived columns. Scheduled (hourly, daily) or streaming (CDC).
↓ Load
Data Warehouse
Column-oriented storage. Star/snowflake schema. Optimized for full-table scans and aggregations. Examples: Snowflake, BigQuery, Redshift, ClickHouse.

Star and Snowflake Schemas

Data warehouses typically organize data into a star schema. At the center is a fact table — a very large table where each row represents an event (a sale, a click, a page view). Each fact row has foreign keys pointing to smaller dimension tables (product, store, customer, date).

The star schema gets its name from the visual layout: the fact table in the center with dimension tables radiating outward like points of a star.

// Example star schema for a retail data warehouse:

// FACT TABLE: sales_facts (billions of rows)
| sale_id | date_key | product_key | store_key | customer_key | quantity | revenue | cost |

// DIMENSION TABLES (thousands to millions of rows each):
| date_key | date | month | quarter | year | is_weekend | is_holiday |
| product_key | name | category | brand | weight | supplier |
| store_key | name | city | state | country | square_feet |
| customer_key | name | email | age_group | loyalty_tier |

// An OLAP query joins fact + dimensions:
// "Total revenue by product category in Q3 2024 at California stores"
// → scan sales_facts, join with date (quarter='Q3'), product (category),
// and store (state='CA') dimensions, aggregate SUM(revenue)

A snowflake schema is a variation where dimension tables are further normalized (e.g., the product dimension has a separate category sub-dimension). This saves space but makes queries more complex with additional joins.

Why Not Just Use the Same Database?

A common question: "Why can't the analyst just query the OLTP database directly?" Three reasons:

1. Performance interference. An OLAP scan that reads millions of rows competes with OLTP queries for buffer pool space, disk I/O, and CPU. A single dashboard refresh could evict thousands of frequently-accessed pages from the buffer pool, causing every user-facing query to hit disk instead of cache. Your 2ms point lookups suddenly become 20ms.

2. Different schemas. OLTP tables are normalized (third normal form) to minimize update anomalies. Analytics work better on denormalized star schemas where every fact row contains or points to all the context you need. ETL performs this denormalization during the transform step.

3. Different storage formats. OLTP uses row-oriented storage (fast for inserting and reading individual rows). OLAP is 10-100x faster with column-oriented storage (Chapter 8). You cannot get both benefits from a single storage format.

python
# A simplified ETL pipeline for a data warehouse

# 1. EXTRACT: Read from OLTP sources
users   = pg_conn.execute("SELECT * FROM users WHERE updated_at > %s", last_etl)
orders  = mysql_conn.execute("SELECT * FROM orders WHERE created_at > %s", last_etl)
events  = mongo_conn.find({"ts": {"$gt": last_etl}})

# 2. TRANSFORM: Clean, join, denormalize into star schema
fact_rows = []
for order in orders:
    user = users_by_id[order.user_id]
    fact_rows.append({
        "order_id": order.id,
        "date_key": to_date_key(order.created_at),
        "user_key": user.id,
        "product_key": order.product_id,
        "revenue": order.total,
        "quantity": order.qty,
        "user_country": user.country,  # denormalized from user dim
    })

# 3. LOAD: Bulk insert into the data warehouse
warehouse.bulk_insert("sales_facts", fact_rows)

ETL vs Change Data Capture (CDC)

Traditional ETL runs on a schedule — hourly or daily. This means the data warehouse is always at least one ETL cycle behind the OLTP sources. For many analytics use cases, this is fine ("what were last month's sales?" does not need real-time data).

But some use cases need fresher data ("how many users signed up in the last 5 minutes?"). Change Data Capture (CDC) replaces batch ETL with a continuous stream of changes. The OLTP database's WAL (write-ahead log) is read in real time, and each change (insert, update, delete) is streamed to the data warehouse within seconds.

ApproachLatencyOLTP ImpactComplexityTools
Batch ETLHours to minutesRead queries on source DBLowerAirflow, dbt, Fivetran
CDC (log-based)SecondsNear-zero (reads WAL only)HigherDebezium, Maxwell, AWS DMS
CDC is reading the database's own WAL. This connects back to Chapter 3's B-tree WAL: the same log that provides crash recovery also provides a real-time change stream. PostgreSQL's logical replication, MySQL's binary log, and MongoDB's oplog are all forms of CDC. Debezium reads these logs and streams changes to Kafka, which can then feed into any data warehouse or search index.

OLTP vs OLAP Canvas

OLTP vs OLAP: Access Pattern Comparison

See how an OLTP point lookup touches one row while an OLAP aggregation scans millions.

Click a button to simulate each query type.

Fact Table Sizing: How Big Do They Get?

Fact tables in a data warehouse can be enormous. Let us size one for a mid-size e-commerce company:

// E-commerce fact table: "order_items" (one row per item in an order)
// Company does 50K orders/day, avg 3 items per order = 150K rows/day

// Row schema:
// order_id (8B) + date_key (4B) + product_key (4B) + customer_key (4B)
// + store_key (4B) + quantity (2B) + unit_price (4B) + discount (4B)
// + tax (4B) + total (4B) + shipping_cost (4B) + margin (4B)
// Total: ~50 bytes per row (12 columns, all numeric)

// Growth rate: 150K rows/day × 50 bytes = 7.5 MB/day
// After 5 years: 150K × 365 × 5 = 273.75 million rows
// Uncompressed: 273.75M × 50 bytes ≈ 13.7 GB
// Column-compressed: typically 5-10x compression ≈ 1.4-2.7 GB
// This fits on a single node of any column store.

// For Amazon or Walmart scale:
// 10M orders/day × 5 items = 50M rows/day × 50 bytes = 2.5 GB/day
// After 5 years: 91 billion rows = 4.5 TB uncompressed
// Compressed: 450 GB - 900 GB
// This requires distributed columnar storage (BigQuery, Redshift).

The HTAP Trend

The newest trend is HTAP (Hybrid Transactional/Analytical Processing): databases that handle both OLTP and OLAP in a single system. The idea: maintain a row-oriented store for transactions and a column-oriented store for analytics, with automatic real-time synchronization between them.

TiDB, for example, has two storage engines: TiKV (a distributed key-value store based on RocksDB/LSM for OLTP) and TiFlash (a columnar store for OLAP). When you write to TiKV, the data is asynchronously replicated to TiFlash in columnar format. The query optimizer routes each query to the appropriate engine — point lookups go to TiKV, aggregation queries go to TiFlash.

This is the theoretical ideal: one database, no ETL, real-time analytics. In practice, HTAP databases are still maturing, and most large organizations still separate OLTP and OLAP into different systems for operational simplicity and failure isolation.

Debug scenario: A data analyst complains that their dashboard query (total sales by region for the past year) takes 45 seconds on the production PostgreSQL database. The DBA says "we can't add more indexes because it slows down writes." What is the correct organizational solution?

Chapter 8: Column-Oriented Storage

This is the showcase chapter — the single most important idea for OLAP performance. Once you understand column-oriented storage, you understand why data warehouses are orders of magnitude faster than row-oriented databases for analytics.

The Problem with Row-Oriented Storage

In a row-oriented database (PostgreSQL, MySQL), each row is stored contiguously on disk. A row for a sales event might look like this in storage:

// Row-oriented layout on disk (each row is contiguous):
[sale_id=1, date=2024-01-15, product="Widget", store="NYC", qty=3, revenue=45.00, cost=12.00]
[sale_id=2, date=2024-01-15, product="Gadget", store="SF", qty=1, revenue=99.00, cost=30.00]
[sale_id=3, date=2024-01-16, product="Widget", store="CHI", qty=7, revenue=105.0, cost=28.00]
... 100 million more rows ...

Now run the query: SELECT SUM(revenue) FROM sales WHERE date BETWEEN '2024-01-01' AND '2024-03-31'.

This query needs only two columns: date and revenue. But in a row-oriented layout, the database must read ALL seven columns for every row, because each row is stored as a single unit. If each row is 100 bytes and you have 100 million rows, you read 10 GB of data to access 16 bytes per row (8-byte date + 8-byte revenue) — you are reading 6x more data than you need.

Column-Oriented Layout: Store Each Column Separately

In a column-oriented (columnar) database, all values from one column are stored contiguously, separate from other columns:

// Column-oriented layout on disk:
sale_id file: [1, 2, 3, 4, 5, ...]
date file: [2024-01-15, 2024-01-15, 2024-01-16, ...]
product file: ["Widget", "Gadget", "Widget", ...]
store file: ["NYC", "SF", "CHI", ...]
qty file: [3, 1, 7, ...]
revenue file: [45.00, 99.00, 105.00, ...]
cost file: [12.00, 30.00, 28.00, ...]

// Same query: SELECT SUM(revenue) WHERE date BETWEEN ...
// Now only reads the date file + revenue file
// 100M rows × 16 bytes = 1.6 GB instead of 10 GB
// 6.25x less I/O!
The key insight is I/O reduction. Analytical queries typically touch 3-5 columns out of 50-200. Column-oriented storage means you only read the columns you actually need. For a table with 100 columns where the query uses 5, you read 5% of the data — a 20x improvement before any compression or indexing.

Column Compression

Storing values of the same type contiguously also enables dramatically better compression. A column of dates contains similar values (most dates cluster in recent months). A column of product categories might have only 50 distinct values across 100 million rows.

Bitmap encoding. For a column with n distinct values, create n bitmaps — one per distinct value. Each bitmap has one bit per row (1 if that row has this value, 0 otherwise). For a "country" column with 200 countries across 100M rows, each bitmap is 100M bits = 12.5 MB, and the total is 200 × 12.5 MB = 2.5 GB. But most bitmaps are very sparse (mostly zeros), so they compress extremely well with run-length encoding.

// Bitmap encoding example: "store" column with 3 values
// Data: [NYC, SF, CHI, NYC, NYC, SF, CHI, NYC, SF, NYC]

NYC bitmap: [1, 0, 0, 1, 1, 0, 0, 1, 0, 1]    → RLE: 1,1 0,2 1,2 0,2 1,1 0,1 1,1
SF bitmap: [0, 1, 0, 0, 0, 1, 0, 0, 1, 0]    → RLE: 0,1 1,1 0,3 1,1 0,2 1,1 0,1
CHI bitmap: [0, 0, 1, 0, 0, 0, 1, 0, 0, 0]    → RLE: 0,2 1,1 0,3 1,1 0,3

// WHERE store = 'NYC' → just read the NYC bitmap
// WHERE store IN ('NYC', 'SF') → bitwise OR of two bitmaps
// WHERE store = 'NYC' AND date = '2024-01-15' → bitwise AND of two bitmaps

With run-length encoding, a bitmap for "NYC" across 100 million rows where 30% are NYC compresses from 12.5 MB to about 200 KB. The compression ratio for the entire column can be 10-100x.

Interactive: Row Store vs Column Store

The simulation below shows the same table stored in row-oriented and column-oriented format. Select columns for your query and see how much data each format must read.

Row Store vs Column Store: I/O Comparison

Toggle columns to include in your query. Watch the bytes-read bar for each format.

Select columns your query needs. Row store always reads everything. Column store reads only what you need.

Dictionary Encoding and Delta Encoding

Beyond bitmap encoding, column stores use two other powerful compression techniques:

Dictionary encoding. Replace each distinct string value with a small integer. Store a separate dictionary mapping integers to strings. If a "product_name" column has 10,000 distinct products across 100 million rows, each row stores a 2-byte integer instead of a 20-byte string — 10x compression on that column alone.

// Dictionary encoding example: product column
Dictionary: { 0: "Widget", 1: "Gadget", 2: "Gizmo", 3: "Doohickey" }

// Original: ["Widget", "Gadget", "Widget", "Gizmo", "Widget", ...]
// Encoded: [0, 1, 0, 2, 0, ...] (each entry is 1 byte instead of ~10)

// WHERE product = 'Widget' → look up dictionary: Widget = 0
// → scan encoded column for value 0 (integer comparison, CPU-friendly)

Delta encoding. For sorted columns (especially timestamps), store the difference between consecutive values instead of the absolute values. If timestamps are in milliseconds and rows are 100ms apart, each delta is small and compresses well:

// Delta encoding on a sorted timestamp column:
Original: [1704067200000, 1704067200100, 1704067200200, 1704067200350, ...]
Base value: 1704067200000
Deltas: [0, 100, 100, 150, ...]

// 8-byte timestamps → 1-2 byte deltas
// Compression ratio: 4-8x on timestamps alone

CPU-Friendly Processing: Vectorized Execution

Column stores unlock a performance trick that row stores cannot use: vectorized execution. Because all values in a column are the same type, stored contiguously in memory, you can process them in tight loops that exploit CPU cache lines and SIMD instructions.

A modern CPU can compare 16 integers against a constant in a single SIMD instruction (using 128-bit SSE or 256-bit AVX). Processing a column of dictionary-encoded integers with SIMD is 10-20x faster than the row-by-row processing that row stores must use.

Compression + vectorization = the column store double win. Column stores get two benefits at once: (1) reading less data from disk (only the needed columns, highly compressed), and (2) processing the data faster on the CPU (tight loops on same-type arrays with SIMD). These advantages compound: you read 20x less data AND process it 10x faster per byte. This is why a column store query can be 200x faster than the same query on a row store.

Sort Order and Multiple Sort Orders

Within each column file, the rows must be stored in the same order across all columns (so that row 42 in the date file corresponds to row 42 in the revenue file). You can choose what order to sort the rows in.

Sorting by date (the most common filter in analytics) means the date column compresses beautifully (many consecutive identical dates → long run-length encoding runs), and date-range queries scan a contiguous region.

Some databases (like Vertica, now part of OpenText) maintain multiple sort orders — several copies of the data, each sorted differently. This is like having multiple clustered indexes. It uses more space but means the query optimizer can pick the sort order that best matches each query.

Materialized Views and Data Cubes

For frequently-run aggregation queries, you can pre-compute the result and store it as a materialized view. A data cube (also called an OLAP cube) is a specialized materialized view that pre-computes aggregates across every combination of dimensions. For example, a cube over (date, product, store) pre-computes the SUM of revenue for every (date, product, store) combination.

The downside: materialized views and cubes must be updated when the underlying data changes, and they cannot answer queries that do not match the pre-computed dimensions. They are a caching optimization, not a general solution.

// Data cube example (3 dimensions: date × product × store):

// The cube pre-computes SUM(revenue) for every combination:
cube[2024-01-15][Widget][NYC] = $45.00
cube[2024-01-15][Widget][SF] = $120.00
cube[2024-01-15][Widget][*] = $315.00 ← aggregated across all stores
cube[2024-01-15][*][NYC] = $890.00 ← aggregated across all products
cube[2024-01-15][*][*] = $12,450 ← total revenue for Jan 15
cube[*][Widget][*] = $2.1M ← total Widget revenue all time

// Queries answered instantly (O(1) lookup):
// "Total revenue for Widget in NYC on Jan 15?" → $45.00
// "Total revenue across all products in all stores in Jan?" → sum Jan dates

// Queries NOT answerable from this cube:
// "Revenue for orders > $100?" → the cube doesn't have a $ threshold dimension
// "Revenue by customer age group?" → no age dimension in the cube

Real-World Column Store: DuckDB

DuckDB is worth special attention because it demonstrates the power of column-oriented processing in an embedded context — no server, no network, just a library linked into your application (like SQLite, but columnar).

python
import duckdb

# DuckDB can directly query Parquet files (columnar format) on disk
# No need to load into a database first!

result = duckdb.sql("""
    SELECT
        product_category,
        SUM(revenue) as total_revenue,
        COUNT(*) as num_sales,
        AVG(revenue) as avg_sale
    FROM read_parquet('sales_2024_*.parquet')
    WHERE date BETWEEN '2024-01-01' AND '2024-03-31'
    GROUP BY product_category
    ORDER BY total_revenue DESC
    LIMIT 10
""")

# This query over 100M rows in Parquet files completes in ~2 seconds
# because DuckDB:
# 1. Only reads the 4 columns needed (date, product_category, revenue)
# 2. Uses vectorized execution (processes 2048 values at a time)
# 3. Pushes the WHERE filter down to the Parquet reader (skip row groups)
# 4. Runs on all CPU cores in parallel

# The same query on PostgreSQL (row-oriented) would read ALL columns
# and take 20-60 seconds. 10-30x slower on the same hardware.
The Parquet format. Apache Parquet has become the de facto standard for analytical data on disk. It stores data in columnar format with built-in compression (Snappy, Zstd, or Gzip per column), row group metadata for predicate pushdown (skip entire row groups where min/max values don't match the WHERE clause), and rich type support. Every major data tool — Spark, Pandas, DuckDB, BigQuery, Redshift Spectrum — can read and write Parquet natively. If you are building a data pipeline in 2024, output Parquet.

Column Store vs Row Store: When Columns Lose

Column stores are not always better. They lose decisively in these scenarios:

1. Point lookups. Retrieving a single row by primary key requires reading one value from every column file and reassembling the row. If you have 200 columns, that is 200 file seeks. A row store does this in 1 seek. For OLTP-style "get user by ID" queries, row stores are 10-100x faster.

2. Small inserts. Adding one row to a column store means appending to 200 separate column files. A row store appends to one file. This is why column stores use the LSM-buffered write path (batch in memory, flush in bulk).

3. Wide SELECT * queries. If your query needs all columns (like a backup or export), column store must read every column file and reassemble rows. Row store reads the data in its natural format with zero reassembly cost.

4. Narrow tables. If your table has only 3-5 columns, the column store advantage disappears — you are reading most of the data regardless of format. The overhead of maintaining separate column files is not worth it.

The crossover point. As a rule of thumb, column stores start winning when: (a) the table has more than ~20 columns, (b) the query touches fewer than half the columns, and (c) the query scans more than ~10,000 rows. Below these thresholds, a well-tuned row store with appropriate indexes is competitive or faster.

The Column Store Write Path

Column-oriented storage is optimized for reads, but writes are tricky. Inserting a single row requires appending a value to every column file — potentially hundreds of files. The solution: use an LSM-tree within each column. Incoming writes go to an in-memory buffer (a row-oriented memtable). When the buffer is full, pivot the rows into columnar format and flush all columns as a batch. Background compaction merges small column segments into larger sorted ones.

1. Row-Oriented Memtable
Incoming writes accumulate in a row-oriented in-memory buffer. This is efficient because individual inserts are natural as rows.
↓ when buffer reaches threshold (e.g., 128 MB)
2. Pivot to Columnar
The row buffer is transposed: each column's values are extracted, compressed (dictionary + RLE + delta), and written to disk as separate column files. This single sequential write is efficient.
↓ background
3. Compaction
Small column segments are merged into larger ones, re-sorted, re-compressed. Same idea as LSM compaction but applied to columnar data. The merge reads and writes each column independently.

This is exactly how Apache Parquet + Delta Lake work, and it is the write engine behind ClickHouse, DuckDB, and Snowflake. The key insight: accept row-oriented writes, batch them, convert to columnar format in bulk. This bridges the gap between the write-friendly row format and the read-friendly columnar format.

python
# Conceptual column store write path
class ColumnStore:
    def __init__(self, columns, flush_threshold=10000):
        self.columns = columns  # list of column names
        self.buffer = []         # row-oriented in-memory buffer
        self.threshold = flush_threshold
        self.segments = []        # list of flushed column segments

    def insert(self, row):
        """Insert a row (dict of column_name → value)."""
        self.buffer.append(row)
        if len(self.buffer) >= self.threshold:
            self._flush()

    def _flush(self):
        """Pivot rows to columns and write to disk."""
        segment = {}
        for col in self.columns:
            segment[col] = [row[col] for row in self.buffer]
        self.segments.append(segment)
        self.buffer = []

    def scan(self, needed_columns, predicate=None):
        """Only read the columns we need — the whole point."""
        results = []
        for seg in self.segments:
            n_rows = len(seg[self.columns[0]])
            for i in range(n_rows):
                row = {c: seg[c][i] for c in needed_columns}
                if predicate is None or predicate(row):
                    results.append(row)
        return results
Column stores in practice. Google BigQuery, Amazon Redshift, Snowflake, ClickHouse, Apache Parquet, Apache ORC, DuckDB — all use columnar storage. When you see "10x-100x faster analytics," columnar storage is almost always the reason. Even traditional OLTP databases now offer columnar indexes for analytics: PostgreSQL has columnar extensions (cstore_fdw, Citus), and MySQL HeatWave runs analytics on a columnar copy of your InnoDB data.

Parquet File Format: Anatomy of a Column Store on Disk

Understanding the Parquet file format makes column store concepts concrete. A Parquet file is organized into row groups, each containing column chunks, each containing pages:

// Parquet file layout:
┌─────────────────────────────────────┐
│ Magic number: "PAR1" (4 bytes) │
├─────────────────────────────────────┤
│ Row Group 0 (default ~128 MB) │
│ ├── Column chunk: sale_id │ ← contiguous, compressed
│ ├── Column chunk: date │ ← contiguous, compressed
│ ├── Column chunk: product │ ← contiguous, compressed
│ ├── Column chunk: revenue │ ← contiguous, compressed
│ └── Column chunk: ... │
├─────────────────────────────────────┤
│ Row Group 1 │
│ ├── Column chunk: sale_id │
│ ├── ... │
├─────────────────────────────────────┤
│ Footer metadata │
│ ├── Schema (column names, types) │
│ ├── Row group offsets │
│ ├── Column chunk metadata │
│ │ ├── min/max values per chunk │ ← enables predicate pushdown!
│ │ ├── null count │
│ │ ├── compression codec │
│ │ └── uncompressed/compressed size │
│ └── ... │
├─────────────────────────────────────┤
│ Footer length (4 bytes) │
│ Magic number: "PAR1" (4 bytes) │
└─────────────────────────────────────┘

Predicate pushdown is the key optimization. Each column chunk's metadata includes the minimum and maximum values for that chunk. If your query says WHERE date > '2024-06-01' and a row group's date column has max = '2024-05-31', the entire row group (potentially millions of rows) can be skipped without reading a single byte of data.

// Predicate pushdown example:
// Query: WHERE date BETWEEN '2024-07-01' AND '2024-07-31'
// File has 10 row groups, each with 10M rows:

Row Group 0: date min='2024-01-01', max='2024-02-15' → SKIP
Row Group 1: date min='2024-02-16', max='2024-04-01' → SKIP
Row Group 2: date min='2024-04-02', max='2024-05-20' → SKIP
Row Group 3: date min='2024-05-21', max='2024-07-05' → READ (overlap)
Row Group 4: date min='2024-07-06', max='2024-08-20' → READ (overlap)
Row Group 5: date min='2024-08-21', max='2024-10-10' → SKIP
// ... remaining groups also skipped

// Result: read 2 out of 10 row groups = 80% of data skipped
// 20M rows read instead of 100M → 5x speedup just from metadata!
Worked problem: Your fact table has 200 columns and 1 billion rows. Each column averages 8 bytes per value. Your typical OLAP query touches 5 columns. How much data does a row-oriented scan read vs a column-oriented scan? What is the speedup factor, assuming I/O-bound performance?

Chapter 9: Interview Arsenal

This chapter is your cheat sheet. Everything from Chapters 0-8 condensed into tables, decision frameworks, and ready-to-use answers for system design interviews.

Storage Structure Comparison

StructureReadWriteSpaceRange QueryBest Use Case
Hash IndexO(1)O(1)Keys must fit in RAMNoFew keys, frequent updates (counters, sessions)
LSM-TreeO(log n) × levelsO(1) amortizedGood compressionYes (sorted SSTs)Write-heavy, time-series, event logs
B-TreeO(log n)O(log n)Some fragmentationYesOLTP, transactions, predictable latency
Column StoreFull-scan optimizedBatch onlyExcellent compressionYes (per-column)Analytics, data warehousing
In-Memory (Redis)O(1) - O(log n)O(1) - O(log n)RAM-boundDepends on structureCaching, real-time leaderboards, pub/sub

System Design Talking Points

"Design a key-value store."

Start with the simplest thing: append-only log + hash index (Bitcask). Discuss limitations (keys must fit in RAM, no range queries). If the interviewer says "now support range queries," upgrade to LSM-tree. If they say "now support transactions," switch to B-tree with WAL. If they say "now support analytics on this data," add a column-oriented read replica. Each upgrade is motivated by a new requirement — show the progression.

"Design a time-series database."

LSM-tree with timestamp-ordered keys. Ingestion is append-mostly (events always have increasing timestamps), which perfectly matches LSM's sequential write pattern. Range queries ("last hour of data") are natural on sorted SSTables. Time-based compaction: old data is compacted more aggressively, recent data is kept at higher resolution. Column-oriented within each SSTable for compression (timestamp column compresses 10-20x with delta encoding). This is how InfluxDB, TimescaleDB, and QuestDB work.

"Design a data warehouse."

Column-oriented storage on distributed shared-nothing nodes. Star schema: one fact table, multiple dimension tables. ETL from OLTP sources on a schedule (hourly or via CDC). Columnar compression (bitmap encoding + run-length encoding on low-cardinality columns, dictionary encoding on strings, delta encoding on timestamps). Materialized views for frequently-run dashboard queries. MPP (Massively Parallel Processing): partition the fact table across nodes, each node scans its partition in parallel, a coordinator merges results. This is BigQuery, Redshift, Snowflake.

Design Challenge: E-Commerce Product Catalog

Let us work through a real design challenge, the kind you would get in a 45-minute system design interview:

Problem: Design the storage layer for an e-commerce product catalog. Requirements: 50 million products, each with ~30 attributes (name, price, category, brand, color, size, description, images, ratings, reviews count, etc.). Support these queries: (1) Get product by ID. (2) Search by category + price range. (3) Full-text search on name/description. (4) Analytics: "average price by category for the last year." (5) 99th percentile read latency under 20ms for queries 1-3.
// Solution architecture:

// Primary store: PostgreSQL with B-tree indexes
// - Primary key: product_id (B-tree, O(log n) point lookup)
// - Composite index: (category, price) for range queries
// - Table size: 50M × 2KB avg row = 100 GB
// - Buffer pool: 32 GB → top 3 B-tree levels cached
// - Expected latency: 1-5ms for point lookups, 5-20ms for range queries ✓

// Full-text search: Elasticsearch (inverted index)
// - CDC from PostgreSQL → Elasticsearch via Debezium
// - Handles stemming, fuzzy matching, relevance ranking
// - Latency: 5-15ms for search queries ✓

// Analytics: ClickHouse (column-oriented)
// - Daily ETL from PostgreSQL → ClickHouse
// - Columnar compression: 100 GB → ~20 GB
// - "Avg price by category" query: 2-5 seconds on full dataset

// Caching layer: Redis
// - Cache top 1M most-viewed products (LRU eviction)
// - 1M × 2KB = 2 GB RAM (trivial)
// - Cache hit → 0.1ms response. Cache miss → PostgreSQL (5ms)

Debug Scenarios

SymptomLikely CauseSolution
Reads getting slower over timeLSM compaction falling behind — too many SSTables to check per readTune compaction (increase background threads, reduce size ratio). Check Bloom filter effectiveness. Consider leveled compaction to limit files per level.
Writes spike then stallB-tree page splits cascading up to root, or LSM write stall when L0 is fullFor B-tree: pre-split pages, use fillfactor < 100%. For LSM: increase memtable size, add more compaction threads.
Point lookups are fast but range queries are slowData is not clustered by the range-query key, causing random I/OAdd a clustered index on the range-query column, or use a covering index to avoid heap lookups.
Analytics queries slow down OLTPAnalytics scans are evicting OLTP data from the buffer poolSeparate OLTP and OLAP into different databases. Set up ETL to a data warehouse. Or use read replicas for analytics.
Disk space growing despite deletesLSM: deleted keys are tombstoned, not removed until compaction. B-tree: pages are not returned to OS after row deletion.LSM: force compaction on affected segments. B-tree: VACUUM (PostgreSQL) or OPTIMIZE TABLE (MySQL) to reclaim space.

Coding Drills

Drill 1: Simple Hash Index in Python

python
import os

class HashIndex:
    """Bitcask-style hash index: append-only log + in-memory hash map."""

    def __init__(self, path):
        self.path = path
        self.index = {}  # key → byte offset
        self.f = open(path, 'a+b')  # append + read, binary
        self._rebuild_index()

    def _rebuild_index(self):
        """Scan the file to rebuild the in-memory index on startup."""
        self.f.seek(0)
        while True:
            offset = self.f.tell()
            line = self.f.readline()
            if not line:
                break
            key, _ = line.decode().strip().split(',', 1)
            self.index[key] = offset  # last write wins

    def set(self, key, value):
        offset = self.f.seek(0, 2)  # seek to end
        self.f.write(f"{key},{value}\n".encode())
        self.f.flush()
        self.index[key] = offset

    def get(self, key):
        if key not in self.index:
            return None
        self.f.seek(self.index[key])
        line = self.f.readline().decode().strip()
        _, value = line.split(',', 1)
        return value

Drill 2: Simple LSM Memtable + Flush

python
import json, os
from sortedcontainers import SortedDict  # pip install sortedcontainers

class LSMTree:
    """Minimal LSM-tree: memtable + SSTable flush."""

    def __init__(self, directory, memtable_limit=1000):
        self.directory = directory
        self.memtable_limit = memtable_limit
        self.memtable = SortedDict()  # sorted in-memory tree
        self.sstables = []  # list of SSTable file paths, newest first
        os.makedirs(directory, exist_ok=True)

    def put(self, key, value):
        self.memtable[key] = value
        if len(self.memtable) >= self.memtable_limit:
            self._flush()

    def get(self, key):
        # 1. Check memtable first (newest data)
        if key in self.memtable:
            return self.memtable[key]
        # 2. Check SSTables from newest to oldest
        for sst_path in self.sstables:
            with open(sst_path) as f:
                data = json.load(f)  # [{key:..., value:...}, ...]
                # Binary search (data is sorted by key)
                lo, hi = 0, len(data) - 1
                while lo <= hi:
                    mid = (lo + hi) // 2
                    if data[mid]['key'] == key:
                        return data[mid]['value']
                    elif data[mid]['key'] < key:
                        lo = mid + 1
                    else:
                        hi = mid - 1
        return None

    def _flush(self):
        """Write sorted memtable to disk as an SSTable."""
        path = os.path.join(self.directory, f"sst_{len(self.sstables)}.json")
        data = [{'key': k, 'value': v} for k, v in self.memtable.items()]
        with open(path, 'w') as f:
            json.dump(data, f)
        self.sstables.insert(0, path)  # newest first
        self.memtable = SortedDict()

Drill 3: B-Tree Search in Python

python
class BTreeNode:
    def __init__(self, leaf=True):
        self.keys = []       # sorted list of keys
        self.values = []     # values (only in leaf nodes)
        self.children = []   # child pointers (only in internal nodes)
        self.leaf = leaf

def btree_search(node, key):
    """Search for a key in a B-tree. Returns (node, index) or None."""
    if node is None:
        return None

    # Binary search within the node's keys
    lo, hi = 0, len(node.keys) - 1
    while lo <= hi:
        mid = (lo + hi) // 2
        if node.keys[mid] == key:
            return (node, mid)    # found!
        elif node.keys[mid] < key:
            lo = mid + 1
        else:
            hi = mid - 1

    # Not found in this node
    if node.leaf:
        return None           # key doesn't exist
    else:
        return btree_search(node.children[lo], key)  # recurse into child

# Usage: each recursive call = one disk page read
# Total disk reads = depth of tree = log_b(n)

Drill 4: Bloom Filter in Python

python
import hashlib

class BloomFilter:
    """Simple Bloom filter with k=3 hash functions."""

    def __init__(self, size=1000):
        self.size = size
        self.bits = [False] * size

    def _hashes(self, key):
        """Generate 3 hash positions for a key."""
        h = hashlib.md5(key.encode()).hexdigest()
        return [
            int(h[:8], 16) % self.size,
            int(h[8:16], 16) % self.size,
            int(h[16:24], 16) % self.size,
        ]

    def add(self, key):
        for pos in self._hashes(key):
            self.bits[pos] = True

    def might_contain(self, key):
        """Returns False = definitely not in set. True = maybe in set."""
        return all(self.bits[pos] for pos in self._hashes(key))

# Usage in LSM read path:
# if not bloom_filter.might_contain(key):
#     skip_this_sstable()  # saved a disk read!

Capacity Planning: A Worked Example

Let us size a real storage system from scratch — the kind of back-of-envelope calculation you would do in a system design interview:

// Scenario: Design the storage layer for a URL shortener
// Requirements:
// - 100M URLs created per year
// - Each URL: short_code (7 bytes) → long_url (avg 100 bytes)
// - Read:write ratio = 100:1 (URLs are read much more than created)
// - 99th percentile read latency < 10ms
// - Retain URLs for 5 years

// Total data after 5 years:
500M URLs × (7 + 100 + overhead ~50) bytes = 500M × 157 bytes ≈ 78 GB

// Option 1: Hash index (Bitcask-style)
// In-memory: 500M × (7 + 8 + 40) bytes = 27.5 GB RAM
// On disk: 78 GB data
// Verdict: Feasible. 32 GB RAM instance. O(1) reads. Perfect fit.

// Option 2: LSM-tree (RocksDB)
// On disk: 78 GB × 1.1 (space amplification) ≈ 86 GB
// In-memory: Bloom filters + block cache ≈ 4 GB
// Read: O(log n) with 3-4 levels, 1-2 disk reads with Bloom + cache
// Verdict: Also feasible. Less RAM needed. Scales better to 5B URLs.

// Option 3: B-tree (PostgreSQL)
// On disk: 78 GB data + ~20 GB index ≈ 98 GB
// Buffer pool: 24 GB (top 3 levels of B-tree always cached)
// Read: 1-2 disk reads for cold keys, 0 for hot keys
// Verdict: Feasible. Operational simplicity of PostgreSQL is a big win.

// Recommendation: PostgreSQL with B-tree index.
// Why: read-heavy (100:1), no range queries needed, 78 GB is small,
// operational simplicity outweighs raw performance at this scale.
// Upgrade to RocksDB only if growth exceeds 10B URLs.

The Five-Minute Interview Framework

When asked "Design a [X] storage system" in an interview, use this framework:

1. Characterize the Workload (1 min)
Read/write ratio? Point lookups or range scans? Latency requirements? Data size? Growth rate? Consistency requirements?
2. Choose Storage Engine (1 min)
Match workload to engine: write-heavy → LSM. Read-heavy + transactions → B-tree. Analytics → column store. Small dataset + complex ops → in-memory. Justify with specific trade-offs.
3. Size the System (1 min)
Back-of-envelope: total data size, memory needs for index/cache, number of nodes, disk IOPS needed. Show you can translate requirements into hardware.
4. Design the Schema (1 min)
Primary key choice, secondary indexes, partition key for sharding. Each choice has trade-offs — explain them.
5. Address Failure Modes (1 min)
What happens when the disk fills up? When a node dies? When the write rate spikes 10x? Show you think about production reality, not just the happy path.

Quick Reference: When to Use What

If the interviewer says...You should think...
"Write-heavy workload"LSM-tree (sequential writes, high throughput)
"Read-heavy with point lookups"B-tree or hash index
"Range queries"B-tree or LSM-tree (NOT hash index)
"Analytics / aggregation"Column-oriented storage
"All data must fit in memory"Redis / in-memory DB
"Predictable p99 latency"B-tree (no compaction spikes)
"Time-series data"LSM-tree with timestamp keys
"Need transactions"B-tree (WAL + latches = ACID friendly)
"Geospatial queries"R-tree or space-filling curve + B-tree
"Full-text search"Inverted index (Lucene/Elasticsearch)

Performance Mental Model: The Numbers Everyone Should Know

Jeff Dean's famous "latency numbers every programmer should know," updated for 2024 hardware and applied to storage engines:

OperationLatencyStorage Engine Impact
L1 cache reference1 nsBinary search within a cached B-tree page
L3 cache reference12 nsMemtable lookup (all in CPU cache for small memtables)
DRAM access100 nsHash index lookup, Bloom filter check
NVMe SSD random read (4KB)100 μsOne B-tree leaf page read = one SSD I/O
NVMe SSD sequential read (1MB)200 μsSSTable block read during compaction or range scan
HDD seek + read (4KB)10 msOne B-tree page on spinning disk (50x slower than SSD)
Network round trip (same datacenter)500 μsReading from a remote replica or distributed storage
Network round trip (cross-region)100 msReading from a geo-replicated database
The 1000x gaps. DRAM is 1000x faster than SSD. SSD is 100x faster than HDD. Cross-region network is 1000x slower than SSD. Every storage engine decision is about managing these gaps. Cache hot data in RAM (1000x win), use sequential I/O when you must hit disk (5x win), avoid cross-datacenter reads (1000x win). When a candidate says "just add more cache," they understand the first gap. When they can articulate all three and design around them, they are staff-level.
Comprehensive scenario: You are designing the storage layer for a new social media platform. You need: (1) user profiles (read-heavy, point lookups by user_id), (2) a feed of posts (write-heavy, time-ordered), (3) analytics on engagement metrics (aggregate queries across millions of posts). Which storage engines would you use for each?

Chapter 10: Connections

Storage and retrieval is the foundation upon which everything else in database systems is built. Now that you understand how data is organized on disk (and in memory), the remaining DDIA chapters snap into focus:

TopicConnection to Storage & Retrieval
Data Models (Ch 2)Relational, document, and graph models all sit on top of the storage engines we covered. A document database might use an LSM-tree (MongoDB's WiredTiger engine) while a relational database uses a B-tree (PostgreSQL). The data model determines WHAT queries you need; the storage engine determines HOW fast those queries run.
Encoding & Evolution (Ch 4)When you write data to an SSTable or B-tree page, it must be serialized to bytes. The encoding format (JSON, Protobuf, Avro) determines how efficiently the data is stored, how quickly it can be decoded, and whether schema evolution is possible without rewriting all data.
Replication (Ch 5)The WAL (write-ahead log) that B-trees use for crash recovery is the same mechanism used for replication — ship the WAL to a replica, replay it, and you have a copy of the database. LSM-tree SSTables can also be shipped to replicas.
Partitioning (Ch 6)When data exceeds what one machine can store, you partition (shard) it. The choice of partition key interacts with the index structure: a hash-based partition key distributes writes evenly but makes range queries hit all partitions; a range-based key enables efficient range scans but can create hot spots.
Transactions (Ch 7)B-trees with WAL and latches support serializable transactions naturally. LSM-trees require additional mechanisms (lock tables, MVCC) to provide transaction isolation. The storage engine's concurrency model shapes what isolation levels are practical.

Beyond This Chapter

The storage engine landscape continues to evolve. Some frontiers worth exploring:

Learned indexes (2018). Tim Kraska et al. proposed replacing B-trees with neural networks that predict the position of a key. A small ML model trained on the data distribution can predict "key 42 is at approximately position 1,247,000" faster than traversing a B-tree. If the data has a near-uniform distribution, a linear model (2 multiplications) replaces a 4-level B-tree (4 random reads). Still mostly academic, but Google's internal systems reportedly use learned index components for specific workloads.

Persistent memory (Intel Optane). Non-volatile memory that sits between DRAM and SSD in both speed and capacity. It blurs the line between in-memory and disk-based databases, potentially making the WAL unnecessary (writes persist directly to memory). Intel discontinued Optane in 2022, but CXL-attached persistent memory from other vendors continues the concept.

Disaggregated storage (Aurora, Neon). Separate the compute (query processing) from storage (data persistence) and connect them via a fast network. The storage layer replicates and persists data independently, while compute nodes are stateless and can scale independently. Aurora DSQL (2024) pushes this further with active-active multi-writer across regions — something traditional storage engines were never designed for.

LSM-tree optimizations. WiscKey (2016) separates keys from values in LSM-trees — keys stay in the LSM-tree, values go in a separate log. This reduces write amplification dramatically because only small keys are compacted, not large values. RocksDB's BlobDB implements this pattern. For large values (images, embeddings), this can reduce write amplification from 50x to 5x.

The rise of Rust databases. A new generation of storage engines written in Rust (TiKV, SurrealDB, Qdrant, AgateDB) is challenging the C/C++ incumbents. Rust's ownership model prevents entire classes of bugs (buffer overflows, use-after-free) that have caused data corruption in production databases. The performance is comparable to C++, but the safety guarantees mean fewer catastrophic production incidents.

The Modern Landscape: Convergence

The biggest trend in storage engines today is convergence. The hard lines between categories are blurring:

TrendWhat's HappeningExamples
HTAPHybrid Transactional/Analytical Processing — one engine handles both OLTP and OLAPTiDB (B-tree for OLTP + columnar replica for OLAP), SingleStore, AlloyDB
Embedded AnalyticsColumn-oriented engines embedded inside applications, no server neededDuckDB (the SQLite of analytics), Apache DataFusion
Cloud-NativeSeparate compute from storage, use object storage (S3) as the durable layerSnowflake, Neon, Aurora DSQL
Vector IndexesNew index type for nearest-neighbor search on high-dimensional embeddingspgvector, Pinecone, Milvus (using HNSW graphs or IVF indexes)
The Kleppmann principle. "If you are selecting a storage engine for your application, you do not need to understand the internal implementation details of each engine. But when you need to tune a database to perform well on your particular workload, or when you need to make sense of mysterious performance behavior, an understanding of the internals is invaluable." — Martin Kleppmann, DDIA Chapter 3

Vector Indexes: The New Frontier

With the rise of AI and embeddings, a new category of index has emerged: the vector index. Instead of looking up keys or ranges, you search for the nearest neighbors in a high-dimensional embedding space. "Find the 10 products most similar to this one" is a vector search query.

The dominant algorithm is HNSW (Hierarchical Navigable Small World): a multi-layer graph where each node is an embedding vector. Higher layers contain fewer nodes (for fast, coarse search), lower layers contain all nodes (for precise results). A query navigates top-down, getting closer to the target at each layer. Typical query latency: 1-5ms for finding 10 nearest neighbors among 10 million vectors.

SystemTypeIndex AlgorithmBest For
pgvectorPostgreSQL extensionIVFFlat, HNSWExisting PostgreSQL users, moderate scale (<10M vectors)
PineconeManaged serviceProprietary (HNSW-based)No-ops vector search, production RAG systems
MilvusOpen-sourceIVF, HNSW, DiskANNLarge-scale (>100M vectors), on-premise deployment
WeaviateOpen-sourceHNSWAI-native applications with hybrid search
QdrantOpen-source (Rust)HNSWHigh-performance, filtering + vector search
Vector search connects to everything we learned. Vector databases face the same trade-offs as traditional databases: in-memory vs on-disk (HNSW is typically in-memory, DiskANN works on SSD), write amplification (adding vectors requires rebuilding parts of the graph), and compression (product quantization reduces vector size from 3072 bytes to 384 bytes at the cost of recall accuracy). The same mental models apply — they are just operating in 1536 dimensions instead of one.

Recommended Reading

Paper/ResourceWhy Read It
O'Neil et al., "The Log-Structured Merge-Tree" (1996)The original LSM-tree paper. Short and readable.
Bayer & McCreight, "Organization and Maintenance of Large Ordered Indexes" (1972)The original B-tree paper. Elegant and still relevant.
Chang et al., "Bigtable: A Distributed Storage System" (2006)Google's Bigtable paper. Introduced SSTables and the memtable/compaction model to a generation of engineers.
Stonebraker et al., "C-Store: A Column-oriented DBMS" (2005)The seminal column store paper. Became Vertica.
Dong et al., "Optimizing Space Amplification in RocksDB" (2017)RocksDB's compaction strategies explained by Facebook engineers.
Kraska et al., "The Case for Learned Index Structures" (2018)The provocative paper on replacing B-trees with ML models.
Lu et al., "WiscKey: Separating Keys from Values in SSD-Conscious Storage" (2016)Reducing LSM write amplification by 10x — keys in LSM, values in a separate log.
Malviya et al., "Rethinking Main Memory OLTP Recovery" (2014)How VoltDB achieves durability without traditional WAL — command logging instead.
Abadi et al., "The Design and Implementation of Modern Column-Oriented Databases" (2013)Comprehensive survey of column store techniques by a C-Store/Vertica co-inventor.

The Big Picture

Everything in this chapter comes down to one sentence: different access patterns demand different storage layouts. There is no universal "best" storage engine, just as there is no universal "best" data structure. A hash map is perfect for key lookups, terrible for sorted iteration. A sorted array is perfect for binary search, terrible for insertions. Storage engines are the same — each is optimized for a specific workload pattern.

The mark of a strong engineer is not memorizing which engine to use when. It is understanding why each engine makes the trade-offs it does, so you can reason about novel situations the way the engine designers did. If you understand that LSM-trees convert random writes into sequential writes by buffering in memory — you can predict that any write-heavy workload benefits. If you understand that B-trees keep data in a single sorted structure — you can predict that they offer better read latency and stronger transaction support. If you understand that column stores skip unnecessary data — you can predict that any analytical query benefits from columnar layout.

The principles are timeless. The specific engines will change. The trade-offs will not.

Implementation Challenges

If you want to cement your understanding, try building these (in order of difficulty):

ChallengeDifficultyWhat You Learn
Implement a Bitcask-style key-value store in Python (as in Ch 9)Easy (2 hours)Append-only logs, in-memory indexing, crash recovery via log replay
Add compaction and segment merging to the Bitcask implementationMedium (4 hours)Background compaction, atomic segment switching, tombstone handling
Implement a memtable with sorted output (red-black tree or skip list)Medium (3 hours)In-memory balanced trees, sorted iteration for SSTable generation
Build a simple SSTable writer and reader with binary searchMedium (4 hours)File format design, binary search on disk, sparse index creation
Implement a Bloom filter and integrate it with your SSTable readerEasy (1 hour)Probabilistic data structures, false positive tuning
Build a complete 2-level LSM-tree (memtable + L0 + compaction)Hard (8 hours)Full LSM lifecycle: write → memtable → flush → compaction → read
Implement a B-tree with page splits on a simulated disk (4KB pages)Hard (8 hours)Page-based storage, split propagation, WAL for crash recovery

Each challenge from the table above can be done in a single file in Python. The goal is not production quality — it is to experience the engineering trade-offs firsthand. You will remember the design decisions much better after fighting with them in code.

A Final Quote

"The task of the database engineer is to bridge an astronomically wide performance gap — a factor of 100,000 between RAM and disk — while providing the illusion that all data is instantly accessible." This gap has existed for 50 years and will exist for 50 more. B-trees, LSM-trees, column stores, and in-memory databases are all different strategies for hiding this gap from applications. Understanding them is understanding the fundamental constraint of all data systems.
Final reflection: A colleague says "Just use PostgreSQL for everything — it is battle-tested and handles any workload." In what scenario is this advice actually correct, and when does it break down?