Distributed Systems

Data Storage & Caching

Replication strategies, partitioning, NoSQL taxonomy, caching patterns, and eviction policies.

Prerequisites: Key-value stores + Basic networking. That's it.
10
Chapters
9
Simulations
0
Assumed Knowledge

Chapter 0: The Problem

You have built a web application. Users sign up, create profiles, post content. Everything lives in one PostgreSQL database on one server. Reads take 2 milliseconds. Life is good.

Then you grow. A million users. Ten million posts. Your database disk is 90% full. Read latency creeps from 2ms to 20ms because the working set no longer fits in RAM and the database starts hitting disk. Writes are even worse — every write must be persisted to disk and update indexes. Users start complaining about slow page loads.

You have three fundamental problems, and each requires a different solution:

ProblemSymptomSolution
Too much data for one machineDisk full, storage limitPartitioning — split data across machines
Too many reads for one machineHigh CPU, slow queriesReplication — copy data to read replicas
Reads are too slow (even with capacity)20ms per query, database as bottleneckCaching — keep hot data in memory

These are not just theoretical concerns. Every system that grows beyond a single machine must solve all three. And they interact: caching reduces load on replicas, replication provides read capacity that makes cache misses tolerable, and partitioning determines where data lives (which affects cache locality).

Single Database Bottleneck

Watch requests hit a single database. As load increases, latency spikes because the working set exceeds RAM. A cache absorbs the hot reads, bringing latency back down.

Click a mode to begin

With caching, 90% of reads never touch the database. They are served from memory in under a millisecond. The database only handles cache misses (10% of reads) and all writes. This is the fundamental insight of caching: most access patterns follow a power law — a small set of "hot" keys accounts for the majority of reads. If you keep those hot keys in memory, you can reduce database load by an order of magnitude.

This lesson covers the storage layer of distributed systems. We will explore how data is replicated for durability, partitioned for scale, categorized across different NoSQL families, and cached for speed. By the end, you will understand the trade-offs behind every major storage decision in a distributed system.
Your database handles 10,000 reads/sec with 5ms latency. You add 3 read replicas (4 total). What problem have you NOT solved?

Chapter 1: Replication

Replication means keeping copies of the same data on multiple machines. It serves two purposes: fault tolerance (if one machine dies, others have the data) and read scaling (spread reads across multiple copies). The fundamental challenge is keeping copies in sync when data changes.

Single-Leader Replication

One node is the leader (also called primary or master). All writes go to the leader. The leader streams changes to followers (replicas, secondaries). Reads can go to any replica. This is how PostgreSQL, MySQL, MongoDB (by default), and most traditional databases work.

Client Write
INSERT INTO users VALUES ('alice', ...)
Leader
Writes to disk, appends to WAL (Write-Ahead Log)
Replication Stream
Leader sends WAL entries to all followers
Followers
Apply WAL entries to their local copies

The critical question: does the leader wait for followers to confirm before acknowledging the write to the client? Synchronous replication waits — the client does not get "OK" until at least one follower has confirmed. Durable but slow. Asynchronous replication acknowledges immediately after the leader writes — fast but risky, because if the leader crashes before followers catch up, those writes are lost.

// Synchronous: write confirmed after follower ack
Client → Leader writes → Follower-1 writes → ACK to client
Latency: leader_write + network_RTT + follower_write ≈ 5ms + 1ms + 5ms = 11ms
Durability: guaranteed on 2 machines

// Asynchronous: write confirmed immediately
Client → Leader writes → ACK to client (followers catch up later)
Latency: leader_write ≈ 5ms
Durability: only on 1 machine until replication catches up

// Semi-synchronous (PostgreSQL default): 1 sync follower, rest async
// Best of both: fast for most, durable on 2 machines

Multi-Leader Replication

In multi-leader (also called master-master), multiple nodes accept writes. Each leader replicates to the others. This is useful for geographically distributed systems: a user in Tokyo writes to the Tokyo leader with low latency, and the change replicates to the Virginia leader asynchronously.

The poison: write conflicts. User A changes a record on leader 1 while user B changes the same record on leader 2. Both succeed locally. When the changes replicate, each leader sees a conflicting update. Which one wins? Common resolution strategies: last-write-wins (LWW, by timestamp — lossy), merge the values (application-specific), or flag the conflict for manual resolution.

Leaderless Replication

Leaderless (used by Cassandra and DynamoDB) has no designated leader. Any node can accept writes. The client writes to multiple nodes simultaneously. Reads also query multiple nodes. Consistency is achieved through quorum: if there are N replicas, a write succeeds if W nodes confirm, and a read succeeds if R nodes respond. As long as W + R > N, you are guaranteed to read your own writes.

// Quorum formula
N = 3 replicas
W = 2 (write to at least 2 nodes)
R = 2 (read from at least 2 nodes)
W + R = 4 > N = 3 → overlapping quorum, guaranteed consistency

// Fast reads (eventual consistency)
W = 3, R = 1 → writes are slow but reads are fast

// Fast writes (eventual consistency)
W = 1, R = 3 → writes are fast but reads are slow
Replication Strategies

Watch how writes propagate under each strategy. Single-leader: all writes go to one node. Multi-leader: writes go to any leader. Leaderless: writes go to a quorum.

Pick a strategy
The replication trade-off triangle. You can optimize for any two: (1) low write latency (do not wait for replicas), (2) strong consistency (read your own writes), (3) high availability (survive node failures). Single-leader with sync replication gives you consistency + availability but sacrifices latency. Leaderless with W=1 gives latency + availability but sacrifices consistency. Multi-leader with conflict resolution gives latency + availability with eventual consistency.
You use leaderless replication with N=5, W=3, R=3. A write updates nodes 1, 2, 3. A read queries nodes 3, 4, 5. Does the read see the latest write?

Chapter 2: Partitioning

Replication copies data for redundancy. Partitioning (also called sharding) splits data for capacity. Each partition holds a subset of the data on a different machine. Together, all partitions hold the full dataset. This is how you go beyond what one machine can store or process.

Range Partitioning

Assign each partition a continuous range of keys. Keys A-M go to partition 1, N-Z go to partition 2. Efficient for range queries ("all users whose name starts with 'S'") because you only query one or two partitions. But vulnerable to hotspots: if keys are timestamps, all recent writes land on one partition.

Hash Partitioning

Hash each key and assign the hash to a partition. partition = hash(key) % N. Distributes data evenly regardless of key patterns — no hotspots. But range queries are impossible: adjacent keys hash to random partitions, so "all users from A to C" requires querying every partition.

// Range partitioning: good for scans, bad for hotspots
key="2024-06-15" → partition 3 (owns June 2024)
key="2024-06-16" → partition 3 (same month, same partition)
Range query "June 2024": only partition 3 (efficient!)
But: ALL writes in June go to partition 3 (hotspot!)

// Hash partitioning: good for distribution, bad for scans
key="2024-06-15" → hash = 0x4A2B → partition 2
key="2024-06-16" → hash = 0xD103 → partition 0
Adjacent dates land on different partitions (no hotspot!)
But: range query "June 2024" must scan ALL partitions (expensive!)

Compound Partitioning

Some systems combine both. Cassandra uses the first part of the key for hash partitioning (to distribute data) and the second part for range ordering within each partition (to support efficient range queries on that second dimension).

// Cassandra compound key: (user_id, timestamp)
// Partition key: hash(user_id) → distributes users across nodes
// Clustering key: timestamp → sorted within each partition

Query: "all posts by user 42 from June" →
1. hash("user42") → partition 7 (single node!)
2. Range scan on timestamp within partition 7 (efficient!)

// Best of both worlds for this access pattern
Range vs. Hash Partitioning

Watch keys distribute across 4 partitions. Range partitioning creates hotspots with sequential keys. Hash partitioning distributes evenly.

Pick a strategy
The partitioning-replication combination. Real systems use both. You partition for capacity (5 partitions, each holding 20% of the data) and replicate each partition for durability (3 copies of each). That is 5 x 3 = 15 node assignments. A node might hold partition 1's primary copy, partition 3's replica, and partition 5's replica. This is how Kafka, Cassandra, and CockroachDB work.
You partition user data by hash(user_id) across 4 nodes. A query asks for "all users who signed up in January." How many nodes must you query?

Chapter 3: NoSQL Taxonomy

Not all data fits neatly into rows and columns. The NoSQL movement (which really means "not only SQL") emerged because relational databases are a poor fit for some workloads. There are four major families, each optimized for different access patterns.

Key-Value Stores

The simplest model: a key maps to a blob of data. The store does not care what the value contains. GET(key) and PUT(key, value) are the only operations. Because the store does not need to parse or index the value, it can be extremely fast — sub-millisecond reads.

Use cases: session storage, user preferences, shopping carts, caching. Examples: Redis, Memcached, DynamoDB (simple mode), etcd.

Trade-off: No queries on values. You cannot say "find all users where age > 30" without reading every value and checking. If you need to query by multiple attributes, a key-value store forces you to maintain secondary indexes manually.

Document Stores

Like key-value, but the store understands the value structure — it is a JSON (or BSON) document. This means you can create indexes on fields within the document and query on any field. Documents are self-contained: a user document includes their name, address, orders, and preferences in one blob.

Use cases: content management, user profiles, product catalogs — anything with variable structure. Examples: MongoDB, CouchDB, Firestore.

Trade-off: Great for reading a full document, terrible for joins across documents. If a user's order references a product document, you need two reads (or denormalize and risk inconsistency).

Wide-Column Stores

Data is organized by row key, column family, and column name. Think of it as a two-dimensional key: (row_key, column_name) → value. Each row can have a different set of columns — there is no fixed schema. Columns within a family are stored together on disk, enabling efficient reads of specific column subsets.

Use cases: time-series data, analytics, event logs — any workload with massive row counts and queries that touch specific column subsets. Examples: Cassandra, HBase, BigTable.

Graph Databases

Store data as nodes (entities) and edges (relationships). Queries traverse edges: "find all friends of friends who work at company X." Relational databases can do this with joins, but each hop requires a join, and joining 6 levels deep is prohibitively slow. Graph databases index edges natively, making traversals constant-time per hop.

Use cases: social networks, fraud detection, recommendation engines, knowledge graphs. Examples: Neo4j, Amazon Neptune, TigerGraph.

NoSQL Families

Click each family to see its data model, query pattern, and strengths.

FamilyData modelQuery styleBest for
Key-Valuekey → opaque blobGET/PUT by keyCaching, sessions, config
Documentkey → JSON documentQuery any field, index fieldsCMS, profiles, catalogs
Wide-Column(row, column) → valueScan rows, filter columnsTime-series, analytics, logs
Graphnodes + edgesTraverse relationshipsSocial, fraud, recommendations
NoSQL does not mean "no schema." It means "schema on read" instead of "schema on write." A relational database enforces structure when you write (INSERT fails if columns do not match). A document store accepts any JSON — but your application still needs to understand the structure when it reads. The schema exists; it is just enforced in code, not in the database. This is flexible but dangerous: bugs that write malformed documents are not caught until read time.
You need to find all users who are friends-of-friends with user X and also purchased product Y. Which database type handles this most efficiently?

Chapter 4: Cache-Aside Pattern

The most common caching pattern in production systems is cache-aside (also called lazy loading or look-aside). The application code is responsible for managing the cache — the cache and database do not talk to each other directly.

The Flow

Read Request
Application receives GET /user/42
Check Cache
cache.get("user:42") — returns value or NULL
↓ cache miss
Query Database
db.query("SELECT * FROM users WHERE id=42")
Populate Cache
cache.set("user:42", result, ttl=300)
Return to Client
Next read of user:42 hits cache (fast path)
// Cache-aside in code
def get_user(user_id):
  # Step 1: check cache
  cached = redis.get(f"user:{user_id}")
  if cached:
    return json.loads(cached) # Cache HIT — fast path

  # Step 2: cache miss — query database
  user = db.query("SELECT * FROM users WHERE id = %s", user_id)

  # Step 3: populate cache for next time
  redis.setex(f"user:{user_id}", 300, json.dumps(user)) # TTL = 5 min

  return user

Why Cache-Aside Works

Only hot data gets cached. The cache fills organically: data that is read often gets cached automatically (because it is read, found missing, fetched, and cached). Data that is never read is never cached. The cache naturally reflects the access pattern. No upfront work to decide what to cache.

Cache failures are survivable. If the cache goes down, the application falls back to the database. Latency increases but the system does not break. This is a critical advantage over patterns where the cache is in the write path.

The Staleness Problem

When data is updated in the database, the cache still holds the old value. There are two strategies:

TTL-based expiration: Set a TTL (Time To Live) on every cached entry. After 5 minutes, the entry expires and the next read fetches from the database. Staleness window = TTL. Simple but imprecise.

Explicit invalidation: When updating data, also delete the cache entry. The next read will miss, fetch the new value, and repopulate. More consistent but requires discipline — every write path must remember to invalidate.

// Write + invalidate
def update_user(user_id, data):
  db.execute("UPDATE users SET ... WHERE id = %s", user_id, data)
  redis.delete(f"user:{user_id}") # Invalidate cache
  # Next read will miss, fetch from DB, and repopulate

// DANGER: if invalidate fails (network blip), cache holds stale data
// until TTL expires. Always set a TTL as a safety net!
Cache-Aside Pattern

Watch reads flow through the cache. First access = miss (goes to DB). Subsequent accesses = hit (served from cache). Click "Update DB" to see invalidation.

Click Read to start
Cache-aside is the default for 90% of production systems. It is simple, resilient, and self-tuning. The main gotchas: (1) the first read for every key is slow (cold cache / "thundering herd" if many requests hit the same uncached key simultaneously), (2) stale data if invalidation is missed, (3) the application code must handle both hit and miss paths.
Your cache-aside system has a 95% hit rate and 5ms cache latency vs. 50ms DB latency. What is the average read latency?

Chapter 5: Write-Through, Write-Behind, Read-Through

Cache-aside puts the application in charge. But there are patterns where the cache itself coordinates with the database, simplifying application code at the cost of more complex infrastructure.

Write-Through

Every write goes to the cache first. The cache synchronously writes to the database before returning. The application writes to the cache, and the cache handles persistence. The cache always has the latest data — no staleness.

// Write-through flow
App writes "user:42" → Cache
Cache writes "user:42" → Database (synchronously)
Cache returns success to App

// Advantage: cache is always up-to-date (no stale reads)
// Disadvantage: every write has DB latency (slow writes)
// Disadvantage: writes data that may never be read (wasted cache space)

Write-Behind (Write-Back)

The cache accepts the write and returns immediately. The cache asynchronously flushes to the database later — either after a delay or in batches. This makes writes extremely fast (memory speed) but introduces a durability risk: if the cache crashes before flushing, those writes are lost.

// Write-behind flow
App writes "user:42" → Cache (returns immediately, ~0.5ms)
Cache queues write → flushes to DB every 5 seconds (in batch)

// Advantage: writes are memory-speed fast
// Advantage: batching reduces DB write load
// Disadvantage: data loss if cache crashes before flush
// Disadvantage: complex failure handling

Read-Through

Read-through is like cache-aside but the cache, not the application, fetches from the database on a miss. The application always reads from the cache. On a miss, the cache itself queries the DB, stores the result, and returns it. This simplifies application code — the application does not need miss-handling logic.

Write Pattern Comparison

Watch how writes flow through each pattern. Notice the latency and durability differences.

Pick a pattern
PatternWrite latencyRead consistencyDurabilityComplexity
Cache-AsideDB speed (writes bypass cache)Stale until TTL/invalidationHigh (writes go to DB first)In application
Write-ThroughDB speed (cache writes through)Always freshHighIn cache layer
Write-BehindMemory speed (async flush)Always fresh in cacheRisk of lossIn cache layer
Read-ThroughN/A (read pattern)Fresh after first readN/AIn cache layer
In practice: cache-aside + write-through is the most common combination. Reads use cache-aside (check cache, fall back to DB). Writes use write-through (write to cache and DB). This gives you always-fresh reads with simple application code. Write-behind is used when write throughput is critical and you can tolerate some data loss risk (e.g., view counters, analytics events).
Your system uses write-behind caching with a 10-second flush interval. The cache server crashes. What happens to the writes made in the last 10 seconds?

Chapter 6: Cache Eviction

Cache memory is finite. Your database might hold 1 TB of data, but your Redis instance has 64 GB of RAM. When the cache is full and a new entry needs to be stored, something must be removed. The eviction policy decides what gets kicked out.

LRU (Least Recently Used)

Remove the entry that has not been accessed for the longest time. The intuition: if you have not read a key in a while, you probably will not read it soon. This is the most popular eviction policy because it works well for most access patterns.

Implementation: A doubly-linked list + hash map. On every access, move the entry to the head of the list. When evicting, remove from the tail. Both operations are O(1). Redis uses an approximated LRU that samples a few random keys and evicts the least recently used among the sample — this avoids the overhead of maintaining a true LRU list.

LFU (Least Frequently Used)

Remove the entry that has been accessed the fewest times. The intuition: entries that are rarely accessed are less valuable. LFU is better than LRU when there are "scan" workloads — a one-time bulk read of many keys would flush the entire LRU cache, but LFU keeps the truly popular keys because their frequency count is high.

Trade-off: LFU has a "cold start" problem. A new entry has a low frequency count and gets evicted quickly, even if it is about to become popular. Some implementations decay frequency counts over time to address this.

TTL (Time To Live)

Entries expire after a fixed duration regardless of access pattern. This is not really an eviction policy — it is an expiration policy. But it serves the same purpose: keeping the cache fresh and bounded. TTL is typically used alongside LRU: entries can be evicted by LRU when the cache is full, or they can expire by TTL even if the cache is not full.

LRU vs. LFU Eviction

A cache with 5 slots. Watch entries arrive and get evicted. LRU evicts the oldest-accessed; LFU evicts the least-frequently-accessed.

Pick a policy
// Redis eviction policies
noeviction # Return error when memory full (never lose data)
allkeys-lru # LRU among all keys (most common)
allkeys-lfu # LFU among all keys
volatile-lru # LRU among keys with TTL set
volatile-lfu # LFU among keys with TTL set
allkeys-random # Random eviction (surprisingly effective)
volatile-ttl # Evict keys with shortest remaining TTL
The scan problem with LRU. Imagine your cache holds the 1,000 most popular items. A batch job reads 10,000 items once. Under LRU, those 10,000 one-time reads flush all 1,000 popular items from the cache. After the batch completes, your hit rate drops from 95% to 0% while the cache refills. LFU avoids this because the popular items have high frequency counts that survive a one-time scan. This is why Redis added LFU support in version 4.0.
Your LRU cache has 5 slots: [A(5min ago), B(1min), C(3min), D(10min), E(2min)]. A new key F arrives. Which is evicted?

Chapter 7: Local vs. External Cache

When we say "cache," we usually mean an external cache like Redis. But there is an older, simpler kind: keeping cached data right in your application's memory. Each approach has sharp trade-offs.

Local (In-Process) Cache

Store cached data in a hash map inside your application process. Reads are a memory lookup — nanoseconds, no network hop. This is as fast as caching gets.

// Python local cache (simple dict)
_cache = {}
def get_user(user_id):
  if user_id in _cache:
    return _cache[user_id] # ~100ns, no network
  user = db.query(user_id) # ~5ms, database round trip
  _cache[user_id] = user
  return user

The problem: if you have 10 application instances (for horizontal scaling), each has its own cache. Write to one instance's database, and the other 9 still have stale data. There is no shared state. Cache invalidation becomes a distributed coordination problem.

Also, local caches consume application heap memory. A large cache means more garbage collection pressure, potentially causing latency spikes. And when the application restarts (deploy, crash), the cache is lost — every key starts cold.

External (Shared) Cache

A separate cache server (Redis, Memcached) shared by all application instances. When instance 1 caches a value, instance 2 can read it. When the value is invalidated, all instances see the invalidation immediately.

// External cache (Redis)
App-1 writes: redis.set("user:42", data) # 1ms network round trip
App-2 reads: redis.get("user:42") # 1ms, gets the same data
App-3 reads: redis.get("user:42") # 1ms, also hits

// Invalidation is global:
App-1 writes: redis.delete("user:42") # All instances see the miss

The trade-off: every cache access requires a network round trip (~1ms), compared to ~100ns for a local lookup. That is 10,000x slower. For extremely hot keys accessed millions of times per second, this overhead matters.

Local vs. External Cache

3 app instances share data. Watch how local caching leads to inconsistency while external caching stays synchronized.

Pick a mode

The Two-Tier Pattern

In practice, many systems use both: a small local cache for the hottest keys (L1) backed by a shared Redis cache (L2), backed by the database (L3). This gives you nanosecond latency for the top 100 keys, millisecond latency for the top 10,000 keys, and database latency for everything else.

// Two-tier caching
def get_user(user_id):
  # L1: local cache (100ns, small, per-instance)
  if user_id in local_cache:
    return local_cache[user_id]

  # L2: Redis (1ms, large, shared)
  cached = redis.get(f"user:{user_id}")
  if cached:
    local_cache[user_id] = cached # Promote to L1
    return cached

  # L3: database (5-50ms)
  user = db.query(user_id)
  redis.setex(f"user:{user_id}", 300, user)
  local_cache[user_id] = user
  return user
Local caches must have short TTLs. Since local caches cannot be invalidated externally, use TTLs of 5-30 seconds. This bounds staleness: after a write, within 30 seconds every instance's local cache expires and fetches from Redis (which was already invalidated). For most use cases, 30 seconds of staleness is acceptable.
You have 8 app instances with local caching. User 42 updates their profile on instance 1. How many instances still serve stale data?

Chapter 8: Cache-Aside Simulator

Time to see caching in action. Below is an interactive cache-aside simulation. You control the cache size, TTL, and access pattern. Watch the hit rate climb as the cache warms up, then inject writes and see how invalidation affects performance.

Your mission: Achieve a 90%+ hit rate. Experiment with cache sizes and TTLs. Watch what happens when you switch from Zipfian (realistic) to uniform (worst-case) access patterns. Kill the cache and watch the database suffer.
Cache-Aside Pattern Simulator

Requests arrive for keys 1-100. The cache holds a limited number of entries. Watch hits (fast) and misses (slow) in real time.

Cache Size 20 slots
TTL (sec) 10s
Press Play to begin

What to Try

Experiment 1: Zipfian with 20 slots. Watch the hit rate climb above 80% quickly. A Zipfian distribution means a few keys are accessed much more often than others — these "hot" keys stay in cache.

Experiment 2: Switch to uniform access. Every key is equally likely. The hit rate drops dramatically because there are no "hot" keys to cache. This is the worst case for caching.

Experiment 3: Lower TTL to 2 seconds. Entries expire quickly, forcing more database reads. The hit rate drops even with Zipfian access. This is the cost of freshness — short TTLs mean more misses.

Experiment 4: Increase cache to 50 slots. With 50 of 100 possible keys cached, even uniform access gets a ~50% hit rate. More cache memory directly translates to fewer database reads.

Chapter 9: Connections

Data storage and caching are foundational layers. Here is how they connect to the broader distributed systems landscape.

What We Covered

ConceptKey takeaway
ReplicationSingle-leader for simplicity, multi-leader for geo-distribution, leaderless for availability
PartitioningRange for scans, hash for even distribution, compound for both
NoSQL familiesKV for speed, document for flexibility, wide-column for analytics, graph for relationships
Cache-asideApplication manages cache; only hot data cached; resilient to cache failure
Write patternsWrite-through for consistency, write-behind for speed, read-through for simplicity
EvictionLRU for general use, LFU for scan-resistant workloads, TTL as safety net
Local vs. externalLocal for nanosecond reads; external for consistency; two-tier for both

Where to Go Next

TopicConnection
Load BalancingConsistent hashing in load balancers enables stateful caching at the LB layer
Service ArchitectureService meshes add caching at the sidecar level; API gateways often include response caching
MessagingEvent-driven cache invalidation: database publishes changes to a message queue, cache subscribers invalidate entries
ConsensusLeader election in single-leader replication; distributed locks for cache stampede prevention
Storage and caching decisions define your system's personality. A system with aggressive caching and eventual consistency feels fast and forgiving. A system with write-through caching and synchronous replication feels slow but trustworthy. There is no universally correct choice — only the right choice for your access patterns, consistency requirements, and failure tolerance.
You want sub-millisecond reads for the top 100 keys, millisecond reads for the top 10,000, and can tolerate 5ms for everything else. What architecture achieves this?