Distributed Systems

Partitioning & Storage

Range partitioning, consistent hashing, rebalancing strategies, and blob storage architecture — splitting data that outgrows one machine.

Prerequisites: Basic data structures + Hashing fundamentals. That's it.
10
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Single-Node Wall

Your database stores 500 million user profiles. Each profile is 2KB. Total data: 1 terabyte. Your biggest server has 256 GB of RAM. The entire dataset doesn't fit in memory, so the database pages to disk. Disk I/O is 100x slower than memory access. Query latency degrades from 2ms to 200ms. Users notice.

You could buy a bigger machine. The biggest available has 24 TB of RAM and costs $80/hour. That's $700,000/year for a single server — which is still a single point of failure. And next year, you'll have 1 billion profiles and need another upgrade.

The alternative: split the data across multiple machines. Each machine holds a subset of the data. This is partitioning (also called sharding). Instead of one machine with 1 TB, you have four machines with 250 GB each. Each machine's data fits in memory. Query latency is back to 2ms.

Partitioning is about write scalability. Replication gives you more read capacity (every replica can serve reads). But every write still goes to every replica. Partitioning splits writes across machines: writes for user IDs 0-999 go to Node A, writes for 1000-1999 go to Node B. This is the ONLY way to scale writes beyond what a single machine can handle.

Partitioning vs. Replication

PropertyReplicationPartitioning
What it scalesRead throughput, availabilityWrite throughput, storage capacity
Data per nodeFull copy on every nodeSubset on each node
Failure impactOther replicas take overData on failed partition unavailable until recovery
Typical comboBoth together: partition data across shards, replicate each shard 3x

The Core Question

Partitioning sounds simple: split the data. But how? You need a partition key — a rule that determines which partition owns each record. The choice of partition key determines everything: whether queries are fast (hit one partition) or slow (scan all partitions), whether data is evenly distributed or skewed to one hot node.

The Single-Node Wall

Watch data grow and eventually overwhelm a single node. Then partition it across multiple nodes and see query latency recover.

Click "Add Data" to grow beyond one node's capacity.
Quick check: A single database node handles 10,000 writes/sec. You need 40,000 writes/sec. Replication won't help (writes go to all replicas). What's the minimum number of partitions needed?

Chapter 1: Range Partitioning

The simplest partitioning strategy: divide the key space into contiguous ranges. Each partition owns a range. If the partition key is alphabetical (e.g., username), Partition 1 gets A-F, Partition 2 gets G-L, Partition 3 gets M-R, Partition 4 gets S-Z.

How Range Partitioning Works

// Key space: user IDs (integers)
Partition 0: keys 0 - 249,999,999
Partition 1: keys 250,000,000 - 499,999,999
Partition 2: keys 500,000,000 - 749,999,999
Partition 3: keys 750,000,000 - 999,999,999

// Query: WHERE user_id = 372,000,000
// → Route to Partition 1 (250M-499M)
// Single-partition query: fast (no scatter-gather)

// Range query: WHERE user_id BETWEEN 240M AND 260M
// → Partitions 0 and 1 (query spans the boundary)
// Only 2 of 4 partitions need to be queried

The big advantage: range queries are efficient. If you need all records with timestamps between Monday and Friday, and your partition key is timestamp, you only hit the partitions that cover that range. Hash partitioning (covered next) destroys range query efficiency because adjacent keys end up on different partitions.

The Hotspot Problem

Range partitioning has a critical weakness: hotspots. If your partition key is timestamp, and you're inserting data with the current timestamp, ALL writes go to the last partition (the one covering "now"). The other partitions are idle. One node is overloaded while the rest sit idle.

// Time-based range partitioning → hotspot
Partition 0: Jan 1 - Mar 31 (historical, read-only)
Partition 1: Apr 1 - Jun 30 (historical, read-only)
Partition 2: Jul 1 - Sep 30 (historical, read-only)
Partition 3: Oct 1 - Dec 31 (ALL current writes go here → HOTSPOT!)

// Fix: compound partition key
// Partition key = (sensor_id, timestamp)
// Writes spread across partitions by sensor_id, range queries on timestamp still work within a sensor
Real example: HBase row key design. HBase uses range partitioning. A common mistake is using timestamp as the row key — all writes go to the last region. The fix: prefix the row key with a hash of the timestamp or a modular shard number. This spreads writes across regions while preserving some locality within each shard.
Range Partitioning Visualizer

Four partitions with range boundaries. Insert records and watch how they distribute. Notice the hotspot when inserting sequential keys.

Insert records to see how they distribute across range partitions.
Check: Your partition key is user_id (auto-incrementing integer) and you have 4 range partitions (0-25K, 25K-50K, 50K-75K, 75K-100K). Your app is growing fast, creating ~100 new users/sec. Where do all new user writes go?

Chapter 2: Hash Partitioning

To avoid hotspots, apply a hash function to the partition key, then use the hash value to determine the partition. A good hash function spreads even sequential inputs uniformly across the output range.

// Hash partitioning:
partition_number = hash(key) % num_partitions

// Example with 4 partitions:
hash("user_1001") = 2847392 → 2847392 % 4 = 0 → Partition 0
hash("user_1002") = 9182736 → 9182736 % 4 = 0 → Partition 0
hash("user_1003") = 5638291 → 5638291 % 4 = 3 → Partition 3
hash("user_1004") = 7291038 → 7291038 % 4 = 2 → Partition 2

// Sequential keys → uniform distribution (no hotspot!)

The Trade-off

PropertyRange PartitioningHash Partitioning
Point queriesFast (one partition)Fast (one partition)
Range queriesEfficient (few partitions)Expensive (ALL partitions — scatter-gather)
Write distributionCan be skewed (hotspots)Uniform
Adding/removing nodesCan split/merge rangesNaive: rehash everything (see next chapter)
The rehashing catastrophe. With naive hash(key) % N, changing N (adding or removing a node) changes the partition assignment for almost EVERY key. If you go from 4 to 5 partitions, the fraction of keys that stay on the same partition is roughly 1/5 = 20%. That means 80% of your data needs to move. This is why naive hash partitioning is impractical for dynamic clusters. The solution: consistent hashing (next chapter).
Hash vs. Range Distribution

Insert sequential keys and see how range partitioning creates hotspots while hash partitioning distributes evenly.

Insert keys to compare distribution between hash and range partitioning.
Check: You have 8 partitions using hash(key) % 8. You need to add a 9th partition. Approximately what fraction of keys need to move?

Chapter 3: Consistent Hashing

Consistent hashing solves the rehashing catastrophe. Instead of hash(key) % N, it uses a hash ring: both keys AND nodes are mapped onto a circular hash space (0 to 232-1). Each key is assigned to the first node you encounter walking clockwise around the ring from the key's position.

How the Ring Works

// Hash ring: 0 to 2^32 - 1 (wraps around)

// Nodes placed on ring by hashing their identity:
hash("Node-A") = 1,000,000 → position on ring
hash("Node-B") = 2,500,000 → position on ring
hash("Node-C") = 3,800,000 → position on ring

// Key assigned to next node clockwise:
hash("user_42") = 1,200,000 → between Node-A (1M) and Node-B (2.5M)
→ Assigned to Node-B (first node clockwise)

// Adding Node-D at position 2,000,000:
// Keys between 1M and 2M move from Node-B to Node-D
// Keys everywhere else stay put!
// Only ~1/N of keys move (much better than N-1/N)

Why It's Consistent

When you add or remove a node, only keys in the range between the new/removed node and its predecessor need to move. With N nodes, adding one node moves approximately 1/N of keys. Compare to naive hashing where (N-1)/N keys move.

OperationNaive hash(key) % NConsistent hashing
Add 1 node (4→5)~80% of keys move~20% of keys move
Add 1 node (100→101)~99% of keys move~1% of keys move
Remove 1 node~same percentage move~1/N of keys move
Who uses consistent hashing? Amazon Dynamo (the internal KV store, not DynamoDB), Apache Cassandra, Riak, Akamai CDN, Discord, and most distributed caches (Memcached, Redis Cluster). It's one of the most important algorithms in distributed systems engineering.
Consistent Hashing Ring

A hash ring with nodes and keys. Add/remove nodes to see how few keys actually move.

Add nodes and keys to the consistent hashing ring.
Check: A consistent hashing ring has 10 nodes and 10,000 keys evenly distributed. You add 1 node. Approximately how many keys need to migrate?

Chapter 4: Virtual Nodes

Basic consistent hashing has a problem: with few nodes, the key distribution can be very uneven. If you have 3 nodes on the ring, one might own 50% of the key space while another owns 15%. This creates load imbalance — one node is overloaded while others are underutilized.

The fix: virtual nodes (also called vnodes). Instead of placing each physical node at one position on the ring, place it at multiple positions. A node with 128 virtual nodes appears at 128 different points on the ring. Keys are still assigned to the nearest clockwise node, but now "nearest" means the nearest virtual node, which maps back to a physical node.

// Without virtual nodes (3 physical nodes):
Node-A: 1 position → owns 40% of ring
Node-B: 1 position → owns 35% of ring
Node-C: 1 position → owns 25% of ring
// Standard deviation: ~7.6%

// With 128 virtual nodes each:
Node-A: 128 positions → owns ~33.2% of ring
Node-B: 128 positions → owns ~33.5% of ring
Node-C: 128 positions → owns ~33.3% of ring
// Standard deviation: ~0.15% (nearly perfect)

The Math of Load Balance

With V virtual nodes per physical node and N physical nodes, each physical node owns approximately 1/N of the key space. The standard deviation of load decreases as O(1/√V). With V=128, the load imbalance is tiny.

Cassandra uses 256 vnodes by default. When a new node joins the cluster, it takes ownership of 256 random ranges from existing nodes. This ensures that the load from the new node is spread across all existing nodes, not just one neighbor. The downside: more vnodes means more metadata to track and more small data transfers during rebalancing.
Virtual Nodes Distribution

Adjust the number of virtual nodes per physical node and see how load distribution improves. The bar chart shows each node's share of the key space.

Virtual nodes per node 1
Slide to increase virtual nodes and watch load balance improve.
Check: You have 5 physical nodes, each with 100 virtual nodes. You add a 6th physical node (also with 100 vnodes). Approximately what fraction of keys migrates to the new node?

Chapter 5: Rebalancing

Nodes join and leave clusters. Rebalancing is the process of moving data to restore even distribution. It's operationally the most dangerous operation in a distributed database — moving data while serving live traffic, without downtime or data loss.

Rebalancing Strategies

StrategyHow it worksProsCons
Fixed partitionsCreate many more partitions than nodes. Partitions move between nodes.No data splitting. Simple assignment.Must choose partition count upfront. Wrong choice is hard to fix.
Dynamic splittingWhen a partition grows too large, split it in half. Like B-tree page splits.Adapts to data size. No upfront choice needed.Split is expensive. Empty database starts with one partition (cold start).
Proportional to nodesFixed number of partitions per node. Adding a node randomly splits existing partitions.Automatic, proportional scaling.More complex implementation.

Fixed Partitions (Most Common)

// Create 1024 partitions for 4 initial nodes:
Node-A: partitions 0-255 (256 partitions)
Node-B: partitions 256-511 (256 partitions)
Node-C: partitions 512-767 (256 partitions)
Node-D: partitions 768-1023 (256 partitions)

// Add Node-E: steal some partitions from each node:
Node-A: 0-204 (205 partitions)
Node-B: 256-460 (205 partitions)
Node-C: 512-716 (205 partitions)
Node-D: 768-972 (205 partitions)
Node-E: 205-255, 461-511, 717-767, 973-1023 (204 partitions)
// Each existing node gave ~51 partitions to Node-E
// Data moved: ~20% of total (1/5, as expected)
Choosing the right partition count. Too few partitions: can't rebalance smoothly. Too many: metadata overhead (each partition has routing info). Rule of thumb: start with 10-50x more partitions than your maximum expected node count. Elasticsearch defaults to 5 primary shards per index. Cassandra uses 256 vnodes. Kafka uses a configurable number of partitions per topic (usually 6-12 for typical workloads).
Rebalancing Simulator

12 fixed partitions across nodes. Add or remove nodes and watch partitions migrate to restore balance.

Add or remove nodes to see partition rebalancing.
Check: You have 100 fixed partitions across 5 nodes (20 each). You add a 6th node. The system rebalances. Approximately how many partitions does the new node receive, and from how many existing nodes?

Chapter 6: Blob Storage Architecture

Not all data is structured rows and columns. Images, videos, logs, backups, and machine learning models are blobs (Binary Large Objects) — unstructured data ranging from kilobytes to terabytes. Cloud blob storage systems (like S3) handle exabytes of this data. How do they partition and store it?

The Three-Layer Architecture

Large-scale blob storage systems use a layered architecture that separates concerns:

1. Front-End / Gateway
Receives HTTP requests (PUT/GET). Authenticates, rate-limits, routes to the correct partition. Stateless — scales horizontally.
2. Partition Layer
Maps object keys to physical storage locations. Maintains the index: key → (extent_id, offset, length). Uses range or hash partitioning on the object key. This is the "brain" of the system.
3. Extent / Storage Layer
Stores the actual bytes on disk. Data is grouped into large chunks called "extents" (~1-3 GB each). Each extent is replicated 3x across failure domains (racks, AZs). Writes append to extents. Garbage collection reclaims space from deleted objects.

How a Write Works

// Client: PUT /bucket/photos/sunset.jpg (5MB file)

Gateway: authenticate, extract bucket + key
→ Route key "photos/sunset.jpg" to partition shard 7

Partition layer (shard 7):
→ Find or allocate an extent with enough free space
→ Extent E42 has 800MB free, append the 5MB blob
→ Record index entry: "photos/sunset.jpg" → {extent: E42, offset: 2.1GB, len: 5MB}

Extent layer:
→ Write 5MB to extent E42 on primary extent node
→ Replicate to 2 secondary extent nodes (different racks)
→ ACK to partition layer when 2/3 have written

Partition layer: ACK to gateway
Gateway: 200 OK to client

// Total internal operations: ~10 RPCs, ~100ms end-to-end
Why extents? Storing each object as a separate file on disk wastes space (filesystem metadata overhead) and creates millions of tiny files that overwhelm the filesystem. By packing many objects into large extents, the system minimizes metadata and enables efficient sequential disk I/O. The trade-off: deleting an object doesn't immediately free disk space — the space is reclaimed later during garbage collection (compaction).

Erasure Coding: Beyond Simple Replication

Three-way replication has 200% storage overhead (3 copies = 3x the disk). For cold data (rarely accessed), blob storage systems use erasure coding: split data into k fragments, encode into n fragments (n > k), where any k of n fragments can reconstruct the original. This achieves similar fault tolerance with much less overhead.

// Reed-Solomon (6,4) erasure coding:
Original: 4 data fragments
Encoded: 4 data + 2 parity = 6 total fragments
Fault tolerance: any 2 fragments can be lost
Storage overhead: 6/4 = 1.5x (vs. 3x for triple replication)

// S3 uses (10,8) for infrequent-access data:
8 data + 2 parity = 10 fragments across 10 AZs
Overhead: 10/8 = 1.25x
Durability: 99.999999999% (11 nines)
Blob Storage Write Path

Watch a blob write flow through the three-layer architecture. The partition layer maps the key to an extent, and the extent layer replicates the data.

Click "Write Blob" to trace the write path through all layers.
Check: A blob storage system uses (6,4) erasure coding. One of the 6 storage nodes fails permanently. Can the data still be read? What does the system do?

Chapter 7: Partition Strategies in Practice

How do real systems choose their partitioning strategy? The answer depends on access patterns, and getting it wrong is one of the most common causes of performance issues in distributed databases.

Real-World Partition Key Design

SystemPartition StrategyKey DesignWhy
DynamoDBHashPartition key (e.g., user_id)Uniform distribution, point queries dominant
CassandraHash (Murmur3)Partition key + clustering columnsEven distribution + sorted data within partition
HBase/BigTableRange (sorted)Row key (byte-ordered)Range scans over time-series data
KafkaHash (or custom)Message key → partitionOrdering within partition, even distribution across
ElasticsearchHash_routing field (default: _id)Even distribution, all fields searchable via scatter-gather
CockroachDBRangePrimary key (sorted)Range scans, automatic splitting on size

Secondary Indexes Across Partitions

Partitioning by primary key is straightforward. But what about secondary indexes? If you partition users by user_id, how do you efficiently query "all users in city=London"?

// Two approaches to secondary indexes:

Local index (document-based):
Each partition maintains its OWN index over its data.
Query "city=London" → must ask ALL partitions (scatter-gather).
Write: fast (only update local index).
Read: slow for secondary key queries (fan-out to all partitions).
Used by: MongoDB, Cassandra, DynamoDB, Elasticsearch

Global index (term-based):
A separate partition scheme for the index itself.
city=London → all matching user_ids stored in one index partition.
Query: fast (hit one index partition).
Write: slow (may need to update index on a different node).
Used by: DynamoDB GSI, Oracle distributed indexes
The scatter-gather trade-off. Local indexes make writes fast but reads fan out. Global indexes make reads fast but writes fan out. There's no free lunch. The right choice depends on your read/write ratio. If you read 100x more than you write, optimize for reads (global index). If reads and writes are balanced, local indexes usually win because the write path is simpler.
Local vs. Global Secondary Index

Compare query patterns for local (scatter-gather) vs. global (single-partition) secondary indexes.

Compare local vs. global secondary index query patterns.
Check: You have 100 partitions. A query uses a local secondary index. How many partitions must be contacted?

Chapter 8: Consistent Hashing Ring

This is the showcase simulation. A full consistent hashing ring with nodes, virtual nodes, and keys. Add and remove nodes to see exactly which keys migrate. Watch the load distribution histogram update in real time. Compare the behavior with 1 vnode per node vs. 32 vnodes per node.

Experiment: Start with 3 nodes and 1 vnode each. Notice how uneven the distribution is. Increase to 32 vnodes and see it smooth out. Add a 4th node and count exactly how many keys move. Remove a node and see the keys redistribute to the nearest remaining node.
Interactive Consistent Hashing Ring

A hash ring with nodes (triangles) and keys (dots). Add/remove nodes to see key migration. The histogram shows load per node.

Virtual nodes per node 1
Start with 3 nodes. Add keys and adjust vnodes to explore consistent hashing.

Key observations:

ScenarioWhat to observe
1 vnode, 3 nodesHighly uneven distribution — one node may own 50%+
32 vnodes, 3 nodesNearly even — each node owns ~33%
Add a nodeOnly ~1/N keys migrate (from the predecessor)
Remove a nodeIts keys transfer to the successor — no other keys move

Chapter 9: Connections

Partitioning is the foundation of horizontal scalability. Combined with replication, it enables systems that handle petabytes of data and millions of operations per second across thousands of nodes.

The Partitioning Decision Tree

If your workload is...Use thisExample systems
Point queries (key-value lookups)Hash partitioningCassandra, DynamoDB, Redis Cluster
Range scans (time-series, ordered data)Range partitioningHBase, CockroachDB, Spanner
Full-text searchHash + scatter-gatherElasticsearch, Solr
Large unstructured objectsHash + extent storageS3, Azure Blob, GCS
Streaming events (ordered per key)Hash within topicKafka, Kinesis, Pulsar
The golden rule of partitioning: choose your partition key based on your most common query pattern, not your data model. If you mostly query by user_id, partition by user_id — even if your data model is centered around orders. The most expensive operation in a partitioned system is the scatter-gather query that must hit all partitions.
Partitioning Landscape

Where different partitioning strategies sit on the range-query vs. write-distribution spectrum.

Related lessons:

"Premature optimization is the root of all evil. But failing to partition when you need to is the root of all outages." — Paraphrased from Knuth