Range partitioning, consistent hashing, rebalancing strategies, and blob storage architecture — splitting data that outgrows one machine.
Your database stores 500 million user profiles. Each profile is 2KB. Total data: 1 terabyte. Your biggest server has 256 GB of RAM. The entire dataset doesn't fit in memory, so the database pages to disk. Disk I/O is 100x slower than memory access. Query latency degrades from 2ms to 200ms. Users notice.
You could buy a bigger machine. The biggest available has 24 TB of RAM and costs $80/hour. That's $700,000/year for a single server — which is still a single point of failure. And next year, you'll have 1 billion profiles and need another upgrade.
The alternative: split the data across multiple machines. Each machine holds a subset of the data. This is partitioning (also called sharding). Instead of one machine with 1 TB, you have four machines with 250 GB each. Each machine's data fits in memory. Query latency is back to 2ms.
| Property | Replication | Partitioning |
|---|---|---|
| What it scales | Read throughput, availability | Write throughput, storage capacity |
| Data per node | Full copy on every node | Subset on each node |
| Failure impact | Other replicas take over | Data on failed partition unavailable until recovery |
| Typical combo | Both together: partition data across shards, replicate each shard 3x | |
Partitioning sounds simple: split the data. But how? You need a partition key — a rule that determines which partition owns each record. The choice of partition key determines everything: whether queries are fast (hit one partition) or slow (scan all partitions), whether data is evenly distributed or skewed to one hot node.
Watch data grow and eventually overwhelm a single node. Then partition it across multiple nodes and see query latency recover.
The simplest partitioning strategy: divide the key space into contiguous ranges. Each partition owns a range. If the partition key is alphabetical (e.g., username), Partition 1 gets A-F, Partition 2 gets G-L, Partition 3 gets M-R, Partition 4 gets S-Z.
The big advantage: range queries are efficient. If you need all records with timestamps between Monday and Friday, and your partition key is timestamp, you only hit the partitions that cover that range. Hash partitioning (covered next) destroys range query efficiency because adjacent keys end up on different partitions.
Range partitioning has a critical weakness: hotspots. If your partition key is timestamp, and you're inserting data with the current timestamp, ALL writes go to the last partition (the one covering "now"). The other partitions are idle. One node is overloaded while the rest sit idle.
Four partitions with range boundaries. Insert records and watch how they distribute. Notice the hotspot when inserting sequential keys.
To avoid hotspots, apply a hash function to the partition key, then use the hash value to determine the partition. A good hash function spreads even sequential inputs uniformly across the output range.
| Property | Range Partitioning | Hash Partitioning |
|---|---|---|
| Point queries | Fast (one partition) | Fast (one partition) |
| Range queries | Efficient (few partitions) | Expensive (ALL partitions — scatter-gather) |
| Write distribution | Can be skewed (hotspots) | Uniform |
| Adding/removing nodes | Can split/merge ranges | Naive: rehash everything (see next chapter) |
hash(key) % N, changing N (adding or removing a node) changes the partition assignment for almost EVERY key. If you go from 4 to 5 partitions, the fraction of keys that stay on the same partition is roughly 1/5 = 20%. That means 80% of your data needs to move. This is why naive hash partitioning is impractical for dynamic clusters. The solution: consistent hashing (next chapter).Insert sequential keys and see how range partitioning creates hotspots while hash partitioning distributes evenly.
Consistent hashing solves the rehashing catastrophe. Instead of hash(key) % N, it uses a hash ring: both keys AND nodes are mapped onto a circular hash space (0 to 232-1). Each key is assigned to the first node you encounter walking clockwise around the ring from the key's position.
When you add or remove a node, only keys in the range between the new/removed node and its predecessor need to move. With N nodes, adding one node moves approximately 1/N of keys. Compare to naive hashing where (N-1)/N keys move.
| Operation | Naive hash(key) % N | Consistent hashing |
|---|---|---|
| Add 1 node (4→5) | ~80% of keys move | ~20% of keys move |
| Add 1 node (100→101) | ~99% of keys move | ~1% of keys move |
| Remove 1 node | ~same percentage move | ~1/N of keys move |
A hash ring with nodes and keys. Add/remove nodes to see how few keys actually move.
Basic consistent hashing has a problem: with few nodes, the key distribution can be very uneven. If you have 3 nodes on the ring, one might own 50% of the key space while another owns 15%. This creates load imbalance — one node is overloaded while others are underutilized.
The fix: virtual nodes (also called vnodes). Instead of placing each physical node at one position on the ring, place it at multiple positions. A node with 128 virtual nodes appears at 128 different points on the ring. Keys are still assigned to the nearest clockwise node, but now "nearest" means the nearest virtual node, which maps back to a physical node.
With V virtual nodes per physical node and N physical nodes, each physical node owns approximately 1/N of the key space. The standard deviation of load decreases as O(1/√V). With V=128, the load imbalance is tiny.
Adjust the number of virtual nodes per physical node and see how load distribution improves. The bar chart shows each node's share of the key space.
Nodes join and leave clusters. Rebalancing is the process of moving data to restore even distribution. It's operationally the most dangerous operation in a distributed database — moving data while serving live traffic, without downtime or data loss.
| Strategy | How it works | Pros | Cons |
|---|---|---|---|
| Fixed partitions | Create many more partitions than nodes. Partitions move between nodes. | No data splitting. Simple assignment. | Must choose partition count upfront. Wrong choice is hard to fix. |
| Dynamic splitting | When a partition grows too large, split it in half. Like B-tree page splits. | Adapts to data size. No upfront choice needed. | Split is expensive. Empty database starts with one partition (cold start). |
| Proportional to nodes | Fixed number of partitions per node. Adding a node randomly splits existing partitions. | Automatic, proportional scaling. | More complex implementation. |
12 fixed partitions across nodes. Add or remove nodes and watch partitions migrate to restore balance.
Not all data is structured rows and columns. Images, videos, logs, backups, and machine learning models are blobs (Binary Large Objects) — unstructured data ranging from kilobytes to terabytes. Cloud blob storage systems (like S3) handle exabytes of this data. How do they partition and store it?
Large-scale blob storage systems use a layered architecture that separates concerns:
Three-way replication has 200% storage overhead (3 copies = 3x the disk). For cold data (rarely accessed), blob storage systems use erasure coding: split data into k fragments, encode into n fragments (n > k), where any k of n fragments can reconstruct the original. This achieves similar fault tolerance with much less overhead.
Watch a blob write flow through the three-layer architecture. The partition layer maps the key to an extent, and the extent layer replicates the data.
How do real systems choose their partitioning strategy? The answer depends on access patterns, and getting it wrong is one of the most common causes of performance issues in distributed databases.
| System | Partition Strategy | Key Design | Why |
|---|---|---|---|
| DynamoDB | Hash | Partition key (e.g., user_id) | Uniform distribution, point queries dominant |
| Cassandra | Hash (Murmur3) | Partition key + clustering columns | Even distribution + sorted data within partition |
| HBase/BigTable | Range (sorted) | Row key (byte-ordered) | Range scans over time-series data |
| Kafka | Hash (or custom) | Message key → partition | Ordering within partition, even distribution across |
| Elasticsearch | Hash | _routing field (default: _id) | Even distribution, all fields searchable via scatter-gather |
| CockroachDB | Range | Primary key (sorted) | Range scans, automatic splitting on size |
Partitioning by primary key is straightforward. But what about secondary indexes? If you partition users by user_id, how do you efficiently query "all users in city=London"?
Compare query patterns for local (scatter-gather) vs. global (single-partition) secondary indexes.
This is the showcase simulation. A full consistent hashing ring with nodes, virtual nodes, and keys. Add and remove nodes to see exactly which keys migrate. Watch the load distribution histogram update in real time. Compare the behavior with 1 vnode per node vs. 32 vnodes per node.
A hash ring with nodes (triangles) and keys (dots). Add/remove nodes to see key migration. The histogram shows load per node.
Key observations:
| Scenario | What to observe |
|---|---|
| 1 vnode, 3 nodes | Highly uneven distribution — one node may own 50%+ |
| 32 vnodes, 3 nodes | Nearly even — each node owns ~33% |
| Add a node | Only ~1/N keys migrate (from the predecessor) |
| Remove a node | Its keys transfer to the successor — no other keys move |
Partitioning is the foundation of horizontal scalability. Combined with replication, it enables systems that handle petabytes of data and millions of operations per second across thousands of nodes.
| If your workload is... | Use this | Example systems |
|---|---|---|
| Point queries (key-value lookups) | Hash partitioning | Cassandra, DynamoDB, Redis Cluster |
| Range scans (time-series, ordered data) | Range partitioning | HBase, CockroachDB, Spanner |
| Full-text search | Hash + scatter-gather | Elasticsearch, Solr |
| Large unstructured objects | Hash + extent storage | S3, Azure Blob, GCS |
| Streaming events (ordered per key) | Hash within topic | Kafka, Kinesis, Pulsar |
Where different partitioning strategies sit on the range-query vs. write-distribution spectrum.
Related lessons:
"Premature optimization is the root of all evil. But failing to partition when you need to is the root of all outages." — Paraphrased from Knuth