Hashing, chaining, open addressing, perfect hashing — O(1) lookup for everything.
You are building a spell checker. Your dictionary contains 100,000 words. A user types a word and you need to answer one question: is this word in the dictionary? How fast can you answer?
The simplest approach: store all 100K words in an unsorted array. To check a word, scan every element. That is linear search — O(n) time. For 100K words, that means up to 100,000 comparisons per lookup. If the user types 50 words per minute, your spell checker makes 5 million comparisons per minute. For a single lookup this is already slow. For a database serving millions of queries per second, it is catastrophic.
Can we do better? Sort the array and use binary search: O(log n). For 100K words, that is about 17 comparisons per lookup. Dramatically better. But we can do better still.
What if we could answer in O(1) — constant time, regardless of how many words are in the dictionary? Not 17 comparisons, not 100,000. Just one. That is the promise of a hash table: a data structure that maps any key to an array index using a hash function, then looks up the value at that index directly.
The simulation below lets you see the difference. Click each lookup strategy and watch how many steps it takes to find (or not find) a word in a dictionary of varying size.
Click "Lookup" to search for a key. Watch the number of comparisons for each strategy.
Notice the pattern: as dictionary size grows, linear search becomes painfully slow, binary search grows gently (logarithmically), and hash table lookup stays flat — just 1 or 2 steps regardless of size. That constant-time lookup is why hash tables are the most widely used data structure in practice.
Let us put some numbers on this. A modern CPU can compare two 64-bit integers in about 1 nanosecond. For a dictionary of n = 1,000,000 keys:
| Method | Comparisons | Time (approx) |
|---|---|---|
| Linear scan | 1,000,000 (worst case) | 1 millisecond |
| Binary search | 20 (log₂ 10⁶) | 20 nanoseconds |
| Hash table | 1-2 (expected) | 1-2 nanoseconds |
Binary search is already fast — 20 ns is plenty for most applications. But hash tables are 10x faster still. When your database serves 100,000 queries per second, that 10x matters. When your compiler resolves millions of symbol lookups during compilation, it matters even more.
Hash tables power Python dictionaries, JavaScript objects, database indexes, caches, compilers (symbol tables), routers (IP lookup), and virtually every application that needs fast key-value access. Understanding how they work — and when they break — is fundamental.
Here is the roadmap for this lesson. We will build hash tables from the ground up, starting with the simplest possible version (direct addressing), then adding the hash function (Chapter 2), handling collisions with chaining (Chapter 3) and open addressing (Chapter 4), analyzing performance mathematically (Chapter 5), achieving worst-case O(1) with perfect hashing (Chapter 6), surveying real-world implementations (Chapter 7), and finally practicing design and debugging patterns (Chapters 8-9).
By the pigeonhole principle, if you have more possible keys than table slots (|U| > m), at least two keys must map to the same slot. But collisions happen much sooner than you might expect. The birthday paradox tells us: in a room of just 23 people, there is a >50% chance two share a birthday (out of 365 days). Similarly, with a hash table of m slots, after inserting only about √(2m) keys, you have a >50% chance of a collision. For m = 1000 slots, that is just ~45 keys — not 1000.
This is not a flaw of hash tables. It is a mathematical certainty. The entire field of hash table design is about gracefully handling collisions, not avoiding them.
Before we build a hash table, let us start with the simplest possible version. Suppose your keys are integers in the range [0, m-1]. Then you can just use an array of size m and store the value for key k at position T[k]. Done.
This is a direct-address table. Insert, delete, and search are all O(1) — you just compute the index and access it directly. No hash function needed, no collisions possible. It is the dream data structure... with one fatal flaw.
What if your keys are Social Security numbers (9 digits, range 000000000 to 999999999)? You need an array of 1 billion slots, even if you only store 1,000 people. That is 109 slots for 103 keys — a 99.9999% waste of memory. At 8 bytes per slot, that is 8 GB for a table holding 1,000 entries.
What if your keys are strings? The universe of possible strings is effectively infinite. A 10-character ASCII string has 12810 ≈ 1021 possibilities. You cannot allocate an array of 1021 slots. Even if your keys are short English words (say, up to 20 characters), the universe is impossibly large for direct addressing.
What if your keys are IPv4 addresses? The range is 0 to 232-1, which is about 4 billion. A direct-address table would need 4 billion slots. That is feasible (16 GB at 4 bytes per slot), and in practice some routers do use direct addressing for IP lookups when memory is available. But for IPv6 addresses (128 bits), the universe has 2128 ≈ 3.4 × 1038 possible keys. Direct addressing is impossible.
The simulation below shows a direct-address table. Insert a few keys and notice how most slots remain empty. This waste is exactly what hash tables fix.
Enter a key (0-15) and a value, then click Insert. Notice the empty slots.
python class DirectAddressTable: def __init__(self, m): """Create table with slots 0..m-1, all initially None.""" self.table = [None] * m # O(m) space def insert(self, key, value): self.table[key] = value # O(1) def search(self, key): return self.table[key] # O(1) — returns None if absent def delete(self, key): self.table[key] = None # O(1)
Every operation is a single array access. No comparisons, no loops, no searching. The index IS the key. The cost? Space is O(m) where m is the universe size, not the number of stored keys n. When m >> n (which is the common case for real applications), this is unacceptable.
Think about the trade-off as a spectrum. On one end: direct addressing uses space O(|U|) but achieves O(1) time with no collisions. On the other end: a single linked list uses space O(n) but requires O(n) time. Hash tables sit in the sweet spot: space O(n) (or slightly more) and O(1) expected time, at the cost of potential collisions.
| Operation | Time | Space |
|---|---|---|
| Insert | O(1) | O(m) where m = universe size |
| Search | O(1) | |
| Delete | O(1) |
Direct addressing works when the universe is small. Hash tables work when it is not. The key insight: even if the universe is huge (all possible strings, all possible 64-bit integers), the number of keys you actually store is typically much smaller. A hash function compresses the universe [0, |U|-1] down to [0, m-1] where m is a manageable size proportional to the number of stored keys n.
This compression comes at a cost: two different keys might map to the same slot. That is a collision, and dealing with collisions is the central challenge of hash table design. The next eight chapters are about choosing hash functions that minimize collisions and resolving the collisions that inevitably occur.
A hash table uses an array of size m (much smaller than the universe U) and a hash function h that maps each key k in U to a slot h(k) in [0, m-1]. The value for key k is stored at position T[h(k)].
The problem is obvious: since |U| > m, multiple keys must map to the same slot. When h(k₁) = h(k₂) for k₁ ≠ k₂, we have a collision. A good hash function minimizes collisions by distributing keys as uniformly as possible across all m slots.
The simplest hash function: divide the key by m and take the remainder.
If keys are strings, first convert to an integer. A common approach: treat each character as a digit in a base-128 number. For the string "cat": c=99, a=97, t=116, so the numeric key is 99×128² + 97×128 + 116 = 1,634,420.
Multiply the key by a constant A (0 < A < 1), extract the fractional part, then scale to [0, m-1].
Here "k · A mod 1" means "take the fractional part of k·A." Let us trace through an example to make this concrete:
The beauty: the value of m does not matter much — you can use any m, even a power of 2. Knuth suggests A ≈ (√5 - 1)/2 ≈ 0.6180339887, the golden ratio conjugate, which distributes keys particularly well.
Why the golden ratio? The three-distance theorem says that the fractional parts {k · A} for k = 1, 2, 3, ... partition [0,1] into at most three distinct gap sizes. For A = the golden ratio, these gaps are as equal as possible — giving the most uniform distribution. This is why the golden ratio appears everywhere in quasi-random sequences, low-discrepancy sampling, and hash functions.
In practice, the multiplication method is implemented using integer arithmetic for speed. Choose w = word size (32 or 64 bits), and let s = ⌊A · 2w⌋. Then:
No floating-point needed. The multiply-and-shift extracts the middle bits of the product, which are the most "mixed" bits — they depend on all bits of both k and s. This is why Knuth calls it a "multiply-shift" hash.
Both the division and multiplication methods are deterministic. An adversary who knows which hash function you use can craft keys that all collide, degrading your O(1) table to O(n). This is not theoretical — hash collision attacks have been used against web servers to cause denial-of-service.
Universal hashing defends against this by choosing the hash function randomly at runtime from a family of functions. No adversary can choose bad keys without knowing which function was selected.
A universal family for integer keys: pick a prime p ≥ |U|, then choose random a ∈ {1, ..., p-1} and b ∈ {0, ..., p-1}.
For any two distinct keys k₁ ≠ k₂, the probability of collision is at most 1/m — exactly the same as if h were truly random. This is the definition of a universal hash family.
Let us verify this with a concrete example. Let p = 17, m = 5. Choose a = 3, b = 7. Then:
Keys 0 and 4 collide (both map to slot 2), and keys 1 and 5 collide (both map to slot 0). But for a different random choice of a and b, the collisions would be different. That is the power of universality: no fixed set of keys can consistently cause collisions across all hash functions in the family.
Enter numeric keys and see where they land with different hash functions. Watch for collisions (red).
python import random def hash_division(k, m): return k % m def hash_multiplication(k, m, A=0.6180339887): return int(m * ((k * A) % 1)) class UniversalHash: def __init__(self, m, p=104729): """Pick random a, b at creation time.""" self.m = m self.p = p # prime ≥ universe size self.a = random.randint(1, p - 1) self.b = random.randint(0, p - 1) def __call__(self, k): return ((self.a * k + self.b) % self.p) % self.m
Let us hash the keys 42, 130, and 77 with table size m = 11 using all three methods.
Division method (h(k) = k mod m):
Multiplication method (h(k) = ⌊m · (k · 0.618 mod 1)⌋):
Notice: the multiplication method spread all three keys to different slots, while the division method caused a collision between 42 and 130. This does not mean multiplication is always better — it depends on the specific key distribution. But it illustrates why having multiple hash function strategies matters.
Most theoretical analysis of hash tables assumes simple uniform hashing (SUH): each key is equally likely to hash to any of the m slots, independently of where other keys hash. This is an idealization — no deterministic hash function can achieve this because the distribution depends on the actual key set. But universal hashing comes close: for any fixed set of keys, a randomly chosen universal hash function distributes them nearly uniformly in expectation.
The SUH assumption lets us derive clean expected-case bounds. When your hash function is good (close to uniform), these bounds are accurate in practice. When your hash function is bad (clustering, many collisions), performance degrades. The gap between theory and practice is entirely about hash function quality.
What if your key is not a simple integer but a struct with multiple fields? For example, a (first_name, last_name) pair. The standard approach: combine the hashes of individual fields using a mixing function.
python # Bad: XOR loses information (hash("AB","CD") == hash("CD","AB")) def bad_hash(a, b): return hash(a) ^ hash(b) # symmetric — order doesn't matter! # Good: multiply-and-add preserves order def good_hash(a, b): return hash(a) * 31 + hash(b) # asymmetric — ("AB","CD") != ("CD","AB") # Best: use Python's built-in tuple hashing def best_hash(a, b): return hash((a, b)) # Python handles it
The "multiply by a prime and add" pattern is used by Java (31), Boost (multiply by golden ratio and XOR), and many others. The key requirement: the combining function must be order-sensitive (hash(a,b) ≠ hash(b,a)) and avalanching (a small change in any field should change many bits of the output).
Two keys hash to the same slot. Now what? The simplest strategy: store all keys that hash to slot j in a linked list hanging off that slot. This is called chaining (or separate chaining).
Insert(k, v): Compute h(k) — this takes O(1). Prepend the (k, v) pair to the head of the linked list at slot h(k). Time: O(1). We prepend rather than append because prepending to a singly-linked list is O(1) (just update the head pointer), while appending is O(nj) (walk to the end).
Search(k): Compute h(k). Walk the linked list at slot h(k), comparing each key to k. If found, return the value. If you reach the end of the list without finding k, the key is not in the table. Time: O(length of the chain at slot h(k)).
Delete(k): Search for k in its chain, then remove the node. With a doubly-linked list, once you have a pointer to the node, removal is O(1) (update the predecessor's next pointer and the successor's previous pointer). Total: O(search time) = O(length of chain). With a singly-linked list, you also need the predecessor, so you track it during the search.
Define the load factor as α = n/m, where n is the number of stored keys and m is the table size. The load factor is the average chain length. If you have 100 keys in a table with 50 slots, α = 2, meaning each chain has 2 elements on average.
Insert keys and watch chains grow. Adjust table size to see how load factor affects performance.
Insert keys 5, 28, 19, 15, 20, 33, 12, 17, 10, 0 into a hash table with m = 7 using h(k) = k mod 7.
| Key k | h(k) = k mod 7 | Slot |
|---|---|---|
| 5 | 5 mod 7 = 5 | 5 |
| 28 | 28 mod 7 = 0 | 0 |
| 19 | 19 mod 7 = 5 | 5 (collision with 5) |
| 15 | 15 mod 7 = 1 | 1 |
| 20 | 20 mod 7 = 6 | 6 |
| 33 | 33 mod 7 = 5 | 5 (collision with 5, 19) |
| 12 | 12 mod 7 = 5 | 5 (collision with 5, 19, 33) |
| 17 | 17 mod 7 = 3 | 3 |
| 10 | 10 mod 7 = 3 | 3 (collision with 17) |
| 0 | 0 mod 7 = 0 | 0 (collision with 28) |
Result: slot 0 has chain [0, 28], slot 1 has [15], slot 2 is empty, slot 3 has [10, 17], slot 4 is empty, slot 5 has [12, 33, 19, 5], slot 6 has [20]. Load factor: α = 10/7 ≈ 1.43. The longest chain has 4 elements (slot 5). An unsuccessful search in slot 5 requires 4 comparisons. This is why keeping α ≤ 1 matters.
Why is the expected unsuccessful search time Θ(1 + α)? Here is the argument, step by step.
Under simple uniform hashing, each of the n keys is equally likely to hash to any of the m slots. The expected number of keys in any given slot j is n/m = α. An unsuccessful search for key k first computes h(k) in O(1), then traverses the entire chain at slot h(k). The expected chain length is α. Total: 1 + α.
For a successful search, the analysis is slightly different. We search for a key that is in the table. The expected probes depend on where the key sits in the chain. On average, the key is at position (chain length)/2 from the head. The expected chain length is 1 + (n-1)/m ≈ 1 + α. So the expected successful search examines about α/2 keys past the first, giving Θ(1 + α/2).
When α exceeds a threshold (typically 0.75), we double the table size and rehash all n keys. This costs O(n). But how often does it happen?
If the table starts at size m = 1 and we double each time it fills, after inserting 2k elements the table has been resized k times. The total rehashing cost is 1 + 2 + 4 + ... + 2k = 2k+1 - 1 < 2n. So the total cost of n insertions is O(n) for the inserts themselves plus O(n) for all resizing combined — that is O(1) amortized per insertion.
python class ChainingHashTable: def __init__(self, m=7): self.m = m self.table = [[] for _ in range(m)] self.n = 0 def _hash(self, key): return key % self.m def insert(self, key, value): h = self._hash(key) # Check if key exists — update if so for i, (k, v) in enumerate(self.table[h]): if k == key: self.table[h][i] = (key, value) return self.table[h].append((key, value)) self.n += 1 def search(self, key): h = self._hash(key) for k, v in self.table[h]: if k == key: return v return None def delete(self, key): h = self._hash(key) self.table[h] = [(k, v) for k, v in self.table[h] if k != key] self.n -= 1 def load_factor(self): return self.n / self.m
Chaining uses linked lists, which means pointer overhead and cache-unfriendly memory access. Open addressing takes a different approach: store all elements directly in the table array. No linked lists. No pointers. When a collision occurs, you probe — try another slot, then another, until you find an empty one.
The probe sequence is the order in which slots are examined. Given a key k and a probe number i (starting from 0), the probe function h(k, i) returns the slot to try on the i-th probe.
Try the home slot h'(k), then the next slot, then the next, wrapping around. Simple and cache-friendly (sequential memory access), but suffers from primary clustering: long runs of occupied slots tend to grow longer, because any key that hashes into the cluster must traverse the entire cluster to find an empty slot, extending it further.
Instead of stepping by 1 each time, step by increasing amounts (1, 3, 5, 7... with c₁=0, c₂=1, or similar). This breaks up primary clusters but introduces secondary clustering: keys with the same home slot follow the same probe sequence. Better than linear, but not ideal.
Use a second hash function to determine the step size. Different keys with the same home slot will have different step sizes, so their probe sequences diverge immediately. This produces the best distribution among the three methods — close to the theoretical ideal of uniform probing.
Consider m = 7, linear probing, and keys arriving in this order: 14, 21, 28, 35, 7, 0.
All six keys hash to slot 0, creating a cluster from slots 0 through 5. The last key needs 6 probes. Now any new key hashing to slots 0-5 must traverse the entire cluster. This is the worst case for linear probing — all keys collide. With a better hash function (one that does not map all these multiples of 7 to the same slot), this would not happen.
With double hashing and h₂(k) = 1 + (k mod 6):
Different step sizes mean the probe sequences diverge immediately. Key 28 with step size 5 probes slot 0 (full), then slot 5 (empty) — just 2 probes instead of linear probing's 3. The savings compound as the table fills.
Insert keys and watch the probe sequences. Notice how linear probing creates clusters while double hashing spreads keys out.
Table size m = 11 (prime). Hash function h'(k) = k mod 11. Insert keys: 10, 22, 31, 4, 15, 28, 17, 88, 59.
| Key | h'(k) | Probes | Final slot |
|---|---|---|---|
| 10 | 10 | slot 10 (empty) | 10 |
| 22 | 0 | slot 0 (empty) | 0 |
| 31 | 9 | slot 9 (empty) | 9 |
| 4 | 4 | slot 4 (empty) | 4 |
| 15 | 4 | slot 4 (occupied by 4), slot 5 (empty) | 5 |
| 28 | 6 | slot 6 (empty) | 6 |
| 17 | 6 | slot 6 (occupied), slot 7 (empty) | 7 |
| 88 | 0 | slot 0 (occupied), slot 1 (empty) | 1 |
| 59 | 4 | slot 4 (occ), 5 (occ), 6 (occ), 7 (occ), 8 (empty) | 8 |
Notice key 59: it hashes to slot 4 but needs 5 probes because of the cluster from slots 4 through 7. This is primary clustering in action. With double hashing, 59 would jump by h₂(59) = 1 + (59 mod 10) = 10 slots per step, landing at slot (4 + 10) mod 11 = 3 on the second probe — directly to an empty slot.
Deletion is tricky. You cannot simply set a slot to "empty" because it might break a probe chain. If key A was placed past slot j because j was occupied, and you later delete the occupant of slot j, a search for A will stop at the now-empty slot j and incorrectly report A as absent.
The solution: use a special DELETED sentinel (also called a tombstone). Search treats DELETED slots as occupied (keep probing). Insert treats them as empty (can place a new key there). This works but degrades performance over time as tombstones accumulate.
To search for key k, follow the same probe sequence used for insertion: compute h(k,0), h(k,1), h(k,2), ... until you either find k (success) or hit an empty slot (k is not in the table). If you encounter a DELETED tombstone, keep probing — the key might be beyond it.
python class OpenAddressHashTable: DELETED = "__DELETED__" def __init__(self, m=11): self.m = m self.table = [None] * m self.n = 0 def _probe(self, key, i): # Linear probing return (hash(key) + i) % self.m def insert(self, key, value): for i in range(self.m): slot = self._probe(key, i) if self.table[slot] is None or self.table[slot] == self.DELETED: self.table[slot] = (key, value) self.n += 1 return if self.table[slot][0] == key: self.table[slot] = (key, value) # update return raise Exception("Table full") def search(self, key): for i in range(self.m): slot = self._probe(key, i) if self.table[slot] is None: return None # empty slot = not found if self.table[slot] != self.DELETED and self.table[slot][0] == key: return self.table[slot][1] return None def delete(self, key): for i in range(self.m): slot = self._probe(key, i) if self.table[slot] is None: return # not found if self.table[slot] != self.DELETED and self.table[slot][0] == key: self.table[slot] = self.DELETED # tombstone self.n -= 1 return
| Property | Chaining | Open Addressing |
|---|---|---|
| Memory layout | Pointer-chasing (bad for cache) | Contiguous array (cache-friendly) |
| Space overhead | Extra pointer per entry | No extra pointers |
| Load factor | Can exceed 1.0 (chains grow) | Must stay below 1.0 |
| Deletion | Clean — just remove from list | Tombstones — accumulate, degrade perf |
| Clustering | No clustering | Primary/secondary clustering possible |
| Best for | Unknown key count, frequent deletes | Known size, read-heavy, low α |
We have seen that hash table performance depends on the load factor α = n/m. But how exactly? In this chapter we derive the expected probe counts and build an interactive graph that reveals the dramatic difference between probing strategies.
Under simple uniform hashing, the expected number of probes for different strategies is:
| Strategy | Unsuccessful Search | Successful Search |
|---|---|---|
| Chaining | 1 + α | 1 + α/2 |
| Linear Probing | ½(1 + 1/(1-α)²) | ½(1 + 1/(1-α)) |
| Double Hashing | 1/(1-α) | -(1/α) ln(1-α) |
Let us unpack what these formulas mean with concrete numbers:
| α | Chaining (unsucc) | Linear Probing (unsucc) | Double Hashing (unsucc) |
|---|---|---|---|
| 0.25 | 1.25 | 1.39 | 1.33 |
| 0.50 | 1.50 | 2.50 | 2.00 |
| 0.75 | 1.75 | 8.50 | 4.00 |
| 0.90 | 1.90 | 50.50 | 10.00 |
| 0.95 | 1.95 | 200.50 | 20.00 |
At 90% full, linear probing needs 50 probes on average for an unsuccessful search. At 95% full, it needs 200. Chaining at 95% needs just 1.95. This is why you must resize before the table gets too full.
Drag the vertical line to see expected probe counts at any load factor. Watch how linear probing explodes near α = 1. The red dashed line marks the typical resize threshold of 0.75.
Under the uniform hashing assumption (each probe sequence is equally likely to be any permutation of [0, m-1]), the probability that the first probe hits an occupied slot is n/m = α. Given that, the probability the second is also occupied is (n-1)/(m-1) ≤ α. The probability of needing at least i probes is at most αi-1.
The expected number of probes is:
This is the geometric series. At α = 0.5, we get 2 probes. At α = 0.9, we get 10. At α = 0.99, we get 100. The function has a vertical asymptote at α = 1 (table full), which is why you must never let a hash table fill completely.
For a successful search, the expected probes depend on when the key was inserted. A key inserted when the load factor was α' needed 1/(1-α') probes on average. Averaging over all n keys (inserted at load factors 0/m, 1/m, ..., (n-1)/m):
This sum is a harmonic series fragment. Using the integral approximation ∑ 1/(m-i) ≈ ln(m/(m-n)) = ln(1/(1-α)), we get:
At α = 0.5: (1/0.5)·ln(2) ≈ 1.39 probes. At α = 0.9: (1/0.9)·ln(10) ≈ 2.56 probes. Even at high load, successful searches are much cheaper than unsuccessful ones because the key was inserted when the table was less full.
Linear probing does not satisfy the uniform hashing assumption because its probe sequences overlap. If keys A, B, C hash to consecutive slots 5, 6, 7, they form a cluster. Any future key hashing to slots 3, 4, 5, 6, or 7 must probe through the entire cluster. The cluster grows at its boundaries, and larger clusters grow faster — a positive feedback loop.
Knuth's analysis (1963!) showed that the expected probes for linear probing are:
The unsuccessful search has (1-α)² in the denominator instead of (1-α). At α = 0.9, this gives 50.5 probes vs double hashing's 10. The squared term comes from the clustering effect: you need to find a gap in the cluster, and the expected gap size shrinks quadratically with load factor.
When do you resize, and what does it cost? Most implementations use a doubling strategy: when α exceeds a threshold (typically 0.75), create a new table of size 2m and rehash all n elements.
The rehashing costs O(n) — you must recompute the hash and insert every element into the new, larger table. But this O(n) cost is spread over the O(n) insertions that filled the table to the threshold. Using the accounting method of amortized analysis:
This is the same argument used for dynamic arrays (ArrayList, Python list). Although individual operations can be expensive (O(n) for a resize), the expensive operations are rare enough that the average cost per operation remains O(1).
A visual way to think about it: plot the cumulative cost of n insertions. Without resizing, the line is straight (slope 1). With resizing, the line has occasional jumps (when a resize happens) but the slope is still O(1) — the jumps are bounded by the line 3n.
Most implementations only grow, never shrink. If you insert 1 million keys and then delete 999,999, you still have a table of size ~1.3 million for a single key — massive memory waste. To fix this, some implementations halve the table when α drops below 0.25. But shrinking adds complexity: you must avoid thrashing (repeatedly growing and shrinking at the threshold boundary). The standard solution uses asymmetric thresholds: grow at α = 0.75, shrink at α = 0.25. The gap between 0.25 and 0.75 provides hysteresis that prevents thrashing.
Theory assumes uniform hashing. Does it hold in practice? The simulation below inserts N random keys and measures actual probe counts, overlaid on the theoretical curves.
Click "Run Experiment" to insert random keys and measure actual probe counts. Circles are measured values; lines are theoretical predictions.
Everything we have seen so far gives O(1) expected time. Can we guarantee O(1) worst-case? If we know all keys in advance (a static set), the answer is yes. The technique is called perfect hashing.
The idea: use two levels of hashing. The first level is an ordinary hash table with chaining. If nj keys hash to slot j, the second level is a small hash table of size nj² with its own hash function — chosen to have zero collisions.
Wait — a table of size nj² for just nj keys? That sounds wasteful. But by the birthday paradox in reverse: if you hash nj keys into nj² slots using a random universal hash function, the expected number of collisions is less than 1/2. So by trying a few random hash functions, you can quickly find one with no collisions at all.
Click "Build Perfect Hash" to construct a two-level hash table from the given keys. Notice: NO collisions at the second level.
| Property | Chaining | Open Addressing | Perfect Hashing |
|---|---|---|---|
| Search (worst) | O(n) | O(n) | O(1) |
| Search (expected) | O(1) | O(1) | O(1) |
| Insert | O(1) | O(1) amortized | O(n) rebuild |
| Space | O(n + m) | O(m) | O(n) |
| Static or dynamic? | Dynamic | Dynamic | Static |
This is one of the most elegant arguments in CLRS. Consider hashing nj keys into a table of size mj = nj² using a universal hash function. The expected number of colliding pairs is:
Since the expected number of collisions is less than 1/2, by Markov's inequality, the probability of any collision is less than 1/2. So with probability at least 1/2, a random universal hash function gives zero collisions. On average, you need only 2 random tries to find a perfect function for each bucket.
The total space across all second-level tables is ∑ nj². Using the identity ∑ nj² = n + 2 · C (where C is the number of colliding pairs at the first level), and the fact that a universal hash gives expected C = n(n-1)/(2m), we get expected total space Θ(n) when m = Θ(n). So the whole structure uses linear space — no waste.
Real-world hash tables are more sophisticated than the textbook versions. Each major language and runtime has made different engineering trade-offs. Let us tour the landscape.
Python's dict is one of the most studied hash table implementations in any language, and understanding its design choices is instructive. It uses open addressing with a custom probing strategy. The probe sequence mixes in higher-order bits of the hash to break up clustering:
Table size is always a power of 2 (for fast modulo via bitmask: hash & (size-1) instead of hash % size, which avoids an expensive integer division). Load factor threshold: 2/3 (~0.667), which is lower than Java's 0.75 because open addressing is more sensitive to high load factors than chaining. On resize, the table doubles. Since Python 3.6, dict preserves insertion order using a compact layout: a dense array of (hash, key, value) entries plus a sparse index array of pointers. This reduced memory by 20-25%.
A key performance detail: Python caches the hash value of each key. Computing hash("hello") costs ~50 ns, but comparing two hashes (integers) costs ~1 ns. By storing the hash alongside each key, Python avoids recomputing it during probing and can quickly reject non-matching entries by comparing hashes before comparing keys. This is especially important for string keys, where full comparison is expensive.
Python 3.6 also introduced a compact representation. Instead of one large array of (hash, key, value) triples with empty slots, it uses two arrays: a dense "entries" array (no gaps) and a sparse "indices" array (small integers pointing into entries). This preserves insertion order as a free side effect — iterating the entries array gives items in insertion order.
python # Conceptual layout of Python dict internals (simplified) # Sparse index array (1 byte per slot when table small): indices = [None, 0, None, 2, None, 1, None, None] # Dense entries array (no gaps, preserves insertion order): entries = [ (hash("cat"), "cat", 3), # entry 0 (hash("dog"), "dog", 7), # entry 1 (hash("bird"), "bird", 1), # entry 2 ] # To find "dog": compute hash, probe indices array, follow pointer to entries[1]
Java's HashMap uses chaining, but with a twist: when a chain exceeds 8 elements, it converts from a linked list to a red-black tree (called "treeification"). This caps worst-case search at O(log n) instead of O(n). When the chain shrinks below 6, it converts back to a list. Load factor threshold: 0.75. Table size: power of 2.
Why the threshold of 8? Under random hashing with α = 0.75, the probability of any bucket having 8 or more elements follows a Poisson distribution with λ = 0.75. P(X ≥ 8) ≈ 0.000006 — roughly one in 166,000 buckets. So treeification almost never triggers in normal use. It is a safety net against collision attacks (or pathologically bad hash functions), not a common-case optimization. The choice of 8 balances the overhead of maintaining a tree (more memory per node, more complex code) against the protection it provides.
Java also uses a technique called tree-bin thinning: when a treeified bin shrinks below 6 elements (through deletions), it converts back to a linked list. The gap between 6 (un-treeify) and 8 (treeify) prevents thrashing at the boundary.
Robin Hood hashing is a variant of open addressing with a beautiful invariant: when inserting, if the new key has a longer probe distance than the key currently in the slot, swap them. The new key takes the slot, and the displaced key continues probing. This "steals from the rich" (short probe distance) to "give to the poor" (long probe distance).
The result: the variance of probe distances drops dramatically. Instead of a few keys having very long probe chains while most have short ones, all keys have similar (moderate) probe distances. This makes the worst-case much better while keeping the average nearly the same.
Cuckoo hashing (named after the cuckoo bird, which lays eggs in other birds' nests) uses two hash functions h₁ and h₂ and two tables T₁ and T₂. Each key is stored in either T₁[h₁(k)] or T₂[h₂(k)]. Lookup checks exactly 2 locations — worst-case O(1).
Insert: try T₁[h₁(k)]. If occupied, evict the occupant and place the new key there. The evicted key goes to its alternative location in T₂. If that is also occupied, evict again. This chain of evictions (the "cuckoo" process) terminates quickly with high probability. If it cycles (after a threshold number of evictions), rehash with new functions.
python class CuckooHashTable: def __init__(self, size=11): self.size = size self.T1 = [None] * size self.T2 = [None] * size self.max_evictions = 20 def h1(self, k): return k % self.size def h2(self, k): return (k // self.size) % self.size def lookup(self, key): if self.T1[self.h1(key)] == key: return True if self.T2[self.h2(key)] == key: return True return False # exactly 2 lookups, always def insert(self, key): for _ in range(self.max_evictions): slot = self.h1(key) if self.T1[slot] is None: self.T1[slot] = key; return key, self.T1[slot] = self.T1[slot], key # evict slot = self.h2(key) if self.T2[slot] is None: self.T2[slot] = key; return key, self.T2[slot] = self.T2[slot], key # evict # Too many evictions — need to rehash with new functions raise Exception("Cycle detected — rehash needed")
Cuckoo hashing has a strict load factor limit: with two tables each of size m, you can store at most ~0.5m keys (load factor ~0.5) before cycles become frequent. Variants with 3 or 4 hash functions push this to α ≈ 0.9.
The beauty of cuckoo hashing for interviews: lookup is worst-case O(1), which no other dynamic hash table achieves. The cost is paid at insertion time, where the expected amortized cost is O(1) but worst-case can trigger a full rehash. For read-heavy workloads (which describes most real systems — typical read-to-write ratios are 100:1 or higher), this trade-off is excellent.
Not a hash table, but closely related: a Bloom filter answers "is x in the set?" using k hash functions and a bit array of m bits. On insert, set bits h₁(x), h₂(x), ..., hk(x) to 1. On query, check if all k bits are 1. False positives are possible (all bits happen to be set by other keys), but false negatives are impossible (if x was inserted, its bits are definitely set).
The false positive probability for n elements, m bits, and k hash functions is approximately:
The optimal number of hash functions is k = (m/n) · ln(2). For example, with 10 bits per element (m/n = 10), the optimal k = 7 and the false positive rate is about 0.8%. That is: using just 10 bits per element (1.25 bytes), you can test set membership with 99.2% accuracy. A full hash set would need 8-16 bytes per element.
Bloom filters are used everywhere: databases check a Bloom filter before reading from disk (avoiding expensive I/O for keys that definitely are not stored), CDNs use them to avoid caching one-hit wonders, spell checkers use them as a fast first pass, and Bitcoin nodes use them to request only relevant transactions from peers.
Variants include counting Bloom filters (replace bits with counters to support deletion), cuckoo filters (support deletion with lower false positive rates), and quotient filters (cache-friendly alternative). For interview purposes, remember: Bloom filter = probabilistic set membership, no false negatives, tunable false positive rate via m/n ratio and k hash functions.
python class BloomFilter: def __init__(self, size=1000, num_hashes=7): self.bits = [False] * size self.size = size self.k = num_hashes def _hashes(self, item): # Generate k hash values using double hashing h1 = hash(item) % self.size h2 = (hash(item) // self.size + 1) % self.size return [(h1 + i * h2) % self.size for i in range(self.k)] def add(self, item): for h in self._hashes(item): self.bits[h] = True def might_contain(self, item): return all(self.bits[h] for h in self._hashes(item))
Insert keys and watch Robin Hood hashing balance probe distances. The bar above each slot shows its probe distance. Rich keys (short distance) get displaced by poor keys (long distance).
| Implementation | Collision Strategy | Resize at | Worst Lookup | Special Feature |
|---|---|---|---|---|
| Python dict | Open (perturbation probing) | α = 2/3 | O(n) | Compact layout, insertion order |
| Java HashMap | Chaining → tree | α = 0.75 | O(log n) | Treeification at chain length 8 |
| C++ unordered_map | Chaining | α = 1.0 | O(n) | Bucket interface, local iterators |
| Robin Hood | Open (displacement) | α ≈ 0.9 | O(log n) expected | Low variance probe distance |
| Cuckoo | Two tables, eviction | α ≈ 0.5 | O(1) | Constant worst-case lookup |
Hopscotch hashing (Herlihy, Shavit, Tzafrir, 2008) is a hybrid between chaining and open addressing. Each slot has a "neighborhood" of size H (typically 32, matching the word size). A bitmap in each slot records which of the next H positions contain keys that hash to this slot. Lookup examines only the H slots in the neighborhood — a fixed constant, making worst-case lookup O(H) = O(1).
On insert, if the target neighborhood is full, the algorithm displaces an element outside the neighborhood into a further slot, then cascades displacements closer to the target. This is similar to cuckoo hashing but works within a single table. The bounded neighborhood size is particularly valuable for concurrent hash tables: a thread only needs to lock its neighborhood, not the entire table, enabling fine-grained locking and high parallelism.
Hopscotch hashing achieves load factors up to 0.9 with excellent performance, and its bitmap operations are efficiently implementable using SIMD instructions — a single instruction can check all 32 neighborhood slots simultaneously.
| Year | Innovation | Key Idea |
|---|---|---|
| 1953 | Chaining (Luhn) | Linked lists at each slot |
| 1954 | Open addressing (Amdahl, Elmore, Peterson) | Probe within the table itself |
| 1963 | Linear probing analysis (Knuth) | First rigorous analysis of clustering |
| 1979 | Universal hashing (Carter, Wegman) | Randomized defense against adversarial inputs |
| 1984 | Perfect hashing (FKS) | O(1) worst-case for static sets |
| 2001 | Cuckoo hashing (Pagh, Rodler) | Worst-case O(1) lookup for dynamic sets |
| 2005 | Robin Hood hashing (Celis) | Variance reduction via displacement |
| 2008 | Hopscotch hashing | Bounded neighborhoods for concurrency |
| 2017 | Swiss Table (Google) | SIMD-parallel probing, metadata bytes |
Hash tables appear in nearly every system design interview. This chapter covers the most common design patterns and failure modes.
"Design a cache with O(1) get and O(1) put. When the cache exceeds capacity, evict the least recently used item."
This is the classic LRU (Least Recently Used) cache, and the answer is a hash map plus a doubly-linked list. The hash map gives O(1) lookup. The doubly-linked list maintains access order — the most recently used item is at the head, the least recently used is at the tail.
python class Node: def __init__(self, key, val): self.key, self.val = key, val self.prev = self.nxt = None class LRUCache: def __init__(self, capacity): self.cap = capacity self.map = {} # key -> Node # Sentinel nodes: head.nxt = MRU, tail.prev = LRU self.head = Node(0, 0) self.tail = Node(0, 0) self.head.nxt = self.tail self.tail.prev = self.head def _remove(self, node): node.prev.nxt = node.nxt node.nxt.prev = node.prev def _add_front(self, node): node.nxt = self.head.nxt node.prev = self.head self.head.nxt.prev = node self.head.nxt = node def get(self, key): if key not in self.map: return -1 node = self.map[key] self._remove(node) self._add_front(node) return node.val def put(self, key, val): if key in self.map: self._remove(self.map[key]) node = Node(key, val) self.map[key] = node self._add_front(node) if len(self.map) > self.cap: lru = self.tail.prev self._remove(lru) del self.map[lru.key]
Get and put keys in a capacity-3 LRU cache. Watch the linked list reorder on access. Least recently used is evicted when full.
A spell checker needs to answer "is this word valid?" for a dictionary of ~200K words. A hash set (hash table with no values, just keys) gives O(1) lookup. But if memory is tight, a Bloom filter uses only ~10 bits per word (250 KB total) with a 1% false positive rate — meaning it might say "yes" for a misspelled word, but never says "no" for a real word. Use the Bloom filter as a first pass, then check a compact on-disk dictionary for the rare false positives.
"Your web service uses a hash map to cache user sessions. Response times were 2ms but suddenly jumped to 200ms. What happened?"
Systematic diagnosis:
| Hypothesis | How to check | Fix |
|---|---|---|
| Load factor too high (no resize) | Log n and m, compute α | Trigger resize at α > 0.75 |
| Hash collision attack | Check if many keys map to same slot | Switch to SipHash or universal hashing |
| Broken hash function | Check distribution of h(k) across slots | Replace hash function, ensure it uses all bits of the key |
| Tombstone accumulation (open addressing) | Count DELETED slots vs total | Periodic full rehash to eliminate tombstones |
| Memory thrashing | Check if table exceeds L3 cache | Reduce table size, improve data locality |
In 2011, researchers demonstrated that many web frameworks (PHP, Java, Python, Ruby) were vulnerable to hash collision attacks. By crafting HTTP POST parameters that all hash to the same slot, an attacker could force the server's hash table to degrade to O(n) per operation. A single malicious request with 100K parameters could tie up a CPU core for minutes.
The fix: SipHash, a keyed hash function designed by Jean-Philippe Aumasson and Daniel J. Bernstein. SipHash takes a 128-bit secret key, making it impossible for an attacker to predict which keys collide. Python (since 3.4), Rust, Ruby, and Perl all switched to SipHash for their built-in hash tables.
"Your Python service's memory grows without bound even though you are deleting entries from a dict."
Common cause: the dict stores objects that reference each other (circular references), and the garbage collector's cycle detector runs less frequently than your insertion rate. Or: you are storing large values and only deleting keys — the dict's internal array does not shrink automatically in most implementations. In CPython, dict.clear() followed by reassigning to a new dict is the only way to reclaim the backing array's memory.
Another subtlety: if you use a hash table with tombstones (open addressing), deleting keys does not actually free the slots. The table stays at its peak size. Periodic "compaction" (creating a new table and re-inserting all live keys) is necessary to truly reclaim space. This is analogous to database vacuuming (PostgreSQL's VACUUM command does exactly this for its hash indexes).
"Our hash map operations are taking 10x longer than expected. The load factor is only 0.3."
Low load factor should mean fast operations. If it is slow anyway, investigate:
| Cause | Diagnosis | Fix |
|---|---|---|
| Expensive hash function | Profile: is hash() taking most of the time? | Cache hash values, use cheaper hash |
| Expensive key comparison | Profile: is equals() taking most of the time? | Compare hash values first (fast reject) |
| Table too large for cache | Table size * entry size > L3 cache size | Shrink table, use compact representation |
| Many tombstones | Count DELETED vs occupied slots | Compact/rebuild the table |
| Pointer chasing (chaining) | Profile cache misses with perf stat | Switch to open addressing |
"Our test suite passes locally but fails in CI. The output order of items changes between runs."
In Python 3.3+, hash randomization is enabled by default (PYTHONHASHSEED is random each run). This means dict iteration order can vary between runs for dicts with non-integer keys. In Python 3.7+, dicts preserve insertion order, but sets do not. If your code depends on iteration order of a set, it will be non-deterministic. Fix: sort the output, or use an OrderedDict/list where order matters.
"Design a rate limiter that allows at most K requests per client per minute."
Use a hash map from client_id to a list of timestamps. On each request, check if the client has made K requests in the last 60 seconds. Implementation:
python import time from collections import defaultdict, deque class RateLimiter: def __init__(self, max_requests=10, window_secs=60): self.max_requests = max_requests self.window = window_secs self.clients = defaultdict(deque) # client_id -> deque of timestamps def allow(self, client_id): now = time.time() q = self.clients[client_id] # Evict timestamps older than the window while q and now - q[0] > self.window: q.popleft() if len(q) < self.max_requests: q.append(now) return True # allowed return False # rate limited
The hash map gives O(1) lookup per client. The deque gives O(1) timestamp eviction. Total: O(1) amortized per request. This is the sliding window counter pattern — one of the most common uses of hash maps in backend engineering.
"You have 100 web servers. Users can hit any server. How do you store session data so that any server can access any user's session?"
Option 1: Consistent hashing across servers. hash(session_id) determines which server "owns" the session. On a cache miss, the server computes the consistent hash, routes to the owner, and caches locally. On server failure, only 1/N sessions need to remapped.
Option 2: Centralized hash table (Redis, Memcached). All servers read/write to a shared in-memory store. Simple but introduces a single point of failure and a network hop per access.
Option 3: Replicated hash table. Each server has a full copy. Writes are broadcast to all servers. Good for read-heavy workloads but write propagation adds latency. The interviewer expects you to compare these trade-offs: consistency, availability, partition tolerance (CAP theorem).
| Variant | Avg Search | Worst Search | Space | Best For |
|---|---|---|---|---|
| Chaining | O(1+α) | O(n) | O(n+m) | General purpose, frequent deletes |
| Linear Probing | O(1/(1-α)) | O(n) | O(m) | Cache-friendly, low α |
| Double Hashing | O(1/(1-α)) | O(n) | O(m) | Better than linear at high α |
| Robin Hood | O(1/(1-α)) | O(log n) exp | O(m) | Low-variance reads |
| Cuckoo | O(1) | O(1) | O(n) | Read-heavy, static-ish sets |
| Perfect | O(1) | O(1) | O(n) | Static sets (keywords, configs) |
| Bloom Filter | O(k) | O(k) | O(m) bits | Membership test, no false negatives |
1. Two Sum (LeetCode #1) — Given an array and a target, find two elements that sum to the target. Brute force O(n²): try every pair. With a hash map: for each element x, check if (target - x) is in the map. If yes, return the pair. If no, add x to the map and continue. Single pass, O(n) time, O(n) space. This is the quintessential "use a hash map" interview problem — if you see "find a complement" or "find a pair," think hash map.
python def two_sum(nums, target): seen = {} # value -> index for i, x in enumerate(nums): complement = target - x if complement in seen: return [seen[complement], i] seen[x] = i return []
2. Group Anagrams (LeetCode #49) — Group strings that are anagrams of each other. Key insight: two strings are anagrams if and only if they have the same sorted characters. Use tuple(sorted(s)) as the hash map key. This is an example of the "canonical form" pattern: transform each element into a canonical representative, then group by representative using a hash map. The same pattern works for grouping isomorphic strings, grouping equivalent fractions, etc.
python from collections import defaultdict def group_anagrams(strs): groups = defaultdict(list) for s in strs: key = tuple(sorted(s)) # O(k log k) per string groups[key].append(s) return list(groups.values())
3. First Non-Repeating Character (LeetCode #387) — Find the first character in a string that appears exactly once. Two passes: first pass counts frequencies in a hash map, second pass finds the first character with count 1. This is the "frequency counting" pattern — use a hash map to count occurrences, then query the counts. The same technique solves: find majority element (Boyer-Moore is better but hash map works), find top K frequent elements, check if two strings are permutations.
python def first_unique(s): freq = {} for c in s: freq[c] = freq.get(c, 0) + 1 for c in s: if freq[c] == 1: return c return None
4. Subarray Sum Equals K — Count subarrays whose elements sum to k. Maintain a running prefix sum. At each index, check if (prefix_sum - k) exists in the hash map — that means there is a subarray ending here that sums to k. O(n) time, O(n) space.
python def subarray_sum(nums, k): count = 0 prefix = 0 seen = {0: 1} # prefix_sum -> count of occurrences for x in nums: prefix += x if prefix - k in seen: count += seen[prefix - k] seen[prefix] = seen.get(prefix, 0) + 1 return count
5. Consistent Hashing — For distributed systems: place servers on a ring of size 232 using hash(server_id). To route a key, compute hash(key) and walk clockwise to the first server on the ring. When a server is added or removed, only keys between the new/removed server and its predecessor need to be remapped — roughly n/k keys move (where k is the number of servers) instead of the n keys that would move with naive modular hashing (hash(key) % k changes for every key when k changes). This is how Amazon DynamoDB, Apache Cassandra, and CDN routing work.
To ensure even load distribution, each physical server is represented by multiple virtual nodes at different points on the ring. With 100-200 virtual nodes per server, load imbalance stays below 10%.
6. Hash Table from Scratch — Implement a hash table with insert, search, delete, and automatic resizing. This is a full implementation exercise that tests all concepts from this lesson. The code is in Chapter 3 (chaining version) and the comprehensive version with resize is shown below in the cheat sheet.
7. Longest Substring Without Repeating Characters (LeetCode #3) — Use a hash map (or set) to track characters in the current window. Slide the window rightward, adding characters to the set. When a duplicate is found, slide the left boundary past the previous occurrence. The hash set gives O(1) membership testing at each step, making the algorithm O(n).
python class HashMap: def __init__(self, initial_cap=8): self.cap = initial_cap self.size = 0 self.buckets = [[] for _ in range(self.cap)] self.LOAD_THRESHOLD = 0.75 def _hash(self, key): return hash(key) % self.cap def _resize(self): old = self.buckets self.cap *= 2 self.buckets = [[] for _ in range(self.cap)] self.size = 0 for bucket in old: for k, v in bucket: self.put(k, v) # rehash into new table def put(self, key, value): if self.size / self.cap >= self.LOAD_THRESHOLD: self._resize() h = self._hash(key) for i, (k, v) in enumerate(self.buckets[h]): if k == key: self.buckets[h][i] = (key, value) return self.buckets[h].append((key, value)) self.size += 1 def get(self, key): h = self._hash(key) for k, v in self.buckets[h]: if k == key: return v raise KeyError(key) def remove(self, key): h = self._hash(key) for i, (k, v) in enumerate(self.buckets[h]): if k == key: self.buckets[h].pop(i) self.size -= 1 return raise KeyError(key)
Hash tables appear in interview problems in five recurring patterns. Recognizing the pattern is 80% of the solution.
| Pattern | When to use | Example problems |
|---|---|---|
| Complement lookup | "Find a pair that satisfies a condition" | Two Sum, Two Sum II, 4Sum, K-diff pairs |
| Frequency counting | "Count occurrences, find duplicates, find majority" | First unique char, top K frequent, valid anagram |
| Canonical grouping | "Group elements by some equivalence" | Group anagrams, isomorphic strings |
| Sliding window + set | "Find longest/shortest subarray with distinct elements" | Longest substring no repeat, min window substring |
| Prefix sum + map | "Count subarrays with a target sum" | Subarray sum equals K, continuous subarray sum |
When you see a problem, ask: "Can I reduce this to a lookup?" If yes, a hash map likely helps. If the problem involves ordering (sorted output, range queries, nearest neighbor), a hash map alone will not suffice — consider a BST, heap, or sorted structure.
The LRU cache is one of the most frequently asked system design questions. Let us walk through the exact thought process an interviewer expects:
Step 1: Identify the operations. get(key) returns the value if key exists, else -1. put(key, value) inserts or updates. Both must be O(1). When capacity is exceeded, evict the least recently used key.
Step 2: Choose data structures. O(1) lookup suggests a hash map. O(1) insertion/removal at arbitrary positions suggests a doubly-linked list. The key insight: combine both. The hash map maps keys to linked list nodes. The linked list maintains recency order.
Step 3: Define invariants. The head of the list is always the most recently used. The tail is the least recently used. Every key in the hash map has a corresponding node in the list, and vice versa. These invariants must hold after every operation.
Step 4: Handle edge cases. What if get() is called on a key at the tail? It must move to the head. What if put() updates an existing key? Move it to the head. What if put() inserts into a full cache? Remove the tail, delete from the hash map, then insert the new key.
This is the pattern: state the operations, choose data structures, define invariants, handle edges. Interviewers evaluate your ability to think systematically, not just write code.
Copy this table to your whiteboard at the start of any hash table interview question:
| Operation | Chaining (avg) | Chaining (worst) | Open Addr (avg) | Open Addr (worst) |
|---|---|---|---|---|
| Insert | O(1) | O(n) | O(1/(1-α)) | O(n) |
| Search (found) | O(1 + α/2) | O(n) | O((1/α)ln(1/(1-α))) | O(n) |
| Search (not found) | O(1 + α) | O(n) | O(1/(1-α)) | O(n) |
| Delete | O(1 + α) | O(n) | Tombstone + search | O(n) |
| Space | O(n + m) | O(m) | ||
Remember: all "average" bounds assume simple uniform hashing and α = O(1). The O(n) worst cases happen with adversarial inputs or degenerate hash functions.
| Topic | Relationship to Hash Tables |
|---|---|
| Balanced BSTs (Ch 13) | O(log n) guaranteed vs hash table's O(1) average. BSTs support ordered operations (min, max, range queries) that hash tables cannot. |
| Dynamic Programming | Memoization uses hash tables to cache subproblem results. |
| Graph algorithms | Adjacency lists are often hash maps (vertex → neighbor list). |
| Counting problems | Frequency tables, histogram computation — all hash maps. |
| String algorithms | Rabin-Karp uses rolling hash for substring search. Trie vs hash map trade-offs. |
| Cryptography | Cryptographic hash functions (SHA-256) guarantee collision resistance. Hash tables need only uniformity. |
| Distributed systems | Consistent hashing for load balancing. Rendezvous hashing for server selection. |
| Resource | What You Learn |
|---|---|
| CLRS Chapter 11 | Rigorous proofs of all bounds. Universal hashing proof. Perfect hashing O(n) space proof. |
| Knuth TAOCP Vol 3, Ch 6.4 | The original analysis of linear probing (1963). Still the deepest treatment available. |
| "A Cool and Practical Alternative to Traditional Hash Tables" (Pagh & Rodler, 2001) | The cuckoo hashing paper. Elegant and readable. |
| Matt Kulukundis, CppCon 2017: "Designing a Fast, Efficient, Cache-friendly Hash Table" | The Swiss Table talk. How Google rethought hash tables using SIMD. |
| Python source code: Objects/dictobject.c | How Python's dict actually works. The perturbation probe sequence. The compact layout. |
| Mistake | Why it is wrong | What to say instead |
|---|---|---|
| "Hash tables are O(1)" | Only in the expected/average case. Worst case is O(n) for all standard variants. | "Expected O(1) under simple uniform hashing. Worst case O(n), but pathological only with adversarial inputs or a bad hash function." |
| "Just use a hash map" (for everything) | Hash maps do not support ordered operations: min, max, range queries, predecessor/successor. | "Hash map for unordered access. BST or sorted array for ordered queries. Know when each is appropriate." |
| Forgetting to handle hash collisions in custom implementations | Without collision resolution, keys silently overwrite each other. | "I will use chaining. Each slot is a linked list. Insert appends; search traverses." |
| Using mutable objects as hash keys | If the object changes after insertion, its hash changes, and the table loses the entry. | "Hash keys must be immutable. In Python, use tuples (not lists) as dict keys." |
Hash tables are not always the answer. Use a different data structure when you need:
| Need | Better Choice | Why |
|---|---|---|
| Ordered iteration | Balanced BST (TreeMap) | In-order traversal in O(n) |
| Range queries ("all keys between 10 and 50") | Balanced BST or B-tree | Hash has no notion of "between" |
| Find min/max | Heap or BST | O(1) min with heap vs O(n) scan of hash table |
| Memory-constrained membership test | Bloom filter | ~10 bits/element vs ~50+ bytes/element |
| Worst-case O(log n) guarantee | Balanced BST | Hash table worst case is O(n) |
| Persistent/functional data structure | Hash Array Mapped Trie (HAMT) | Structural sharing for immutable updates |
dict[key] in Python or access a JSON object by key, you are using a hash table. Understanding how they work — the math, the engineering trade-offs, the failure modes — separates someone who uses tools from someone who builds them.