KMP, Rabin-Karp, finite automata — finding patterns in text efficiently.
You are building a text editor. The user presses Ctrl+F and types "needle". Your document contains 50,000 characters. You need to find every occurrence of "needle" in the document and highlight them all — instantly. How?
Or you work at a security firm. Every network packet flowing through the firewall must be checked against 10,000 known malware signatures. Each packet is thousands of bytes. You need to check all signatures in real time on a 10 Gbps link. How?
Or you are a bioinformatician. You have a genome with 3 billion base pairs and a gene sequence of 200 bases. You need to find every location where that gene appears. How?
All three problems reduce to the same fundamental question.
Given a text T of length n and a pattern P of length m (where m ≤ n), find all positions in T where P occurs as a substring. More precisely: find every valid shift s (where 0 ≤ s ≤ n - m) such that T[s..s+m-1] equals P[0..m-1]. We call each valid shift a match or an occurrence.
The notation matters. We use zero-based indexing throughout. T[i] is the character at position i. T[s..s+m-1] is the substring of T starting at position s with length m. P[0..m-1] is the entire pattern.
The brute-force approach is dead simple. Place the pattern at position 0 of the text. Compare character by character, left to right. If every character matches, record a hit. Then slide the pattern one position to the right and repeat. Keep going until the pattern falls off the end of the text.
python def naive_match(T, P): """Return all positions where P occurs in T.""" n, m = len(T), len(P) matches = [] for s in range(n - m + 1): # try every shift j = 0 while j < m and T[s + j] == P[j]: j += 1 if j == m: # all m characters matched matches.append(s) return matches
Let us trace it on a small example. T = "ababcab", P = "abc".
In the best case, the first character of the pattern mismatches at every shift. That is n - m + 1 comparisons — basically O(n). This happens when the text and pattern share few characters (e.g., searching for "xyz" in "aaaaaaa").
In the worst case, the naive algorithm is devastating. Consider searching for "aaaaab" in "aaaaaaaaaa...a" (a string of all a's). At every shift, we compare 5 a's before hitting the mismatch at 'b'. Nearly every comparison is wasted because we already knew those a's matched from the previous shift. The cost is (n - m + 1) · m, which is O(n · m).
For n = 1,000,000 and m = 1,000, that is about 109 comparisons. For n = 3 · 109 (a genome) and m = 200, it is 6 · 1011. Way too slow.
Let us see the scale of the problem concretely:
| Scenario | n | m | Naive worst case | O(n+m) target | Speedup |
|---|---|---|---|---|---|
| Ctrl+F in editor | 50,000 | 10 | 500,000 | 50,010 | 10x |
| grep in log file | 107 | 20 | 2 · 108 | ~107 | 20x |
| Gene search in genome | 3 · 109 | 200 | 6 · 1011 | ~3 · 109 | 200x |
| IDS signature scan | 1,500 | 50 | 75,000 | 1,550 | 48x |
The speedup grows linearly with m. The longer the pattern, the more the naive algorithm wastes, and the more the smart algorithms save.
The simulation below makes the waste visible. Watch how the naive algorithm re-examines characters it has already seen.
Watch the pattern slide along the text. Green = match, red = mismatch, orange = current comparison. Click Step to advance one comparison at a time, or Play to animate.
Notice what happens at shift s=0: we match "ab" and then fail at position 2. At shift s=2, we compare "a" again — even though we already know T[2]='a' from the previous attempt. At shift s=0, we saw that T[0]='a', T[1]='b'. But shift s=1 does not use any of that knowledge. It starts from scratch.
This is the fundamental waste: information from partial matches is discarded. Every algorithm in this chapter is a different strategy for preserving and exploiting that information.
Before we dive into individual algorithms, here is the map of where we are going:
Hash the pattern. Hash each text window. Compare hashes in O(1). If hashes match, verify characters. Expected O(n+m), worst case O(nm). Simple, versatile, extends to multiple patterns and 2D matching.
Build a DFA from the pattern. Process text in a single pass. O(n) matching, O(m|Σ|) preprocessing. Elegant, guaranteed linear matching, but preprocessing cost depends on alphabet size.
The prefix function: for each position in the pattern, precompute the longest proper prefix that is also a suffix. Use this to skip ahead on mismatch. O(n+m) guaranteed, O(m) preprocessing. The gold standard.
The naive algorithm compares characters one by one. What if we could compare entire substrings in O(1) by comparing their fingerprints? That is the idea behind Rabin-Karp: hash the pattern, hash each window of text, and compare hashes instead of characters.
If the hash of a text window differs from the pattern's hash, we know for certain the window does not match — skip it instantly. If the hashes are equal, we might have a match (or a collision), so we verify character by character. The trick that makes this fast is computing the next window's hash from the previous one in O(1) time — the rolling hash.
Every string is a sequence of characters, and every character has a numeric code. In ASCII, 'a' = 97, 'b' = 98, 'c' = 99, and so on. We can treat a string of length m as an m-digit number in base d, where d is the alphabet size.
For a simple example, let us use decimal digits as our alphabet (d = 10). The string "314" represents the number 3 · 102 + 1 · 101 + 4 · 100 = 314. More generally:
For real characters with d = 256, this number can be astronomically large. A 10-character string produces a number up to 25610 ≈ 1024 — too large to fit in a machine word. So we take the hash modulo a large prime q:
This keeps numbers in the range [0, q-1]. The price: different strings can produce the same hash value. This is a collision, and we must handle it.
Here is the key insight that makes Rabin-Karp efficient. When the window slides from position s to s+1, the new window differs from the old one by exactly two characters: the old leftmost character drops out, and a new rightmost character enters. We can update the hash in O(1):
Read that formula step by step:
This is exactly how you would update a decimal number by dropping the leftmost digit and appending a new rightmost digit. For example, to go from 31415 to 14159: subtract 3 · 10000, multiply by 10, add 9.
Let T = "31415926535", P = "26", d = 10, q = 13. We use small numbers for visibility.
First, compute dm-1 mod q = 101 mod 13 = 10.
Then compute the pattern hash: h(P) = (2 · 10 + 6) mod 13 = 26 mod 13 = 0.
Now slide through the text windows of length 2:
Notice h("65") = 0 = h("26") — a spurious hit. Both 26 and 65 are divisible by 13. We detect it by verifying the actual characters. With a good prime q (much larger than 13), spurious hits are extremely rare in practice.
python def rabin_karp(T, P, d=256, q=1000000007): """Find all occurrences of P in T using Rabin-Karp.""" n, m = len(T), len(P) if m > n: return [] h = pow(d, m - 1, q) # d^(m-1) mod q, precomputed p_hash = 0 # hash of pattern t_hash = 0 # hash of current text window matches = [] # Compute initial hashes for P and T[0..m-1] for i in range(m): p_hash = (d * p_hash + ord(P[i])) % q t_hash = (d * t_hash + ord(T[i])) % q # Slide the window across T for s in range(n - m + 1): if p_hash == t_hash: # hash match! if T[s:s+m] == P: # verify to avoid spurious hit matches.append(s) if s < n - m: # Roll the hash: remove T[s], shift left, add T[s+m] t_hash = (d * (t_hash - ord(T[s]) * h) + ord(T[s + m])) % q if t_hash < 0: # Python handles this, but in C: t_hash += q # ensure non-negative return matches
(a - b) % q can return a negative value when a < b. You must add q to make it non-negative. Python's % operator always returns a non-negative result, so this is only a concern in other languages. This is a common source of bugs in interview implementations.Preprocessing: Computing the initial hashes takes O(m) — we process each character of P and the first window of T.
Matching: Each rolling hash update is O(1). We perform n - m updates. At each position, comparing hashes is O(1). Verification on hash match costs O(m). If there are v valid matches and c spurious hits, total verification cost is (v + c) · m.
Expected total: O(n + m + (v + c) · m). With a random prime q, the expected number of spurious hits is (n - m + 1) / q, which is negligible for large q. So expected time is O(n + m) assuming v is small.
Worst case: O(n · m) — if every hash collides (e.g., q = 1 and all hashes are 0). This is the same as naive. But with a good q, this never happens in practice.
Watch the rolling hash value update as the window slides. Teal = hash match (verify!), orange = current window. Uses d=10, q=13 for visibility.
Rabin-Karp may seem weak — its worst case is no better than naive. But it has unique strengths that no other algorithm in this chapter shares:
Multiple pattern matching. To search for k patterns simultaneously, compute all k pattern hashes and store them in a hash set. At each text window, check if the window hash is in the set. Cost: O(n + km) expected. Naive would cost O(knm). This is the most common reason to choose Rabin-Karp over KMP.
2D pattern matching. Find a small m × m matrix inside a large n × n matrix. Hash each row of the pattern and each row of the text. Use column-wise rolling hashes. This extends naturally: just compose two 1D rolling hashes. Neither FA nor KMP generalizes to 2D easily.
Plagiarism detection. Hash every m-gram (overlapping window of m words) in two documents. Find matching hashes. This is the basis of MOSS, Turnitin, and similar systems. The key insight: you do not need to compare every pair of windows across two documents. Hash all windows from both, store them, and find collisions. With a hash set, this takes O(n1 + n2) expected time.
Rabin-Karp is simple in concept but tricky in practice. Here are the bugs that trip up most implementers:
The prime q controls the collision rate. A larger q means fewer spurious hits but requires larger arithmetic. In practice:
Suppose you need to find all occurrences of three patterns: "ab", "bc", "cd" in text T = "abcde". With KMP, you would run three separate searches (3 · O(n)). With Rabin-Karp, you compute all three pattern hashes, store them in a set, and scan once:
This scales beautifully. For k = 10,000 patterns, we still make one pass through the text. The hash set lookup is O(1). Total: O(n + km) for initial hashing + O(n) for scanning = O(n + km). Compare to k separate KMP searches: O(kn + km). When n is much larger than m, Rabin-Karp's single pass wins decisively.
Find a 3×3 pattern matrix inside a 10×10 text matrix. Rabin-Karp extends naturally:
Total: O(n2) for an n×n text with an m×m pattern. Naive 2D matching would be O(n2m2).
Rabin-Karp uses hashing to skip non-matching windows. There is a completely different approach: build a machine — a deterministic finite automaton (DFA) — that reads the text character by character and transitions between states. Each state represents "how many characters of the pattern have been matched so far." When the machine reaches the accepting state (state m), we have found a match.
Think of it like playing a matching game. You are walking along the pattern from left to right, trying to match it against the incoming text. You maintain a counter: "how many characters of the pattern have I matched so far?" This counter is your state.
When the next text character matches the next expected pattern character, you advance: state goes from q to q+1. When it does not match, you need to figure out: "given what I have seen so far (the last q characters plus this new mismatched character), what is the longest prefix of the pattern that I have already matched?" You do not go back to the start — you jump to the longest position in the pattern where you could still be mid-match.
A DFA precomputes this jump for every possible (state, character) pair. After preprocessing, matching is a single left-to-right scan with no backtracking.
Consider pattern P = "ababc" (length 5) over alphabet Σ = {a, b, c}. The DFA has m + 1 = 6 states, numbered 0 through 5. State q means "the last q characters of the text processed so far match the first q characters of P." State 0 is the start (nothing matched). State 5 is the accept state (full pattern matched).
The transition function δ(q, c) answers: "If I am in state q and I read character c, what is my new state?" The new state is the length of the longest prefix of P that is also a suffix of the string "P[0..q-1] + c".
For example, δ(4, 'a') asks: I have matched "abab" (state 4) and I read 'a'. The string so far ends with "ababa". What is the longest prefix of "ababc" that is a suffix of "ababa"? The answer is "aba" (length 3), so δ(4, 'a') = 3.
For pattern P = "ababc":
| State q | Matched so far | δ(q, 'a') | δ(q, 'b') | δ(q, 'c') |
|---|---|---|---|---|
| 0 | "" | 1 | 0 | 0 |
| 1 | "a" | 1 | 2 | 0 |
| 2 | "ab" | 3 | 0 | 0 |
| 3 | "aba" | 1 | 4 | 0 |
| 4 | "abab" | 3 | 0 | 5 |
| 5 (accept) | "ababc" | 1 | 2 | 0 |
Let us verify a few entries:
Let us run this DFA on text T = "ababababcab":
The power of this approach: every text character is processed exactly once. There is no backtracking, no re-examination. The DFA transitions encode all the "failure" information implicitly. At step i=4, when we read 'a' instead of 'c', the DFA does not go back to state 0 — it jumps to state 3, because "ababa" ends with "aba" = P[0..2]. It automatically "remembers" the partial match.
Once the DFA is built, matching is a single pass through the text:
python def fa_match(T, delta, m): """Match using precomputed DFA. delta[q][c] = next state.""" q = 0 # start state matches = [] for i, c in enumerate(T): q = delta[q].get(c, 0) # transition if q == m: # reached accept state matches.append(i - m + 1) return matches
That is it. One character at a time, one table lookup per character, no backtracking. O(n) time for matching. The text index i only moves forward. No character is ever re-examined. This is the fundamental advantage over naive.
The preprocessing cost is higher. For each of the m+1 states and each of the |Σ| alphabet characters, we need to compute δ(q, c). The brute-force approach:
python def build_fa(P, alphabet): """Build DFA transition table for pattern P.""" m = len(P) delta = [{} for _ in range(m + 1)] for q in range(m + 1): for c in alphabet: # Find longest k such that P[0..k-1] is a suffix of P[0..q-1]+c k = min(m, q + 1) # can't match more than q+1 or m while k > 0: suffix = P[:q] + c if suffix.endswith(P[:k]): break k -= 1 delta[q][c] = k return delta
This runs in O(m3 |Σ|) due to the string comparisons inside the while loop. It can be optimized to O(m |Σ|) using the prefix function (which we will meet in Chapter 4). The space cost is O(m |Σ|) for the table itself.
There is a clever way to build the FA table in O(m|Σ|) using the prefix function. The insight: δ(q, c) = δ(π[q], c) whenever P[q] ≠ c. In other words, if the next character does not match the expected pattern character, the FA "falls back" to the same state that KMP would fall back to, and then tries the transition again. This recursive relationship lets us build δ from π efficiently:
python def build_fa_fast(P, alphabet): """Build DFA in O(m|Σ|) using prefix function.""" m = len(P) pi = compute_prefix(P) delta = [{} for _ in range(m + 1)] # State 0: only the pattern's first character advances for c in alphabet: delta[0][c] = 1 if c == P[0] else 0 # States 1..m: use prefix function for fallback for q in range(1, m + 1): for c in alphabet: if q < m and c == P[q]: delta[q][c] = q + 1 # match: advance else: # mismatch: same as transitioning from state pi[q-1] delta[q][c] = delta[pi[q - 1] if q > 0 else 0][c] return delta
This is the bridge between FA matching and KMP: they encode the same information in different forms. The FA stores it as an explicit table (fast lookup, more memory). KMP stores it as the π array (less memory, amortized lookup).
For DNA (|Σ| = 4), the table is tiny: 4(m+1) entries. For ASCII (|Σ| = 256), it is 256(m+1). For Unicode (|Σ| = 143,859), it is impractical. This is the main weakness of the FA approach — it scales poorly with alphabet size. KMP (next chapter) solves this by encoding the same information more compactly.
Watch the DFA process text character by character. The current state is highlighted in orange. State transitions are animated. When the accept state (5) is reached, the match is recorded in teal.
The finite automaton approach works beautifully, but building the transition table costs O(m|Σ|). KMP achieves the same O(n) matching with only O(m) preprocessing. It is the algorithm most interviewers expect you to know by heart. The secret weapon: the prefix function (also called the failure function).
When a mismatch occurs in the naive algorithm after matching q characters, we throw away all that progress and start over. The FA approach precomputes where to go for every (state, character) pair. KMP takes a middle path: it only needs to know, for each position q in the pattern, what is the longest proper prefix of P[0..q-1] that is also a suffix of P[0..q-1].
Let us unpack that sentence with an example. Pattern P = "ababc". Suppose we have matched "abab" (q = 4) and the next text character is 'a' instead of the expected 'c'. Here is what we know:
The prefix function π[q] tells us exactly how far to shift. For our example, π[4] = 2 (the prefix "ab" of length 2 equals the suffix "ab" of "abab"). On mismatch at position 4, we set q = π[4-1] = π[3] and continue. No character of the text is ever re-examined.
Pattern P = "ababaca" (m = 7). Let us compute π for every position:
The matching algorithm maintains two indices: i (position in text, never decreases) and q (number of pattern characters matched so far). At each step:
python def kmp_match(T, P): """Find all occurrences of P in T using KMP.""" n, m = len(T), len(P) pi = compute_prefix(P) # O(m) — see Chapter 4 q = 0 # characters matched so far matches = [] for i in range(n): # Fall back until we find a match or reach 0 while q > 0 and P[q] != T[i]: q = pi[q - 1] # use prefix function to skip if P[q] == T[i]: q += 1 # extend the match if q == m: # full match found matches.append(i - m + 1) q = pi[q - 1] # continue searching for overlapping matches return matches
T = "ababababcab", P = "ababc", π = [0, 0, 1, 2, 0].
Compare to naive: the naive algorithm would try shifts 0 through 7, and at each shift it might compare up to 5 characters. KMP used about 14 total comparisons. The text pointer i never moved backward — every text character was examined at most twice (once in the for loop, potentially once more via a fallback that was "paid for" by a previous match extension).
T = "aaaaaab", P = "aaab", π = [0, 1, 2, 0].
Notice the pattern: at each mismatch, KMP falls back by one position (from q=3 to q=2) and immediately re-matches (q=2 to q=3). It still advances the text pointer i each time. The naive algorithm, by contrast, would slide back to shift s+1 and re-compare all 3 a's from scratch at every shift.
Watch KMP match pattern "ababc" against text. When a mismatch occurs, the pattern shifts using π — notice how text characters are never re-examined. Purple shows the fallback. Compare the comparison count to naive.
The proof is elegant and uses an amortized argument. Define a potential Φ = q (the current number of matched characters). At each step of the for loop:
Since q starts at 0 and can never go negative, the total number of q-increases over the entire algorithm is at most n (one per for-loop iteration). Each while-loop iteration pays for one q-decrease, and total q-decreases cannot exceed total q-increases. So the while loop runs at most n times total across all iterations of the for loop.
Total work: n iterations of the for loop + at most n while-loop iterations + O(m) to build π = O(n + m).
KMP has a reputation for being hard to code from memory. Here is a mental model that makes it stick. There are only two functions, and they have nearly identical structure:
Matches P against itself. Variable k = matched prefix length. For each position q from 1 to m-1: (1) while k>0 and mismatch, fall back k = pi[k-1]. (2) if match, k++. (3) pi[q] = k.
Matches P against T. Variable q = matched prefix length. For each position i from 0 to n-1: (1) while q>0 and mismatch, fall back q = pi[q-1]. (2) if match, q++. (3) if q==m, record match, q = pi[q-1].
The structure is the same: while-fallback, if-extend, store/check. The only differences: (a) compute_prefix stores pi[q], kmp_match checks q==m. (b) kmp_match resets q after a full match. (c) compute_prefix starts at q=1 (skip the trivial base case), kmp_match starts at i=0.
Practice writing both side by side. Once you see they are the same algorithm applied to different inputs (P vs itself, and T vs P), you only need to memorize one template.
KMP and FA matching process text in the same way — one character at a time, left to right, maintaining a "how much have I matched" state. The difference is how they compute the next state:
| Aspect | FA Matching | KMP |
|---|---|---|
| State representation | Integer q (same) | Integer q (same) |
| Forward transition | Table lookup: δ[q][c] | Check P[q]==T[i], increment q |
| Backward transition | Table lookup: δ[q][c] | Follow π chain until match |
| Preprocessing | O(m|Σ|) — build full table | O(m) — build π only |
| Per-character cost | O(1) always | O(1) amortized |
FA matching has O(1) worst case per character (single table lookup), while KMP has O(1) amortized (the while loop might run multiple times for one character, but averages out). In practice, both are extremely fast. KMP wins on preprocessing cost when the alphabet is large.
The prefix function is the engine that powers KMP. Computing it is the subtle part — and beautifully, the algorithm to compute π is itself a form of string matching. It matches the pattern against itself.
π[0] = 0 always (a single character has no proper prefix = suffix). For each position q from 1 to m-1, we want the longest k such that P[0..k-1] = P[q-k+1..q].
Here is the key: π[q] can be at most π[q-1] + 1. Why? If we have a prefix-suffix of length k for P[0..q], then P[0..k-2] is a prefix-suffix of P[0..q-1] with length k-1. So k-1 ≤ π[q-1], meaning k ≤ π[q-1] + 1.
This means: to find π[q], first check if P[π[q-1]] == P[q]. If yes, π[q] = π[q-1] + 1. If not, we need to try shorter prefix-suffixes. Where? Follow the chain: try k = π[π[q-1]-1], check if P[k] == P[q]. If not, try k = π[k-1], and so on, until we find a match or k reaches 0.
python def compute_prefix(P): """Compute the prefix function for pattern P.""" m = len(P) pi = [0] * m k = 0 # length of current prefix-suffix for q in range(1, m): # Follow the failure chain until match or k=0 while k > 0 and P[k] != P[q]: k = pi[k - 1] # fall back if P[k] == P[q]: k += 1 # extend pi[q] = k return pi
Look at this code. It is almost identical to the KMP matching code. That is not a coincidence. Computing π IS matching P against itself. The variable k plays the role of the matched-character count, and q plays the role of the text index. The same amortized argument applies: k increases at most m-1 times, and each while-loop iteration decreases k, so total work is O(m).
Let us trace compute_prefix for P = "aabaaab" (m = 7):
The critical step is q=5. We tried k=2 (trying to extend the prefix-suffix of length 2 from position 4), but P[2]='b' ≠ P[5]='a'. So we followed the failure chain to k = π[1] = 1. Now we check P[1]='a' vs P[5]='a' — match! So π[5] = 2. The failure chain found a shorter prefix-suffix that we could extend.
Step through the prefix function computation for "aabaaab". Purple highlights show the failure chain. The top row shows the pattern, the bottom row shows π values as they are computed.
At first glance, the while loop inside the for loop looks dangerous. The for loop runs m-1 times. The while loop could run multiple times per iteration. Is this O(m2)?
No. The amortized argument is identical to the one for KMP matching. Think of k as a bank account:
k += 1 executes at most once per for-loop iteration (m-1 times total). Each execution deposits one unit in the bank.k = pi[k-1] strictly decreases k by at least 1. Each execution withdraws at least one unit from the bank.Deposits: at most m-1. Withdrawals: at most m-1 (cannot withdraw more than deposited). Total while-loop iterations across the entire for loop: at most m-1. Total work: O(m).
Let us compute π for a pattern with a deeper failure chain.
Notice: π[5] = 3 means "aabaab" has a 3-character prefix "aab" that equals its 3-character suffix "aab". This immediately tells us the string has period 3. This is the power of the prefix function — structural information about the string falls out for free.
The prefix function encodes deep structure about the pattern. One powerful consequence: it reveals the period of the string.
A string S of length n has period p if S[i] = S[i + p] for all valid i. The shortest period of S is n - π[n-1] (using 0-indexed π). If n is divisible by this period, the string is a perfect repetition of the period.
This property is the basis of interview problems like "find the shortest repeating unit of a string" — a one-liner once you know the prefix function.
Now the payoff. Enter any text and pattern below, and watch all four algorithms race side by side. Each one counts its character comparisons. You will see exactly where KMP and the FA approach win: on patterns with repeated prefixes that cause the naive algorithm to thrash.
Enter text and pattern, then click Run Race. Try the suggested inputs below to see dramatic differences.
| Input | Pattern | What Happens |
|---|---|---|
aaaaaaaaaaaaaab | aaab | Naive uses ~44 comparisons. KMP uses ~18. The pattern has a long repeated prefix ("aaa") that naive re-examines at every shift. KMP's π = [0,1,2,0] lets it skip ahead efficiently. |
abcxabcxabcxabcy | abcxabcy | The pattern nearly matches twice before succeeding. KMP's prefix function encodes the "abcx" overlap and skips half the work. |
hello world hello world | world | All algorithms perform similarly — "world" has no repeated prefixes, so there is little structure to exploit. The first character 'w' mismatches immediately at most positions. |
31415926535897932 | 2653 | Short pattern, numeric text. Rabin-Karp excels here — the rolling hash eliminates most windows in O(1) each. |
You might wonder: "When does the worst case actually happen? Is it just a theoretical curiosity?" No. The worst case occurs in real applications:
The interactive counter below lets you see the ratio between naive and KMP on worst-case inputs (all 'a's with pattern ending in 'b'). Drag the sliders and watch the bars diverge.
Adjust the text length and pattern repetitiveness to see how comparison counts scale. The pattern is repeated 'a's followed by 'b'. Notice: as the pattern gets longer, the naive/KMP ratio grows linearly — naive does O(nm) work while KMP stays at O(n+m).
After seeing all four algorithms race, here are the practical guidelines:
The text is very short (under 1000 characters), or you are prototyping and do not want to implement preprocessing. The naive algorithm has zero overhead and is cache-friendly. For short texts and non-repetitive patterns, it is actually competitive.
You need guaranteed O(n+m) performance. You are processing untrusted input (adversarial patterns or text). You are in a streaming scenario where text arrives character by character. KMP never backtracks in the text.
You need to search for multiple patterns simultaneously. You need 2D pattern matching. You are doing plagiarism detection or fingerprinting. The implementation is simpler than KMP and usually fast enough in practice.
You are searching through large files of natural language text (English, code, logs). The pattern is relatively long (5+ characters) and the alphabet is large. Boyer-Moore can be 2-5x faster than KMP in practice due to sublinear scanning.
CLRS covers four algorithms (naive, Rabin-Karp, FA, KMP). But the string matching story is far richer. Two important extensions deserve your attention.
Boyer-Moore is often the fastest string matching algorithm in practice, even though its worst case is O(nm). The key idea: compare characters right to left within the pattern window. When a mismatch occurs on a character that does not appear in the pattern at all, skip the entire pattern length forward.
Consider searching for "EXAMPLE" (length 7) in a long text. We align the pattern with the text and compare the last character of the pattern ('E') with the text character at that position. If the text character is 'Z' — a character that does not appear anywhere in "EXAMPLE" — then no shift of the pattern that includes this text position can possibly match. We jump forward by 7 positions in one shot.
Boyer-Moore uses two heuristics to determine the shift amount:
When a mismatch occurs at pattern position j with text character c, find the rightmost occurrence of c in P[0..j-1]. Shift the pattern so that occurrence aligns with the text position. If c does not appear at all, shift past the entire window. This is the "big skip" heuristic — it can skip m positions in one step.
When a mismatch occurs after matching a suffix of the pattern, shift the pattern so that a different occurrence of that suffix (or a matching prefix) aligns. This is the right-to-left analog of KMP's prefix function. It prevents Boyer-Moore from missing matches when the bad character rule gives a small shift.
In practice, Boyer-Moore with just the bad character rule is already very fast on English text (alphabet of ~70 common characters). Average case is sublinear: O(n/m), meaning it examines only a fraction of the text characters. The larger the alphabet and the longer the pattern, the bigger the skips.
Watch Boyer-Moore compare right-to-left and make big jumps when the mismatched character is not in the pattern. Notice how few comparisons it needs.
T = "here is a simple example", P = "example" (m=7). The bad character table for P:
This example shows Boyer-Moore's superpower: on natural text with diverse characters, most mismatches involve characters not in the pattern, causing full-length skips. The algorithm often examines only n/m characters — sublinear!
The bad character table is simple to build. For each character c in the alphabet, store the rightmost position of c in the pattern. Characters not in the pattern get value -1 (meaning "shift past the whole window"). Here is the code:
python def boyer_moore(T, P): """Find all occurrences of P in T using Boyer-Moore (bad char only).""" n, m = len(T), len(P) # Build bad character table bad_char = {} for i in range(m): bad_char[P[i]] = i # rightmost occurrence matches = [] s = 0 # current shift while s <= n - m: j = m - 1 # start comparing from the right while j >= 0 and P[j] == T[s + j]: j -= 1 if j < 0: # full match matches.append(s) s += m # shift past the match (simplified) else: bc = bad_char.get(T[s + j], -1) s += max(1, j - bc) # shift by bad character rule return matches
The shift amount j - bc aligns the rightmost occurrence of the mismatched text character with the current position. If bc > j (the character appears to the right of j in the pattern), we use max(1, ...) to avoid shifting backwards. This simple version omits the good suffix rule but is already very effective in practice.
All four CLRS algorithms search for a single pattern. What if you need to find all occurrences of k patterns simultaneously? Running KMP once per pattern costs O(k(n + mavg)). Aho-Corasick does it in O(n + total pattern length + number of matches) — one pass through the text, regardless of k.
The idea: build a trie of all patterns. Then add "failure links" connecting each trie node to the longest proper suffix that is also a prefix of some pattern (generalizing KMP's prefix function from a single string to a tree of strings). Process the text through this augmented trie. At each text character, follow the trie forward if possible, or use the failure link to find the next viable match position.
Applications: virus scanners (match thousands of malware signatures in each file), bibliographic search (find all keyword occurrences in a document), DNA motif discovery (search for thousands of known gene sequences in a genome).
All our algorithms preprocess the pattern and then scan the text. But what if you need to search the same text for many different patterns? It is better to preprocess the text once and answer pattern queries quickly.
A suffix array sorts all suffixes of the text lexicographically. Binary search finds any pattern in O(m log n). A suffix tree (Ukkonen's algorithm) can be built in O(n) and finds any pattern in O(m). These are the backbone of bioinformatics — genome databases index 3-billion-character sequences into suffix arrays so that gene lookups take milliseconds.
| Algorithm | Preprocess | Match | Best For |
|---|---|---|---|
| Naive | None | O(nm) | Simple, short text, prototyping |
| Rabin-Karp | O(m) | Expected O(n) | Multiple patterns, 2D matching, plagiarism |
| FA | O(m|Σ|) | O(n) | Small alphabets (DNA), hardware implementation |
| KMP | O(m) | O(n) | Single pattern, guaranteed O(n+m) |
| Boyer-Moore | O(m + |Σ|) | O(n/m) best, O(nm) worst | Large alphabets, English text, grep |
| Aho-Corasick | O(sum of patterns) | O(n + matches) | Multiple simultaneous patterns |
| Suffix Array | O(n) | O(m log n) per query | Many queries on fixed text |
| Suffix Tree | O(n) | O(m) per query | Many queries, largest common substring |
String matching is everywhere. You use it dozens of times a day without thinking about it. Here are the major application domains, each with the algorithm it typically uses and why that choice makes sense.
When you press Ctrl+F in VS Code or Sublime Text, the editor runs a string matching algorithm on the visible buffer. Most editors use a variant of Boyer-Moore because it is fastest on natural language text. English text has a large effective alphabet (~70 characters), and patterns rarely have repeated structure. Boyer-Moore's bad character rule skips huge chunks when the search term contains uncommon characters. Searching for "quantum" in a 10,000-line file takes far fewer than 10,000 · 7 comparisons — more like 10,000 / 7 ≈ 1,400 in favorable cases.
The classic Unix grep is actually two tools in one. When you use grep -F "pattern" (fixed string), GNU grep runs Boyer-Moore. When you use grep "pattern.*regex" (regular expression), it builds a Thompson NFA — a generalization of our finite automaton approach to handle wildcards, character classes, repetition, and alternation. The NFA is simulated directly (without converting to DFA) to avoid the potential O(2m) state explosion that DFA construction can cause on some regexes.
This is why grep -F is faster than grep for literal strings — it uses a specialized algorithm instead of the general regex engine.
A human genome has approximately 3.2 billion base pairs (characters). Searching for a gene sequence (a pattern of 20-200 bases) requires extreme efficiency. The alphabet is small: Σ = {A, C, G, T}, |Σ| = 4. This makes the FA approach viable (the transition table has only 4 columns).
In practice, tools like BLAST (Basic Local Alignment Search Tool) use a seed-and-extend approach: find exact matches of short seeds (11-mers) using hash tables (like Rabin-Karp), then extend promising seeds using dynamic programming (Smith-Waterman). Modern tools like BWA and Bowtie use the Burrows-Wheeler Transform (a close relative of suffix arrays) to index the entire genome and answer pattern queries in milliseconds.
Given a student's essay and a database of source documents, find all matching substrings above a threshold length. This is naturally a multi-pattern, multi-document problem. Rabin-Karp is the algorithm of choice: hash every m-gram (overlapping window of m words) in both the submission and the source documents. Store source hashes in a hash table. For each submission hash, check for matches in O(1). Systems like MOSS (Measure of Software Similarity) and Turnitin use variants of this approach.
Firewalls and intrusion detection systems (Snort, Suricata) scan every network packet for thousands of known attack signatures simultaneously. Each packet is 1-1500 bytes. The signature database contains 10,000+ patterns. This is the classic Aho-Corasick use case: build a trie of all attack patterns, process each packet through the trie in a single pass. Finding any matching pattern triggers an alert. At 10 Gbps, the system must process about 800,000 packets per second — there is no time to run KMP 10,000 times per packet.
Google does not do string matching on the raw web. It builds an inverted index: for each word, store a list of documents containing that word. A query like "string matching algorithm" is decomposed into three word lookups in the index, and the results are intersected. But string matching still appears in snippet generation (highlighting query terms in results), URL matching, and autocomplete.
When a compiler reads your source code, it breaks the text into tokens (keywords, identifiers, numbers, operators). This lexical analysis is a multi-pattern matching problem: "if", "else", "for", "while", "return" etc. The lexer builds a DFA from all keyword patterns (using Aho-Corasick or explicit NFA-to-DFA conversion) and processes the source code in a single pass. Tools like Flex (Fast Lexical Analyzer) automate this — you specify the patterns, and Flex generates the DFA as C code.
Let us be concrete about what goes in and what comes out of each algorithm. This is what you need to implement them.
Search a random DNA sequence for a motif using KMP. Enter a motif (e.g., "GATTACA") and click Search. The algorithm highlights all occurrences. Click "New DNA" to generate a fresh sequence.
String matching problems appear regularly in coding interviews. You are unlikely to be asked to implement KMP from scratch (it is hard to get right under pressure), but you MUST understand the prefix function — it appears in problems about repeated substrings, string periodicity, and pattern matching. Here is your battle kit.
| Dimension | What They Test | Example Question |
|---|---|---|
| CONCEPT | Do you understand the prefix function? | "Why is KMP O(n+m)? Walk through the amortized argument." |
| DESIGN | Can you pick the right algorithm? | "Design a system to detect plagiarism across 10M documents." |
| CODE | Can you implement from memory? | "Implement KMP in 15 minutes." or "Implement Rabin-Karp." |
| DEBUG | Can you find bugs in string code? | "This Rabin-Karp has a bug — find it." (Missing mod, off-by-one, negative hash) |
| FRONTIER | Do you know the broader landscape? | "How does grep work? What is Aho-Corasick? When would you use a suffix array?" |
This is the most important drill. You should be able to write both compute_prefix and kmp_match from memory in under 10 minutes. Practice until the code flows naturally.
python def kmp(T, P): """Find all occurrences of P in T. O(n+m).""" # Step 1: Build prefix function m = len(P) pi = [0] * m k = 0 for q in range(1, m): while k > 0 and P[k] != P[q]: k = pi[k - 1] if P[k] == P[q]: k += 1 pi[q] = k # Step 2: Match q = 0 result = [] for i in range(len(T)): while q > 0 and P[q] != T[i]: q = pi[q - 1] if P[q] == T[i]: q += 1 if q == m: result.append(i - m + 1) q = pi[q - 1] return result
The tricky parts: (1) computing dm-1 mod q efficiently, (2) handling negative modular arithmetic, (3) remembering to verify on hash match.
python def rabin_karp(T, P): """Find all occurrences of P in T. Expected O(n+m).""" n, m, d, q = len(T), len(P), 256, 1000000007 h = pow(d, m-1, q) ph = th = 0 for i in range(m): ph = (d*ph + ord(P[i])) % q th = (d*th + ord(T[i])) % q result = [] for s in range(n-m+1): if ph == th and T[s:s+m] == P: result.append(s) if s < n-m: th = (d*(th - ord(T[s])*h) + ord(T[s+m])) % q return result
Given T and P, find all positions where an anagram (permutation) of P appears in T. This is not string matching per se — it is a sliding window with character frequency comparison. But it appears in interviews alongside string matching problems.
python from collections import Counter def find_anagrams(T, P): """Find all positions where an anagram of P appears in T. O(n).""" m, n = len(P), len(T) if m > n: return [] p_count = Counter(P) w_count = Counter(T[:m]) result = [] if w_count == p_count: result.append(0) for i in range(m, n): w_count[T[i]] += 1 # add incoming character out = T[i - m] w_count[out] -= 1 # remove outgoing character if w_count[out] == 0: del w_count[out] # clean up zeros for equality check if w_count == p_count: result.append(i - m + 1) return result
Given string S, find the shortest string R such that S is R repeated k times (or determine that no such R exists). This is a one-liner with the prefix function.
python def shortest_period(S): """Return shortest repeating unit of S, or S itself if none.""" pi = compute_prefix(S) # from KMP n = len(S) period = n - pi[n - 1] # candidate period length if n % period == 0: return S[:period] # perfect repetition return S # no repeating unit
Given strings A and B, find the minimum number of times A must be repeated such that B is a substring of the repeated string. Example: A = "abc", B = "cabcabca" — answer is 4 ("abcabcabcabc" contains "cabcabca").
python def repeated_match(A, B): """Min repetitions of A to contain B as substring. -1 if impossible.""" # Need at least ceil(len(B)/len(A)) copies, plus at most 1 more reps = -(-len(B) // len(A)) # ceiling division for k in [reps, reps + 1]: if B in A * k: # Python's `in` uses fast string matching return k return -1
Given a string S, find the longest prefix which is also a suffix (not equal to S). This is literally π[n-1]. The "happy prefix" name is just LeetCode branding — the underlying structure is the prefix function.
python def longest_happy_prefix(s): pi = compute_prefix(s) return s[:pi[-1]] # the prefix of length pi[n-1] # "ababab" → pi = [0,0,1,2,3,4] → pi[5]=4 → "abab" # "a" → pi = [0] → pi[0]=0 → ""
| Problem | Key Technique | Complexity |
|---|---|---|
| Find all occurrences of P in T | KMP | O(n + m) |
| Find all anagram occurrences | Sliding window + char count | O(n) |
| Shortest repeating unit | Prefix function: period = n - π[n-1] | O(n) |
| Repeated string match | KMP on repeated string | O(n + m) |
| Longest prefix = suffix | π[n-1] | O(n) |
| Check if B is rotation of A | Check if B in A+A (KMP) | O(n) |
| Count distinct substrings | Suffix array + LCP array | O(n log n) |
| Longest common substring | Suffix array + LCP, or DP O(nm) | O(n log n) or O(nm) |
| Edit distance (Levenshtein) | 2D DP | O(nm) |
| Wildcard pattern matching | DP or FFT-based | O(nm) or O(n log n) |
String matching connects to several other chapters in CLRS and to broader algorithmic themes. Understanding these connections makes you a stronger algorithm designer.
Rabin-Karp is fundamentally a hashing algorithm applied to string matching. The rolling hash is a special case of polynomial hashing: treating the string as a polynomial in base d, evaluated at d, modulo q. The choice of prime q, the modular arithmetic, and the handling of collisions all draw on hash function theory from Chapter 11.
The connection goes deeper. Universal hashing — choosing hash functions randomly from a family — is what gives Rabin-Karp its expected O(n+m) guarantee. With a randomly chosen prime q, the probability that any particular non-matching window produces a spurious hit is at most m/q, which is negligible for large q.
The FA matching algorithm is a special case of DFA simulation. Regular expressions compile to NFAs (Thompson's construction), which can be converted to DFAs (subset construction) for execution. This is how grep, sed, and every regex engine works. The string matching DFA from Chapter 2 is the simplest case: matching a single literal string.
The connection to KMP: the FA transition table can be computed from the prefix function in O(m|Σ|) time, because π encodes the same "fallback" information that the DFA transitions encode. In fact, KMP's matching loop is equivalent to simulating a DFA where non-forward transitions are computed on-the-fly from π instead of stored in a table.
The prefix function computation has a DP flavor: π[q] depends on π values at smaller indices. But the deeper connection is to approximate string matching: finding the substring of T most similar to P, allowing insertions, deletions, and substitutions.
The edit distance (Levenshtein distance) between two strings is computed using a 2D DP table in O(nm) time. This is the foundation of spell checkers, DNA alignment (Needleman-Wunsch for global, Smith-Waterman for local), diff tools, and autocorrect. When exact matching is too strict, approximate matching with DP fills the gap.
A concrete connection: the edit distance DP recurrence is:
where D[i][j] is the edit distance between T[0..i-1] and P[0..j-1]. The third term (diagonal) is the "substitution" cost. When T[i] = P[j], it is free — exactly like the "match" case in string matching. Edit distance generalizes exact matching: when D[n][m] = 0, the strings are identical.
Rabin-Karp relies on modular arithmetic with large primes. The probability of a spurious hit depends on the distribution of hash values modulo q, which connects to number-theoretic results about the density of primes and the uniformity of polynomial hash functions over finite fields. Using multiple independent hash functions (different primes) reduces false positive probability exponentially — a technique also used in Bloom filters.
There is a surprising connection between string matching and polynomial multiplication. Define indicator polynomials for the text and pattern, then compute their convolution using FFT. The convolution at position s counts the number of matching characters at shift s. This gives an O(n log n) algorithm for string matching — slower than KMP for standard matching, but it generalizes to matching with wildcards, which KMP and FA cannot handle.
| If you want to learn... | Study... |
|---|---|
| Approximate matching (fuzzy search) | Edit distance (CLRS Ch15, DP), BK-trees |
| Regular expression matching | Thompson NFA, NFA-to-DFA subset construction |
| Genome indexing | Suffix arrays, BWT (Burrows-Wheeler Transform), FM-index |
| Multiple pattern matching | Aho-Corasick, Commentz-Walter |
| Compressed pattern matching | Lempel-Ziv factorization, grammar-compressed matching |
| Wildcard matching | FFT-based convolution (CLRS Ch30) |
| Streaming pattern matching | Porat-Porat algorithm (O(log m) space) |
The Z-array is an alternative to the prefix function, popular in competitive programming. For a string S of length n, Z[i] = length of the longest substring starting at position i that matches a prefix of S. Z[0] is undefined (or set to n). While π[q] looks at what ends at position q (longest prefix = suffix of P[0..q-1]), Z[i] looks at what starts at position i (longest match with the start of S).
To use Z for pattern matching: concatenate P + "$" + T (where "$" is a separator not in either string), compute the Z-array. Any position i > m where Z[i] = m indicates that T[i-m-1..] starts with the full pattern — a match. The Z-algorithm runs in O(n + m) and is often easier to code than KMP in competitions because the Z-array logic is slightly more intuitive (extending a match forward rather than falling back).
python def z_function(S): """Compute Z-array in O(n).""" n = len(S) Z = [0] * n Z[0] = n l, r = 0, 0 for i in range(1, n): if i < r: Z[i] = min(r - i, Z[i - l]) while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]: Z[i] += 1 if i + Z[i] > r: l, r = i, i + Z[i] return Z def z_match(T, P): """Find all occurrences of P in T using Z-algorithm.""" S = P + "$" + T Z = z_function(S) m = len(P) return [i - m - 1 for i in range(m + 1, len(S)) if Z[i] == m]
The prefix function and Z-array encode equivalent information and can be converted between each other in O(n). They are two views of the same combinatorial structure on strings. In interviews, knowing both gives you flexibility — use whichever one you can code faster under pressure.
"An algorithm must be seen to be believed."
— Donald Knuth