String Matching (CLRS Ch32) — From Absolute Zero to Mastery

Chapter 0: The Problem

You are building a text editor. The user presses Ctrl+F and types "needle". Your document contains 50,000 characters. You need to find every occurrence of "needle" in the document and highlight them all — instantly. How?

Or you work at a security firm. Every network packet flowing through the firewall must be checked against 10,000 known malware signatures. Each packet is thousands of bytes. You need to check all signatures in real time on a 10 Gbps link. How?

Or you are a bioinformatician. You have a genome with 3 billion base pairs and a gene sequence of 200 bases. You need to find every location where that gene appears. How?

All three problems reduce to the same fundamental question.

The Formal Problem

Given a text T of length n and a pattern P of length m (where m ≤ n), find all positions in T where P occurs as a substring. More precisely: find every valid shift s (where 0 ≤ s ≤ n - m) such that T[s..s+m-1] equals P[0..m-1]. We call each valid shift a match or an occurrence.

The notation matters. We use zero-based indexing throughout. T[i] is the character at position i. T[s..s+m-1] is the substring of T starting at position s with length m. P[0..m-1] is the entire pattern.

The Naive Algorithm

The brute-force approach is dead simple. Place the pattern at position 0 of the text. Compare character by character, left to right. If every character matches, record a hit. Then slide the pattern one position to the right and repeat. Keep going until the pattern falls off the end of the text.

python
def naive_match(T, P):
    """Return all positions where P occurs in T."""
    n, m = len(T), len(P)
    matches = []
    for s in range(n - m + 1):   # try every shift
        j = 0
        while j < m and T[s + j] == P[j]:
            j += 1
        if j == m:                    # all m characters matched
            matches.append(s)
    return matches

Let us trace it on a small example. T = "ababcab", P = "abc".

// Shift s=0: compare T[0..2] = "aba" vs P = "abc"
T[0]='a' == P[0]='a' ✓
T[1]='b' == P[1]='b' ✓
T[2]='a' != P[2]='c' ✗ → mismatch after 3 comparisons

// Shift s=1: compare T[1..3] = "bab" vs P = "abc"
T[1]='b' != P[0]='a' ✗ → mismatch after 1 comparison

// Shift s=2: compare T[2..4] = "abc" vs P = "abc"
T[2]='a' == P[0]='a' ✓
T[3]='b' == P[1]='b' ✓
T[4]='c' == P[2]='c' ✓ → MATCH at shift 2! (3 comparisons)

// Shift s=3: compare T[3..5] = "bca" vs P = "abc"
T[3]='b' != P[0]='a' ✗ → mismatch after 1 comparison

// Shift s=4: compare T[4..6] = "cab" vs P = "abc"
T[4]='c' != P[0]='a' ✗ → mismatch after 1 comparison

// Total: 5 shifts, 9 comparisons, 1 match

The Cost of Naive

In the best case, the first character of the pattern mismatches at every shift. That is n - m + 1 comparisons — basically O(n). This happens when the text and pattern share few characters (e.g., searching for "xyz" in "aaaaaaa").

In the worst case, the naive algorithm is devastating. Consider searching for "aaaaab" in "aaaaaaaaaa...a" (a string of all a's). At every shift, we compare 5 a's before hitting the mismatch at 'b'. Nearly every comparison is wasted because we already knew those a's matched from the previous shift. The cost is (n - m + 1) · m, which is O(n · m).

T_naive(n, m) = O((n - m + 1) · m)

For n = 1,000,000 and m = 1,000, that is about 10⁹ comparisons. For n = 3 · 10⁹ (a genome) and m = 200, it is 6 · 10¹¹. Way too slow.

Let us see the scale of the problem concretely:

Scenario	n	m	Naive worst case	O(n+m) target	Speedup
Ctrl+F in editor	50,000	10	500,000	50,010	10x
grep in log file	10⁷	20	2 · 10⁸	~10⁷	20x
Gene search in genome	3 · 10⁹	200	6 · 10¹¹	~3 · 10⁹	200x
IDS signature scan	1,500	50	75,000	1,550	48x

The speedup grows linearly with m. The longer the pattern, the more the naive algorithm wastes, and the more the smart algorithms save.

The simulation below makes the waste visible. Watch how the naive algorithm re-examines characters it has already seen.

Naive String Matching — Sliding Window

Watch the pattern slide along the text. Green = match, red = mismatch, orange = current comparison. Click Step to advance one comparison at a time, or Play to animate.

Comparisons: 0 | Matches found: 0

Notice what happens at shift s=0: we match "ab" and then fail at position 2. At shift s=2, we compare "a" again — even though we already know T[2]='a' from the previous attempt. At shift s=0, we saw that T[0]='a', T[1]='b'. But shift s=1 does not use any of that knowledge. It starts from scratch.

This is the fundamental waste: information from partial matches is discarded. Every algorithm in this chapter is a different strategy for preserving and exploiting that information.

The core question of this chapter. Can we avoid re-examining characters? When a mismatch occurs after matching k characters, we already know what those k characters are — they are the first k characters of the pattern. Can we use that knowledge to skip ahead? The answer is yes, and it leads to three beautiful algorithms: Rabin-Karp (expected O(n+m) via hashing), finite automaton matching (O(n) after O(m|Σ|) preprocessing), and Knuth-Morris-Pratt (O(n+m) guaranteed). Our goal: O(n + m) — linear in the total input size.

The Landscape

Before we dive into individual algorithms, here is the map of where we are going:

Rabin-Karp (Chapter 1)

Hash the pattern. Hash each text window. Compare hashes in O(1). If hashes match, verify characters. Expected O(n+m), worst case O(nm). Simple, versatile, extends to multiple patterns and 2D matching.

Finite Automaton (Chapter 2)

Build a DFA from the pattern. Process text in a single pass. O(n) matching, O(m|Σ|) preprocessing. Elegant, guaranteed linear matching, but preprocessing cost depends on alphabet size.

Knuth-Morris-Pratt (Chapters 3-4)

The prefix function: for each position in the pattern, precompute the longest proper prefix that is also a suffix. Use this to skip ahead on mismatch. O(n+m) guaranteed, O(m) preprocessing. The gold standard.

You search for pattern "abc" (length 3) in text "ababcabc" (length 8) using the naive algorithm. The valid shifts to check are s = 0, 1, 2, 3, 4, 5 (six shifts). How many total character comparisons does the naive algorithm make?

8 (one per text character) 10 12 (s=0: a,b,a mismatch=3; s=1: b mismatch=1; s=2: a,b,c match=3; s=3: b mismatch=1; s=4: c mismatch=1; s=5: a,b,c match=3) 18 (worst case upper bound 6 × 3)

Chapter 1: Rabin-Karp

The naive algorithm compares characters one by one. What if we could compare entire substrings in O(1) by comparing their fingerprints? That is the idea behind Rabin-Karp: hash the pattern, hash each window of text, and compare hashes instead of characters.

If the hash of a text window differs from the pattern's hash, we know for certain the window does not match — skip it instantly. If the hashes are equal, we might have a match (or a collision), so we verify character by character. The trick that makes this fast is computing the next window's hash from the previous one in O(1) time — the rolling hash.

Treating Strings as Numbers

Every string is a sequence of characters, and every character has a numeric code. In ASCII, 'a' = 97, 'b' = 98, 'c' = 99, and so on. We can treat a string of length m as an m-digit number in base d, where d is the alphabet size.

For a simple example, let us use decimal digits as our alphabet (d = 10). The string "314" represents the number 3 · 10² + 1 · 10¹ + 4 · 10⁰ = 314. More generally:

h(S[0..m-1]) = S[0] · d^m-1 + S[1] · d^m-2 + ... + S[m-1] · d⁰

For real characters with d = 256, this number can be astronomically large. A 10-character string produces a number up to 256¹⁰ ≈ 10²⁴ — too large to fit in a machine word. So we take the hash modulo a large prime q:

h(S) = (S[0] · d^m-1 + S[1] · d^m-2 + ... + S[m-1]) mod q

This keeps numbers in the range [0, q-1]. The price: different strings can produce the same hash value. This is a collision, and we must handle it.

The Rolling Hash

Here is the key insight that makes Rabin-Karp efficient. When the window slides from position s to s+1, the new window differs from the old one by exactly two characters: the old leftmost character drops out, and a new rightmost character enters. We can update the hash in O(1):

h(T[s+1..s+m]) = (d · (h(T[s..s+m-1]) - T[s] · d^m-1) + T[s+m]) mod q

Read that formula step by step:

Subtract the contribution of the outgoing character: T[s] · d^m-1 (the leftmost digit)
Multiply by d — this shifts every remaining digit one position left (like multiplying 314 by 10 to get 3140)
Add the incoming character: T[s+m] (the new rightmost digit)
Mod q to keep the number in range

This is exactly how you would update a decimal number by dropping the leftmost digit and appending a new rightmost digit. For example, to go from 31415 to 14159: subtract 3 · 10000, multiply by 10, add 9.

Why modular arithmetic? Without the mod q, the hash of a window could be astronomically large (d^m grows exponentially). Taking mod q keeps numbers manageable — all arithmetic fits in 64-bit integers. The tradeoff: different strings can produce the same hash (a spurious hit or collision). When hashes match, we must verify character by character. Choosing a large prime q (like 10⁹ + 7) makes collisions rare. With a random prime q, the probability of a spurious hit at any position is 1/q — negligible.

Worked Example

Let T = "31415926535", P = "26", d = 10, q = 13. We use small numbers for visibility.

First, compute d^m-1 mod q = 10¹ mod 13 = 10.

Then compute the pattern hash: h(P) = (2 · 10 + 6) mod 13 = 26 mod 13 = 0.

Now slide through the text windows of length 2:

// Initial window hash:
h("31") = (3 · 10 + 1) mod 13 = 31 mod 13 = 5
5 ≠ 0 → skip

// Roll to next window — remove '3', add '4':
h("14") = (10 · (5 - 3 · 10) + 4) mod 13
= (10 · (-25) + 4) mod 13 = (-246) mod 13 = 1
1 ≠ 0 → skip

// Easier: compute directly for each window
h("41") = 41 mod 13 = 2 ≠ 0 → skip
h("15") = 15 mod 13 = 2 ≠ 0 → skip
h("59") = 59 mod 13 = 7 ≠ 0 → skip
h("92") = 92 mod 13 = 1 ≠ 0 → skip
h("26") = 26 mod 13 = 0 == 0 → VERIFY! "26" == "26" ✓ Match!
h("65") = 65 mod 13 = 0 == 0 → VERIFY! "65" ≠ "26" ✗ Spurious hit!
h("53") = 53 mod 13 = 1 ≠ 0 → skip
h("35") = 35 mod 13 = 9 ≠ 0 → skip

Notice h("65") = 0 = h("26") — a spurious hit. Both 26 and 65 are divisible by 13. We detect it by verifying the actual characters. With a good prime q (much larger than 13), spurious hits are extremely rare in practice.

The Code

python
def rabin_karp(T, P, d=256, q=1000000007):
    """Find all occurrences of P in T using Rabin-Karp."""
    n, m = len(T), len(P)
    if m > n:
        return []

    h = pow(d, m - 1, q)  # d^(m-1) mod q, precomputed
    p_hash = 0              # hash of pattern
    t_hash = 0              # hash of current text window
    matches = []

    # Compute initial hashes for P and T[0..m-1]
    for i in range(m):
        p_hash = (d * p_hash + ord(P[i])) % q
        t_hash = (d * t_hash + ord(T[i])) % q

    # Slide the window across T
    for s in range(n - m + 1):
        if p_hash == t_hash:          # hash match!
            if T[s:s+m] == P:         # verify to avoid spurious hit
                matches.append(s)
        if s < n - m:
            # Roll the hash: remove T[s], shift left, add T[s+m]
            t_hash = (d * (t_hash - ord(T[s]) * h) + ord(T[s + m])) % q
            if t_hash < 0:            # Python handles this, but in C:
                t_hash += q            # ensure non-negative
    return matches

Gotcha: negative modular arithmetic. In languages like C and Java, the expression (a - b) % q can return a negative value when a < b. You must add q to make it non-negative. Python's % operator always returns a non-negative result, so this is only a concern in other languages. This is a common source of bugs in interview implementations.

Complexity Analysis

Preprocessing: Computing the initial hashes takes O(m) — we process each character of P and the first window of T.

Matching: Each rolling hash update is O(1). We perform n - m updates. At each position, comparing hashes is O(1). Verification on hash match costs O(m). If there are v valid matches and c spurious hits, total verification cost is (v + c) · m.

Expected total: O(n + m + (v + c) · m). With a random prime q, the expected number of spurious hits is (n - m + 1) / q, which is negligible for large q. So expected time is O(n + m) assuming v is small.

Worst case: O(n · m) — if every hash collides (e.g., q = 1 and all hashes are 0). This is the same as naive. But with a good q, this never happens in practice.

Rabin-Karp Rolling Hash Animation

Watch the rolling hash value update as the window slides. Teal = hash match (verify!), orange = current window. Uses d=10, q=13 for visibility.

Window: — | Hash: — | Pattern hash: —

Why Rabin-Karp Matters

Rabin-Karp may seem weak — its worst case is no better than naive. But it has unique strengths that no other algorithm in this chapter shares:

Multiple pattern matching. To search for k patterns simultaneously, compute all k pattern hashes and store them in a hash set. At each text window, check if the window hash is in the set. Cost: O(n + km) expected. Naive would cost O(knm). This is the most common reason to choose Rabin-Karp over KMP.

2D pattern matching. Find a small m × m matrix inside a large n × n matrix. Hash each row of the pattern and each row of the text. Use column-wise rolling hashes. This extends naturally: just compose two 1D rolling hashes. Neither FA nor KMP generalizes to 2D easily.

Plagiarism detection. Hash every m-gram (overlapping window of m words) in two documents. Find matching hashes. This is the basis of MOSS, Turnitin, and similar systems. The key insight: you do not need to compare every pair of windows across two documents. Hash all windows from both, store them, and find collisions. With a hash set, this takes O(n₁ + n₂) expected time.

Common Pitfalls in Implementation

Rabin-Karp is simple in concept but tricky in practice. Here are the bugs that trip up most implementers:

// Bug 1: Negative modular arithmetic
// In C/Java: (-5) % 13 = -5, not 8
// Fix: th = ((th % q) + q) % q

// Bug 2: Integer overflow
// d * t_hash can overflow 32-bit int when d=256, m=10+
// Fix: use 64-bit integers, or use Python (arbitrary precision)

// Bug 3: Forgetting to verify on hash match
// Hash equality does NOT guarantee string equality
// You MUST compare characters when hashes match

// Bug 4: Wrong power computation
// h = d^(m-1) mod q, NOT d^m mod q
// The highest-order digit is multiplied by d^(m-1)

Choosing the Prime q

The prime q controls the collision rate. A larger q means fewer spurious hits but requires larger arithmetic. In practice:

q = 10⁹ + 7 (1000000007) — the standard choice for competitive programming. Fits in 64-bit with room for d · t_hash without overflow.
q = 10⁹ + 9 — another good prime, sometimes used as a second hash for double-hashing to virtually eliminate false positives.
q should NOT be a power of 2 — modular arithmetic by powers of 2 is just bit masking, which does not distribute hash values uniformly across strings.

Rabin-Karp for Multiple Patterns — Worked Example

Suppose you need to find all occurrences of three patterns: "ab", "bc", "cd" in text T = "abcde". With KMP, you would run three separate searches (3 · O(n)). With Rabin-Karp, you compute all three pattern hashes, store them in a set, and scan once:

// Pattern hashes (d=256, q=101):
h("ab") = (97 · 256 + 98) mod 101 = 24930 mod 101 = 89
h("bc") = (98 · 256 + 99) mod 101 = 25187 mod 101 = 42
h("cd") = (99 · 256 + 100) mod 101 = 25444 mod 101 = 96
Hash set = {89, 42, 96}

// Scan T:
h("ab") = 89 → in set! Verify "ab"=="ab" ✓ Match!
h("bc") = 42 → in set! Verify "bc"=="bc" ✓ Match!
h("cd") = 96 → in set! Verify "cd"=="cd" ✓ Match!
h("de") = ... → not in set. Skip.

// One pass through T, all 3 patterns found.

This scales beautifully. For k = 10,000 patterns, we still make one pass through the text. The hash set lookup is O(1). Total: O(n + km) for initial hashing + O(n) for scanning = O(n + km). Compare to k separate KMP searches: O(kn + km). When n is much larger than m, Rabin-Karp's single pass wins decisively.

Extension: 2D Pattern Matching

Find a 3×3 pattern matrix inside a 10×10 text matrix. Rabin-Karp extends naturally:

Hash each row of the pattern: h_row[0], h_row[1], h_row[2].
Combine row hashes into a column hash: H_pat = hash(h_row[0], h_row[1], h_row[2]).
For each column position in the text, compute row hashes using 1D rolling hash.
For each starting row, combine 3 consecutive row hashes into a column hash using a second rolling hash.
Compare column hashes to H_pat.

Total: O(n²) for an n×n text with an m×m pattern. Naive 2D matching would be O(n²m²).

In Rabin-Karp, when the hash of a text window equals the hash of the pattern, what must we do?

Report a match immediately — hash equality guarantees string equality Verify by comparing characters one by one, because hash collisions produce spurious hits Rehash with a different prime q and check again Skip ahead by m positions since we found a match

Chapter 2: Finite Automaton Matching

Rabin-Karp uses hashing to skip non-matching windows. There is a completely different approach: build a machine — a deterministic finite automaton (DFA) — that reads the text character by character and transitions between states. Each state represents "how many characters of the pattern have been matched so far." When the machine reaches the accepting state (state m), we have found a match.

The Idea

Think of it like playing a matching game. You are walking along the pattern from left to right, trying to match it against the incoming text. You maintain a counter: "how many characters of the pattern have I matched so far?" This counter is your state.

When the next text character matches the next expected pattern character, you advance: state goes from q to q+1. When it does not match, you need to figure out: "given what I have seen so far (the last q characters plus this new mismatched character), what is the longest prefix of the pattern that I have already matched?" You do not go back to the start — you jump to the longest position in the pattern where you could still be mid-match.

A DFA precomputes this jump for every possible (state, character) pair. After preprocessing, matching is a single left-to-right scan with no backtracking.

States and Transitions

Consider pattern P = "ababc" (length 5) over alphabet Σ = {a, b, c}. The DFA has m + 1 = 6 states, numbered 0 through 5. State q means "the last q characters of the text processed so far match the first q characters of P." State 0 is the start (nothing matched). State 5 is the accept state (full pattern matched).

The transition function δ(q, c) answers: "If I am in state q and I read character c, what is my new state?" The new state is the length of the longest prefix of P that is also a suffix of the string "P[0..q-1] + c".

For example, δ(4, 'a') asks: I have matched "abab" (state 4) and I read 'a'. The string so far ends with "ababa". What is the longest prefix of "ababc" that is a suffix of "ababa"? The answer is "aba" (length 3), so δ(4, 'a') = 3.

Think of it like autocomplete. The DFA is a state machine that constantly asks: "what is the longest pattern prefix I can still be working on, given the text I have seen?" It is like autocomplete on your phone — as you type each character, the system instantly knows which completions are still possible. The DFA does the same for pattern matching.

The Complete Transition Table

For pattern P = "ababc":

State q	Matched so far	δ(q, 'a')	δ(q, 'b')	δ(q, 'c')
0	""	1	0	0
1	"a"	1	2	0
2	"ab"	3	0	0
3	"aba"	1	4	0
4	"abab"	3	0	5
5 (accept)	"ababc"	1	2	0

Let us verify a few entries:

// δ(0, 'a') = 1: we see 'a', which matches P[0]='a' → state 1
// δ(0, 'b') = 0: we see 'b', no prefix of P starts with 'b' → state 0
// δ(2, 'a') = 3: matched "ab", see 'a' → "aba" matches P[0..2] → state 3
// δ(4, 'a') = 3: matched "abab", see 'a' → "ababa"
// longest prefix of "ababc" that is suffix of "ababa"?
// "ababc"? no (c≠a). "abab"? no (b≠a). "aba"? yes! → state 3
// δ(5, 'a') = 1: matched "ababc" (done!), see 'a' → "ababca"
// longest prefix of "ababc" that is suffix of "ababca"? "a" → state 1
// (the DFA keeps looking for overlapping matches!)

Hand-Tracing the DFA on Text

Let us run this DFA on text T = "ababababcab":

// Start: state = 0
i=0: char='a', δ(0,'a')=1 → state=1 (matched "a")
i=1: char='b', δ(1,'b')=2 → state=2 (matched "ab")
i=2: char='a', δ(2,'a')=3 → state=3 (matched "aba")
i=3: char='b', δ(3,'b')=4 → state=4 (matched "abab")
i=4: char='a', δ(4,'a')=3 → state=3 (fell back to "aba")
i=5: char='b', δ(3,'b')=4 → state=4 (matched "abab")
i=6: char='a', δ(4,'a')=3 → state=3 (fell back to "aba" again)
i=7: char='b', δ(3,'b')=4 → state=4 (matched "abab")
i=8: char='c', δ(4,'c')=5 → state=5 MATCH at position 4!
i=9: char='a', δ(5,'a')=1 → state=1 (restart, matched "a")
i=10: char='b', δ(1,'b')=2 → state=2 (matched "ab")

// Done. 1 match at position 4. 11 character lookups (one per text char).
// No character was ever examined twice. That's O(n).

The power of this approach: every text character is processed exactly once. There is no backtracking, no re-examination. The DFA transitions encode all the "failure" information implicitly. At step i=4, when we read 'a' instead of 'c', the DFA does not go back to state 0 — it jumps to state 3, because "ababa" ends with "aba" = P[0..2]. It automatically "remembers" the partial match.

Matching is Trivially Simple

Once the DFA is built, matching is a single pass through the text:

python
def fa_match(T, delta, m):
    """Match using precomputed DFA. delta[q][c] = next state."""
    q = 0                           # start state
    matches = []
    for i, c in enumerate(T):
        q = delta[q].get(c, 0)     # transition
        if q == m:                   # reached accept state
            matches.append(i - m + 1)
    return matches

That is it. One character at a time, one table lookup per character, no backtracking. O(n) time for matching. The text index i only moves forward. No character is ever re-examined. This is the fundamental advantage over naive.

Building the Transition Table

The preprocessing cost is higher. For each of the m+1 states and each of the |Σ| alphabet characters, we need to compute δ(q, c). The brute-force approach:

python
def build_fa(P, alphabet):
    """Build DFA transition table for pattern P."""
    m = len(P)
    delta = [{} for _ in range(m + 1)]

    for q in range(m + 1):
        for c in alphabet:
            # Find longest k such that P[0..k-1] is a suffix of P[0..q-1]+c
            k = min(m, q + 1)      # can't match more than q+1 or m
            while k > 0:
                suffix = P[:q] + c
                if suffix.endswith(P[:k]):
                    break
                k -= 1
            delta[q][c] = k
    return delta

This runs in O(m³ |Σ|) due to the string comparisons inside the while loop. It can be optimized to O(m |Σ|) using the prefix function (which we will meet in Chapter 4). The space cost is O(m |Σ|) for the table itself.

Optimized FA Construction with π

There is a clever way to build the FA table in O(m|Σ|) using the prefix function. The insight: δ(q, c) = δ(π[q], c) whenever P[q] ≠ c. In other words, if the next character does not match the expected pattern character, the FA "falls back" to the same state that KMP would fall back to, and then tries the transition again. This recursive relationship lets us build δ from π efficiently:

python
def build_fa_fast(P, alphabet):
    """Build DFA in O(m|Σ|) using prefix function."""
    m = len(P)
    pi = compute_prefix(P)
    delta = [{} for _ in range(m + 1)]

    # State 0: only the pattern's first character advances
    for c in alphabet:
        delta[0][c] = 1 if c == P[0] else 0

    # States 1..m: use prefix function for fallback
    for q in range(1, m + 1):
        for c in alphabet:
            if q < m and c == P[q]:
                delta[q][c] = q + 1   # match: advance
            else:
                # mismatch: same as transitioning from state pi[q-1]
                delta[q][c] = delta[pi[q - 1] if q > 0 else 0][c]
    return delta

This is the bridge between FA matching and KMP: they encode the same information in different forms. The FA stores it as an explicit table (fast lookup, more memory). KMP stores it as the π array (less memory, amortized lookup).

Tradeoff: Alphabet Size

For DNA (|Σ| = 4), the table is tiny: 4(m+1) entries. For ASCII (|Σ| = 256), it is 256(m+1). For Unicode (|Σ| = 143,859), it is impractical. This is the main weakness of the FA approach — it scales poorly with alphabet size. KMP (next chapter) solves this by encoding the same information more compactly.

Finite Automaton Matching

Watch the DFA process text character by character. The current state is highlighted in orange. State transitions are animated. When the accept state (5) is reached, the match is recorded in teal.

State: 0 | Character: — | Matches: 0

Preprocessing vs. matching tradeoff. The FA approach spends O(m|Σ|) time and space building the table, then O(n) matching. KMP (next chapter) achieves the same O(n) matching with only O(m) preprocessing — by cleverly avoiding the need to store transitions for every alphabet character. The key insight: most transitions go to state 0 or follow a simple rule. KMP encodes only the "interesting" transitions (the failure links).

For pattern "aab" over alphabet {a, b}, what is δ(2, 'a')? (State 2 means "aa" has been matched.)

0 — start over, no prefix of "aab" is a suffix of "aaa" 2 — "aaa" ends with "aa" which is P[0..1], so we stay at state 2 3 — "aab" is fully matched 1 — only one 'a' carries over

Chapter 3: Knuth-Morris-Pratt (KMP)

The finite automaton approach works beautifully, but building the transition table costs O(m|Σ|). KMP achieves the same O(n) matching with only O(m) preprocessing. It is the algorithm most interviewers expect you to know by heart. The secret weapon: the prefix function (also called the failure function).

The Key Insight

When a mismatch occurs in the naive algorithm after matching q characters, we throw away all that progress and start over. The FA approach precomputes where to go for every (state, character) pair. KMP takes a middle path: it only needs to know, for each position q in the pattern, what is the longest proper prefix of P[0..q-1] that is also a suffix of P[0..q-1].

Let us unpack that sentence with an example. Pattern P = "ababc". Suppose we have matched "abab" (q = 4) and the next text character is 'a' instead of the expected 'c'. Here is what we know:

// We matched T[s..s+3] = P[0..3] = "abab"
// The text at positions s..s+3 is "a b a b"
// Look at P[0..3] = "abab":
// Prefix "ab" (length 2) equals suffix "ab" (last 2 chars)
// So the last 2 characters of what we matched ("ab")
// are the same as the first 2 characters of P!

// This means: T[s+2..s+3] = P[0..1] = "ab"
// We can shift the pattern forward to align P[0..1]
// with T[s+2..s+3] and continue comparing from P[2].
// We skip 2 characters of comparison work!

The prefix function π[q] tells us exactly how far to shift. For our example, π[4] = 2 (the prefix "ab" of length 2 equals the suffix "ab" of "abab"). On mismatch at position 4, we set q = π[4-1] = π[3] and continue. No character of the text is ever re-examined.

The prefix function π[q]. For each position q (1 ≤ q ≤ m), π[q] = length of the longest proper prefix of P[0..q-1] that is also a suffix of P[0..q-1]. "Proper" means it cannot be the entire string (otherwise π[q] = q always, which is useless). π[1] = 0 always — a single character has no proper prefix that is also a suffix. This array encodes ALL the information KMP needs.

Worked Example: Computing π

Pattern P = "ababaca" (m = 7). Let us compute π for every position:

P = a b a b a c a
pos= 0 1 2 3 4 5 6
q = 1 2 3 4 5 6 7

// π[1]: P[0..0] = "a" — no proper prefix = suffix → 0
// (A single character has nothing shorter that matches.)
π[1] = 0

// π[2]: P[0..1] = "ab"
// Proper prefixes: "a". Suffixes: "b". "a" ≠ "b" → 0
π[2] = 0

// π[3]: P[0..2] = "aba"
// Proper prefixes: "a", "ab". Suffixes: "a", "ba".
// "a" = "a" ✓ (length 1)
π[3] = 1

// π[4]: P[0..3] = "abab"
// Proper prefixes: "a","ab","aba". Suffixes: "b","ab","bab".
// "ab" = "ab" ✓ (length 2)
π[4] = 2

// π[5]: P[0..4] = "ababa"
// "aba" = "aba" ✓ (length 3)
π[5] = 3

// π[6]: P[0..5] = "ababac"
// Check: "ababa"≠"babac", "abab"≠"abac", "aba"≠"bac",
// "ab"≠"ac", "a"≠"c" → none match
π[6] = 0

// π[7]: P[0..6] = "ababaca"
// "a" = "a" ✓ (length 1)
π[7] = 1

π = [0, 0, 1, 2, 3, 0, 1]

How KMP Uses π

The matching algorithm maintains two indices: i (position in text, never decreases) and q (number of pattern characters matched so far). At each step:

If P[q] == T[i], advance both: q++, i++.
If P[q] != T[i] and q > 0, fall back: q = π[q-1]. Do NOT advance i. The text character will be compared again with the new (smaller) q.
If P[q] != T[i] and q == 0, just advance i (nothing to fall back to).
If q == m, record a match at position i - m, then fall back: q = π[q-1].

python
def kmp_match(T, P):
    """Find all occurrences of P in T using KMP."""
    n, m = len(T), len(P)
    pi = compute_prefix(P)  # O(m) — see Chapter 4
    q = 0                      # characters matched so far
    matches = []

    for i in range(n):
        # Fall back until we find a match or reach 0
        while q > 0 and P[q] != T[i]:
            q = pi[q - 1]       # use prefix function to skip
        if P[q] == T[i]:
            q += 1              # extend the match
        if q == m:              # full match found
            matches.append(i - m + 1)
            q = pi[q - 1]       # continue searching for overlapping matches
    return matches

Hand-Tracing KMP

T = "ababababcab", P = "ababc", π = [0, 0, 1, 2, 0].

// i=0: T[0]='a', P[0]='a' match → q=1
// i=1: T[1]='b', P[1]='b' match → q=2
// i=2: T[2]='a', P[2]='a' match → q=3
// i=3: T[3]='b', P[3]='b' match → q=4
// i=4: T[4]='a', P[4]='c' MISMATCH ✗
// q = π[3] = 2 (fall back: we know "ab" is still matched)
// P[2]='a', T[4]='a' match → q=3
// i=5: T[5]='b', P[3]='b' match → q=4
// i=6: T[6]='a', P[4]='c' MISMATCH ✗
// q = π[3] = 2
// P[2]='a', T[6]='a' match → q=3
// i=7: T[7]='b', P[3]='b' match → q=4
// i=8: T[8]='c', P[4]='c' match → q=5 = m → MATCH at position 4!
// q = π[4] = 0 (fall back for overlapping matches)
// i=9: T[9]='a', P[0]='a' match → q=1
// i=10: T[10]='b', P[1]='b' match → q=2
// Done! 1 match at position 4. Total comparisons: ~14

Compare to naive: the naive algorithm would try shifts 0 through 7, and at each shift it might compare up to 5 characters. KMP used about 14 total comparisons. The text pointer i never moved backward — every text character was examined at most twice (once in the for loop, potentially once more via a fallback that was "paid for" by a previous match extension).

Second Worked Example: The Worst Case for Naive

T = "aaaaaab", P = "aaab", π = [0, 1, 2, 0].

// i=0: T[0]='a', P[0]='a' match → q=1
// i=1: T[1]='a', P[1]='a' match → q=2
// i=2: T[2]='a', P[2]='a' match → q=3
// i=3: T[3]='a', P[3]='b' MISMATCH ✗
// q = π[2] = 2 (we still know "aa" is matched)
// P[2]='a', T[3]='a' match → q=3
// i=4: T[4]='a', P[3]='b' MISMATCH ✗
// q = π[2] = 2
// P[2]='a', T[4]='a' match → q=3
// i=5: T[5]='a', P[3]='b' MISMATCH ✗
// q = π[2] = 2
// P[2]='a', T[5]='a' match → q=3
// i=6: T[6]='b', P[3]='b' match → q=4 = m
// MATCH at position 3!
// q = π[3] = 0

// Total: 10 comparisons (7 from for loop + 3 fallbacks)
// Naive would do: 4 shifts × 4 comparisons = 16 comparisons
// On longer all-a strings, the ratio gets much worse

Notice the pattern: at each mismatch, KMP falls back by one position (from q=3 to q=2) and immediately re-matches (q=2 to q=3). It still advances the text pointer i each time. The naive algorithm, by contrast, would slide back to shift s+1 and re-compare all 3 a's from scratch at every shift.

KMP Matching in Action

Watch KMP match pattern "ababc" against text. When a mismatch occurs, the pattern shifts using π — notice how text characters are never re-examined. Purple shows the fallback. Compare the comparison count to naive.

Text pos: 0 | Pattern pos: 0 | Comparisons: 0 | Matches: 0

Why KMP is O(n + m)

The proof is elegant and uses an amortized argument. Define a potential Φ = q (the current number of matched characters). At each step of the for loop:

i increments by 1 (always)
q increases by at most 1 (when P[q] == T[i])
Each while-loop iteration strictly decreases q

Since q starts at 0 and can never go negative, the total number of q-increases over the entire algorithm is at most n (one per for-loop iteration). Each while-loop iteration pays for one q-decrease, and total q-decreases cannot exceed total q-increases. So the while loop runs at most n times total across all iterations of the for loop.

Total work: n iterations of the for loop + at most n while-loop iterations + O(m) to build π = O(n + m).

The amortized argument in one sentence. Every while-loop iteration consumes a unit of "potential" (matched characters) that was previously created by a character match. Since we create at most n units total, we consume at most n units total. Therefore the while loop runs at most n times across the entire algorithm, and total work is O(n + m).

How to Memorize KMP

KMP has a reputation for being hard to code from memory. Here is a mental model that makes it stick. There are only two functions, and they have nearly identical structure:

compute_prefix(P)

Matches P against itself. Variable k = matched prefix length. For each position q from 1 to m-1: (1) while k>0 and mismatch, fall back k = pi[k-1]. (2) if match, k++. (3) pi[q] = k.

kmp_match(T, P)

Matches P against T. Variable q = matched prefix length. For each position i from 0 to n-1: (1) while q>0 and mismatch, fall back q = pi[q-1]. (2) if match, q++. (3) if q==m, record match, q = pi[q-1].

The structure is the same: while-fallback, if-extend, store/check. The only differences: (a) compute_prefix stores pi[q], kmp_match checks q==m. (b) kmp_match resets q after a full match. (c) compute_prefix starts at q=1 (skip the trivial base case), kmp_match starts at i=0.

Practice writing both side by side. Once you see they are the same algorithm applied to different inputs (P vs itself, and T vs P), you only need to memorize one template.

KMP vs. the Finite Automaton

KMP and FA matching process text in the same way — one character at a time, left to right, maintaining a "how much have I matched" state. The difference is how they compute the next state:

Aspect	FA Matching	KMP
State representation	Integer q (same)	Integer q (same)
Forward transition	Table lookup: δ[q][c]	Check P[q]==T[i], increment q
Backward transition	Table lookup: δ[q][c]	Follow π chain until match
Preprocessing	O(m\|Σ\|) — build full table	O(m) — build π only
Per-character cost	O(1) always	O(1) amortized

FA matching has O(1) worst case per character (single table lookup), while KMP has O(1) amortized (the while loop might run multiple times for one character, but averages out). In practice, both are extremely fast. KMP wins on preprocessing cost when the alphabet is large.

Pattern P = "aabaa". What is π[5] (the prefix function for the full pattern)?

1 — "a" is both a prefix and suffix 2 — "aa" is a proper prefix of "aabaa" AND a suffix of "aabaa" 3 — "aab" is a prefix and suffix 0 — no proper prefix equals a suffix

Chapter 4: Building the Prefix Function

The prefix function is the engine that powers KMP. Computing it is the subtle part — and beautifully, the algorithm to compute π is itself a form of string matching. It matches the pattern against itself.

The Core Observation

π[0] = 0 always (a single character has no proper prefix = suffix). For each position q from 1 to m-1, we want the longest k such that P[0..k-1] = P[q-k+1..q].

Here is the key: π[q] can be at most π[q-1] + 1. Why? If we have a prefix-suffix of length k for P[0..q], then P[0..k-2] is a prefix-suffix of P[0..q-1] with length k-1. So k-1 ≤ π[q-1], meaning k ≤ π[q-1] + 1.

This means: to find π[q], first check if P[π[q-1]] == P[q]. If yes, π[q] = π[q-1] + 1. If not, we need to try shorter prefix-suffixes. Where? Follow the chain: try k = π[π[q-1]-1], check if P[k] == P[q]. If not, try k = π[k-1], and so on, until we find a match or k reaches 0.

The Algorithm

python
def compute_prefix(P):
    """Compute the prefix function for pattern P."""
    m = len(P)
    pi = [0] * m
    k = 0                        # length of current prefix-suffix

    for q in range(1, m):
        # Follow the failure chain until match or k=0
        while k > 0 and P[k] != P[q]:
            k = pi[k - 1]         # fall back
        if P[k] == P[q]:
            k += 1                # extend
        pi[q] = k
    return pi

Look at this code. It is almost identical to the KMP matching code. That is not a coincidence. Computing π IS matching P against itself. The variable k plays the role of the matched-character count, and q plays the role of the text index. The same amortized argument applies: k increases at most m-1 times, and each while-loop iteration decreases k, so total work is O(m).

Step-by-Step Trace

Let us trace compute_prefix for P = "aabaaab" (m = 7):

P: a a b a a a b
0 1 2 3 4 5 6

// Initialize: pi = [0, ?, ?, ?, ?, ?, ?], k = 0

// q=1: P[k]=P[0]='a', P[q]=P[1]='a'
// 'a' == 'a' ✓ → k=1, pi[1]=1
// "aa" has proper prefix "a" = suffix "a"

// q=2: P[k]=P[1]='a', P[q]=P[2]='b'
// 'a' != 'b' ✗ → while loop: k=pi[0]=0
// P[0]='a' != P[2]='b' ✗ → pi[2]=0
// "aab" has no proper prefix = suffix

// q=3: P[k]=P[0]='a', P[q]=P[3]='a'
// 'a' == 'a' ✓ → k=1, pi[3]=1
// "aaba" has prefix "a" = suffix "a"

// q=4: P[k]=P[1]='a', P[q]=P[4]='a'
// 'a' == 'a' ✓ → k=2, pi[4]=2
// "aabaa" has prefix "aa" = suffix "aa"

// q=5: P[k]=P[2]='b', P[q]=P[5]='a'
// 'b' != 'a' ✗ → while loop: k=pi[1]=1
// P[1]='a', P[5]='a' ✓ → k=2, pi[5]=2
// "aabaaa" — verify: prefix "aa" = suffix "aa" ✓

// q=6: P[k]=P[2]='b', P[q]=P[6]='b'
// 'b' == 'b' ✓ → k=3, pi[6]=3
// "aabaaab" has prefix "aab" = suffix "aab" ✓

π = [0, 1, 0, 1, 2, 2, 3]

The critical step is q=5. We tried k=2 (trying to extend the prefix-suffix of length 2 from position 4), but P[2]='b' ≠ P[5]='a'. So we followed the failure chain to k = π[1] = 1. Now we check P[1]='a' vs P[5]='a' — match! So π[5] = 2. The failure chain found a shorter prefix-suffix that we could extend.

Why follow the chain? When P[k] ≠ P[q], we know that P[0..k-1] matches P[q-k+1..q-1] (from the definition of π). We want the next-longest prefix of P[0..k-1] that is also a suffix — which is exactly π[k-1]. This gives us the next candidate. We keep following the chain until we find a character match or exhaust all possibilities (k = 0). Each step strictly shortens the candidate, guaranteeing termination.

Prefix Function Builder

Step through the prefix function computation for "aabaaab". Purple highlights show the failure chain. The top row shows the pattern, the bottom row shows π values as they are computed.

Position: 0 | k: 0 | π: [—]

The O(m) Proof — Why It Is Not O(m²)

At first glance, the while loop inside the for loop looks dangerous. The for loop runs m-1 times. The while loop could run multiple times per iteration. Is this O(m²)?

No. The amortized argument is identical to the one for KMP matching. Think of k as a bank account:

k starts at 0.
The line k += 1 executes at most once per for-loop iteration (m-1 times total). Each execution deposits one unit in the bank.
Each while-loop iteration k = pi[k-1] strictly decreases k by at least 1. Each execution withdraws at least one unit from the bank.
k never goes below 0 (the while-loop checks k > 0).

Deposits: at most m-1. Withdrawals: at most m-1 (cannot withdraw more than deposited). Total while-loop iterations across the entire for loop: at most m-1. Total work: O(m).

A Second Example: Pattern "aabaab"

Let us compute π for a pattern with a deeper failure chain.

P: a a b a a b
0 1 2 3 4 5

π[0] = 0
π[1] = 1 // "aa": prefix "a" = suffix "a"
π[2] = 0 // "aab": no prefix = suffix
π[3] = 1 // "aaba": "a" = "a"
π[4] = 2 // "aabaa": "aa" = "aa"
π[5] = 3 // "aabaab": "aab" = "aab" ✓

π = [0, 1, 0, 1, 2, 3]

// Period = 6 - 3 = 3, and 6 mod 3 = 0
// So "aabaab" = "aab" repeated twice ✓

Notice: π[5] = 3 means "aabaab" has a 3-character prefix "aab" that equals its 3-character suffix "aab". This immediately tells us the string has period 3. This is the power of the prefix function — structural information about the string falls out for free.

The Prefix Function and String Periodicity

The prefix function encodes deep structure about the pattern. One powerful consequence: it reveals the period of the string.

A string S of length n has period p if S[i] = S[i + p] for all valid i. The shortest period of S is n - π[n-1] (using 0-indexed π). If n is divisible by this period, the string is a perfect repetition of the period.

// Example: S = "abcabcabc", n=9
// π = [0, 0, 0, 1, 2, 3, 4, 5, 6]
// π[8] = 6, period = 9 - 6 = 3
// 9 mod 3 = 0 → "abc" repeats 3 times ✓

// Example: S = "abcabcab", n=8
// π = [0, 0, 0, 1, 2, 3, 4, 5]
// π[7] = 5, period = 8 - 5 = 3
// 8 mod 3 = 2 ≠ 0 → not a perfect repetition
// But the period is still 3: "abcabcab" is "abc" repeated 2.67 times

This property is the basis of interview problems like "find the shortest repeating unit of a string" — a one-liner once you know the prefix function.

For pattern "abcabc", what is the prefix function array?

[0, 0, 0, 1, 2, 3] — "abc" prefix matches "abc" suffix at position 6 [0, 0, 0, 0, 0, 0] — no repeated structure [0, 1, 0, 0, 1, 0] [0, 0, 1, 0, 0, 1]

Chapter 5: String Matching Showcase

Now the payoff. Enter any text and pattern below, and watch all four algorithms race side by side. Each one counts its character comparisons. You will see exactly where KMP and the FA approach win: on patterns with repeated prefixes that cause the naive algorithm to thrash.

Algorithm Race — All Four Side by Side

Enter text and pattern, then click Run Race. Try the suggested inputs below to see dramatic differences.

Text:

Pattern:

What to Look For

Input	Pattern	What Happens
`aaaaaaaaaaaaaab`	`aaab`	Naive uses ~44 comparisons. KMP uses ~18. The pattern has a long repeated prefix ("aaa") that naive re-examines at every shift. KMP's π = [0,1,2,0] lets it skip ahead efficiently.
`abcxabcxabcxabcy`	`abcxabcy`	The pattern nearly matches twice before succeeding. KMP's prefix function encodes the "abcx" overlap and skips half the work.
`hello world hello world`	`world`	All algorithms perform similarly — "world" has no repeated prefixes, so there is little structure to exploit. The first character 'w' mismatches immediately at most positions.
`31415926535897932`	`2653`	Short pattern, numeric text. Rabin-Karp excels here — the rolling hash eliminates most windows in O(1) each.

The big takeaway. For non-repetitive patterns (like English words), all four algorithms perform about n comparisons. The differences emerge on patterns with repeated structure ("aaab", "ababc", "abcxabcy") which trigger worst-case behavior in naive. KMP and FA are never worse than 2n comparisons. Rabin-Karp is usually about n but can degrade. In practice: KMP for guaranteed performance, Rabin-Karp for simplicity and multi-pattern, Boyer-Moore for large alphabets (see Chapter 6).

Why the Worst Case Matters

You might wonder: "When does the worst case actually happen? Is it just a theoretical curiosity?" No. The worst case occurs in real applications:

DNA sequences. With only 4 bases (A, C, G, T), repeated patterns are common. "AAAAT" in "AAAAAAA...A" is a real search that occurs when looking for poly-A tails. The naive algorithm degrades to O(nm) here.
Binary data. Searching through files that contain long runs of zeros (common in disk images, memory dumps) with a pattern like "\x00\x00\x00\x01". Naive is terrible; KMP handles it gracefully.
Adversarial inputs. If an attacker can control the text or pattern (e.g., regex denial of service), they will craft inputs that trigger worst-case behavior. Using an algorithm with guaranteed O(n+m) is a security measure.

Comparison Summary

The interactive counter below lets you see the ratio between naive and KMP on worst-case inputs (all 'a's with pattern ending in 'b'). Drag the sliders and watch the bars diverge.

Interactive Comparison Counter

Adjust the text length and pattern repetitiveness to see how comparison counts scale. The pattern is repeated 'a's followed by 'b'. Notice: as the pattern gets longer, the naive/KMP ratio grows linearly — naive does O(nm) work while KMP stays at O(n+m).

Text length 30

Pattern 'a's before 'b' 3

When to Use Which Algorithm — A Practical Guide

After seeing all four algorithms race, here are the practical guidelines:

Use Naive When...

The text is very short (under 1000 characters), or you are prototyping and do not want to implement preprocessing. The naive algorithm has zero overhead and is cache-friendly. For short texts and non-repetitive patterns, it is actually competitive.

Use KMP When...

You need guaranteed O(n+m) performance. You are processing untrusted input (adversarial patterns or text). You are in a streaming scenario where text arrives character by character. KMP never backtracks in the text.

Use Rabin-Karp When...

You need to search for multiple patterns simultaneously. You need 2D pattern matching. You are doing plagiarism detection or fingerprinting. The implementation is simpler than KMP and usually fast enough in practice.

Use Boyer-Moore When...

You are searching through large files of natural language text (English, code, logs). The pattern is relatively long (5+ characters) and the alphabet is large. Boyer-Moore can be 2-5x faster than KMP in practice due to sublinear scanning.

Chapter 6: Beyond CLRS

CLRS covers four algorithms (naive, Rabin-Karp, FA, KMP). But the string matching story is far richer. Two important extensions deserve your attention.

Boyer-Moore: Matching Backwards, Skipping Big

Boyer-Moore is often the fastest string matching algorithm in practice, even though its worst case is O(nm). The key idea: compare characters right to left within the pattern window. When a mismatch occurs on a character that does not appear in the pattern at all, skip the entire pattern length forward.

Consider searching for "EXAMPLE" (length 7) in a long text. We align the pattern with the text and compare the last character of the pattern ('E') with the text character at that position. If the text character is 'Z' — a character that does not appear anywhere in "EXAMPLE" — then no shift of the pattern that includes this text position can possibly match. We jump forward by 7 positions in one shot.

Boyer-Moore uses two heuristics to determine the shift amount:

Bad Character Rule

When a mismatch occurs at pattern position j with text character c, find the rightmost occurrence of c in P[0..j-1]. Shift the pattern so that occurrence aligns with the text position. If c does not appear at all, shift past the entire window. This is the "big skip" heuristic — it can skip m positions in one step.

Good Suffix Rule

When a mismatch occurs after matching a suffix of the pattern, shift the pattern so that a different occurrence of that suffix (or a matching prefix) aligns. This is the right-to-left analog of KMP's prefix function. It prevents Boyer-Moore from missing matches when the bad character rule gives a small shift.

In practice, Boyer-Moore with just the bad character rule is already very fast on English text (alphabet of ~70 common characters). Average case is sublinear: O(n/m), meaning it examines only a fraction of the text characters. The larger the alphabet and the longer the pattern, the bigger the skips.

Boyer-Moore Bad Character Rule

Watch Boyer-Moore compare right-to-left and make big jumps when the mismatched character is not in the pattern. Notice how few comparisons it needs.

Comparisons: 0 | Shift: — | Matches: 0

Boyer-Moore Worked Example

T = "here is a simple example", P = "example" (m=7). The bad character table for P:

// Last occurrence of each character in P:
'e': 6 (last position in "example")
'x': 1
'a': 2
'm': 3
'p': 4
'l': 5

// Alignment 1: compare P[6]='e' vs T[6]='a'
// 'a' appears in P at position 2. Shift by 6-2 = 4.
// Alignment 2 (shift 4): compare P[6]='e' vs T[10]='i'
// 'i' not in P at all! Shift by 7 (full pattern length).
// Alignment 3 (shift 11): compare P[6]='e' vs T[17]='e' ✓
// P[5]='l' vs T[16]='l' ✓
// P[4]='p' vs T[15]='p' ✓
// ... all match! FOUND at position 11.

// Total: only ~10 comparisons for a 24-character text.
// Naive would use ~50+ comparisons.
// The big skip at alignment 2 (shift by 7!) is the key.

This example shows Boyer-Moore's superpower: on natural text with diverse characters, most mismatches involve characters not in the pattern, causing full-length skips. The algorithm often examines only n/m characters — sublinear!

Boyer-Moore: The Bad Character Table

The bad character table is simple to build. For each character c in the alphabet, store the rightmost position of c in the pattern. Characters not in the pattern get value -1 (meaning "shift past the whole window"). Here is the code:

python
def boyer_moore(T, P):
    """Find all occurrences of P in T using Boyer-Moore (bad char only)."""
    n, m = len(T), len(P)
    # Build bad character table
    bad_char = {}
    for i in range(m):
        bad_char[P[i]] = i     # rightmost occurrence

    matches = []
    s = 0                        # current shift
    while s <= n - m:
        j = m - 1                # start comparing from the right
        while j >= 0 and P[j] == T[s + j]:
            j -= 1
        if j < 0:                # full match
            matches.append(s)
            s += m               # shift past the match (simplified)
        else:
            bc = bad_char.get(T[s + j], -1)
            s += max(1, j - bc)   # shift by bad character rule
    return matches

The shift amount j - bc aligns the rightmost occurrence of the mismatched text character with the current position. If bc > j (the character appears to the right of j in the pattern), we use max(1, ...) to avoid shifting backwards. This simple version omits the good suffix rule but is already very effective in practice.

Aho-Corasick: Multiple Patterns in One Pass

All four CLRS algorithms search for a single pattern. What if you need to find all occurrences of k patterns simultaneously? Running KMP once per pattern costs O(k(n + m_avg)). Aho-Corasick does it in O(n + total pattern length + number of matches) — one pass through the text, regardless of k.

The idea: build a trie of all patterns. Then add "failure links" connecting each trie node to the longest proper suffix that is also a prefix of some pattern (generalizing KMP's prefix function from a single string to a tree of strings). Process the text through this augmented trie. At each text character, follow the trie forward if possible, or use the failure link to find the next viable match position.

Applications: virus scanners (match thousands of malware signatures in each file), bibliographic search (find all keyword occurrences in a document), DNA motif discovery (search for thousands of known gene sequences in a genome).

Suffix Structures: Preprocess the Text

All our algorithms preprocess the pattern and then scan the text. But what if you need to search the same text for many different patterns? It is better to preprocess the text once and answer pattern queries quickly.

A suffix array sorts all suffixes of the text lexicographically. Binary search finds any pattern in O(m log n). A suffix tree (Ukkonen's algorithm) can be built in O(n) and finds any pattern in O(m). These are the backbone of bioinformatics — genome databases index 3-billion-character sequences into suffix arrays so that gene lookups take milliseconds.

The Complete Algorithm Comparison

Algorithm	Preprocess	Match	Best For
Naive	None	O(nm)	Simple, short text, prototyping
Rabin-Karp	O(m)	Expected O(n)	Multiple patterns, 2D matching, plagiarism
FA	O(m\|Σ\|)	O(n)	Small alphabets (DNA), hardware implementation
KMP	O(m)	O(n)	Single pattern, guaranteed O(n+m)
Boyer-Moore	O(m + \|Σ\|)	O(n/m) best, O(nm) worst	Large alphabets, English text, grep
Aho-Corasick	O(sum of patterns)	O(n + matches)	Multiple simultaneous patterns
Suffix Array	O(n)	O(m log n) per query	Many queries on fixed text
Suffix Tree	O(n)	O(m) per query	Many queries, largest common substring

Boyer-Moore compares characters within the pattern window in which direction?

Left to right, same as KMP and naive Right to left — this enables the "bad character" big-skip heuristic From the middle outward (galloping search) Random order for expected O(1) verification

Chapter 7: Applications

String matching is everywhere. You use it dozens of times a day without thinking about it. Here are the major application domains, each with the algorithm it typically uses and why that choice makes sense.

Text Editors — Ctrl+F

When you press Ctrl+F in VS Code or Sublime Text, the editor runs a string matching algorithm on the visible buffer. Most editors use a variant of Boyer-Moore because it is fastest on natural language text. English text has a large effective alphabet (~70 characters), and patterns rarely have repeated structure. Boyer-Moore's bad character rule skips huge chunks when the search term contains uncommon characters. Searching for "quantum" in a 10,000-line file takes far fewer than 10,000 · 7 comparisons — more like 10,000 / 7 ≈ 1,400 in favorable cases.

grep and Regular Expressions

The classic Unix grep is actually two tools in one. When you use grep -F "pattern" (fixed string), GNU grep runs Boyer-Moore. When you use grep "pattern.*regex" (regular expression), it builds a Thompson NFA — a generalization of our finite automaton approach to handle wildcards, character classes, repetition, and alternation. The NFA is simulated directly (without converting to DFA) to avoid the potential O(2^m) state explosion that DFA construction can cause on some regexes.

This is why grep -F is faster than grep for literal strings — it uses a specialized algorithm instead of the general regex engine.

DNA Sequence Alignment

A human genome has approximately 3.2 billion base pairs (characters). Searching for a gene sequence (a pattern of 20-200 bases) requires extreme efficiency. The alphabet is small: Σ = {A, C, G, T}, |Σ| = 4. This makes the FA approach viable (the transition table has only 4 columns).

In practice, tools like BLAST (Basic Local Alignment Search Tool) use a seed-and-extend approach: find exact matches of short seeds (11-mers) using hash tables (like Rabin-Karp), then extend promising seeds using dynamic programming (Smith-Waterman). Modern tools like BWA and Bowtie use the Burrows-Wheeler Transform (a close relative of suffix arrays) to index the entire genome and answer pattern queries in milliseconds.

Plagiarism Detection

Given a student's essay and a database of source documents, find all matching substrings above a threshold length. This is naturally a multi-pattern, multi-document problem. Rabin-Karp is the algorithm of choice: hash every m-gram (overlapping window of m words) in both the submission and the source documents. Store source hashes in a hash table. For each submission hash, check for matches in O(1). Systems like MOSS (Measure of Software Similarity) and Turnitin use variants of this approach.

Network Intrusion Detection

Firewalls and intrusion detection systems (Snort, Suricata) scan every network packet for thousands of known attack signatures simultaneously. Each packet is 1-1500 bytes. The signature database contains 10,000+ patterns. This is the classic Aho-Corasick use case: build a trie of all attack patterns, process each packet through the trie in a single pass. Finding any matching pattern triggers an alert. At 10 Gbps, the system must process about 800,000 packets per second — there is no time to run KMP 10,000 times per packet.

Search Engines

Google does not do string matching on the raw web. It builds an inverted index: for each word, store a list of documents containing that word. A query like "string matching algorithm" is decomposed into three word lookups in the index, and the results are intersected. But string matching still appears in snippet generation (highlighting query terms in results), URL matching, and autocomplete.

Compilers and Lexers

When a compiler reads your source code, it breaks the text into tokens (keywords, identifiers, numbers, operators). This lexical analysis is a multi-pattern matching problem: "if", "else", "for", "while", "return" etc. The lexer builds a DFA from all keyword patterns (using Aho-Corasick or explicit NFA-to-DFA conversion) and processes the source code in a single pass. Tools like Flex (Fast Lexical Analyzer) automate this — you specify the patterns, and Flex generates the DFA as C code.

Data Flow Summary

Let us be concrete about what goes in and what comes out of each algorithm. This is what you need to implement them.

Input

Text T (string, length n) + Pattern P (string, length m). Both are arrays of characters.

↓

Preprocess

KMP: π[0..m-1] (int array, length m)
FA: δ[0..m][c] (2D table, (m+1) × |Σ| ints)
RK: p_hash (single int), h = d^m-1 mod q (single int)

↓

Match

Single left-to-right scan of T. Output: list of shift positions (ints) where P occurs.

↓

Output

matches = [s₀, s₁, ..., s_k] where T[s_i..s_i+m-1] = P

Application: DNA Motif Finder

Search a random DNA sequence for a motif using KMP. Enter a motif (e.g., "GATTACA") and click Search. The algorithm highlights all occurrences. Click "New DNA" to generate a fresh sequence.

Motif:

Click "New DNA" to generate a sequence, then "Search" to find the motif.

A virus scanner needs to match 50,000 known malware signatures against every file on disk. Each file is scanned once. Which algorithm family is best suited?

Naive — simplest to implement for each signature KMP — guaranteed O(n+m) per pattern Aho-Corasick — processes ALL 50,000 patterns in one pass through the file Boyer-Moore — fastest single-pattern algorithm

Chapter 8: Interview Arsenal

String matching problems appear regularly in coding interviews. You are unlikely to be asked to implement KMP from scratch (it is hard to get right under pressure), but you MUST understand the prefix function — it appears in problems about repeated substrings, string periodicity, and pattern matching. Here is your battle kit.

The Five Interview Dimensions

Dimension	What They Test	Example Question
CONCEPT	Do you understand the prefix function?	"Why is KMP O(n+m)? Walk through the amortized argument."
DESIGN	Can you pick the right algorithm?	"Design a system to detect plagiarism across 10M documents."
CODE	Can you implement from memory?	"Implement KMP in 15 minutes." or "Implement Rabin-Karp."
DEBUG	Can you find bugs in string code?	"This Rabin-Karp has a bug — find it." (Missing mod, off-by-one, negative hash)
FRONTIER	Do you know the broader landscape?	"How does grep work? What is Aho-Corasick? When would you use a suffix array?"

Decision Framework

How many patterns?

1 → KMP (guaranteed) or Boyer-Moore (fast in practice)

↓ Many patterns

Scan text once?

Yes → Aho-Corasick (one pass for all patterns)

↓ Many queries on same text

Preprocess text?

Yes → Suffix array (O(m log n) per query) or suffix tree (O(m) per query)

↓ Special case

Approximate / fuzzy?

Edit distance DP (O(nm)) or BK-tree for dictionary search

Drill 1: Implement KMP

This is the most important drill. You should be able to write both compute_prefix and kmp_match from memory in under 10 minutes. Practice until the code flows naturally.

python
def kmp(T, P):
    """Find all occurrences of P in T. O(n+m)."""
    # Step 1: Build prefix function
    m = len(P)
    pi = [0] * m
    k = 0
    for q in range(1, m):
        while k > 0 and P[k] != P[q]:
            k = pi[k - 1]
        if P[k] == P[q]:
            k += 1
        pi[q] = k

    # Step 2: Match
    q = 0
    result = []
    for i in range(len(T)):
        while q > 0 and P[q] != T[i]:
            q = pi[q - 1]
        if P[q] == T[i]:
            q += 1
        if q == m:
            result.append(i - m + 1)
            q = pi[q - 1]
    return result

Drill 2: Implement Rabin-Karp

The tricky parts: (1) computing d^m-1 mod q efficiently, (2) handling negative modular arithmetic, (3) remembering to verify on hash match.

python
def rabin_karp(T, P):
    """Find all occurrences of P in T. Expected O(n+m)."""
    n, m, d, q = len(T), len(P), 256, 1000000007
    h = pow(d, m-1, q)
    ph = th = 0
    for i in range(m):
        ph = (d*ph + ord(P[i])) % q
        th = (d*th + ord(T[i])) % q
    result = []
    for s in range(n-m+1):
        if ph == th and T[s:s+m] == P:
            result.append(s)
        if s < n-m:
            th = (d*(th - ord(T[s])*h) + ord(T[s+m])) % q
    return result

Drill 3: Find All Anagram Occurrences

Given T and P, find all positions where an anagram (permutation) of P appears in T. This is not string matching per se — it is a sliding window with character frequency comparison. But it appears in interviews alongside string matching problems.

python
from collections import Counter

def find_anagrams(T, P):
    """Find all positions where an anagram of P appears in T. O(n)."""
    m, n = len(P), len(T)
    if m > n: return []
    p_count = Counter(P)
    w_count = Counter(T[:m])
    result = []
    if w_count == p_count:
        result.append(0)
    for i in range(m, n):
        w_count[T[i]] += 1          # add incoming character
        out = T[i - m]
        w_count[out] -= 1          # remove outgoing character
        if w_count[out] == 0:
            del w_count[out]       # clean up zeros for equality check
        if w_count == p_count:
            result.append(i - m + 1)
    return result

Drill 4: Shortest Repeating Unit

Given string S, find the shortest string R such that S is R repeated k times (or determine that no such R exists). This is a one-liner with the prefix function.

python
def shortest_period(S):
    """Return shortest repeating unit of S, or S itself if none."""
    pi = compute_prefix(S)     # from KMP
    n = len(S)
    period = n - pi[n - 1]       # candidate period length
    if n % period == 0:
        return S[:period]       # perfect repetition
    return S                    # no repeating unit

Drill 5: Repeated String Match

Given strings A and B, find the minimum number of times A must be repeated such that B is a substring of the repeated string. Example: A = "abc", B = "cabcabca" — answer is 4 ("abcabcabcabc" contains "cabcabca").

python
def repeated_match(A, B):
    """Min repetitions of A to contain B as substring. -1 if impossible."""
    # Need at least ceil(len(B)/len(A)) copies, plus at most 1 more
    reps = -(-len(B) // len(A))  # ceiling division
    for k in [reps, reps + 1]:
        if B in A * k:              # Python's `in` uses fast string matching
            return k
    return -1

Drill 6: Longest Happy Prefix (LeetCode 1392)

Given a string S, find the longest prefix which is also a suffix (not equal to S). This is literally π[n-1]. The "happy prefix" name is just LeetCode branding — the underlying structure is the prefix function.

python
def longest_happy_prefix(s):
    pi = compute_prefix(s)
    return s[:pi[-1]]  # the prefix of length pi[n-1]

# "ababab" → pi = [0,0,1,2,3,4] → pi[5]=4 → "abab"
# "a" → pi = [0] → pi[0]=0 → ""

Common Interview Questions — Quick Reference

Problem	Key Technique	Complexity
Find all occurrences of P in T	KMP	O(n + m)
Find all anagram occurrences	Sliding window + char count	O(n)
Shortest repeating unit	Prefix function: period = n - π[n-1]	O(n)
Repeated string match	KMP on repeated string	O(n + m)
Longest prefix = suffix	π[n-1]	O(n)
Check if B is rotation of A	Check if B in A+A (KMP)	O(n)
Count distinct substrings	Suffix array + LCP array	O(n log n)
Longest common substring	Suffix array + LCP, or DP O(nm)	O(n log n) or O(nm)
Edit distance (Levenshtein)	2D DP	O(nm)
Wildcard pattern matching	DP or FFT-based	O(nm) or O(n log n)

The rotation trick. To check if string B is a rotation of string A (same length), concatenate A with itself: A+A. B is a rotation of A if and only if B appears as a substring of A+A. Use KMP to check in O(n). Example: A="abcde", A+A="abcdeabcde". Is "cdeab" a rotation? Check if "cdeab" is in "abcdeabcde" — yes! This trick appears frequently in interviews.

The prefix function is your secret weapon. It appears in so many interview problems: longest proper prefix = suffix, string periodicity, repeated pattern detection, Z-algorithm equivalence, and even some DP problems (longest palindromic prefix). If you know compute_prefix cold, you can solve a whole class of problems that would otherwise require ad-hoc reasoning. Practice writing it from memory until it takes under 2 minutes.

String S = "abababab" has prefix function [0, 0, 1, 2, 3, 4, 5, 6]. The length is 8, π[7] = 6. What is the shortest repeating unit of S?

"ab" — period = 8 - 6 = 2, and 8 mod 2 = 0, so S = "ab" repeated 4 times "abab" — half the string "ababab" — π[7] = 6 characters No repeating unit exists

Chapter 9: Connections

String matching connects to several other chapters in CLRS and to broader algorithmic themes. Understanding these connections makes you a stronger algorithm designer.

Hashing (CLRS Ch11)

Rabin-Karp is fundamentally a hashing algorithm applied to string matching. The rolling hash is a special case of polynomial hashing: treating the string as a polynomial in base d, evaluated at d, modulo q. The choice of prime q, the modular arithmetic, and the handling of collisions all draw on hash function theory from Chapter 11.

The connection goes deeper. Universal hashing — choosing hash functions randomly from a family — is what gives Rabin-Karp its expected O(n+m) guarantee. With a randomly chosen prime q, the probability that any particular non-matching window produces a spurious hit is at most m/q, which is negligible for large q.

Finite Automata and Formal Languages (CLRS Ch32.3)

The FA matching algorithm is a special case of DFA simulation. Regular expressions compile to NFAs (Thompson's construction), which can be converted to DFAs (subset construction) for execution. This is how grep, sed, and every regex engine works. The string matching DFA from Chapter 2 is the simplest case: matching a single literal string.

The connection to KMP: the FA transition table can be computed from the prefix function in O(m|Σ|) time, because π encodes the same "fallback" information that the DFA transitions encode. In fact, KMP's matching loop is equivalent to simulating a DFA where non-forward transitions are computed on-the-fly from π instead of stored in a table.

Dynamic Programming (CLRS Ch15)

The prefix function computation has a DP flavor: π[q] depends on π values at smaller indices. But the deeper connection is to approximate string matching: finding the substring of T most similar to P, allowing insertions, deletions, and substitutions.

The edit distance (Levenshtein distance) between two strings is computed using a 2D DP table in O(nm) time. This is the foundation of spell checkers, DNA alignment (Needleman-Wunsch for global, Smith-Waterman for local), diff tools, and autocorrect. When exact matching is too strict, approximate matching with DP fills the gap.

A concrete connection: the edit distance DP recurrence is:

D[i][j] = min(D[i-1][j]+1, D[i][j-1]+1, D[i-1][j-1] + (T[i]≠P[j] ? 1 : 0))

where D[i][j] is the edit distance between T[0..i-1] and P[0..j-1]. The third term (diagonal) is the "substitution" cost. When T[i] = P[j], it is free — exactly like the "match" case in string matching. Edit distance generalizes exact matching: when D[n][m] = 0, the strings are identical.

Number Theory (CLRS Ch31)

Rabin-Karp relies on modular arithmetic with large primes. The probability of a spurious hit depends on the distribution of hash values modulo q, which connects to number-theoretic results about the density of primes and the uniformity of polynomial hash functions over finite fields. Using multiple independent hash functions (different primes) reduces false positive probability exponentially — a technique also used in Bloom filters.

Polynomials and FFT (CLRS Ch30)

There is a surprising connection between string matching and polynomial multiplication. Define indicator polynomials for the text and pattern, then compute their convolution using FFT. The convolution at position s counts the number of matching characters at shift s. This gives an O(n log n) algorithm for string matching — slower than KMP for standard matching, but it generalizes to matching with wildcards, which KMP and FA cannot handle.

Where to Go Next

If you want to learn...	Study...
Approximate matching (fuzzy search)	Edit distance (CLRS Ch15, DP), BK-trees
Regular expression matching	Thompson NFA, NFA-to-DFA subset construction
Genome indexing	Suffix arrays, BWT (Burrows-Wheeler Transform), FM-index
Multiple pattern matching	Aho-Corasick, Commentz-Walter
Compressed pattern matching	Lempel-Ziv factorization, grammar-compressed matching
Wildcard matching	FFT-based convolution (CLRS Ch30)
Streaming pattern matching	Porat-Porat algorithm (O(log m) space)

The Z-Algorithm: A Close Relative of π

The Z-array is an alternative to the prefix function, popular in competitive programming. For a string S of length n, Z[i] = length of the longest substring starting at position i that matches a prefix of S. Z[0] is undefined (or set to n). While π[q] looks at what ends at position q (longest prefix = suffix of P[0..q-1]), Z[i] looks at what starts at position i (longest match with the start of S).

// S = "aabxaab"
// Z = [_, 1, 0, 0, 3, 1, 0]
// Z[1] = 1: "a" matches prefix "a"
// Z[4] = 3: "aab" matches prefix "aab"

To use Z for pattern matching: concatenate P + "$" + T (where "$" is a separator not in either string), compute the Z-array. Any position i > m where Z[i] = m indicates that T[i-m-1..] starts with the full pattern — a match. The Z-algorithm runs in O(n + m) and is often easier to code than KMP in competitions because the Z-array logic is slightly more intuitive (extending a match forward rather than falling back).

python
def z_function(S):
    """Compute Z-array in O(n)."""
    n = len(S)
    Z = [0] * n
    Z[0] = n
    l, r = 0, 0
    for i in range(1, n):
        if i < r:
            Z[i] = min(r - i, Z[i - l])
        while i + Z[i] < n and S[Z[i]] == S[i + Z[i]]:
            Z[i] += 1
        if i + Z[i] > r:
            l, r = i, i + Z[i]
    return Z

def z_match(T, P):
    """Find all occurrences of P in T using Z-algorithm."""
    S = P + "$" + T
    Z = z_function(S)
    m = len(P)
    return [i - m - 1 for i in range(m + 1, len(S)) if Z[i] == m]

The prefix function and Z-array encode equivalent information and can be converted between each other in O(n). They are two views of the same combinatorial structure on strings. In interviews, knowing both gives you flexibility — use whichever one you can code faster under pressure.

The prefix function is the unifying idea. It appears explicitly in KMP, implicitly in FA transitions (which can be built from π), in the Z-algorithm (as the Z-array), in suffix array construction (Kasai's LCP array), in Aho-Corasick (failure links are the multi-pattern generalization of π), and even in data compression (LZ77's longest match computation). Understanding π deeply — not just the code, but WHY following the failure chain works, WHY the amortized analysis gives O(m), and HOW to use it for periodicity detection — is the single most valuable investment you can make in this chapter.

To check if string B is a rotation of string A (both length n), what is the most efficient approach?

Try all n rotations of A and compare each to B: O(n²) Sort both strings and compare: O(n log n) Check if B is a substring of A + A using KMP: O(n) Hash both strings and compare: O(n)

"An algorithm must be seen to be believed."
— Donald Knuth

String Matching