String Searching
String searching (or pattern matching) is the problem of finding the occurrence of a pattern string within a text string. It’s the core logic behind the Ctrl+F feature in your browser and text editors.
1. The Problem
Given:
- Text (
T): A longer string of lengthN. - Pattern (
P): A shorter string of lengthM.
Goal: Return the index of the first occurrence of P in T, or -1 if not found.
2. The Naive Approach (Brute Force)
The simplest way to solve this is to check every possible position in the text.
- Align the pattern with the beginning of the text.
- Compare characters one by one.
- If a mismatch occurs, shift the pattern one step to the right.
- Repeat until a match is found or the end of the text is reached.
Time Complexity
- Best Case:
O(N)(e.g., Pattern found at start). - Worst Case:
O(N * M)(e.g., Text=”AAAA…B”, Pattern=”AAAB”). - Space Complexity:
O(1)(no extra space needed).
3. The Mathematical Anchor: Rabin-Karp & Rolling Hashes
To avoid re-scanning every character, Rabin-Karp uses a mathematical trick: Rolling Hashes.
The Hash Formula
Instead of comparing strings character-by-character, we compare their numerical fingerprint. For a string S of length M, the hash H is: H = (S[0] \cdot BM-1 + S[1] \cdot BM-2 + \dots + S[M-1] \cdot B0) \mod Q (Where B is a base, usually a prime like 31, and Q is a large prime to avoid overflow).
The “Rolling” Part
When we shift the window from [i] to [i+1], we don’t re-calculate the whole hash. We:
- Subtract the value of T[i].
- Multiply by base B.
- Add the value of T[i+M]. This transformation takes O(1) time, allowing us to find matches in O(N) average time.
4. The Mathematical Anchor: KMP & The LPS Array
The Knuth-Morris-Pratt (KMP) algorithm asks: “If I already matched ABAB, and the next char is a mismatch, how much of my progress can I keep?”
The Longest Prefix Suffix (LPS)
We pre-process the pattern to find the longest proper prefix that is also a suffix. P = \text{“ABABC”}
- “A”: 0
- “AB”: 0
- “ABA”: 1 (Prefix “A”, Suffix “A”)
- “ABAB”: 2 (Prefix “AB”, Suffix “AB”)
- “ABABC”: 0
When a mismatch occurs at index j, we don’t start i over. We look at LPS[j-1] and resume from there. This ensures we never backtrack in the text string, guaranteeing O(N) worst-case performance.
5. Interactive Visualizer: Naive Search
Watch how the brute force algorithm slides the pattern window across the text. Notice how it resets the comparison for every shift.
6. Implementation
The implementation is straightforward but serves as a baseline for more advanced algorithms like Knuth-Morris-Pratt (KMP) or Rabin-Karp.
public class NaiveSearch {
public static int search(String text, String pattern) {
int n = text.length();
int m = pattern.length();
// Loop through text
for (int i = 0; i <= n - m; i++) {
int j;
// Loop through pattern
for (j = 0; j < m; j++) {
if (text.charAt(i + j) != pattern.charAt(j)) {
break; // Mismatch found, break inner loop
}
}
// If we reached the end of pattern, we found a match
if (j == m) {
return i;
}
}
return -1; // Not found
}
public static void main(String[] args) {
String text = "ABABDABACDABABCABAB";
String pattern = "ABABCABAB";
System.out.println("Found at index: " + search(text, pattern));
}
}
package main
import "fmt"
func Search(text, pattern string) int {
n := len(text)
m := len(pattern)
// Loop through text
for i := 0; i <= n-m; i++ {
j := 0
// Loop through pattern
for j < m {
if text[i+j] != pattern[j] {
break // Mismatch
}
j++
}
// If j reached m, full match found
if j == m {
return i
}
}
return -1
}
func main() {
text := "ABABDABACDABABCABAB"
pattern := "ABABCABAB"
fmt.Println("Found at index:", Search(text, pattern))
}
Limitations
Notice in the visualization that when a mismatch happens, we shift by just one position. We often re-scan characters we’ve already seen. Advanced algorithms optimize this by shifting more than one step based on the mismatch.
7. Deep Dive Strategy Lab: String Searching
Intuition Through Analogy
Think of this chapter like searching text in a large log stream. The goal is not to memorize a fixed trick, but to repeatedly answer:
- What is the bottleneck?
- Which constraint dominates (time, memory, latency, correctness)?
- Which representation makes the bottleneck easier to eliminate?