String Searching

String searching (or pattern matching) is the problem of finding the occurrence of a pattern string within a text string. It’s the core logic behind the Ctrl+F feature in your browser and text editors.

1. The Problem

Given:

  • Text (T): A longer string of length N.
  • Pattern (P): A shorter string of length M.

Goal: Return the index of the first occurrence of P in T, or -1 if not found.

2. The Naive Approach (Brute Force)

The simplest way to solve this is to check every possible position in the text.

  1. Align the pattern with the beginning of the text.
  2. Compare characters one by one.
  3. If a mismatch occurs, shift the pattern one step to the right.
  4. Repeat until a match is found or the end of the text is reached.

Time Complexity

  • Best Case: O(N) (e.g., Pattern found at start).
  • Worst Case: O(N * M) (e.g., Text=”AAAA…B”, Pattern=”AAAB”).
  • Space Complexity: O(1) (no extra space needed).

3. The Mathematical Anchor: Rabin-Karp & Rolling Hashes

To avoid re-scanning every character, Rabin-Karp uses a mathematical trick: Rolling Hashes.

The Hash Formula

Instead of comparing strings character-by-character, we compare their numerical fingerprint. For a string S of length M, the hash H is: H = (S[0] \cdot BM-1 + S[1] \cdot BM-2 + \dots + S[M-1] \cdot B0) \mod Q (Where B is a base, usually a prime like 31, and Q is a large prime to avoid overflow).

The “Rolling” Part

When we shift the window from [i] to [i+1], we don’t re-calculate the whole hash. We:

  1. Subtract the value of T[i].
  2. Multiply by base B.
  3. Add the value of T[i+M]. This transformation takes O(1) time, allowing us to find matches in O(N) average time.

4. The Mathematical Anchor: KMP & The LPS Array

The Knuth-Morris-Pratt (KMP) algorithm asks: “If I already matched ABAB, and the next char is a mismatch, how much of my progress can I keep?”

The Longest Prefix Suffix (LPS)

We pre-process the pattern to find the longest proper prefix that is also a suffix. P = \text{“ABABC”}

  • “A”: 0
  • “AB”: 0
  • “ABA”: 1 (Prefix “A”, Suffix “A”)
  • “ABAB”: 2 (Prefix “AB”, Suffix “AB”)
  • “ABABC”: 0

When a mismatch occurs at index j, we don’t start i over. We look at LPS[j-1] and resume from there. This ensures we never backtrack in the text string, guaranteeing O(N) worst-case performance.


Watch how the brute force algorithm slides the pattern window across the text. Notice how it resets the comparison for every shift.

Text (T)
Pattern (P)
Shift: 0
Comparisons: 0
Click Step to start searching.

6. Implementation

The implementation is straightforward but serves as a baseline for more advanced algorithms like Knuth-Morris-Pratt (KMP) or Rabin-Karp.

public class NaiveSearch {
  public static int search(String text, String pattern) {
    int n = text.length();
    int m = pattern.length();

    // Loop through text
    for (int i = 0; i <= n - m; i++) {
      int j;
      // Loop through pattern
      for (j = 0; j < m; j++) {
        if (text.charAt(i + j) != pattern.charAt(j)) {
          break; // Mismatch found, break inner loop
        }
      }
      // If we reached the end of pattern, we found a match
      if (j == m) {
        return i;
      }
    }
    return -1; // Not found
  }

  public static void main(String[] args) {
    String text = "ABABDABACDABABCABAB";
    String pattern = "ABABCABAB";
    System.out.println("Found at index: " + search(text, pattern));
  }
}
package main

import "fmt"

func Search(text, pattern string) int {
	n := len(text)
	m := len(pattern)

	// Loop through text
	for i := 0; i <= n-m; i++ {
		j := 0
		// Loop through pattern
		for j < m {
			if text[i+j] != pattern[j] {
				break // Mismatch
			}
			j++
		}
		// If j reached m, full match found
		if j == m {
			return i
		}
	}
	return -1
}

func main() {
	text := "ABABDABACDABABCABAB"
	pattern := "ABABCABAB"
	fmt.Println("Found at index:", Search(text, pattern))
}

Limitations

Notice in the visualization that when a mismatch happens, we shift by just one position. We often re-scan characters we’ve already seen. Advanced algorithms optimize this by shifting more than one step based on the mismatch.


7. Deep Dive Strategy Lab: String Searching

Intuition Through Analogy

Think of this chapter like searching text in a large log stream. The goal is not to memorize a fixed trick, but to repeatedly answer:

  1. What is the bottleneck?
  2. Which constraint dominates (time, memory, latency, correctness)?
  3. Which representation makes the bottleneck easier to eliminate?