Search Engines & Inverted Indices

[!NOTE] This module explores the core principles of Search Engines & Inverted Indices, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: The Needle in the Haystack

Imagine searching for “Star Wars” in 100 million text files. Naive Approach: grep every file (O(N \cdot L)). This takes hours. Search Engine: Returns results in 10ms. How? eliminating 99.999% of documents before reading them.

2. The Core Structure: Inverted Index

An Index maps Document → Words. An Inverted Index maps Word → List<Document>.

Example:

Doc A: “The cat sat”
Doc B: “The cat jumped”

Inverted Index:

“The”: [A, B]
“cat”: [A, B]
“sat”: [A]
“jumped”: [B]

Query: “cat AND jumped”

Get list for “cat”: [A, B]
Get list for “jumped”: [B]
Intersect: [B].

3. Algorithm: Efficient Intersection

How to intersect two sorted lists of size M and N?

Two Pointers: O(M+N) (Linear scan).
Skip Pointers: If lists are massive, store pointers that “skip” ahead (e.g., every 100 items). O(M + \log N)ish.

4. Implementation: Simple Inverted Index (Java)

class InvertedIndex {
  Map<String, List<Integer>> index = new HashMap<>();

  public void addDocument(int docId, String text) {
    for (String word : text.split(" ")) {
      index.computeIfAbsent(word, k -> new ArrayList<>()).add(docId);
    }
  }

  public List<Integer> search(String query) {
    String[] words = query.split(" AND ");
    List<Integer> result = new ArrayList<>(index.getOrDefault(words[0], new ArrayList<>()));

    for (int i = 1; i < words.length; i++) {
      List<Integer> next = index.getOrDefault(words[i], new ArrayList<>());
      result.retainAll(next); // Intersection
    }
    return result;
  }
}

5. Deep Dive Strategy Lab: Search Engines

Intuition Through Analogy

The Index in a Book. You don’t read the whole book to find “Voldemort”. You go to the back specific pages. An Inverted Index is just the “Index” at the back of the internet.

Mathematical Anchor: TF-IDF

How do we rank results? TF (Term Frequency): How often “cat” appears in this doc. IDF (Inverse Document Frequency): How rare “cat” is across all docs. Score = TF \times \log(\frac{TotalDocs}{DocsWithTerm}) If a word is everywhere (“the”), IDF is near 0. If it’s rare (“quantum”), IDF is high.