Text Analysis: How Search Works

[!NOTE] This module explores the core principles of Text Analysis: How Search Works, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Pipeline: From Text to Tokens

When you index the string "Running at 5pm!", Elasticsearch doesn’t just store it. It transforms it. This process is called Analysis.

The 3-Stage Pipeline:

  1. Character Filters: Clean the raw string (e.g., Remove HTML tags).
    • "<b>Hello</b>""Hello"
  2. Tokenizer: Chop string into a stream of tokens (e.g., Split on whitespace).
    • "Hello World"["Hello", "World"]
  3. Token Filters: Process tokens (Lowercase, Synonym, Stemming).
    • "Running""run" (Stemming)

2. Standard vs Custom Analyzers

The standard Analyzer (Default)

Good for most things.

  • Tokenizer: Unicode Text Segmentation.
  • Filters: Lowercase.

The “English” Analyzer

Smart linguistic rules.

  • Stemming: foxesfox, runningrun.
  • Stopwords: Removes the, a, and.

Custom N-Gram Analyzer (Autocomplete)

To search for "Elas", we need partial tokens.

  • Edge N-Gram Tokenizer:
  • Input: "Elastic"
  • Output: ["E", "El", "Ela", "Elas", "Elast", "Elasti", "Elastic"]
  • Use Case: Type-ahead search.

3. Interactive: Analysis Lab

Experiment with different components to see how tokens are generated.

1. Raw Tokens

2. Filtered Output (Index)


4. Hardware Reality: Analyzer CPU Cost

Analysis happens on Writes (Indexing) AND Reads (Search).

  • Write Time: Complex analysis (synonyms, n-grams) slows down ingestion. CPU heavy.
  • Search Time: The query string "Foxes" must typically go through the same analyzer to match the indexed token "fox".

Tip: If ingestion is slow, check your CPU. Heavy Regex filters are usually the bottleneck.