Text Analysis: How Search Works

[!NOTE] This module explores the core principles of Text Analysis: How Search Works, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Pipeline: From Text to Tokens

When you index the string "Running at 5pm!", Elasticsearch doesn’t just store it. It transforms it. This process is called Analysis.

The 3-Stage Pipeline:

Character Filters: Clean the raw string (e.g., Remove HTML tags).
- "<b>Hello</b>" → "Hello"
Tokenizer: Chop string into a stream of tokens (e.g., Split on whitespace).
- "Hello World" → ["Hello", "World"]
Token Filters: Process tokens (Lowercase, Synonym, Stemming).
- "Running" → "run" (Stemming)

2. Standard vs Custom Analyzers

The `standard` Analyzer (Default)

Good for most things.

Tokenizer: Unicode Text Segmentation.
Filters: Lowercase.

The “English” Analyzer

Smart linguistic rules.

Stemming: foxes → fox, running → run.
Stopwords: Removes the, a, and.

Custom N-Gram Analyzer (Autocomplete)

To search for "Elas", we need partial tokens.

Edge N-Gram Tokenizer:
Input: "Elastic"
Output: ["E", "El", "Ela", "Elas", "Elast", "Elasti", "Elastic"]
Use Case: Type-ahead search.

3. Interactive: Analysis Lab

Experiment with different components to see how tokens are generated.

Lowercase Stemming (S-removal)

1. Raw Tokens

→

2. Filtered Output (Index)

4. Hardware Reality: Analyzer CPU Cost

Analysis happens on Writes (Indexing) AND Reads (Search).

Write Time: Complex analysis (synonyms, n-grams) slows down ingestion. CPU heavy.
Search Time: The query string "Foxes" must typically go through the same analyzer to match the indexed token "fox".

Tip: If ingestion is slow, check your CPU. Heavy Regex filters are usually the bottleneck.

Text Analysis: How Search Works

Text Analysis: How Search Works

1. The Pipeline: From Text to Tokens

2. Standard vs Custom Analyzers

The standard Analyzer (Default)

The “English” Analyzer

Custom N-Gram Analyzer (Autocomplete)

3. Interactive: Analysis Lab

1. Raw Tokens

2. Filtered Output (Index)

4. Hardware Reality: Analyzer CPU Cost

The `standard` Analyzer (Default)