Text Analysis: How Search Works
[!NOTE] This module explores the core principles of Text Analysis: How Search Works, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Pipeline: From Text to Tokens
When you index the string "Running at 5pm!", Elasticsearch doesn’t just store it.
It transforms it. This process is called Analysis.
The 3-Stage Pipeline:
- Character Filters: Clean the raw string (e.g., Remove HTML tags).
"<b>Hello</b>"→"Hello"
- Tokenizer: Chop string into a stream of tokens (e.g., Split on whitespace).
"Hello World"→["Hello", "World"]
- Token Filters: Process tokens (Lowercase, Synonym, Stemming).
"Running"→"run"(Stemming)
2. Standard vs Custom Analyzers
The standard Analyzer (Default)
Good for most things.
- Tokenizer: Unicode Text Segmentation.
- Filters: Lowercase.
The “English” Analyzer
Smart linguistic rules.
- Stemming:
foxes→fox,running→run. - Stopwords: Removes
the,a,and.
Custom N-Gram Analyzer (Autocomplete)
To search for "Elas", we need partial tokens.
- Edge N-Gram Tokenizer:
- Input:
"Elastic" - Output:
["E", "El", "Ela", "Elas", "Elast", "Elasti", "Elastic"] - Use Case: Type-ahead search.
3. Interactive: Analysis Lab
Experiment with different components to see how tokens are generated.
1. Raw Tokens
→
2. Filtered Output (Index)
4. Hardware Reality: Analyzer CPU Cost
Analysis happens on Writes (Indexing) AND Reads (Search).
- Write Time: Complex analysis (synonyms, n-grams) slows down ingestion. CPU heavy.
- Search Time: The query string
"Foxes"must typically go through the same analyzer to match the indexed token"fox".
Tip: If ingestion is slow, check your CPU. Heavy Regex filters are usually the bottleneck.