Text Analysis: How Search Works
[!NOTE] Text Analysis: How Search Works provides a comprehensive overview of the core concepts, ensuring you have a solid foundation before diving deeper into the technical details.
1. The Pipeline: From Text to Tokens
Imagine searching for running shoes in an e-commerce store. If the search engine only does an exact string match, a product titled Nike Running Shoe (singular) or run shoes might not appear. This is why text cannot simply be stored as raw strings. To provide intuitive and highly relevant search experiences, the text must be analyzed and broken down into standardized, searchable components.
When you index the string "Running at 5pm!", Elasticsearch doesn’t just store it. It transforms it. This transformation process is called Analysis.
The 3-Stage Pipeline:
- Character Filters: Clean the raw string before any splitting occurs (e.g., Remove HTML tags, map
&toand)."<b>Hello</b>"→"Hello"
- Tokenizer: Chop the cleaned string into a stream of tokens (e.g., Split on whitespace or punctuation).
"Hello World"→["Hello", "World"]
- Token Filters: Process the tokens to standardize them (e.g., Lowercase, Synonym mapping, Stemming).
"Running"→"run"(Stemming)
2. Standard vs Custom Analyzers
Different use cases require different analysis strategies. Elasticsearch provides several built-in analyzers, and allows you to build custom ones.
The standard Analyzer (Default)
This is the default analyzer if none is specified. It is designed for general-purpose text and works reasonably well for most languages.
- Tokenizer: Unicode Text Segmentation (splits on word boundaries and removes punctuation).
- Filters: Lowercase.
The “English” Analyzer
The standard analyzer doesn’t understand language. The English analyzer applies smart linguistic rules to improve search recall.
- Stemming: Reduces words to their root form (e.g.,
foxes→fox,running→run). This ensures a search for “run” matches a document containing “running”. - Stopwords: Removes extremely common but less meaningful words like
the,a, andand, saving space and improving relevance.
Custom N-Gram Analyzer (Autocomplete)
If a user types "Elas", a standard analyzer won’t match the indexed token "elastic" because it performs exact token matching. We need partial tokens.
- Edge N-Gram Tokenizer: Breaks text into overlapping prefixes.
- Input:
"Elastic" - Output:
["E", "El", "Ela", "Elas", "Elast", "Elasti", "Elastic"] - Use Case: “Search as you type” or autocomplete functionality.
3. Interactive: Analysis Lab
Experiment with different components to see how tokens are generated. Try typing new sentences and toggling the different filters to see how the text is broken down.
1. Raw Tokens
2. Filtered Output (Index)
4. Hardware Reality: Analyzer CPU Cost
Analysis happens on Writes (Indexing) AND Reads (Search).
- Write Time: Complex analysis (synonyms, n-grams) slows down ingestion. It is highly CPU heavy.
- Search Time: The query string
"Foxes"must typically go through the same analyzer at query time to match the indexed token"fox".
War Story: A team once experienced frequent CPU spikes and high ingestion latency during bulk indexing. They discovered their custom analyzer used a complex Regular Expression pattern for filtering. Because regex evaluation is computationally expensive, every document write caused a CPU bottleneck. By switching from a regex filter to a simpler pattern-matching tokenizer and N-grams, they reduced CPU utilization by 60% and vastly improved indexing throughput.
Tip: If ingestion is slow, check your CPU. Heavy Regex filters and large N-Gram generation (e.g., 1-gram to 20-gram combinations) are usually the bottlenecks.