Querying & Relevance Engineering — Review & Checklist
[!NOTE] This module explores the core principles of Querying & Relevance Engineering — Review & Checklist, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
Key Takeaways
- Filter vs. Query Context: The fundamental dichotomy. Use
filter(cached, boolean) for exact matches (status, IDs, dates). Usequery(scored, computationally expensive) only when relevance ranking is required (full-text search). - BitSet Caching: In Filter Context, Elasticsearch caches results in highly efficient BitSets, enabling ultra-fast bitwise AND operations for complex boolean logic.
- BM25 Scoring Fundamentals:
_scoreis driven by Term Frequency (TF - saturates quickly), Inverse Document Frequency (IDF - rewards rarity), and Field Length Norm (rewards shorter fields). - Aggregations Architecture: Think of aggregations as SQL
GROUP BY. They are divided into Buckets (grouping docs) and Metrics (calculating stats within buckets). - Global Ordinals: For fast aggregations on
keywordstrings, Elasticsearch uses global ordinals (mapping strings to integers). The first aggregation can be slow; useeager_global_ordinalsto pre-load for low-latency needs.
Flashcards
Test your understanding of the core concepts.
Query Context
What question does this context answer, and what is the output?
Answers "How well does this match?"
Outputs a calculated `_score` (Float). It is slower because it calculates relevance.
Filter Context
What question does this context answer, and how does it achieve high performance?
Answers "Does this match? (Yes/No)".
It ignores scoring entirely and caches the results in memory-efficient BitSets for rapid boolean operations.
BM25: TF Saturation
How does BM25 handle Term Frequency differently from Classic TF-IDF?
BM25 applies a non-linear saturation curve. Finding a term 100 times is only slightly better than finding it 10 times, preventing spammy documents from dominating.
Global Ordinals
What are they, and why are they critical for Aggregations?
A mapping of unique strings to integer IDs. They allow ES to group by integers rather than comparing string bytes, drastically speeding up bucket aggregations on high-cardinality fields.
Cheat Sheet
| Concept | The “Why” | When to Use |
|---|---|---|
bool query |
Combines logic. must (score), filter (cache), should (boost), must_not (exclude). |
The foundation of 99% of complex Elasticsearch queries. |
| BitSets | Arrays of 1s and 0s representing matched documents. Executed via SIMD instructions. | Underpins the blazing speed of filter context. |
| BM25 | The math behind _score. Relies on TF, IDF, and Field Length. |
The default scoring algorithm for full-text relevance. |
| Buckets | Bins documents (e.g., terms, date_histogram). Similar to SQL GROUP BY. |
Creating faceted navigation or segmenting data. |
| Metrics | Calculates numbers (e.g., avg, sum) inside buckets. Similar to SQL SELECT AVG(). |
Extracting statistics from grouped data. |
Quick Revision
- Always prefer Filter Context unless you explicitly need documents ranked by relevance.
- The
boolquery is your orchestrator: usefilterfor hard constraints andmust/shouldfor relevance. - BM25 rewards rarity and brevity: A rare word in a short field yields the highest score.
- Aggregations are dual-purpose: They return the search results AND the analytical summary in a single round-trip.
- Beware the first aggregation penalty: If latency is critical, use
eager_global_ordinalsto pre-build the string-to-int mappings for aggregations.
Next Steps
Now that you understand how to query and rank documents efficiently, it’s time to learn how to scale the system that handles these requests.
→ Continue to Scaling & Operations
Glossary Link
Need a refresher on specific terminology? View the Elasticsearch Glossary