Query DSL: Speaking JSON

[!NOTE] This module explores the core principles of Query DSL: Speaking JSON, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Two Contexts: Score vs No-Score

Every clause in Elasticsearch runs in one of two contexts. Mixing them up is the #1 cause of slow clusters.

The “Sommelier vs. Bouncer” Analogy:

  • Query Context is like a Sommelier tasting wine: “How good is this wine on a scale of 1-100?” It requires careful evaluation, nuanced calculations (scoring algorithms), and is fundamentally a slower, comparative process.
  • Filter Context is like a Bouncer checking IDs at a club: “Are you over 21? Yes or No.” It’s an exact, binary decision that is extremely fast and can be easily remembered (cached) for the rest of the night.
Feature Query Context ("query": ...) Filter Context ("filter": ...)
Question “How well does this match?” “Does this match? (Yes/No)”
Output _score (Float) Boolean (True/False)
Performance Slower (Calculates Relevance) Fast (Cached in BitSet)
Use Case Full-text search (“best pizza”) Exact filtering (“status=active”)

Golden Rule: If you don’t care about ranking (e.g., filtering by Date, Status, ID), ALWAYS use Filter Context.


2. The Compound bool Query

The bool query is the wrapper for combining logic. It has 4 clauses:

  1. must (AND): Must match. Contributes to score.
  2. filter (AND): Must match. Ignores score. Cached.
  3. should (OR): Nice to have. Boosts score if present.
  4. must_not (NOT): Must NOT match. Ignores score. Cached.

Pattern:

{
  "query": {
    "bool": {
      "must": [ { "match": { "title": "pizza" }} ],
      "filter": [ { "term": { "city": "NYC" }} ]
    }
  }
}

3. Interactive: The BitSet Cache

Elasticsearch caches Filters using BitSets (Arrays of 0s and 1s). See how intersecting queries works.

Doc IDs:
Filter: "status=active"
Filter: "category=tech"
Result (Bitwise AND)

4. Hardware Reality: CPU Instructions & Roaring Bitmaps

Why are Filter Contexts exponentially faster? Under the hood, Elasticsearch (via Apache Lucene) caches filters using Roaring Bitmaps, a highly compressed data structure for sets of integers (like Document IDs).

  • Memory Efficiency: Instead of storing a raw array of millions of 0s and 1s, Roaring Bitmaps compress dense regions (where many documents match) and store sparse regions efficiently.
  • SIMD Operations: When you combine multiple filters (e.g., status=active AND category=tech), the CPU doesn’t iterate through documents one by one. It uses SIMD (Single Instruction, Multiple Data) CPU instructions to execute bitwise AND/OR operations on blocks of 256 bits simultaneously in a single CPU cycle.
  • Math vs. Bits: Filter queries (BitSet intersections) use integer bitwise arithmetic (fast as light). Query context requires calculating term frequencies, inverse document frequencies (TF-IDF/BM25), and executing heavy floating-point operations for every single matched document (computationally expensive).