Mapping: The Schema of Search

[!NOTE] This module explores the core principles of Mapping: The Schema of Search, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Hook: “Schemaless” is a Lie

Elasticsearch claims to be “schemaless”. Reality: If you don’t define a schema (Mapping), Elasticsearch guesses. And it usually guesses WRONG.

  • String → text + keyword (Bloats disk by 2x).
  • Timestamp → date (Good).
  • Floating point → float (Bad if you only need precision).

2. Text vs Keyword (The Billion Dollar Question)

Feature text Field keyword Field
Use Case Full-text search (“find ‘fox’ in body”) Exact filtering (“status=’active’”)
Analysis Tokenized ("The Fox"[the, fox]) Untouched ("The Fox"[The Fox])
Data Structure Inverted Index Inverted Index + Doc Values
Sorting/Aggs Disabled by default (Too much RAM) Fast (Uses Doc Values)

Golden Rule:

  • Do you need to search for words inside it? → text
  • Do you need to Filter, Sort, or Aggregate? → keyword
  • Both? → Multi-field ("title": { "type": "text", "fields": { "raw": { "type": "keyword" } } })

3. Storage Internals: How Mappings Hit the Disk

Maps Term &rarr; List<DocIDs>.

  • Used for: match, term queries.
  • Structure: Sorted list of terms (Trie/FST) pointing to Posting Lists.

B. Doc Values (Sorting & Aggregations)

Maps DocID &rarr; Value.

  • This is a Columnar Store (like Parquet/Cassandra).
  • Used for: sort, aggs, script.
  • Hardware: Stored on disk, loaded into OS Page Cache.
  • Performance: Sequential access pattern. Fast!

C. BKD Trees (Numbers & Geo)

Elasticsearch 5.0+ changed everything for numbers.

  • Old: Numbers were strings in the Inverted Index.
  • New: Block K-Dimensional (BKD) Trees.
  • Why: Optimized for range queries (price > 100).
  • Speed: Faster than B-Trees for multi-dimensional data (e.g., Lat/Lon + Date).

4. Interactive: Mapping Designer

See how your choice changes disk usage and capability.

Inverted Index

Empty

Doc Values (Columnar)

Disabled

Capabilities

  • Search: ❌
  • Sort: ❌
  • Aggs: ❌

5. Staff Tip: Disabling Fields

Most JSON logs have fields you never search. "user_agent": "Mozilla/5.0 ..." Do you search this? Probably not. Optimize:

"user_agent": {
  "type": "keyword",
  "index": false,    // No Inverted Index (Save Disk)
  "doc_values": true // Still Aggregatable (Top User Agents)
}
  • Result: 30% Disk Savings on high-volume log clusters.