Mapping: The Schema of Search
[!NOTE] This module explores the core principles of Mapping: The Schema of Search, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Hook: “Schemaless” is a Lie
Elasticsearch claims to be “schemaless”. Reality: If you don’t define a schema (Mapping), Elasticsearch guesses. And it usually guesses WRONG.
- String →
text+keyword(Bloats disk by 2x). - Timestamp →
date(Good). - Floating point →
float(Bad if you only need precision).
2. Text vs Keyword (The Billion Dollar Question)
| Feature | text Field |
keyword Field |
|---|---|---|
| Use Case | Full-text search (“find ‘fox’ in body”) | Exact filtering (“status=’active’”) |
| Analysis | Tokenized ("The Fox" → [the, fox]) |
Untouched ("The Fox" → [The Fox]) |
| Data Structure | Inverted Index | Inverted Index + Doc Values |
| Sorting/Aggs | Disabled by default (Too much RAM) | Fast (Uses Doc Values) |
Golden Rule:
- Do you need to search for words inside it? →
text - Do you need to Filter, Sort, or Aggregate? →
keyword - Both? → Multi-field (
"title": { "type": "text", "fields": { "raw": { "type": "keyword" } } })
3. Storage Internals: How Mappings Hit the Disk
A. The Inverted Index (Search)
Maps Term → List<DocIDs>.
- Used for:
match,termqueries. - Structure: Sorted list of terms (Trie/FST) pointing to Posting Lists.
B. Doc Values (Sorting & Aggregations)
Maps DocID → Value.
- This is a Columnar Store (like Parquet/Cassandra).
- Used for:
sort,aggs,script. - Hardware: Stored on disk, loaded into OS Page Cache.
- Performance: Sequential access pattern. Fast!
C. BKD Trees (Numbers & Geo)
Elasticsearch 5.0+ changed everything for numbers.
- Old: Numbers were strings in the Inverted Index.
- New: Block K-Dimensional (BKD) Trees.
- Why: Optimized for range queries (
price > 100). - Speed: Faster than B-Trees for multi-dimensional data (e.g., Lat/Lon + Date).
4. Interactive: Mapping Designer
See how your choice changes disk usage and capability.
Inverted Index
Empty
Doc Values (Columnar)
Disabled
Capabilities
- Search: ❌
- Sort: ❌
- Aggs: ❌
5. Staff Tip: Disabling Fields
Most JSON logs have fields you never search.
"user_agent": "Mozilla/5.0 ..."
Do you search this? Probably not.
Optimize:
"user_agent": {
"type": "keyword",
"index": false, // No Inverted Index (Save Disk)
"doc_values": true // Still Aggregatable (Top User Agents)
}
- Result: 30% Disk Savings on high-volume log clusters.