Internals: BKD Trees vs Inverted Index
[!NOTE] Internals: BKD Trees vs Inverted Index provides a comprehensive overview of the core concepts, ensuring you have a solid foundation before diving deeper into the technical details.
1. The Tale of Two Structures
Until 2016 (Lucene 6.0), Elasticsearch stored everything in the Inverted Index.
Numbers (123) were treated as string terms ("123").
Range queries (price > 50) were a disaster.
Enter the BKD Tree.
2. The Inverted Index (Text)
We covered this in Module 1. But let’s go deeper. It has 3 parts on disk:
- FST (Finite State Transducer): A compressed Trie in memory. Maps
Prefix → Disk Offset. - Term Dictionary: Sorted list of all terms.
- Posting List: The list of Doc IDs.
Compression: Uses Frame of Reference (FOR) and Delta Encoding to squeeze integers into bits.
[100, 105, 108] → [100, +5, +3].
3. The BKD Tree (Numbers/Geo)
Block K-Dimensional Tree. Based on kd-trees, but “Blocked” to be disk-friendly (aligns with 4KB OS Pages).
How it works:
- It recursively divides the space into 2 halves.
- Imagine a timeline (1D). It cuts it at the median.
- Then cuts the halves again.
- Leaf nodes store the actual Doc IDs.
Why it wins for Ranges:
- To find
price > 50, it can discard entire “left branches” of the tree that contain only values < 50. - It’s like a Binary Search on steroids.
4. Interactive: Range Query (Scan vs Tree)
Compare finding simple terms vs a range.
5. Summary
- Strings → Inverted Index (Trie + Postings).
- Numbers/Geo → BKD Tree (Spatial Indexing).
- Latency Impact: Mismatching these causes massive performance loss. Filtering a
keywordrange is slow compared to alongrange.