Internals: BKD Trees vs Inverted Index

[!NOTE] This module explores the core principles of Internals: BKD Trees vs Inverted Index, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Tale of Two Structures

Until 2016 (Lucene 6.0), Elasticsearch stored everything in the Inverted Index. Numbers (123) were treated as string terms ("123"). Range queries (price > 50) were a disaster.

Enter the BKD Tree.

2. The Inverted Index (Text)

We covered this in Module 1. But let’s go deeper. It has 3 parts on disk:

FST (Finite State Transducer): A compressed Trie in memory. Maps Prefix → Disk Offset.
Term Dictionary: Sorted list of all terms.
Posting List: The list of Doc IDs.

Compression: Uses Frame of Reference (FOR) and Delta Encoding to squeeze integers into bits. [100, 105, 108] → [100, +5, +3].

3. The BKD Tree (Numbers/Geo)

Block K-Dimensional Tree. Based on kd-trees, but “Blocked” to be disk-friendly (aligns with 4KB OS Pages).

How it works:

It recursively divides the space into 2 halves.
Imagine a timeline (1D). It cuts it at the median.
Then cuts the halves again.
Leaf nodes store the actual Doc IDs.

Why it wins for Ranges:

To find price > 50, it can discard entire “left branches” of the tree that contain only values < 50.
It’s like a Binary Search on steroids.

4. Interactive: Range Query (Scan vs Tree)

Compare finding simple terms vs a range.

Select a method...

5. Summary

Strings → Inverted Index (Trie + Postings).
Numbers/Geo → BKD Tree (Spatial Indexing).
Latency Impact: Mismatching these causes massive performance loss. Filtering a keyword range is slow compared to a long range.