Internals: BKD Trees vs Inverted Index

[!NOTE] This module explores the core principles of Internals: BKD Trees vs Inverted Index, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Tale of Two Structures

Until 2016 (Lucene 6.0), Elasticsearch stored everything in the Inverted Index. Numbers (123) were treated as string terms ("123"). Range queries (price > 50) were a disaster.

Enter the BKD Tree.


2. The Inverted Index (Text)

We covered this in Module 1. But let’s go deeper. It has 3 parts on disk:

  1. FST (Finite State Transducer): A compressed Trie in memory. Maps Prefix → Disk Offset.
  2. Term Dictionary: Sorted list of all terms.
  3. Posting List: The list of Doc IDs.

Compression: Uses Frame of Reference (FOR) and Delta Encoding to squeeze integers into bits. [100, 105, 108][100, +5, +3].


3. The BKD Tree (Numbers/Geo)

Block K-Dimensional Tree. Based on kd-trees, but “Blocked” to be disk-friendly (aligns with 4KB OS Pages).

How it works:

  1. It recursively divides the space into 2 halves.
  2. Imagine a timeline (1D). It cuts it at the median.
  3. Then cuts the halves again.
  4. Leaf nodes store the actual Doc IDs.

Why it wins for Ranges:

  • To find price > 50, it can discard entire “left branches” of the tree that contain only values < 50.
  • It’s like a Binary Search on steroids.

4. Interactive: Range Query (Scan vs Tree)

Compare finding simple terms vs a range.

Select a method...

5. Summary

  • Strings → Inverted Index (Trie + Postings).
  • Numbers/Geo → BKD Tree (Spatial Indexing).
  • Latency Impact: Mismatching these causes massive performance loss. Filtering a keyword range is slow compared to a long range.