The Structures That Hold The World
When you do SELECT * FROM users WHERE id = 5, what happens?
The database doesn’t scan the whole disk. It uses a Tree. But not a Binary Tree.
1. The Hardware Constraint: Disk I/O
- RAM Access: Nanoseconds.
- Disk Access: Milliseconds (1,000,000x slower).
- Block Size: Disks read data in blocks (e.g., 4KB or 16KB).
A standard Binary Tree spreads nodes across memory. Reading Node -> Left -> Left might trigger 3 disk seeks. Too slow.
2. B-Trees and B+Trees
A B-Tree is a “fat” tree. Each node contains many keys (e.g., 100) and many children.
- Height: Very low. A height-3 B-Tree can store millions of rows.
- Locality: One node = One Disk Page. Reading a node gets 100 keys in 1 seek.
B+Tree (The Standard)
Used by MySQL (InnoDB) and PostgreSQL.
- Internal Nodes: Only keys (for routing).
- Leaf Nodes: Actual Data. Linked together for fast range scans.
[!NOTE] The Math of Fanout: If a B+Tree has a branching factor (fanout) of 100, a tree of height 3 can store $100^3 = 1,000,000$ keys. This means finding any record out of a million takes at most 3 disk seeks!
3. Interactive: B+Tree Split
Visualize how a node “splits” when full, pushing the median up.
4. LSM Trees (Log-Structured Merge)
Used by Cassandra, RocksDB, Kafka.
- Problem: B-Trees do random writes (slow on HDD/SSD). If you have millions of writes per second, B-Trees become a bottleneck.
- Solution: Write sequentially to an append-only log.
Anatomy of an LSM Tree Read/Write
- Write-Ahead Log (WAL): Every write is immediately appended to a WAL on disk for durability (crash recovery).
- MemTable: The write is then stored in an in-memory balanced tree (like a Red-Black tree).
- Flush to SSTable: When the MemTable gets full (e.g., a few Megabytes), it is flushed sequentially to disk as an immutable SSTable (Sorted String Table).
- Compaction: Over time, you accumulate many SSTables. A background process periodically merges and sorts them into larger SSTables, removing deleted keys (Tombstones).
[!TIP] War Story: Discord’s Migration Discord originally stored billions of messages in MongoDB (B-Tree). As they grew, random writes crippled performance. They migrated to Cassandra (LSM Tree) because appending to a sequential log is orders of magnitude faster than updating random B-Tree nodes for their heavily write-skewed workload.
How do we Read Fast? (Bloom Filters)
Since a read might have to check the MemTable and multiple SSTables, LSM Trees use Bloom Filters (a probabilistic data structure). The Bloom Filter can tell us instantly: “This key is definitely not in this SSTable,” saving us a costly disk read.
B+Tree vs LSM Tree Comparison
| Feature | B+Tree (MySQL, Postgres) | LSM Tree (Cassandra, RocksDB) |
|---|---|---|
| Optimized For | Read-heavy workloads | Write-heavy workloads |
| Write Pattern | Random Writes (in-place updates) | Sequential Writes (append-only) |
| Read Speed | Very Fast ($O(\log N)$ seeks) | Slower (may check multiple SSTables) |
| Space Amplification | High (fragmentation in pages) | Low (Compaction cleans it up) |