Courses / dsa

Database Internals

06 Trees Last updated: May 02, 2026

The Structures That Hold The World

When you do SELECT * FROM users WHERE id = 5, what happens? The database doesn’t scan the whole disk. It uses a Tree. But not a Binary Tree.

1. The Hardware Constraint: Disk I/O

RAM Access: Nanoseconds.
Disk Access: Milliseconds (1,000,000x slower).
Block Size: Disks read data in blocks (e.g., 4KB or 16KB).

A standard Binary Tree spreads nodes across memory. Reading Node -> Left -> Left might trigger 3 disk seeks. Too slow.

2. B-Trees and B+Trees

A B-Tree is a “fat” tree. Each node contains many keys (e.g., 100) and many children.

Height: Very low. A height-3 B-Tree can store millions of rows.
Locality: One node = One Disk Page. Reading a node gets 100 keys in 1 seek.

B+Tree (The Standard)

Used by MySQL (InnoDB) and PostgreSQL.

Internal Nodes: Only keys (for routing).
Leaf Nodes: Actual Data. Linked together for fast range scans.

The Math of Fanout: If a B+Tree has a branching factor (fanout) of 100, a tree of height 3 can store 100³ = 1,000,000 keys. This means finding any record out of a million takes at most 3 disk seeks!

3. Interactive: B+Tree Split

Visualize how a node “splits” when full, pushing the median up.

Tree Empty.

4. LSM Trees (Log-Structured Merge)

Used by Cassandra, RocksDB, Kafka.

Problem: B-Trees do random writes (slow on HDD/SSD). If you have millions of writes per second, B-Trees become a bottleneck.
Solution: Write sequentially to an append-only log.

Anatomy of an LSM Tree Read/Write

Write-Ahead Log (WAL): Every write is immediately appended to a WAL on disk for durability (crash recovery).
MemTable: The write is then stored in an in-memory balanced tree (like a Red-Black tree).
Flush to SSTable: When the MemTable gets full (e.g., a few Megabytes), it is flushed sequentially to disk as an immutable SSTable (Sorted String Table).
Compaction: Over time, you accumulate many SSTables. A background process periodically merges and sorts them into larger SSTables, removing deleted keys (Tombstones).

War Story: Discord’s Migration Discord originally stored billions of messages in MongoDB (B-Tree). As they grew, random writes crippled performance. They migrated to Cassandra (LSM Tree) because appending to a sequential log is orders of magnitude faster than updating random B-Tree nodes for their heavily write-skewed workload.

How do we Read Fast? (Bloom Filters)

Since a read might have to check the MemTable and multiple SSTables, LSM Trees use Bloom Filters (a probabilistic data structure). The Bloom Filter can tell us instantly: “This key is definitely not in this SSTable,” saving us a costly disk read.

B+Tree vs LSM Tree Comparison

Feature	B+Tree (MySQL, Postgres)	LSM Tree (Cassandra, RocksDB)
Optimized For	Read-heavy workloads	Write-heavy workloads
Write Pattern	Random Writes (in-place updates)	Sequential Writes (append-only)
Read Speed	Very Fast (O(log N) seeks)	Slower (may check multiple SSTables)
Space Amplification	High (fragmentation in pages)	Low (Compaction cleans it up)

Found this lesson helpful?

Mark it as mastered to track your progress through the course.

Previous Lesson ← Trie (Prefix Tree) Next Lesson Module Review: Trees →