The Structures That Hold The World

When you do SELECT * FROM users WHERE id = 5, what happens? The database doesn’t scan the whole disk. It uses a Tree. But not a Binary Tree.

1. The Hardware Constraint: Disk I/O

  • RAM Access: Nanoseconds.
  • Disk Access: Milliseconds (1,000,000x slower).
  • Block Size: Disks read data in blocks (e.g., 4KB or 16KB).

A standard Binary Tree spreads nodes across memory. Reading Node → Left → Left might trigger 3 disk seeks. Too slow.

2. B-Trees and B+Trees

A B-Tree is a “fat” tree. Each node contains many keys (e.g., 100) and many children.

  • Height: Very low. A height-3 B-Tree can store millions of rows.
  • Locality: One node = One Disk Page. Reading a node gets 100 keys in 1 seek.

B+Tree (The Standard)

Used by MySQL (InnoDB) and PostgreSQL.

  • Internal Nodes: Only keys (for routing).
  • Leaf Nodes: Actual Data. Linked together for fast range scans.

3. Interactive: B+Tree Split

Visualize how a node “splits” when full, pushing the median up.

Tree Empty.

4. LSM Trees (Log-Structured Merge)

Used by Cassandra, RocksDB, Kafka.

  • Problem: B-Trees do random writes (slow on HDD/SSD).
  • Solution: Write sequentially to an append-only log (MemTable).
  • Flush: When RAM is full, flush to disk as a sorted file (SSTable).
  • Merge: Background process merges SSTables.

[!TIP] Read vs Write:

  • B-Tree: Balanced. Good read, decent write.
  • LSM Tree: Fast write (append-only), slower read (check MemTable + SSTables).