Database Trees


Database Trees

[!NOTE] This module explores the core principles of Database Trees, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Conflict: Read vs. Write

Databases handle two main workloads:

  1. OLTP (Transactions): Fast lookups, row updates. (SQL).
  2. OLAP/Big Data: Massive ingestion, analytics. (NoSQL/Cassandra).

The data structure you choose determines which workload you survive.


2. B-Trees (The SQL Standard)

Used by: PostgreSQL, MySQL (InnoDB), Oracle.

  • Structure: A short, fat tree. High fan-out (100+ keys per node).
  • Why?: Minimizes Disk Seeks. If the tree height is 3, you only need 3 disk jumps to find any record in billions.
  • Pro: Blazing fast Reads (O(\logB N)).
  • Con: Random Writes. Updating a key might require jumping to a random disk sector to update the leaf. Bad for heavy write loads.

3. LSM Trees (The Write-Heavy Hero)

Used by: Cassandra, RocksDB, LevelDB.

  • Log-Structured Merge Tree.
  • Mechanism:
    1. Writes go to an in-memory buffer (MemTable) → Instant RAM speed.
    2. When full, flush to disk as a sorted file (SSTable) → Sequential Write (Fast!).
    3. Reads checks MemTable, then SSTables.
  • Pro: Insane Write Throughput (Sequential I/O).
  • Con: Slower Reads (Need to check multiple files/levels).

4. Hardware Truth: Random vs. Sequential I/O

Why does this matter?

  • Sequential Write: ~500 MB/s (HDD/SSD). The head just spins and writes.
  • Random Write: ~1 MB/s (HDD). The disk head must physically move (seek) to the sector.
  • Impact: LSM Trees turn random database updates into sequential log appends, making them 10s or 100s of times faster for writing than B-Trees.

5. Deep Dive Strategy Lab: Database Trees

Intuition Through Analogy

Think of this chapter like running a high-traffic consumer app. The goal is not to memorize a fixed trick, but to repeatedly answer:

  1. What is the bottleneck?
  2. Which constraint dominates (time, memory, latency, correctness)?
  3. Which representation makes the bottleneck easier to eliminate?