Courses / dsa

Database Trees

14 Real World Last updated: Feb 27, 2026

Database Trees

Database Trees

[!NOTE] This module explores the core principles of Database Trees, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Conflict: Read vs. Write

Databases handle two main workloads:

OLTP (Transactions): Fast lookups, row updates. (SQL).
OLAP/Big Data: Massive ingestion, analytics. (NoSQL/Cassandra).

The data structure you choose determines which workload you survive.

2. B-Trees (The SQL Standard)

Used by: PostgreSQL, MySQL (InnoDB), Oracle.

Structure: A short, fat tree. High fan-out (100+ keys per node).
Why?: Minimizes Disk Seeks. If the tree height is 3, you only need 3 disk jumps to find any record in billions.
Pro: Blazing fast Reads (O(\log_B N)).
Con: Random Writes. Updating a key might require jumping to a random disk sector to update the leaf. Bad for heavy write loads.

3. LSM Trees (The Write-Heavy Hero)

Used by: Cassandra, RocksDB, LevelDB.

Log-Structured Merge Tree.
Mechanism:
1. Writes go to an in-memory buffer (MemTable) → Instant RAM speed.
2. When full, flush to disk as a sorted file (SSTable) → Sequential Write (Fast!).
3. Reads checks MemTable, then SSTables.
Pro: Insane Write Throughput (Sequential I/O).
Con: Slower Reads (Need to check multiple files/levels).

4. Hardware Truth: Random vs. Sequential I/O

Why does this matter?

Sequential Write: ~500 MB/s (HDD/SSD). The head just spins and writes.
Random Write: ~1 MB/s (HDD). The disk head must physically move (seek) to the sector.
Impact: LSM Trees turn random database updates into sequential log appends, making them 10s or 100s of times faster for writing than B-Trees.

5. Deep Dive Strategy Lab: Database Trees

Intuition Through Analogy

Think of this chapter like running a high-traffic consumer app. The goal is not to memorize a fixed trick, but to repeatedly answer:

What is the bottleneck?
Which constraint dominates (time, memory, latency, correctness)?
Which representation makes the bottleneck easier to eliminate?

Previous Lesson ← Social Networks & Graph Partitioning Next Lesson Distributed Scheduling Systems →