Module Review: Database Basics

Congratulations on completing Module 4! This is one of the most technical and critical modules in System Design. Let’s solidify your knowledge.

1. Key Takeaways

  • SQL vs NoSQL: SQL (B-Trees) excels at complex reads and ACID, while NoSQL (LSM Trees) dominates high-throughput writes and horizontal scaling.
  • ACID Guarantees: Relational databases ensure Atomicity, Consistency, Isolation, and Durability, prioritizing safety over availability.
  • BASE Model: Distributed NoSQL systems embrace Basically Available, Soft state, and Eventual consistency for massive scale.
  • Isolation Levels: Trade-offs exist between preventing anomalies (Dirty/Phantom Reads) and database performance using MVCC.
  • WAL (Write-Ahead Logging): Databases survive crashes by appending changes to a sequential log on disk before updating main memory.

2. Cheat Sheet

Concept Definition Key Trade-off
SQL (B-Tree) Relational, Normalized, ACID. Great for Reads. Slow Writes (Random I/O). Hard to Shard.
NoSQL (LSM Tree) Flexible, Denormalized, BASE. Great for Writes (Sequential I/O). Weak for complex Reads/Joins.
ACID Atomicity, Consistency, Isolation, Durability. Safety vs Performance.
BASE Basically Available, Soft state, Eventual consistency. Availability vs Consistency.
Isolation Levels Read Uncommitted → Serializable. Preventing anomalies (Dirty, Phantom, Skew) costs speed.
WAL Log changes to disk before RAM. Durability. Converts Random I/O to Sequential I/O.
Vector Clocks Tracks causality [A:1, B:2]. Detects concurrent updates (conflicts) without data loss.
MVCC Multi-Version Concurrency Control. Readers don’t block Writers. Prevents Dirty Reads implicitly.

3. Interactive Flashcards

Test yourself. Click a card to flip it.

What is the "Golden Rule" of Distributed Consistency?

(Hint: R, W, N)

R + W > N

If Read Quorum + Write Quorum > Total Nodes, you are guaranteed Strong Consistency.

Why are LSM Trees faster for writes than B-Trees?

Sequential I/O

LSM Trees append to the end of a file (Sequential), avoiding the slow "Seek Time" (Random I/O) of B-Tree rebalancing.

What is a "Phantom Read"?

A Ghost Row

A transaction reads a set of rows matching a criteria (Age > 10). Another transaction inserts a *new* row that matches. Re-reading yields a different count.

What system call guarantees data is physically on the disk?

fsync()

It flushes the OS file system buffer to the physical disk platter, ensuring durability against power loss.

What is "Write Skew"?

Logic Bug in Snapshot Isolation

Two transactions read the same state (e.g., "2 Doctors on call"), make a decision, and update different rows, breaking a global invariant (0 Doctors on call).

How do Vector Clocks solve conflicts?

Causality Tracking

By tracking version counters per node `[A:1, B:1]`, we can detect if updates happened concurrently and ask the app to merge them (instead of blindly overwriting).

When should you use NewSQL?

Scale + ACID

When you need the horizontal scaling of NoSQL but the strict financial correctness (ACID Transactions) of SQL. E.g., Global Payments.

Why is 2PC called the "Availability Killer"?

Blocking Protocol

If the Coordinator crashes, all participants are stuck holding locks indefinitely, freezing the system. Use Sagas instead.

Why are LSM Trees slower for reads?

Read Penalty

The DB must check the MemTable (RAM) and potentially multiple SSTables (Disk) to find the key. Bloom Filters help speed this up.

What is the risk of "Group Commit"?

Data Loss

Transactions are buffered in RAM to increase throughput. If power fails before the buffer flushes to disk, those transactions are lost forever.

4. Quick Revision

  • Storage: B-Trees (Reads/SQL) vs LSM Trees (Writes/NoSQL).
  • Consistency: ACID (Safety/Banking) vs BASE (Speed/Social).
  • Transactions: Atomicity (All or nothing), Isolation (Anomalies vs Performance), Durability (WAL on disk).
  • Distributed: 2PC (Blocking) vs Sagas (Eventual). Quorum (R + W > N).
  • Anomalies: Dirty Read, Phantom Read, Write Skew (Snapshot Isolation).

5. Next Steps

Move on to the next module to learn about traffic distribution: Load Balancing. Review terms in the System Design Glossary.