Module 15: Review & Cheat Sheet
1. Quick Revision
- Dynamo: The father of NoSQL. Prioritized Availability (AP). Introduced Consistent Hashing, Vector Clocks, and Gossip.
- Cassandra: The hybrid. BigTable Data Model (Wide Column) + Dynamo Architecture (Ring). Optimized for Writes (LSM Trees).
- BigTable: The structured map. Master-Slave architecture. Used for Google Search/Maps. Uses SSTables on GFS.
- MapReduce: Distributed computing. Map (Filter/Transform) -> Shuffle (Group by Key) -> Reduce (Aggregate).
- Bloom Filters: Probabilistic set. “Maybe in set” or “Definitely not”. Used to avoid expensive disk reads.
2. Cheat Sheet: Database Comparison
| Feature | Dynamo (Amazon) | Cassandra (Facebook/Apache) | BigTable (Google) |
|---|---|---|---|
| Data Model | Key-Value (Blob) | Wide Column (2D Map) | Wide Column (Sparse Map) |
| Architecture | P2P (Leaderless Ring) | P2P (Leaderless Ring) | Master-Slave |
| Consistency | Eventual (AP) | Tunable (AP or CP) | Strong (CP) |
| Conflict Res. | Vector Clocks (Client Side) | LWW (Last Write Wins) | Strong (Single Row Atomic) |
| Storage Engine | Pluggable (BDB, etc.) | LSM Tree | SSTable (LSM-like) |
| Gossip? | Yes | Yes | No (Uses Chubby/Master) |
| Primary Use | Shopping Cart, Session | Activity Feed, Metrics | Analytics, Search Index |
3. Interactive Flashcards
Test your knowledge. Click to flip.
What is a Tombstone?
(Click to reveal)
A Deletion Marker
In LSM Trees (Cassandra), you can't delete from immutable SSTables. You write a "Tombstone" to mark data as deleted. It is removed during Compaction.
Vector Clock
(Click to reveal)
Causality Tracker
A list of (Node, Counter) pairs used in Dynamo to detect conflicting updates in a distributed system. e.g., [A:1, B:2].
Bloom Filter Guarantee
(Click to reveal)
No False Negatives
If a Bloom Filter says "No", the item is DEFINITELY not in the set. If it says "Yes", it MIGHT be (False Positive).
Hinted Handoff
(Click to reveal)
Temporary Storage
If a node is down, a neighbor accepts the write with a "hint" to replay it when the target node comes back online. Ensures Availability.
MapReduce Combiner
(Click to reveal)
Local Reducer
Runs on the Mapper node to pre-aggregate data (e.g., sum counts) before sending over the network. Reduces bandwidth usage.
MemTable vs SSTable
(Click to reveal)
RAM vs Disk
MemTable is the In-Memory buffer (Mutable). SSTable is the On-Disk file (Immutable). Data moves MemTable -> SSTable.
Gossip Protocol
(Click to reveal)
Epidemic Failure Detection
Nodes randomly exchange state information to discover failures and membership changes without a central master.
What is YARN?
(Click to reveal)
Resource Negotiator
The OS of Hadoop. It allocates CPU/RAM to applications (MapReduce, Spark) and manages scheduling.
BigTable Tablet
(Click to reveal)
A Range of Rows
BigTable shards data into Tablets based on Row Key ranges. Tablets split when they get too big (~200MB).
Merkle Tree
(Click to reveal)
Efficient Sync
A hash tree used by Dynamo/Cassandra to find data differences between replicas quickly without transferring all data.
Tunable Consistency
(Click to reveal)
R + W > N
The formula to guarantee Strong Consistency in a quorum-based system. R=Read Quorum, W=Write Quorum, N=Replication Factor.