Review & Cheat Sheet

Key Takeaways

  1. Tunable Consistency: Cassandra lets you choose between Strong Consistency (R + W > N) and High Availability per request.
  2. CAP Theorem: Cassandra is an AP system (Availability + Partition Tolerance) by default, but can be configured to behave like CP.
  3. Hinted Handoff: A temporary failure handling mechanism where the coordinator stores writes for down nodes. It ensures eventual consistency but is not a replacement for repair.
  4. Anti-Entropy Repair: The process of synchronizing data between replicas.
    • Read Repair: Lazy, fixes data on read access.
    • Nodetool Repair: Active, background process using Merkle Trees.
  5. Merkle Trees: Hash trees used to efficiently compare massive datasets without transferring all data.
  6. Zombie Data: Data that reappears because a node missed a tombstone (deletion marker). Prevented by running repair within gc_grace_seconds.

Flashcards

What is the formula for Strong Consistency in Cassandra?
R + W > N (Read Nodes + Write Nodes > Replication Factor)
What is Hinted Handoff?
A mechanism where the coordinator temporarily stores a write for a down node and replays it when the node comes back online.
What data structure makes Anti-Entropy Repair efficient?
Merkle Tree (Allows comparing large datasets by hashing blocks)
Which Consistency Level forces a majority of replicas to acknowledge?
QUORUM (or LOCAL_QUORUM)
True or False: Hinted Handoff can store hints forever.
False. Hints expire (default 3 hours). After that, manual repair is needed.
What is a Tombstone?
A marker indicating that a row or cell has been deleted. It prevents deleted data from resurrecting (Zombie Data).

Cheat Sheet: Consistency Levels

Level Read Behavior Write Behavior Best For
ONE Returns data from closest replica. Acks after 1 replica writes. Analytics, Logs, “Likes”
QUORUM Returns data from majority (N/2+1). Acks after majority writes. General Purpose, Strong Consistency
ALL Waits for all replicas. Waits for all replicas. Avoid. Zero fault tolerance.
LOCAL_QUORUM Majority in local DC. Majority in local DC. Multi-region apps (Low latency)
EACH_QUORUM N/A (Not supported for reads). Majority in each DC. Global Consistency (Very slow)
ANY N/A (Not supported for reads). Acks if even 1 hint is stored. Dangerous. Data loss if coordinator dies.

Next Steps

Now that you understand how Cassandra keeps data consistent, let’s look at how it handles massive scale.