Consensus: Raft & Paxos

A Senior Engineer knows that Consensus is how a cluster agrees on a single value (like “Who is the leader?”). A Staff Engineer knows that the real challenge isn’t reaching consensus—it’s preventing a partitioned node from starting a “dueling election” when it rejoins.


The Raft State Machine

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : Timeout
    Candidate --> Follower : Leader Appears / Higher Term
    Candidate --> Leader : Majority Votes
    Leader --> Follower : Higher Term Found
    Follower --> Follower : Heartbeat

1. Leader Election: Majority Rules

  1. Leader Election: One node is chosen to be the source of truth.
  2. Log Replication: The leader sends updates to followers.
  3. Safety: Only nodes with the most up-to-date log can become leaders.

Staff-Level Edge Case: Asymmetric Partitions

A node $N$ can receive heartbeats from the leader $L$, but $L$ cannot receive responses from $N$.

  • The Trap: $N$ thinks the network is fine, but $L$ sees $N$ as down. If $N$ triggers an election, it can cause “dueling elections.”
  • The Pre-Vote Fix: $N$ will first ask peers if they have a leader. Since the other followers are seeing the leader, they will reject $N$’s pre-candidate request, stopping the disruption before it starts.

2. Operational Fencing: The “Pre-Vote” Phase

In standard Raft, any node that stops receiving heartbeats increments its Term and starts an election. If a node is partitioned (e.g., a network cable is unplugged) and then rejoins, it will have a very high Term number. It will force the entire cluster to undergo a disruptive election, even if the current leader was perfectly healthy.

The Solution: Pre-Vote (etcd style) Before a node increments its term, it enters a “Pre-Candidate” state. It asks peers: “If I were to start an election, would you vote for me?” Peers only say “Yes” if:

  1. They haven’t heard from a leader recently.
  2. The requester’s log is up-to-date.

Interactive: Raft Pre-Vote vs. Naive Election

See how Pre-Vote prevents a lagging node from disrupting a healthy cluster.

L (T:1)
F (T:1)
F (T:1)
Isolated (T:5)
Healthy Cluster. Term: 1. Node 3 is isolated and term-bombing locally.

3. Joint Consensus: Cluster Membership Changes

What happens when you want to grow a cluster from 3 nodes to 5? You can’t just update all nodes at once. If you do it partially, you might have two independent majorities (3 out of 5 and 2 out of 3) during the transition—a Split Brain.

The Staff Move: Joint Consensus (Two-Phase Membership)

  1. Phase 1 (Joint): Decisions require a majority of both the old configuration ($C_{old}$) and the new configuration ($C_{new}$).
  2. Phase 2 (New): Once the joint config is committed, the system moves purely to $C_{new}$.

4. Staff Hazard: The “Phantom Leader”

Even with Raft’s safety guarantees, production optimizations can introduce a Dual-Leader scenario.

  • The Optimization: To avoid hitting the disk (the log) for every READ, leaders use a Lease. For 500ms after a successful heartbeat, the leader assumes it is still the leader and serves reads locally.
  • The Failure: If the leader’s clock drifts (e.g., NTP glitch) or the network is highly unstable, a new leader might be elected while the old leader still thinks its lease is valid.
  • The Result: For a brief window, both nodes serve “Linearizable” reads. One returns old data, one returns new.
  • The Staff Move: Never rely on wall-clock time for strict consistency in Raft. Always use Read-Index or Lease with Clock-Skew Bound to ensure the leader has checked with a majority recently.

5. The “Membership Change” Storm

Adding 2 nodes to a 3-node cluster seems simple. But if you update all nodes naively:

  1. The 3 old nodes think the quorum is 2.
  2. The 5 new nodes think the quorum is 3.
  3. The Split-Brain: You can accidentally form two independent majorities simultaneously.

The Solution: Joint Consensus Staff engineers insist on a Two-Phase Membership Change:

  • Phase 1 (Joint): Every decision must reach a majority in both the 3-node configuration AND the 5-node configuration.
  • Phase 2 (Final): Once the joint log is committed, the system switches to the 5-node quorum only.

6. Log Bloat & Slow Snapshots

If your Raft log isn’t “compacted” into a snapshot, it grows forever, consuming all disk space.

  • The Failure: A “Snapshot” process (taking the current DB state and writing it to disk) takes 60 seconds because the state is 100GB.
  • The Impact: During those 60 seconds, the node might stop responding to Raft heartbeats or AppendEntries, causing it to be dropped from the cluster.
  • Staff Move: Use Background/Incremental Snapshotting or specialized storage engines (like RocksDB) that allow the Raft log to be pruned without blocking the main event loop.

4. Staff Math: The Mechanics of Agreement

Consensus isn’t magic; it’s a game of majority intersections.

4.1. The Quorum Intersect Limit

Why do we use $N/2+1$? Because it’s the smallest number that guarantees any two sets share at least one node. [ \textbf{Safety Property}: Q_1 \cap Q_2 \neq \emptyset ]

  • Split-Brain Math: If you misconfigure a 6-node cluster to have a quorum of 3 ($q=3$):
    • Dangerous Splits: There are $\binom{6}{3} = 20$ ways to split the cluster into two equal halves.
    • Probability: $20 / (2^6 - 2) \approx \mathbf{32\%}$ probability that a random network partition will result in two active leaders.

4.2. The Election Jitter Window

To avoid “Split-Vote Livelock,” Raft nodes use randomized election timeouts. [ \textbf{Ideal Jitter} \approx 10 \times \text{Network RTT} ]

  • Constraint: The window must be large enough that if two nodes start an election, one of them is likely to finish and send a “Heartbeat” before the other’s timer expires.
  • Example: If RTT is 10ms, use a jitter window of 150ms–300ms.

4.3. Commit Latency (The Quorum Median)

In a global 5-node cluster, your write latency depends on the $3^{rd}$ fastest node. [ \text{Write Latency} = \text{Median}(\text{RTT}_1, \text{RTT}_2, \text{RTT}_3, \text{RTT}_4, \text{RTT}_5) ]

  • Staff Insight: This makes Raft highly resilient to “Tail Latency.” Even if 2 nodes are having a major lag spike (p99), your system’s p99 write latency will only be the p50 of the healthy majority.

Staff Takeaway

Consensus is the bedrock of distributed safety.

  • Raft is the choice for 99% of new systems due to its understandability.
  • Pre-Vote is non-negotiable for large-scale production stability to prevent term-bombing.
  • Membership Changes are the most dangerous time for a cluster; always use a formal Two-Phase Joint Consensus strategy.

excerpt: “How to maintain atomicity across boundaries without a global lock. Comparing Try-Confirm-Cancel (TCC) to the Saga pattern.” image: /Users/laxmansharma/.gemini/antigravity/brain/f6c3a85a-4b26-4d7d-8c43-cded1a3fe336/dist_patterns_hero_tcc_1768146618454.png —

Distributed Transactions: TCC vs. Sagas

You’ve been tasked with designing a “Super App” booking flow: Flight + Hotel + Car Rental.

excerpt: “Why clocks in distributed systems are never in sync. Comparing Hybrid Logical Clocks (CockroachDB) and Google’s TrueTime (GPS+Atomic).” image: /Users/laxmansharma/.gemini/antigravity/brain/f6c3a85a-4b26-4d7d-8c43-cded1a3fe336/dist_patterns_hero_time_1768146603727.png —

Distributed Time: HLC vs. TrueTime

In a single-machine database, ordering events is easy: just look at the system clock.