Consensus: Raft & Paxos

A Senior Engineer knows that Consensus is how a cluster agrees on a single value (like “Who is the leader?”). A Staff Engineer knows that the real challenge isn’t reaching consensus—it’s preventing a partitioned node from starting a “dueling election” when it rejoins.

The Raft State Machine

stateDiagram-v2
    [*] --> Follower
    Follower --> Candidate : Timeout
    Candidate --> Follower : Leader Appears / Higher Term
    Candidate --> Leader : Majority Votes
    Leader --> Follower : Higher Term Found
    Follower --> Follower : Heartbeat

1. Leader Election: Majority Rules

Leader Election: One node is chosen to be the source of truth.
Log Replication: The leader sends updates to followers.
Safety: Only nodes with the most up-to-date log can become leaders.

Staff-Level Edge Case: Asymmetric Partitions

A node $N$ can receive heartbeats from the leader $L$, but $L$ cannot receive responses from $N$.

The Trap: $N$ thinks the network is fine, but $L$ sees $N$ as down. If $N$ triggers an election, it can cause “dueling elections.”
The Pre-Vote Fix: $N$ will first ask peers if they have a leader. Since the other followers are seeing the leader, they will reject $N$’s pre-candidate request, stopping the disruption before it starts.

2. Operational Fencing: The “Pre-Vote” Phase

In standard Raft, any node that stops receiving heartbeats increments its Term and starts an election. If a node is partitioned (e.g., a network cable is unplugged) and then rejoins, it will have a very high Term number. It will force the entire cluster to undergo a disruptive election, even if the current leader was perfectly healthy.

The Solution: Pre-Vote (etcd style) Before a node increments its term, it enters a “Pre-Candidate” state. It asks peers: “If I were to start an election, would you vote for me?” Peers only say “Yes” if:

They haven’t heard from a leader recently.
The requester’s log is up-to-date.

Interactive: Raft Pre-Vote vs. Naive Election

See how Pre-Vote prevents a lagging node from disrupting a healthy cluster.

L (T:1)

F (T:1)

Isolated (T:5)

Healthy Cluster. Term: 1. Node 3 is isolated and term-bombing locally.

3. Joint Consensus: Cluster Membership Changes

What happens when you want to grow a cluster from 3 nodes to 5? You can’t just update all nodes at once. If you do it partially, you might have two independent majorities (3 out of 5 and 2 out of 3) during the transition—a Split Brain.

The Staff Move: Joint Consensus (Two-Phase Membership)

Phase 1 (Joint): Decisions require a majority of both the old configuration ($C_{old}$) and the new configuration ($C_{new}$).
Phase 2 (New): Once the joint config is committed, the system moves purely to $C_{new}$.

4. Staff Hazard: The “Phantom Leader”

Even with Raft’s safety guarantees, production optimizations can introduce a Dual-Leader scenario.

The Optimization: To avoid hitting the disk (the log) for every READ, leaders use a Lease. For 500ms after a successful heartbeat, the leader assumes it is still the leader and serves reads locally.
The Failure: If the leader’s clock drifts (e.g., NTP glitch) or the network is highly unstable, a new leader might be elected while the old leader still thinks its lease is valid.
The Result: For a brief window, both nodes serve “Linearizable” reads. One returns old data, one returns new.
The Staff Move: Never rely on wall-clock time for strict consistency in Raft. Always use Read-Index or Lease with Clock-Skew Bound to ensure the leader has checked with a majority recently.

5. The “Membership Change” Storm

Adding 2 nodes to a 3-node cluster seems simple. But if you update all nodes naively:

The 3 old nodes think the quorum is 2.
The 5 new nodes think the quorum is 3.
The Split-Brain: You can accidentally form two independent majorities simultaneously.

The Solution: Joint Consensus Staff engineers insist on a Two-Phase Membership Change:

Phase 1 (Joint): Every decision must reach a majority in both the 3-node configuration AND the 5-node configuration.
Phase 2 (Final): Once the joint log is committed, the system switches to the 5-node quorum only.

6. Log Bloat & Slow Snapshots

If your Raft log isn’t “compacted” into a snapshot, it grows forever, consuming all disk space.

The Failure: A “Snapshot” process (taking the current DB state and writing it to disk) takes 60 seconds because the state is 100GB.
The Impact: During those 60 seconds, the node might stop responding to Raft heartbeats or AppendEntries, causing it to be dropped from the cluster.
Staff Move: Use Background/Incremental Snapshotting or specialized storage engines (like RocksDB) that allow the Raft log to be pruned without blocking the main event loop.

4. Staff Math: The Mechanics of Agreement

Consensus isn’t magic; it’s a game of majority intersections.

4.1. The Quorum Intersect Limit

Why do we use $N/2+1$? Because it’s the smallest number that guarantees any two sets share at least one node. [ \textbf{Safety Property}: Q_1 \cap Q_2 \neq \emptyset ]

Split-Brain Math: If you misconfigure a 6-node cluster to have a quorum of 3 ($q=3$):
- Dangerous Splits: There are $\binom{6}{3} = 20$ ways to split the cluster into two equal halves.
- Probability: $20 / (2^6 - 2) \approx \mathbf{32\%}$ probability that a random network partition will result in two active leaders.

4.2. The Election Jitter Window

To avoid “Split-Vote Livelock,” Raft nodes use randomized election timeouts. [ \textbf{Ideal Jitter} \approx 10 \times \text{Network RTT} ]

Constraint: The window must be large enough that if two nodes start an election, one of them is likely to finish and send a “Heartbeat” before the other’s timer expires.
Example: If RTT is 10ms, use a jitter window of 150ms–300ms.

4.3. Commit Latency (The Quorum Median)

In a global 5-node cluster, your write latency depends on the $3^{rd}$ fastest node. [ \text{Write Latency} = \text{Median}(\text{RTT}_1, \text{RTT}_2, \text{RTT}_3, \text{RTT}_4, \text{RTT}_5) ]

Staff Insight: This makes Raft highly resilient to “Tail Latency.” Even if 2 nodes are having a major lag spike (p99), your system’s p99 write latency will only be the p50 of the healthy majority.

Staff Takeaway

Consensus is the bedrock of distributed safety.

Raft is the choice for 99% of new systems due to its understandability.
Pre-Vote is non-negotiable for large-scale production stability to prevent term-bombing.
Membership Changes are the most dangerous time for a cluster; always use a formal Two-Phase Joint Consensus strategy.

excerpt: “How to maintain atomicity across boundaries without a global lock. Comparing Try-Confirm-Cancel (TCC) to the Saga pattern.” image: /Users/laxmansharma/.gemini/antigravity/brain/f6c3a85a-4b26-4d7d-8c43-cded1a3fe336/dist_patterns_hero_tcc_1768146618454.png —

Distributed Transactions: TCC vs. Sagas

You’ve been tasked with designing a “Super App” booking flow: Flight + Hotel + Car Rental.

excerpt: “Why clocks in distributed systems are never in sync. Comparing Hybrid Logical Clocks (CockroachDB) and Google’s TrueTime (GPS+Atomic).” image: /Users/laxmansharma/.gemini/antigravity/brain/f6c3a85a-4b26-4d7d-8c43-cded1a3fe336/dist_patterns_hero_time_1768146603727.png —

Distributed Time: HLC vs. TrueTime

In a single-machine database, ordering events is easy: just look at the system clock.