Consensus (Paxos & Raft)

Imagine you are running a bank with three database servers. A customer deposits $100. Server A receives the deposit, but before it can tell Servers B and C, the network cable is cut. If the customer connects to Server B and tries to withdraw the $100, what happens? If the servers act independently, you end up with corrupt, divergent data—a “Split Brain.” To prevent this, distributed systems must agree on a single source of truth before confirming any operation. This agreement is called Consensus.

In 1989, Leslie Lamport invented Paxos and submitted it to ACM TOCS. The paper was rejected because reviewers found it “too theoretical and not practical enough.” Lamport shelved it. For nearly a decade, distributed systems lacked a rigorous consensus algorithm. In 1998, Google engineers Burrows, Chandra and Redmond re-discovered and implemented Paxos inside Google Chubby — a distributed lock service that became the coordination backbone for Bigtable, Spanner, and GFS. Google’s 2006 Chubby paper finally forced academia to take Paxos seriously. But Paxos was notoriously difficult to implement correctly — even Lamport admitted the paper was confusing. In 2014, Diego Ongaro published Raft as his Stanford PhD thesis, with the explicit goal: “Design a consensus algorithm more understandable than Paxos.” Raft became the basis for Etcd (Kubernetes’ brain), Consul (HashiCorp), and CockroachDB. Today Raft is the default consensus algorithm for anyone building a new distributed database. One PhD thesis changed how the entire cloud industry coordinates data.

[!IMPORTANT] In this lesson, you will master:

  1. The Replication Loop: Mastering the internal mechanics of <span class="term-tooltip" tabindex="0" data-definition="The Raft RPC used by the leader to replicate log entries and send heartbeats.">AppendEntries</span> and CommitIndex that keep your data perfectly synced.
  2. Hardware Write Barriers: Understanding why every consensus step requires a Synchronous fsync() to protect against physical power loss.
  3. Strict Linearizability: Learning why most systems trade performance for “Real-Time Consistency” and why it’s the most expensive property to maintain.

1. The Trinity: Paxos vs Raft vs ZAB

While Paxos is the academic holy grail, Raft is the engineer’s choice.

Analogy: Think of Consensus like a group of friends trying to agree on where to eat.

  • Paxos is like a chaotic group chat where anyone can suggest a restaurant at any time, leading to overlapping proposals, ignored messages, and a lot of confusion before a decision is made.
  • Raft is like electing one friend to be the “Leader” for the evening. The Leader chooses the restaurant and asks the others. As long as a majority nod their heads, the decision is final. If the Leader goes offline, they pause and elect a new one.
Feature Paxos (1989) Raft (2014) ZAB (ZooKeeper)
Philosophy Theoretical purity. No specific leader required. Understandability. Strong Leader is mandatory. Primary-Backup ordering (FIFO).
Complexity Extreme. (Multi-Paxos is hard to implement). Moderate. Decomposed into Election & Log Replication. High.
Use Case Google Spanner, Cassandra (LWT). Kubernetes (Etcd), Consul, CockroachDB. Kafka (Old Controller), Hadoop.

[!TIP] Interview Strategy: Explain Raft. It has a linear narrative: Leader ElectionLog ReplicationCommit. Paxos is a mesh of proposers and acceptors that is easy to get wrong.

[!WARNING] War Story: The GitHub Split-Brain (2012) Before adopting rigorous consensus protocols everywhere, GitHub experienced a major network partition. Their primary and backup databases lost connection to each other, but both still accepted writes. When the network was restored, the two databases had entirely different histories—a classic “split-brain” scenario. Resolving the conflicting data required hours of manual, painful reconciliation and significant downtime. This is exactly the nightmare that Raft and Paxos are designed to prevent by strictly enforcing Quorum (majority rule) before committing any write.


2. Raft Log Replication

Once a Leader is elected, it handles all Client requests. The goal: Replicate SET X=5 to a majority of followers.

The Flow Diagram

Leader
Followers (Quorum)
Client
1. SET X=5
Log: [X=5] (Uncommitted)
2. AppendEntries RPC
Log: [X=5] (Uncommitted)
3. Success (ACK)
Log: [X=5] (COMMITTED)

3. Interactive Demo: The Replication Loop

Cyberpunk Mode: Visualize the commit flow.

  • Mission: Replicate SET X=99 to the cluster.
  • Obstacle: Network Partition.
  • Visuals: Watch the logs turn from Yellow (Uncommitted) to Green (Committed).

[!TIP] Try it yourself:

  1. Click “Send SET X=99”. Watch the Leader replicate to Followers, get ACKs, and Commit.
  2. Click “Network” to cut the connection.
  3. Try sending again. Watch the Leader retry endlessly because it can’t reach a Quorum.
System Idle. Waiting for Client.
LEADER
Empty
Commit Idx: 0
Follower 1
Empty
Commit Idx: 0
Follower 2
Empty
Commit Idx: 0

4. Deep Dive: Linearizability vs Serializability

This is the most confusing topic in Distributed Systems.

  • Linearizability (Strong Consistency): A real-time guarantee. If Operation A completes at time T, any Operation B that starts after T MUST see A. It makes the system behave like a Single Copy of Data. Raft provides this via Quorum Reads or Lease-based Reads to avoid the bottleneck of a full consensus round for every GET.
  • Serializability (Isolation): A transaction guarantee. If Transactions A and B run concurrently, the result must be as if they ran one after the other (A then B, or B then A). It does not care about real-time.

[!TIP] Raft provides Linearizability. Databases (like PostgreSQL) provide Serializability. Spanner provides BOTH (Strict Serializability) using TrueTime hardware (GPS/Atomic Clocks).

5. Summary

  • Commit happens only when a Majority acknowledges.
  • Linearizability is guaranteed by reading from the Leader (who checks with a quorum).

Staff Engineer Tip: Consensus is not a “Software” bottleneck; it’s a Physics Bottleneck. To guarantee durability, Raft MUST perform an fsync() (physical disk write) on at least 3 nodes before it can reply to a client. Even with NVMe SSDs, this introduces a “Commit Latency” of ~1ms. If your app requires 100,000 writes per second, standard Raft will fail you. You must use Batching (combining 100 client requests into 1 Raft log entry) to amortize the hardware disk seek and stay within physical limits.

Mnemonic — “Paxos = Theory, Raft = Practice”: Paxos (1989, rejected, rediscovered 1998) = mathematical purity, multiple proposers, hard to implement. Raft (2014) = understandability first, single strong leader, decomposed into Election + Log Replication. For interviews: always explain Raft. For financial systems (Spanner): Paxos. Commit flow: Client → Leader → AppendEntries RPC → Followers ACK → Leader Commits → Leader notifies Followers. Every step requires fsync(). Use batching to survive the Physics Tax.