Distributed Leader Election

In October 2018, GitHub’s MySQL cluster suffered a split-brain during a routine network maintenance window. A brief network partition between the US-East and US-West data centers made both sides think the other was dead. Each region independently elected a new MySQL master. For 43 seconds, both masters accepted writes — diverging the dataset. GitHub engineers spent 24+ hours manually resolving 946 conflicting rows to reconstruct a consistent state. The incident shocked the industry: GitHub was already using Orchestrator (an automated failover tool), but automatic failover without the right quorum mechanism can be worse than no failover at all. The fix: quorum-based election rules that require (N/2) + 1 votes to promote a new leader. This physical requirement prevents split-brain using the same mathematical certainty as a physical law.

In this lesson, you will master:
  1. Consensus Stability: Proving why "Single Leader" architectures are the only way to avoid irreversible data corruption in CP systems.
  2. Hardware Clock Drift: Understanding why Raft uses Relative Timers and Logical Epochs instead of physical CPU clocks to handle network jitter.
  3. Quorum Calculus: Proving the (N/2)+1 math that prevents the catastrophic "Split Brain" during physical network partitions.

1. The Analogy: The Cyber-Council

Imagine a fleet of autonomous drones (nodes) executing a mission. They need to agree on a target. They cannot all shout commands at once.

  1. Election: The drones vote. The one with the strongest signal or highest ID becomes the Commander.
  2. Authority: Only the Commander issues strike orders. Everyone else listens.
  3. Re-election: If the Commander is shot down, the remaining drones detect the silence (Timeout) and hold a new vote.
  4. Term Limits: Each election starts a new “Epoch” (or Term). If an old Commander reconnects, they are demoted because their “Term” number is outdated.

2. The Enemy: Split Brain

What happens when the network fractures?

  • Sector A (2 nodes) thinks the Commander is dead.
  • Sector B (3 nodes) still sees the Commander.

If Sector A elects a new Commander, we now have Two Leaders. Both accept writes. The database diverges. This is Split Brain.

Detective Mode: GitHub Outage (2018)

In 2018, GitHub suffered a major outage due to a split-brain scenario in their MySQL cluster. A brief network partition caused the West Coast data center to elect a new leader while the East Coast leader was still active. Writes were accepted in both places. Reconciliation took 24 hours of manual data surgery.

The Shield: Quorum (Majority Vote)

  • If the network splits into 2 vs 3, only the group of 3 can elect a leader. The group of 2 is frozen (cannot write).
Hardware-First Intuition: The "Physical Partition" tax. When a network switch fails or an SFP+ transceiver "flaps," the hardware doesn't just cut communication; it creates Intermittent Packet Loss. In a Raft cluster, this causes "Heartbeat Jitter." If the network delay fluctuates by 100ms, a Follower might perceive the Leader as dead even if the physical cable is still connected. To handle this, Staff Engineers tune the Election Timeout to be at least 10x the physical network RTT (Round Trip Time), ensuring that tiny hardware glitches don't trigger constant, expensive re-elections.

Interactive: Quorum Checker

Click nodes to toggle their status (Online/Offline). See if the remaining nodes can form a Quorum.

Total Nodes: 5 | Quorum Needed: 3
1
2
3
4
5
✅ SYSTEM OPERATIONAL

3. The Solution: Raft Consensus Algorithm

Raft is the industry standard (used in Kubernetes/Etcd, Consul, CockroachDB). It turns the chaos of distributed consensus into a predictable state machine.

Node States

  1. Follower: Passive. Responds to requests from Leaders and Candidates.
  2. Candidate: Active. Campaigning for votes.
  3. Leader: Active. Handles all client requests and sends Heartbeats to suppress rebellion.

The Election Process

  1. Heartbeat Timeout: Every follower has a random countdown timer (e.g., 150-300ms).
  2. Trigger: If the timer hits zero with no Heartbeat from a Leader, the Follower assumes the Leader is dead.
  3. Campaign:
    • Increment Term (Epoch).
    • Vote for self.
    • Send RequestVote RPC to everyone.
  4. Victory: If it receives votes from a majority, it becomes Leader.

4. Interactive Demo: Raft Cluster Simulator

Interactive Simulator: Visualize the election process.

  • Green Ring: Leader (Sending Heartbeats).
  • Yellow Pulse: Candidate (Campaigning).
  • Red Border: Dead/Disconnected.
Try it yourself:
  1. Observe the Heartbeats (pulsing rings) from the Leader.
  2. Click "Kill Leader" to assassinate the current leader.
  3. Watch a Follower time out, turn Yellow (Candidate), and request votes.
  4. See a new Leader emerge (Green) once it secures 3 votes.
  5. Bonus: Click "Partition Network" to split the cluster. Notice how the minority partition cannot elect a leader!
Current Term: 1
[SYSTEM] Cluster Initialized. Leader is N3.

5. Alternative: The Bully Algorithm

Before Raft, many systems (like old MongoDB) used the Bully Algorithm. Simple, but brutal.

How it works

  1. Rank: Every node has a unique ID. Higher ID = “Bigger Bully”.
  2. Election: When a node suspects the leader is dead, it sends ELECTION to all nodes with Higher IDs.
  3. Victory: If no one higher responds, it declares “I am the Boss” (Victory Message).
  4. Takeover: If a node with a higher ID comes online, it immediately “bullies” the current leader and takes over.
Why Raft Won: The Bully Algorithm suffers from "Flapping". If the highest-ID node has a flaky connection, it will constantly trigger re-elections, destabilizing the cluster. Raft is "sticky"—a leader stays leader as long as it's healthy, regardless of ID.

6. Advanced Raft Optimizations

In production systems (like Etcd), basic Raft isn’t enough. We need speed and safety.

A. Pre-Vote (The Disruptive Node Problem)

Imagine a node gets partitioned. It can’t see the Leader, so it increments its Term (e.g., from 10 to 100) and keeps voting for itself. When it reconnects, its high Term forces the valid Leader to step down, causing a useless election.

  • Solution: Before incrementing the Term, a candidate asks: “Is anyone else there?” (Pre-Vote). If no one responds, it doesn’t disrupt the cluster.

B. Check Quorum (Lease Read)

To serve a “Strongly Consistent Read” without writing to the log (which is slow), the Leader must confirm it is still the Leader.

  • Mechanism: The Leader periodically pings a quorum. “Do you still see me?” If yes, it can serve reads locally. If no, it steps down.

7. Advanced: Leader Leases

What if the Leader is slow but not dead? A “Zombie Leader” might still try to write to the DB. Leader Leases solve this by using time-bound authority.

  1. Leader obtains a “Lease” (e.g., 10 seconds).
  2. It can only write if Current_Time < Lease_Expiry.
  3. It must renew the lease before it expires (e.g., at 5 seconds).
  4. If it crashes, the system waits 10 seconds before electing a new leader.

This leads directly into Distributed Locking, our next mission.

8. The Fencing Problem: “Authority is not absolute”

Electing a leader is only 50% of the battle. The other 50% is ensuring the storage hardware respects that leadership.

Elite Warning: The Zombie Leader. If a Leader undergoes a long GC pause, its "Term" might expire while it's frozen. A new Leader is elected. When the old Leader wakes up, it still thinks it is the leader and tries to write to the DB. This leads to irreversible data corruption.

The Shield: Fencing Tokens

To prevent zombie writes, every leader election must generate a Fencing Token (a monotonic sequence number).

  1. Leader A gets Token 34.
  2. Leader A freezes. Leader B is elected with Token 35.
  3. Leader B writes to the DB. The DB records “Last seen token: 35”.
  4. Leader A wakes up and tries to write with Token 34.
  5. The Fencing Rule: The storage layer rejects any write with a token < the highest seen token.

Without Fencing Tokens, Leader Election is just an “Efficiency Suggestion,” not a “Consistency Guarantee.”

9. Summary

  • Quorums: (N/2)+1 is the magic number to ensure only one partition can proceed.
  • Pre-Vote: Prevents disruptive nodes from forcing unnecessary elections.
  • Fencing: The physical data-layer check that prevents “Zombie Leaders” from corrupting state.

Staff Engineer Tip: Raft is a “Global Consensus” protocol, which means intensive network heartbeats. If you have 1,000 nodes, Raft will saturate your NIC Bandwidth with just heartbeats. For massive clusters, engineers use Gossip Protocols (like SWIM or memberlist) for health checks and “Eventual Consensus,” while reserving Raft only for a tiny, elected “Control Plane” of 3-5 nodes (like Kubernetes’ Etcd) to manage the global state.

Hardware Nuance: SFP+ Flapping. In high-performance data centers, a faulty SFP+ optical transceiver can “flap”—rapidly connecting and disconnecting at the microsecond level. This creates a stream of “Candidate Timeouts” only for the nodes connected to that specific switch. If your Raft cluster doesn’t have Pre-Vote enabled, this flapping hardware can force the entire global cluster into constant, expensive leader re-elections, effectively DDOSing your own control plane.

Mnemonic — “Quorum = (N/2)+1 = Safety”: 5 nodes → need 3 to elect. Split 2/3 → only the 3 can elect. Split 3/2 → only the 3 can elect. Key: minority freezes, doesn’t corrupt. Raft states: Follower (passive) → Candidate (timeout, vote for self) → Leader (heartbeats). Term = epoch (logical clock, prevents zombie leaders). Heartbeat timeout » Network RTT (10x rule). GitHub lesson: Orchestrator auto-promote = Split Brain risk. Always use fencing tokens.