Resilience: Surviving Split Brain

[!NOTE] This module explores the core principles of Resilience: Surviving Split Brain, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Nightmare: Split Brain

Imagine a 2-Node Cluster (Node A, Node B). Network cable is cut.

  • Node A thinks: “Node B is dead. I am Master.”
  • Node B thinks: “Node A is dead. I am Master.”

Result: You now have TWO clusters. Application writes data to both. When network returns, Data Divergence. You cannot merge them. You have lost data.


2. The Solution: Quorum ((N/2) + 1)

To be Master, a node needs votes from a majority of Master-Eligible nodes.

  • 2 Nodes: (2/2) + 1 = 2.
  • If disconnected, neither has 2 votes. Cluster blocks writes. Correct.
  • BUT, if 1 node dies, the survivor has only 1 vote. Cluster dies. Not High Availability.
  • 3 Nodes: (3/2) + 1 = 2.
  • If 1 node dies, remaining 2 form a quorum. Cluster survives.

Rule: Always use 3 Master-Eligible Nodes (or 5, 7…). Never 2.


3. Interactive: Election Simulator

See how network partitions affect the cluster.

Node 1
Node 2
Node 3
Status: Healthy (3/3 Connected)

4. Troubleshooting: Red vs Yellow Cluster

  • Green: All Primary & Replica shards are assigned.
  • Yellow: All Primaries are assigned, but some Replicas are missing.
  • Cause: You have 1 node but asked for 1 Replica (Replicas cannot live on same node).
  • Red: Some Primary shards are missing. Data is unavailable.
  • Cause: Multiple nodes crash simultaneously.
  • Fix: GET _cluster/allocation/explain.