Resilience: Surviving Split Brain
[!NOTE] This module explores the core principles of Resilience: Surviving Split Brain, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Nightmare: Split Brain
Imagine a 2-Node Cluster (Node A, Node B). Network cable is cut.
- Node A thinks: “Node B is dead. I am Master.”
- Node B thinks: “Node A is dead. I am Master.”
Result: You now have TWO clusters. Application writes data to both. When network returns, Data Divergence. You cannot merge them. You have lost data.
2. The Solution: Quorum ((N/2) + 1)
To be Master, a node needs votes from a majority of Master-Eligible nodes.
- 2 Nodes: (2/2) + 1 = 2.
- If disconnected, neither has 2 votes. Cluster blocks writes. Correct.
- BUT, if 1 node dies, the survivor has only 1 vote. Cluster dies. Not High Availability.
- 3 Nodes: (3/2) + 1 = 2.
- If 1 node dies, remaining 2 form a quorum. Cluster survives.
Rule: Always use 3 Master-Eligible Nodes (or 5, 7…). Never 2.
3. Interactive: Election Simulator
See how network partitions affect the cluster.
Node 1
Node 2
Node 3
Status: Healthy (3/3 Connected)
4. Troubleshooting: Red vs Yellow Cluster
- Green: All Primary & Replica shards are assigned.
- Yellow: All Primaries are assigned, but some Replicas are missing.
- Cause: You have 1 node but asked for 1 Replica (Replicas cannot live on same node).
- Red: Some Primary shards are missing. Data is unavailable.
- Cause: Multiple nodes crash simultaneously.
- Fix:
GET _cluster/allocation/explain.