High Availability & Failover
Redis Cluster is designed to survive node failures. Every Master node should have at least one Replica. If the Master fails, the Replica promotes itself to become the new Master, ensuring the cluster remains available.
1. The Election Process
The failover mechanism is similar to the Raft consensus algorithm.
- Detection: The cluster reaches consensus that a Master is FAIL.
- Request: One of the failed Master’s replicas initiates an election. It bumps its
currentEpochand broadcasts aFAILOVER_AUTH_REQUEST. - Voting: The remaining healthy Masters vote. They grant a vote (
FAILOVER_AUTH_ACK) if:- The request comes from a replica of the failed master.
- The replica’s data is fresh enough.
- The master hasn’t voted for anyone else in this epoch.
- Promotion: If a replica receives votes from the majority of masters, it promotes itself, takes over the hash slots, and broadcasts a
PONGto update the cluster configuration.
2. Split Brain Protection
What happens if the network partitions, and both sides try to elect a master? Or if a client writes to the old master while a new one is being elected?
To prevent data loss, you can configure min-replicas-to-write.
min-replicas-to-write 1
min-replicas-max-lag 10
This ensures that a Master accepts writes only if it is connected to at least 1 replica. If a partition isolates a Master from its replicas, it stops accepting writes, preventing a “Split Brain” where two versions of history diverge.
3. Interactive: Failover Simulation
Watch a Replica promote itself when its Master fails.
4. Summary
- Automatic Recovery: Redis Cluster heals itself without human intervention.
- Consensus: Masters vote to authorize promotions, preventing split-brain scenarios.
- Safety: Using
min-replicas-to-writeadds an extra layer of data safety during partitions.