High Availability & Failover

Redis Cluster is designed to survive node failures. Every Master node should have at least one Replica. If the Master fails, the Replica promotes itself to become the new Master, ensuring the cluster remains available.

1. The Election Process

The failover mechanism is similar to the Raft consensus algorithm.

Detection: The cluster reaches consensus that a Master is FAIL.
Request: One of the failed Master’s replicas initiates an election. It bumps its currentEpoch and broadcasts a FAILOVER_AUTH_REQUEST.
Voting: The remaining healthy Masters vote. They grant a vote (FAILOVER_AUTH_ACK) if:
- The request comes from a replica of the failed master.
- The replica’s data is fresh enough.
- The master hasn’t voted for anyone else in this epoch.
Promotion: If a replica receives votes from the majority of masters, it promotes itself, takes over the hash slots, and broadcasts a PONG to update the cluster configuration.

2. Split Brain Protection

What happens if the network partitions, and both sides try to elect a master? Or if a client writes to the old master while a new one is being elected?

To prevent data loss, you can configure min-replicas-to-write.

min-replicas-to-write 1
min-replicas-max-lag 10

This ensures that a Master accepts writes only if it is connected to at least 1 replica. If a partition isolates a Master from its replicas, it stops accepting writes, preventing a “Split Brain” where two versions of history diverge.

3. Interactive: Failover Simulation

Watch a Replica promote itself when its Master fails.

Cluster Status

Master 1

Leader

Replica 1

Follower

Master 2

Leader

Master 3

Leader

> Cluster Healthy.

4. Summary

Automatic Recovery: Redis Cluster heals itself without human intervention.
Consensus: Masters vote to authorize promotions, preventing split-brain scenarios.
Safety: Using min-replicas-to-write adds an extra layer of data safety during partitions.