Sentinel and Failover

You’ve secured your data with RDB and AOF, ensuring that if a server restarts, the data is not lost. But what if the server hardware itself dies or the network cable is severed? Your application will still experience downtime, unable to process writes or reads until manual intervention occurs. To prevent this single point of failure, we use Redis Sentinel.

Think of a Redis Master as a bank vault manager, and Replicas as the assistant managers. If the main manager unexpectedly leaves, an assistant needs to step up. However, the assistants don’t have the authority to promote themselves. They need a board of directors (Sentinels) to constantly monitor the manager, agree when the manager is truly gone, and officially vote one of the assistants into the manager role.

1. What is Sentinel?

Redis Sentinel is a distributed system designed to act as this “board of directors.” It monitors your Redis instances and performs the following mission-critical tasks:

  • Monitoring: Periodically checks if your Master and Replica instances are working as expected via constant PING commands.
  • Notification: Alerts your application or system administrator when something goes wrong (via Pub/Sub channels).
  • Automatic Failover: If the Master definitively dies, Sentinel promotes the most up-to-date Replica to be the new Master.
  • Configuration Provider: Acts as a source of truth for client libraries. Clients connect to Sentinels to ask “Who is the current Master?”, allowing them to seamlessly redirect traffic after a failover.

2. SDOWN, ODOWN, and The Quorum

Sentinel relies on a concept of consensus to avoid catastrophic mistakes, such as a Split Brain scenario where two different network partitions both think they have the valid Master.

To achieve this, Sentinel differentiates between a node seeming down to one observer, and a node being verifiably down to the cluster:

  • SDOWN (Subjectively Down): A single Sentinel node stops receiving valid PONG replies from the Master for a configured threshold (down-after-milliseconds). To this specific Sentinel, the Master is dead. However, this could just be a localized network issue.
  • ODOWN (Objectively Down): To verify the death, the Sentinel asks other Sentinels (via the SENTINEL is-master-down-by-addr command) if they also see the Master as down. If enough Sentinels agree, the state is escalated to ODOWN.

The number of Sentinels required to agree is called the Quorum.

  • If you have 3 Sentinels and a Quorum of 2, at least 2 Sentinels must agree to reach ODOWN.
  • Crucial Rule: While a Quorum is enough to detect a failure, an actual failover requires a majority of all Sentinels to elect a Leader (e.g., 2 out of 3, or 3 out of 5). You can’t failover if the majority of the Sentinel fleet is disconnected.

3. The Failover Process Deep Dive

Once ODOWN is reached and a Sentinel Leader is elected, the failover proceeds through a strict sequence:

  1. Replica Selection: The Leader cannot just pick any Replica. It evaluates them based on:
    • Disconnection Time: Rejecting Replicas that have been disconnected from the Master for too long.
    • Priority: Checking the replica-priority configuration (lower number is better; 0 means never promote).
    • Replication Offset: Picking the Replica that has consumed the most data from the Master (most up-to-date).
    • Run ID: As a tie-breaker, picking the Replica with the lexicographically smaller Run ID.
  2. Promotion (REPLICAOF NO ONE): The chosen Replica is sent a command to stop replicating and become a Master.
  3. Configuration Epoch: The Sentinel Leader increments the Configuration Epoch (a version number similar to Raft terms) and broadcasts the new Master’s details.
  4. Reconfiguration: All other Replicas are commanded to start replicating from the new Master.
  5. Client Redirect: Clients querying Sentinel for the Master are given the new IP/Port.

4. Interactive: Sentinel Failover

Kill the Master and watch Sentinel promote a Replica. Notice the transition from healthy PINGs to an ODOWN state, followed by promotion.

MASTER
⬅️
👁️
Sentinel Quorum
➡️
REPLICA
Offset: Ready
Status: Healthy (PING/PONG)

5. Summary

Sentinel is a vital component of the Redis ecosystem for applications requiring high availability. By introducing consensus and automated failover, it transforms a manually managed database into a self-healing system capable of surviving hardware failures, network partitions, and unexpected crashes with minimal downtime.