Topology & Gossip

Unlike Redis Sentinel, which relies on a separate monitoring process, Redis Cluster is a Peer-to-Peer (P2P) system. Every node is connected to every other node in a full mesh topology.

This architecture eliminates the need for a central proxy or load balancer, removing a critical Single Point of Failure (SPOF).

1. The Cluster Bus

Every Redis Cluster node opens two TCP ports:

  1. Client Port (e.g., 6379): Accepts commands from clients (GET, SET).
  2. Cluster Bus Port (e.g., 16379): Used for node-to-node communication.

The Cluster Bus uses a compact binary protocol to exchange information about the cluster state. This traffic is low-bandwidth but high-frequency.

[!IMPORTANT] Firewall Rules: If you open port 6379 but block 16379, your cluster will not form. Always allow traffic on Client Port + 10000.

2. The Gossip Protocol

Nodes use a Gossip Protocol to propagate information.

At fixed intervals (default 1 second), each node:

  1. Pings a few random nodes.
  2. Sends its own view of the cluster (who is master, who is replica).
  3. Sends information about other nodes it has recently communicated with.

This allows state changes (like a new node joining or a node failing) to propagate exponentially through the cluster.

3. Failure Detection: PFAIL vs FAIL

Redis Cluster uses a two-stage failure detection mechanism to prevent false positives.

Stage 1: PFAIL (Possible Failure)

If Node A tries to contact Node B and receives no response within the cluster-node-timeout, Node A marks Node B as PFAIL.

  • This is a local view. Node A thinks Node B is down, but maybe it’s just a network partition between A and B.

Stage 2: FAIL (Confirmed Failure)

Node A gossips this PFAIL state to other nodes. If Node A receives PFAIL messages from a majority of master nodes regarding Node B, it upgrades the state to FAIL.

  • This is a global consensus.
  • Node A broadcasts a FAIL message to all reachable nodes.
  • Node B is now officially considered down, triggering a failover.

4. Interactive: Gossip Propagation

Click “Kill Node” to simulate a failure and watch the PFAIL/FAIL state propagate via gossip.

Status Log
Node 1
OK
Node 2
OK
Node 3
OK
> Cluster healthy. All nodes connected.

5. Summary

  • P2P Mesh: No central bottleneck.
  • Gossip: Efficient state propagation.
  • PFAIL/FAIL: Robust failure detection preventing false positives.

In the next chapter, we will see how clients interact with this distributed system.