Topology & Gossip
Unlike Redis Sentinel, which relies on a separate monitoring process, Redis Cluster is a Peer-to-Peer (P2P) system. Every node is connected to every other node in a full mesh topology.
This architecture eliminates the need for a central proxy or load balancer, removing a critical Single Point of Failure (SPOF).
1. The Cluster Bus
Every Redis Cluster node opens two TCP ports:
- Client Port (e.g.,
6379): Accepts commands from clients (GET, SET). - Cluster Bus Port (e.g.,
16379): Used for node-to-node communication.
The Cluster Bus uses a compact binary protocol to exchange information about the cluster state. This traffic is low-bandwidth but high-frequency.
[!IMPORTANT] Firewall Rules: If you open port 6379 but block 16379, your cluster will not form. Always allow traffic on
Client Port + 10000.
2. The Gossip Protocol
Nodes use a Gossip Protocol to propagate information.
At fixed intervals (default 1 second), each node:
- Pings a few random nodes.
- Sends its own view of the cluster (who is master, who is replica).
- Sends information about other nodes it has recently communicated with.
This allows state changes (like a new node joining or a node failing) to propagate exponentially through the cluster.
3. Failure Detection: PFAIL vs FAIL
Redis Cluster uses a two-stage failure detection mechanism to prevent false positives.
Stage 1: PFAIL (Possible Failure)
If Node A tries to contact Node B and receives no response within the cluster-node-timeout, Node A marks Node B as PFAIL.
- This is a local view. Node A thinks Node B is down, but maybe it’s just a network partition between A and B.
Stage 2: FAIL (Confirmed Failure)
Node A gossips this PFAIL state to other nodes. If Node A receives PFAIL messages from a majority of master nodes regarding Node B, it upgrades the state to FAIL.
- This is a global consensus.
- Node A broadcasts a
FAILmessage to all reachable nodes. - Node B is now officially considered down, triggering a failover.
4. Interactive: Gossip Propagation
Click “Kill Node” to simulate a failure and watch the PFAIL/FAIL state propagate via gossip.
5. Summary
- P2P Mesh: No central bottleneck.
- Gossip: Efficient state propagation.
- PFAIL/FAIL: Robust failure detection preventing false positives.
In the next chapter, we will see how clients interact with this distributed system.