Gossip Protocol: The Cluster’s Heartbeat
In a decentralized system like Cassandra, there is no “master” node telling everyone what to do. So, how do nodes know which other nodes are alive, dead, or joining the cluster?
They gossip.
Every second, each node communicates with up to three other nodes to exchange state information. This epidemic-style protocol ensures that information (like “Node A is down” or “Node B has a new schema”) spreads through the cluster in O(log N) time.
1. How Gossip Works
- Peer Selection: Every second, a node selects a random peer to gossip with.
- SYN: The node sends a
SYNmessage containing a digest of its knowledge (node states and versions). - ACK: The peer compares the digest with its own. If it has newer information, it sends it back. If it’s missing information, it requests it. This is the
ACK. - ACK2: The original node updates its state and sends back any information the peer requested.
ACK2.
This 3-way handshake (SYN, ACK, ACK2) ensures that the latest state propagates rapidly.
The Math of Epidemics
Why is Gossip efficient? If one node knows a secret and tells 3 random friends, and they tell 3 random friends…
- Round 1: 1 node knows.
- Round 2: ~4 nodes know.
- Round 3: ~16 nodes know.
- Round 4: ~64 nodes know.
The information spreads exponentially. In a cluster of N nodes, it takes approximately O(log N) rounds for everyone to know. For a cluster of 1000 nodes, it takes only about 10 rounds (~10 seconds) for state to converge.
2. Seed Nodes & Topology
Since nodes need to know someone to start gossiping, Cassandra uses Seed Nodes.
- Bootstrapping: When a new node joins, it contacts the seed nodes to learn about the cluster topology.
- Gossip Convergence: Seed nodes help facilitate gossip traffic and ensure the cluster doesn’t fragment into partitioned islands.
- Best Practice: Configure 2-3 seed nodes per datacenter. Do not make every node a seed node (it slows down gossip).
[!NOTE] Snitches: The “Snitch” tells Cassandra which datacenter and rack a node belongs to. This topology information is also gossiped, allowing the driver to route requests efficiently (e.g., to the nearest replica).
Interactive Visualizer: Gossip Propagation
Simulate how a piece of information (e.g., “Node A is DOWN”) spreads across a cluster of nodes. Click “Start Gossip” to see the infection spread.
3. Failure Detection: Phi Accrual
How does a node know if a peer is down or just slow? Cassandra doesn’t use a fixed timeout (e.g., “if no response in 5s, mark dead”). Instead, it uses a mathematical model called Phi Accrual Failure Detection.
- Concept: It calculates a “suspicion level” (Phi φ) based on the history of inter-arrival times of gossip heartbeats.
- Formula:
φ = -log10(P), wherePis the probability that the heartbeat will arrive later than the current time. - Meaning:
- φ = 1: Maybe slow (~10% error rate).
- φ = 8: Extremely likely dead (~0.000001% chance it’s just slow).
- Threshold: By default, Cassandra marks a node DOWN when φ > 8.
This adaptive mechanism prevents “flapping” (nodes rapidly flipping between UP and DOWN) during temporary network congestion.
Code Implementation: Gossip State Exchange
Here’s a conceptual implementation of how two nodes might exchange state versions (Generation and Version) to determine who has the latest data.
import java.util.HashMap;
import java.util.Map;
public class GossipState {
private final String nodeId;
private int generation; // Timestamp of boot
private int version; // Incremented on every state change
public GossipState(String nodeId, int generation, int version) {
this.nodeId = nodeId;
this.generation = generation;
this.version = version;
}
// Compare local state with remote state
public String compare(GossipState remote) {
if (remote.generation > this.generation) {
return "Remote has newer restart. Overwrite local.";
} else if (remote.version > this.version) {
return "Remote has newer state updates. Request diff.";
} else if (remote.version < this.version) {
return "Local has newer state. Send update.";
}
return "In sync.";
}
public static void main(String[] args) {
GossipState localNode = new GossipState("192.168.1.10", 1620000000, 5);
GossipState remoteNode = new GossipState("192.168.1.10", 1620000000, 8);
System.out.println(localNode.compare(remoteNode));
}
}
package main
import "fmt"
type GossipState struct {
NodeID string
Generation int64 // Timestamp of boot
Version int // Incremented on state change
}
// Compare determines if we need to sync with remote
func (local GossipState) Compare(remote GossipState) string {
if remote.Generation > local.Generation {
return "Remote has newer restart. Overwrite local."
}
if remote.Version > local.Version {
return "Remote has newer state updates. Request diff."
}
if remote.Version < local.Version {
return "Local has newer state. Send update."
}
return "In sync."
}
func main() {
localNode := GossipState{
NodeID: "192.168.1.10",
Generation: 1620000000,
Version: 5,
}
remoteNode := GossipState{
NodeID: "192.168.1.10",
Generation: 1620000000,
Version: 8,
}
fmt.Println(localNode.Compare(remoteNode))
}
Summary
- Gossip is the mechanism for cluster state propagation.
- Seed Nodes act as introduction points for new nodes.
- Phi Accrual Failure Detector provides a flexible, adaptive way to detect node failures without rigid timeouts.