Anti Entropy Repair
So far, we’ve talked about Write Consistency (Tunable Consistency) and Temporary Failure Handling (Hinted Handoff). But what happens if a node is down for weeks? Hints expire. What if a disk corrupts silently?
Distributed systems tend towards entropy (disorder/inconsistency). Anti-Entropy Repair is the process of comparing data across replicas and fixing inconsistencies to restore order.
Cassandra has two primary ways to fix data:
- Read Repair (Lazy, on-access)
- Anti-Entropy Repair (Active, scheduled maintenance)
1. Read Repair
Read Repair happens automatically during read requests.
- Client requests data with
CL=QUORUM. - Coordinator asks Replicas A, B, and C for data.
- Replica A returns
Ver: 2, Val: "Foo". - Replica B returns
Ver: 1, Val: "Bar"(Stale!). - Action: The coordinator sees the mismatch. It returns the latest data (“Foo”) to the client immediately.
- Repair: In the background, the coordinator sends a write to Replica B to update it to “Foo”.
[!NOTE] This is great for “hot” data that is read frequently. But “cold” data (archived logs) might never be read, so it stays inconsistent forever without active repair.
2. Active Repair: The Merkle Tree
To repair data that isn’t read often, you run nodetool repair. But wait—if you have 10TB of data, do you have to transfer 10TB across the network to compare it?
No. Cassandra uses Merkle Trees.
What is a Merkle Tree?
A Merkle Tree is a hash tree where:
- Leaves = Hashes of individual data blocks (rows).
- Branches = Hashes of their children.
- Root = Hash of the entire dataset.
To compare two massive datasets, you just compare the Root Hash.
- Match? The data is identical. Done. (O(1) comparison).
- Mismatch? Compare the children. Follow the path of the mismatch down to the specific leaf.
- Result: You only stream the specific rows that differ, not the whole dataset.
3. Interactive Visualizer: Merkle Tree Comparison
See how Cassandra efficiently finds the “bad block” without reading everything.
Replica A (Correct)
Replica B (Corrupt)
4. The Zombie Data Problem
Why do we need repair if we have hinted handoff?
When you delete data in Cassandra, it isn’t removed immediately. Instead, a marker called a Tombstone is written. This tombstone replicates to other nodes.
If a node is down when the tombstone is written, and it stays down longer than gc_grace_seconds (default 10 days), it misses the tombstone. When it comes back up, it thinks it still has the data. If you don’t run repair, this old data can propagate back to the cluster. This is called Zombie Data.
Rule of Thumb: You MUST run a full repair on every node at least once within gc_grace_seconds.
5. Running Repairs
You trigger Anti-Entropy Repair using the nodetool utility.
Full Repair
Checks all data on the node. Very expensive (CPU/Disk I/O).
nodetool repair
Incremental Repair (Default in modern Cassandra)
Only repairs data written since the last repair. It marks repaired data as “repaired” so it doesn’t need to be checked again.
nodetool repair -inc
Partition Range Repair (Specific)
Repair only a specific keyspace or table.
nodetool repair my_keyspace my_table
6. Programmatic Repair (JMX)
Sometimes you want to build your own repair scheduler (like Cassandra Reaper). You can use JMX (Java Management Extensions) to trigger repairs programmatically.
import javax.management.MBeanServerConnection;
import javax.management.ObjectName;
import javax.management.remote.JMXConnector;
import javax.management.remote.JMXConnectorFactory;
import javax.management.remote.JMXServiceURL;
import java.util.HashMap;
import java.util.Map;
public class RepairTrigger {
public static void main(String[] args) throws Exception {
// Connect to Cassandra JMX (Standard port 7199)
JMXServiceURL url = new JMXServiceURL("service:jmx:rmi:///jndi/rmi://localhost:7199/jmxrmi");
try (JMXConnector jmxc = JMXConnectorFactory.connect(url, null)) {
MBeanServerConnection mbsc = jmxc.getMBeanServerConnection();
// Get StorageService MBean
ObjectName ssName = new ObjectName("org.apache.cassandra.db:type=StorageService");
// Define repair options
Map<String, String> options = new HashMap<>();
options.put("ranges", "100-200"); // Optional: Repair specific token ranges
options.put("parallelism", "parallel");
options.put("incremental", "true");
// Invoke repairAsync
// The return value is an int command number.
// In a real app, you would subscribe to JMX notifications to track progress.
int cmd = (int) mbsc.invoke(ssName, "repairAsync",
new Object[]{ "my_keyspace", options },
new String[]{ "java.lang.String", "java.util.Map" });
System.out.println("Repair command sent. Command ID: " + cmd);
}
}
}
[!TIP] Use Cassandra Reaper: Don’t run
nodetool repairvia cron scripts. Use a tool like Cassandra Reaper which handles scheduling, pausing, and resuming repairs intelligently to avoid overloading your cluster.
7. Summary
- Read Repair fixes inconsistencies lazily when data is accessed.
- Anti-Entropy Repair (using
nodetool repair) is active maintenance. - Merkle Trees allow Cassandra to compare petabytes of data by only transferring small hashes.
- Zombie Data: Regular repair prevents deleted data from resurrecting.