Replication and ISR
Kafka’s primary promise is durability. Once a message is acknowledged, it should never be lost, even if servers catch fire. Kafka achieves this through Replication—storing multiple copies of every message on different physical brokers.
In this chapter, we’ll dissect how Kafka ensures data safety without sacrificing performance, using the concepts of Leaders, Followers, and the ISR (In-Sync Replicas).
1. The Anatomy of a Partition
Every partition in Kafka has a configured Replication Factor (usually 3). This means there are three copies of the data living on three different brokers.
1.1 Leader vs. Follower
- The Leader: One broker is elected as the “Leader” for the partition. All reads and writes go to the Leader. It is the single source of truth.
- The Followers: The other brokers are “Followers”. Their only job is to fetch data from the Leader and write it to their own local logs. They are passive.
[!NOTE] Unlike some databases (e.g., MongoDB, Cassandra) where you might read from replicas to scale reads, Kafka consumers typically read only from the Leader (though “Follower Fetching” exists for cross-datacenter scenarios).
1.2 The In-Sync Replicas (ISR)
Not all followers are created equal. A follower might be down for maintenance, or it might be running slow due to network congestion. Kafka maintains a list of In-Sync Replicas (ISR).
- In-Sync: A follower that is caught up with the Leader (specifically, it has fetched up to the Leader’s latest offset within a configurable time window).
- Out-of-Sync: A follower that has fallen too far behind.
The ISR is critical because only members of the ISR are eligible to become the new Leader if the current Leader fails. This guarantees no committed data is lost during a failover.
2. Interactive: The Replication Simulator
Use the simulator below to understand how messages flow from the Leader to the Followers and how the ISR list changes when a node fails.
Broker 1 (Leader)
Broker 2 (Follower)
Broker 3 (Follower)
2.1 The High Watermark (HW)
You might notice that a consumer cannot read a message the moment it hits the Leader. It must wait until the message is committed.
A message is considered committed when it has been replicated to all brokers in the current ISR. The offset of the last committed message is called the High Watermark.
[!IMPORTANT] Consumers can ONLY read up to the High Watermark. This ensures that even if the Leader crashes, the consumer has only read data that is guaranteed to exist on the new Leader.
3. Configuring Durability
You can control durability using the acks (acknowledgments) setting on the Producer.
| Acks Setting | Description | Durability | Latency |
|---|---|---|---|
acks=0 |
Producer fires and forgets. Doesn’t wait for server response. | None (Risky) | Lowest |
acks=1 |
Producer waits for Leader to write the message to disk. | Medium (Safe if Leader lives) | Low |
acks=all (or -1) |
Producer waits for Leader + All ISR to acknowledge. | Max (Guaranteed) | Higher |
3.1 The Magic Formula for Zero Data Loss
To guarantee zero data loss, you need three settings working together:
- Replication Factor ≥ 3
acks=allon Producermin.insync.replicas≥ 2 (on Broker or Topic config)
If min.insync.replicas=2 and only 1 broker is in the ISR, the Leader will reject writes with acks=all. This protects you from writing data that cannot be safely replicated.
4. Code Example: Inspecting Replication
We can use the Kafka AdminClient to programmatically inspect the ISR and replica status of a topic.
Java Implementation
import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.DescribeTopicsResult;
import org.apache.kafka.clients.admin.TopicDescription;
import org.apache.kafka.common.TopicPartitionInfo;
import java.util.Collections;
import java.util.Properties;
import java.util.concurrent.ExecutionException;
public class ReplicationInspector {
public static void main(String[] args) throws ExecutionException, InterruptedException {
Properties props = new Properties();
props.put("bootstrap.servers", "localhost:9092");
try (AdminClient admin = AdminClient.create(props)) {
String topicName = "critical-orders";
DescribeTopicsResult result = admin.describeTopics(Collections.singleton(topicName));
TopicDescription topicDesc = result.allTopicNames().get().get(topicName);
System.out.println("Topic: " + topicName);
for (TopicPartitionInfo partition : topicDesc.partitions()) {
System.out.printf("Partition: %d%n", partition.partition());
System.out.printf(" Leader: %s%n", partition.leader().id());
System.out.printf(" Replicas: %s%n", partition.replicas());
System.out.printf(" ISR: %s%n", partition.isr());
if (partition.isr().size() < partition.replicas().size()) {
System.out.println(" [WARNING] Partition is under-replicated!");
}
}
}
}
}
Go Implementation
package main
import (
"context"
"fmt"
"log"
"time"
"github.com/twmb/franz-go/pkg/kadm"
"github.com/twmb/franz-go/pkg/kgo"
)
func main() {
// Initialize client
client, err := kgo.NewClient(
kgo.SeedBrokers("localhost:9092"),
)
if err != nil {
log.Fatal(err)
}
defer client.Close()
// Create Admin Client
admin := kadm.NewClient(client)
ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
// Describe Topic
topicName := "critical-orders"
topics, err := admin.DescribeTopics(ctx, topicName)
if err != nil {
log.Fatal(err)
}
for _, topic := range topics {
fmt.Printf("Topic: %s\n", topic.Topic)
for _, p := range topic.Partitions {
fmt.Printf("Partition: %d\n", p.Partition)
fmt.Printf(" Leader: %d\n", p.Leader)
fmt.Printf(" Replicas: %v\n", p.Replicas)
fmt.Printf(" ISR: %v\n", p.ISR)
if len(p.ISR) < len(p.Replicas) {
fmt.Println(" [WARNING] Partition is under-replicated!")
}
}
}
}
5. Summary
- Replication provides redundancy.
- ISR ensures consistency. Only caught-up followers can become leaders.
- High Watermark protects consumers from seeing uncommitted data.
- Use
acks=allandmin.insync.replicas=2for critical data.