Replication and ISR

Kafka’s primary promise is durability. Once a message is acknowledged, it should never be lost, even if servers catch fire. Kafka achieves this through Replication—storing multiple copies of every message on different physical brokers.

In this chapter, we’ll dissect how Kafka ensures data safety without sacrificing performance, using the concepts of Leaders, Followers, and the ISR (In-Sync Replicas).

1. The Anatomy of a Partition

Every partition in Kafka has a configured Replication Factor (usually 3). This means there are three copies of the data living on three different brokers.

1.1 Leader vs. Follower

  • The Leader: One broker is elected as the “Leader” for the partition. All reads and writes go to the Leader. It is the single source of truth.
  • The Followers: The other brokers are “Followers”. Their only job is to fetch data from the Leader and write it to their own local logs. They are passive.

[!NOTE] Unlike some databases (e.g., MongoDB, Cassandra) where you might read from replicas to scale reads, Kafka consumers typically read only from the Leader (though “Follower Fetching” exists for cross-datacenter scenarios).

1.2 The In-Sync Replicas (ISR)

Not all followers are created equal. A follower might be down for maintenance, or it might be running slow due to network congestion. Kafka maintains a list of In-Sync Replicas (ISR).

  • In-Sync: A follower that is caught up with the Leader (specifically, it has fetched up to the Leader’s latest offset within a configurable time window).
  • Out-of-Sync: A follower that has fallen too far behind.

The ISR is critical because only members of the ISR are eligible to become the new Leader if the current Leader fails. This guarantees no committed data is lost during a failover.


2. Interactive: The Replication Simulator

Use the simulator below to understand how messages flow from the Leader to the Followers and how the ISR list changes when a node fails.

Broker 1 (Leader)

Offset: 0

Broker 2 (Follower)

Offset: 0 (ISR)

Broker 3 (Follower)

Offset: 0 (ISR)
ISR List: [1, 2, 3]

2.1 The High Watermark (HW)

You might notice that a consumer cannot read a message the moment it hits the Leader. It must wait until the message is committed.

A message is considered committed when it has been replicated to all brokers in the current ISR. The offset of the last committed message is called the High Watermark.

[!IMPORTANT] Consumers can ONLY read up to the High Watermark. This ensures that even if the Leader crashes, the consumer has only read data that is guaranteed to exist on the new Leader.


3. Configuring Durability

You can control durability using the acks (acknowledgments) setting on the Producer.

Acks Setting Description Durability Latency
acks=0 Producer fires and forgets. Doesn’t wait for server response. None (Risky) Lowest
acks=1 Producer waits for Leader to write the message to disk. Medium (Safe if Leader lives) Low
acks=all (or -1) Producer waits for Leader + All ISR to acknowledge. Max (Guaranteed) Higher

3.1 The Magic Formula for Zero Data Loss

To guarantee zero data loss, you need three settings working together:

  1. Replication Factor ≥ 3
  2. acks=all on Producer
  3. min.insync.replicas ≥ 2 (on Broker or Topic config)

If min.insync.replicas=2 and only 1 broker is in the ISR, the Leader will reject writes with acks=all. This protects you from writing data that cannot be safely replicated.


4. Code Example: Inspecting Replication

We can use the Kafka AdminClient to programmatically inspect the ISR and replica status of a topic.

Java Implementation

import org.apache.kafka.clients.admin.AdminClient;
import org.apache.kafka.clients.admin.DescribeTopicsResult;
import org.apache.kafka.clients.admin.TopicDescription;
import org.apache.kafka.common.TopicPartitionInfo;
import java.util.Collections;
import java.util.Properties;
import java.util.concurrent.ExecutionException;

public class ReplicationInspector {
    public static void main(String[] args) throws ExecutionException, InterruptedException {
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");

        try (AdminClient admin = AdminClient.create(props)) {
            String topicName = "critical-orders";
            DescribeTopicsResult result = admin.describeTopics(Collections.singleton(topicName));
            TopicDescription topicDesc = result.allTopicNames().get().get(topicName);

            System.out.println("Topic: " + topicName);
            for (TopicPartitionInfo partition : topicDesc.partitions()) {
                System.out.printf("Partition: %d%n", partition.partition());
                System.out.printf("  Leader: %s%n", partition.leader().id());
                System.out.printf("  Replicas: %s%n", partition.replicas());
                System.out.printf("  ISR: %s%n", partition.isr());

                if (partition.isr().size() < partition.replicas().size()) {
                    System.out.println("  [WARNING] Partition is under-replicated!");
                }
            }
        }
    }
}

Go Implementation

package main

import (
	"context"
	"fmt"
	"log"
	"time"

	"github.com/twmb/franz-go/pkg/kadm"
	"github.com/twmb/franz-go/pkg/kgo"
)

func main() {
	// Initialize client
	client, err := kgo.NewClient(
		kgo.SeedBrokers("localhost:9092"),
	)
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()

	// Create Admin Client
	admin := kadm.NewClient(client)
	ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
	defer cancel()

	// Describe Topic
	topicName := "critical-orders"
	topics, err := admin.DescribeTopics(ctx, topicName)
	if err != nil {
		log.Fatal(err)
	}

	for _, topic := range topics {
		fmt.Printf("Topic: %s\n", topic.Topic)
		for _, p := range topic.Partitions {
			fmt.Printf("Partition: %d\n", p.Partition)
			fmt.Printf("  Leader: %d\n", p.Leader)
			fmt.Printf("  Replicas: %v\n", p.Replicas)
			fmt.Printf("  ISR: %v\n", p.ISR)

			if len(p.ISR) < len(p.Replicas) {
				fmt.Println("  [WARNING] Partition is under-replicated!")
			}
		}
	}
}

5. Summary

  • Replication provides redundancy.
  • ISR ensures consistency. Only caught-up followers can become leaders.
  • High Watermark protects consumers from seeing uncommitted data.
  • Use acks=all and min.insync.replicas=2 for critical data.