Rebalance Protocols

Imagine you are managing a busy restaurant with four waiters and sixteen tables. Suddenly, one waiter calls in sick. How do you handle their tables? Do you ask every waiter to drop their plates, freeze all service, and completely reassign all sixteen tables from scratch? Or do you let the three remaining waiters continue serving their tables, and only reassign the four orphaned tables?

This exact dilemma exists in Kafka. A Rebalance is the critical coordination process where the Kafka Group Coordinator redistributes topic partitions among active consumer group members. While necessary for scalability and fault tolerance, rebalances can be a source of significant latency if not managed correctly.

In this chapter, we will dive deep into why rebalances happen, the legacy “Stop-the-World” approach, and how modern Kafka solves this with Incremental Cooperative Rebalancing.


1. The Genesis: Why Rebalances Happen

A consumer group is dynamic. Consumers scale up during peak traffic, scale down to save costs, or crash due to Out-Of-Memory (OOM) errors. Kafka must adapt to these changes dynamically.

A rebalance is triggered when there is a change to the membership of the consumer group or the metadata of the subscribed topics.

Trigger Event Why it causes a Rebalance Common Root Cause
New Consumer Joins The new consumer needs partitions to process. Partitions must be taken from existing consumers and given to the new one. Horizontal Auto-scaling (e.g., Kubernetes HPA spinning up a new pod).
Consumer Leaves Gracefully The consumer sends a LeaveGroup request. Its partitions are now orphaned and must be reassigned. Rolling deployments or deliberate scale-down events.
Consumer Crashes The consumer stops sending heartbeats (session.timeout.ms). The coordinator assumes it is dead. Node failure, Out-Of-Memory (OOM) exception, or network partition.
Processing Timeout The consumer takes too long to process a batch of messages (max.poll.interval.ms), violating its contract. Slow external API calls or unoptimized database inserts during message processing.
Topic Changes The number of partitions in a subscribed topic increases. An administrator manually adds partitions to handle increased throughput.

2. The Legacy Approach: Eager Rebalance (EagerProtocol)

Prior to Kafka 2.4, the default behavior was the Eager Rebalance protocol. It favored simplicity over performance, acting as a complete “Stop-the-World” event for your application.

The Eager Workflow

  1. Revoke All: Every single consumer in the group stops fetching data and relinquishes all of its assigned partitions, even if those partitions don’t need to move.
  2. Join Group: The consumers send a JoinGroup request to the Group Coordinator.
  3. Sync Group: The Group Leader (typically the first consumer to join) calculates the new assignment for everyone and sends it to the Coordinator. The Coordinator distributes this assignment via the SyncGroup response.
  4. Resume: Every consumer receives its new partitions and begins fetching data again.

The “Stop-the-World” Problem

The fatal flaw of Eager Rebalancing is the massive pause in processing. If you have 50 consumers processing 500 partitions, and one consumer restarts, all 500 partitions stop processing.

During rolling deployments in a large cluster, this can lead to consecutive rebalances, causing the entire consumer group to halt for several minutes. This is known as a Rebalance Storm.


3. The Modern Approach: Cooperative Sticky Rebalance

Introduced in Kafka 2.4 and made default in 3.0, the Cooperative Sticky Rebalance (or Incremental Cooperative Rebalancing) solves the “Stop-the-World” problem.

Instead of a global reset, this protocol operates on two core principles:

  1. Sticky Assignments: Consumers keep their existing partitions unless absolutely necessary to move them.
  2. Incremental Revocation: The coordinator only takes away partitions that need to be moved to achieve balance.

The Cooperative Workflow

  1. First Phase (Detection & Sticky Retention): The coordinator announces a rebalance. Consumers rejoin the group, but they do not drop their existing partitions. The Group Leader calculates the new assignment.
  2. Second Phase (Targeted Revocation): If the Leader decides Consumer A needs to give a partition to Consumer B, it tells Consumer A to revoke only that specific partition. Consumer A stops processing that partition and commits its offsets.
  3. Third Phase (Targeted Assignment): The revoked partition is now free. The coordinator assigns it to Consumer B.

Throughout this entire process, all other partitions continued processing without a single millisecond of downtime.


4. Interactive: Cooperative Rebalance Simulator

Watch how partitions move during a Cooperative Rebalance. When Consumer 1 crashes, Consumer 2 keeps its original partitions processing smoothly while absorbing Consumer 1’s orphaned workload.

Consumer 1
P0
P1
Consumer 2
P2
P3
Status: Healthy Workload. Both consumers processing...

5. Hooking into the Lifecycle: ConsumerRebalanceListener

When a rebalance occurs, you often need to perform cleanup tasks, such as committing offsets manually or closing database connections. Kafka provides the ConsumerRebalanceListener interface for this exact purpose.

Java

consumer.subscribe(Collections.singletonList("orders-topic"), new ConsumerRebalanceListener() {

    @Override
    public void onPartitionsRevoked(Collection<TopicPartition> partitions) {
        // Triggered right BEFORE you lose the partitions.
        // Good place to commit synchronous offsets for processed messages
        // to prevent duplicate processing on the new consumer.
        System.out.println("Partitions revoked: " + partitions);
        consumer.commitSync(currentOffsets);
    }

    @Override
    public void onPartitionsAssigned(Collection<TopicPartition> partitions) {
        // Triggered right AFTER you receive new partitions.
        // Good place to load state, connect to a database, or seek to a specific offset.
        System.out.println("Partitions assigned: " + partitions);
    }
});

Go

type RebalanceHandler struct{}

func (h *RebalanceHandler) Setup(session sarama.ConsumerGroupSession) error {
	// Equivalent to onPartitionsAssigned
	// Setup resources before consuming starts
	log.Printf("Partitions assigned: %v", session.Claims())
	return nil
}

func (h *RebalanceHandler) Cleanup(session sarama.ConsumerGroupSession) error {
	// Equivalent to onPartitionsRevoked
	// Commit offsets and cleanup resources before losing partitions
	log.Printf("Partitions revoked")
	return nil
}

func (h *RebalanceHandler) ConsumeClaim(session sarama.ConsumerGroupSession, claim sarama.ConsumerGroupClaim) error {
	// Normal message consumption logic
	return nil
}

6. Tuning Rebalances: Avoiding the “Rebalance Storm”

Unnecessary rebalances are the enemy of high-throughput streaming applications. To prevent your consumers from constantly rebalancing, you must align your configurations with your application’s actual behavior.

  • session.timeout.ms (Default: 45s): The maximum time the coordinator will wait for a heartbeat before declaring the consumer dead. If your consumers run in an environment with frequent network blips or heavy GC pauses, consider increasing this slightly. However, setting it too high means it will take longer to detect a real failure.
  • heartbeat.interval.ms (Default: 3s): How often the background thread sends heartbeats. Rule of thumb: this should be no more than 1/3 of the session.timeout.ms.
  • max.poll.interval.ms (Default: 5m): Crucial Configuration. This dictates the maximum time your application is allowed to spend processing a batch of records returned by poll(). If your processing logic involves slow database calls and takes 6 minutes, but this is set to 5 minutes, the consumer will deliberately trigger a rebalance. Always ensure max.poll.interval.ms is strictly greater than your worst-case processing time.

Summary

Modern Kafka has vastly improved the rebalance process with Cooperative Sticky assignments, but the underlying mechanics remain the same. By writing efficient processing logic, correctly sizing your timeouts, and implementing ConsumerRebalanceListener to cleanly manage state, you can ensure your consumer groups remain stable, performant, and resilient.