Rebalance Protocols

A Rebalance is the process where the Kafka Group Coordinator reassigns partitions to consumer group members. While necessary for scalability and fault tolerance, it can be a source of significant latency if not managed correctly.

1. Why Rebalances Happen

  • A new consumer joins the group.
  • An existing consumer leaves the group (graceful shutdown).
  • A consumer crashes (detected via heartbeat timeout).
  • A consumer takes too long to process a batch of messages (max.poll.interval.ms).
  • The topic’s partition count is increased.

2. Eager Rebalance (EagerProtocol)

This is the legacy protocol (pre-Kafka 2.4).

  1. Every consumer stops reading and relinquishes all partitions.
  2. The group sits idle while the coordinator decides on the new assignment.
  3. Each consumer is assigned its new partitions and starts reading again.
    • The Problem: A “Stop-the-World” event for your entire application.

3. Cooperative Sticky Rebalance

Introduced in modern Kafka versions, this protocol performs Incremental Cooperative Rebalancing.

  1. Instead of everyone stopping, the coordinator only takes away partitions that need to be moved.
  2. The consumers that are losing partitions stop and commit their offsets.
  3. The consumers that are keeping their partitions continue processing without interruption.
    • The Benefit: Much smaller impact on latency and throughput.

4. Interactive: Rebalance Simulator

Watch how partitions move during a rebalance.

C1
P0
P1
C2
P2
P3
Healthy Workload

5. Conclusion

Unnecessary rebalances are the enemy of Kafka performance. Always tune your session.timeout.ms and max.poll.interval.ms to match your application’s processing time. If your code takes 10 minutes to process a batch but your timeout is 5 minutes, you will be in a constant “Rebalance Storm.”