Leader Election
In a distributed system, failures are not exceptions; they are the norm. When a broker hosting a partition Leader crashes, the cluster must instantaneously react. It needs to elect a new Leader from the surviving followers to ensure the partition remains available for reads and writes.
This process is orchestrated by a special entity called the Controller.
1. The Controller: The Brain of the Cluster
One of the brokers in the Kafka cluster is elected as the Controller. Its job is to manage the state of partitions and replicas.
1.1 Responsibilities
- Monitor Broker Health: It watches for brokers joining or leaving the cluster (via Zookeeper or KRaft heartbeats).
- Elect Leaders: When a broker fails, the Controller finds all partitions that lost their leader and selects a new one from the ISR.
- Update Metadata: It sends
LeaderAndIsrrequests to all brokers, telling them: “Broker X is the new leader for Partition Y.”
1.2 Zookeeper vs. KRaft
- Legacy (Zookeeper): The Controller writes state to Zookeeper. This was a bottleneck for large clusters because all metadata updates had to flow through Zookeeper.
- Modern (KRaft): Kafka creates a dedicated metadata quorum (based on the Raft consensus algorithm). The Controller is now an internal role, and metadata is stored in a topic (
@metadata). This allows for much faster failovers and millions of partitions.
2. Interactive: The Election Process
Simulate a broker failure to see how the Controller elects a new leader and increments the Leader Epoch.
3. The “Split Brain” Nightmare
Imagine a scenario where the Controller decides Broker A has died and elects Broker B as the new Leader. But Broker A isn’t actually dead—it was just stuck in a long Garbage Collection (GC) pause.
Suddenly, Broker A wakes up. It still thinks it is the Leader. Broker B knows it is the Leader. We have two leaders. This is called a Split Brain.
If clients write to both, we get data corruption.
3.1 Solution: Epochs (Zombie Fencing)
Kafka solves this using Leader Epochs.
- Every time a new leader is elected, the Controller increments the
Leader Epochcounter (e.g., from 5 to 6). - The new leader (Broker B) shares this epoch (6) with all followers.
- When the “Zombie” leader (Broker A) wakes up, it tries to send a command to its followers with Epoch 5.
- The followers see Epoch 5, compare it to their current Epoch 6, and reject the request saying, “You are from the past. Go away.”
- Broker A realizes it is no longer the leader and demotes itself to a follower.
4. Unclean Leader Election
By default, Kafka will only elect a leader from the ISR. Why? Because ISR members have all the committed data.
But what if ALL ISR members die simultaneously? You have two choices:
- Availability over Consistency (Unclean Election): Allow a non-ISR replica (which might be missing data) to become leader.
- Pros: The partition comes back online.
- Cons: You will lose data.
- Config:
unclean.leader.election.enable=true
- Consistency over Availability (Default): Wait until an ISR member comes back to life.
- Pros: No data loss.
- Cons: Partition remains offline (downtime) until the specific failed hardware is fixed.
- Config:
unclean.leader.election.enable=false(Default)
[!WARNING] Enabling unclean leader election is dangerous. It should only be done as a last resort if availability is absolutely critical and data loss is acceptable.
5. Summary
- The Controller orchestrates failovers.
- Leader Epochs prevent Split Brain scenarios by fencing off “zombie” leaders.
- Unclean Leader Election is a trade-off: Data Loss vs. Downtime.