ZooKeeper Basics
In 2008, Yahoo’s Hadoop cluster had a coordination problem. Their distributed MapReduce jobs needed a way to elect a master among 1,000 nodes — and when the master crashed, all 1,000 nodes would simultaneously discover it was dead and try to be the next master. The result was a Thundering Herd: thousands of nodes hitting the same registry simultaneously, overloading the very service meant to keep them coordinated. The Yahoo team built ZooKeeper as the solution — a dedicated coordination service with one key insight: Ephemeral ZNodes. A node that registers creates a Ephemeral ZNode that automatically disappears when the TCP session drops. All interested parties “Watch” the ZNode directory. When it vanishes, ZooKeeper fires exactly one notification to each watcher. No polling, no thundering herd — just a single surgical event. ZooKeeper became the coordination substrate for Hadoop, Kafka (original), HBase, and dozens of other distributed systems. The Netflix engineering blog compared it to “the central nervous system of distributed computing.”
[!IMPORTANT] In this lesson, you will master:
- Ephemeral Lifecycles: Understanding how ZK uses TCP Heartbeats to detect physical hardware failure and instantly revoke authority.
- RAM-First Architecture: Why Zookeeper trades Storage Capacity for Latency, and the dangers of JVM Garbage Collection on coordination stability.
- Atomic Broadcast (ZAB): The primary-backup protocol that provides the “FIFO Total Order” required for complex distributed recipes.
1. The Data Model: ZNodes
ZooKeeper looks like a standard Linux file system, but optimized for coordination.
- Root:
/ - Nodes:
/app,/app/config. - Data: Each node stores tiny amounts of data (<1MB).
Node Types (The “Mode”)
- Persistent: Stays forever. (e.g.,
/db-config). - Ephemeral: Dies when the client disconnects. (e.g.,
/active-users/user-1). Crucial for Failure Detection. - Sequential: Appends a number. (e.g.,
/queue/task-001). Crucial for Ordering.
ZooKeeper uses Watches (Push notifications).
- Client: “Watch
/config.” - ZooKeeper: “Acknowledged.”
- Admin updates
/config. - ZooKeeper → Client: “EVENT: NodeDataChanged”.
[!NOTE] Staff Engineer Intuition: The Thundering Herd. In naive systems, 1,000 nodes might poll a database every second to check for a master’s death (1,000 RPS of wasted work). ZooKeeper’s Watches turn this into a Push-based event. When the master (Ephemeral node) dies, ZooKeeper sends exactly 1,000 packets — one to each waiter. This surgical precision is why ZooKeeper is called the “firefly” of distributed systems.
ZAB Atomic Broadcast Diagram
3. Interactive Demo: The Service Registry
Cyberpunk Mode: Visualize Service Discovery.
- Role: You are the Load Balancer.
- Task: Route traffic to active workers.
- Mechanism: Watch the
/workersdirectory.
Instructions:
- Start Worker: Spawns a new Ephemeral Node. Watch the Load Balancer update instantly.
- Kill Worker: Simulates a crash. The Ephemeral Node vanishes. The Load Balancer removes it from the routing table.
[!TIP] Try it yourself: Click “Start Worker” to add a node. Then click “Crash Random Worker” to see the Ephemeral node vanish and the Load Balancer update its routing table.
- (No Upstreams)
4. The Curator Framework
Writing raw ZooKeeper code is painful. Dealing with connection drops, retries, and edge cases is hard. Netflix created Curator to solve this.
Recipe: Distributed Lock
// Connect to ZK
CuratorFramework client = CuratorFrameworkFactory.newClient("localhost:2181", new ExponentialBackoffRetry(1000, 3));
client.start();
// The "Recipe" handles all logic (creating ephemeral nodes, watching, retrying)
InterProcessMutex lock = new InterProcessMutex(client, "/my-distributed-lock");
try {
if (lock.acquire(10, TimeUnit.SECONDS)) {
try {
// Critical Section: Access Shared Resource
doWork();
} finally {
lock.release();
}
}
} catch (Exception e) {
// Handle failure
}
Recipe: Leader Election
LeaderSelector listener = new LeaderSelector(client, "/master-election", new LeaderSelectorListenerAdapter() {
public void takeLeadership(CuratorFramework client) throws Exception {
// You are now the Leader!
// Do work here...
// If this method exits, you give up leadership.
}
});
listener.autoRequeue();
listener.start();
5. ZAB vs Raft: The Engineering Trade-off
While both achieve consensus, they are built for different operational profiles.
| Feature | ZAB (ZooKeeper) | Raft (Etcd/Consul) |
|---|---|---|
| Commitment | 2-Phase Broadcast. Leader proposes, followers ACK, Leader broadcasts COMMIT. | Log Matching. Followers replicate log; Leader updates commit index and piggybacks it on next request. |
| Total Order | Guaranteed. Every write has a sequential Zxid (ZooKeeper Transaction ID). | Implicit. The log itself defines the sequence. |
| Read Profile | Read-Local Optimized. Followers serve reads natively (stale possible unless sync() called). |
Leader-Centric. Reads usually go to the leader for linearizability. |
| Internal Stack | JVM (Java). High RAM usage, GC risks. | Go/C++. Lower memory footprint, predictable performance. |
6. Summary
- Watches: The nervous system. Enables real-time reaction to cluster changes.
- Curator: The shield. Use it instead of raw ZK libraries to handle session expiration and retries.
Staff Engineer Tip: ZooKeeper is a Memory-Bound System. Because it stores everything in RAM for speed, its capacity is strictly limited by the physical RAM DIMMs on your servers. If you store large configuration files in ZNode data (>1MB), you will trigger massive JVM Heap Fragmentation. This leads to the “GC Death Spiral”: a 20-second pause triggers a session timeout, which triggers a cluster re-election, which puts more load on the new leader, triggering another GC pause. Always keep ZNode data tiny to protect the physical coordination backplane.
Mnemonic — “Ephemeral = Firefly, Persistent = Stone”: Persistent ZNode = stays forever (config, schemas). Ephemeral ZNode = dies when TCP session drops (service registry, leader lock). Sequential = auto-numbered (distributed queues). Watch = fire-and-forget notification (no polling). ZK vs Etcd: ZK = JVM, older, 3.x has Watch limitations. Etcd = Go, newer, K8s-native, gRPC, simpler API. Use ZK for legacy Hadoop/Kafka stacks. Use etcd for cloud-native Kubernetes coordination.