ZooKeeper Basics

In 2008, Yahoo’s Hadoop cluster had a coordination problem. Their distributed MapReduce jobs needed a way to elect a master among 1,000 nodes — and when the master crashed, all 1,000 nodes would simultaneously discover it was dead and try to be the next master. The result was a Thundering Herd: thousands of nodes hitting the same registry simultaneously, overloading the very service meant to keep them coordinated. The Yahoo team built ZooKeeper as the solution — a dedicated coordination service with one key insight: Ephemeral ZNodes. A node that registers creates a Ephemeral ZNode that automatically disappears when the TCP session drops. All interested parties “Watch” the ZNode directory. When it vanishes, ZooKeeper fires exactly one notification to each watcher. No polling, no thundering herd — just a single surgical event. ZooKeeper became the coordination substrate for Hadoop, Kafka (original), HBase, and dozens of other distributed systems. The Netflix engineering blog compared it to “the central nervous system of distributed computing.”

[!IMPORTANT] In this lesson, you will master:

Ephemeral Lifecycles: Understanding how ZK uses TCP Heartbeats to detect physical hardware failure and instantly revoke authority.

RAM-First Architecture: Why Zookeeper trades Storage Capacity for Latency, and the dangers of JVM Garbage Collection on coordination stability.

Atomic Broadcast (ZAB): The primary-backup protocol that provides the “FIFO Total Order” required for complex distributed recipes.

1. The Data Model: ZNodes

ZooKeeper looks like a standard Linux file system, but optimized for coordination.

Root: /
Nodes: /app, /app/config.
Data: Each node stores tiny amounts of data (<1MB).

Node Types (The “Mode”)

Persistent: Stays forever. (e.g., /db-config).
Ephemeral: Dies when the client disconnects. (e.g., /active-users/user-1). Crucial for Failure Detection.
Sequential: Appends a number. (e.g., /queue/task-001). Crucial for Ordering.

ZooKeeper uses Watches (Push notifications).

Client: “Watch /config.”
ZooKeeper: “Acknowledged.”
Admin updates /config.
ZooKeeper → Client: “EVENT: NodeDataChanged”.

[!NOTE] Staff Engineer Intuition: The Thundering Herd. In naive systems, 1,000 nodes might poll a database every second to check for a master’s death (1,000 RPS of wasted work). ZooKeeper’s Watches turn this into a Push-based event. When the master (Ephemeral node) dies, ZooKeeper sends exactly 1,000 packets — one to each waiter. This surgical precision is why ZooKeeper is called the “firefly” of distributed systems.

ZAB Atomic Broadcast Diagram

Leader

Follower

Client

1. Write Request

2. PROPOSAL (Zxid)

3. ACK

4. COMMIT

3. Interactive Demo: The Service Registry

Cyberpunk Mode: Visualize Service Discovery.

Role: You are the Load Balancer.
Task: Route traffic to active workers.
Mechanism: Watch the /workers directory.

Instructions:

Start Worker: Spawns a new Ephemeral Node. Watch the Load Balancer update instantly.
Kill Worker: Simulates a crash. The Ephemeral Node vanishes. The Load Balancer removes it from the routing table.

[!TIP] Try it yourself: Click “Start Worker” to add a node. Then click “Crash Random Worker” to see the Ephemeral node vanish and the Load Balancer update its routing table.

ZooKeeper Data Tree (RAM)

📂 /
📂 workers (Persistent)
(empty)

Load Balancer (Client)

State: Connected

Watching: /workers

Routing Table:

(No Upstreams)

System Ready.

4. The Curator Framework

Writing raw ZooKeeper code is painful. Dealing with connection drops, retries, and edge cases is hard. Netflix created Curator to solve this.

Recipe: Distributed Lock

// Connect to ZK
CuratorFramework client = CuratorFrameworkFactory.newClient("localhost:2181", new ExponentialBackoffRetry(1000, 3));
client.start();

// The "Recipe" handles all logic (creating ephemeral nodes, watching, retrying)
InterProcessMutex lock = new InterProcessMutex(client, "/my-distributed-lock");

try {
  if (lock.acquire(10, TimeUnit.SECONDS)) {
    try {
      // Critical Section: Access Shared Resource
      doWork();
    } finally {
      lock.release();
    }
  }
} catch (Exception e) {
  // Handle failure
}

Recipe: Leader Election

LeaderSelector listener = new LeaderSelector(client, "/master-election", new LeaderSelectorListenerAdapter() {
  public void takeLeadership(CuratorFramework client) throws Exception {
    // You are now the Leader!
    // Do work here...
    // If this method exits, you give up leadership.
  }
});
listener.autoRequeue();
listener.start();

5. ZAB vs Raft: The Engineering Trade-off

While both achieve consensus, they are built for different operational profiles.

Feature	ZAB (ZooKeeper)	Raft (Etcd/Consul)
Commitment	2-Phase Broadcast. Leader proposes, followers ACK, Leader broadcasts COMMIT.	Log Matching. Followers replicate log; Leader updates commit index and piggybacks it on next request.
Total Order	Guaranteed. Every write has a sequential Zxid (ZooKeeper Transaction ID).	Implicit. The log itself defines the sequence.
Read Profile	Read-Local Optimized. Followers serve reads natively (stale possible unless `sync()` called).	Leader-Centric. Reads usually go to the leader for linearizability.
Internal Stack	JVM (Java). High RAM usage, GC risks.	Go/C++. Lower memory footprint, predictable performance.

6. Summary

Watches: The nervous system. Enables real-time reaction to cluster changes.
Curator: The shield. Use it instead of raw ZK libraries to handle session expiration and retries.

Staff Engineer Tip: ZooKeeper is a Memory-Bound System. Because it stores everything in RAM for speed, its capacity is strictly limited by the physical RAM DIMMs on your servers. If you store large configuration files in ZNode data (>1MB), you will trigger massive JVM Heap Fragmentation. This leads to the “GC Death Spiral”: a 20-second pause triggers a session timeout, which triggers a cluster re-election, which puts more load on the new leader, triggering another GC pause. Always keep ZNode data tiny to protect the physical coordination backplane.

Mnemonic — “Ephemeral = Firefly, Persistent = Stone”: Persistent ZNode = stays forever (config, schemas). Ephemeral ZNode = dies when TCP session drops (service registry, leader lock). Sequential = auto-numbered (distributed queues). Watch = fire-and-forget notification (no polling). ZK vs Etcd: ZK = JVM, older, 3.x has Watch limitations. Etcd = Go, newer, K8s-native, gRPC, simpler API. Use ZK for legacy Hadoop/Kafka stacks. Use etcd for cloud-native Kubernetes coordination.

ZooKeeper Basics

ZooKeeper Basics

1. The Data Model: ZNodes

Node Types (The “Mode”)

ZAB Atomic Broadcast Diagram

3. Interactive Demo: The Service Registry

4. The Curator Framework

Recipe: Distributed Lock

Recipe: Leader Election

5. ZAB vs Raft: The Engineering Trade-off

6. Summary

Found this lesson helpful?