Etcd: The Single Source of Truth
[!WARNING] Backup Etcd! If you lose your Etcd data, you lose your entire cluster. The API Server is stateless. The Scheduler is stateless. Only Etcd holds the state.
1. First Principles: Why a Distributed Store?
To understand Etcd, we must understand the hardware reality of a distributed system. Imagine having 1,000 servers (Worker Nodes). When an administrator tells the Control Plane “deploy 5 copies of this web server”, that command cannot be stored on just one hard drive.
If the hard drive fails, the cluster forgets its mission. The data must be replicated.
Etcd is a distributed, consistent key-value store.
- Distributed: It runs on multiple servers (nodes) for high availability, meaning data is mirrored across physical disks.
- Consistent: It prioritizes Consistency over Availability (a CP system in the CAP theorem). If you write data to Etcd and get a success response, any subsequent read will guarantee that data. This is achieved via a mathematical consensus protocol called Raft.
- Key-Value: Data is stored as simple directory-like keys (e.g.,
/registry/pods/default/mypod) and values (serialized as Protobuf for speed and compactness).
2. Hardware Reality: Why not MySQL or Postgres?
Why doesn’t Kubernetes use a traditional relational database? The answer lies in the specific constraints of the system:
- The Watch Mechanism (Push vs Pull): Relational databases are built for polling (pulling data). Kubernetes controllers (like the ReplicaSet Controller) must react instantly to state changes. If 50 controllers pulled data every second, the database CPU would throttle. Etcd uses long-polling HTTP/gRPC streams to push changes to watchers instantly, minimizing CPU cycles and network overhead.
- Concurrency Control: In a large cluster, hundreds of components might try to update a Pod’s status simultaneously. Traditional databases use heavy row locks. Etcd uses MVCC (Multi-Version Concurrency Control) and Optimistic Locking via
resourceVersion. This means Etcd never locks; it simply rejects updates if theresourceVersiondoesn’t match the latest sequence number, forcing the client to retry. This is significantly faster for read-heavy, high-contention workloads. - Data Structure: Kubernetes configuration objects are JSON/YAML documents containing deep hierarchies. Flattening these into rigid tables (rows/columns) requires expensive JOINs. A Key-Value store mapped directly to a file system hierarchy (like
/registry/pods/namespace/pod-name) provides O(1) lookups for any specific resource.
3. How Etcd Works: The Raft Consensus Algorithm
In a distributed system, how do you ensure all nodes agree on the data? Etcd uses Raft.
The Problem: Split Brain
Imagine you have 3 Etcd nodes. If the network splits, you don’t want two different leaders accepting conflicting writes.
The Solution: Quorum
Raft requires a Quorum (majority) to commit a write.
- 3 Nodes: Need 2 to agree. Can survive 1 failure.
- 5 Nodes: Need 3 to agree. Can survive 2 failures.
- Even Numbers?: Bad idea. With 4 nodes, if the network splits 2 vs 2, neither side has a majority (3). The cluster stops accepting writes.
Leader Election
One node is elected Leader. All writes MUST go to the Leader. The Leader replicates the data to the Followers. Once a majority acknowledge the write, the Leader commits it.
4. Interactive: Raft Consensus Visualizer
Simulate a 3-node Etcd cluster. See how writes are replicated.
5. Summary
Etcd is the only stateful part of the Control Plane. It uses the Raft algorithm to ensure that your cluster configuration is safe, consistent, and available even if a machine fails.
In the next chapter, we explore the Declarative Model, which relies entirely on Etcd’s ability to store the “Desired State”.