Replication: The Art of Copying Data
Why Replicate?
Replication means keeping a copy of the same data on multiple machines that are connected via a network. Why do we do this?
- Availability: If one node goes down, the data is still available on another.
- Scalability: You can distribute read queries across multiple replicas.
- Latency: You can place replicas geographically closer to users (e.g., US, EU, Asia).
Strategy 1: Master-Slave (Single Leader)
Architecture:
- One Master (Leader): Handles ALL Writes. Can also handle Reads.
- Multiple Slaves (Followers): Replicate data from the Master. Handle ONLY Reads.
Flow:
- Client writes
x = 5to Master. - Master writes to its disk (WAL).
- Master sends the change to Slaves (Async or Sync).
- Slaves update their disk.
Pros: Simple. No write conflicts (because only one Master). Cons: Master is a Write Bottleneck. If Master dies, you need a failover process.
[!WARNING] Replication Lag: If you use Asynchronous replication (common for performance), a user might write
x = 5, then immediately read from a Slave and seex = 4. The slave hasn’t updated yet! This is Eventual Consistency.
Timeline: The Danger of Async Replication
Strategy 2: Multi-Master (Multi-Leader)
Architecture:
- Multiple Masters: Any Master can accept Writes.
- Masters replicate changes to each other.
Use Case: Global apps. A user in the US writes to the US Master. A user in EU writes to the EU Master.
Pros: High Write Availability. Local write latency.
Cons: Write Conflicts. What if US User sets x = 5 and EU User sets x = 10 at the exact same time? You need conflict resolution (Last Write Wins, Vector Clocks).
Strategy 3: Leaderless (Dynamo-style)
Architecture:
- No Masters. All nodes are equal.
- Client sends Writes to multiple nodes (e.g., 3 out of 5).
- Client reads from multiple nodes to detect conflicts.
Examples: Cassandra, DynamoDB. Pros: Zero downtime. No single point of failure. Cons: Complex consistency logic (Read Repair, Quorums).
Interactive Demo: Replication Lag & Split Brain
Simulate network conditions and see their impact.
- Lag Slider: Increase latency. Write to Master. Read from Slave immediately. Do you see old data?
- Cut Network: Creates a Partition. Slaves stop receiving updates.
- Promote Slave: Causes Split Brain (Two Masters).
Deep Dive: Split Brain
What happens in a Master-Slave system if the Master loses network connectivity but doesn’t crash?
- The Slaves think the Master is dead.
- They elect a New Master (Slave 1).
- The Old Master comes back online. It still thinks it is the Master.
- Now you have Two Masters accepting writes. This is Split Brain.
The Solution: Fencing Tokens
To solve this, we use an Epoch Number (Generation ID).
- Old Master was Epoch 1.
- New Master is elected as Epoch 2.
- If Old Master tries to write to the storage (or send a request), the storage sees “Epoch 1” and rejects it because it has already seen “Epoch 2”.
- This effectively “fences off” the Zombie Master.
Advanced: Chain Replication
Used by systems like MongoDB and CockroachDB.
Instead of Master sending to all Slaves (Star topology), it forms a chain:
Master -> Slave 1 -> Slave 2.
- Write: Goes to Head (Master).
- Propagate: Head -> S1 -> S2.
- Commit: Tail (S2) confirms to Client.
- Pros: Strong Consistency (all nodes have data before success).
- Cons: Higher Latency (Wait for longest path).
[!TIP] Interview Tip: Always clarify if the system needs Strong Consistency (Bank) or Eventual Consistency (Social Media). This dictates whether you use Sync Replication (Slow, Safe) or Async Replication (Fast, Risky).