Replication: The Art of Copying Data

Why Replicate?

Replication means keeping a copy of the same data on multiple machines that are connected via a network. Why do we do this?

  1. Availability: If one node goes down, the data is still available on another.
  2. Scalability: You can distribute read queries across multiple replicas.
  3. Latency: You can place replicas geographically closer to users (e.g., US, EU, Asia).

Strategy 1: Master-Slave (Single Leader)

Architecture:

  • One Master (Leader): Handles ALL Writes. Can also handle Reads.
  • Multiple Slaves (Followers): Replicate data from the Master. Handle ONLY Reads.

Flow:

  1. Client writes x = 5 to Master.
  2. Master writes to its disk (WAL).
  3. Master sends the change to Slaves (Async or Sync).
  4. Slaves update their disk.

Pros: Simple. No write conflicts (because only one Master). Cons: Master is a Write Bottleneck. If Master dies, you need a failover process.

[!WARNING] Replication Lag: If you use Asynchronous replication (common for performance), a user might write x = 5, then immediately read from a Slave and see x = 4. The slave hasn’t updated yet! This is Eventual Consistency.

Timeline: The Danger of Async Replication

T=0ms
User writes x=5 to Master. Master commits. Returns "OK".
T=1ms
User thinks data is safe.
T=2ms
🔥 MASTER CRASHES! (Before sending x=5 to Slave)
T=100ms
Slave Promoted to New Master. It thinks x=4 (Old Value).
Result
DATA LOSS: The write x=5 is gone forever.

Strategy 2: Multi-Master (Multi-Leader)

Architecture:

  • Multiple Masters: Any Master can accept Writes.
  • Masters replicate changes to each other.

Use Case: Global apps. A user in the US writes to the US Master. A user in EU writes to the EU Master. Pros: High Write Availability. Local write latency. Cons: Write Conflicts. What if US User sets x = 5 and EU User sets x = 10 at the exact same time? You need conflict resolution (Last Write Wins, Vector Clocks).

Strategy 3: Leaderless (Dynamo-style)

Architecture:

  • No Masters. All nodes are equal.
  • Client sends Writes to multiple nodes (e.g., 3 out of 5).
  • Client reads from multiple nodes to detect conflicts.

Examples: Cassandra, DynamoDB. Pros: Zero downtime. No single point of failure. Cons: Complex consistency logic (Read Repair, Quorums).


Interactive Demo: Replication Lag & Split Brain

Simulate network conditions and see their impact.

  1. Lag Slider: Increase latency. Write to Master. Read from Slave immediately. Do you see old data?
  2. Cut Network: Creates a Partition. Slaves stop receiving updates.
  3. Promote Slave: Causes Split Brain (Two Masters).
0ms
MASTER
v0
⚡ PARTITIONED ⚡
SLAVE 1
v0
System Ready. Latency: 0ms.

Deep Dive: Split Brain

What happens in a Master-Slave system if the Master loses network connectivity but doesn’t crash?

  1. The Slaves think the Master is dead.
  2. They elect a New Master (Slave 1).
  3. The Old Master comes back online. It still thinks it is the Master.
  4. Now you have Two Masters accepting writes. This is Split Brain.

The Solution: Fencing Tokens

To solve this, we use an Epoch Number (Generation ID).

  • Old Master was Epoch 1.
  • New Master is elected as Epoch 2.
  • If Old Master tries to write to the storage (or send a request), the storage sees “Epoch 1” and rejects it because it has already seen “Epoch 2”.
  • This effectively “fences off” the Zombie Master.

Advanced: Chain Replication

Used by systems like MongoDB and CockroachDB. Instead of Master sending to all Slaves (Star topology), it forms a chain: Master -> Slave 1 -> Slave 2.

  • Write: Goes to Head (Master).
  • Propagate: Head -> S1 -> S2.
  • Commit: Tail (S2) confirms to Client.
  • Pros: Strong Consistency (all nodes have data before success).
  • Cons: Higher Latency (Wait for longest path).
Head Node A Tail

[!TIP] Interview Tip: Always clarify if the system needs Strong Consistency (Bank) or Eventual Consistency (Social Media). This dictates whether you use Sync Replication (Slow, Safe) or Async Replication (Fast, Risky).