Module Review: Replication & Sharding

[!NOTE] This module explores the core principles of Module Review: Replication & Sharding, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  • Replica Sets vs Sharding:
  • Replica Sets provide High Availability (redundancy, failover).
  • Sharding provides Horizontal Scalability (throughput, storage capacity).
  • The Oplog: The critical mechanism that powers replication. Secondaries tail the Primary’s oplog to stay in sync.
  • Elections: If a Primary fails, an election occurs. The node with the freshest oplog usually wins.
  • Shard Keys: The most important architectural decision. Hashed keys distribute writes evenly; Ranged keys optimize range queries.
  • Consistency Control: Use Write Concern (w:majority) for durability and Read Concern (majority) for isolation.

2. Flashcards

What is the role of an Arbiter?

(Click to flip)

Vote Only

An Arbiter participates in elections to ensure a majority vote but does not hold data. It saves cost compared to a full secondary.

What happens if the Oplog is too small?

Initial Sync Required

If a Secondary lags behind longer than the "Oplog Window", it falls off the oplog and must restart replication from scratch (Initial Sync).

Why use a Hashed Shard Key?

Write Distribution

Hashed keys ensure even distribution of data across shards, preventing "Hot Shards" caused by monotonic inserts (e.g., timestamps).

What does 'w: majority' guarantee?

Durability

It ensures the write has been acknowledged by a majority of replica set members, protecting it from rollback if the Primary fails.

What is Zone Sharding?

Data Locality

Assigning specific ranges of data to specific shards (often by geography) to comply with data sovereignty laws (GDPR) or reduce latency.

What is a Jumbo Chunk?

Unsplittable Chunk

A chunk that exceeds the max size (64MB) but cannot be split because all documents share the same shard key value (Low Cardinality).

What is 'readPreference: nearest'?

Lowest Latency

The driver reads from the node with the lowest network latency, regardless of whether it is Primary or Secondary. Useful for global apps.

3. Cheat Sheet

Concept Description Trade-off / Note
Replica Set N nodes, 1 Primary, N-1 Secondaries High Availability, Automatic Failover
Sharding Partitioning data across servers Unlimited Scale vs Complexity
Mongos Query Router Clients connect here, not to shards
Config Server Stores cluster metadata Must be a replica set itself
Hashed Key MD5 hash of value Good Write Distribution, Bad Range Queries
Ranged Key Values in order Good Range Queries, Risk of Hot Shards
Jumbo Chunk Chunk > 64MB cannot split Caused by Low Cardinality Shard Keys
w: 1 Ack by Primary only Fast, Risk of Data Loss
w: majority Ack by >50% nodes Slower, Safe from Rollback
readPreference secondaryPreferred, etc. Offload reads vs Stale Data

4. Next Steps