MongoDB Architecture

Understanding how MongoDB works under the hood is critical for performance tuning and troubleshooting. Unlike a monolithic SQL server, MongoDB is designed to be distributed.

1. Core Components

A production MongoDB deployment is built from three main components:

  1. mongod: The primary daemon process for the database system. It handles data requests, manages data access, and performs background management operations.
  2. mongos: The query router. In a Sharded Cluster, your application connects to mongos, which acts as a load balancer and directs queries to the correct shards.
  3. Config Servers: Store the metadata for a sharded cluster (e.g., which data lives on which shard).

Interactive: Cluster Architecture Explorer

Explore the components of a production Sharded Cluster. Click on any component to see its role.

Application Layer
💻 Application Driver
Routing Layer
🚦 mongos (Router)
📦 Shard 1
Users: A-M
📋 Config Servers
Metadata Map
📦 Shard 2
Users: N-Z
Click on a component above to see details.

2. Storage Engine: WiredTiger

Since MongoDB 3.2, WiredTiger is the default storage engine. It is responsible for managing how data is stored on disk (HDD/SSD) and in memory (RAM).

Key Features

  1. Document-Level Concurrency: Unlike older engines (MMAPv1) that locked the entire database or collection during a write, WiredTiger locks only the specific document being modified. This allows massive write concurrency.
  2. Compression: WiredTiger compresses data on disk (Snappy algorithm by default), often reducing storage footprint by 50-70%.
  3. Journaling (WAL): To ensure durability, every write is first recorded in a sequential Journal (Write Ahead Log) before being applied to the data files. If the server crashes, MongoDB replays the journal to recover data.

WiredTiger Checkpoints & Journaling

How does MongoDB ensure speed (RAM) and safety (Disk)?

  1. Writes go to the Journal (Disk) and Cache (RAM).
  2. Every 60 seconds, a Checkpoint flushes dirty pages from Cache to Data Files.
  3. The Journal is truncated after a successful checkpoint.
RAM Cache Fast, Volatile
Disk (Journal) Safe, Sequential
If power fails, RAM is lost. But MongoDB replays the Journal on restart to restore the data.

3. High Availability: Replica Sets

A Replica Set is a group of mongod processes that maintain the same data set.

  • Primary: The only node that accepts Writes. It replicates changes to secondaries via an “Oplog” (Operations Log).
  • Secondary: Replicates data from the Primary. Can be configured to accept Reads (Read Preference).
  • Automatic Failover: If the Primary dies, an election is held, and a Secondary promotes itself to Primary within seconds.

Visualizing Replication

Primary
Sec 1
Sec 2
Secondaries pull operations from the Primary's Oplog asynchronously.

4. Scalability: Sharding

Sharding is the method for distributing data across multiple machines.

  • When to Shard?: When your dataset exceeds the RAM of a single server, or write throughput becomes a bottleneck.
  • Shard Key: You must choose a field (e.g., user_id) to partition data.
  • Chunks: MongoDB splits data into “chunks” based on the Shard Key ranges.
  • Balancer: A background process that moves chunks between shards to keep the cluster balanced.

5. Summary

  • Use Replica Sets for high availability (redundancy).
  • Use Sharding for horizontal scalability (performance/size).
  • WiredTiger provides compression and document-level locking for high performance.
  • In a sharded cluster, your app connects to mongos, not the shards directly.