Schema Design: The “Together” Rule

In Relational Database Management Systems (RDBMS), we design schemas based on the relationships between data entities (Third Normal Form). In MongoDB, we design schemas based on access patterns.

[!IMPORTANT] The Golden Rule: Data that is accessed together should be stored together.

This shift in thinking is not just stylistic—it is rooted in the physics of how data is read from a disk.

1. The Physics of Reads (Seek vs. Scan)

To understand why MongoDB favors embedding, we must look at the hardware level.

  • Random I/O (Joins): When you join tables in SQL, the database engine often has to perform multiple “seeks” to find rows scattered across the disk. Each seek adds latency (especially on HDDs, but also relevant for SSD throughput).
  • Sequential I/O (Embedding): When you embed data in a single document, the database performs one seek to find the document, then sequentially reads the BSON.

Interactive: The Cost of Joins

This visualization simulates reading a user profile and their 5 most recent orders.

  • Normalized: Requires finding the User, then hunting for 5 separate Order records across the disk.
  • Embedded: The User and Orders are stored in one contiguous block, requiring only a single seek.
Disk Seeks 0
Latency 0ms
Status Idle

2. Modeling for BSON

MongoDB stores data in BSON (Binary JSON). When you retrieve a document, the entire BSON blob is read into memory. This has two implications:

  1. Overhead: Field names are stored in every document (unlike SQL where column names are stored once in the table definition).
  2. Locality: If you only need the user’s name but the document contains 5MB of order history, you are wasting I/O bandwidth.

Code Example: The “User” Entity

Even though MongoDB is schema-less, your application code (Java/Go) usually defines a strict structure. Using short field names in BSON tags is a common optimization to reduce storage costs.

Java

import org.bson.codecs.pojo.annotations.BsonProperty;
import org.bson.types.ObjectId;
import java.util.List;

public class User {
    private ObjectId id;

    // "nm" saves bytes compared to "name" in every doc
    @BsonProperty("nm")
    private String name;

    private String email;

    // Embedded Relationship: Stored directly inside the User document
    private List<Address> addresses;

    // Getters and Setters...
}

public class Address {
    private String street;
    private String city;
    private String zip;
}

Go

package main

import "go.mongodb.org/mongo-driver/bson/primitive"

type User struct {
    ID        primitive.ObjectID `bson:"_id,omitempty"`

    // "nm" saves bytes compared to "name" in every doc
    Name      string             `bson:"nm"`
    Email     string             `bson:"email"`

    // Embedded Relationship: Stored directly inside the User document
    Addresses []Address          `bson:"addresses,omitempty"`
}

type Address struct {
    Street string `bson:"street"`
    City   string `bson:"city"`
    Zip    string `bson:"zip"`
}

[!TIP] Pro Tip: Notice the @BsonProperty("nm") and bson:"nm" tags? Since field names are repeated in every BSON document, shortening “firstName” to “fn” can save significant storage for collections with billions of documents.

3. The 16MB Hard Limit

A single MongoDB document cannot exceed 16 megabytes.

  • Why? To prevent a single large document from hogging too much RAM or network bandwidth during processing.
  • Implication: You cannot embed everything. If a user can have infinite comments, you cannot embed them in the User document.

When to Embed vs. Reference?

Feature Embedding Referencing
Read Speed Fast (1 seek) Slower (Multiple seeks / $lookup)
Write Speed Fast (Atomic update) Slower (Multi-document transaction)
Data Size Limited to 16MB Unlimited
Consistency Strong (Single doc) Eventual (unless using Transactions)

4. Key Takeaways

  1. Optimize for Reads: Most applications read 100x more often than they write. Design your schema so that the most common queries are fast.
  2. Embed by Default: Start by embedding related data. Only move to referencing if you hit the 16MB limit or have specific access patterns (e.g., accessing the child data independently).
  3. Short Keys: Use short field names in your code annotations to save disk space.