Relationships: Embed or Reference?

The most common question in MongoDB modeling is: “Should I embed this data or put it in its own collection?” The answer depends almost entirely on Cardinality—how many “child” records exist for each “parent”?

1. The Cardinality Spectrum

We categorize relationships into four buckets. As the number of children grows, the pressure to reference increases.

  1. One-to-One (1:1): A User has one Profile. → Embed.
  2. One-to-Few (1:N): A Post has 20 Tags. → Embed.
  3. One-to-Many (1:N): A Publisher has 5,000 Books. → Reference (usually, to avoid 16MB limit).
  4. One-to-Squillions (1:N): A Sensor has 10 billion Readings. → Reference (Mandatory, often with Time-Series collections).

Interactive: Cardinality Architect

Drag the data scenario to the correct strategy bin.

Embed It
"Together stays together"
Reference It
Separate Collection
Drag the cards!

2. Implementing References (The Manual Join)

MongoDB does not support server-side JOINs in the traditional SQL sense (except for $lookup in aggregation pipelines, which is powerful but expensive). Instead, when you choose to reference, you usually perform a “Manual Join” (or “Application-Side Join”).

  1. Query 1: Get the Parent (e.g., Author).
  2. Query 2: Get the Children using the Parent’s ID (e.g., Books where author_id = X).

This is surprisingly fast because MongoDB uses the _id index for the first query and the author_id index for the second.

Code Example: Manual Join

Java

// 1. Get the Author
MongoCollection<Document> authors = db.getCollection("authors");
Document author = authors.find(eq("_id", new ObjectId("65df..."))).first();

// 2. Get the Books for this Author
// Requires an index on "author_id" for performance
MongoCollection<Document> books = db.getCollection("books");
FindIterable<Document> authorBooks = books.find(eq("author_id", author.get("_id")));

for (Document book : authorBooks) {
    System.out.println(book.toJson());
}

Go

// 1. Get the Author
var author Author
err := authorsCollection.FindOne(ctx, bson.M{"_id": authorID}).Decode(&author)
if err != nil {
    log.Fatal(err)
}

// 2. Get the Books for this Author
// Requires an index on "author_id"
cursor, err := booksCollection.Find(ctx, bson.M{"author_id": author.ID})
if err != nil {
    log.Fatal(err)
}
defer cursor.Close(ctx)

var books []Book
if err = cursor.All(ctx, &books); err != nil {
    log.Fatal(err)
}

// Result: You have the author struct and a slice of book structs

3. The Subset Pattern

What if you have a “One-to-Many” relationship where you usually only need the top 5 items, but sometimes need all of them?

Example: A Product Page. You want to show the product details and the Top 5 Reviews. There might be 5,000 reviews total.

Solution: Use the Subset Pattern.

  1. Embed the top 5 reviews directly in the Product document.
  2. Reference all 5,000 reviews in a separate Reviews collection.

When loading the page, you only need 1 seek to get the product + top reviews. If the user clicks “See All”, you query the Reviews collection.

4. The Trade-off: Atomicity

The biggest downside of referencing is the loss of Atomicity for single operations.

  • Embedded: Updating a user’s profile and their address is one atomic write. It either succeeds or fails.
  • Referenced: Creating an Author and then creating a Book are two separate writes. If the second one fails, you might have an “orphan” record or inconsistent state.

[!NOTE] MongoDB 4.0+ supports Multi-Document Transactions (ACID), allowing you to update multiple collections atomically. However, transactions come with a performance cost and should not be the default for every operation.

5. Key Takeaways

  1. Prefer Embedding: It’s faster (fewer queries) and simpler (atomic updates).
  2. Reference when Growing: If the array of children grows without limit (comments, logs), reference it.
  3. Subset Pattern: Best of both worlds—embed the “working set” (recent/top items) and reference the rest.