Index Fundamentals & The B-Tree

Why do databases need indexes? To understand this, we must look at the physical reality of how computers store data.

1. The Genesis: O(N) vs O(log N)

Imagine a library with 1,000,000 books but no catalog. To find a specific book, you would have to walk through every aisle, looking at every single spine. This is a Linear Scan, or O(N). In database terms, this is a Collection Scan (COLLSCAN).

Now imagine a library with a sorted card catalog. You don’t scan every card. You jump to the “M” section, then “Ma”, then “Mac”, and find your book in seconds. This is a Logarithmic Search, or O(log N).

Hardware Reality: The Cost of I/O

The bottleneck in almost every database is Disk I/O.

RAM Access: ~100 nanoseconds.
SSD Access: ~100 microseconds (1,000x slower).
HDD Access: ~10 milliseconds (100,000x slower).

Reading a document from disk is expensive. A Collection Scan forces the disk to read everything. An Index allows the database to read only the specific blocks containing the data you need.

2. The B-Tree Data Structure

MongoDB uses B-Trees (Balanced Trees) for its indexes. A B-Tree is designed specifically for storage systems because its nodes map perfectly to disk blocks (pages).

Visualizing the Difference

Interactive Race: Search for ID 500,000 in a collection of 1,000,000 documents.

Collection Scan (O(N))

0 docs

Index Scan (O(log N))

0 nodes

3. Index Types

MongoDB supports several index types to handle different data patterns.

Single Field Index

The most basic index. It creates a sorted list of values for a single field.

Java
Go
// Create an ascending index on "email"
MongoCollection<Document> collection = database.getCollection("users");
collection.createIndex(Indexes.ascending("email"));

// Create a descending index on "score"
collection.createIndex(Indexes.descending("score"));
// Create an ascending index on "email"
model := mongo.IndexModel{
    Keys: bson.D{{"email", 1}},
}
collection.Indexes().CreateOne(context.TODO(), model)

// Create a descending index on "score"
modelScore := mongo.IndexModel{
    Keys: bson.D{{"score", -1}},
}
collection.Indexes().CreateOne(context.TODO(), modelScore)

Compound Index

An index on multiple fields. This is where optimization gets interesting. The order of fields is critical. A compound index on { lastname: 1, firstname: 1 } sorts first by lastname, then by firstname.

[!TIP] Prefix Compression: MongoDB indexes are prefix-compressed. If you have an index on { A: 1, B: 1 }, you automatically have an index on { A: 1 }. You do not have an index on { B: 1 }.

Multikey Index (Arrays)

If you index a field that contains an array, MongoDB creates an index entry for every element in that array.

[!WARNING] Explosion Risk: If you have a document with 100 tags and create an index on tags, that one document creates 100 index entries. This can significantly increase write latency and storage size.

TTL (Time To Live) Index

Automatically deletes documents after a specified time. Useful for session data, logs, or temporary events.

Java
Go
// Expire documents 1 hour (3600 seconds) after the "createdAt" time
IndexOptions options = new IndexOptions().expireAfter(3600L, TimeUnit.SECONDS);
collection.createIndex(Indexes.ascending("createdAt"), options);
// Expire documents 1 hour after "createdAt"
opts := options.Index().SetExpireAfterSeconds(3600)
model := mongo.IndexModel{
    Keys:    bson.D{{"createdAt", 1}},
    Options: opts,
}
collection.Indexes().CreateOne(context.TODO(), model)

Sparse & Partial Indexes

Sparse: Only indexes documents that contain the field. If a document doesn’t have the field, it’s not in the index. Saves space.
Partial: Only indexes documents that match a filter expression (e.g., only index users where status: "active"). This is strictly more powerful than Sparse.

4. Summary & Best Practices

Always Index query targets: If you query by a field often, index it.
Avoid over-indexing: Every index slows down writes (insert/update/delete) because the B-Tree must be updated.
Use Compound Indexes wisely: Follow the ESR rule (covered in the next chapter).
Monitor RAM: Indexes should ideally fit in RAM (WiredTiger Cache). If indexes exceed RAM, performance falls off a cliff due to disk thrashing.