Denormalization Strategy

[!NOTE] This module explores the core principles of Denormalization Strategy, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Art of Duplication

In the RDBMS world, duplication is a sin (violates Normal Forms). In Cassandra, duplication is a virtue.

Denormalization is the process of copying the same data into multiple tables to optimize for different read patterns.

Why do we do it?

Because Disk Space is cheap, but CPU and I/O are expensive. It is far cheaper to store a User’s profile 3 times than it is to coordinate a distributed JOIN across 3 nodes every time you need to read it.

2. Write Amplification vs. Read Latency

This tradeoff is the core of Cassandra data modeling.

Normalized (SQL): 1 Write per update. Complex, slow Reads (JOINs).
Denormalized (CQL): N Writes per update. Simple, fast Reads (O(1)).

Since most web-scale applications are Read-Heavy (e.g., 100 reads for every 1 write), optimizing for reads yields the best total system performance.

Interactive: The Trade-off Calculator

Adjust the Read/Write ratio to see when Denormalization pays off.

Workload Profile

Reads per Write 100:1

Scenario: Updating a User Profile that is stored in 3 different tables (By ID, By Email, By Username).

Total Latency Budget

Normalized (SQL JOIN) 0 ms

Denormalized (3 Writes) 0 ms

3. Implementation Strategy: Logged Batches

When you denormalize, you must ensure that all copies of the data are updated together. If one update fails and others succeed, your data becomes inconsistent.

Cassandra provides Logged Batches to guarantee atomicity (all or nothing) for updates across multiple partitions.

How Logged Batches Work

Unlike a RDBMS transaction, a Logged Batch does not provide Isolation. Other clients may see partial updates. It only guarantees Atomicity (eventual consistency of all rows).

Logged Batch Sequence

Code: Syncing Multiple Tables

// Using Datastax Driver
public void updateUser(UUID userId, String newEmail, String newName) {
    // Logged Batch ensures all-or-nothing atomicity
    BatchStatement batch = BatchStatement.newInstance(BatchType.LOGGED);

    // Update table 1: users_by_id
    SimpleStatement update1 = SimpleStatement.builder(
        "UPDATE users_by_id SET email = ?, name = ? WHERE user_id = ?")
        .addPositionalValues(newEmail, newName, userId)
        .build();

    // Update table 2: users_by_email
    // Note: In reality, updating a PK requires DELETE + INSERT. This is simplified.
    SimpleStatement update2 = SimpleStatement.builder(
        "UPDATE users_by_email SET name = ? WHERE email = ?")
        .addPositionalValues(newName, newEmail)
        .build();

    batch.add(update1);
    batch.add(update2);

    session.execute(batch);
}

func UpdateUser(session *gocql.Session, userID gocql.UUID, newEmail, newName string) error {
    // Logged Batch ensures all-or-nothing atomicity
    batch := session.NewBatch(gocql.LoggedBatch)

    // Update table 1
    batch.Query(
        "UPDATE users_by_id SET email = ?, name = ? WHERE user_id = ?",
        newEmail, newName, userID,
    )

    // Update table 2
    batch.Query(
        "UPDATE users_by_email SET name = ? WHERE email = ?",
        newName, newEmail,
    )

    return session.ExecuteBatch(batch)
}

4. Materialized Views (Warning)

Cassandra has a feature called Materialized Views (MV) that automates this denormalization server-side.

CREATE MATERIALIZED VIEW users_by_email AS
    SELECT * FROM users_by_id
    WHERE email IS NOT NULL
    PRIMARY KEY (email, user_id);

[!CAUTION] Performance Hazard: Read-Before-Write Materialized Views force Cassandra to perform a Read-Before-Write to ensure consistency when updating the view. This kills write performance. Additionally, MVs have a history of instability. The industry standard is Application-Side Denormalization (using Batches) as shown above.

5. Summary

Denormalize to achieve O(1) reads.
Accept Write Amplification as the cost of performance.
Use Logged Batches to keep tables consistent.
Avoid Materialized Views in favor of explicit application logic.

In the final chapter, we’ll review everything and provide a cheat sheet.