Denormalization Strategy
[!NOTE] This module explores the core principles of Denormalization Strategy, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Art of Duplication
In the RDBMS world, duplication is a sin (violates Normal Forms). In Cassandra, duplication is a virtue.
Denormalization is the process of copying the same data into multiple tables to optimize for different read patterns.
Why do we do it?
Because Disk Space is cheap, but CPU and I/O are expensive. It is far cheaper to store a User’s profile 3 times than it is to coordinate a distributed JOIN across 3 nodes every time you need to read it.
2. Write Amplification vs. Read Latency
This tradeoff is the core of Cassandra data modeling.
- Normalized (SQL): 1 Write per update. Complex, slow Reads (JOINs).
- Denormalized (CQL): N Writes per update. Simple, fast Reads (O(1)).
Since most web-scale applications are Read-Heavy (e.g., 100 reads for every 1 write), optimizing for reads yields the best total system performance.
Interactive: The Trade-off Calculator
Adjust the Read/Write ratio to see when Denormalization pays off.
Workload Profile
Scenario: Updating a User Profile that is stored in 3 different tables (By ID, By Email, By Username).
Total Latency Budget
3. Implementation Strategy: Logged Batches
When you denormalize, you must ensure that all copies of the data are updated together. If one update fails and others succeed, your data becomes inconsistent.
Cassandra provides Logged Batches to guarantee atomicity (all or nothing) for updates across multiple partitions.
How Logged Batches Work
Unlike a RDBMS transaction, a Logged Batch does not provide Isolation. Other clients may see partial updates. It only guarantees Atomicity (eventual consistency of all rows).
Logged Batch Sequence
Code: Syncing Multiple Tables
// Using Datastax Driver
public void updateUser(UUID userId, String newEmail, String newName) {
// Logged Batch ensures all-or-nothing atomicity
BatchStatement batch = BatchStatement.newInstance(BatchType.LOGGED);
// Update table 1: users_by_id
SimpleStatement update1 = SimpleStatement.builder(
"UPDATE users_by_id SET email = ?, name = ? WHERE user_id = ?")
.addPositionalValues(newEmail, newName, userId)
.build();
// Update table 2: users_by_email
// Note: In reality, updating a PK requires DELETE + INSERT. This is simplified.
SimpleStatement update2 = SimpleStatement.builder(
"UPDATE users_by_email SET name = ? WHERE email = ?")
.addPositionalValues(newName, newEmail)
.build();
batch.add(update1);
batch.add(update2);
session.execute(batch);
}
func UpdateUser(session *gocql.Session, userID gocql.UUID, newEmail, newName string) error {
// Logged Batch ensures all-or-nothing atomicity
batch := session.NewBatch(gocql.LoggedBatch)
// Update table 1
batch.Query(
"UPDATE users_by_id SET email = ?, name = ? WHERE user_id = ?",
newEmail, newName, userID,
)
// Update table 2
batch.Query(
"UPDATE users_by_email SET name = ? WHERE email = ?",
newName, newEmail,
)
return session.ExecuteBatch(batch)
}
4. Materialized Views (Warning)
Cassandra has a feature called Materialized Views (MV) that automates this denormalization server-side.
CREATE MATERIALIZED VIEW users_by_email AS
SELECT * FROM users_by_id
WHERE email IS NOT NULL
PRIMARY KEY (email, user_id);
[!CAUTION] Performance Hazard: Read-Before-Write Materialized Views force Cassandra to perform a Read-Before-Write to ensure consistency when updating the view. This kills write performance. Additionally, MVs have a history of instability. The industry standard is Application-Side Denormalization (using Batches) as shown above.
5. Summary
- Denormalize to achieve O(1) reads.
- Accept Write Amplification as the cost of performance.
- Use Logged Batches to keep tables consistent.
- Avoid Materialized Views in favor of explicit application logic.
In the final chapter, we’ll review everything and provide a cheat sheet.