Module Review: Data Modeling

Key Takeaways

Query-First: Always start with your application queries. Map 1 Query → 1 Table.
Partition Key: Determines which node stores the data. Must have high cardinality to avoid Hot Partitions.
Clustering Key: Determines the sort order of data on disk. Enables efficient range queries.
Denormalization: Duplicating data is necessary to achieve fast reads.
Write Amplification: Writing to multiple tables is cheaper than doing distributed JOINs.

Flashcards

What determines data distribution?

Tap to reveal

The Partition Key

Cassandra hashes the Partition Key to a Token, which determines which node stores the data.

What is the purpose of the Clustering Key?

Tap to reveal

Sorting & Range Queries

It sorts data within a partition, allowing for efficient range scans (e.g., time-series data).

Why do we Denormalize?

Tap to reveal

To Optimize Reads

We accept slower writes (Write Amplification) to achieve O(1) read latency without JOINs.

What is a Hot Partition?

Tap to reveal

Uneven Data Distribution

Caused by a low-cardinality Partition Key, sending too much traffic to a single node.

How do we ensure consistency across tables?

Tap to reveal

Logged Batches

Logged Batches guarantee atomicity (all-or-nothing) when updating multiple tables for denormalization.

Cheat Sheet

Primary Key Syntax

Syntax	Partition Key	Clustering Key
`PRIMARY KEY (a)`	`a`	None
`PRIMARY KEY (a, b)`	`a`	`b`
`PRIMARY KEY ((a, b), c)`	`a`, `b`	`c`
`PRIMARY KEY ((a), b, c)`	`a`	`b`, `c`

Modeling Do’s and Don’ts

Do	Don’t
✅ Start with Queries	❌ Start with Tables
✅ Duplicate Data	❌ Use client-side JOINs
✅ High Cardinality PK	❌ Low Cardinality PK (e.g., Boolean)
✅ Use Batches for Sync	❌ Use Batches for Bulk Load
✅ Order by Clustering Key	❌ Order by client-side sorting

Practice Scenario

Task: Design a schema for a “IoT Sensor Network”.

We have thousands of sensors.
We need to see the latest temperature for a specific sensor.
We need to see all temperature readings for a specific sensor for a specific day.

Solution:

CREATE TABLE sensor_readings_by_day (
    sensor_id uuid,
    date date,
    recorded_at timestamp,
    temperature decimal,
    PRIMARY KEY ((sensor_id, date), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);

Partition Key: (sensor_id, date) - Ensures that a single partition doesn’t grow indefinitely. Each day is a new partition.
Clustering Key: recorded_at - Sorts readings chronologically.