Module Review: Data Modeling
Key Takeaways
- Query-First: Always start with your application queries. Map 1 Query → 1 Table.
- Partition Key: Determines which node stores the data. Must have high cardinality to avoid Hot Partitions.
- Clustering Key: Determines the sort order of data on disk. Enables efficient range queries.
- Denormalization: Duplicating data is necessary to achieve fast reads.
- Write Amplification: Writing to multiple tables is cheaper than doing distributed JOINs.
Flashcards
What determines data distribution?
Tap to reveal
The Partition Key
Cassandra hashes the Partition Key to a Token, which determines which node stores the data.
What is the purpose of the Clustering Key?
Tap to reveal
Sorting & Range Queries
It sorts data within a partition, allowing for efficient range scans (e.g., time-series data).
Why do we Denormalize?
Tap to reveal
To Optimize Reads
We accept slower writes (Write Amplification) to achieve O(1) read latency without JOINs.
What is a Hot Partition?
Tap to reveal
Uneven Data Distribution
Caused by a low-cardinality Partition Key, sending too much traffic to a single node.
How do we ensure consistency across tables?
Tap to reveal
Logged Batches
Logged Batches guarantee atomicity (all-or-nothing) when updating multiple tables for denormalization.
Cheat Sheet
Primary Key Syntax
| Syntax | Partition Key | Clustering Key |
|---|---|---|
PRIMARY KEY (a) |
a |
None |
PRIMARY KEY (a, b) |
a |
b |
PRIMARY KEY ((a, b), c) |
a, b |
c |
PRIMARY KEY ((a), b, c) |
a |
b, c |
Modeling Do’s and Don’ts
| Do | Don’t |
|---|---|
| ✅ Start with Queries | ❌ Start with Tables |
| ✅ Duplicate Data | ❌ Use client-side JOINs |
| ✅ High Cardinality PK | ❌ Low Cardinality PK (e.g., Boolean) |
| ✅ Use Batches for Sync | ❌ Use Batches for Bulk Load |
| ✅ Order by Clustering Key | ❌ Order by client-side sorting |
Practice Scenario
Task: Design a schema for a “IoT Sensor Network”.
- We have thousands of sensors.
- We need to see the latest temperature for a specific sensor.
- We need to see all temperature readings for a specific sensor for a specific day.
Solution:
CREATE TABLE sensor_readings_by_day (
sensor_id uuid,
date date,
recorded_at timestamp,
temperature decimal,
PRIMARY KEY ((sensor_id, date), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
- Partition Key:
(sensor_id, date)- Ensures that a single partition doesn’t grow indefinitely. Each day is a new partition. - Clustering Key:
recorded_at- Sorts readings chronologically.