Module Review: Data Modeling

Key Takeaways

  • Query-First: Always start with your application queries. Map 1 Query → 1 Table.
  • Partition Key: Determines which node stores the data. Must have high cardinality to avoid Hot Partitions.
  • Clustering Key: Determines the sort order of data on disk. Enables efficient range queries.
  • Denormalization: Duplicating data is necessary to achieve fast reads.
  • Write Amplification: Writing to multiple tables is cheaper than doing distributed JOINs.

Flashcards

What determines data distribution?

Tap to reveal

The Partition Key

Cassandra hashes the Partition Key to a Token, which determines which node stores the data.

What is the purpose of the Clustering Key?

Tap to reveal

Sorting & Range Queries

It sorts data within a partition, allowing for efficient range scans (e.g., time-series data).

Why do we Denormalize?

Tap to reveal

To Optimize Reads

We accept slower writes (Write Amplification) to achieve O(1) read latency without JOINs.

What is a Hot Partition?

Tap to reveal

Uneven Data Distribution

Caused by a low-cardinality Partition Key, sending too much traffic to a single node.

How do we ensure consistency across tables?

Tap to reveal

Logged Batches

Logged Batches guarantee atomicity (all-or-nothing) when updating multiple tables for denormalization.

Cheat Sheet

Primary Key Syntax

Syntax Partition Key Clustering Key
PRIMARY KEY (a) a None
PRIMARY KEY (a, b) a b
PRIMARY KEY ((a, b), c) a, b c
PRIMARY KEY ((a), b, c) a b, c

Modeling Do’s and Don’ts

Do Don’t
Start with Queries ❌ Start with Tables
Duplicate Data ❌ Use client-side JOINs
High Cardinality PK ❌ Low Cardinality PK (e.g., Boolean)
Use Batches for Sync ❌ Use Batches for Bulk Load
Order by Clustering Key ❌ Order by client-side sorting

Practice Scenario

Task: Design a schema for a “IoT Sensor Network”.

  1. We have thousands of sensors.
  2. We need to see the latest temperature for a specific sensor.
  3. We need to see all temperature readings for a specific sensor for a specific day.

Solution:

CREATE TABLE sensor_readings_by_day (
    sensor_id uuid,
    date date,
    recorded_at timestamp,
    temperature decimal,
    PRIMARY KEY ((sensor_id, date), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
  • Partition Key: (sensor_id, date) - Ensures that a single partition doesn’t grow indefinitely. Each day is a new partition.
  • Clustering Key: recorded_at - Sorts readings chronologically.