Module Review: Data Modeling

This module review covers the essential principles of Cassandra data modeling, including query-driven design, keys, and denormalization.

Key Takeaways

  • Query-First: Always start with your application queries. Map 1 Query → 1 Table.
  • Partition Key: Determines which node stores the data. Must have high cardinality to avoid Hot Partitions.
  • Clustering Key: Determines the sort order of data on disk. Enables efficient range queries.
  • Denormalization: Duplicating data is necessary to achieve fast reads.
  • Write Amplification: Writing to multiple tables is cheaper than doing distributed JOINs.

Flashcards

Cheat Sheet

Primary Key Syntax

Syntax Partition Key Clustering Key
PRIMARY KEY (a) a None
PRIMARY KEY (a, b) a b
PRIMARY KEY ((a, b), c) a, b c
PRIMARY KEY ((a), b, c) a b, c

Modeling Do’s and Don’ts

Do Don’t
Start with Queries ❌ Start with Tables
Duplicate Data ❌ Use client-side JOINs
High Cardinality PK ❌ Low Cardinality PK (e.g., Boolean)
Use Batches for Sync ❌ Use Batches for Bulk Load
Order by Clustering Key ❌ Order by client-side sorting

Practice Scenario

Task: Design a schema for a “IoT Sensor Network”.

  1. We have thousands of sensors.
  2. We need to see the latest temperature for a specific sensor.
  3. We need to see all temperature readings for a specific sensor for a specific day.

Solution:

CREATE TABLE sensor_readings_by_day (
    sensor_id uuid,
    date date,
    recorded_at timestamp,
    temperature decimal,
    PRIMARY KEY ((sensor_id, date), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
  • Partition Key: (sensor_id, date) - Ensures that a single partition doesn’t grow indefinitely. Each day is a new partition.
  • Clustering Key: recorded_at - Sorts readings chronologically.