Partition & Clustering Keys
[!NOTE] This module explores the core principles of Partition & Clustering Keys, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Primary Key Deconstructed
In Cassandra, the PRIMARY KEY is not just a unique identifier. It is a roadmap that tells the database exactly where to store the data and how to sort it on disk.
The Primary Key is composed of two parts:
PRIMARY KEY ((Partition Key), Clustering Key)
- Partition Key: Determines WHERE the data lives (which node).
- Clustering Key: Determines HOW the data is sorted (on that node).
The Library Analogy
Think of Cassandra as a massive library distributed across multiple buildings.
- Partition Key: This is the Building or Section.
- Example:
genre = 'Sci-Fi'. All Sci-Fi books go to Building A. - If you query without this key, you have to search every building (Inefficient!).
- Clustering Key: This is the Shelf Order.
- Example:
author_name. Inside Building A, books are sorted alphabetically by Author. - This makes finding “Asimov” instant once you are in the right building.
2. The Partition Key: Distribution
The Partition Key is responsible for data distribution across the cluster. Cassandra uses Consistent Hashing to map partition keys to “Tokens”.
- Hash(PartitionKey) → Token
- Token → Node
Every node in the cluster owns a range of tokens. When you write data, Cassandra hashes the partition key and sends the data to the node that owns that token range.
The Consistent Hash Ring
Visual Warning: Hot Partitions
If you choose a Partition Key with low cardinality (e.g., status = 'ACTIVE'), you will send massive amounts of data to a single node.
[!WARNING] Hot Partition = Cluster Death. If one node receives 90% of the traffic, the entire cluster is bottlenecked by that single node. Always choose a key with high cardinality (e.g.,
user_id,sensor_id,device_id).
3. Interactive: Token Ring Simulator
Visualize how different keys map to different nodes in a 3-node cluster.
Hash & Route
Enter a Partition Key to see where it lands.
4. The Clustering Key: Sorting
Once the data lands on a node (thanks to the Partition Key), the Clustering Key decides how it is stored on the disk.
This allows us to perform efficient range queries within a partition.
Example: Time Series Data
We want to store sensor readings and query them by time range: “Give me all temperatures for Sensor A between 10:00 and 11:00”.
CREATE TABLE sensor_readings (
sensor_id uuid,
recorded_at timestamp,
temperature decimal,
PRIMARY KEY ((sensor_id), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);
- Partition Key (
sensor_id): All data for “Sensor A” lives on one node. - Clustering Key (
recorded_at): Data is sorted by time, newest first.
Composite Keys
You can have multiple columns in both keys.
PRIMARY KEY ((region, country), year, month)
- Partition Key:
(region, country)→ Data for “US, West” stays together. - Clustering Key:
year, month→ Sorted by Year, then by Month.
5. Code Examples
Using Datastax Java Driver.
@Entity
@CqlName("sensor_readings")
public class SensorReading {
@PartitionKey
@CqlName("sensor_id")
private UUID sensorId;
// First clustering key, enables range queries
@ClusteringColumn
@CqlName("recorded_at")
private Instant recordedAt;
private Double temperature;
// Constructors, Getters...
}
// Querying a range
// SimpleStatement s = SimpleStatement.builder("SELECT * FROM sensor_readings WHERE sensor_id = ? AND recorded_at > ? AND recorded_at < ?")
// .addPositionalValues(id, start, end)
// .build();
Using gocql.
package main
import (
"time"
"github.com/gocql/gocql"
)
type SensorReading struct {
SensorID gocql.UUID
RecordedAt time.Time
Temperature float64
}
func GetReadingsInRange(session *gocql.Session, sensorID gocql.UUID, start, end time.Time) ([]SensorReading, error) {
var readings []SensorReading
// Range queries are only efficient because 'recorded_at' is a Clustering Key
iter := session.Query(`
SELECT sensor_id, recorded_at, temperature
FROM sensor_readings
WHERE sensor_id = ? AND recorded_at >= ? AND recorded_at <= ?`,
sensorID, start, end).Iter()
var r SensorReading
for iter.Scan(&r.SensorID, &r.RecordedAt, &r.Temperature) {
readings = append(readings, r)
}
return readings, iter.Close()
}
6. Summary
- Partition Key = Routing (Node selection).
- Clustering Key = Sorting (Disk order).
- Composite Key = Grouping multiple columns to form a unique key.
In the next chapter, we will learn how to strategically duplicate data using Denormalization.