Partition & Clustering Keys

[!NOTE] This module explores the core principles of Partition & Clustering Keys, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Primary Key Deconstructed

In Cassandra, the PRIMARY KEY is not just a unique identifier. It is a roadmap that tells the database exactly where to store the data and how to sort it on disk.

The Primary Key is composed of two parts:

PRIMARY KEY ((Partition Key), Clustering Key)

Partition Key: Determines WHERE the data lives (which node).
Clustering Key: Determines HOW the data is sorted (on that node).

The Library Analogy

Think of Cassandra as a massive library distributed across multiple buildings.

Partition Key: This is the Building or Section.
Example: genre = 'Sci-Fi'. All Sci-Fi books go to Building A.
If you query without this key, you have to search every building (Inefficient!).
Clustering Key: This is the Shelf Order.
Example: author_name. Inside Building A, books are sorted alphabetically by Author.
This makes finding “Asimov” instant once you are in the right building.

2. The Partition Key: Distribution

The Partition Key is responsible for data distribution across the cluster. Cassandra uses Consistent Hashing to map partition keys to “Tokens”.

Hash(PartitionKey) → Token
Token → Node

Every node in the cluster owns a range of tokens. When you write data, Cassandra hashes the partition key and sends the data to the node that owns that token range.

The Consistent Hash Ring

Visual Warning: Hot Partitions

If you choose a Partition Key with low cardinality (e.g., status = 'ACTIVE'), you will send massive amounts of data to a single node.

[!WARNING] Hot Partition = Cluster Death. If one node receives 90% of the traffic, the entire cluster is bottlenecked by that single node. Always choose a key with high cardinality (e.g., user_id, sensor_id, device_id).

3. Interactive: Token Ring Simulator

Visualize how different keys map to different nodes in a 3-node cluster.

Hash & Route

Enter a Partition Key to see where it lands.

Murmur3 Hash Waiting...

Assigned Node Waiting...

4. The Clustering Key: Sorting

Once the data lands on a node (thanks to the Partition Key), the Clustering Key decides how it is stored on the disk.

This allows us to perform efficient range queries within a partition.

Example: Time Series Data

We want to store sensor readings and query them by time range: “Give me all temperatures for Sensor A between 10:00 and 11:00”.

CREATE TABLE sensor_readings (
    sensor_id uuid,
    recorded_at timestamp,
    temperature decimal,
    PRIMARY KEY ((sensor_id), recorded_at)
) WITH CLUSTERING ORDER BY (recorded_at DESC);

Partition Key (sensor_id): All data for “Sensor A” lives on one node.
Clustering Key (recorded_at): Data is sorted by time, newest first.

Composite Keys

You can have multiple columns in both keys.

PRIMARY KEY ((region, country), year, month)

Partition Key: (region, country) → Data for “US, West” stays together.
Clustering Key: year, month → Sorted by Year, then by Month.

5. Code Examples

Using Datastax Java Driver.

@Entity
@CqlName("sensor_readings")
public class SensorReading {

    @PartitionKey
    @CqlName("sensor_id")
    private UUID sensorId;

    // First clustering key, enables range queries
    @ClusteringColumn
    @CqlName("recorded_at")
    private Instant recordedAt;

    private Double temperature;

    // Constructors, Getters...
}

// Querying a range
// SimpleStatement s = SimpleStatement.builder("SELECT * FROM sensor_readings WHERE sensor_id = ? AND recorded_at > ? AND recorded_at < ?")
//    .addPositionalValues(id, start, end)
//    .build();

Using gocql.

package main

import (
    "time"
    "github.com/gocql/gocql"
)

type SensorReading struct {
    SensorID   gocql.UUID
    RecordedAt time.Time
    Temperature float64
}

func GetReadingsInRange(session *gocql.Session, sensorID gocql.UUID, start, end time.Time) ([]SensorReading, error) {
    var readings []SensorReading

    // Range queries are only efficient because 'recorded_at' is a Clustering Key
    iter := session.Query(`
        SELECT sensor_id, recorded_at, temperature
        FROM sensor_readings
        WHERE sensor_id = ? AND recorded_at >= ? AND recorded_at <= ?`,
        sensorID, start, end).Iter()

    var r SensorReading
    for iter.Scan(&r.SensorID, &r.RecordedAt, &r.Temperature) {
        readings = append(readings, r)
    }

    return readings, iter.Close()
}

6. Summary

Partition Key = Routing (Node selection).
Clustering Key = Sorting (Disk order).
Composite Key = Grouping multiple columns to form a unique key.

In the next chapter, we will learn how to strategically duplicate data using Denormalization.