Keyspaces & Tables: The Blueprint
In the world of distributed databases, your schema is not just a definition of data types—it is a deployment strategy. Unlike RDBMS where tables are just logical groupings, in Cassandra, your schema dictates exactly where data lives on the physical cluster and how efficiently it can be retrieved.
This chapter covers the two fundamental building blocks: Keyspaces (the container) and Tables (the data structure).
1. The Keyspace: Your Replication Container
[!TIP] Think of a Keyspace like a “Database” in PostgreSQL or MySQL, but with one superpower: it controls Replication.
A Keyspace defines the scope for a set of tables. Crucially, it defines the Replication Strategy—the algorithm that determines which nodes hold copies of your data.
Replication Strategies
There are two strategies you must know. One for your laptop, and one for production.
| Strategy | Usage | Description |
|---|---|---|
| SimpleStrategy | Development Only | Places replicas on the next N nodes in the ring clockwise. Does not understand racks or data centers. |
| NetworkTopologyStrategy | Production | Rack-aware and DC-aware. Allows you to specify different replication factors for different data centers (e.g., RF=3 in US-East, RF=3 in EU-West). |
Interactive: Replication Simulator
Visualize how data is replicated across a 6-node cluster.
CQL Definition
-- Production Keyspace
CREATE KEYSPACE ecommerce WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'eu-west': 3
};
-- Development Keyspace
CREATE KEYSPACE dev_test WITH replication = {
'class': 'SimpleStrategy',
'replication_factor': 1
};
2. The Table: Partitioning & Sorting
In SQL, a Primary Key just means “Unique”. In Cassandra, the Primary Key defines the physical data model. It has two parts:
The Partition Key (The “Where”)
The first part of the Primary Key is the Partition Key.
- Role: Determines which node holds the data.
- Goal: Even distribution.
- Bad Example:
status(Low cardinality, creates hotspots). - Good Example:
user_id,device_id,sensor_id.
The Clustering Key (The “Order”)
Everything after the Partition Key is a Clustering Key.
- Role: Sorts data within the partition on disk.
- Goal: Fast range queries (e.g., “Give me orders for User X between Jan 1st and Jan 31st”).
- Analogy: The Partition Key finds the filing cabinet (Node). The Clustering Key sorts the folders inside the cabinet.
Interactive: Table Builder
Select keys to see how data is stored.
Schema Definition
CREATE TABLE sensor_data (
sensor_id uuid,
timestamp timestamp,
temperature double,
PRIMARY KEY ((sensor_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
Physical Layout
3. Implementation: Java & Go
How to create schema programmatically.
Java (DataStax Driver)
import com.datastax.oss.driver.api.core.CqlSession;
public class SchemaManager {
public void createSchema(CqlSession session) {
// 1. Create Keyspace
String createKeyspace = "CREATE KEYSPACE IF NOT EXISTS inventory " +
"WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east': 3}";
session.execute(createKeyspace);
// 2. Create Table
// Partition Key: product_id
// Clustering Key: warehouse_id (to query stock per warehouse)
String createTable = "CREATE TABLE IF NOT EXISTS inventory.stock (" +
"product_id uuid, " +
"warehouse_id uuid, " +
"quantity int, " +
"last_updated timestamp, " +
"PRIMARY KEY ((product_id), warehouse_id))";
session.execute(createTable);
}
}
Go (Gocql)
package main
import (
"log"
"github.com/gocql/gocql"
)
func createSchema(session *gocql.Session) {
// 1. Create Keyspace
err := session.Query(`
CREATE KEYSPACE IF NOT EXISTS inventory
WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east': 3}
`).Exec()
if err != nil {
log.Fatal(err)
}
// 2. Create Table
// Partition Key: product_id
// Clustering Key: warehouse_id
err = session.Query(`
CREATE TABLE IF NOT EXISTS inventory.stock (
product_id uuid,
warehouse_id uuid,
quantity int,
last_updated timestamp,
PRIMARY KEY ((product_id), warehouse_id)
)
`).Exec()
if err != nil {
log.Fatal(err)
}
}
4. Best Practices
[!WARNING] Partition Sizing: Keep partitions under 100MB. If a partition gets too large (wide rows), Cassandra struggles to compact and read it.
- Query First Design: Don’t model your data like entities (User, Product). Model it based on your queries (Select User By Email, Select User By ID).
- Minimize Indexes: Secondary indexes in Cassandra are often performance killers. Use Materialized Views or separate tables instead.
- Use UUIDs: Avoid auto-incrementing integers. They require coordination (locks), which distributed systems hate. Use
UUIDorTimeUUID.