Keyspaces & Tables: The Blueprint

In the world of distributed databases, your schema is not just a definition of data types—it is a deployment strategy. Unlike RDBMS where tables are just logical groupings, in Cassandra, your schema dictates exactly where data lives on the physical cluster and how efficiently it can be retrieved.

This chapter covers the two fundamental building blocks: Keyspaces (the container) and Tables (the data structure).

1. The Keyspace: Your Replication Container

[!TIP] Think of a Keyspace like a “Database” in PostgreSQL or MySQL, but with one superpower: it controls Replication.

A Keyspace defines the scope for a set of tables. Crucially, it defines the Replication Strategy—the algorithm that determines which nodes hold copies of your data.

Replication Strategies

There are two strategies you must know. One for your laptop, and one for production.

Strategy	Usage	Description
SimpleStrategy	Development Only	Places replicas on the next `N` nodes in the ring clockwise. Does not understand racks or data centers.
NetworkTopologyStrategy	Production	Rack-aware and DC-aware. Allows you to specify different replication factors for different data centers (e.g., `RF=3` in US-East, `RF=3` in EU-West).

Interactive: Replication Simulator

Visualize how data is replicated across a 6-node cluster.

Strategy: RF:

CQL Definition

-- Production Keyspace
CREATE KEYSPACE ecommerce WITH replication = {
  'class': 'NetworkTopologyStrategy',
  'us-east': 3,
  'eu-west': 3
};

-- Development Keyspace
CREATE KEYSPACE dev_test WITH replication = {
  'class': 'SimpleStrategy',
  'replication_factor': 1
};

2. The Table: Partitioning & Sorting

In SQL, a Primary Key just means “Unique”. In Cassandra, the Primary Key defines the physical data model. It has two parts:

Primary Key = Partition Key (Where) + Clustering Key (Order)

The Partition Key (The “Where”)

The first part of the Primary Key is the Partition Key.

Role: Determines which node holds the data.
Goal: Even distribution.
Bad Example: status (Low cardinality, creates hotspots).
Good Example: user_id, device_id, sensor_id.

The Clustering Key (The “Order”)

Everything after the Partition Key is a Clustering Key.

Role: Sorts data within the partition on disk.
Goal: Fast range queries (e.g., “Give me orders for User X between Jan 1st and Jan 31st”).
Analogy: The Partition Key finds the filing cabinet (Node). The Clustering Key sorts the folders inside the cabinet.

Interactive: Table Builder

Select keys to see how data is stored.

  Schema Definition
  CREATE TABLE sensor_data (
  sensor_id uuid,
  timestamp timestamp,
  temperature double,
  PRIMARY KEY ((sensor_id), timestamp)
) WITH CLUSTERING ORDER BY (timestamp DESC);
  
Physical LayoutNode A (Token Range 0-100)
Partition: sensor_123

        Row: 2023-10-27 10:05 | 24.5°C


        Row: 2023-10-27 10:00 | 24.1°C


        Row: 2023-10-27 09:55 | 23.8°C
      

3. Implementation: Java & Go

How to create schema programmatically.

Java (DataStax Driver)

import com.datastax.oss.driver.api.core.CqlSession;

public class SchemaManager {
    public void createSchema(CqlSession session) {
        // 1. Create Keyspace
        String createKeyspace = "CREATE KEYSPACE IF NOT EXISTS inventory " +
            "WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east': 3}";

        session.execute(createKeyspace);

        // 2. Create Table
        // Partition Key: product_id
        // Clustering Key: warehouse_id (to query stock per warehouse)
        String createTable = "CREATE TABLE IF NOT EXISTS inventory.stock (" +
            "product_id uuid, " +
            "warehouse_id uuid, " +
            "quantity int, " +
            "last_updated timestamp, " +
            "PRIMARY KEY ((product_id), warehouse_id))";

        session.execute(createTable);
    }
}

Go (Gocql)

package main

import (
    "log"
    "github.com/gocql/gocql"
)

func createSchema(session *gocql.Session) {
    // 1. Create Keyspace
    err := session.Query(`
        CREATE KEYSPACE IF NOT EXISTS inventory
        WITH replication = {'class': 'NetworkTopologyStrategy', 'us-east': 3}
    `).Exec()
    if err != nil {
        log.Fatal(err)
    }

    // 2. Create Table
    // Partition Key: product_id
    // Clustering Key: warehouse_id
    err = session.Query(`
        CREATE TABLE IF NOT EXISTS inventory.stock (
            product_id uuid,
            warehouse_id uuid,
            quantity int,
            last_updated timestamp,
            PRIMARY KEY ((product_id), warehouse_id)
        )
    `).Exec()
    if err != nil {
        log.Fatal(err)
    }
}

4. Best Practices

[!WARNING] Partition Sizing: Keep partitions under 100MB. If a partition gets too large (wide rows), Cassandra struggles to compact and read it.

Query First Design: Don’t model your data like entities (User, Product). Model it based on your queries (Select User By Email, Select User By ID).
Minimize Indexes: Secondary indexes in Cassandra are often performance killers. Use Materialized Views or separate tables instead.
Use UUIDs: Avoid auto-incrementing integers. They require coordination (locks), which distributed systems hate. Use UUID or TimeUUID.