Query Driven Modeling

[!NOTE] This module explores the core principles of Query Driven Modeling, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Paradigm Shift

If you are coming from a Relational Database (RDBMS) background, you need to unlearn almost everything you know about data modeling.

In the SQL world, you model your data based on entities and their relationships. You create a Users table, a posts table, and a comments table. You normalize data to avoid duplication (3rd Normal Form). When you need to retrieve a user’s timeline, you rely on the database engine to JOIN these tables at read time.

[!IMPORTANT] In Cassandra, Reads are Expensive, Writes are Cheap.

We optimize for Reads by doing the heavy lifting during Writes. This means we often duplicate data across multiple tables to answer specific queries.

2. The Genesis: Why Normalization Fails at Scale

To understand why we do this, we must look at the hardware reality: Disk I/O.

The Physics of the Disk

Seek Time (Latency): Moving the disk head to the correct sector. This is slow (~10ms for HDD, ~0.1ms for SSD).
Sequential Read (Throughput): Reading data continuously once the head is positioned. This is extremely fast.

RDBMS (Joins) = Scattered Reads

When you perform a JOIN, the database engine often has to “hop” around the disk to find related rows in different tables (Random I/O). As your data grows, these hops become the bottleneck.

Cassandra (Denormalization) = Sequential Reads

By denormalizing (copying all necessary data into a single row), we ensure that answering a query requires only one seek followed by a fast sequential read.

Random vs. Sequential Access

3. The Golden Rule: Query-First Design

In Cassandra, you don’t start by drawing an Entity-Relationship Diagram (ERD). You start by listing your Application Queries.

Queries → Schema → Data

You must know your queries before you design your schema.

Interactive: The Cost of Scale

Let’s look at a classic “Social Media Timeline” feature. We need to show a user all the posts from people they follow. Compare the complexity of retrieving a timeline in SQL vs. Cassandra as the dataset grows.

Scale Simulator

Number of Followers 1,000

Posts per User 50

SQL JOIN Cost (Rows Scanned)

50,000

Cassandra Cost (Partitions Accessed)

* In Cassandra, we create a specialized Timeline_By_User table. When a post is created, we write it to every follower's timeline (Write Amplification). The read becomes a single O(1) lookup.

4. Data Modeling Methodology

We follow a strict methodology to arrive at the perfect schema.

Understand the Data: List your entities and relationships.
Identify Access Patterns: List every query your application will run.
Map Queries to Tables: Create one table per query.

Example: E-Commerce Order History

Imagine we are building an order history page.

Access Pattern 1: “Get order details by Order ID.” Access Pattern 2: “Get all orders for a specific Customer.”

In a Relational world, we’d have Orders and Order_Items tables.

In Cassandra, we create two tables:

Table 1: `orders_by_id`

Optimized for looking up a single order.

CREATE TABLE orders_by_id (
    order_id uuid,
    customer_id uuid,
    total_amount decimal,
    status text,
    created_at timestamp,
    PRIMARY KEY (order_id)
);

Table 2: `orders_by_customer`

Optimized for listing a customer’s history.

CREATE TABLE orders_by_customer (
    customer_id uuid,
    created_at timestamp,
    order_id uuid,
    total_amount decimal,
    status text,
    PRIMARY KEY ((customer_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);

[!NOTE] Notice that we duplicate total_amount and status. This allows us to satisfy the query without reading the other table.

5. Implementation Examples

Let’s look at how we define these entities in code.

Using the Datastax Java Driver and Object Mapping.

// Entity for 'orders_by_id'
@Entity
@CqlName("orders_by_id")
public class OrderById {
    @PartitionKey
    @CqlName("order_id")
    private UUID orderId;

    @CqlName("customer_id")
    private UUID customerId;

    private BigDecimal totalAmount;
    private String status;
    private Instant createdAt;

    // Getters and Setters...
}

// Entity for 'orders_by_customer'
@Entity
@CqlName("orders_by_customer")
public class OrderByCustomer {
    @PartitionKey
    @CqlName("customer_id")
    private UUID customerId;

    @ClusteringColumn
    @CqlName("created_at")
    private Instant createdAt;

    // We include orderId as part of the primary key (clustering key)
    // to ensure uniqueness if a customer places multiple orders at the exact same millisecond.
    @ClusteringColumn(1)
    @CqlName("order_id")
    private UUID orderId;

    private BigDecimal totalAmount;
    private String status;
}

Using gocql.

package main

import (
    "time"
    "github.com/gocql/gocql"
)

// OrderById represents the 'orders_by_id' table
type OrderById struct {
    OrderId     gocql.UUID
    CustomerId  gocql.UUID
    TotalAmount float64
    Status      string
    CreatedAt   time.Time
}

// OrderByCustomer represents the 'orders_by_customer' table
type OrderByCustomer struct {
    CustomerId  gocql.UUID
    CreatedAt   time.Time
    OrderId     gocql.UUID
    TotalAmount float64
    Status      string
}

func CreateOrder(session *gocql.Session, order OrderById) error {
    // We must write to BOTH tables manually (or use a Batch)
    batch := session.NewBatch(gocql.LoggedBatch)

    // Statement 1: Insert into orders_by_id
    batch.Query(`INSERT INTO orders_by_id (order_id, customer_id, total_amount, status, created_at)
                 VALUES (?, ?, ?, ?, ?)`,
        order.OrderId, order.CustomerId, order.TotalAmount, order.Status, order.CreatedAt)

    // Statement 2: Insert into orders_by_customer
    batch.Query(`INSERT INTO orders_by_customer (customer_id, created_at, order_id, total_amount, status)
                 VALUES (?, ?, ?, ?, ?)`,
        order.CustomerId, order.CreatedAt, order.OrderId, order.TotalAmount, order.Status)

    return session.ExecuteBatch(batch)
}

6. Key Takeaways

Forget Normalization: Duplicate data to optimize for reads.
Start with Queries: List your access patterns before defining tables.
One Table Per Query: Don’t be afraid to create 3 different tables for the same data entity if you have 3 different ways to look it up.
Write Amplification is OK: Disk space is cheap; latency is expensive.

In the next chapter, we will dive deep into the mechanics of the Primary Key—the single most important concept in Cassandra.

Query Driven Modeling

Query Driven Modeling

1. The Paradigm Shift

2. The Genesis: Why Normalization Fails at Scale

The Physics of the Disk

RDBMS (Joins) = Scattered Reads

Cassandra (Denormalization) = Sequential Reads

Random vs. Sequential Access

3. The Golden Rule: Query-First Design

Queries → Schema → Data

Interactive: The Cost of Scale

Scale Simulator

SQL JOIN Cost (Rows Scanned)

Cassandra Cost (Partitions Accessed)

4. Data Modeling Methodology

Example: E-Commerce Order History

Table 1: orders_by_id

Table 2: orders_by_customer

5. Implementation Examples

6. Key Takeaways

Table 1: `orders_by_id`

Table 2: `orders_by_customer`