Query Driven Modeling
[!NOTE] This module explores the core principles of Query Driven Modeling, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Paradigm Shift
If you are coming from a Relational Database (RDBMS) background, you need to unlearn almost everything you know about data modeling.
In the SQL world, you model your data based on entities and their relationships. You create a Users table, a posts table, and a comments table. You normalize data to avoid duplication (3rd Normal Form). When you need to retrieve a user’s timeline, you rely on the database engine to JOIN these tables at read time.
[!IMPORTANT] In Cassandra, Reads are Expensive, Writes are Cheap.
We optimize for Reads by doing the heavy lifting during Writes. This means we often duplicate data across multiple tables to answer specific queries.
2. The Genesis: Why Normalization Fails at Scale
To understand why we do this, we must look at the hardware reality: Disk I/O.
The Physics of the Disk
- Seek Time (Latency): Moving the disk head to the correct sector. This is slow (~10ms for HDD, ~0.1ms for SSD).
- Sequential Read (Throughput): Reading data continuously once the head is positioned. This is extremely fast.
RDBMS (Joins) = Scattered Reads
When you perform a JOIN, the database engine often has to “hop” around the disk to find related rows in different tables (Random I/O). As your data grows, these hops become the bottleneck.
Cassandra (Denormalization) = Sequential Reads
By denormalizing (copying all necessary data into a single row), we ensure that answering a query requires only one seek followed by a fast sequential read.
Random vs. Sequential Access
3. The Golden Rule: Query-First Design
In Cassandra, you don’t start by drawing an Entity-Relationship Diagram (ERD). You start by listing your Application Queries.
Queries → Schema → Data
You must know your queries before you design your schema.
Interactive: The Cost of Scale
Let’s look at a classic “Social Media Timeline” feature. We need to show a user all the posts from people they follow. Compare the complexity of retrieving a timeline in SQL vs. Cassandra as the dataset grows.
Scale Simulator
SQL JOIN Cost (Rows Scanned)
Cassandra Cost (Partitions Accessed)
Timeline_By_User table. When a post is created, we write it to every follower's timeline (Write Amplification). The read becomes a single O(1) lookup.
4. Data Modeling Methodology
We follow a strict methodology to arrive at the perfect schema.
- Understand the Data: List your entities and relationships.
- Identify Access Patterns: List every query your application will run.
- Map Queries to Tables: Create one table per query.
Example: E-Commerce Order History
Imagine we are building an order history page.
Access Pattern 1: “Get order details by Order ID.” Access Pattern 2: “Get all orders for a specific Customer.”
In a Relational world, we’d have Orders and Order_Items tables.
In Cassandra, we create two tables:
Table 1: orders_by_id
Optimized for looking up a single order.
CREATE TABLE orders_by_id (
order_id uuid,
customer_id uuid,
total_amount decimal,
status text,
created_at timestamp,
PRIMARY KEY (order_id)
);
Table 2: orders_by_customer
Optimized for listing a customer’s history.
CREATE TABLE orders_by_customer (
customer_id uuid,
created_at timestamp,
order_id uuid,
total_amount decimal,
status text,
PRIMARY KEY ((customer_id), created_at)
) WITH CLUSTERING ORDER BY (created_at DESC);
[!NOTE] Notice that we duplicate
total_amountandstatus. This allows us to satisfy the query without reading the other table.
5. Implementation Examples
Let’s look at how we define these entities in code.
Using the Datastax Java Driver and Object Mapping.
// Entity for 'orders_by_id'
@Entity
@CqlName("orders_by_id")
public class OrderById {
@PartitionKey
@CqlName("order_id")
private UUID orderId;
@CqlName("customer_id")
private UUID customerId;
private BigDecimal totalAmount;
private String status;
private Instant createdAt;
// Getters and Setters...
}
// Entity for 'orders_by_customer'
@Entity
@CqlName("orders_by_customer")
public class OrderByCustomer {
@PartitionKey
@CqlName("customer_id")
private UUID customerId;
@ClusteringColumn
@CqlName("created_at")
private Instant createdAt;
// We include orderId as part of the primary key (clustering key)
// to ensure uniqueness if a customer places multiple orders at the exact same millisecond.
@ClusteringColumn(1)
@CqlName("order_id")
private UUID orderId;
private BigDecimal totalAmount;
private String status;
}
Using gocql.
package main
import (
"time"
"github.com/gocql/gocql"
)
// OrderById represents the 'orders_by_id' table
type OrderById struct {
OrderId gocql.UUID
CustomerId gocql.UUID
TotalAmount float64
Status string
CreatedAt time.Time
}
// OrderByCustomer represents the 'orders_by_customer' table
type OrderByCustomer struct {
CustomerId gocql.UUID
CreatedAt time.Time
OrderId gocql.UUID
TotalAmount float64
Status string
}
func CreateOrder(session *gocql.Session, order OrderById) error {
// We must write to BOTH tables manually (or use a Batch)
batch := session.NewBatch(gocql.LoggedBatch)
// Statement 1: Insert into orders_by_id
batch.Query(`INSERT INTO orders_by_id (order_id, customer_id, total_amount, status, created_at)
VALUES (?, ?, ?, ?, ?)`,
order.OrderId, order.CustomerId, order.TotalAmount, order.Status, order.CreatedAt)
// Statement 2: Insert into orders_by_customer
batch.Query(`INSERT INTO orders_by_customer (customer_id, created_at, order_id, total_amount, status)
VALUES (?, ?, ?, ?, ?)`,
order.CustomerId, order.CreatedAt, order.OrderId, order.TotalAmount, order.Status)
return session.ExecuteBatch(batch)
}
6. Key Takeaways
- Forget Normalization: Duplicate data to optimize for reads.
- Start with Queries: List your access patterns before defining tables.
- One Table Per Query: Don’t be afraid to create 3 different tables for the same data entity if you have 3 different ways to look it up.
- Write Amplification is OK: Disk space is cheap; latency is expensive.
In the next chapter, we will dive deep into the mechanics of the Primary Key—the single most important concept in Cassandra.