Select, Insert, Update

CQL (Cassandra Query Language) was designed to look like SQL to make RDBMS developers feel at home. This was a marketing trick. Under the hood, CQL behaves nothing like SQL.

In SQL, a write is expensive (ACID checks, B-Tree rebalancing) and reads are flexible (Joins, arbitrary WHERE clauses). In Cassandra, writes are cheap (append-only), and reads are strict.

1. The Write Path (INSERT & UPDATE)

In Cassandra, INSERT and UPDATE are effectively the same operation: an Upsert. If the row exists, it’s overwritten. If not, it’s created.

Why Writes are Fast

Cassandra is optimized for write throughput.

Commit Log: Appends to disk (Sequential I/O).
MemTable: Writes to memory (RAM).
Done: The client gets an acknowledgement.

There are no read-before-write checks (unless using Lightweight Transactions).

Time To Live (TTL)

You can set data to self-destruct. This is perfect for time-series data or session management.

INSERT INTO user_sessions (session_id, user_id)
VALUES (uuid(), 'user_123')
USING TTL 3600; -- Expires in 1 hour

2. The Read Path (SELECT)

This is where beginners crash and burn. In a distributed system, you cannot just ask “Find me all users named John”. You have to ask “Find me the user named John on Node 5”.

The Golden Rule of Querying

[!IMPORTANT] You MUST provide the Partition Key in your WHERE clause.

If you provide the Partition Key, the Coordinator node can hash it, determine exactly which replica holds the data, and send a direct request. This is O(1).

If you don’t provide the Partition Key, the Coordinator must send the query to ALL nodes in the cluster. This is O(N) and is called a Scatter-Gather query.

Interactive: Query Router

See the difference between an efficient query and a cluster-killing query.

Analysis: Enter a query...

ALLOW FILTERING: The Silent Killer

When you try to query by a non-primary key column, Cassandra will reject it: InvalidRequest: Cannot execute this query as it might involve data filtering...

You might be tempted to append ALLOW FILTERING to make it work.

[!CAUTION] NEVER use ALLOW FILTERING in production applications. It forces Cassandra to read every single row in the table, transfer them to the coordinator, and filter them in memory. It will time out and crash your cluster under load.

3. Implementation: Java & Go

Java (DataStax Driver)

import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.PreparedStatement;
import com.datastax.oss.driver.api.core.cql.ResultSet;
import com.datastax.oss.driver.api.core.cql.Row;
import java.util.UUID;

public class UserManager {
    private final CqlSession session;
    private final PreparedStatement insertStmt;
    private final PreparedStatement selectStmt;

    public UserManager(CqlSession session) {
        this.session = session;
        // Always prepare statements once!
        this.insertStmt = session.prepare(
            "INSERT INTO users (user_id, name, email) VALUES (?, ?, ?)");
        this.selectStmt = session.prepare(
            "SELECT * FROM users WHERE user_id = ?");
    }

    public void createUser(UUID userId, String name, String email) {
        session.execute(insertStmt.bind(userId, name, email));
    }

    public String getUserName(UUID userId) {
        ResultSet rs = session.execute(selectStmt.bind(userId));
        Row row = rs.one();
        return (row != null) ? row.getString("name") : null;
    }
}

Go (Gocql)

package main

import (
    "log"
    "github.com/gocql/gocql"
)

type User struct {
    ID    gocql.UUID
    Name  string
    Email string
}

func createUser(session *gocql.Session, u User) {
    if err := session.Query(`
        INSERT INTO users (user_id, name, email) VALUES (?, ?, ?)`,
        u.ID, u.Name, u.Email).Exec(); err != nil {
        log.Fatal(err)
    }
}

func getUser(session *gocql.Session, id gocql.UUID) (User, error) {
    var u User
    u.ID = id
    err := session.Query(`
        SELECT name, email FROM users WHERE user_id = ?`,
        id).Scan(&u.Name, &u.Email)
    return u, err
}

4. Batches

Cassandra has BATCH syntax, but it’s not for bulk loading.

Logged Batch (Default): Ensures Atomicity (all or nothing) across multiple partitions. It has a performance penalty (batch log write). Use this for keeping tables in sync (e.g., updating a users table and a users_by_email table).
Unlogged Batch: No atomicity guarantee. Slightly faster.
Bulk Loading: To load millions of rows, do not use batches. Use async INSERT statements in parallel (token-aware client).

Select Insert Update