Module Review: Streams

[!NOTE] This module explores the core principles of Module Review: Streams, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. Key Takeaways

  • Library, Not Cluster: Kafka Streams runs inside your application instances. It scales by adding more instances (Consumer Groups).
  • The Duality: A KStream is an append-only log of events (History). A KTable is a snapshot of the current state (Truth).
  • State Stores: Stateful operations (aggregations, joins) use embedded RocksDB for fast local access.
  • Fault Tolerance: Local state is backed up to Changelog Topics in Kafka. If a node fails, the state is restored from the changelog.
  • Windowing: Essential for time-based aggregations. Tumbling (distinct), Hopping (overlapping), and Session (activity-based).
  • Co-Partitioning: To join two streams, they must have the same partition count and partitioning strategy.

2. Interactive Flashcards

Test your knowledge of Kafka Streams.

What is the difference between KStream and KTable?
KStream represents a stream of independent events (INSERT). KTable represents a stream of updates to a state (UPSERT).
Where is the state stored in Kafka Streams?
Locally in an embedded RocksDB instance on the application server's disk.
How is state recovered after a crash?
By replaying the internal "Changelog Topic" from Kafka, which acts as the source of truth for the state store.
What is a Tumbling Window?
A fixed-size, non-overlapping, contiguous time window (e.g., every 5 minutes).
What is Co-Partitioning?
The requirement that joined topics must have the same number of partitions and be partitioned by the same key.
Is Kafka Streams a separate cluster?
No, it is a Java/Scala library that runs within your application process.

3. Cheat Sheet

Concept Description Analogy
Topology The graph of processing nodes (Source → Processor → Sink). Assembly Line
KStream Record stream. Unbounded sequence of immutable data records. Log File
KTable Changelog stream. Represents the latest state for a key. Database Table
RocksDB Embedded key-value store used for stateful operations. Local Cache
Changelog Internal topic used to back up local state. WAL (Write Ahead Log)
Tumbling Fixed size, no overlap. Pages in a book
Hopping Fixed size, overlap. Moving average
Session Dynamic size, inactivity gap. User visit

4. Next Steps

Now that you understand how to process data, you need to ensure your system is reliable. Module 05: Reliability covers Delivery Semantics (At-least-once vs Exactly-once) and Replication.