Streams Architecture & Duality
Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. Unlike Spark or Flink, which require a separate processing cluster, Kafka Streams is just a Library.
1. The “Library” Philosophy
Traditional stream processing systems (like Spark Streaming or Flink) are Clusters. You submit a job to a cluster, and the cluster manages the execution.
Kafka Streams is different:
- No Cluster: It runs inside your application (e.g., a Spring Boot app, a microservice).
- Scalability: It leverages Kafka’s consumer group protocol to scale. If you want to scale up, you just start more instances of your application.
- Deployment: You deploy it just like any other application (Docker, Kubernetes, bare metal).
2. The Processor Topology
A Kafka Streams application is defined by a Topology—a graph of processing nodes.
3. KStream vs. KTable: The Duality
This is the most critical concept in Kafka Streams.
KStream (The “INSERT” Stream)
A KStream is an abstraction of a record stream, where each data record represents a self-contained datum in the unbounded dataset.
- Semantics: “Fact”. Something happened.
- Example: “Alice sent 10”, “Alice sent 5”.
- Analogy: A ledger of transactions.
KTable (The “UPSERT” Stream)
A KTable is an abstraction of a changlog stream, where each data record represents an update.
- Semantics: “State”. The current value for a key.
- Example: “Alice’s Balance is 10”, then “Alice’s Balance is 15”.
- Analogy: A database table.
[!TIP] Stream-Table Duality:
- Stream as Table: A stream can be considered a changelog of a table. If you replay the stream from the beginning, you can reconstruct the table.
- Table as Stream: A table can be considered a stream of updates. Every time the table changes, it produces a change event.
Interactive: Duality Visualizer
See how the same data creates different results in a Stream vs. a Table.
Input Stream (Topic)
KTable View (Current State)
4. Code Implementation
Here is how you define a simple topology that reads from a source topic, filters records, and writes to a sink topic.
Java (Kafka Streams)
import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.KStream;
// 1. Create a StreamsBuilder
StreamsBuilder builder = new StreamsBuilder();
// 2. Create a KStream from a topic
KStream<String, String> sourceStream = builder.stream("orders");
// 3. Define the topology (Filter)
KStream<String, String> highValueOrders = sourceStream.filter(
(key, value) -> {
// Assuming value is parsed as Double
double amount = Double.parseDouble(value);
return amount > 100.0;
}
);
// 4. Write to sink topic
highValueOrders.to("high-value-orders");
// 5. Build and start
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
Go (Consumer Loop Pattern)
Kafka Streams is a Java-only library. In Go, we implement similar logic using a Consumer-Producer loop.
package main
import (
"log"
"strconv"
"github.com/confluentinc/confluent-kafka-go/kafka"
)
func main() {
c, _ := kafka.NewConsumer(&kafka.ConfigMap{
"bootstrap.servers": "localhost:9092",
"group.id": "order-processor",
"auto.offset.reset": "earliest",
})
p, _ := kafka.NewProducer(&kafka.ConfigMap{"bootstrap.servers": "localhost:9092"})
c.SubscribeTopics([]string{"orders"}, nil)
for {
msg, err := c.ReadMessage(-1)
if err == nil {
// Processing Logic (The "Topology")
amount, _ := strconv.ParseFloat(string(msg.Value), 64)
// Filter
if amount > 100.0 {
// Sink
p.Produce(&kafka.Message{
TopicPartition: kafka.TopicPartition{Topic: &"high-value-orders", Partition: kafka.PartitionAny},
Key: msg.Key,
Value: msg.Value,
}, nil)
}
}
}
}