Streams Architecture & Duality

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. Unlike Spark or Flink, which require a separate processing cluster, Kafka Streams is just a Library.

1. The “Library” Philosophy

Traditional stream processing systems (like Spark Streaming or Flink) are Clusters. You submit a job to a cluster, and the cluster manages the execution.

Kafka Streams is different:

  • No Cluster: It runs inside your application (e.g., a Spring Boot app, a microservice).
  • Scalability: It leverages Kafka’s consumer group protocol to scale. If you want to scale up, you just start more instances of your application.
  • Deployment: You deploy it just like any other application (Docker, Kubernetes, bare metal).

2. The Processor Topology

A Kafka Streams application is defined by a Topology—a graph of processing nodes.

Source
Topic: Orders
Processor
Filter: Amount > $100
Sink
Topic: HighValueOrders

3. KStream vs. KTable: The Duality

This is the most critical concept in Kafka Streams.

KStream (The “INSERT” Stream)

A KStream is an abstraction of a record stream, where each data record represents a self-contained datum in the unbounded dataset.

  • Semantics: “Fact”. Something happened.
  • Example: “Alice sent 10”, “Alice sent 5”.
  • Analogy: A ledger of transactions.

KTable (The “UPSERT” Stream)

A KTable is an abstraction of a changlog stream, where each data record represents an update.

  • Semantics: “State”. The current value for a key.
  • Example: “Alice’s Balance is 10”, then “Alice’s Balance is 15”.
  • Analogy: A database table.

[!TIP] Stream-Table Duality:

  • Stream as Table: A stream can be considered a changelog of a table. If you replay the stream from the beginning, you can reconstruct the table.
  • Table as Stream: A table can be considered a stream of updates. Every time the table changes, it produces a change event.

Interactive: Duality Visualizer

See how the same data creates different results in a Stream vs. a Table.

Input Stream (Topic)

KTable View (Current State)

KeyValue

4. Code Implementation

Here is how you define a simple topology that reads from a source topic, filters records, and writes to a sink topic.

Java (Kafka Streams)


import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.KStream;

// 1. Create a StreamsBuilder
StreamsBuilder builder = new StreamsBuilder();

// 2. Create a KStream from a topic
KStream<String, String> sourceStream = builder.stream("orders");

// 3. Define the topology (Filter)
KStream<String, String> highValueOrders = sourceStream.filter(
    (key, value) -> {
        // Assuming value is parsed as Double
        double amount = Double.parseDouble(value);
        return amount > 100.0;
    }
);

// 4. Write to sink topic
highValueOrders.to("high-value-orders");

// 5. Build and start
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

Go (Consumer Loop Pattern)

Kafka Streams is a Java-only library. In Go, we implement similar logic using a Consumer-Producer loop.


package main

import (
	"log"
	"strconv"

	"github.com/confluentinc/confluent-kafka-go/kafka"
)

func main() {
	c, _ := kafka.NewConsumer(&kafka.ConfigMap{
		"bootstrap.servers": "localhost:9092",
		"group.id":          "order-processor",
		"auto.offset.reset": "earliest",
	})
	p, _ := kafka.NewProducer(&kafka.ConfigMap{"bootstrap.servers": "localhost:9092"})

	c.SubscribeTopics([]string{"orders"}, nil)

	for {
		msg, err := c.ReadMessage(-1)
		if err == nil {
			// Processing Logic (The "Topology")
			amount, _ := strconv.ParseFloat(string(msg.Value), 64)

			// Filter
			if amount > 100.0 {
				// Sink
				p.Produce(&kafka.Message{
					TopicPartition: kafka.TopicPartition{Topic: &"high-value-orders", Partition: kafka.PartitionAny},
					Key:            msg.Key,
					Value:          msg.Value,
				}, nil)
			}
		}
	}
}