Streams Architecture & Duality

Kafka Streams is a client library for building applications and microservices, where the input and output data are stored in Kafka clusters. Unlike Spark or Flink, which require a separate processing cluster, Kafka Streams is just a Library.

1. The “Library” Philosophy

Traditional stream processing systems (like Spark Streaming or Flink) are Clusters. You submit a job to a cluster, and the cluster manages the execution.

Kafka Streams is different:

  • No Cluster: It runs inside your application (e.g., a Spring Boot app, a microservice).
  • Scalability: It leverages Kafka’s consumer group protocol to scale. If you want to scale up, you just start more instances of your application.
  • Deployment: You deploy it just like any other application (Docker, Kubernetes, bare metal).

2. The Processor Topology

A Kafka Streams application is defined by a Topology—a graph of processing nodes.

Source
Topic: Orders
Processor
Filter: Amount > $100
Sink
Topic: HighValueOrders

3. KStream vs. KTable: The Duality

This is the most critical concept in Kafka Streams.

KStream (The “INSERT” Stream)

A KStream is an abstraction of a record stream, where each data record represents a self-contained datum in the unbounded dataset.

  • Semantics: “Fact”. Something happened.
  • Example: “Alice sent 10”, “Alice sent 5”.
  • Analogy: A ledger of transactions.

KTable (The “UPSERT” Stream)

A KTable is an abstraction of a changlog stream, where each data record represents an update.

  • Semantics: “State”. The current value for a key.
  • Example: “Alice’s Balance is 10”, then “Alice’s Balance is 15”.
  • Analogy: A database table.

[!TIP] Stream-Table Duality:

  • Stream as Table: A stream can be considered a changelog of a table. If you replay the stream from the beginning, you can reconstruct the table.
  • Table as Stream: A table can be considered a stream of updates. Every time the table changes, it produces a change event.

Interactive: Duality Visualizer

See how the same data creates different results in a Stream vs. a Table.

Input Stream (Topic)

KTable View (Current State)

KeyValue

4. Code Implementation

Here is how you define a simple topology that reads from a source topic, filters records, and writes to a sink topic.

Java (Kafka Streams)


import org.apache.kafka.common.serialization.Serdes;
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.KStream;

// 1. Create a StreamsBuilder
StreamsBuilder builder = new StreamsBuilder();

// 2. Create a KStream from a topic
KStream<String, String> sourceStream = builder.stream("orders");

// 3. Define the topology (Filter)
KStream<String, String> highValueOrders = sourceStream.filter(
  (key, value) -> {
    // Assuming value is parsed as Double
    double amount = Double.parseDouble(value);
    return amount > 100.0;
  }
);

// 4. Write to sink topic
highValueOrders.to("high-value-orders");

// 5. Build and start
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();

Go (Consumer Loop Pattern)

Kafka Streams is a Java-only library. In Go, we implement similar logic using a Consumer-Producer loop.


package main

import (
  "log"
  "strconv"

  "github.com/confluentinc/confluent-kafka-go/kafka"
)

func main() {
  c, _ := kafka.NewConsumer(&kafka.ConfigMap{
    "bootstrap.servers": "localhost:9092",
    "group.id":          "order-processor",
    "auto.offset.reset": "earliest",
  })
  p, _ := kafka.NewProducer(&kafka.ConfigMap{"bootstrap.servers": "localhost:9092"})

  c.SubscribeTopics([]string{"orders"}, nil)

  for {
    msg, err := c.ReadMessage(-1)
    if err == nil {
      // Processing Logic (The "Topology")
      amount, _ := strconv.ParseFloat(string(msg.Value), 64)

      // Filter
      if amount > 100.0 {
        // Sink
        p.Produce(&kafka.Message{
          TopicPartition: kafka.TopicPartition{Topic: &"high-value-orders", Partition: kafka.PartitionAny},
          Key:            msg.Key,
          Value:          msg.Value,
        }, nil)
      }
    }
  }
}

5. Streaming in Practice: Async vs Synchronous

When developers first encounter Kafka Streams, the concept of a “Stream” can be abstract.

  • Batch Processing (The Past): You collect data all day in a database, and at midnight, a script runs a SELECT * query to calculate daily totals. This is high latency (answers are up to 24 hours old).
  • Synchronous Processing (The Present): A user clicks a button, an API is called, the API calculates a total, and returns the result. This blocks the user interface and doesn’t scale well for heavy analytical workloads.
  • Stream Processing (The Future): Data is an unbounded, infinite flow. As soon as a single event occurs (e.g., a “Buy” click), it enters the Kafka Stream. The Streams application processes that single event instantly—updating a rolling total, triggering a fraud alert, or joining it with another stream—all asynchronously. The downstream systems are updated in milliseconds without ever blocking the user who clicked the button.

Kafka Streams provides the tools (like filter, map, and join) to build these real-time, asynchronous processing pipelines natively within your application code.