Streaming & CDC: Syncing Data

[!NOTE] This module explores the core principles of Streaming & CDC: Syncing Data, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Sync Problem

The Scenario: Imagine you are running an e-commerce platform. Your source of truth is a highly reliable Postgres database (the orders table), where every new order transaction is safely recorded. However, to power the lightning-fast, typo-tolerant “Search Orders” bar on your website, you use Elasticsearch (the orders index).

The challenge: How do you keep these two completely different systems perfectly in sync? Think of it like a business with a master accounting ledger in a vault (Postgres) and a quick-reference catalog at the front desk (Elasticsearch). Keeping them aligned while thousands of transactions occur every second is non-trivial.

  1. Dual Writes: The application writes to the DB, and immediately writes to ES.
    • Problem: Race conditions and distributed transactions. What if the DB write succeeds but the ES network request times out? You now have a silent data mismatch (inconsistency).
  2. Batch ETL: Run a cron job or ETL script every hour to bulk-copy data from DB to ES.
    • Problem: The search index is completely stale for up to 59 minutes, leading to poor user experience when customers search for their newly placed orders.
  3. CDC (Change Data Capture): The Gold Standard.
    • Instead of querying the database, listen to the database’s internal Write-Ahead Log (WAL).
    • Stream every committed change (INSERT, UPDATE, DELETE) into a message broker like Kafka.
    • Sink those Kafka messages into Elasticsearch asynchronously.

2. The Architecture: Debezium + Kafka

Instead of the application talking to both databases, the architecture relies on asynchronous replication:

<div class="code-tab-group" markdown="1"> Postgres (WAL) &rarr; Debezium Connector &rarr; Kafka Topic &rarr; Elasticsearch Sink &rarr; Index </div>

  • Near Real-Time: Generally ~500ms lag from the moment Postgres commits to the moment Elasticsearch refreshes.
  • Resilient: If Elasticsearch goes down for maintenance, Kafka buffers the messages safely on disk. Once ES is back online, it resumes consuming without any data loss.
  • Ordering: Kafka guarantees message order for the same entity ID (Partition Key). All updates to order_id=123 land in the same Kafka partition and are processed sequentially.

3. Interactive: CDC Pipeline

Visualize the flow of an UPDATE command.

Postgres
Kafka
Elasticsearch
DB State: name='Alice'
ES State: name='Alice'

4. Hardware Reality: Version Conflicts

Because CDC systems rely on asynchronous message brokers, network realities apply. In rare cases (such as a network partition triggering a retry), a consumer might receive an OLD update after a NEW update. For instance, an update changing the status to “shipped” arrives before the update changing it to “processing”.

Solution: Use External Versioning (version_type=external) in Elasticsearch.

  • When streaming changes, send the Postgres LSN (Log Sequence Number) or a high-precision Timestamp as the document version.
  • Elasticsearch’s versioning mechanism will actively reject any write request where New Version < Current Version, perfectly preventing out-of-order writes from corrupting the index state.