Streaming & CDC: Syncing Data

[!NOTE] This module explores the core principles of Streaming & CDC: Syncing Data, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Sync Problem

The Scenario: Imagine you are running an e-commerce platform. Your source of truth is a highly reliable Postgres database (the orders table), where every new order transaction is safely recorded. However, to power the lightning-fast, typo-tolerant “Search Orders” bar on your website, you use Elasticsearch (the orders index).

The challenge: How do you keep these two completely different systems perfectly in sync? Think of it like a business with a master accounting ledger in a vault (Postgres) and a quick-reference catalog at the front desk (Elasticsearch). Keeping them aligned while thousands of transactions occur every second is non-trivial.

Dual Writes: The application writes to the DB, and immediately writes to ES.
- Problem: Race conditions and distributed transactions. What if the DB write succeeds but the ES network request times out? You now have a silent data mismatch (inconsistency).
Batch ETL: Run a cron job or ETL script every hour to bulk-copy data from DB to ES.
- Problem: The search index is completely stale for up to 59 minutes, leading to poor user experience when customers search for their newly placed orders.
CDC (Change Data Capture): The Gold Standard.
- Instead of querying the database, listen to the database’s internal Write-Ahead Log (WAL).
- Stream every committed change (INSERT, UPDATE, DELETE) into a message broker like Kafka.
- Sink those Kafka messages into Elasticsearch asynchronously.

2. The Architecture: Debezium + Kafka

Instead of the application talking to both databases, the architecture relies on asynchronous replication:

<div class="code-tab-group" markdown="1"> Postgres (WAL) → Debezium Connector → Kafka Topic → Elasticsearch Sink → Index </div>

Near Real-Time: Generally ~500ms lag from the moment Postgres commits to the moment Elasticsearch refreshes.
Resilient: If Elasticsearch goes down for maintenance, Kafka buffers the messages safely on disk. Once ES is back online, it resumes consuming without any data loss.
Ordering: Kafka guarantees message order for the same entity ID (Partition Key). All updates to order_id=123 land in the same Kafka partition and are processed sequentially.

3. Interactive: CDC Pipeline

Visualize the flow of an UPDATE command.

Postgres

Kafka

Elasticsearch

DB State: name='Alice'

ES State: name='Alice'

4. Hardware Reality: Version Conflicts

Because CDC systems rely on asynchronous message brokers, network realities apply. In rare cases (such as a network partition triggering a retry), a consumer might receive an OLD update after a NEW update. For instance, an update changing the status to “shipped” arrives before the update changing it to “processing”.

Solution: Use External Versioning (version_type=external) in Elasticsearch.

When streaming changes, send the Postgres LSN (Log Sequence Number) or a high-precision Timestamp as the document version.
Elasticsearch’s versioning mechanism will actively reject any write request where New Version < Current Version, perfectly preventing out-of-order writes from corrupting the index state.

Streaming & CDC: Syncing Data

Streaming & CDC: Syncing Data

1. The Sync Problem

2. The Architecture: Debezium + Kafka

3. Interactive: CDC Pipeline

4. Hardware Reality: Version Conflicts

Found this lesson helpful?