Streaming & CDC: Syncing Data
[!NOTE] This module explores the core principles of Streaming & CDC: Syncing Data, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Sync Problem
The Scenario: Imagine you are running an e-commerce platform. Your source of truth is a highly reliable Postgres database (the orders table), where every new order transaction is safely recorded. However, to power the lightning-fast, typo-tolerant “Search Orders” bar on your website, you use Elasticsearch (the orders index).
The challenge: How do you keep these two completely different systems perfectly in sync? Think of it like a business with a master accounting ledger in a vault (Postgres) and a quick-reference catalog at the front desk (Elasticsearch). Keeping them aligned while thousands of transactions occur every second is non-trivial.
- Dual Writes: The application writes to the DB, and immediately writes to ES.
- Problem: Race conditions and distributed transactions. What if the DB write succeeds but the ES network request times out? You now have a silent data mismatch (inconsistency).
- Batch ETL: Run a cron job or ETL script every hour to bulk-copy data from DB to ES.
- Problem: The search index is completely stale for up to 59 minutes, leading to poor user experience when customers search for their newly placed orders.
- CDC (Change Data Capture): The Gold Standard.
- Instead of querying the database, listen to the database’s internal Write-Ahead Log (WAL).
- Stream every committed change (
INSERT,UPDATE,DELETE) into a message broker like Kafka. - Sink those Kafka messages into Elasticsearch asynchronously.
2. The Architecture: Debezium + Kafka
Instead of the application talking to both databases, the architecture relies on asynchronous replication:
<div class="code-tab-group" markdown="1">
Postgres (WAL) → Debezium Connector → Kafka Topic → Elasticsearch Sink → Index
</div>
- Near Real-Time: Generally ~500ms lag from the moment Postgres commits to the moment Elasticsearch refreshes.
- Resilient: If Elasticsearch goes down for maintenance, Kafka buffers the messages safely on disk. Once ES is back online, it resumes consuming without any data loss.
- Ordering: Kafka guarantees message order for the same entity ID (Partition Key). All updates to
order_id=123land in the same Kafka partition and are processed sequentially.
3. Interactive: CDC Pipeline
Visualize the flow of an UPDATE command.
4. Hardware Reality: Version Conflicts
Because CDC systems rely on asynchronous message brokers, network realities apply. In rare cases (such as a network partition triggering a retry), a consumer might receive an OLD update after a NEW update. For instance, an update changing the status to “shipped” arrives before the update changing it to “processing”.
Solution: Use External Versioning (version_type=external) in Elasticsearch.
- When streaming changes, send the Postgres LSN (Log Sequence Number) or a high-precision Timestamp as the document version.
- Elasticsearch’s versioning mechanism will actively reject any write request where
New Version < Current Version, perfectly preventing out-of-order writes from corrupting the index state.