Streaming & CDC: Syncing Data
[!NOTE] This module explores the core principles of Streaming & CDC: Syncing Data, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.
1. The Sync Problem
Your source of truth is Postgres (orders table).
Your search engine is Elasticsearch (orders index).
How do you keep them in sync?
- Dual Writes: App writes to DB, then to ES.
- Problem: Race conditions. What if DB succeeds but ES fails? Data mismatch.
- Batch ETL: Run a script every hour.
- Problem: Data is stale for 59 minutes.
- CDC (Change Data Capture): The Gold Standard.
- Listen to the Postgres Write-Ahead Log (WAL).
- Stream every change (
INSERT,UPDATE,DELETE) to Kafka. - Sink Kafka to Elasticsearch.
2. The Architecture: Debezium + Kafka
Postgres (WAL) → Debezium Connector → Kafka Topic → Elasticsearch Sink → Index
- Near Real-Time: ~500ms lag.
- Resilient: If ES is down, Kafka buffers the messages. No data loss.
- Ordering: Kafka guarantees order for the same ID (Partition Key).
3. Interactive: CDC Pipeline
Visualize the flow of an UPDATE command.
Postgres
Kafka
Elasticsearch
DB State: name='Alice'
ES State: name='Alice'
4. Hardware Reality: Version Conflicts
CDC systems are asynchronous.
In rare cases (network partition + retry), you might receive an OLD update after a NEW update.
Solution: Use External Versioning (version_type=external).
- Send the Postgres LSN (Log Sequence Number) or Timestamp as the version.
- Elasticsearch will reject any write where
New Version < Current Version.