Streaming & CDC: Syncing Data

[!NOTE] This module explores the core principles of Streaming & CDC: Syncing Data, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Sync Problem

Your source of truth is Postgres (orders table). Your search engine is Elasticsearch (orders index). How do you keep them in sync?

Dual Writes: App writes to DB, then to ES.
- Problem: Race conditions. What if DB succeeds but ES fails? Data mismatch.
Batch ETL: Run a script every hour.
- Problem: Data is stale for 59 minutes.
CDC (Change Data Capture): The Gold Standard.
- Listen to the Postgres Write-Ahead Log (WAL).
- Stream every change (INSERT, UPDATE, DELETE) to Kafka.
- Sink Kafka to Elasticsearch.

2. The Architecture: Debezium + Kafka

Postgres (WAL) → Debezium Connector → Kafka Topic → Elasticsearch Sink → Index

Near Real-Time: ~500ms lag.
Resilient: If ES is down, Kafka buffers the messages. No data loss.
Ordering: Kafka guarantees order for the same ID (Partition Key).

3. Interactive: CDC Pipeline

Visualize the flow of an UPDATE command.

Postgres

Kafka

Elasticsearch

DB State: name='Alice'

ES State: name='Alice'

4. Hardware Reality: Version Conflicts

CDC systems are asynchronous. In rare cases (network partition + retry), you might receive an OLD update after a NEW update. Solution: Use External Versioning (version_type=external).

Send the Postgres LSN (Log Sequence Number) or Timestamp as the version.
Elasticsearch will reject any write where New Version < Current Version.