Streaming & CDC: Syncing Data

[!NOTE] This module explores the core principles of Streaming & CDC: Syncing Data, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

1. The Sync Problem

Your source of truth is Postgres (orders table). Your search engine is Elasticsearch (orders index). How do you keep them in sync?

  1. Dual Writes: App writes to DB, then to ES.
    • Problem: Race conditions. What if DB succeeds but ES fails? Data mismatch.
  2. Batch ETL: Run a script every hour.
    • Problem: Data is stale for 59 minutes.
  3. CDC (Change Data Capture): The Gold Standard.
    • Listen to the Postgres Write-Ahead Log (WAL).
    • Stream every change (INSERT, UPDATE, DELETE) to Kafka.
    • Sink Kafka to Elasticsearch.

2. The Architecture: Debezium + Kafka

Postgres (WAL) → Debezium Connector → Kafka Topic → Elasticsearch Sink → Index

  • Near Real-Time: ~500ms lag.
  • Resilient: If ES is down, Kafka buffers the messages. No data loss.
  • Ordering: Kafka guarantees order for the same ID (Partition Key).

3. Interactive: CDC Pipeline

Visualize the flow of an UPDATE command.

Postgres
Kafka
Elasticsearch
DB State: name='Alice'
ES State: name='Alice'

4. Hardware Reality: Version Conflicts

CDC systems are asynchronous. In rare cases (network partition + retry), you might receive an OLD update after a NEW update. Solution: Use External Versioning (version_type=external).

  • Send the Postgres LSN (Log Sequence Number) or Timestamp as the version.
  • Elasticsearch will reject any write where New Version < Current Version.