Failover Strategies: Surviving the Crash

Replication creates a copy of your data. But if the Primary server catches fire, your application goes down.

Failover is the process of promoting a Standby to become the new Primary.

It sounds simple: “If Primary is dead, promote Standby.” But in distributed systems, defining “dead” is the hardest problem.

1. Failover Strategies: Surviving the Crash

2. The Nightmare: Split-Brain

Imagine this scenario:

Network cable between Primary and Standby breaks.
Standby thinks Primary is dead.
Standby promotes itself to Primary.
BUT: The original Primary is still alive and accepting writes from clients!

Now you have Two Primaries. Both are accepting divergent writes. When the network comes back, you have data corruption that is impossible to merge automatically. This is Split-Brain.

3. Failover Strategies: Surviving the Crash

4. Interactive: Split-Brain Simulator

See how a Quorum-based system (like Patroni + Etcd) prevents Split-Brain. A node can only be Primary if it holds the “Leader Key” in the Distributed Consensus Store (DCS).

Primary

Node 1

Consensus Store (Etcd)

Lock: Node 1 (Expires: 10s)

Standby

Node 2

System healthy. Node 1 is Leader.

6. The Solution: Patroni

You should not write your own failover scripts. Use Patroni. Patroni is the industry standard for Postgres HA.

How Patroni Works

Runs as a daemon on every Postgres node.
Uses a Distributed Consensus Store (Etcd, Consul, ZooKeeper).
Leader Key: Only one node can hold the “Leader” key in Etcd (with a TTL, e.g., 10 seconds).
Heartbeats: The Primary must update the key every 5 seconds.
Failover: If Primary crashes (stops updating key), the key expires. The Standby notices the key is gone and races to grab it.

Fencing (STONITH)

To ensure the old Primary is truly dead (and not just partitioned), Patroni uses Watchdogs. If the Patroni process cannot contact Etcd, the hardware watchdog will reboot the server to ensure it cannot accept writes.

STONITH: Shoot The Other Node In The Head.

7. Failover Strategies: Surviving the Crash

8. Client-Side Failover

How does your application know which IP address is the Primary? You have two options:

Option A: VIP (Virtual IP)

Use tools like keepalived to float a Virtual IP address (e.g., 10.0.0.100) to whichever node is Primary. The app always connects to .100.

Option B: libpq Multi-Host Connection

The Postgres driver (libpq) is smart. You can list multiple IPs in the connection string.

## 9. jdbc:postgresql://node1,node2,node3/mydb?targetServerType=primary
host=node1,node2,node3 port=5432 target_session_attrs=read-write

The driver tries node1.
It connects and asks: SELECT pg_is_in_recovery().
If true (Standby), it disconnects and tries node2.
If false (Primary), it stays connected.

This removes the need for a load balancer for failover!

9. Failover Strategies: Surviving the Crash

10. Summary

Strategy	Speed	Complexity	Risk
Manual	Slow (Minutes)	Low	High (Human Error)
Repmgr	Fast	Medium	Medium (Split-brain risk)
Patroni	Fast (<30s)	High (Requires Etcd)	Low (Proven Correctness)

11. Failover Strategies: Surviving the Crash

[!NOTE] This module explores the core principles of Failover Strategies, deriving solutions from first principles and hardware constraints to build world-class, production-ready expertise.

You've survived the crash. Time to review what we've learned.

Next: Module Review →