You have a Patroni leader election. You are only halfway to PostgreSQL high availability.

A PostgreSQL primary loses power at 2am. Writes resume in under thirty seconds. The on-call engineer reads the alert in the morning, sees that the cluster healed itself, and goes back to coffee. That is the outcome PostgreSQL high availability is supposed to deliver.

A working Patroni cluster, on its own, gets you partway there. The leader election runs. A standby gets promoted. The cluster state in etcd stays consistent. Then the application keeps trying to reach an IP address that points at the wrong node now, the old primary needs a manual rejoin, and the on-call engineer is on a conference bridge instead of in bed.

I have seen this pattern enough times to call it the default. The cluster does its job. The application waits on a human. The runbook comes out. RTO passes the SLA. Everyone agrees afterward that “we should look at HA more seriously.”

The arithmetic of recovery time

The case for automation is mostly arithmetic.

When the cluster heals itself, the RTO clock starts at the failure detection and stops at the application’s first successful write. With Patroni’s TTL set to 30 seconds, a routing layer that follows promotion within another second or two, and an application that retries with backoff, the whole sequence finishes in under a minute. Often well under.

Bring a human into the loop, and a different clock starts. The monitoring system needs to detect the failure, group it into an alert, and deliver it to the on-call engineer’s pager. That alone is often 30 to 60 seconds. The engineer needs to wake up, find a laptop, log into the bastion, and load enough context to know what is happening. Even for a sharp on-call engineer ready at the keyboard, that is 5 to 10 minutes of best-case effort. Then comes the investigation: which node failed, what state the cluster is in, what is safe to do next. That is another 5 to 30 minutes depending on how clean the runbook is and how confident the engineer feels. Finally, the actual fix: editing a connection string, running a pg_rewind, restarting an application pool, or whatever the gap requires.

Best case with a human in the path is around 15 minutes. Realistic case is 30 to 60 minutes. Worst case, when the page goes to a second escalation, the engineer is on the road, or the failure mode is unfamiliar, is hours. I have seen a single missed page on a PagerDuty escalation policy stretch a 30-second cluster event into an eight-hour incident that ended at sunrise.

This is the whole point of the second half. Every layer covered below exists to keep the application online and the engineer in bed, so the 2am page becomes a morning summary.

What Patroni gets right

Patroni is, at heart, a leader election engine wrapped around PostgreSQL. It uses a distributed configuration store such as etcd, Consul, ZooKeeper, or the Kubernetes API to coordinate cluster state. It promotes a standby when the primary stops responding to its lease. It exposes a REST API at /primary and /replica that other tools can ask, “Who is the leader right now?” It supports watchdog fencing so a stuck primary is forced to release writes the moment its lease expires.

That is real, careful work, and it is the right tool for the cluster layer. The point of this post is that the cluster layer is one layer of four.

Routing the connection to the new primary

Your application reaches PostgreSQL through some combination of hostnames, IPs, connection pools, and drivers. None of those update on their own when Patroni promotes a new primary. Something has to route fresh connections to whichever node currently holds the leader role.

Three patterns work well in production.

A virtual IP managed by keepalived or vip-manager floats with the primary. The application connects to one address, and the address moves at promotion time. Latency overhead is essentially zero. The tradeoff is that VIPs need a Layer 2 network you control, which limits them to data centers and private networks where you own the routing.

An HAProxy frontend that health-checks Patroni’s REST API works in any network, including public cloud. HAProxy queries /primary on each backend and routes only to the node that returns 200. The tradeoff is one extra network hop and one more component to monitor.

A multi-host connection string is the lightest-weight option for libpq-based drivers. Setting host=node1,node2,node3 target_session_attrs=read-write lets the driver itself find the primary. It works for psql, psycopg, JDBC, and most modern Postgres clients. The tradeoff is that older application code and certain ORMs stick to the first host in the list and skip past the multi-host hint.

Pick whichever fits your network and your driver. The point is that one of them is in place before the first failover, rather than added afterward.

Bringing the old primary back

When a primary fails over, the old primary often has WAL the new primary never saw. Bringing it back as a standby requires either pg_rewind or a full base backup rebuild.

Two Patroni settings carry most of the load here. Setting use_pg_rewind: true tells Patroni to attempt a rewind when a former primary tries to rejoin. Setting remove_data_directory_on_rewind_failure: true tells Patroni to fall back to a clean rebuild from the new primary if the rewind cannot complete.

The combination of those two is what makes the rejoin truly hands-off. With both in place, the cluster heals on its own, and the upper bound is that a 2TB rebuild runs for a few hours in the background while everything else stays online.

pg_rewind itself depends on wal_log_hints = on or data checksums being enabled. Both settings need to be there from the start of the cluster’s life. Adding them later requires a restart, and for checksums, a full rewrite of the data files.

Keeping replicas in sync

A standby that drifts behind the primary will eventually need WAL the primary has already recycled. Replication slots prevent that recycling at the cost of disk on the primary.

Slots are the right answer for almost every cluster. The piece teams miss is the monitoring around them. A single stalled standby with an active slot can fill a primary’s disk in hours and take the whole cluster offline. Watch pg_replication_slots.confirmed_flush_lsn for slots that stop advancing, watch pg_wal_size() for unexpected growth, and set a disk alert that fires while there is still room to react and rebuild the lagging replica.

WAL archiving to object storage, paired with slots, gives you a second source of truth. If a slot needs to be dropped to save the primary, the standby can catch up from the archive instead of needing a fresh base backup. The combination is especially useful for clusters with cross-region replicas or long replication paths.

Teaching the application to reconnect

Drivers handle connection failures differently. The libpq family, with a multi-host connection string, will retry across the host list and find the new primary on its own. The JDBC driver behaves similarly when configured with targetServerType=primary. Many ORM connection pools, however, hold onto cached connections and only discover they are broken on the next query attempt.

Two habits make this layer reliable. Pool eviction policies should be short enough that a stale connection gets discarded inside a few seconds rather than minutes. Application code should retry idempotent writes with backoff. The combination means that a fifteen-second failover surfaces as a brief slowdown rather than a wave of errored requests.

The application reconnect path is the layer that gets the least attention and surfaces the most surprises during a real failover. Test it under load, with a real workload, before you trust it.

The test that proves HA

A controlled switchover in a maintenance window exercises the leader election and promotion mechanics. The routing, application reconnect, and rejoin paths stay untested until you run an unannounced failover under real load.

That second test is uncomfortable on purpose. Run a steady write workload against the application during business hours. Pull the network on the primary node, or kill the Postgres process, or stop the VM. Time how long the application takes to resume writes. Count the writes that returned an error during the gap. Walk over to the demoted primary and confirm it rejoined the cluster on its own.

If the application recovered inside your RTO target, the writes that errored were inside your RPO target, and the old primary is back as a healthy standby with no human input, you have HA.

If any of those three things needed a person, you have a leader election and a runbook. Both are useful tools. Both become HA when paired with the routing, rejoin, and reconnect work covered above.

Where I see this pattern decide the outcome

Across production engagements over the last few months, I have seen the same scenario play out across very different teams. A 5TB Patroni cluster on cloud infrastructure handled a node loss exactly as designed, and the application stayed offline for nineteen minutes while a human edited a connection string. A second cluster, on bare metal with a VIP in front, handled the same scenario in twelve seconds. Same database. Same Patroni. Different second half.

The teams that get the second half right tend to share three habits. They treat the routing layer as part of the cluster from day one, rather than a decision to make later. They run unannounced failover tests on a schedule. They keep the runbook short, because they have already built the second half that would otherwise need it.

Patroni is a fine tool for the job it does. It is, on its own, a leader election engine. The HA platform is what you build around it, and the work pays off the first time a primary loses power at 2am and you find out about it from the morning summary.