Thinking of PostgreSQL High Availability as Layers

High availability for PostgreSQL is often treated as a single, big, dramatic decision: “Are we doing HA or not?”

That framing pushes teams into two extremes:

a “hero architecture” that costs a lot and still feels tense to operate, or
a minimalistic architecture that everyone hopes will just keep running.

A calmer way to design this is to treat HA and DR as layers. You start with a baseline, then add specific capabilities only when your RPO/RTO and budget justify them.

Let us walk through the layers from “single primary” to “multi-site DR posture”.

Start with outcomes

Before topology, align on three things:

1. Failure scope

- A database host fails
- A zone or data center goes away
- A full region outage happens
- Human error

2. RPO (Recovery Point Objective)

- We can tolerate up to 15 minutes of data loss
- We want close to zero

3. RTO (Recovery Time Objective)

- We can be back in 30 minutes
- We want service back in under 2 minutes

Here is my stance (and it saves money!): You get strong availability outcomes by layering in the right order.

Layer 0 – Single primary (baseline, no backups)

This is the baseline: one PostgreSQL primary in one site. All reads and writes go to it.

That is it. No replicas. No archiving. No backup flow in this model.

What you get:

simplicity
low cost
low operational overhead

What it means operationally:

Your “recovery plan” is effectively “rebuild and rehydrate from wherever you can” (which might be infrastructure snapshots, application-level rebuilds, or other ad hoc processes depending on your environment).
Your availability depends heavily on the stability of the underlying host, storage, and platform.

If you are running Layer 0, the best mindset is: keep it stable and observable.

solid monitoring (latency, errors, saturation)
sane maintenance (bloat, stats, connection hygiene)
predictable change management

Layer 0 is not a “bad” architecture. It is simply the baseline. The moment you want a reliable recovery posture, you move to Layer 1.

Layer 1 – Add offsite backups (your first real safety net)

Layer 1 keeps the same single primary in Site A, and adds backup storage in Site B.

This model introduces a defined recovery path.

What you gain:

You can lose the primary server and still recover your data.
You can meet an RPO that is “last successful backup” (which is often perfectly acceptable for many systems).

Practical ways teams implement this:

pgBackRest or Barman sending backups to object storage (often in another region/account)
retention policies that reflect compliance and business needs

An important point to note here – a backup is only as good as its ‘restorability’. If you can’t restore a backup, there is no point in taking one. Best practice is to run periodic drills to test the restore procedure, measure the time it takes, and verify the data it restores.

Layer 2 – Add WAL archiving (PITR-ready recovery)

Layer 2 builds on Layer 1 by adding WAL archiving from Site A to Site B.

This is where recovery becomes precise and continuous.

Backups alone restore you to “the last backup.” WAL archiving lets you restore to a point in time.

What you gain:

PITR (Point-in-Time Recovery)
Tighter RPO
A clean response to human error

The habit that makes this layer valuable:

restore drills
timed drills
runbooks that a tired engineer can follow at 2 AM

Layer 2 is one of the highest-ROI layers in the entire model because it turns recovery into a controlled process rather than improvisation.

Layer 3 – Add a hot standby

Layer 3 keeps backups + WAL archiving, and adds a hot standby in Site A (often in a different zone or DC).

Primary → standby uses asynchronous streaming replication.

What you gain:

much faster RTO (fail over to the standby instead of rebuilding)
the option for load balancing (route read queries to the standby)
planned switchovers for maintenance that do not disrupt operations

Additional monitoring requirements:

replication lag
WAL generation rate
standby replay delay
failover readiness

This is also where teams choose between:

disciplined manual failover
Auto failover using an HA manager

Either path works when it is tested and documented.

Layer 4 – Add synchronous replication

Layer 4 is where teams typically run a primary and multiple standbys, using:

synchronous replication for stronger data guarantees, and
asynchronous replication for flexibility and additional redundancy.

What you gain:

near-zero data loss for transactions protected by synchronous commit

What you accept:

added write latency
more explicit failure handling

An important part of the policy:

When the synchronous standby is unavailable, do you prefer continued writes (async mode) or do you prefer waiting until sync returns?

Teams that decide this up front operate Layer 4 calmly. Teams that leave it implicit tend to discover their “real” policy during an incident.

Layer 5 – Add a warm standby in Site B

Layer 5 is where you treat a second site as a true recovery location, adding regional redundancy.

You keep your HA setup in Site A and maintain a warm standby in Site B, fed by backups and WAL archives that are continuously applied to the standby node.

What you gain:

a cleaner plan for site-level outages
a faster recovery path to Site B, reducing RTO

This layer also forces a useful reality check, DR is not only a database design. You also want:

routing (DNS/LB) that can switch cleanly
application configuration that supports failover
secrets and access that work in the DR site
rehearsed runbooks

When those pieces are ready, Layer 5 feels like a controlled switchover instead of a high-stress scramble.

Common gotchas that show up in production

These are the ones I see repeatedly:

Backups exist; restore is untested. At best, this is Schrodinger’s backup – and you will only know when there is an outage.
WAL archiving is configured but not monitored. You want to make sure the consumer is consuming the files, so they don’t pile up on the producer.
Replication slots retain WAL longer than expected. This needs to be monitored, and you need to ask ‘why’.
Synchronous replication without a clear failure policy. Write the rule down, test it, and make it visible to the on-call team.
Read traffic routed to standbys without thinking about staleness. Replica reads are great when you choose the right queries and accept the consistency model.