...

Achieving High Availability in PostgreSQL: From 90% to 99.999%

When you are running mission-critical applications, like online banking, healthcare systems, or global e-commerce platforms, every second of downtime can cost millions and damage your business reputation. That’s why many customers aim for four-nines (99.99%) or five-nines (99.999%) availability for their applications

In this post, we will walk through what those nines really mean and, more importantly, which PostgreSQL cluster setup will get you there.

What Do “Nines” Mean?

The number of nines describes how much of the year your system stays up:

  • 90% (1 nine) → ~328 days up; 36½ days down
  • 99% (2 nines) → ~361 days up; 3½ days down
  • 99.9% (3 nines) → ~365 days up; 8¾ hours down
  • 99.99% (4 nines) → ~365 days up; 53 minutes down
  • 99.999% (5 nines) → ~365 days up; 5½ minutes down

Every extra nine gives you more uptime, but it also adds complexity to the setup. In the end, it’s a trade-off: either you deal with downtime later or put in the effort upfront to build a resilient system.

Mapping PostgreSQL Clusters to High Availability Levels

In this section, we will explore how different PostgreSQL cluster architectures map to various levels of high availability (HA), starting from a basic single-node setup offering around 90% uptime, all the way to advanced bi-directional replication setups capable of delivering 99.999% uptime.

Note: The uptime mentioned for each cluster assumes that everything operates as expected, without any unexpected failures or disruptions.

Single Instance

What it is: One PostgreSQL server doing all the work.
Uptime you get: Roughly 90–98% per year
When Disaster Strikes: If that single server fails due to hardware issues or accidental deletion of the data directory, all your data/applications can become unavailable.

Why is it tricky?

  • Since there’s only a single server, if it crashes, we will need to provision a new instance, install PostgreSQL, and restore the most recent backup, assuming one is available.
  • If WAL (Write-Ahead Log) files aren’t being archived to a backup location, there’s a risk of data loss, which means the RPO (Recovery Point Objective) could be quite high.
  • Similarly, the RTO (Recovery Time Objective) is also on the higher side, as the entire system needs to be rebuilt and brought back online from scratch.

Recovery Window

  • If recovery takes longer than 36.5 days, you can’t even claim 90% availability because 90% uptime means your system must be back within 36½ days of any outage.
  • Even if you are fast, manual recovery risks human error and extended downtime.

Note: While it’s technically possible to achieve 98% uptime with a single node by applying the right measures, it’s a tricky and fragile setup. Events like hardware failure, manual errors (e.g., accidental deletion of the data directory), or OS-level crashes can significantly extend recovery time. As a result, you are always at risk of falling below the 98% uptime threshold, and any such incident can have a noticeable impact on availability.

Primary + Standby (Manual Failover)

What it is: One live primary server plus one warm standby, kept in sync with pg_basebackup.
Uptime you get: About 99–99.5% per year
When Disaster Strikes:

  • On primary node failure, alert the ops team.
  • A human promotes the standby to the primary.
  • Reconfigure the app to point to the new primary.

Why is it tricky?

  • A human needs to be available 24/7 to manually promote the standby to primary and reconfigure the application to point to the new primary.
  • This setup also requires reliable monitoring and alerting systems to quickly detect failures and notify the team for timely action.

Recovery window:

Typically, the entire process of detecting the failure, generating alerts, getting someone involved to perform a manual failover, and redirecting the application to the new primary can take anywhere from around 1.8 days to as much as 3.6 days. This results in application availability ranging between approximately 99% and 99.5%.

Pros & Cons:

  • RPO can be close to zero
  • RTO is manageable
  • Manual steps can be slow.

Primary + 2 Standbys(Automated Failover)

What it is: A 3-node setup, but with tools(Patroni or pgpool with VIP manager) that automatically detect failures and promote standbys, which means no human effort is required.
Uptime you get: 99.9%99.95% per year
When Disaster Strikes:

  • Patroni or Pgpool, with the help of ETCD or Watchdog, continuously monitors the primary node.
  • If it goes down, the tool immediately promotes the standby.
  • A virtual IP reroutes application connections to the new leader.

Recovery window:

With this setup, the recovery window typically ranges from 8.7 hours to under 4.3 hours only due to the elimination of manual intervention. The application remains operational during failover, ensuring up to 99.9% or 99.95% high availability.

Pros & Cons:

  • Almost invisible failover to end users.
  • No risk of late human intervention.
  • Still vulnerable if your entire zone or data center fails.

Multi-AZ / Multi-Zone Setup

What it is: Nodes spread across multiple availability zones (AZs) or data centers. Can be managed (e.g., AWS RDS Multi-AZ) or self-managed with Patroni or Pgpool + pg_basebackup.
Uptime you get: 99.99% (four nines).
When Disaster Strikes:

  • An entire AZ or data center goes offline.
  • Human effort is required for the system to promote a standby in another zone.

Recovery window:

  • If a primary goes down inside the region, no action is needed, as another standby can be promoted to be a new primary automatically 
  • However, if the entire region goes down, strong monitoring and alerting systems can help detect the issue, but human intervention would still be required to promote the warm standby in another region as the new primary. To meet a four-nines availability target, this failover must be completed within just 52 minutes.

Pros & Cons:

  • Business operations survive regional outages.
  • Fully automated in many managed services.
  • More complex networking and cross-zone replication.
  • In some scenarios, human intervention can also be eliminated by introducing an odd number of availability zones to maintain accurate consensus. But adds extra complexity to manage and deploy 

Multi-Region Active-Active

What it is: Running multiple primary nodes across different regions, all staying in sync, is only feasible with proprietary solutions like PGD. This is due to the complexity of conflict resolution, latency challenges, and the risk of split-brain scenarios.
Uptime you get: 99.999% (Five nines)
When Disaster Strikes:

  • If a primary node within a region goes down, there’s no need to worry; thanks to the proxy server, the application can continue writing to another primary node in the same region. 
  • Even if the entire region goes down, the application can continue writing to the disaster recovery (DR) region without interruption. While this may introduce some latency, it’s a far better trade-off than experiencing a complete outage.

Pros & Cons:

  • We survive continent-scale disasters.
  • Write available locally in each region.
  • Because of the multi master architecture, latency is usually on the low side
  • High licensing and operational costs.
  • Very complex to set up and maintain.

Don’t Forget About the Single Point of Failure

A Single Point of Failure (SPOF) anywhere in your stack can wreck your availability claims. For instance, in a three-node Patroni cluster with only one witness (etcd/Consul node): if that witness dies, the cluster halts, even though the database nodes are fine. Rebuilding that witness could take days—scooping away even your 90% or 99% availability!

Wrapping Up

  • Decide your “nines” based on how long your business can tolerate downtime.
  • Match your architecture from single instance all the way to multi-region active-active.
  • Automate, monitor, and remove every SPOF you can find.

Leave A Comment

Seraphinite AcceleratorOptimized by Seraphinite Accelerator
Turns on site high speed to be attractive for people and search engines.