Understanding Patroni Failovers

What is Patroni and How Does it Work

Patroni is the most popular high availability system for PostgreSQL today. It supports both Windows and Linux deployments and relies on modern consensus-oriented state handling.

Patroni manages PostgreSQL instances, expecting to start and stop them, and run the PostgreSQL processes as the Patroni user. It uses a distributed key-value store for managing and although most deployments use Etcd for this role, consul and zookeeper are also supported. A particularly popular combination is that of Etcd, Patroni, and vip-manager with vip-manager watching Etcd for Patroni state changes and applying virtual ip settings to the appropriate node.

Patroni is a heart-beat-based high availability system with 10-second heartbeats. When heartbeats are missed, nodes are removed from the cluster. They can then be bootstrapped into the cluster automatically though a number of settings can be required to make this possible (in particular setting a PostgreSQL recovery command and telling Patroni to use pg_rewind, caveats to the latter being discussed below).

In essence, Patroni sends out heartbeats every ten seconds. This is done by writing to the distributed key/value store (usually Etcd) and Etcd then returns the information to Patroni which can then act on that information. If a node fails to issue a heartbeat request, it is assumed to be dead and the cluster reacts appropriately.

In most cases, Patroni will be set up to use replication slots to ensure that if a replica fails, it can be brought back to a fully caught-up state because the write-ahead log on the primary system will be maintained. This is optional and you can use restore_command to address this instead.

The replication slots are created when the replicas connect and so in very high throughput systems their usefulness is not guaranteed, but more on this below.

Patroni Switchovers

Patroni has the option of manual switchovers. These have the advantage of providing additional checks, for example ensuring that the replica is caught up and has the most recent data, and so things like pg_rewind will not be required.

The command for a Patroni switchover is:

patronictl switchover

In this case, Patroni will make sure that every node is in a safe condition, and then it will stop the primary, promote the replica, and then attach the primary as a replica, so that it can continue to stream changes. This is done by reconfiguring PostgreSQL. Write-ahead-logs will adopt a new timeline and replication will resume from the point of promotion.

Switchovers can also be scheduled. This is useful for providing orderly preparation for the maintenance on the primary system. In this case, no failure is detected by the software (and no failure is present), and so no failover occurs. Switchovers are entirely safe.

Patroni Failovers

In contrast to a switchover, a failover occurs when the Patroni primary instance fails to issue a heartbeat request. In this case, it is removed from the cluster and the Patroni process is expected to shut down the Postgresql instance. In the event where Patroni fails but PostgreSQL is running properly (which I have never seen happen but it could happen), the PostgreSQL process might not be stopped. In that event, you may have a possibility for a split-brain but this is something that virtual IP addresses can help protect against since it is unlikely that PostgreSQL is running and both Patroni and vip-manager is not.

However, in the case of a network failure, this can happen. High availability is a hard topic but this is mitigated by the fact that when the network issue is restored, Etcd will reconcile the differences and vip-manager will reach a consistent state with the rest of the cluster.

In a failover the following steps occur in rough order:

  1. Heartbeat is initiated
  2. The primary fails to register a heartbeat. This can happen due to any number of reasons
  3. The cluster removes the primary from the cluster and replicas seek to identify themselves as the new leader.
  4. The winner of this election promotes itself and the other replica is reconfigured to use it as the source.
  5. If vip-manager is used, the virtual IP is assigned to the new primary
  6. On connection, the replica will create its own replication slot.
  7. The old primary, if available is evaluated and may be recovered if possible, given existing options

Recovering the Old Primary

If the replicas were fully caught up to speed, the old primary is simply re-attached as a replica to the new primary. This can happen if, for example, the PostgreSQL host disappeared during a slow time and autovacuum was not currently active.

If the replicas are not fully caught up, then the old primary must be recovered, via pg_rewind. Whether this step is automatic or not depends on settings, and different use cases suggest different approaches here.

Caveats to pg_rewind

The use of pg_rewind can (but does not necessarily) indicate a possibility of data loss of the most recent transactions. In most cases, if the replicas were well caught up, there may be some in-flight (uncommitted) transactions whose writes must be undone, or there could be autovacuum processes which are writing cleaned-up blocks and these must be undone, Remember, to start replicating we have to have the same files.

The second major caveat to pg_rewind is that if PostgreSQL crashed (for example) because underlying data storage failures, this will not correct the problem and it would be better to determine the cause, replace the hardware, and then bootstrap a new clean instance of the PostgreSQL database, using Patroni to do this. Failure to diagnose the problem here could lead to further data loss later due to underlying storage failures so some people prefer to manually run rewind after diagnosing the cause of the failover.

Replication Slot Edge Cases

Replication slots have a couple of edge cases in this setup which can be managed a few different ways. Managing these is critical and understanding the implications of the use of replication slots are vital.

PostgreSQL replication works by sending the write-ahead log messages to the replicas in real-time which then replay these messages on their files in order to keep them up to speed. Periodically, when a checkpoint occurs, write-ahead log files which can be removed are removed.

Replication slots store state information about what write-ahead log messages have been sent to each replica. This information is used to determine whether write-ahead log files can be safely deleted or not, If you have a replication slot which has been active but is not updated due to sending wal, then wal will not be clearable and will build up on the server until it runs out of space and PostgreSQL will then crash.

During a checkpoint, a configurable amount of write-ahead log is retained, This is useful for preventing the files from being removed out from under the replicas if no replication slots exist yet. On high-traffic systems, you may need to tune wal_keep_size in order to improve your chances of being able to just resume replication. Alternatively using a restore_command and an external archive can allow nodes to retrieve missing write-ahead log segments from an external archive.

Concluding Thoughts

Patroni is a wonderful tool but understanding the fundamentals is still important. Like any system there are important tradeoffs between considerations which must be taken into account. These are discussed above.

I hope you all have gained some insight into how Patroni handles failovers.

Leave A Comment