July 2026 - Stormatics

Upgrading PostgreSQL 9.6 to 17 with pg_upgrade

When you are upgrading across major PostgreSQL versions, there are a few ways to go. Dump and restore is the simplest to reason about, but downtime scales directly with database size, so for anything multi-terabyte, it is off the table. Logical replication gets you near-zero downtime, but it only works from PostgreSQL 10 onward; if your source cluster is on less than version 10, that path does not exist in a native way. That leaves pg_upgrade, the community-maintained tool for in-place major version upgrades. With the –link flag, it creates hard links instead of copying data files, so the upgrade step itself stays fast, no matter how big the database is.

Zero-Pain PostgreSQL DDL Migrations: Avoiding Locks and Long-running Queries in Production

Database migrations are a critical step in the lifecycle of any application, allowing teams to deploy new features, create and change database objects, and scale infrastructure. However, in high-volume mission-critical environments, executing Data Definition Language (DDL) statements can quickly turn into a production nightmare. The primary culprit behind application downtime during DDL deployments is the mismanagement of PostgreSQL's locking mechanism. A single poorly planned ALTER TABLE statement can request an exclusive lock, blocking incoming application queries, exhausting connection pools, causing cascading timeouts, and ultimately leading to revenue loss. To achieve true zero-downtime deployments, database migrations must be fast, defensive, and meticulously designed to avoid heavy locks and long-running queries.

PostgreSQL Disaster Recovery with pgBackRest TLS Transport

If you've read the pgBackRest DR guide on this blog, you already know the standard setup: two servers, passwordless SSH, pgBackRest pulling backups across the wire. It works reliably, and it's what most teams run. SSH works well for small deployments. The challenge emerges at scale: as the number of machines grows, managing individual key pairs, distributing them, rotating them, and auditing who has what becomes increasingly complex. SSH also supports host-based authentication, where host keys are used to authenticate connections in an Ident-like model, which simplifies certain setups. But, enforced key rotation across a large fleet remains genuinely difficult. In essence, TLS works with the X.509 public key infrastructure to manage and verify public keys. Rather than pre-sharing them, the key owner can provide them embedded in a certificate that includes more information about who the key belongs to, the validity period, and so forth. A certificate authority then signs the certificate. The receiver of the key only needs to know the certificate authority’s public key to verify it and then decide whether to trust it. As a result, rather than pre-sharing keys to validate authentication, this allows fewer keys to be shared initially, thereby improving management at scale. That's exactly the problem pgBackRest's TLS server mode solves, although this now adds a new layer of systems to manage in the certificate authorities.

Inside a PostgreSQL Checkpointer Bug: A Production Postmortem

One of our client’s PostgreSQL 16.8 production databases started logging what looked like a memory error: ERROR: invalid memory alloc request size The error immediately pointed toward two likely suspects: - Memory exhaustion - Memory corruption As it turned out, neither was the culprit. Instead, it had encountered a known PostgreSQL bug that trapped the checkpointer in an infinite retry loop. The only way to recover was a forced restart, followed by an extended period of WAL replay during crash recovery. This article explains what happened, why manual checkpoints couldn't fix it, and how a PostgreSQL minor version upgrade permanently resolved the issue. Understanding the purpose of a checkpoint When a transaction modifies data, PostgreSQL does not immediately write the changed page to disk. Instead, it follows a two-step process: Write the change to the Write-Ahead Log (WAL) - a sequential, append-only record of every modification. Keep the modified page in shared memory as a dirty buffer until it is written later. This design is intentional. WAL writes are sequential and therefore inexpensive, whereas writing data pages directly to their final location requires random disk I/O, which is much more costly. Decoupling these two operations is a fundamental part of PostgreSQL's I/O architecture. Eventually, however, the dirty buffers in memory must be synchronized with the actual data files on disk. That is the job of a checkpoint. During a checkpoint, the checkpointer: Flushes every dirty buffer from shared memory to its corresponding data file. Calls fsync() on those files to ensure the data has reached durable storage rather than remaining in the operating system's cache. Records the checkpoint location in the WAL once all writes have been safely persisted. This checkpoint record is critical for crash recovery. If PostgreSQL crashes, recovery only needs to replay WAL generated after the most recent completed checkpoint, because everything before that point has already been written safely to disk. Without checkpoints, PostgreSQL would have to replay the entire WAL history from the beginning, making recovery increasingly slow as WAL accumulates. To keep track of which files still require an fsync() before a checkpoint can finish, the checkpointer maintains an internal structure called the fsync request queue. Every data file modified during checkpoint processing is added to this queue. As each file is successfully fsynced, its entry is removed. Under normal conditions, the queue drains steadily until the checkpoint completes. The problem begins when it doesn't.

Upgrading PostgreSQL 9.6 to 17 with pg_upgrade

Zero-Pain PostgreSQL DDL Migrations: Avoiding Locks and Long-running Queries in Production

PostgreSQL Disaster Recovery with pgBackRest TLS Transport

Inside a PostgreSQL Checkpointer Bug: A Production Postmortem

Our Projects

Quick Links

Contact Info

USA: 5900 Balcones Drive Ste 100, Austin, Travis County, TX 78731

Singapore: 20A Tanjong Pagar Road, 088443

Follow us at

Archives

Categories