StormaticsStormatics

How To Run Logical Replication On A Production PostgreSQL Cluster

Upgrading PostgreSQL in production is not fun. You have real data, real users, and a version of Postgres that's probably several major releases behind. The obvious approach (dump, upgrade, restore) works fine on a 10GB database on a Saturday night. It doesn't work when you have TBs of data, applications that can't afford hours of downtime, and a team watching the monitoring dashboard at 2am. Logical replication solves this. I’ve run it in production, and it works well, but there are a few practical considerations that are easier to appreciate once you’ve been through the process yourself. Why Logical Replication You keep your source cluster running (old Postgres version, production traffic, and all), bring up a destination cluster with the new version, and replicate all data changes while both run in parallel. When the destination has caught up, you cut over. Downtime is seconds to minutes, not hours. Source is your existing cluster (say, Postgres 14). Destination is your new cluster running Postgres 17. The thing most guides skim over is that logical replication does not copy schema, it only copies data. PostgreSQL replicates INSERT, UPDATE, and DELETE operations, but not your table definitions, indexes, roles, or anything structural. An example would be if logical replication is setup then truncate operation on a single source table would also be replicated, but table1, table2, and table3, etc. won't be replicated. You have to handle that yourself before replication starts. Miss this, and your subscription will either fail silently or throw confusing errors that don't point you back to the actual problem. Step 1: The Primary Key Check Logical replication requires every replicated table to have a replica identity to replicate UPDATE and DELETE operations. By default, that's the primary key. Tables without one will still replicate INSERTs, but any UPDATE or DELETE will error on the publisher. Replica Identity A published table must have a replica identity configured so that UPDATE and DELETE operations can be replicated, because the replica identity is what allows the correct row to be located on the subscriber. By default, the replica identity is the primary key. A suitable unique index can also be used, subject to certain requirements. If no suitable key exists, the replica identity can be set to FULL, which uses the entire row as the key. Replica identity FULL should be treated as a fallback to be used only when no other option is available, because without a suitable index on the subscriber, the row search can be very inefficient. If a table without a replica identity is added to a publication that replicates UPDATE or DELETE operations, those operations will fail on the publisher. INSERT operations proceed regardless of the replica identity.
Read More

When Patroni Silently Deletes Your Replication Slots

If you have ever been in the middle of a database migration and suddenly found your logical replication slots missing, you know how unsettling that feeling is. No obvious error. No warning. Just gone. That is exactly what happened to us while migrating data from a Patroni-managed cluster to a standalone PostgreSQL cluster. What started as a smooth operation turned into a debugging session that taught us something important about how Patroni behaves after a restart.The SetupWe were running a data migration where the source was a Patroni cluster and the target was a another PostgreSQL cluster. Logical replication was our chosen approach, which meant we had created replication slots on the Patroni side to track the changes flowing out. Things were going well at the start. Data was moving, lag was low, and we felt good about the plan.Something Went WrongAt some point, the publisher node on the Patroni side got restarted. After the restart, our logical replication slots were simply gone. This was strange. These were not temporary slots. Temporary slots are expected to disappear after a session ends, but permanent slots are supposed to survive restarts. PostgreSQL itself has no reason to drop them.
Read More

Critical PMM Alerts Every PostgreSQL DBA Must Track

Have you ever left for home on a Friday evening feeling confident about your work for the day, at peace knowing your system would survive the coming weekend? We’ve all felt that way at some point. Meanwhile, the disk on the server had quietly reached 90% utilization. Write-Ahead Log (WAL) files had accumulated enormously, one long-running query had been running for over an hour, and nobody noticed because, some time earlier, the dashboard had looked fine.
Read More

You have a Patroni leader election. You are only halfway to PostgreSQL high availability.

A PostgreSQL primary loses power at 2am. Writes resume in under thirty seconds. The on-call engineer reads the alert in the morning, sees that the cluster healed itself, and goes back to coffee. That is the outcome PostgreSQL high availability is supposed to deliver.A working Patroni cluster, on its own, gets you partway there. The leader election runs. A standby gets promoted. The cluster state in etcd stays consistent. Then the application keeps trying to reach an IP address that points at the wrong node now, the old primary needs a manual rejoin, and the on-call engineer is on a conference bridge instead of in bed.
Read More

PostgreSQL High Availability on OCI: Why Your Failover Passes Every Test But Breaks in Production

If you have built PostgreSQL high availability clusters on AWS or Azure, you have probably gotten comfortable with how virtual IPs work. You assign a VIP, your failover tool moves it, and your application reconnects to the new primary. Clean. Simple. Done.Then you try the same thing on Oracle Cloud Infrastructure and something quietly goes wrong.The cluster promotes. Patroni (or repmgr, or whatever you are using) does its job. The standby becomes the new primary. But the VIP does not follow. Your application keeps sending traffic to the old node — the one that just failed. From the outside, it looks like the database is down. From the inside, everything looks green.
Read More

Thinking of PostgreSQL High Availability as Layers

High availability for PostgreSQL is often treated as a single, big, dramatic decision: “Are we doing HA or not?”That framing pushes teams into two extremes:- a “hero architecture” that costs a lot and still feels tense to operate, or - a minimalistic architecture that everyone hopes will just keep running.A calmer way to design this is to treat HA and DR as layers. You start with a baseline, then add specific capabilities only when your RPO/RTO and budget justify them.Let us walk through the layers from “single primary” to “multi-site DR posture”.Start with outcomesBefore topology, align on three things:1. Failure scope a. A database host fails b. A zone or data center goes away c. A full region outage happens d. Human error2. RPO (Recovery Point Objective) a. We can tolerate up to 15 minutes of data loss b. We want close to zero3. RTO (Recovery Time Objective) a. We can be back in 30 minutes b. We want service back in under 2 minutesHere is my stance (and it saves money!): You get strong availability outcomes by layering in the right order.
Read More

Unlocking High-Performance PostgreSQL: Key Memory Optimizations

PostgreSQL can scale extremely well in production, but many deployments run on conservative defaults that are safe yet far from optimal. The crux of performance optimization is to understand what each setting really controls, how settings interact under concurrency, and how to verify impact with real metrics.This guide walks through the two most important memory parameters : - shared_buffers - work_mem
Read More

The Road to Deploy a Production-Grade, Highly Available System with Open-Source Tools

Everyone wants high availability, and that’s completely understandable. When an app goes down, users get frustrated, business stops, and pressure builds.But here’s the challenge: high availability often feels like a big monster. Many people think, If I need to set up high availability, I must master every tool involved. And there’s another common belief too: Open-source tools are not enough for real HA, so I must buy paid tools.
Read More

Understanding Disaster Recovery in PostgreSQL

System outages, hardware failures, or accidental data loss can strike without warning. What determines whether operations resume smoothly or grind to a halt is the strength of the disaster recovery setup. PostgreSQL is built with powerful features that make reliable recovery possible. This post takes a closer look at how these components work together behind the scenes to protect data integrity, enable consistent restores, and ensure your database can recover from any failure scenario.
Read More

Understanding PostgreSQL WAL and optimizing it with a dedicated disk

If you manage a PostgreSQL database with heavy write activity, one of the most important components to understand is the Write-Ahead Log (WAL). WAL is the foundation of PostgreSQL’s durability and crash recovery as it records every change before it’s applied to the main data files. But because WAL writes are synchronous and frequent, they can also become a serious performance bottleneck when they share the same disk with regular data I/O.
Read More