StormaticsStormatics

When Patroni Silently Deletes Your Replication Slots

If you have ever been in the middle of a database migration and suddenly found your logical replication slots missing, you know how unsettling that feeling is. No obvious error. No warning. Just gone. That is exactly what happened to us while migrating data from a Patroni-managed cluster to a standalone PostgreSQL cluster. What started as a smooth operation turned into a debugging session that taught us something important about how Patroni behaves after a restart.The SetupWe were running a data migration where the source was a Patroni cluster and the target was a another PostgreSQL cluster. Logical replication was our chosen approach, which meant we had created replication slots on the Patroni side to track the changes flowing out. Things were going well at the start. Data was moving, lag was low, and we felt good about the plan.Something Went WrongAt some point, the publisher node on the Patroni side got restarted. After the restart, our logical replication slots were simply gone. This was strange. These were not temporary slots. Temporary slots are expected to disappear after a session ends, but permanent slots are supposed to survive restarts. PostgreSQL itself has no reason to drop them.
Read More

The Night Our Tables Wouldn’t Stop Growing

We were doing everything right. The migration plan was solid, the team was experienced, and we had done this kind of thing before. But somewhere around midnight, someone on the team noticed something strange. Tables on the destination side were ballooning at an unexpected rate with hundreds of gigabytes being used, while the source side tables sat quietly at just a few megabytes. Something was very wrong, and we had no idea what.
Read More

Before You Move a Single Row, Plan Your Cutover

You have deployed your new cluster. Now comes the work of moving your data and cutting over to it. Reading that sentence, you might assume cutover is something you figure out at the end, after the migration is done. And in practice, that is the order in which things happen. But technically, it is your cutover strategy that decides how you migrate, not the other way around. The strategy you pick determines how you configure replication, how many slots you provision, how you handle schema changes, and what your rollback path looks like. So before you touch replication, decide how you want to cut over. In this post, I will walk through the two most common cutover strategies, what each one costs you, and what each one gives you back. Approach 1: Cut Over One Database at a Time Imagine you are moving an entire office to a new location. One way to do it is to move one team at a time, finance this week, engineering next week. Each team settles in before the next one arrives. If something goes wrong with finance’s move, it doesn’t affect engineering. You fix the problem, learn from it, and continue. That’s exactly how this approach works with databases. You pick one database, migrate it, test it, cut it over, confirm everything is fine, and then move to the next. Why This Approach Makes Sense 1. It’s Easier to Manage When you are watching one database go through a cutover, you know exactly where to look if something breaks. Your team isn’t juggling ten things at once. Attention is focused, and problems surface quickly. 2. Issues Show Up Early The first database you cut over is like a fire drill. You discover what your runbook missed, what monitoring didn’t catch, what your rollback steps actually look like in practice, all with limited impact. By the time you reach database number five, your team is smooth and confident. 3. Replication Slots Don’t Pile Up During logical replication, the publisher (your old cluster) has to keep a replication slot open for each active subscriber. These slots hold WAL data, so the subscriber doesn’t miss anything. If you keep many databases in sync at the same time, those slots pile up, resulting in more disk usage, more memory pressure, and more risk. When you cut over one database at a time, you drop the slot as soon as that cutover is done. The publisher breathes easier. 4. Rolling Back Is Simple If something goes wrong after cutting over a single database, you only need to revert one database. Your reverse replication path is short and clean. You are not trying to undo a dozen cutovers at once. 5. CPU Stays Calm Managing replication for many databases at the same time puts a real load on the publisher. WAL senders, replication workers, and the constant work of tracking changes across many slots can spike your CPU at the worst possible moment, right when you need the system to be stable. One database at a time keeps the load predictable. 6. Downtime Can Be Just Seconds A well-prepared cutover for a single database is fast. You pause writes, confirm the subscriber has caught up (lag is zero), update your connection strings, and resume. 7. Less Room for Human Error Cutover is a high-stress, time-sensitive operation. The fewer things your team has to do at once, the fewer mistakes happen. One database at a time means one checklist, one confirmation, one rollback plan, not ten of them running in parallel. 8. The Migration Can Be Done in Hours or a Few Days Each database is self-contained. You don’t have to wait for every database in the cluster to be ready before you can start cutting over. Even if a database is huge, the migration window is limited to just that one.
Read More

Unused Indexes In PostgreSQL: Risks, Detection, And Safe Removal

Indexes exist to speed up data access. They allow PostgreSQL to avoid full table scans, significantly reducing query execution time for read-heavy workloads.From real production experience, we have observed that well-designed, targeted indexes can improve query performance by 5× or more, especially on large transactional tables. However, indexes are not free. And in this blog, we are going to discuss what issues unused indexes can cause and how to remove them from production systems with a rollback plan, safely
Read More

The Road to Deploy a Production-Grade, Highly Available System with Open-Source Tools

Everyone wants high availability, and that’s completely understandable. When an app goes down, users get frustrated, business stops, and pressure builds.But here’s the challenge: high availability often feels like a big monster. Many people think, If I need to set up high availability, I must master every tool involved. And there’s another common belief too: Open-source tools are not enough for real HA, so I must buy paid tools.
Read More

Scaling Up Wasn’t the Plan — Until It Was the Only Plan

If you have ever generated a complex report in Odoo only to watch the loading spinner for minutes, you are not alone. One of our customers ran into exactly this scenario: their system ground to a near stall whenever they tried to compile business reports. After a systematic investigation, we achieved a 93 % performance improvement, but only by choosing the last resort: upgrading the instance’s resources.
Read More

Don’t Skip ANALYZE: A Real-World PostgreSQL Story

Recently, we worked on a production PostgreSQL database where a customer reported that a specific SELECT query was performing extremely slowly. The issue was critical since this query was part of a daily business process that directly impacted their operations.
Read More

Cold, Warm, and Hot Standby in PostgreSQL: Key Differences

When working with customers, a common question we get is: “Which standby type is best for our HA needs?” Before answering, we ensure they fully understand the concepts behind each standby type and provide the necessary guidance A standby server is essentially a copy of your primary database that can take over if the primary fails. There are different types of standby setups, each with its own use cases, pros, and cons. In this blog, we will discuss the three types: Cold Standby, Warm Standby, and Hot Standby.
Read More

Achieving High Availability in PostgreSQL: From 90% to 99.999%

When you are running mission-critical applications, like online banking, healthcare systems, or global e-commerce platforms, every second of downtime can cost millions and damage your business reputation. That’s why many customers aim for four-nines (99.99%) or five-nines (99.999%) availability for their applications n this post, we will walk through what those nines really mean and, more importantly, which PostgreSQL cluster setup will get you there.
Read More