StormaticsStormatics

The Night Our Tables Wouldn’t Stop Growing

We were doing everything right. The migration plan was solid, the team was experienced, and we had done this kind of thing before. But somewhere around midnight, someone on the team noticed something strange. Tables on the destination side were ballooning at an unexpected rate with hundreds of gigabytes being used, while the source side tables sat quietly at just a few megabytes.

Something was very wrong, and we had no idea what.

How It Started

The customer had a PostgreSQL database that they wanted to migrate to a new server. The approach was a logical replication

The initial step in logical replication is the initial table copy. PostgreSQL copies all the existing rows from the source to the destination before replication kicks in and starts streaming live changes. For most tables, this is quick and uneventful.

But when we checked on things the next morning, the picture was not pretty.

What We Saw

On the source side, the tables looked completely normal. Some were 50 GB, some were 90 GB, and a few were just megabytes. Nothing unusual.

On the destination side, the subscriber had a few of those same tables that had grown to over 400 GB each. Overnight. Tables that were 50 to 90GB on the source were now hundreds of gigabytes on the destination, and they were still growing.

What Was Happening

After digging in, we found the culprit. It was statement_timeout.

statement_timeout is a PostgreSQL setting that tells the database, “If any query or operation runs longer than X amount of time, kill it.” It is a sensible safety net. You do not want runaway queries eating resources forever on a production system.

The problem was that on the publisher server, statement_timeout was set to a relatively low value(1 min). And the initial table copy during logical replication setup, which is essentially one big, long-running operation, kept hitting that timeout and getting terminated before it could finish.

Here is where it gets painful.

When PostgreSQL’s logical replication copy process gets interrupted, it does not just stop and wait for you to restart it manually. It automatically tries again. And again. And again. Every time it starts copying rows, it inserts them into the destination table. Every time statement_timeout kills it before it finishes, PostgreSQL treats those partially written rows as dead tuples that exist in the table but are no longer valid.

Dead tuples take up space. They do not disappear on their own immediately. Autovacuum cleans them up eventually, but it could not keep up with how fast they were being created because the copy kept restarting at full speed.

So we had this cycle running through the night:

  1. Replication starts copying rows into the destination table
  2. Publisher statement_timeout kicks in and kills the initial COPY after 1 min 
  3. The copied rows become dead tuples, bloating the table
  4. Replication restarts and begins copying again
  5. Repeat, forever

The table size on the destination kept climbing with no end in sight, because dead garbage was piling up faster than it could be cleaned.

The Fix

Once we understood what was happening, the fix was actually straightforward. We needed to tell PostgreSQL: “For the replication user specifically, do not apply statement_timeout.”

The key here is that we did not want to remove statement_timeout for everyone. That setting exists for a reason and protects the production system from badly behaved queries. We only wanted to lift it for the logical replication process.

PostgreSQL lets you do this at the role level. We ran:

ALTER ROLE replication_user SET statement_timeout = 0;

This sets statement_timeout to zero, meaning no timeout, but only for that specific replication role. Every other user on the system kept their timeout settings untouched. Production queries were still protected. Only the replication copy process was given the breathing room it needed to run to completion.

Once that was in place, the initial table copy ran without interruption. The destination tables stopped ballooning. Replication caught up cleanly.

The Takeaway

Two things that are individually fine, a reasonable statement_timeout and a logical replication initial copy, can quietly wreck each other when they meet. The timeout kills the copy. The copy restarts. The dead tuples pile up. And if no one is watching, this can run for hours.

The fix is one line of SQL. But you have to know how to look for it.

Leave A Comment