Long-running transactions, job queues, and the cascade that wreaks havoc

A scheduled PostgreSQL migration step held an open transaction snapshot for hours during the initial data copy. A job queue running at production write velocity began to slow down. Twelve hours later, the queue was seven million rows deep, the primary was pinned at 100 percent CPU across 24 cores, and customer support was fielding complaints about delayed transactions.

Staging had validated the migration plan. Every rehearsal was green. And yet here we were.

The team had the system stable that same day. A joint check the next morning confirmed it was operating cleanly. Three changes to our migration SOP have emerged from the retrospective. This post walks through what happened, why staging validated the procedure cleanly while the production failure mode sat outside its reach, and the changes we have made so this specific cascade stays out of future migrations.

The setup

The database in question was a production OLTP system at a significant scale. The largest table was append-only, sitting at around 11 TB and holding transactional records. The same database hosted a job queue processed via SELECT FOR UPDATE SKIP LOCKED, with worker concurrency tuned for production throughput.

The customer was migrating to a new version of PostgreSQL through logical replication. The initial data sync used COPY. The plan had been validated in staging across multiple rehearsal runs. The COPY completed cleanly there. Replication caught up. Resource footprint stayed within expectations. The plan was as ready as a plan gets.

The migration sequence began with the largest table first. Starting with the biggest workload gives the most runway for monitoring, validation, and adjustment before cutover. It is a reasonable choice for most migrations.

Level one of the cascade

A long-running COPY holds a transaction snapshot open for as long as the copy takes to complete. That part is well known. The next part is the one that hurts: PostgreSQL autovacuum holds tuples that are newer than the oldest snapshot anywhere in the cluster. The rule applies to every table on the cluster, not only to the one being copied.

On the job queue tables, the implication is direct. Every row the queue processes marks a tuple as dead. At production write velocity, those dead tuples accumulate by the millions. With autovacuum’s removal cutoff anchored to the COPY snapshot, the queue tables stayed bloated. Index scans began traversing increasing numbers of invalid heap pointers. Per-query latency climbed.

At around 4 million dead tuples in the active queue table, selection queries crossed the 1-second mark. At production cadence, the queue was filling faster than it could drain.

Our engineer caught it the next morning, when the database hit 100% CPU. The COPY was killed, the queue tables were manually vacuumed, and throughput recovered.

Level two of the cascade

Recovery lasted about 30 minutes, then throughput began dropping again. The new bottleneck was the multi-transaction SLRU.

The queue workload uses SELECT FOR UPDATE SKIP LOCKED. Once enough concurrent workers piled onto the same hot pages of the queue index, the multi-xact buffers became the contention point. Lightweight lock waits dominated the wait event profile. Vacuuming the tables again left throughput unchanged because the contention had moved into a lower layer of the system.

The fix at this level was structural. We changed the queue selection logic to begin index scans at randomized positions rather than always from the head of the index. That spread the workers across the keyspace and reduced concurrent access to the multi-xact code paths. Once the change was deployed, the database was no longer the bottleneck. The remaining backlog drained through the message broker downstream.

Level three of the cascade

Several hours later, the transaction age on the cluster crossed 195 million. The on-call team flagged it as a possible approach to transaction wraparound.

The reading was a healthy signal, not a danger one. PostgreSQL’s autovacuum_freeze_max_age default triggers preventive vacuuming at roughly 10 percent of the runway to wraparound, which lands in the 200 million range. That is the system behaving exactly as designed. The serious threshold sits closer to 1 billion. We confirmed with a transaction-age check, then ran an aggressive vacuum on two outbox tables to bring the cluster back within a comfortable range.

While investigating, one more contributor surfaced. A Patroni replica had fallen behind earlier in the incident and had been reseeded from a standby. The replication slot it left behind held a stale backend_xmin, which kept the vacuum cutoff anchored even after the COPY transaction was long gone. Recreating the slot released the cutoff, and the cluster fully stabilized.

Why staging validated the procedure cleanly

Here is the part that matters for anyone planning a similar migration. Staging validated the procedure correctly. The COPY ran. Replication caught up. Resource usage looked clean. Every rehearsal was green.

Staging operates at a fraction of production velocity. The job queue tables in staging carry a fraction of the production write rate. At the staging cadence, dead tuples accumulated during the COPY snapshot remain well below the threshold at which index scans become operationally significant. The mechanism existed in staging. The conditions that would trigger it remained outside its reach.

This is the structural lesson. Staging validates the procedure. Staging validates tooling. Staging validates the sequence of steps. For workloads where the dominant cost is rate-driven, the failure modes that matter most are the ones that surface at production velocity. Treating staging as a complete safety net for that class of workload is a habit worth retiring.

What changed in our SOP

After the system was stable, we ran a full retrospective. Three changes have come out of it.

Pre-migration V3 review on every database. V3 stands for Volume, Velocity, Variety. Volume is the data size, with attention to the largest tables individually. Velocity is the write throughput and queue cadence. Variety is the workload mix: job queues, long-running analytics, hot OLTP tables, ETL paths, anything that may interact with replication in less obvious ways. Any database that crosses thresholds on any of the three axes triggers an additional review layer before the first migration step runs.

Long-running transactions and dead-tuple monitoring are prerequisites. Before any migration touchpoint, we require alerts for transactions older than a configured threshold and for dead-tuple counts on critical tables. A standing query to inspect the current vacuum-removable cutoff and the transaction holding it is now part of the migration runbook. This is the single most useful monitoring tool during a migration window, and the cost of adding it is small.

Tailored migration plans for any database with a job queue. Job queues and long-running transactions interact in ways that compound quickly. For any database that hosts a queue, the migration plan now keeps snapshots short over the queue tables. For large append-only tables, a narrowly scoped subscription that copies new rows separately from the historical backfill keeps the snapshot window short.

Why we wrote this up

We share these retrospectives because the pain is portable. A long-running transaction blocking vacuum on tables it has nothing to do with is a pattern that surfaces in many production environments, not only during migrations. Long backups, replica lag, slow analytical queries, abandoned BEGIN statements in a connection pool, anything that holds an old snapshot does the same work. The job queue interaction described above will occur whenever the snapshot remains open long enough.

The most expensive PostgreSQL incidents tend to arise from interactions among components that are individually safe. A COPY is safe. A job queue is safe. The two together, at production velocity, constitute a different category of risk.