In July 2025, during the PG19-1 CommitFest, I reviewed a patch targeting the lack of parallelism when adding foreign keys in pg_restore. Around the same time, I was helping a client with a large production migration where pg_restore dragged on for more than 24 hours and crashed multiple times.
In this blog, I will talk about the technical limitations in PostgreSQL, the proposed fix, and a practical workaround for surviving large restores.
Background
There are two main types of backups in PostgreSQL:
- Physical backups (e.g., pg_basebackup, file-level copies): these copy the database files as they exist on disk. Restoring is fast, but the backup can only be used on the same PostgreSQL version and often requires identical hardware or configuration.
- Logical backups (pg_dump and pg_restore): instead of copying raw files, PostgreSQL exports the database into SQL statements. These statements describe how to rebuild everything: the schema (CREATE TABLE, CREATE INDEX), the data (using COPY), and the constraints (ALTER TABLE … ADD FOREIGN KEY).
The trade-off is that logical restore is not just a copy , it’s a rebuild. PostgreSQL has to re-run all those CREATE, COPY, and ALTER statements as if the database were being created for the first time.
The Core Problem
On small databases, logical backup and restore are very quick, but on multi-terabyte systems with thousands of partitions, this process can stretch for days. Data can often be restored quickly, but steps like validating foreign keys or building indexes can take exponentially longer. This is because when you run:
ALTER TABLE the_master_partitioned_table
ADD CONSTRAINT ...
FOREIGN KEY (columnX) REFERENCES another_table(columnX);
Under the hood, PostgreSQL has to scan all rows in the child table (table_partition) and check that each value exists in the parent table (another_table). And this is done with an anti-join query like:
SELECT fk."columnX" FROM ONLY "public"."table_partition_214" fk
LEFT OUTER JOIN ONLY "public"."another_table" pk
ON ( pk."columnX" OPERATOR(pg_catalog.=) fk."columnX")
WHERE pk."columnX" IS NULL AND (fk."columnX" IS NOT NULL)
This query finds any rows in the child partition table_partition_214 where columnX has a value that does not exist in the parent table another_table. If the join finds rows in orders that don’t match customers, the constraint fails.
The problem is that pg_restore does not parallelize foreign key checks. Even if you specify multiple jobs (-j), PostgreSQL schedules foreign key validation sequentially. Why? There are two main causes:
- pg_restore uses a dependency graph to determine execution order. It’s overly cautious and assumes that all FKs on the same table must be run in order, which serializes them.
- On the backend, PostgreSQL does not parallelize FK validation per partition. Even if a table has thousands of partitions, it checks each one in turn.
On partitioned tables with hundreds or thousands of partitions, each constraint triggers sequential anti-joins on each child partition. This anti-join query can consume gigabytes of temporary files and run for 1–2 days per constraint. For a 10TB database with 1,000 partitions, foreign key checks alone can take days, making logical restores impractical without workarounds.
Underlying Architecture
When executing an anti-join query for a constraint check, the executor needs to scan the tables. If available, the planner uses an index on table_partition_214(columnX) and confirms the value of columnX is indeed not NULL. For each now-verified columnX value from the child, it must check for its existence in another_table. This often triggers a similar fetch-and-check cycle on the parent’s side.
If the value is not found in the parent (pk.”columnX” IS NULL), it’s a violation and is returned. We know that an index scan is faster than a sequential scan, and an index-only scan is even faster than an index scan. So, can we force the use of an index-only scan to speed up the process here? The answer is yes, but before that, we need to understand the difference between an index scan and an index-only scan.
Understanding Index-Only Scans
A standard index scan is a two-step process:
- Index Lookup: The database traverses the index structure (e.g., a B-Tree) to find the Tuple ID (TID) for each row that matches the query predicate. The TID is a physical address, specifying the exact page and offset where the row resides.
- Heap Fetch: For each TID found, the database must access the corresponding heap page to retrieve the row data. This is necessary for two reasons:
- To get the values of columns not present in the index.
- To perform a visibility check, which determines if the row version is visible to the current transaction based on the MVCC rules stored in its header (xmin, xmax).
This heap fetch operation is a primary source of I/O overhead, especially for large datasets, as it involves random disk access.
An index-only scan aims to eliminate the costly heap fetch. If all columns required by a query are contained within an index, the engine can, in theory, answer the query by reading only the index, which is typically smaller and more cache-friendly than the main heap. However, there’s a fundamental catch: the index does not store visibility information. The engine cannot determine from the index alone whether a row is visible to the current transaction. Without this, an index-only scan could return rows from aborted transactions or rows that are not yet committed, violating core ACID principles.
Importance of Visibility Map in Query Performance
The visibility map solves this problem. It is a separate fork (physical file) for each table, acting as a bitmap where each bit represents a single heap page.
A bit is set to 1: This indicates that every tuple on the corresponding heap page is known to be visible to all transactions. It means that VACUUM has examined the page and confirmed that no row has an xmin or xmax value that would require a visibility check for any future transaction.
A bit is set to 0: This indicates that at least one tuple on the page may not be visible to all transactions. The engine cannot make any assumptions about the visibility of rows on this page.
With this map in place, the query executor’s logic for an index-only scan becomes efficient. The executor traverses the index and retrieves the TID for a matching row. It then examines the visibility map bit for the heap page referenced by the TID.
- If the bit is 1 (all-visible): The executor can skip the heap fetch entirely. It is guaranteed that the row is visible, so it can safely return the data directly from the index.
- If the bit is 0 (not all-visible): The executor must fall back to the standard method: it performs a heap fetch and a full visibility check to ensure correctness.
The Role of VACUUM
The visibility map is maintained by the VACUUM process (both manual and autovacuum). When VACUUM processes a heap page, it does two things:
- It removes dead tuple versions.
- It assesses the page’s visibility status.
If it confirms that every living tuple on the page is definitely visible to all transactions, it sets the corresponding bit in the visibility map to 1.
DML operations (INSERT, UPDATE, DELETE) can invalidate the map’s state. An UPDATE creates a new row version, leaving the old one dead. An INSERT adds a new tuple. These actions on a previously “all-visible” page will cause its visibility map bit to be cleared to 0 until the next VACUUM runs and re-evaluates it.
Understanding the Restore Bottleneck
This brings us to the core performance bottleneck in pg_restore. During the restore process, we know that no other transactions are running in parallel; the only active transaction is the restore itself. If we could mark all rows in the heap table as all-visible, PostgreSQL would be able to use index-only scans for constraint checks and subsequent post-data operations wherever possible. This would significantly improve restore speed.
This idea is exactly what was proposed in the postgresql patch. There are two possible approaches to achieve this, but let’s first discuss the method suggested in the patch itself.
To leverage this insight, we must break the restore into its logical and dependent phases. The three phases are:
- First phase builds the skeleton by creating tables, schemas etc. No data exists yet, but the structure is ready to receive it.
pg_restore --section=pre-data -d my_new_db my_backup.dump
2. Load all table data:
pg_restore --section=data -j n -d my_new_db my_backup.dump
Using (-j n) parallel workers, PostgreSQL streams data into the tables via high-speed COPY commands. The database is now full but unoptimized and unvalidated.
3. Apply constraints:
pg_restore --section=post-data -d my_new_db my_backup.dump
Here, PostgreSQL builds secondary indexes and validates every foreign key constraint. It must prove that every key in every child table exists in the parent table.
In the second phase, the data loading is performed using the COPY command, for example:
COPY customers (id, name, email)
FROM '/tmp/customers.csv'
WITH (FORMAT csv, HEADER true);
The COPY operation inserts rows directly into the database’s heap pages. This is highly optimized and very fast, but it does not update the visibility map. After COPY completes the table is now fully populated with data,however, the visibility map is still empty (all bits = 0) because VACUUM has not yet run. The visibility map is what allows PostgreSQL to safely skip heap fetches during index-only scans.
At this point, when PostgreSQL proceeds to the post-data phase (e.g., foreign key validation queries), it runs checks such as anti-joins against the newly loaded table. The planner sees that the required index does exist. But it also sees that the visibility map is not populated, meaning it cannot guarantee that all rows are visible.Therefore, it cannot use an index-only scan. PostgreSQL falls back to a less efficient plan which is Index scan + heap fetch for every row (to confirm visibility). This leads to extreme slowness in constraint validation and heavy use of temporary files (for sorting and join processing).
The Future Fix : COPY FREEZE
One of the solutions to this problem is submitted by Dimitrios Apostolou as a patch to PostgreSQL 19 CommitFest. The idea is to add a new option –freeze to pg_restore. This leverages:
COPY table_name FROM STDIN WITH (FREEZE);
during restore. The FREEZE parameter resolves the issue by instructing the COPY command to pre-build the visibility map. It does this by writing each row with a transaction ID (xmin) marked as “frozen”. In PostgreSQL’s MVCC system, a frozen transaction ID is a special value that is treated as older than every normal transaction ID. This means that a frozen row is visible to every current and future transaction. This means by employing COPY FREEZE, we fundamentally alter the state of the newly loaded table. As pages are filled with frozen rows, the system can immediately and correctly mark them as all-visible in the visibility map.
There is no need to wait for a VACUUM process to later inspect the page and deduce this fact. The visibility map is built concurrently with the data load.Now in the next phase, when the foreign key validation query runs,the planner now sees that the required columnX exists in the index.
- It reads each entry: the columnX value and the TID.
- For each entry, the executor checks the visibility map using the TID’s page number.
Seeing the page is all-visible, it skips the heap fetch entirely. It doesn’t need to check the row’s visibility because FREEZE guarantees it. It also doesn’t need to fetch the row to check columnX IS NOT NULL because that value is already right there in the index. It never touches the main heap data for the child table. This reduces the I/O requirement by orders of magnitude.
Practical Workaround: Strategic Use of VACUUM ANALYZE
Around the same time I was reviewing this patch, I had a client hit the exact same issue during a production migration. The client was running a production migration from pg_dump/pg_restore. The dump size was massive around 11 TB running on a powerful machine with:
- 672 GB RAM
- 96 CPUs
The client first attempted pg_restore with 8 parallel workers. The data phase moved along steadily and the database grew to about 10 TB, but then progress stalled. pg_stat_activity showed it was stuck for hours on the ‘Add FK constraint’ step, and after nearly 24 hours with no movement, the process had to be terminated manually.
In the next attempt, additional resources were allocated. The number of parallel workers was increased to 48, WAL was disabled, autovacuum was turned off, and several memory-related parameters were tuned specifically for the restore. While the process initially appeared faster, after six hours memory usage reached 100% and the system crashed with an Out-of-Memory (OOM) error.
Next, memory related parameters were reduced in an attempt to prevent OOM failures. However, the same crashes continued to occur. At that stage, it became clear that the issue was not simply a lack of resources. The real bottleneck was in the post-data phase, specifically constraint validation.
This is where the patch discussion provided a useful insight. The problem is that constraint validation is inherently slow, because foreign key checks are not parallelized. Increasing the -j value or allocating more hardware resources won’t change this behavior.
With that in mind, I suggested a workaround to run a manual VACUUM ANALYZE between the data and post-data phases. The idea was that VACUUM would build the visibility map and update statistics, giving the planner enough metadata to use index-only scans during foreign key validation. This, in turn, would help reduce the time spent in the post-data phase.
So, we reinitiated the restore in phases to get better control over where the bottlenecks were. The pre-data phase barely took a minute to complete. After that, we moved into the data-only restore using 32 parallel workers. This part finished in about seven hours, bringing the database size close to 9 TB.
At that point, we run a manual VACUUM ANALYZE. This took around three hours to run. During this pause we also dropped a few unnecessary tables to reduce the workload. Once the vacuum was complete, we kicked off the post-data phase, which included indexes, constraints, and triggers. That section finished in about an hour and a half.
All in, the full restore completed in roughly 11.5 hours which is significantly better than the earlier attempts. The key difference was introducing the vacuum step, which allowed constraint checks to run efficiently by avoiding unnecessary heap lookups.
Conclusion
The main lesson from this exercise is that constraint checks are the hidden bottleneck in large database restores and no matter how much hardware is thrown at the problem, the post-data phase keeps dragging.
Introducing a manual VACUUM ANALYZE step between the data and post-data phases helps fill that gap. Vacuum processes the heap, builds the visibility map, and updates statistics, which in turn allows the planner to use index-only scans for foreign key checks.
That said, this remains a workaround rather than a true fix. The actual solution lies in the COPY FREEZE patch, which avoids the problem altogether by marking pages as all-visible during the bulk load itself. The patch has already been submitted to the PG19-1 CommitFest and once committed, should remove the need for manual VACUUM steps in future versions.
Comments (2)
Mano
To achieve freeze, we use CopyManager in java Postgres driver and perform restore. In pg_restore, you mentioned VACUUM ANALYZE is executed instead is it better to execute VACCUM FREEZE?
Warda Bibi
Hi Mano, Thanks for the great question! The proposed pg_restore –freeze patch would automate this ‘COPY with FREEZE’ solution seamlessly. Regarding your second point, yes we can do it but VACUUM FREEZE does more than we need it to. Since our only goal is to build the visibility map, the lighter-weight ‘VACUUM ANALYZE’ is sufficient and faster.Feel free to reach out if you have any other doubts or ideas! Thanks