Data is often called the new gold, and databases are where we store and manage this precious resource as it constantly changes and grows. At first glance, updating data might seem like a simple task—just modify a row.
But behind the scenes, it’s more complex to ensure that data remains consistent and accessible. In today’s blog, I’ll answer some frequently asked questions from our customers and dive into why PostgreSQL relies on a process called VACUUM to efficiently manage data updates.
Updating a row in PostgreSQL isn’t as straightforward as directly changing the existing data. Instead, PostgreSQL avoids in-place updates, meaning it doesn’t overwrite rows directly.
But what does this actually mean?
When an update occurs, PostgreSQL creates a new row, inserts the updated data there, and marks this new row as the latest version. The old row, meanwhile, is flagged as obsolete row.
A similar process applies to deletes, where rows are marked as outdated rather than removed immediately.
This raises an interesting question on why did PostgreSQL choose this more complex approach for handling updates and deletes?
The answer lies in its design philosophy, which is rooted in Multi-Version Concurrency Control (MVCC). MVCC ensures data consistency and allows for high concurrency.
What is Multi-Version Concurrency Control (MVCC)?
Multi-Version Concurrency Control (MVCC) allows PostgreSQL to manage multiple transactions at once, enabling consistent data views for each transaction without interference.
Imagine a library with a single book titled The Ultimate Guide to Databases.
Without Multi-Version Concurrency Control (MVCC), if two people want to check out the book at the same time, one would have to wait for the other to finish reading it. This is similar to a traditional database where transactions can block each other, preventing simultaneous access.
With MVCC, the process works differently. When the first person checks out the book, the library creates a copy just for them. The second person can also check out the book at the same time, but they receive their own copy. Both individuals can read and make notes in their respective copies without affecting the other’s experience. Once they’re done, they return the copies, and the library can clean them up for future use.
In this analogy, the book represents a data record in a database, and the copies of the book are like different versions of that data.
MVCC allows PostgreSQL to create and manage multiple versions of data, enabling multiple transactions to access and modify the data concurrently without interfering with each other. This ensures that each transaction gets a consistent view of the data while allowing for high performance and concurrency.
However, just like the library ends up with multiple copies of the book that are no longer being read, PostgreSQL ends up with versions of the data that are no longer needed, called dead tuples.
These dead tuples are like outdated copies of the book that no one is checking out anymore. Over time, as more transactions occur, these dead tuples accumulate, taking up space and potentially slowing down the system. This is where the process of vacuuming comes in—just like the library regularly clears out old, unused books to make room for new ones, PostgreSQL uses vacuuming to clean up dead tuples, reclaim storage, and maintain optimal performance.
What are Dead Tuples?
When a row is updated or deleted in PostgreSQL, the previous version remains in the table as a dead tuple. These dead tuples accumulate over time, leading to table bloat. Just like old, unused files piling up on a computer, table bloat consumes unnecessary storage and can significantly degrade performance. As these outdated rows stay in the table, queries may take longer to execute, and the database’s overall efficiency suffers.
Additionally, index bloat can occur as well. When rows are updated or deleted, the associated index entries may not be immediately cleaned up, causing indexes to become larger than necessary. This increases the amount of disk space used by the index and can further slow down query performance. Both table bloat and index bloat contribute to inefficiencies, but through regular vacuuming and index maintenance, PostgreSQL can reclaim the space and optimize the system for better performance.
What is Vacuuming?
Vacuuming is a PostgreSQL maintenance process that removes dead tuples, reclaims storage, and helps keep your database running efficiently.
Is VACUUM Responsible Only for Cleaning Up Dead Tuples?
The simple answer is NO
Vacuuming has an even more critical role beyond cleaning up dead tuples: it prevents Transaction ID Wraparound.
Transaction ID Wraparound
Every transaction in PostgreSQL is assigned a unique identifier called a Transaction ID (XID), which is similar to numbering each page in a notebook.
A transaction refers to a set of one or more SQL operations executed as a single unit of work, such as inserting, updating, or deleting data.
This unique ID helps PostgreSQL keep track of the changes made during the transaction and ensures data consistency. PostgreSQL uses a 32-bit integer for these XIDs, meaning that there are approximately 4 billion possible unique transaction IDs.
Max Transaction IDs=2^32 ≈ 4,294,967,296
Once this limit is reached, the system starts reusing IDs, which could cause confusion if not properly managed. This is also another duty for VACUUM to cleanup the older XIDs
To illustrate the potential problem imagine you’re using a notebook where you number each page to keep track of your notes. If you keep filling up pages and eventually run out of numbers, you might start reusing old page numbers. This would create confusion because you wouldn’t know whether you’re referring to a fresh note or an old one.
Similarly, When PostgreSQL reaches this limit, it wraps around and starts reusing IDs. Without careful management, PostgreSQL could mistake an old transaction for a new one due to ID reuse, leading to data inconsistencies.
How does PostgreSQL prevent Transaction ID Wraparound?
By vacuuming. Vacuuming marks old transactions so they don’t interfere with new ones, ensuring PostgreSQL can reuse transaction IDs safely without confusion.
How to Monitor Transaction ID Wraparound?
We can run the following query to check the database age and the percentage of progress toward emergency autovacuum (when old tuples are forcefully frozen) and wraparound_risk_percent (when PostgreSQL might set the database to read-only to perform vacuum operations).
SELECT
Datname,
age(datfrozenxid) AS database_age,
100 * (age(datfrozenxid) /
current_setting('autovacuum_freeze_max_age')::float) AS
Emergency_autovacuum_percent,
(age(datfrozenxid)::numeric / 2000000000 * 100)::numeric(4,2) AS
Wraparound_risk_percent
FROM
pg_database
ORDER BY
age(datfrozenxid) DESC;
To check table-wise age run the following query
SELECT
c.oid::regclass AS table_name,
age(c.relfrozenxid) AS table_age,
100 * (age(c.relfrozenxid) /
current_setting('autovacuum_freeze_max_age')::float) AS
Emergency_autovacuum_percent,
pg_size_pretty(pg_total_relation_size(c.oid)) AS table_size
FROM
pg_class c
JOIN
pg_namespace n ON c.relnamespace = n.oid
WHERE
relkind IN ('r', 't', 'm') -- Filter for tables, toast tables, and
materialized views
AND n.nspname NOT IN ('pg_toast', 'pg_catalog', 'information_schema') -- Exclude system schemas
ORDER BY
age(c.relfrozenxid) DESC
LIMIT 100;
Is Vacuuming Automatic or Manual?
In PostgreSQL, vacuuming can be both automatic and manual. Autovacuum is a built-in feature that runs in the background to clean up dead rows and prevent table bloat.
Autovacuum reclaims storage and updates statistics without user intervention, keeping the database performant. However, manual vacuuming is available for when you want more control. You can run the VACUUM command to clean specific tables or manage tables with heavier workloads.
What Type of Workload Generates Table Bloat?
Table bloat is primarily caused by workloads with frequent updates or deletes. When rows are updated, PostgreSQL keeps old row versions as dead tuples, gradually bloating the table. Similarly, deleted rows aren’t removed immediately; instead, they’re marked as dead. This is why workloads with high deletion rates can also contribute significantly to bloat.
Managing bloat is essential for database health, especially in systems with heavy data modification.
Further Readings: