How PostgreSQL Scans Your Data

To understand how PostgreSQL scans data, we first need to understand how PostgreSQL stores it.

  • A table is stored as a collection of 8KB pages (by default) on disk.
  • Each page has a header, an array of item pointers (also called line pointers), and the actual tuple data growing from the bottom up.
  • Each tuple has its own header containing visibility info: xmin, xmax, cmin/cmax, and infomask bits.

There are different ways PostgreSQL can read data from disk. Depending on the query and available indexes, it can choose from several scan strategies:

  1. Sequential Scan 
  2. Index Scan
  3. Index-Only Scan
  4. Bitmap Index Scan

In this blog post, we’ll explore each of these scan types one by one.

Sequential Scan

The sequential scan is PostgreSQL’s brute-force access method. It reads every page of the table from block 0 to relpages – 1.

Step by Step

  1. PostgreSQL opens the primary physical file for the relation.
  2. Each 8 KB page is pulled into the shared buffer pool. If already cached, it’s a hit. Otherwise, Postgres reads it from disk. A lightweight pin prevents the buffer manager from evicting the page during reading.
  3. For each tuple, PostgreSQL compares xmin/xmax against the current transaction’s snapshot to determine whether the tuple is visible. Dead and in-progress tuples are skipped.
  4. Visible tuples are tested against the WHERE clause quals, and non-matching tuples are discarded.

Visibility Map Optimization

Normally, PostgreSQL must check each row’s visibility before returning it. Because of MVCC, a row might have been inserted, updated, or deleted by another transaction, so PostgreSQL verifies that the row is visible to the current snapshot. To make this faster, PostgreSQL maintains a visibility map. The visibility map stores a bit for each page in the table. If all rows on a page are visible to all transactions, that page is marked as all-visible in the visibility map. When this bit is set:

  • PostgreSQL knows every row on that page is visible.
  • It can skip the expensive per-row visibility checks.
  • It can move straight to applying the WHERE condition.

This significantly reduces CPU overhead during large scans.

Synchronized Scans


If multiple backends start a sequential scan on the same large table around the same time, PostgreSQL doesn’t let them all begin at page 0. Instead, it coordinates them using synchronized scans. When a second scan starts, it joins the first scan at its current position in the table and continues scanning from there. After reaching the end of the table, it wraps around to read the remaining pages from the beginning.

This approach reduces redundant work and makes better use of shared memory and disk bandwidth.

Parallel Seqscans

For very large tables, PostgreSQL can scan in parallel. A parallel leader process divides the table into block ranges. Multiple worker processes are launched, and each worker scans a different portion of the table.

When a worker finishes its assigned range, it requests another. This ensures that the work is evenly distributed and all CPU cores can participate in large scans to complete faster.

When Does the Planner Choose seqscan?

PostgreSQL uses a cost model to estimate which plan will be cheapest. Two important settings influence this decision: seq_page_cost (default: 1.0) and random_page_cost (default: 4.0).

Sequential page reads are considered cheap and random page reads are assumed to be more expensive, so they’re given a higher cost. Because of this, a sequential scan is usually chosen when:

  • A large percentage of the table’s rows are expected to match
  • The table is small
  • The table already fits in memory
  • Using an index would require too many random page reads

In these cases, reading the table once from start to finish is estimated to be cheaper than jumping around via an index.

Index Scan

An index scan uses a separate structure to find matching rows. Instead of reading the entire table, PostgreSQL:

  1. Looks up matching values in the index
  2. Gets the physical locations of those rows
  3. Fetches only those rows from the table

The index and the table (heap) are two separate structures. The index stores keys and pointers, while the heap stores the actual row data.

What Is a TID?


Every index entry points to something called a TID (Tuple ID). A TID (also called an ItemPointer) is a small physical address made of:

  1. Block number (which heap page to read)
  2. Offset (which row slot inside that page)

Every index entry points to exactly one TID in the heap. Now let’s say you run:

 

SELECT * FROM users WHERE id = 42;

 

Here’s what happens during an index scan on id, step by step:

  1. PostgreSQL scans the index to find the entry where id = 42.
  2. From that index entry, it reads the TID .
  3. Using the block number in the TID, Postgres fetches the corresponding heap page into shared buffers (or uses it if it’s already cached).
  4. Inside that page, it uses the offset to locate the exact row.
  5. Before returning it, Postgres performs an MVCC visibility check (to make sure the row is visible to the current transaction).
  6. If the row is visible (and matches the query), PostgreSQL returns it.

So for each row in Index scan, we have one I/O for the index page, another for the heap.

HOT Chains: Avoiding Index Bloat

When a row is updated, PostgreSQL creates a new version of that row. If any indexed column changes, the index must also be updated. But if the indexed columns stay the same, PostgreSQL can avoid touching the index entirely. This is called a HOT (Heap-Only Tuple) update.

In a HOT update, the new row version is placed on the same heap page (if there’s space). The existing index entry still points to the original TID. Instead of creating a new index entry, PostgreSQL links the old tuple to the new version using an internal pointer (t_ctid). During an index scan, PostgreSQL fetches the heap page using the TID from the index. If that tuple was updated, it follows the chain(t_ctid) inside the same page to find the latest visible version and returns it.

Because no new index entry is created, the index stays smaller and cleaner. This reduces index bloat and makes updates cheaper when indexed columns don’t change.

Index-Only Scans

If all columns the query needs are in the index, PostgreSQL can skip the heap fetch entirely. But since the index has no visibility info, it consults the visibility map. If the VM bit for the corresponding heap page is set, the tuple is guaranteed visible and no heap fetch is needed. Otherwise, tables with frequent VACUUM have more VM bits set, which is why index-only scans improve dramatically on well-vacuumed tables.

Normally, even when PostgreSQL uses an index, it still has to visit the heap to fetch the row. However, if all the columns required by the query are already stored in the index, PostgreSQL can return the result directly from the index without fetching the heap row. This is called an index-only scan. 

However, there’s one catch: indexes do not store visibility information. So PostgreSQL still needs to make sure the row is visible to the current transaction. To do that, it checks the visibility map. 

  • If the visibility map bit for the corresponding heap page is set, it means all rows on that page are visible to all transactions. In that case, PostgreSQL can safely return the data directly from the index, so no heap access is needed. 
  • If the visibility map bit is not set, PostgreSQL must fall back to fetching the heap page to verify visibility.

This is why index-only scans work much better on tables that are regularly vacuumed. Frequent VACUUM updates the visibility map, setting more pages to all-visible, which allows PostgreSQL to skip heap lookups more often.

Bitmap Scan

If too many rows match, a regular index scan can become expensive as it causes excessive random I/O. And if not enough rows match to justify reading the entire table with a sequential scan, PostgreSQL chooses a third option: Bitmap Index Scan.

It’s similar to an index scan, but instead of jumping to the heap immediately for every match, PostgreSQL splits the work into two phases.

Phase 1: Build the Bitmap (Bitmap Index Scan)

First, PostgreSQL scans the index. But instead of fetching rows right away, it collects all matching TIDs into an in-memory bitmap. The bitmap groups match by heap page. Conceptually, it looks like this:

Page 3:    [0, 1, 0, 0, 1, 0, 1, 0]  ← rows 1, 4, 6 match

Page 8:    [1, 0, 0, 0, 0, 0, 0, 0]  ← row 0 matches

Page 472:  [0, 0, 1, 0, 0, 1, 0, 0]  ← rows 2, 5 match

Instead of scanning x random rows, PostgreSQL now thinks that these are the pages that contain matches.

Phase 2: Read Heap Pages in Order (Bitmap heap Scan)

Once the bitmap is built, PostgreSQL sorts the matching heap pages by block number and reads them in physical order. Each page is read at most once. Instead of jumping around like this:

Page 472 → Page 3 → Page 472 → Page 8801 → Page 3 → Page 8

It reads pages like this:

Page 3 → Page 8 → Page 472 → Page 8801

If multiple matching rows are on the same page, they are all processed together. This avoids repeatedly reading the same page from disk. Without this approach, a page might be loaded into shared buffers, evicted, and then loaded again later. Bitmap scans prevent that waste.

Exact vs Lossy Bitmap

When memory allows, PostgreSQL keeps an exact bitmap. That means it tracks the exact tuple positions inside each page. Later, it knows precisely which rows to fetch. But the bitmap is stored in memory, and it is limited by work_mem. If the bitmap grows too large, PostgreSQL switches to a lossy bitmap. In this mode, it no longer remembers exact tuple positions. It only remembers:

This page has at least one matching row.

Because it no longer knows exactly which rows match, PostgreSQL must re-check every row on those pages during the heap scan. In EXPLAIN ANALYZE, we will see:

Recheck Cond:

If we increase work_mem, PostgreSQL is more likely to use an exact bitmap and avoid this extra recheck work.

Combining Indexes

Bitmap scans can combine multiple indexes. For example:

SELECT * FROM orders
WHERE amount > 100 AND region = 'APAC';

PostgreSQL can:

  • Build one bitmap from the amount index
  • Build another bitmap from the region index
  • Combine them using a fast bitwise AND

Only pages present in both bitmaps are scanned. This is not possible with a regular index scan, which can only use one index at a time. Bitmap scans can also combine indexes with OR conditions using a bitwise OR. This ability to merge multiple index results is one of the main reasons bitmap scans exist.

Choosing the Right Plan

The PostgreSQL planner uses a cost model to pick between these access methods. Here is the rough decision landscape:

Selectivity

Rows matched

Likely plan

Why

< 1%

Very few rows

Index Scan

Random I/O cost acceptable at this scale

1–20%

Moderate rows

Bitmap Index Scan

Sorting TIDs converts random IO to  sequential

> 20%

Most rows

Seq Scan

Index overhead exceeds benefit

Multi-column conditions

Any

BitmapAnd / BitmapOr

Only mechanism that combines two indexes


The default random_page_cost = 4.0 was tuned for spinning disks. On SSDs, lower it to around 1.1. This tells the planner that random I/O is cheap, so it will favor index scans more aggressively. Understanding these internals gives us the ability to reason about query performance, knowing why the planner made a decision, and how to guide it with random_page_cost, work_mem, or appropriate indexes.

Leave A Comment