Understand Explain Plans in PostgreSQL

In a previous post titled Exploring Postgres Performance: A Deep Dive into pg_stat_statements, we discussed the utility of pg_stat_statements as a tool for identifying resource-intensive queries in terms of memory, CPU, or I/O usage.

Now, let’s suppose you’ve pinpointed certain queries you’d like to investigate further. The EXPLAIN command is used to generate execution plans. It includes:

The highest-level node provides a summary that includes cumulative statistics gathered from its child nodes. Each child node possesses its individual statistics, which may also include statistics from any child nodes it holds. These statistics encompass factors like cost, total execution time, the number of rows returned, and the data width.
Details about scans performed on tables, including sequential scans, index scans, index-only scans, parallel scans and an estimation of the number of rows to be returned or matching the specified condition.
Information regarding the join algorithms utilized, such as Hash join, Merge join, or nested loop join.

Key Parameters for PostgreSQL Query Optimization

ANALYZE

The ANALYZE statement functions by executing the SQL statement and discarding the output information. It provides information on actual time, rows returned from each node and number of times a node was executed(loops).

EXPLAIN ANALYZE SELECT * FROM users;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------
Seq Scan ON users (cost=0.00..18584.82 ROWS=1890082 width=24) (actual TIME=0.005..232.380 ROWS=1600000 loops=1)
Total runtime: 269.666 ms
(2 rows)

If you want to run analyze with statements such as INSERT, UPDATE, or DELETE without affecting the data,it should be executed inside transaction, as follows:

BEGIN;
EXPLAIN ANALYZE UPDATE users set id = 1 where name = 'abc';
ROLLBACK;

VERBOSE

The VERBOSE parameter allows you to display additional information regarding the query plan, such as the output column list for each node in the plan tree and schema-qualified table and function names. This parameter is set to FALSE by default.

COSTS

The COSTS option includes the estimated startup and total costs of each plan node, as well as the estimated number of rows and the estimated width of each row in the query plan. PostgreSQL uses a cost-based optimizer, which means it evaluates multiple execution strategies for a query and selects the most efficient one based on cost. Lower cost indicates a more efficient operation. COSTS reflect the effort required for operations in terms of time and resources.

Note: A common misunderstanding is that they are in milliseconds or some other unit of time, but that’s not the case. The costs are measured arbitrary units based on the GUCs mentioned below.

Following GUC parameters are used in cost planning.

seq_page_cost = 1.0
random_page_cost = 4.0
cpu_tuple_cost = 0.01
cpu_index_tuple_cost = 0.005
cpu_operator_cost = 0.0025

More information on above parameters: https://www.postgresql.org/docs/current/runtime-config-query.html#RUNTIME-CONFIG-QUERY-CONSTANTS

But how are query costs calculated? For example

EXPLAIN SELECT * FROM account;
QUERY PLAN
-----------------------------------------------------------------
Seq Scan on account(cost=0.00..3372572.84 rows=73953584 width=386)

Cost in normally shown in a range(cost=a..b). a shows estimate of how long it will take to fetch the first row and b shows how long it will take to return all the rows. In above example, the startup cost of sequential scan is 0.00. For a sequential scan, the startup cost will generally be close to zero, as it can start fetching rows straight away.In contrast, operations like sorting have higher costs because significant work must be done before rows can be returned.

Let’s find the total cost of above operation. First we need to find total tuples count and total pages used to accomodate those tuples using pg_class catalog table:

SELECT relpages, reltuples FROM pg_class WHERE relname = 'account';

relpages = 2633037 => It will be used as estimated sequential page reads
reltuples = 73953584 => It will be used as total tuples inside table

Total cost of a Seq Scan = (estimated sequential page reads * seq_page_cost) + (cpu_tuple_cost + cpu_operator_cost) * total tuples to be scanned

For instance, in the provided example:
Total cost of Seq Scan = (2633037 * 1) + (0.0025 * 0.01) * 73953584 = 3372572.8

One exception is LIMIT clause, in which the planner uses to estimate whether it can terminate the child operation early. For example

EXPLAIN SELECT * FROM users LIMIT 10;

QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.00..0.058 rows=10 width=386)
-> Seq Scan on users (cost=0.00..3372572.84 rows=73953584 width=386)

Here, the total cost for the Seq Scan node remains 3372572.84, but the total cost of the Limit operation is reported as 0.058 because the planner expects that only a small number of rows need to be processed and as a result, the LIMIT will stop the scan early.

BUFFERS

The BUFFERS option helps to monitor the amount of IO work performed by each node inside query execution plan. It is particularly valuable for database systems that often carry out IO-intensive operations, as excessive data page access can be a major cause of performance issues.

The BUFFERS option provides details about shared blocks (containing data from regular tables and indexes), local blocks (containing data from temporary tables and indexes), and temporary blocks (holding short-term working data used in sorting, hashing, Materialize plan nodes, and similar scenarios).

Blocks hit means that a disk read was avoided because the block was found already in memory when needed.
Blocks read means that a data was not avaialble inside shared_buffers so Postgresql went to fetch data from OS cache/disk(We don’t know).
Blocks dirtied indicates the number of blocks that were changed by this query(Can happen during inserts/updates or select hint bits).
Blocks written indicates the number of previously-dirtied blocks evicted from memory to disk by this backend during query processing. The number of blocks shown for an upper-level node includes those used by all its child nodes.

Example:

Buffers: shared hit=4785 READ=4541 dirtied=204 written=663, temp READ=788

This means:

PostgreSQL found 4785 blocks in memory and fetched 4541 blocks from the OS cache/disk.
204 blocks were modified (e.g., new rows, updates, deletes, hint bits).
663 blocks were evicted from memory to free up space.
788 blocks were read from temporary files for hash/aggregation/sorting operations.

WAL

Include information on WAL record generation. Specifically, include the number of records, number of full page images (fpi) and the amount of WAL generated in bytes. In text format, only non-zero values are printed. This parameter may only be used when ANALYZE is also enabled. It defaults to FALSE.

FORMAT

Specify the output format, which can be TEXT, XML, JSON, or YAML. Non-text output contains the same information as the text output format, but is easier for programs to parse. This parameter defaults to TEXT.

Examples: PostgreSQL Query Plans

First Query

explain (analyze,buffers)
SELECT
"r_partner".id
FROM
"r_partner"
WHERE
(
("r_partner"."active" = true)
AND (
"r_partner"."x_studio_driver_application_id" = 405300
)
)
AND (
(
(
(
"r_partner"."partner_share" IS NULL
or "r_partner"."partner_share" = false
)
OR (
"r_partner"."company_id" in (1)
)
)
OR "r_partner"."company_id" IS NULL
)
AND (
(
(
(
"r_partner"."type" != 'private'
)
OR "r_partner"."type" IS NULL
)
OR "r_partner"."type" IS NULL=
)
OR ("r_partner"."type" = 'private')
)
)
ORDER BY
"r_partner"."display_name"
Limit 1;

Query Plan before Index

QUERY PLAN
-----------------------------------------------------------------
Limit (cost=0.42..6322.54 rows=1 width=23) (actual time=398.392..398.393 rows=1 loops=1)
Buffers: shared hit=373305
-> Index Scan using r_partner_display_name_index on r_partner (cost=0.42..101154.35 rows=16 width=23) (actual time=398.390..398.391 rows=1 loops=1)
Filter: (active AND ((partner_share IS NULL) OR (NOT partner_share) OR (company_id = 1) OR (company_id IS NULL)) AND (x_studio_driver_application_id = 405300) AND (((type)::text <> 'private'::text) OR (type IS NULL) OR (type IS NULL) OR ((type)::text = 'private'::text)))
Rows Removed by Filter: 370671
Buffers: shared hit=373305
Planning Time: 0.144 ms
Execution Time: 398.417 ms

Let’s get the information from above plan:

Total execution time = 398.417 ms
Shared hit = 373305 which means that we are touching 373305 blocks/pages from memory. Each block has a size of 8kb in PostgreSQL which means that we are accessing 373305 * 8/1000 = 3GB of data just to return 1 row.
Rows Removed by Filter: 370671 which means that 99% of rows are removed after reading them

What we need to think is how we can reduce number of blocks accessed, filtered rows and find out columns which are good candidate for indexing.

Lets create an index on x_studio_driver_application_id column since it has high cardinality and mostly unique. Don’t forget to analyze the table for updating statistics.

create index concurrently idx_r_partner_driver_application_id on r_partner(x_studio_driver_application_id);
ANALYZE r_partner;

Query Plan after Index

QUERY PLAN
-----------------------------------------------------------------
Limit (cost=6.26..6.26 rows=1 width=23) (actual time=0.026..0.027 rows=1 loops=1)
Buffers: shared hit=4
-> Sort (cost=6.26..6.30 rows=16 width=23) (actual time=0.025..0.026 rows=1 loops=1)
Sort Key: display_name
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=4
-> Index Scan using idx_r_partner_driver_application_id on r_partner (cost=0.42..6.18 rows=16 width=23) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: (x_studio_driver_application_id = 405300)
Filter: (active AND ((partner_share IS NULL) OR (NOT partner_share) OR (company_id = 1) OR (company_id IS NULL)) AND (((type)::text <> 'private'::text) OR (type IS NULL) OR (type IS NULL) OR ((type)::text = 'private'::text)))
Buffers: shared hit=4
Planning Time: 0.149 ms
Execution Time: 0.047 ms

Here we can see that:

- idx_r_partner_driver_application_id is used for column x_studio_driver_application_id.
- Shared_hit = 4 which means we are now only accessing 4 * 8kb = 32KB dataset.
- Execution time has been reduced from 398ms to 0.047ms.

Second Query

SELECT count(1) FROM accounts WHERE ((accounts.state = 'cancel')
AND (accounts.date <= '2023-12-31'))
AND (accounts.company_id IS NULL OR (accounts.company_id in (1)))

Query Plan before Index

QUERY PLAN
-----------------------------------------------------------------
Finalize Aggregate (cost=997128.91..997128.92 rows=1 width=8) (actual time=988.165..990.682 rows=1 loops=1)
Buffers: shared hit=896417
-> Gather (cost=997128.49..997128.90 rows=4 width=8) (actual time=988.066..990.675 rows=5 loops=1)
Workers Planned: 4
Workers Launched: 4
Buffers: shared hit=896417
-> Partial Aggregate (cost=996128.49..996128.50 rows=1 width=8) (actual time=985.039..985.039 rows=1 loops=5)
Buffers: shared hit=896417
-> Parallel Seq Scan on accounts (cost=0.00..996114.35 rows=5654 width=0) (actual time=0.071..984.457 rows=5084 loops=5)
Filter: ((date <= '2023-12-31'::date) AND ((company_id IS NULL) OR (company_id = 1)) AND ((state)::text = 'cancel'::text))
Rows Removed by Filter: 4550296
Buffers: shared hit=896417
Planning:
Buffers: shared hit=4
Planning Time: 0.167 ms
Execution Time: 990.718 ms

Let’s get the information from above plan:

Total execution time = 990.718 ms.
Shared hit = 896417 which means that we are touching 896417 blocks/pages from memory. Each block has a size of 8kb in PostgreSQL which means that we are accessing 896417 * 8/1000 = 7.2 GB of data just to return the count.
Parallel sequential scan is performed with loops = 5 which means 5 parallel workers(1 leader and 4 launched workers as shown in Workers Launched: 4). Most of the time of this query is spent on this node. Rows Removed by filter are 4550296 and rows returned are 5084. Note that these are average row values for a single loop. Since we have loops = 5, we can say that around 4550296 * 5 = 22,751,480 rows were filtered(Almost 99% of the table) and 5084 * 5 = 25,420 rows were returned from this parallel scan.

Since this query is run with high frequency everyday, it can be we can optimize it by using a proper index. Lets check the state column first as it is used as filter.

select count(*),state from account_move group by state;

count     |   state
----------------------
25436     |  cancel
118014    |  draft
22633447  |  posted

There are only 3 unique values in state column and 95% data is aligned with posted status. Looks like we can get benefit from a partial index on state column.

CREATE INDEX CONCURRENTLY idx_account_move_state on account_move(state) where state in ('cancel','draft');
ANALYZE account_move;

We could’ve created full index on state column but:

The index creation time will increase as well as index size.
Index can include rows that are never searched and there is a possibility that planner will never use it when state = posted.

Query Plan after Index

QUERY PLAN
-----------------------------------------------------------------

Aggregate (cost=2268.88..2268.89 rows=1 width=8) (actual time=24.444..24.445 rows=1 loops=1)
Buffers: shared hit=23040
-> Index Scan using idx_account_move_state on account_move (cost=0.29..2199.06 rows=27930 width=0) (actual time=0.026..22.491 rows=25419 loops=1)
Index Cond: ((state)::text = 'cancel'::text)
Filter: ((date <= '2023-12-31'::date) AND ((company_id IS NULL) OR (company_id = 1)))
Rows Removed by Filter: 17
Buffers: shared hit=23040
Planning:
Buffers: shared hit=4
Planning Time: 0.192 ms
Execution Time: 24.476 ms

Here we can see that:

idx_account_move_state is used for column state.
Shared_hit = 23040 which means we are now only accessing 23040 * 8kb = 184MB dataset
Execution time has been reduced from 990ms to 24ms.

Note: Keep in mind that Indexes shouldn’t be used every time you want to optimize a query. Every index has to be updated for every write, they take space depending on the dataset and can avoid HOT optimization.

Query Optimization Using PostgreSQL Explain Plans

Reading explain plan can be complex, especially for newbie PostgreSQL users. However, by focusing on some fundamental concepts, you can gradually become more proficient at interpreting the information provided and identifying potential performance issues in your queries.

EXPLAIN alone shows cost estimates. With ANALYZE, you’ll actually run the query and see the time it took to create the query plan, plus the time it took to execute the query according to that plan. By using BUFFERS, you can track how many blocks are read from memory, how many are fetched from the disk, and other I/O-related details.
The tables involved in the query must be vacuumed and analyzed. You can check last_*vacuum and last_*analyze columns inside pg_stat_all_tables against each table. For more info on vacuum, checkout Vacuum In PostgreSQL.
Look for nodes with the following operations and stats. These can be potential bottlenecks.
- - High costs
  - Bad estimated row counts(planned vs returned)
  - High buffer operations (read and hit)
  - Filters that remove a lot of rows
  - Sequential scans on large tables
  - External disk merge operation
  - Temp reads and writes

Ensure that indexes are used appropriately as they helps in working with fewer database blocks(buffers). Use CONCURRENTLY while creating index in production environment to avoid locks.
Following parameters should be set inside postgresql.conf for better performance:
- - shared_buffers
  - effective_cache_size
  - work_mem
  - maintenance_work_mem
  - random_page_cost

Check out for more information on PostgreSQL memory optimization: Memory Management in PostgreSQL

To visualize query plans, you can use DALIBO and DEPESZ tool which can help in understanding PostgreSQL EXPLAIN plans.

References:

https://www.postgresql.org/docs/current/sql-explain.html
https://www.depesz.com/tag/unexplainable/