StormaticsStormatics

PostgreSQL is Not Slow. Your Queries Are.

A field guide to the seven things that are actually making our database feel slow and how to stop blaming the wrong suspect.

It usually starts with a Slack message: “The app feels slow”. This is normally followed by a ticket, then an internal meeting, and finally someone, and there is always someone, saying: “I think we need to switch databases. PostgreSQL can’t handle this load.”

We have heard this exact sentence more times than we can count and almost every single time, after we dig in, PostgreSQL is doing exactly what it was asked to do. Diligently. Faithfully. Scanning millions of rows because nobody told it not to. Recalculating the same aggregation on every request because nobody cached it. Executing 847 individual queries to render one page because an ORM thought that was fine.

PostgreSQL is not the problem. The problem is what we are asking PostgreSQL to do.

This is not a dig at developers. These are honest mistakes that happen when teams move fast, when ORMs abstract away the SQL, when things work fine at 1,000 rows and then fall apart at 10 million. This blog is about recognizing them before our users do.

Culprit #1: The Missing Index

We start with the obvious one, because it is still the most common thing we find. A table grows from 10,000 rows to 10 million rows over eighteen months, and nobody noticed that a critical query was doing a sequential scan the whole time. At 10,000 rows, it was fast enough. At 10 million, it takes more seconds.

If we filter on a column, sort by a column, or join on a column, and there is no index on that column, PostgreSQL has to find another strategy. If no other indexes are available, it will read every single row in the table using a sequential scan. But if another index exists on a related column in the query, the planner may use that instead, which can be better or worse.

How to catch it

Running this query shows us tables that are being sequentially scanned the most. These are candidates for missing indexes:

-- Find tables getting hammered by sequential scans
SELECT relname AS table_name,
   seq_scan,
   seq_tup_read,
   idx_scan,
   seq_tup_read / NULLIF(seq_scan, 0) AS avg_rows_per_seq_scan
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 20;

If a table has a high seq_tup_read and a low idx_scan, that is a problem. Find the queries hitting that table, look at which columns appear in WHERE, ORDER BY, and JOIN clauses, and add indexes on those columns.

The other side: too many indexes slow down writes

Before adding indexes liberally, it’s important to understand the cost. In PostgreSQL, every index on a table must be updated for every write that is not a HOT (Heap Only Tuple) update. This means more WAL (Write-Ahead Log) writes, more WAL contention, more disk I/O, and more write amplification. On a write-heavy table, a large pile of indexes can make writes measurably slower. Every index that is not actively used by queries is purely overhead. The goal is targeted indexing — not adding indexes until the problem goes away.

The trap inside the trap

There is a subtler version of this that catches us off guard: we have an index, but PostgreSQL is not using it. This happens when we wrap a column in a function:

-- PostgreSQL CANNOT use an index on created_at for this:
SELECT * FROM orders WHERE DATE(created_at) = '2024-01-15';

-- PostgreSQL CAN use the index for this:
SELECT * FROM orders
WHERE created_at >= '2024-01-15'
AND created_at < '2024-01-16';

The moment we wrap a column in a function, the index on that column becomes invisible to the planner. Write WHERE clauses so the column stands alone on one side. That said, if rewriting the query is not feasible, PostgreSQL also supports expression indexes, and we can index the expression directly:

CREATE INDEX ON orders (DATE(created_at));

Culprit #2: The N+1 Query, Death by a Thousand Cuts

This one is almost entirely an ORM problem, and it is beautiful in how invisible it is until it isn’t.

Here is the scenario. We are building a page that shows a list of orders, and for each order, we want to display the customer’s name. Our ORM fetches the orders in one query. Then, for each order, it fetches the customer in a separate query. If we have 100 orders on the page, we just made 101 database round-trips to render it.

At a small scale, this is annoying. At production scale, with 500 concurrent users each triggering 101 queries, we have 50,500 simultaneous queries hitting our database for what should be a single page load. PostgreSQL is executing every single one correctly and quickly. But the volume is what kills us.

We once worked on a dashboard that was taking 14 seconds to load.

How to catch it

Enable query logging temporarily and look for patterns. If we see the same query shape repeating hundreds of times with different parameter values, SELECT * FROM customers WHERE id = $1, we have an N+1 problem.

The ORM knows how to do the JOIN; we just have to tell it to. The fix is ORM-specific, which means solving it requires understanding how ORM translates object traversal into SQL, not just knowing what the SQL should look like.

-- What the ORM was doing (N+1):
SELECT * FROM orders LIMIT 100;
SELECT * FROM customers WHERE id = 1;
SELECT * FROM customers WHERE id = 2;
-- ... 98 more times

-- What it should do (1 query):
SELECT o.*, c.name, c.email
FROM orders o
JOIN customers c ON c.id = o.customer_id
LIMIT 100;

Culprit #3: Stale Statistics, The Planner Is Flying Blind

This one is sneaky because the query is written correctly, the index exists, and still somehow PostgreSQL chooses a terrible execution plan. What happened?

PostgreSQL’s query planner makes decisions based on statistics about our data. These statistics are collected by ANALYZE and stored in system catalogs. If the statistics are stale, the planner is making decisions based on outdated information.

How to check and fix it

-- Check statistics freshness
SELECT relname,
       last_analyze,
       last_autoanalyze,
       n_live_tup,
       n_dead_tup
FROM pg_stat_user_tables
ORDER BY last_analyze NULLS FIRST;

If last_analyze is NULL or very old for a heavily-used table, run ANALYZE table_name    manually. For tables with heavily skewed distributions, increase the statistics target for that column:

ALTER TABLE orders
ALTER COLUMN status SET STATISTICS 500;
ANALYZE orders;

Autovacuum runs ANALYZE automatically, but it can fall behind on tables with very high write volume. If we have a table that receives thousands of inserts per minute, autovacuum’s default thresholds may not keep up. Check autovacuum settings for busy tables.

Culprit #4: The Query That Runs Fine, Alone

Here is a situation every DBA has encountered. We run a query in psql, and it returns in 200 milliseconds. In production, under load, the same query takes 8 seconds. Nothing changed.

The usual cause: lock contention.

Lock contention

PostgreSQL uses locks extensively. A long-running transaction holds locks that block other queries. If our application has a background job that takes 30 seconds and locks a table, and 50 other queries need data from that table, those 50 queries queue up behind it.

The query isn’t slow. It is waiting.

-- Find queries currently waiting for locks
SELECT pid,
       now() - query_start AS wait_duration,
       state,
       wait_event_type,
       wait_event,
       query
FROM pg_stat_activity
WHERE wait_event IS NOT NULL
AND state = 'active'
ORDER BY wait_duration DESC;

If we see wait_event_type = ‘Lock’, we have a blocking problem. One useful nuance here: row-level lock waits do not appear directly as row locks in pg_locks. Instead, they are aggregated on the blocking transaction’s transaction ID lock. When digging deeper into pg_locks, look for waits on transaction IDs; that is how row-level contention surfaces.

The idle transaction time bomb

A particularly nasty variant: a transaction that was opened and then forgotten. It sits open, holding locks. Other queries pile up behind it.

-- Find idle transactions holding locks
SELECT pid,
       now() - xact_start AS idle_duration,
       state,
       query
FROM pg_stat_activity
WHERE state = 'idle in transaction'
AND xact_start < now() - interval '30 seconds'
ORDER BY idle_duration DESC;

Idle-in-transaction sessions that are unusually long for the application’s expected behavior are almost always a bug. The right threshold depends entirely on workload; some applications legitimately hold transactions open for minutes; batch jobs and certain distributed coordination patterns can go longer still. Set the timeout based on what the application actually does:

idle_in_transaction_session_timeout = 'required_threshold'

One more nuance worth knowing: a transaction that has been opened but never assigned a transaction ID (XID) causes far less harm than one that has. An XID-less transaction does not hold row-level locks in the way that matters and does not block VACUUM from advancing. The concern is specifically transactions that have taken locks or been assigned an XID.

Culprit #5: EXPLAIN ANALYZE Exists, Use It

This is the skill that renders everything else optional. If we know how to read EXPLAIN ANALYZE output, we can diagnose any of the problems above in minutes.

The right way to run EXPLAIN

EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT o.id, o.total, c.name
FROM orders o
JOIN customers c ON c.id = o.customer_id
WHERE o.status = 'pending'
AND o.created_at > now() - interval '7 days';

Focus on:

  • rows= vs actual rows=
  • Seq Scan vs Index Scan
  • actual time=
  • Buffers: hit= vs read=

What good output looks like vs. what bad output looks like

Here is a real before-and-after. Before adding an index, the planner does a full sequential scan, reading every row in the table:

-- BEFORE: No index on orders.status
Seq Scan on orders  (cost=0.00..48210.00 rows=432 width=48)
                   (actual time=0.043..312.819 rows=432 loops=1)
 Filter: ((status)::text = 'pending'::text)
Rows Removed by Filter: 2399568
Buffers: shared hit=12 read=23198
Planning Time: 0.4 ms
Execution Time: 313.1 ms

After adding CREATE INDEX ON orders(status), the planner switches to an index scan, skipping 2.4 million rows entirely:

-- AFTER: Index on orders.status
Index Scan using orders_status_idx on orders
               (cost=0.43..1891.44 rows=432 width=48)
               (actual time=0.031..2.761 rows=432 loops=1)
 Index Cond: ((status)::text = 'pending'::text)
Buffers: shared hit=437
Planning Time: 0.3 ms
Execution Time: 2.9 ms

313 ms down to 2.9 ms. The table didn’t change. The data didn’t change. We just told PostgreSQL where to look.

Culprit #6: Connection Exhaustion, PostgreSQL is Full

Every connection to PostgreSQL is a full OS process. At 100 connections, that is fine. At 1,000 connections, the server spends more time managing processes than running queries. Most teams respond by raising max_connections. This makes things worse. More connections means more memory pressure, which means slower everything.

The correct fix is a connection pooler. PgBouncer in transaction mode lets hundreds of application threads share a small number of actual database connections. Applications think they have 500 connections. PostgreSQL sees 20. Throughput goes up. Incidents go down.

See connections and what they are doing

SELECT state, count(*)
FROM   pg_stat_activity
WHERE  datname = 'db_name'
GROUP BY state
ORDER BY count(*) DESC;

Thousands of idle connections are the smoking gun. The application is opening connections and not releasing them fast enough. Add PgBouncer before touching anything else. Raising max_connections without a pooler is the database equivalent of adding lanes to a highway to fix traffic. It does not work, and someone has written a paper about why.

Culprit #7: Reporting Queries Running on Production

Someone wrote a beautiful analytics query. Multiple CTEs, window functions, three JOINs, forty-five seconds to run. That is fine. Analytics are heavy. It is not fine to run it directly against the production OLTP database at 9 am when real users are trying to check out.

Long-running analytical queries hold lock snapshots that prevent VACUUM from cleaning dead rows. They consume I/O that transaction queries need. They can starve the connection pool entirely. The query is not wrong. Its address is.

Find long-running queries in real time

SELECT pid, now() - query_start AS duration, state, LEFT(query, 200)
FROM   pg_stat_activity
WHERE  state = 'active'
AND    now() - query_start > interval '5 seconds'
ORDER BY duration DESC;

Where to route them instead

The instinct is to point reporting queries at a read replica. This is not as safe as it sounds. If hot_standby_feedback is enabled on the replica, which prevents the primary from vacuuming rows the replica is still reading, a long-running query on the replica will cause the same VACUUM-blocking bloat problem on the primary. We have moved the symptom, not fixed it.

Safer alternatives:

  • Logical replicas: Unlike streaming replicas, logical replicas do not have this coupling with the primary’s VACUUM behavior.
  • A dedicated reporting database: A separate instance with its own replication pipeline, fully decoupled from production.

Production is for serving users. Analytics need their own home. 

The Pattern We’ll Keep Seeing

If we work with PostgreSQL long enough, we start to notice that slow database complaints follow a pattern. The complaint arrives. We look at the query. We find one of these seven things. We fix it. The “slow database” becomes fast again. PostgreSQL hadn’t changed at all.

The hard part isn’t fixing these problems. The hard part is building the habit of investigating before concluding.

Before we blame the database, before we provision more hardware, before we start evaluating alternatives, spending an hour with pg_stat_activity, pg_stat_user_tables, and EXPLAIN ANALYZE is mostly fruitful.

PostgreSQL is remarkably good at what it does. It is not the bottleneck. 

Leave A Comment