When querying large datasets in PostgreSQL, the combination of ORDER BY and LIMIT clauses can significantly influence performance. ORDER BY sorts the data, and LIMIT restricts the number of rows returned, but together they can create a performance bottleneck. Understanding how these operations interact and optimizing their use is crucial for maintaining efficient database performance and ensuring quick query responses.
In this blog, we are going to see how they can affect the performance of a query/
We have a simple table named person here is the structure of this table:
postgres=# \d person
Table "public.person"
Column | Type | Collation | Nullable | Default
--------+-----------------------+-----------+----------+------------------------------------
id | integer | | not null | nextval('person_id_seq'::regclass)
name | character varying(50) | | |
age | integer | | |
city | character varying(50) | | |
Indexes:
"person_pkey" PRIMARY KEY, btree (id)postgres=# select count(*) from person;
count
--------
100000
(1 row)postgres=# select * from person limit 1;
id | name | age | city
----+---------+-----+------
1 | Person1 | 29 | Taif
(1 row)
Let’s see how much time PostgreSQL will took to fetch all the rows without any index
postgres=# explain analyze select * from person;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Seq Scan on person (cost=0.00..1736.00 rows=100000 width=25) (actual time=0.014..12.892 rows=100000 loops=1)
Planning Time: 0.069 ms
Execution Time: 21.260 ms
(3 rows)
PostgreSQL performs a sequential scan (Seq Scan) to retrieve all rows from the person table. The execution plan shows that this process scans the entire table, which contains 100,000 rows, taking approximately 12.892 milliseconds. The total execution time, including planning, is 21.260 milliseconds. This approach reads each row sequentially, making it straightforward but potentially slow for large tables, as it does not use indexes to speed up data retrieval.
Now let’s see how much time PostgreSQL will require to get the first 5 records only
postgres=# explain analyze select * from person limit 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.09 rows=5 width=25) (actual time=0.009..0.010 rows=5 loops=1)
-> Seq Scan on person (cost=0.00..1736.00 rows=100000 width=25) (actual time=0.007..0.008 rows=5 loops=1)
Planning Time: 0.043 ms
Execution Time: 0.019 ms
(4 rows)
PostgreSQL takes just 0.019 ms to retrieve the first rows using LIMIT. In production environments, it’s crucial to limit the number of rows retrieved to only what is needed. This approach significantly enhances query performance compared to retrieving the entire dataset.
Suppose we want to get a list of the 5 youngest people in our dataset. We could run the following query:
postgres=# select * from person order by age limit 5;
id | name | age | city
-----+-----------+-----+--------
137 | Person137 | 20 | Taif
209 | Person209 | 20 | Riyadh
54 | Person54 | 20 | Jaddah
68 | Person68 | 20 | Riyadh
274 | Person274 | 20 | Jaddah
(5 rows)
Lets now see how much time it took to execute this query
postgres=# explain analyze select * from person order by age limit 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=3396.96..3396.98 rows=5 width=25) (actual time=68.293..68.295 rows=5 loops=1)
-> Sort (cost=3396.96..3646.96 rows=100000 width=25) (actual time=68.291..68.292 rows=5 loops=1)
Sort Key: age
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on person (cost=0.00..1736.00 rows=100000 width=25) (actual time=0.010..21.802 rows=100000 loops=1)
Planning Time: 0.065 ms
Execution Time: 68.311 ms
(7 rows)
In this query, PostgreSQL retrieves the first 5 rows from the person table, ordered by the age column. The execution plan reveals an inefficient process: PostgreSQL performs a sequential scan of the entire table (Seq Scan), taking about 21.8 milliseconds to read all 100,000 rows. It then sorts these rows by age using a top-N heapsort, which takes approximately 46.4 milliseconds.
Finally, it applies the LIMIT 5 clause to return the first 5 sorted rows, resulting in a total execution time of 68.311 milliseconds. This approach is inefficient because it scans and sorts the entire dataset, which is time-consuming and resource-intensive, especially for large tables.
Let’s see how an index can help to speed this up let’s create an index on the age column
postgres=# create index on person (age);
CREATE INDEX
postgres=#
postgres=# explain analyze select * from person order by age limit 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.29..0.53 rows=5 width=25) (actual time=0.034..0.039 rows=5 loops=1)
-> Index Scan using person_age_idx on person (cost=0.29..4798.73 rows=100000 width=25) (actual time=0.033..0.036 rows=5 loops=1)
Planning Time: 0.293 ms
Execution Time: 0.052 ms
(4 rows)
After creating an index on the age column of the person table, the query to retrieve the first 5 rows ordered by age becomes significantly more efficient. The execution plan shows that PostgreSQL uses an Index Scan person_age_idx, which directly accesses the rows in the desired order. This eliminates the need for a full table scan and explicit sorting. The Index Scan retrieves the first 5 rows in just 0.036 milliseconds, resulting in a total execution time of 0.052 milliseconds.
This efficient method drastically reduces the query time compared to the previous approach, as it leverages the index to quickly find and return the sorted rows, demonstrating the performance benefits of indexing for such queries.
Understanding the performance impact of using ORDER BY with LIMIT in PostgreSQL is essential for optimizing query efficiency. By leveraging indexes and limiting result sets, developers can enhance database performance and ensure faster query responses.