Pitfalls of using SELECT *

Let’s say you’re writing SQL code and need to retrieve data from one or more tables. It’s easy to use SELECT * to fetch all columns at once and then process the data on the client side. But hold on! Let’s examine why SELECT * might not be the best option. In this blog, we’ll dive into the drawbacks of using SELECT *, such as performance issues related to resource contention and related issues

Network Traffic

With SELECT *, the surplus data contributes to the overall volume of information transferred between the database server and your application server or client machine. Increased network traffic translates to more consumption of network bandwidth and resources, which can impact the performance of other applications and services sharing the same network. Moreover, this increased traffic can incur higher costs related to network infrastructure and data transfer.

Sorting and Hashing

Database operations such as sorting and hashing rely on the volume of data they handle. When SELECT * is used, it increases the data size, thereby making these operations more resource-intensive.

For instance, the sorting algorithm has to deal with a larger dataset, necessitating more memory(work_mem) and computational power. Consequently, this can lead to slower sorting times and elevate CPU usage, impacting the overall efficiency of the database. It is also possible to generate inefficient execution plans and the selection of suboptimal sorting algorithms for the given query. For example, an external merge sort tends to be slower compared to a quicksort algorithm.

Similarly, hashing operations, such as grouping or joining data based on hash keys, can also suffer due to the increased data size caused by SELECT *. When a hash table is created, hash values and corresponding rows are inserted into it. With a larger dataset, this process can rapidly eat up all memory resources and can lead to disk-based hashing, resulting in slower processing times.

More on Hash operation in PostgreSQL:

Understanding Hash aggregates and Hash Joins in PostgreSQL

Toast Columns

PostgreSQL utilizes the TOAST (The Oversized-Attribute Storage Technique) mechanism to efficiently store large values like text, BLOB, jsonb, or binary data. Postgres compresses these values using an internal algorithmby default. If the compressed values remain too large, Postgres relocates them to a separate table known as the TOAST table, while maintaining pointers in the original table.

When SELECT * is used, PostgreSQL retrieves all columns from the primary table, including any toasted columns. Retrieving toasted columns may require decompression, potentially consuming additional CPU cycles. Moreover, if the toasted values reside in a separate table, there is an additional cost associated with fetching tuples from there.

Index only scans

Indexes serve as efficient lookup structures based on specific columns.Indexes are typically smaller and more efficiently organized than actual table data. When a query requires data solely present in the index, PostgreSQL can execute an index-only scan. However, three conditions must be met for an index-only scan to occur:

The index type must support index-only scans. While B-tree indexes universally support index-only scans, GiST and SP-GiST indexes only support them for select operator classes.
The query should only reference columns stored in the index.
The visibility map, also known as the “vm” map, tracks the visibility of tuples (rows) in tables. For an index-only scan to proceed, the visibility map must confirm that all tuples referenced by the index are visible to current and future transactions.

An index-only scan retrieves essential data directly from the index, thus avoiding the need to access the actual table data. However, employing SELECT * in a query may block the possibility of an index-only scan. This occurs because SELECT * commands the database to fetch all columns, even those not stored in the index. Retrieving table data typically involves multiple disk reads, while an index-only scan may require less IO and also reducing the memory footprint.

For example:

Query # 1

postgres=# explain (analyze,buffers) select * from pgbench_accounts where aid between 1 and 10000;

QUERY PLAN

————————————————————————————————————————————————–

Index Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.56..344.10 rows=8702 width=97) (actual time=0.015..1.523 rows=10000 loops=1)

Index Cond: ((aid >= 1) AND (aid <= 10000))

Buffers: shared hit=195

Planning:

Planning Time: 0.093 ms

Execution Time: 2.890 ms

Query # 2

postgres=# explain (analyze,buffers) select aid from pgbench_accounts where aid between 1 and 10000;

QUERY PLAN

——————————————————————————————————————————————————

Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.56..201.00 rows=8702 width=4) (actual time=0.020..1.063 rows=10000 loops=1)

Index Cond: ((aid >= 1) AND (aid <= 10000))

Heap Fetches: 0

Buffers: shared hit=32

Planning:

Planning Time: 0.095 ms

Execution Time: 1.428 ms

Here, we can compare the shared hit values of both queries. In the first query, we access 195*8KB, totaling 1560KB of data and execution time of 1.990ms. Conversely, for the second query, we access only 256KB of data, resulting in an execution time of 1.428ms.

Continue reading our blog posts below, or reach out if you have any questions using our contact form.

Comments (2)

Bruce Momjian
April 3, 2024 - 7:16 pm
Great summary. I had not considered some of these items, like memory used for sorts.
- Spokey
  April 5, 2024 - 1:20 pm
  Same, Bruce!
  Muhammad, I’d go a bit further than SELECT * too – I think all SQL should be reviewed to make sure that the columns requested are all required – you only need one superfluous column to break an index-only scan, for instance.
  Great read, though, thank you!