Optimizing PostgreSQL with Composite and Partial Indexes

Indexes are crucial for accelerating database queries, and enhancing the performance of your PostgreSQL applications. However, not all indexes function the same way. Composite and partial indexes are two common types, each with distinct purposes and effects on performance. In this blog, we’ll dive into what composite and partial indexes are, how they operate, and when to use them to achieve the best results for your database.

Composite indexes

A composite index is created on multiple columns, enabling PostgreSQL to efficiently process queries that include several columns in their search conditions. This type of index is especially beneficial when queries frequently filter or sort data using multiple fields. For instance, an index on both “last_name” and “first_name” can accelerate searches where both names are specified in the query.

Advantages of using composite indexes

Composite indexes can serve as covering indexes, meaning they include all the columns needed by a query. This allows the database to retrieve data directly from the index without looking at the main table, reducing the number of I/O operations required.
If we often run queries that sort by “last_name” and “first_name,” a composite index on these columns will speed them up by providing a pre-sorted order, making sorting quicker and more efficient.
We can also use a composite index with a UNIQUE constraint on the same columns to ensure both faster performance and data integrity.

Disadvantages of using composite indexes

If a composite index includes columns that are rarely used together in queries, it won’t boost performance and may even slow it down due to the extra work required to update the index.
If any column in a composite index is updated frequently, the index must be modified often, which can negatively impact performance.
Frequent updates to the index from INSERT or UPDATE queries can increase storage use and require more maintenance.
The order of columns in a composite index is crucial; PostgreSQL can only use the index efficiently if the query starts with the first column or matches the first few columns in order.
Use composite indexes when your table has many unique values; if there are only a few unique values, a composite index may not be effective.

Composite index example

Let’s consider a simple table for storing sales data, which has a primary key but no other indexes. Here’s how this table might look

postgres=# \d sales                                     Table "public.sales"    Column    |     Type      | Collation | Nullable |                Default                  -------------+---------------+-----------+----------+----------------------------------------  sale_id     | integer       |           | not null | nextval('sales_sale_id_seq'::regclass)  customer_id | integer       |           | not null |   product_id  | integer       |           | not null |   sale_date   | date          |           | not null |   amount      | numeric(10,2) |           | not null |  Indexes:     "sales_pkey" PRIMARY KEY, btree (sale_id)

Let’s execute a simple SELECT query to fetch all sales for product_id = 408 where the sale_date is 2024-08-17.

postgres=# EXPLAIN ANALYZE SELECT * FROM sales WHERE product_id = 408 AND sale_date = '2024-08-17';                                                     QUERY PLAN                                                      -------------------------------------------------------------------------------------------------------------------  Gather  (cost=1000.00..7310.30 rows=3 width=22) (actual time=4.374..19.059 rows=4 loops=1)    Workers Planned: 2    Workers Launched: 2    ->  Parallel Seq Scan on sales  (cost=0.00..6310.00 rows=1 width=22) (actual time=9.093..14.098 rows=1 loops=3)          Filter: ((product_id = 408) AND (sale_date = '2024-08-17'::date))          Rows Removed by Filter: 166665  Planning Time: 0.251 ms  Execution Time: 19.078 ms (8 rows)

The PostgreSQL planner chose to use a parallel sequential scan to fetch the required result, which makes sense since there are no indexes yet. This took 19.078 ms to complete.

Now, let’s create a composite index on the product_id and sale_date columns in the same table.

CREATE INDEX idx_sales_product_id_sale_date ON sales(product_id, sale_date); CREATE INDEX

Now, let’s run the same SELECT query again.

EXPLAIN ANALYZE SELECT * FROM sales WHERE product_id = 408 AND sale_date = '2024-08-17';                                                               QUERY PLAN                                                                ---------------------------------------------------------------------------------------------------------------------------------------  Bitmap Heap Scan on sales  (cost=4.45..16.22 rows=3 width=22) (actual time=0.048..0.055 rows=4 loops=1)    Recheck Cond: ((product_id = 408) AND (sale_date = '2024-08-17'::date))    Heap Blocks: exact=4    ->  Bitmap Index Scan on idx_sales_product_id_sale_date  (cost=0.00..4.45 rows=3 width=0) (actual time=0.045..0.045 rows=4 loops=1)          Index Cond: ((product_id = 408) AND (sale_date = '2024-08-17'::date))  Planning Time: 0.265 ms  Execution Time: 0.074 ms (7 rows)

Wow! The execution time dropped from 19.078 ms to 0.074 ms, resulting in nearly a 275x performance improvement.

Partial indexes

Partial indexes index only a subset of data that meets specific conditions, rather than covering all rows in a table. They are perfect for cases where only a portion of the data is frequently queried, such as indexing only active users or recent transactions. By targeting a specific subset of data, partial indexes can reduce storage needs and enhance query performance.

Advantages of using Partial indexes

Partial indexes cover only a part of the table, leading to lower storage costs compared to full indexes.
Their smaller size requires less time and resources to maintain during inserts, updates, and deletes.
UPDATE operations are generally faster because the partial index doesn’t need to be updated with every change.

Cons of using Partial Indexes

Partial indexes don’t cover all the data, so joins or filters on non-indexed data can lead to slower performance as the table size increases.
Partial indexes are useful only if the indexed subset is frequently queried; otherwise, they may not improve performance if the queries don’t match the index conditions.

Partial index example

Let’s create a table to store COVID-19 data. and insert 3 million dummy records into the table.

CREATE TABLE IF NOT EXISTS covid_data ( id SERIAL PRIMARY KEY, country varchar(20), title varchar(10), names varchar(20), vaccinated varchar(3) ); CREATE TABLE ostgres=# INSERT INTO covid_data (country, title, names, vaccinated) postgres-# SELECT -- Random country from a predefined list (ARRAY['USA', 'Canada', 'UK', 'Germany', 'France', 'India', 'China', 'Brazil', 'Australia', 'Japan'])[floor(random() * 10 + 1)], -- Random title from a predefined list (ARRAY['Mr.', 'Ms.', 'Dr.', 'Prof.'])[floor(random() * 4 + 1)], -- Random name from a predefined list (ARRAY['John', 'Jane', 'Alex', 'Emily', 'Michael', 'Sarah', 'David', 'Laura', 'Robert', 'Linda'])[floor(random() * 10 + 1)], -- Random vaccinated status ('Yes' or 'No') CASE   WHEN random() < 0.8 THEN 'Yes'  -- 80% chance of being 'Yes'   ELSE 'No'  -- 20% chance of being 'No' END FROM generate_series(1, 3000000);

First, let’s run the SELECT query with EXPLAIN ANALYZE to analyze the execution plan and get the results.

postgres=# EXPLAIN ANALYZE SELECT * FROM covid_data WHERE vaccinated = 'Yes' AND country = 'UK' AND title = 'Prof.';                                                           QUERY PLAN                                                            -------------------------------------------------------------------------------------------------------------------------------  Gather  (cost=1000.00..27738.89 rows=1 width=174) (actual time=0.361..450.847 rows=59916 loops=1)    Workers Planned: 2    Workers Launched: 2    ->  Parallel Seq Scan on covid_data  (cost=0.00..26738.79 rows=1 width=174) (actual time=0.075..432.697 rows=19972 loops=3)          Filter: (((vaccinated)::text = 'Yes'::text) AND ((country)::text = 'UK'::text) AND ((title)::text = 'Prof.'::text))          Rows Removed by Filter: 980028  Planning Time: 0.096 ms  Execution Time: 456.062 ms (8 rows)

Since there is no index, a sequential scan was used, and it returned the result in 456 ms.

Now, let’s create three indexes, one for each column used in the SELECT query.

postgres=# CREATE INDEX vaccinated_full_idx ON covid_data(vaccinated); CREATE INDEX postgres=# CREATE INDEX country_full_idx ON covid_data(country); CREATE INDEX postgres=# CREATE INDEX title_full_idx ON covid_data(title); CREATE INDEX

Now, let’s run the same SELECT query again and check if the execution time is reduced.

postgres=# EXPLAIN ANALYZE SELECT * FROM covid_data WHERE vaccinated = 'Yes' AND country = 'UK' AND title = 'Prof.';                                                                    QUERY PLAN                                                                    ------------------------------------------------------------------------------------------------------------------------------------------------  Bitmap Heap Scan on covid_data  (cost=495.29..499.31 rows=1 width=174) (actual time=100.977..229.152 rows=59435 loops=1)    Recheck Cond: (((title)::text = 'Prof.'::text) AND ((country)::text = 'UK'::text) AND ((vaccinated)::text = 'Yes'::text))    Heap Blocks: exact=19605    ->  BitmapAnd  (cost=495.29..495.29 rows=1 width=0) (actual time=97.916..97.917 rows=0 loops=1)          ->  Bitmap Index Scan on title_full_idx  (cost=0.00..164.93 rows=15000 width=0) (actual time=23.191..23.192 rows=749479 loops=1)                Index Cond: ((title)::text = 'Prof.'::text)          ->  Bitmap Index Scan on country_full_idx  (cost=0.00..164.93 rows=15000 width=0) (actual time=11.334..11.334 rows=299158 loops=1)                Index Cond: ((country)::text = 'UK'::text)          ->  Bitmap Index Scan on vaccinated_full_idx  (cost=0.00..164.93 rows=15000 width=0) (actual time=62.001..62.001 rows=2400565 loops=1)                Index Cond: ((vaccinated)::text = 'Yes'::text)  Planning Time: 0.477 ms  Execution Time: 232.855 ms (12 rows) postgres=# SELECT pg_size_pretty(pg_relation_size('title_full_idx'));  pg_size_pretty  ----------------  20 MB (1 row) postgres=# SELECT pg_size_pretty(pg_relation_size('country_full_idx'));  pg_size_pretty ----------------  20 MB (1 row) postgres=# SELECT pg_size_pretty(pg_relation_size('vaccinated_full_idx'));  pg_size_pretty  ----------------  20 MB (1 row)

This time, the execution time decreased to 232 ms, resulting in a 1.9x performance improvement. PostgreSQL used nearly 60 MB to store these indexes.

Now, let’s use a partial index to see if we can optimize both speed and space and achieve better results.

Create a Partial index

postgres=# CREATE INDEX vaccinated_partial_idx ON covid_data(vaccinated) WHERE vaccinated = 'Yes' AND country = 'UK' AND title = 'Prof.'; CREATE INDEX

Now execute the same SELECT query

postgres=# EXPLAIN ANALYZE SELECT * FROM covid_data WHERE vaccinated = 'Yes' AND country = 'UK' AND title = 'Prof.';                                                                 QUERY PLAN                                                                 ------------------------------------------------------------------------------------------------------------------------------------------  Index Scan using vaccinated_partial_idx on covid_data  (cost=0.29..8.30 rows=1 width=174) (actual time=0.023..79.981 rows=60644 loops=1)  Planning Time: 0.237 ms  Execution Time: 83.855 ms (3 rows) postgres=# SELECT pg_size_pretty(pg_relation_size('vaccinated_partial_idx'));  pg_size_pretty  ----------------  424 kB (1 row)

Wow! With the partial index, we achieved a 5.4x performance improvement and reduced the size by 99.29%.