An Introduction to Indexing and Partitioning in PostgreSQL

When it comes to managing large and complex databases in PostgreSQL, an important decision you’ll face is how to optimize your data storage and retrieval strategies. Two common techniques for improving database performance and manageability are indexing and partitioning in PostgreSQL.

What is Indexing in PostgreSQL?

In PostgreSQL indexing is a common database performance optimization technique. Using indexes, a database server can find and retrieve specific rows much faster than it could without them. Indexes are particularly useful when optimizing queries involving conditions in WHERE clauses, JOIN operations, or ORDER BY clauses. They reduce the need for scanning the entire table and reduce disk I/O, leading to significant performance improvements.

Types of Indexes in PostgreSQL:

PostgreSQL offers various types of indexes, including B-tree, Hash, GiST, GIN, and SP-GiST. Each type has its own strengths and use cases. B-tree indexes are the most commonly used and are suitable for most common scenarios.

B-Tree Index

B-tree(Balanced Tree) indexes are the default index in PostgreSQL. B-Tree indexes are suitable for equality and range queries on sortable data. PostgreSQL query planner will use a B-tree index for comparisons on an indexed column using the operators <, <=, =, >=, >, this includes BETWEEN and IN which are combinations of these operators. B-Tree index can also be used for pattern matching using LIKE and ~ operators.

BRIN (Block Range Indexes):

BRIN indexes divide the table into blocks with each block containing a range of values. BRIN indexes store the maximum and minimum values from within the block instead of storing index entry for each row. BRIN indexes ideal for large tables with sorted data. BRIN indexes are smaller in size compared to other indexes.

Partial Index

Partial index is built over only a subset of the table, which is defined using conditional expressions. The index only has entries that satisfy the conditional expression, which in turn helps reduce the size of the index.

Composite Index

Composite indexes are indexes defined on more than one column of the table, also called multicolumn index. . When defining composite indexes, columns most frequently used in WHERE clause should be placed left in the column list and the columns used less frequently should be placed towards the right.

Hash Index

Hash index stores 32-bit hash code from the indexed column value. Hash indexes are ideal for handling simple equality comparisons using the = operator. Hash indexes are not suitable when dealing with sorting or range queries.

Covering Index

Covering indexes allows us to include additional columns to the index using the INCLUDE keyword. This enables us to perform index-only scans when the columns specified in SELECT queries are already included in the index i.e. SELECT clause will get data directly from the index pages. While this can significantly reduce disk I/O, it is important to know that columns added to index still use space on disk.

PostgreSQL also offers GIST, SP-GIST and GIN indexes which are useful when dealing with complex data types such as Arrays, JSONB Documents and Geometrical shapes.

When To Use Indexes

Faster Data Retrieval: Indexes can lead to significant performance improvements as they help to find and retrieve small percentage of rows much faster instead of scanning the entire table which reduces disk I/O.
Improve Query Performance: When a query includes conditions in the WHERE clause, indexes can help identify and access the rows that meet the specified criteria. Indexes can also speed up queries involving joins.
Efficient Sorting: B-Tree index, by default store entries in an ascending order, due to which B-Tree indexes can return sorted data without an extra step for sorting. This makes ORDER BY clauses more efficient.

Drawbacks of Indexing

Write Overhead: Indexes speed up read operations but can slow down writes, as the index must be updated with each insert, update, or delete.
Disk Space: Indexes consume additional disk space, so disk usage is a factor to be mindful of.
HOT Prevention: PostgreSQL can avoid index changes when an update is performed on non-indexed columns using an optimization technique called heap-only-tuples (HOT), but when there are frequent updates on indexed columns PostgreSQL cannot benefit from HOT optimization which also increases disk I/O.

Indexing Best Practices

- Rebuild Indexes: Indexes require maintenance to ensure performance over time. REINDEX and REINDEX CONCURRENTLY can be used to remove bloat from index to ensure performance.
- Review Index Usage: Indexes can become less effective over time as the data increases and query patterns change, hence it is important to review and optimize indexes according to query patterns. PostgreSQL has a pg_stat_all_indexes view which provides us with detailed statistics about all indexes in our database.
- Don’t Index on Low Cardinality Columns: Avoid indexing columns with very few distinct values to prevent potential performance degradation, instead index on columns which have a high number of distinct values(High Cardinality).
- Don’t Over Index: Performance can significantly deteriorate due to over indexing because with every INSERT, UPDATE and DELETE index would be updated. It is advised to not create indexes on every column.
- Update Statistics when an index is added or removed: Statistics must be updated after adding or removing an index using ANALYZE command on the table, otherwise performance could be poor as query optimizer won’t have accurate statistics about values in indexed columns.

Partitioning In PostgreSQL

Partitioning is a data organization technique that involves splitting a large table into smaller, more manageable pieces called partitions. Each partition holds a subset of the data and is usually based on a specific criteria, such as a range of values or a list of values.

Types of Partitioning

Range Partitioning:

In range partitioning tables are partitioned based on specific value ranges inside a column. Range partitioning can help improve query performance as data can be efficiently retrieved from specific ranges. Range partitioning is ideal when dealing with time-series data or any sorted data.

List Partitioning:

In list partitioning table is partitioned on the basis of explicitly listed values. List partitioning can be useful when data can be categorized into distinct values.

Hash Partitioning:

In hash partitioning table is partitioned on the basis of a hash value. Hash partitioning is useful when we don’t have a partition key like range or value. Hash partitioning evenly distributes data across partitions.

When To Use Partitioning

Large Tables: Partitioning can be very useful when dealing with large tables. With increase in table size, maintenance overhead also increases, causing performance degradation. With partitioning, you only need to maintain smaller partitions which would avoid performance degradation.
Frequent Bulk Deletion: If usage pattern is taken into consideration while designing partitions, bulk deletion can be achieved by simply dropping a partition which is way faster than bulk deletion. Dropping a partition also avoids vacuum overhead caused by bulk deletion.
Query Performance: When used correctly, partitioning can improve query performance by limiting the amount of data that needs to be scanned. Queries that target specific partitions can be significantly faster than queries that scan an entire unpartitioned table because partition pruning excludes partitions that don’t need to be scanned from the query plan based on the query’s WHERE clause.

Drawbacks of partitioning

Full Scans: If your partition key is not used in the WHERE clause of a query, you’ll end up scanning all partitions which can result in poor query performance.
Too many partitions: Having too many will increase planning time and could negatively affect query time.
Non uniform partitioning: If your partitions are too large, then you won’t be able to use ranges to exclude data, and the effectiveness of partition pruning will be minimized.

Partitioning Best Practices

Choose Right Partition Size: Partition size must not be too large or too small as it could negatively affect performance. It is also advised to keep partition size uniform so that performance stays consistent across partitions.
Optimize Queries: Make sure that only the necessary partitions are scanned, this can be done by analyzing query execution plan.
Choose the right partition key: The partition key shall align with the query patterns. For improved performance the partition key must be in the WHERE clause.
Use Appropriate Storage: When saving partitions on different storage media, make sure that the frequently accessed partitions are on faster storage, whereas the less frequently accessed partitions should be on slower and cheaper storage. It is also possible to specify a separate tablespace and storage parameters for each partition.

Conclusion

In PostgreSQL, both indexing and partitioning are essential techniques for optimizing database performance and manageability. The choice between the two primarily depends on your specific use case and the characteristics of your data