Exploring-the-Power-of-Partitioning-in-PostgreSQL-2-opt

Exploring the Power of Partitioning in PostgreSQL

PostgreSQL is renowned for its exceptional performance in managing data. One of its standout features is partitioning, a technique that divides large datasets into smaller, more manageable segments. Partitioning provides several benefits, including improved query performance, streamlined data management, and enhanced scalability. By organizing data into partitions, PostgreSQL can execute searches more efficiently and handle tasks with greater ease. 

In this blog, we will delve into the details of partitioning in PostgreSQL, exploring its various types, advantages, and drawbacks. We’ll uncover how partitioning can revolutionize data management and decision-making processes in database environments.

Types of Partitioning 

Range partitioning

Range partitioning in PostgreSQL involves dividing a table based on ranges of values in a specified column, such as dates or numeric ranges. Each partition stores data falling within a defined range, enabling efficient data retrieval and management. 

This method is particularly useful for organizing time-series data or numeric ranges.

Create a partition table named sales 

CREATE TABLE sales (
id SERIAL,
sale_date DATE NOT NULL,
amount NUMERIC,
PRIMARY KEY (id, sale_date)
) PARTITION BY RANGE (sale_date);

We are going to create the following partition tables that store respective data

  • sales_jan2023
  • sales_feb2023
  • sales_mar2023
CREATE TABLE sales_jan2023 PARTITION OF sales 
FOR VALUES FROM ('2023-01-01') TO ('2023-02-01');

CREATE TABLE sales_feb2023 PARTITION OF sales
FOR VALUES FROM ('2023-02-01') TO ('2023-03-01');

CREATE TABLE sales_mar2023 PARTITION OF sales
FOR VALUES FROM ('2023-03-01') TO ('2023-04-01');

Let’s populate the sales table with data, and afterward, we’ll examine the partition utilized when executing a SELECT query on the sales table. We’ll use the EXPLAIN ANALYZE command to analyze which partition is accessed to retrieve the specified date.

INSERT INTO sales (sale_date, amount) VALUES
   ('2023-01-15', 100.00),
   ('2023-02-20', 200.00),
   ('2023-03-10', 150.00);

Let’s attempt to run SELECT queries and observe which partition it accesses.

explain analyze select * from sales;
                                                       QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
Append  (cost=0.00..84.00 rows=3600 width=40) (actual time=0.008..0.015 rows=3 loops=1)
  ->  Seq Scan on sales_jan2023 sales_1  (cost=0.00..22.00 rows=1200 width=40) (actual time=0.008..0.008 rows=1 loops=1)
  ->  Seq Scan on sales_feb2023 sales_2  (cost=0.00..22.00 rows=1200 width=40) (actual time=0.003..0.003 rows=1 loops=1)
  ->  Seq Scan on sales_mar2023 sales_3  (cost=0.00..22.00 rows=1200 width=40) (actual time=0.002..0.002 rows=1 loops=1)
Planning Time: 0.096 ms
Execution Time: 0.028 ms
(6 rows)

explain analyze SELECT * FROM sales WHERE sale_date BETWEEN '2023-01-01' AND '2023-01-31';
                                                 QUERY PLAN                                                   
---------------------------------------------------------------------------------------------------------------
Seq Scan on sales_jan2023 sales  (cost=0.00..28.00 rows=6 width=40) (actual time=0.009..0.010 rows=1 loops=1)
  Filter: ((sale_date >= '2023-01-01'::date) AND (sale_date <= '2023-01-31'::date))
Planning Time: 0.094 ms
Execution Time: 0.020 ms
(4 rows)

In the EXPLAIN ANALYZE output, the partitions used for accessing data are indicated by the table names. In the first query, the “sales_jan2023”, “sales_feb2023”, and “sales_mar2023” tables represent the partitions accessed during the scan. Similarly, in the second query, the “sales_jan2023” table is utilized, as shown in the “QUERY PLAN” section.

List Partitioning

List partitioning in PostgreSQL involves dividing a table based on specific values in a designated column. Each partition contains rows with values matching predefined lists. This method allows for precise control over data distribution, enabling efficient data retrieval and management based on predefined criteria.

Lets create a partition table named customer_orders with list partitioning 

CREATE TABLE customer_orders (
   order_id SERIAL,
   customer_id INT NOT NULL,
   order_date DATE NOT NULL,
   region TEXT NOT NULL,
   amount NUMERIC,
   PRIMARY KEY (order_id, region)
) PARTITION BY LIST (region);

We are going to create the following partition tables that store respective data

  • customer_orders_south
  • customer_orders_east
  • customer_orders_west
  • customer_orders_north
CREATE TABLE customer_orders_north PARTITION OF customer_orders
FOR VALUES IN ('North');

CREATE TABLE customer_orders_south PARTITION OF customer_orders
  FOR VALUES IN ('South');

CREATE TABLE customer_orders_east PARTITION OF customer_orders
  FOR VALUES IN ('East');

CREATE TABLE customer_orders_west PARTITION OF customer_orders
  FOR VALUES IN ('West');

Inset data inside customer_orders table

INSERT INTO customer_orders (customer_id, order_date, region, amount) VALUES
(1, '2023-01-15', 'North', 100.00),
  (2, '2023-02-20', 'South', 200.00),
  (3, '2023-03-10', 'East', 150.00),
  (4, '2023-04-05', 'West', 250.00);

Let’s attempt a query to locate all orders within the North region and observe which partition it accesses.

explain analyze SELECT * FROM customer_orders WHERE region = 'North';
                                                          QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
Seq Scan on customer_orders_north customer_orders  (cost=0.00..19.75 rows=4 width=76) (actual time=0.010..0.011 rows=1 loops=1)
  Filter: (region = 'North'::text)
Planning Time: 0.093 ms
Execution Time: 0.023 ms
(4 rows)

As we can see the only partition hit by the SELECT query is customer_orders_north 

Hash Partitioning

In hash partitioning, the modulus operation (often represented as %) is used in conjunction with a hash function to determine which partition a specific record should be stored in. Here’s how it works

Hash Function: The hash function takes the partition key (e.g., a numeric value or a string) as input and computes a hash value, which is typically a numeric value.

Modulus Operation: Once the hash value is computed, the modulus operation is applied to the hash value. The result of the modulus operation is an integer value representing the remainder when dividing the hash value by the total number of partitions.

Partition Assignment: The remainder obtained from the modulus operation determines which partition the record will be assigned to. For example, if there are four partitions numbered 0, 1, 2, and 3, and the remainder is 2, then the record will be assigned to partition 2.

Let’s create a hash partition table named user_data

CREATE TABLE user_data (
user_id INT PRIMARY KEY,
  user_name TEXT NOT NULL,
  user_email TEXT NOT NULL
) PARTITION BY HASH (user_id);

Now let’s create separate partition tables

CREATE TABLE user_data_p0 PARTITION OF user_data
FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE user_data_p1 PARTITION OF user_data

  FOR VALUES WITH (MODULUS 4, REMAINDER 1);

CREATE TABLE user_data_p2 PARTITION OF user_data
  FOR VALUES WITH (MODULUS 4, REMAINDER 2);

CREATE TABLE user_data_p3 PARTITION OF user_data

  FOR VALUES WITH (MODULUS 4, REMAINDER 3);

Let’s insert dummy data inside our user_data table. And see which portion is hit when we are trying to retrieve the data where user_id = 1 

INSERT INTO user_data (user_id, user_name, user_email) VALUES
(1, 'Alice', 'alice@example.com'),
  (2, 'Bob', 'bob@example.com'),
  (3, 'Charlie', 'charlie@example.com'),
  (4, 'Diana', 'diana@example.com'),
  (5, 'Eve', 'eve@example.com');

explain analyze SELECT * FROM user_data WHERE user_id = 1;
                                                               QUERY PLAN                                                                 
-------------------------------------------------------------------------------------------------------------------------------------------
Index Scan using user_data_p0_pkey on user_data_p0 user_data  (cost=0.15..8.17 rows=1 width=68) (actual time=0.011..0.013 rows=1 loops=1)
  Index Cond: (user_id = 1)
Planning Time: 0.123 ms
Execution Time: 0.026 ms
(4 rows)

So we are only hitting the user_data_p0 table and retrieving the required data 

Benefits of Partitioning

Enhanced Query Performance: Partitioning allows PostgreSQL to access only relevant partitions during query execution, minimizing the amount of data scanned. This targeted access leads to faster query processing times, particularly for large datasets, as it reduces I/O operations and optimizes resource utilization.

Improved Data Management: Partitioning facilitates better data management by logically organizing data into smaller, more manageable units. This division simplifies tasks such as data loading, purging, and archiving, as operations can be performed on individual partitions without impacting the entire table. Additionally, maintenance tasks like backup and restore operations can be executed more efficiently on partitioned tables.

Scalability: Partitioning enhances scalability by distributing data across multiple partitions, allowing PostgreSQL to handle larger datasets and higher query loads. As data grows, additional partitions can be added without significantly impacting performance. This scalability ensures that PostgreSQL can effectively support growing data volumes and evolving application requirements.

Optimized Indexing and Vacuuming: Partitioning enables more efficient indexing and vacuuming operations by allowing them to be performed on individual partitions rather than the entire table. This targeted approach reduces the overhead associated with maintaining indexes and reclaiming disk space, leading to improved database performance and reduced maintenance overhead.

FAQ

Q) Why composite primary key is required to create a partition table

A) In PostgreSQL, a composite primary key is vital for partitioned tables to ensure unique identification across all partitions. For instance, in a sales table partitioned by date, a composite primary key consisting of an identifier and the date column ensures each sale is uniquely identified within its respective partition. This prevents data duplication and maintains integrity across partitions, facilitating efficient query processing and data management.

Q) How to attach/detach a new partition with the original partitional table

A) Yes we can do both to attach a new partition table execute this query

ALTER TABLE sales ATTACH PARTITION sales_april2023 FOR VALUES FROM ('2023-04-01') TO ('2023-05-01');

Similarly to detach the partition from the partitional table execute this query 

ALTER TABLE sales DETACH PARTITION sales_jan2023;

Q) What happens if a row’s value falls outside the defined partitions?

A) If a row’s value falls outside the defined partitions, PostgreSQL will raise an error when attempting to insert or update the row. But if you want to capture the data that is out of range we can use a default partition table for this like that

CREATE TABLE sales_data_falling PARTITION OF sales DEFAULT;

INSERT INTO sales (sale_date, amount) VALUES ('2023-10-15', 100.00);


explain analyze select * from sales where sale_date = '2023-10-15';

                                                     QUERY PLAN                                                    
--------------------------------------------------------------------------------------------------------------------
Seq Scan on sales_data_falling sales  (cost=0.00..25.00 rows=6 width=40) (actual time=0.008..0.009 rows=1 loops=1)
   Filter: (sale_date = '2023-10-15'::date)
Planning Time: 0.174 ms
Execution Time: 0.020 ms

(4 rows)

Q) What are the Real-world use-cases of each list, range and hash partitioning?

A) List Partitioning (E-commerce): Imagine an online store with millions of customer orders. You can partition the order table by region (US, Europe, Asia) in separate partitions. This allows for faster retrieval of orders specific to a region, improving efficiency when analyzing regional sales trends.

Range Partitioning (Time-Series Data): Sensors collect environmental data (temperature, humidity) every minute. You can partition the data table by date (year, month, day). Queries filtering data for a specific date range only need to access the relevant partition(s), significantly speeding up analysis of historical trends. 

Hash Partitioning (User Logins): A social media platform has billions of users logging in daily. You can partition the login table by user_id hashed value. This distributes login data evenly across partitions, improving query performance when searching for specific user login activities or filtering by recent login time.

Comments (6)

  • Agaoua reda

    June 13, 2024 - 7:24 pm

    Great article ! If I may, a correction : you can’t actually write “ALTER TABLE … ADD PARTITION …” or “ALTER TABLE … DROP PARTITION …”

    If you want to DROP a partition, you can simply write :
    DROP TABLE

    Similarly, creating a partition can simply be done with :
    CREATE TABLE PARTITION OF … FOR VALUES …

    You can however detach a partition from its partitionned table with :
    ALTER TABLE DETACH PARTITION
    This wouldn’t drop the partition nor delete its data.

    Finally, you can attach a table to a partition with :
    ALTER TABLE ATTACH PARTITION FOR VALUE …
    Or course, the data already present in the table being attached (ie, the partition) needs to satisfy the contraints defined by the “FOR VALUE …” part.

    Hope this was helpful !

    • Semab Tariq

      June 14, 2024 - 6:28 am

      Hi Agaoua
      Thank you for your keen observation and for pointing out the syntax mistake in the queries. I have updated the blog with the correct syntax. Your feedback helps improve the content and ensures accuracy for everyone. Thanks again for your contribution!

  • Agharta

    June 12, 2024 - 7:27 pm

    Q) What happens if a row’s value falls outside the defined partitions?
    A) If a row’s value falls outside the defined partitions, PostgreSQL will raise an error when attempting to insert or update the row. It’s crucial to regularly review and update partition definitions to accommodate new data ranges or adjust partition boundaries accordingly to avoid such errors.

    You can use:
    CREATE TABLE user_data_falling PARTITION OF user_data FOR VALUES DEFAULT
    to catch them’all!

  • Agharta

    June 12, 2024 - 7:26 pm

    *sorry*
    Right syntax
    CREATE TABLE user_data_falling PARTITION OF user_data FOR VALUES DEFAULT

  • agharta

    June 12, 2024 - 7:24 pm

    Q) What happens if a row’s value falls outside the defined partitions?
    A) If a row’s value falls outside the defined partitions, PostgreSQL will raise an error when attempting to insert or update the row. It’s crucial to regularly review and update partition definitions to accommodate new data ranges or adjust partition boundaries accordingly to avoid such errors.

    A point: you can use DEFAULT DDL when creating a partition to catch all falling rows.
    Like this:
    CREATE TABLE user_data_falling PARTITION OF user_data BY DEFAULT VALUES

    • Semab Tariq

      June 14, 2024 - 6:46 am

      Hi Agharta
      Thank you for your insightful suggestion! I have implemented the workaround you recommended with the correct syntax and added an example as well. I appreciate your help and engagement with the blog. Thanks again!

Leave A Comment