I have always been a fan of RANGE partitioning using a date/time value in PostgreSQL. This isn’t always possible, however, and I recently came across a scenario where a table had grown large enough that it had to be partitioned, and the only reasonable key to use was a UUID styled identifier.
The goal of this post is to highlight when and why hashing your data across partitions in PostgreSQL might be a better approach.
Table of Contents
- Range Partitioning (A Quick Recap)
- Hash Partitioning 101
- Pitfalls of Range Partitioning in Certain Workloads
- When Hash Partitioning Is a Better Fit
- Implementation Details in PostgreSQL
- Hybrid Approaches in PostgreSQL
- Using pg_partman for Automated Partition Management
- Best Practices and Lessons Learned
- Conclusion
Range vs. Hash Partitioning in PostgreSQL
Range Partitioning (A Quick Recap)
Range partitioning in PostgreSQL uses boundary values that define slices of the data, often by date or numeric ranges. If you have a transactions table, you might create monthly partitions based on a transaction_date column. This is intuitive for time-series data because each partition holds rows from a specific date range.
Advantages of Range Partitioning:
- Easy pruning for date-based queries.
- Straightforward approach to archiving old data: drop an entire partition for a past month, rather than issuing a massive DELETE.
- Pairs nicely with time-based ingestion pipelines, where every day or month gets its own partition.
But as convenient as that is, there are cases where range partitioning runs into problems.
Why Range Partitioning Can Fall Short
- Data Skew: If a huge portion of data lands in a single time interval—say, because of a traffic spike in the current month—that monthly partition might end up significantly larger than the others.
- Complex Backfills: Not everyone ingests data in an orderly, daily manner. Sometimes you need to re-ingest or correct data that spans multiple periods. Merging or splitting range partitions can get cumbersome.
- Non-Date Dimensions: Some tables aren’t naturally tied to a sequential numeric or date dimension. If your queries center on user IDs or device IDs, dividing by date might not solve your performance issues.
Hash Partitioning 101
Hash partitioning, introduced in PostgreSQL 11 with major improvements in later releases, solves a different kind of problem. Instead of defining numeric or date boundaries, you choose one or more columns and let PostgreSQL distribute rows across partitions based on a hash function.
Here’s a simplified example of creating a hash-partitioned table:
CREATE TABLE big_data (
id SERIAL PRIMARY KEY,
user_id INT,
info TEXT,
created_at TIMESTAMP
)
PARTITION BY HASH (user_id);
Then, you’d create partitions like so:
CREATE TABLE big_data_p0 PARTITION OF big_data
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE big_data_p1 PARTITION OF big_data
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
CREATE TABLE big_data_p2 PARTITION OF big_data
FOR VALUES WITH (MODULUS 4, REMAINDER 2);
CREATE TABLE big_data_p3 PARTITION OF big_data
FOR VALUES WITH (MODULUS 4, REMAINDER 3);
In this example, we’re using four partitions (0 through 3). PostgreSQL will take each user_id, compute a hash, and then perform modulo arithmetic to decide which partition a row belongs in. By spreading data evenly, each partition gets a roughly equal share of the workload and storage.
Pitfalls of Range Partitioning in Certain Workloads
Let’s dig a little deeper into the reasons you might switch from range partitioning to hash partitioning.
- Uneven Volume by Time: If your data does arrive in large bursts, you might wind up with a single day or month that dwarfs the rest. Queries on that hot partition can slow down, and your partition management tasks become lopsided.
- High Concurrency Bottlenecks: Most PostgreSQL workloads see a spike in reads and writes for recent data. If you only have one “current” range partition, that single partition could become a bottleneck, hurting concurrency.
- Odd Backfill or Correction Patterns: Occasionally, a data fix might involve reprocessing rows from irregular time spans—maybe every Thursday for six months or a random set of user IDs from different weeks. With range partitions, you’re forced to update each affected date range. That can be a hassle.
- Sharding-Like Requirements: If you’re functionally trying to “shard” data within a single PostgreSQL instance, range partitioning might not be enough to distribute load effectively. Hash partitioning can mimic a sharding-like approach by spreading the data more evenly across your partitions.
When Hash Partitioning Is a Better Fit
Below are concrete scenarios where hash partitioning solves real problems.
1. Even Data Distribution
Let’s say you’re running a platform with thousands of distinct customer IDs, and each customer has unpredictable usage patterns. Some days, Customer A generates mountains of data while Customer B only logs in once. If your goal is to keep storage and queries balanced across multiple partitions, hashing on customer_id helps ensure no single partition gets hammered too much.
In PostgreSQL, you’d do something like:
CREATE TABLE customer_events (
customer_id INT,
event_data JSONB,
event_time TIMESTAMP
)
PARTITION BY HASH (customer_id);
Then create as many partitions as you feel comfortable managing, each defined by (MODULUS n, REMAINDER x). If you choose the right number of partitions, large surges of data from a few customers won’t overload a single partition.
2. High Concurrency
Whenever multiple sessions read or write to the same data slice, locking and I/O contention can become a major pain point. If you suspect that concurrency is becoming a bottleneck because of intense activity on a single partition, hash partitioning can spread out that activity.
You’ll still need to consider indexing strategies (e.g., indexes on the partitioning key), but distributing the rows across multiple partitions can help concurrency by dividing up the physical data. PostgreSQL handles partition locks at the table or row level in most cases, so separate partitions may reduce the chance of collisions.
3. Complex Backfill or Reloads
Some pipelines don’t just ingest today’s data in a tidy, linear fashion. You might discover an error that affects a random subset of customers or transaction IDs across multiple dates. With range partitions, you’d have to carefully identify and possibly lock each date-based partition that holds those rows.
If you hash-partition by something like user_id or transaction_id, and you know exactly which IDs need re-ingestion, you can limit your reprocessing or updates to just the partitions that hold those keys. You’re not forced to work with an entire month’s worth of data if your corrections only affect a tiny fraction of it.
4. Unpredictable Growth
Not every project has a predictable ingestion pattern. Some experience sporadic jumps in data volume. When you’re not sure how large each date-based partition might get, a hash-based approach can be more flexible. As long as your chosen key maintains a fairly even distribution, it won’t matter if data volume doubles or triples in certain periods; the load is spread across partitions regardless of when the data arrives.
5. Emulating Sharding
PostgreSQL can scale quite high, but large deployments might consider sharding or distributing data across multiple nodes. Hash partitioning can be a stepping stone to that architecture. Each partition can be thought of as a shard, especially if you end up placing those partitions on separate tablespaces or even separate machines. While that’s more advanced to set up, it can be a logical extension of straightforward hash partitioning within a single node.
Implementation Details in PostgreSQL
Syntax for Hash Partitioning
As of PostgreSQL 11, you can declare a partition by hash like so:
CREATE TABLE my_table (
...
) PARTITION BY HASH (some_column);
To create partitions, define MODULUS and REMAINDER:
CREATE TABLE my_table_p0 PARTITION OF my_table
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
CREATE TABLE my_table_p1 PARTITION OF my_table
FOR VALUES WITH (MODULUS 4, REMAINDER 1);
— and so on…
The MODULUS is how many partitions you want, while the REMAINDER is the partition’s slice. For example, with MODULUS 4, the hash function will distribute rows into one of four buckets (0, 1, 2, or 3).
Choosing a Hash Key
Pick a column that has these characteristics:
- High Cardinality: A large variety of distinct values (e.g., user IDs, session IDs, transaction IDs).
- Uniform Distribution: You don’t want 90% of your data to share a single key value.
A typical example is user_id or device_id in a multi-tenant environment. You can also hash on multiple columns if needed:
PARTITION BY HASH ((customer_id || product_id)::text)
Though in practice, many use a single column that’s known to be fairly diverse.
Number of Partitions
Starting with 4 or 8 partitions might be enough if your data volume isn’t massive. Going beyond that can be tempting, but keep in mind each partition maintains its own metadata and indexes. Over-partitioning adds overhead. Balance is key here: you want enough partitions to distribute data evenly, but not so many that managing them becomes complex.
Monitoring and Maintenance
Once you’ve launched a hash-partitioned table, keep an eye on:
- Partition Size: Make sure your data distribution remains even. If you see a partition getting unusually large, investigate whether your hash key is skewed.
- Query Plans: Check EXPLAIN and EXPLAIN ANALYZE to see if the optimizer is correctly using indexes on each partition.
- Vacuum and Reindexing: Each partition is essentially a separate table under the hood, so routine maintenance applies to each one.
Hybrid Approaches in PostgreSQL
It’s entirely possible to combine range and hash partitioning, sometimes called range-hash partitioning or sub-partitioning. For instance, you might first partition by month, then within each month partition by hash of user_id.
An example:
CREATE TABLE logs (
log_id SERIAL,
user_id INT,
event_time TIMESTAMP,
data JSONB
) PARTITION BY RANGE (date_trunc('month', event_time));
For each monthly partition, you could create sub-partitions by hash:
CREATE TABLE logs_2025_01 PARTITION OF logs
FOR VALUES FROM ('2025-01-01') TO ('2025-02-01')
PARTITION BY HASH (user_id);
CREATE TABLE logs_2025_01_p0 PARTITION OF logs_2025_01
FOR VALUES WITH (MODULUS 4, REMAINDER 0);
— and so on…
The advantage here is that you can quickly drop or archive an entire month while still enjoying the concurrency and distribution benefits of hash partitioning within that month. The downside is added complexity—there are more tables to manage and more partitions to keep track of.
Using pg_partman for Automated Partition Management
Managing partitions manually can become tedious, especially when handling a large number of partitions over time. This is where pg_partman comes in.
pg_partman is an extension designed to automate the creation and maintenance of partitioned tables in PostgreSQL. It supports both range and hash partitioning and significantly simplifies maintenance.
Key Features of pg_partman:
- Automated Partition Creation: New partitions are created automatically as needed, eliminating the need for manual intervention.
- Retention Policies: Older partitions can be dropped or archived based on pre-configured settings.
- Sub-Partitioning Support: Supports hybrid approaches, such as range-hash partitioning.
- Partition Pruning Enhancements: Works well with PostgreSQL’s native partition pruning for efficient query execution.
- Background Maintenance: Automates partition management tasks via cron jobs or scheduled tasks.
Example Usage
To set up an automatically managed partitioned table with pg_partman, you first need to install the extension and create the required schema:
CREATE EXTENSION pg_partman;
Then, configure partitioning for an existing table:
SELECT partman.create_parent('public.my_partitioned_table', 'created_at', 'range', 'monthly');
This ensures new partitions are created automatically every month.
Using pg_partman reduces the operational overhead of managing partitions, making it easier to scale PostgreSQL partitioning strategies in production environments.
Best Practices and Lessons Learned
Key Tips
- Spend Time on Your Partition Key
The right key has high cardinality and an even distribution. If it’s skewed—say, one value occurs much more frequently than others—hash partitioning might still be imbalanced. - Plan Your Initial Partition Count
It’s easier to start with a small number of partitions (4, 8, or 16) and grow if needed. While PostgreSQL does let you add new hash partitions, reorganizing existing data can be time-consuming if you do it too often. - Stay on Top of Maintenance
Partitioned tables require vacuuming, reindexing, and analyzing each child table. Automate these tasks or schedule them so your partitions stay healthy. - Use the Latest PostgreSQL Features
Partitioning in PostgreSQL has improved with each version. Upgrading may give you more flexibility or better performance (for example, improvements in query planning for partitioned tables).
Common Pitfalls
- Over-Partitioning: A large number of partitions can bog down the planner or cause a spike in file descriptors. Aim for a sweet spot.
- Ignoring Growth: Your partition plan should evolve along with your data. Keep an eye on performance metrics and usage patterns.
- Relying on a Weak Key: If your chosen column doesn’t have enough unique values, you’ll see some partitions hog all the data. Revisit your design if distribution isn’t as even as you hoped.
Conclusion
PostgreSQL’s range partitioning is often the first choice for many data ingestion scenarios—especially time-series or numeric boundaries. Yet there are times when a uniform spread of data is more valuable than neatly packaged time slices. This is where hash partitioning is preferred.
When you see uneven growth, experience concurrency issues on a single hot partition, or deal with sporadic data updates across random subsets of the data, it might be time to explore PostgreSQL’s hash partitioning features. By selecting a column with high cardinality and distributing your rows across multiple partitions, you can minimize bottlenecks and simplify ongoing maintenance.
There’s no “one-size-fits-all” approach to partitioning. Each workload has unique query patterns, growth rates, and data distribution quirks. The best way to decide is to prototype. Set up a test environment, create both a range-partitioned table and a hash-partitioned table, and hammer them with your typical queries. Measure how each design holds up in terms of read and write performance, concurrency, and maintenance overhead.
If you discover that data distribution or concurrency is the real challenge, hash partitioning can be a life saver. If your data remains heavily tied to time and you rarely need to reprocess random subsets, range partitioning still might be a simpler choice. And if you want both—for instance, easy dropping of old data plus concurrency improvements—then combining range and hash partitioning is an option worth considering.
The key is to stay flexible. PostgreSQL gives you the tools to partition in the way that best suits your needs. By recognizing the scenarios where hash partitioning excels, you’ll be better equipped to build a database schema that performs well, scales smoothly, and makes life easier in the long run.