Understanding Hash aggregates and Hash Joins in PostgreSQL

PostgreSQL employs various techniques for data joining and aggregation in its queries, among which the hash-based method stands out for its efficiency in particular situations and different data sizes. We will discuss hash joins and hash aggregates in PostgreSQL, providing insights on how they work and parameters which influence this algorithm.

What is Hash Aggregate in PostgreSQL?

If you use aggregate like SUM, COUNT, or AVG in a PostgreSQL query with GROUP BY, the query planner might use hash aggregation. This process involves creating a hash table for the query. As PostgreSQL iterates through each row, it hashes the GROUP BY key and the corresponding row is placed as per calculated hash value. Values belonging to the same hash are placed in the same bucket (we are going to discuss below). Once all the necessary rows have been processed, PostgreSQL works out the final average for each group using this hash table data, and then displays these results.

If you’re running a query like

SELECT id, SUM(Amount)

FROM Sales

GROUP BY id;

Values from the id column are utilized for hashing. As PostgreSQL assigns rows to various buckets, it begins aggregating the amount values. For example, when the first row comes with id 1 and an Amount of 100, the total for id 1 becomes 100. Subsequently, if it encounters another row with the same id 1 with the same or different amount(let’s say 150), it adjusts the total for id 1 to 250 by adding the new amount (100 + 150). After processing all rows, the hash table will hold the aggregated total amounts against each id. This compiled data is then sent as a result to the client.

What is Hash Join and how does it work?

A hash join in an execution plan involves creating a hash table based on the join key of one of the tables involved in the join, usually the smaller one. Once the hash table is created, the larger table is scanned, and for each of its rows, the hash value of the join key is calculated and used to search the hash table for matching rows.

Hash joins can be produced in the following conditions:

Join condition is based on equality (=) operator.
Tables involving large data sets.
forcing planner to use hash join by disabling nested loop join(enable_nestloop) and merge join(enable_mergejoin)

A hash table stores data as hash key and value pairs. It provides constant-time complexity (O(1)) for data retrieval which means that the search time is unaffected by hash table size. The hash keys are spread across buckets.

The larger table is then scanned, and for each row, the hash value of the join key is computed and looked up in the hash table to find matching rows.

What is a bucket in Hash Join?

Buckets are used to group rows with the same hash value. When the hash table is created, each unique hash value corresponds to a bucket, and rows with the same hash value are stored in the same bucket. This allows for efficient retrieval of matching rows during the join operation. The total number of buckets is always two to the power N, and the bucket number for a given hash key is the last N bits of its hash function result.

Let’s say the hash key is 10001001 and there are 8 buckets in use, which equates to 2^3 buckets. In this case, the bucket number assigned will be 001.

What advantages does a Hash Join offer compared to nested loop joins and gather merge joins?

Hash joins typically outperform nested loop joins with bigger datasets, as they eliminate the necessity of repeatedly scanning the entire dataset. Furthermore, when dealing with unsorted data and the hash table can be fit within the memory, hash joins tend to be more efficient than gather merge joins.

How much memory is used by Hash Operations?

The maximum memory a hash table can use is controlled by the following parameters:

work_mem: This setting specifies the memory allocated for sorting and hash tables. Increasing work_mem allows for the creation of larger hash tables, which can improve the performance of hash joins.
hash_mem_multiplier: This parameter calculates the maximum memory that hash-based operations are allowed to use, in relation to work_mem. The formula is:
Total memory = work_mem * hash_mem_multiplier

its default value was increased from 1.0 to 2.0 in PostgreSQL 15. This change means hash-based operations now use twice the base amount of work_mem. This adjustment was made because the performance of hash-based operations depends on various factors like data uniqueness, the number of rows and data size, making their memory requirements harder to predict.

temp_file_limit: This parameter limits the maximum disk space that a process can use for temporary files, which includes files for sorting, hash operations, and cursors. If a process tries to use more space than this limit, the transaction will be canceled. The default value is set to -1 by default, which means unlimited temporary files can be created in the pgsql_tmp directory.

What happens if the Hash table cannot fit into the memory?

If the hash table for a join in PostgreSQL is too big to fit in the available memory, the system handles it by dividing the data into smaller subsets, called batches.These batches are processed one by one, and uses temporary files to store data that doesn’t fit into memory. Here’s how it works in simpler terms:

PostgreSQL breaks the data into batches. The number of batches is always a power of two. Each batch is identified by a part of its hash value.
Each batch is made to fit into the available memory. The system does this by choosing the right number of buckets for each batch that can be fit inside the memory.
When PostgreSQL starts processing the inner table, it first builds a hash table for the initial batch. If a row of data belongs to this first batch, it’s kept in memory. If it belongs to any other batch, it’s saved in a temporary file on disk. You can call it as batch_1_inner, batch_2_inner…
As PostgreSQL goes through the data inside the outer table, it matches rows from the first batch with the hash table. If a row doesn’t belong to the first batch, it’s stored in a temporary file. You can call it as batch_1_outer, batch_2_outer…
After processing the first batch, PostgreSQL clears the hash table. Then it takes the next batch from its temporary file(batch_1_inner and batch_1_outer…), builds a new hash table from it, and matches it with the corresponding data from the other set. Once a batch is completed, the associated temporary files and the hash table are cleared, This process is repeated for each batch.

For better understanding: Hash Join batches and processing

Example

->  Hash Join  (cost=1.36..29.43 rows=1000 width=70) (actual time=0.038..0.350 rows=1000 loops=1)

Hash Cond: (fc.category_id = c.category_id)

-> Seq Scan on film_category fc (cost=0.00..16.00 rows=1000 width=4) (actual time=0.010..0.134 rows=1000 loops=1)

-> Hash (cost=1.16..1.16 rows=16 width=72) (actual time=0.018..0.018 rows=16 loops=1)

Buckets: 1024 Batches: 1 Memory Usage: 9kB

-> Seq Scan on category c (cost=0.00..1.16 rows=16 width=72) (actual time=0.010..0.012 rows=16 loops=1)

Planning time: 0.393 ms

Execution time: 5.271 ms

1: In this process, the category table will serve as the hash table(inner table), while the film_category table will be treated as the outer table in the join.

2: The fact that only 1 batch was needed suggests that the hash table was successfully created and managed within the allocated memory. Specifically, the portion that stands out shows that it utilized 9KB of work_mem.

3: The presence of multiple batches would indicate that the current settings for work_mem or hash_mem_multiplier are insufficient for the query’s needs, and we might consider increasing them if necessary.

4: To get details on the size of spill files, you can execute the explain command with the buffers option. This will give you statistics related to disk operations, including cache and disk reads, as well as reads and writes to temporary files.

Considerations:

1: To ensure that the table’s statistics are updated, set the autoanalyze parameters appropriately or run the ANALYZE command manually. This helps the query planner to make accurate estimates.

2: Be careful with setting higher values of work_mem and hash_mem_multiplier to increase hash join performance. This setting applies to each node, and setting it too high might cause high memory usage and other queries can be starved for memory. For example if you have 3 hash operations in a query, it will take 3 * work_mem * hash_mem_multipler of the total memory

3: Hash joins are particularly effective for integer and string data types.

4: Uneven data distribution can affect the efficiency of hash joins. For instance, if you have a table with 10,000 rows, where 5,000 rows share the same value and the rest are unique(random example), one bucket will end up with those 5,000 identical values and processing this bucket could be time-consuming.

5: Including unnecessary columns in the query can increase the size of the hash table and the memory it requires.