Optimizing PostgreSQL Cluster Performance, Part 1 - Load Balancing

In this blog, we will explore how standby nodes contribute to efficient workload distribution and achieving optimal query execution by directing all read/select queries to these standby nodes.

How load balancing works in pg_cirrus?

pgpool operates behind pg_cirrus. The way load balancing feature works is as follows:

pgpool accepts client request on port 9999(default) and parses the query.
If the query contains Delete/Update/Insert/DDL transaction, pgpool will send it to the Primary database.
If the query contains Select transactions, pgpool will send it to the selected Load Balance node from all the available PostgreSQL servers.

In pg_cirrus, load balancing is turned on by default. The configuration is done using the following parameters:

load_balance_mode = on (default)
backend_weight0 = 0
backend_weight1 = 0.5
backend_weight2 = 0.5

The load balance ratio of each node is represented by the backend_weight. Since all of the backend_weights eventually get normalised on a scale of [0 to 1], it can be adjusted to any positive integer or floating point >= 0 or <= 1.

For instance, if we give each standby node in a cluster of three nodes the backend_weight value of 0.5, each node will receive 50% of the READ traffic.

Setting up a 3-Node HA Cluster with pg_cirrus

Before proceeding, Please make sure you have successfully deployed your 3-Node HA cluster. If not, please refer to our guide on how to set up a 3-Node HA PostgreSQL cluster using pg_cirrus [https://stormatics.tech/how-to-use-pg_cirrus-to-setup-a-highly-available-postgresql-cluster].

Checking Cluster Status

Once your cluster is up and running, you can check its status by executing the following command:

./psql -d postgres -U postgres -p 9999 -h 172.16.14.154 -c "SELECT * FROM pg_stat_replication;"

  pid   | usesysid | usename | application_name |  client_addr  | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state |          reply_time           
--------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+-------------------------------
 200440 |    16384 | repuser | walreceiver      | 172.16.14.163 |                 |       51502 | 2023-08-03 07:51:52.787014+05 |              | streaming | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 |           |           |            |             0 | async      | 2023-08-03 08:40:03.137874+05
 200441 |    16384 | repuser | walreceiver      | 172.16.14.162 |                 |       40258 | 2023-08-03 07:51:52.791413+05 |              | streaming | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 |           |           |            |             0 | async      | 2023-08-03 08:40:03.195437+05
(2 rows)

This will show the replication status of your primary node to the standby nodes. Here we can clearly see 2 standby nodes connected via streaming replication

Load Balancing with pg_cirrus

Load balancing status can be verified using the following command:

./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres

 node_id |   hostname    | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 172.16.14.165 | 5432 | up     | up        | 0.000000  | primary | primary | 2          | false             | 0                 |                   |                        | 2023-08-03 08:12:17
 1       | 172.16.14.163 | 5432 | up     | up        | 0.500000  | standby | standby | 112234     | true             | 0                 |                   |                        | 2023-08-03 08:12:17
 2       | 172.16.14.162 | 5432 | up     | up        | 0.500000  | standby | standby | 142076     | true              | 0                 |                   |                        | 2023-08-03 08:12:17
(3 rows)

This command displays the status and load balancing information of each node in the cluster. Inside lb_weight column we can see select queries on primary node are disabled as well as load_balance_node flag is false, while they will equally distribute to standby1 and standby2 nodes.

Disable load balancing

To disable load balancing, modify the pgpool.conf file by updating below properties

sudo vim /etc/pgpool2/pgpool.conf 
#load_balance_mode = on 
backend_weight0 = 1 
backend_weight1 = 0 
backend_weight2 = 0

Now let’s check load balancing status again after restarting pgpool services:

./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres
 node_id |   hostname    | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 172.16.14.165 | 5432 | up     | up        | 1.000000  | primary | primary | 0          | true              | 0                 |                   |                        | 2023-08-03 07:41:38
 1       | 172.16.14.163 | 5432 | up     | up        | 0.000000  | standby | standby | 0          | false             | 0                 |                   |                        | 2023-08-03 07:41:38
 2       | 172.16.14.162 | 5432 | up     | up        | 0.000000  | standby | standby | 0          | false             | 0                 |                   |                        | 2023-08-03 07:41:38
(3 rows)

We can see that the primary node is handling all the load as lb_weight for primary is 1 and 0 for standby nodes. We can also identify it using load_balance_node column.

Testing the cluster for load balancing

Before running the benchmark, let’s initialise the database using -i switch via pgbench

./pgbench -i -s 100 -d postgres -h 172.16.14.154 -p 9999 -U postgres
dropping old tables...
NOTICE:  table "pgbench_accounts" does not exist, skipping
NOTICE:  table "pgbench_branches" does not exist, skipping
NOTICE:  table "pgbench_history" does not exist, skipping
NOTICE:  table "pgbench_tellers" does not exist, skipping
creating tables...
generating data (client-side)...
10000000 of 10000000 tuples (100%) done (elapsed 13.24 s, remaining 0.00 s)
vacuuming...
creating primary keys...
done in 19.09 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.19 s, vacuum 0.20 s, primary keys 3.68 s).

By default, pgbench create following tables inside given database

table                   # of rows
---------------------------------
pgbench_branches        1
pgbench_tellers         10
pgbench_accounts        100000
pgbench_history         0

For more info on pgbench and how to run benchmark tests for postgresql, checkout the following link:
https://www.postgresql.org/docs/current/pgbench.html

Without Load Balancing

Lets run pgbench to generate the load on our cluster

./pgbench postgres -h 172.16.14.154 -p 9999 -U postgres -c 10 -S -T 60   
pgbench (15.2, server 15.3 (Ubuntu 15.3-1.pgdg22.04+1))
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 100
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 1
duration: 60 s
number of transactions actually processed: 225394
number of failed transactions: 0 (0.000%)
latency average = 2.656 ms
initial connection time = 143.063 ms
tps = 3764.944861 (without initial connection time)

Here is the breakdown of command used in pgbench

-c 10 will simulate 10 concurrent connections to postgresql database
-S flag tells pgbench to only perform select queries
-T 60 is used to run benchmarking till 60 seconds

Now let see how select transactions behaves on the cluster:

./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres
 node_id |   hostname    | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 172.16.14.165 | 5432 | up     | up        | 1.000000  | primary | primary | 225358     | true              | 0                 |                   |                        | 2023-08-03 07:52:05
 1       | 172.16.14.163 | 5432 | up     | up        | 0.000000  | standby | standby | 0          | false             | 0                 |                   |                        | 2023-08-03 07:52:05
 2       | 172.16.14.162 | 5432 | up     | up        | 0.000000  | standby | standby | 0          | false             | 0                 |                   |                        | 2023-08-03 07:52:05
(3 rows)

In above output, select operations were performed by primary node itself and none of the select query goes to standby nodes as select_cnt for standby nodes is 0 and for primary it is 225358.

With Load Balancing

Please first revert the changes that we made inside /etc/pgpool/pgpool.conf file in order to enable the load balancing settings mentioned below

sudo vim /etc/pgpool2/pgpool.conf 
load_balance_mode = on 
backend_weight0 = 0 
backend_weight1 = 0.5 
backend_weight2 = 0.5

Here backend_weight is set to 0.5 for both standby 1 and standby 2 nodes and 0 for primary node which depicts that both standby nodes will handle 50% – 50% of select operations and primary won’t be participating here.

After restarting pgpool services and executing pgbench with same parameters, we get the following results:

./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres
 node_id |   hostname    | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 172.16.14.165 | 5432 | up     | up        | 0.000000  | primary | primary | 1          | false             | 0                 |                   |                        | 2023-08-03 07:57:56
 1       | 172.16.14.163 | 5432 | up     | up        | 0.500000  | standby | standby | 118067     | false             | 0                 |                   |                        | 2023-08-03 07:57:56
 2       | 172.16.14.162 | 5432 | up     | up        | 0.500000  | standby | standby | 116334     | true              | 0                 |                   |                        | 2023-08-03 07:57:56
(3 rows)

We can see standby 1 executed 118067 queries and standby 2 executed 116334 queries and there is only 1 query that went to primary node. This query could be anything but that won’t make much difference here hence we achieved our initial goal to distribute load on standby nodes.

Summary

Enabling load balancing in a highly available PostgreSQL cluster significantly improves transaction throughput. By distributing select queries to standby nodes, the primary node is freed up to handle other operations, resulting in better overall performance and scalability.

In the next blog, we will explore techniques to further improve TPS of our cluster.