In this blog, we will explore how standby nodes contribute to efficient workload distribution and achieving optimal query execution by directing all read/select queries to these standby nodes.
How load balancing works in pg_cirrus?
pgpool operates behind pg_cirrus. The way load balancing feature works is as follows:
- pgpool accepts client request on port 9999(default) and parses the query.
- If the query contains Delete/Update/Insert/DDL transaction, pgpool will send it to the Primary database.
- If the query contains Select transactions, pgpool will send it to the selected Load Balance node from all the available PostgreSQL servers.
In pg_cirrus, load balancing is turned on by default. The configuration is done using the following parameters:
load_balance_mode = on (default) backend_weight0 = 0 backend_weight1 = 0.5 backend_weight2 = 0.5
The load balance ratio of each node is represented by the backend_weight. Since all of the backend_weights eventually get normalised on a scale of [0 to 1], it can be adjusted to any positive integer or floating point >= 0 or <= 1.
For instance, if we give each standby node in a cluster of three nodes the backend_weight value of 0.5, each node will receive 50% of the READ traffic.
Setting up a 3-Node HA Cluster with pg_cirrus
Before proceeding, Please make sure you have successfully deployed your 3-Node HA cluster. If not, please refer to our guide on how to set up a 3-Node HA PostgreSQL cluster using pg_cirrus [https://stormatics.tech/how-to-use-pg_cirrus-to-setup-a-highly-available-postgresql-cluster].
Checking Cluster Status
Once your cluster is up and running, you can check its status by executing the following command:
./psql -d postgres -U postgres -p 9999 -h 172.16.14.154 -c "SELECT * FROM pg_stat_replication;"
pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | backend_xmin | state | sent_lsn | write_lsn | flush_lsn | replay_lsn | write_lag | flush_lag | replay_lag | sync_priority | sync_state | reply_time --------+----------+---------+------------------+---------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------+-----------+------------+---------------+------------+------------------------------- 200440 | 16384 | repuser | walreceiver | 172.16.14.163 | | 51502 | 2023-08-03 07:51:52.787014+05 | | streaming | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 | | | | 0 | async | 2023-08-03 08:40:03.137874+05 200441 | 16384 | repuser | walreceiver | 172.16.14.162 | | 40258 | 2023-08-03 07:51:52.791413+05 | | streaming | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 | 0/50CCC1A0 | | | | 0 | async | 2023-08-03 08:40:03.195437+05 (2 rows)
This will show the replication status of your primary node to the standby nodes. Here we can clearly see 2 standby nodes connected via streaming replication
Load Balancing with pg_cirrus
Load balancing status can be verified using the following command:
./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres
node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change ---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 | 172.16.14.165 | 5432 | up | up | 0.000000 | primary | primary | 2 | false | 0 | | | 2023-08-03 08:12:17 1 | 172.16.14.163 | 5432 | up | up | 0.500000 | standby | standby | 112234 | true | 0 | | | 2023-08-03 08:12:17 2 | 172.16.14.162 | 5432 | up | up | 0.500000 | standby | standby | 142076 | true | 0 | | | 2023-08-03 08:12:17 (3 rows)
This command displays the status and load balancing information of each node in the cluster. Inside lb_weight column we can see select queries on primary node are disabled as well as load_balance_node flag is false, while they will equally distribute to standby1 and standby2 nodes.
Disable load balancing
To disable load balancing, modify the pgpool.conf file by updating below properties
sudo vim /etc/pgpool2/pgpool.conf #load_balance_mode = on backend_weight0 = 1 backend_weight1 = 0 backend_weight2 = 0
Now let’s check load balancing status again after restarting pgpool services:
./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change ---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 | 172.16.14.165 | 5432 | up | up | 1.000000 | primary | primary | 0 | true | 0 | | | 2023-08-03 07:41:38 1 | 172.16.14.163 | 5432 | up | up | 0.000000 | standby | standby | 0 | false | 0 | | | 2023-08-03 07:41:38 2 | 172.16.14.162 | 5432 | up | up | 0.000000 | standby | standby | 0 | false | 0 | | | 2023-08-03 07:41:38 (3 rows)
We can see that the primary node is handling all the load as lb_weight for primary is 1 and 0 for standby nodes. We can also identify it using load_balance_node column.
Testing the cluster for load balancing
Before running the benchmark, let’s initialise the database using -i switch via pgbench
./pgbench -i -s 100 -d postgres -h 172.16.14.154 -p 9999 -U postgres dropping old tables... NOTICE: table "pgbench_accounts" does not exist, skipping NOTICE: table "pgbench_branches" does not exist, skipping NOTICE: table "pgbench_history" does not exist, skipping NOTICE: table "pgbench_tellers" does not exist, skipping creating tables... generating data (client-side)... 10000000 of 10000000 tuples (100%) done (elapsed 13.24 s, remaining 0.00 s) vacuuming... creating primary keys... done in 19.09 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.19 s, vacuum 0.20 s, primary keys 3.68 s).
By default, pgbench create following tables inside given database
table # of rows --------------------------------- pgbench_branches 1 pgbench_tellers 10 pgbench_accounts 100000 pgbench_history 0
For more info on pgbench and how to run benchmark tests for postgresql, checkout the following link:
https://www.postgresql.org/docs/current/pgbench.html
Without Load Balancing
Lets run pgbench to generate the load on our cluster
./pgbench postgres -h 172.16.14.154 -p 9999 -U postgres -c 10 -S -T 60 pgbench (15.2, server 15.3 (Ubuntu 15.3-1.pgdg22.04+1)) starting vacuum...end. transaction type: <builtin: select only> scaling factor: 100 query mode: simple number of clients: 10 number of threads: 1 maximum number of tries: 1 duration: 60 s number of transactions actually processed: 225394 number of failed transactions: 0 (0.000%) latency average = 2.656 ms initial connection time = 143.063 ms tps = 3764.944861 (without initial connection time)
Here is the breakdown of command used in pgbench
- -c 10 will simulate 10 concurrent connections to postgresql database
- -S flag tells pgbench to only perform select queries
- -T 60 is used to run benchmarking till 60 seconds
Now let see how select transactions behaves on the cluster:
./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change ---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 | 172.16.14.165 | 5432 | up | up | 1.000000 | primary | primary | 225358 | true | 0 | | | 2023-08-03 07:52:05 1 | 172.16.14.163 | 5432 | up | up | 0.000000 | standby | standby | 0 | false | 0 | | | 2023-08-03 07:52:05 2 | 172.16.14.162 | 5432 | up | up | 0.000000 | standby | standby | 0 | false | 0 | | | 2023-08-03 07:52:05 (3 rows)
In above output, select operations were performed by primary node itself and none of the select query goes to standby nodes as select_cnt for standby nodes is 0 and for primary it is 225358.
With Load Balancing
Please first revert the changes that we made inside /etc/pgpool/pgpool.conf file in order to enable the load balancing settings mentioned below
sudo vim /etc/pgpool2/pgpool.conf load_balance_mode = on backend_weight0 = 0 backend_weight1 = 0.5 backend_weight2 = 0.5
Here backend_weight is set to 0.5 for both standby 1 and standby 2 nodes and 0 for primary node which depicts that both standby nodes will handle 50% – 50% of select operations and primary won’t be participating here.
After restarting pgpool services and executing pgbench with same parameters, we get the following results:
./psql -p 9999 -h 172.16.14.154 -c "show pool_nodes" postgres -U postgres node_id | hostname | port | status | pg_status | lb_weight | role | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change ---------+---------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+--------------------- 0 | 172.16.14.165 | 5432 | up | up | 0.000000 | primary | primary | 1 | false | 0 | | | 2023-08-03 07:57:56 1 | 172.16.14.163 | 5432 | up | up | 0.500000 | standby | standby | 118067 | false | 0 | | | 2023-08-03 07:57:56 2 | 172.16.14.162 | 5432 | up | up | 0.500000 | standby | standby | 116334 | true | 0 | | | 2023-08-03 07:57:56 (3 rows)
We can see standby 1 executed 118067 queries and standby 2 executed 116334 queries and there is only 1 query that went to primary node. This query could be anything but that won’t make much difference here hence we achieved our initial goal to distribute load on standby nodes.
Summary
Enabling load balancing in a highly available PostgreSQL cluster significantly improves transaction throughput. By distributing select queries to standby nodes, the primary node is freed up to handle other operations, resulting in better overall performance and scalability.
In the next blog, we will explore techniques to further improve TPS of our cluster.
Comments (2)
Andreas Kretschmer
the main problem with pgpool and load balancing of select queries are functions with write-operations inside. yes, i know, there are black/white lists available to control that.
Muhammad Ali
Hi Andreas, You’re right. Load balancing select queries in the presence of functions with write operations is a challenge. With latest releases of pgpool, we can specify read only functions(which needs to be load balanced) with read_only_function_list parameter and others which update the database with write_function_list parameter. However there are some caveats related to function classification which needs to be tested. We are going to explore these parameters in upcoming blogs. Let us know if you have any questions or suggestions. Thanks