[pgpool-general: 8803] queries not redirecting to primary when "delay_threshold_by_time" is exceeded

Sat Jun 3 01:42:11 JST 2023

Hi

We have pgpool 4.4.2 with streaming replication setup using PG 14.7
(primary+1 replica)

The configuration is set up to check for replication lag every 5 seconds
and we have also set up "delay_treshold_by_time=1" (meaning 1 second)

We can see replication lag messages in the log as well as check the
replication lag using SHOW POOL_NODES and pcp_node_info.
All that works fine...

However, we are noticing that even after the lag is detected and
exceeds 1 second, the SELECT queries are NOT being routed back to the
primary server!

Attached file has config and sample taken from the log (with
log_per_node_statement=on)
This is a rather impactful issue for us because the application is
getting stale data.
I know we can redirect all queries to go to primary on per application
basis but would rather leverage load balancing when possible.
-------------- next part --------------
#
# two node streaming replicaiton setup, lag detection enabled
#
[postgres at pgpool-01 ~]$ psql -w -h localhost -p 9999 -U postgres -c "SHOW POOL_NODES"
 node_id |  hostname   | port | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change
---------+-------------+------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | 11.12.34.14 | 5432 | up     | up        | 0.500000  | primary | primary | 16480897   | false             | 0                 |                   |                        | 2023-05-29 12:49:16
 1       | 11.12.34.24 | 5432 | up     | up        | 0.500000  | standby | standby | 6404357    | true              | 0.001623 second   | streaming         | async                  | 2023-05-29 12:49:16
(2 rows)

#
# showing only settings we changed from defaults in original sample config
#
[root at pgpool-01 pgpool-II]# diff --unchanged-line-format= --old-line-format= --new-line-format='%L' pgpool.conf.sample pgpool.conf
listen_addresses = '*'
serialize_accept = on
backend_hostname0 = '11.12.34.14'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/data/pgdata/14'
backend_flag0 = 'ALWAYS_PRIMARY'
backend_application_name0 = 'primary'
backend_hostname1 = '11.12.34.24'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/data/pgdata/14'
backend_flag1 = 'DISALLOW_TO_FAILOVER'
backend_application_name1 = 'replica_1'
num_init_children = 96
max_pool = 2
child_life_time = 0
log_line_prefix = '%m [%p]: [%l-1] user=%u,db=%d,app=%a '   # printf-style string to output at beginning of each log line.
log_connections = on
log_disconnections = on
log_per_node_statement = on
log_standby_delay = 'always'
logging_collector = on
log_directory = '/var/log/pgpool-II'
log_filename = 'pgpool-01-%a.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 100MB
write_function_list = 'currval,lastval,nextval,setval,public.upgrade_.*,upgrade_.*,public.update_.*,update_.*'
statement_level_load_balance = on
sr_check_period = 5
sr_check_user = 'postgres'
delay_threshold_by_time = 1
health_check_period = 10
health_check_timeout = 30
health_check_user = 'postgres'
health_check_max_retries = 10
failover_on_backend_error = off
check_temp_table = trace

#
# pgpool log file, showing (one of many) examples where SELECT queries are going to replica (node 1), even after delay_threshold_by_time has been exceeded...
# note: SELECT queries are multiline so only 1st line shows
#
[root at pgpool-01 pgpool-II]# grep -iE "(behind)|(app_web_service_XY01.*node\s+id:\s+1.*execute:\s+SELECT$)" /var/log/pgpool-II/pgpool-01-Tue.log
...
...
2023-05-30 08:47:05.114 [656949]: [131618-1] user=[No Connection],db=[No Connection],app=sr_check_worker LOG:  Replication of node: 1 is behind 4.520635 second(s) from the primary server (node: 0)
2023-05-30 08:47:05.599 [656879]: [1175368-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508960 statement: Execute: SELECT
2023-05-30 08:47:05.798 [656866]: [1177354-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508932 statement: Execute: SELECT
2023-05-30 08:47:05.999 [656892]: [1169321-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508963 statement: Execute: SELECT
2023-05-30 08:47:06.198 [656907]: [1159747-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508965 statement: Execute: SELECT
2023-05-30 08:47:06.333 [663876]: [952487-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508968 statement: Execute: SELECT
2023-05-30 08:47:06.414 [663876]: [952622-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508968 statement: Execute: SELECT
2023-05-30 08:47:06.816 [656885]: [1168773-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508949 statement: Execute: SELECT
2023-05-30 08:47:06.818 [656885]: [1168781-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508949 statement: Execute: SELECT
2023-05-30 08:47:06.977 [656929]: [1165589-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508974 statement: Execute: SELECT
2023-05-30 08:47:07.173 [656859]: [1163054-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508975 statement: Execute: SELECT
2023-05-30 08:47:07.322 [656904]: [1168277-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508979 statement: Execute: SELECT
2023-05-30 08:47:08.058 [656889]: [1170768-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508982 statement: Execute: SELECT
2023-05-30 08:47:08.162 [656911]: [1171972-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508983 statement: Execute: SELECT
2023-05-30 08:47:08.950 [656919]: [1174657-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508991 statement: Execute: SELECT
2023-05-30 08:47:09.658 [682106]: [298130-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508995 statement: Execute: SELECT
2023-05-30 08:47:09.779 [680738]: [375124-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508998 statement: Execute: SELECT
2023-05-30 08:47:09.853 [680738]: [375259-1] user=postgres,db=pacs_live,app=app_web_service_XY01 LOG:  DB node id: 1 backend pid: 1508998 statement: Execute: SELECT
2023-05-30 08:47:10.139 [656949]: [131620-1] user=[No Connection],db=[No Connection],app=sr_check_worker LOG:  Replication of node: 1 is behind 9.555652 second(s) from the primary server (node: 0)
2023-05-30 08:47:15.164 [656949]: [131622-1] user=[No Connection],db=[No Connection],app=sr_check_worker LOG:  Replication of node: 1 is behind 14.574173 second(s) from the primary server (node: 0)
...
...