[pgpool-general: 8745] Re: [pgpool 4.0.4 ] PCP + psql commands get stuck after systemd-networkd restart

Wed Apr 19 17:38:33 JST 2023

Thank you for your response.

Answers to your questions:
1. I am using pgpool 4.0.4
2. DEBUG was specifically enabled to debug this issue. (PCP commands frozen)
3. Yes, all the mentioned properties are enabled. (All the pgpool
configurations are below for the reference)

>If the network goes down, watchdog will detect the network failure and
shutdown itself.
>To avoid such problems, it is recommended to shutdown pgpool before
restarting network.

1. The network was not down; It got restarted. i.e. it came back up in no
time (Within seconds)
In my current understanding, the watchdog settings: wd_interval and
wd_life_point
should have covered/tolerated this network downtime?
2. Most of the time in my prod environment, the restart or a glitch in
network is not in the application control, to pre-emptively stop pgPool.

My follow-up questions:
-------
1. Am I hitting a bug in pgPool?
2. Is this scenario (Network glitch) handled better in a newer PgPool
version? (So that I can upgrade, if possible with minimal changes to the
confs.)
-------------------------
allow_clear_text_frontend_auth = off
allow_multiple_failover_requests_from_node = off
allow_sql_comments = off
app_name_redirect_preference_list = ''
arping_cmd = 'arping -U $_IP_$ -w 1'
arping_path = '/sbin'
authentication_timeout = 60
backend_data_directory0 = '/db/data'
backend_data_directory1 = '/db/data'
backend_data_directory2 = '/db/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_flag2 = 'ALLOW_TO_FAILOVER'
backend_hostname0 = '10.108.104.31'
backend_hostname1 = '10.108.104.32'
backend_hostname2 = '10.108.104.33'
backend_port0 = 5432
backend_port1 = 5432
backend_port2 = 5432
backend_weight0 = 1
backend_weight1 = 1
backend_weight2 = 1
black_function_list = 'currval,lastval,nextval,setval'
black_memqcache_table_list = ''
black_query_pattern_list = ''
check_temp_table = on
check_unlogged_table = on
child_life_time = 300
child_max_connections = 0
clear_memqcache_on_escalation = on
client_idle_limit = 0
client_idle_limit_in_recovery = 0
connect_timeout = 10000
connection_cache = on
connection_life_time = 0
database_redirect_preference_list = ''
delay_threshold = 10000000
delegate_IP = ''
detach_false_primary = off
disable_load_balance_on_write = 'transaction'
enable_pool_hba = off
failback_command = ''
failover_command = '/usr/local/etc/failover.sh %d %h %p %D %m %H %M %P %r
%R'
failover_if_affected_tuples_mismatch = off
failover_on_backend_error = on
failover_require_consensus = on
failover_when_quorum_exists = on
follow_master_command = '/usr/local/etc/follow_master.sh %d %h %p %D %m %M
%H %P %r %R'
health_check_database = ''
health_check_max_retries = 3
health_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'
health_check_period = 5
health_check_retry_delay = 1
health_check_timeout = 30
health_check_user = 'pgpool'
heartbeat_destination0 = '10.108.104.32'
heartbeat_destination1 = '10.108.104.33'
heartbeat_destination_port0 = 9694
heartbeat_destination_port1 = 9694
heartbeat_device0 = ''
heartbeat_device1 = ''
if_down_cmd = ''
if_up_cmd = ''
ifconfig_path = '/sbin'
ignore_leading_white_space = on
insert_lock = off
listen_addresses = '*'
listen_backlog_multiplier = 2
load_balance_mode = on
lobj_lock_table = ''
log_client_messages = off
log_connections = off
log_destination = 'syslog'
log_hostname = off
log_line_prefix = '%t: pid %p: '
log_per_node_statement = off
log_standby_delay = 'if_over_threshold'
log_statement = off
logdir = '/tmp'
master_slave_mode = on
master_slave_sub_mode = 'stream'
max_pool = 4
memory_cache_enabled = off
memqcache_auto_cache_invalidation = on
memqcache_cache_block_size = 1048576
memqcache_expire = 0
memqcache_max_num_cache = 1000000
memqcache_maxcache = 409600
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_method = 'shmem'
memqcache_oiddir = '/var/log/pgpool/oiddir'
memqcache_total_size = 67108864
num_init_children = 32
other_pgpool_hostname0 = '10.108.104.32'
other_pgpool_hostname1 = '10.108.104.33'
other_pgpool_port0 = 9999
other_pgpool_port1 = 9999
other_wd_port0 = 9000
other_wd_port1 = 9000
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/tmp'
pid_file_name = '/var/run/pgpool/pgpool.pid'
ping_path = '/bin'
pool_passwd = 'pool_passwd'
port = 9999
recovery_1st_stage_command = 'recovery_1st_stage'
recovery_2nd_stage_command = ''
recovery_password = 'ZPH3Xnuh8ISKMZjSqLvIBQe_WTOzXbPF'
recovery_timeout = 90
recovery_user = 'postgres'
relcache_expire = 0
relcache_size = 256
replicate_select = off
replication_mode = off
replication_stop_on_mismatch = off
reset_query_list = 'ABORT; DISCARD ALL'
search_primary_node_timeout = 300
serialize_accept = off
socket_dir = '/var/run/pgpool/socket'
sr_check_database = 'postgres'
sr_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'
sr_check_period = 10
sr_check_user = 'pgpool'
ssl = off
ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
ssl_prefer_server_ciphers = off
syslog_facility = 'LOCAL1'
syslog_ident = 'pgpool'
trusted_servers = ''
use_watchdog = on
wd_authkey = ''
wd_de_escalation_command = '/usr/local/etc/desc.sh'
wd_escalation_command = '/usr/local/etc/esc.sh'
wd_heartbeat_deadtime = 30
wd_heartbeat_keepalive = 2
wd_heartbeat_port = 9694
wd_hostname = '10.108.104.31'
wd_interval = 10
wd_ipc_socket_dir = '/tmp'
wd_life_point = 3
wd_lifecheck_dbname = 'template1'
wd_lifecheck_method = 'heartbeat'
wd_lifecheck_password = ''
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_user = 'nobody'
wd_monitoring_interfaces_list = 'any'
wd_port = 9000
wd_priority = 1
white_function_list = ''
white_memqcache_table_list = ''

*Thanks*
*Gopi*

On Wed, Apr 19, 2023 at 1:13 PM Bo Peng <pengbo at sraoss.co.jp> wrote:

> Hello,
>
> > When I do:
> > systemctl restart systemd-networkd
> >
> > After that, I am not able to execute any PCP commands like:
> > pcp_watchdog_info
> > It is frozen.
>
> If the network goes down, watchdog will detect the network failure and
> shutdown itself.
> To avoid such problems, it is recommended to shutdown pgpool before
> restarting network.
>
> BTW, which version of Pgpool-II are you using?
>
> > I tried restarting pgpool and postgres to no avail.
> > However, rebooting the system gets it back to a workable state. (PCP
> > commands are running again and I can attach the nodes back to the pool)
> >
> > The pgPool logs shows that the pg-pool was shutdown due to the network
> > event:
> > -------------------------------
> >
> > 2023-04-17T15:27:25.190949+00:00 vmvrlcm-104-32 g[3042]: [268-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network event received
> > 2023-04-17T15:27:25.191041+00:00 vmvrlcm-104-32 g[3042]: [268-2]
> 2023-04-17
> > 15:27:25: pid 3042: DETAIL:  deleted = YES Link change event = NO
> > 2023-04-17T15:27:25.191186+00:00 vmvrlcm-104-32 g[3042]: [269-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  STATE MACHINE INVOKED WITH EVENT = NETWORK IP
> > IS REMOVED Current State = STANDBY
> > 2023-04-17T15:27:25.191243+00:00 vmvrlcm-104-32 g[3042]: [270-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network interface lo having flags 65609
> > 2023-04-17T15:27:25.191296+00:00 vmvrlcm-104-32 g[3042]: [271-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network interface eth0 having flags 69699
> > 2023-04-17T15:27:25.191352+00:00 vmvrlcm-104-32 g[3042]: [272-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" link is active
> > 2023-04-17T15:27:25.191401+00:00 vmvrlcm-104-32 g[3042]: [273-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" link is up
> > 2023-04-17T15:27:25.191449+00:00 vmvrlcm-104-32 g[3042]: [274-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network interface lo having flags 65609
> > 2023-04-17T15:27:25.191497+00:00 vmvrlcm-104-32 g[3042]: [275-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" is up and we can
> > continue
> > 2023-04-17T15:27:25.191551+00:00 vmvrlcm-104-32 g[3042]: [276-1]
> 2023-04-17
> > 15:27:25: pid 3042: WARNING:  network IP is removed and system has no IP
> is
> > assigned
> > 2023-04-17T15:27:25.191614+00:00 vmvrlcm-104-32 g[3042]: [276-2]
> 2023-04-17
> > 15:27:25: pid 3042: DETAIL:  changing the state to in network trouble
> > 2023-04-17T15:27:25.191667+00:00 vmvrlcm-104-32 g[3042]: [277-1]
> 2023-04-17
> > 15:27:25: pid 3042: LOG:  watchdog node state changed from [STANDBY] to
> [IN
> > NETWORK TROUBLE]
> > 2023-04-17T15:27:25.191713+00:00 vmvrlcm-104-32 g[3042]: [278-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  STATE MACHINE INVOKED WITH EVENT = STATE
> > CHANGED Current State = IN NETWORK TROUBLE
> > 2023-04-17T15:27:25.191759+00:00 vmvrlcm-104-32 g[3042]: [279-1]
> 2023-04-17
> > 15:27:25: pid 3042: FATAL:  system has lost the network
> > 2023-04-17T15:27:25.191807+00:00 vmvrlcm-104-32 g[3042]: [280-1]
> 2023-04-17
> > 15:27:25: pid 3042: LOG:  Watchdog is shutting down
> > 2023-04-17T15:27:25.191849+00:00 vmvrlcm-104-32 g[3042]: [281-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  sending packet, watchdog node:[
> > vmvrlcm-104-31.eng.vmware.com:9999 Linux vmvrlcm-104-31.eng.vmware.com]
> > command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
> > 2023-04-17T15:27:25.191894+00:00 vmvrlcm-104-32 g[3042]: [282-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  sending watchdog packet to socket:8,
> type:[X],
> > command ID:10, data Length:0
> > 2023-04-17T15:27:25.191952+00:00 vmvrlcm-104-32 g[3042]: [283-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  sending packet, watchdog node:[
> > vmvrlcm-104-33.eng.vmware.com:9999 Linux vmvrlcm-104-33.eng.vmware.com]
> > command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
> > 2023-04-17T15:27:25.192001+00:00 vmvrlcm-104-32 g[3042]: [284-1]
> 2023-04-17
> > 15:27:25: pid 3042: DEBUG:  sending watchdog packet to socket:9,
> type:[X],
> > command ID:10, data Length:0
> > 2023-04-17T15:27:25.192671+00:00 vmvrlcm-104-32 pgpool[3040]: [24-1]
> > 2023-04-17 15:27:25: pid 3040: DEBUG:  reaper handler
> > 2023-04-17T15:27:25.192753+00:00 vmvrlcm-104-32 pgpool[3040]: [25-1]
> > 2023-04-17 15:27:25: pid 3040: DEBUG:  watchdog child process with pid:
> > 3042 exit with FATAL ERROR. pgpool-II will be shutdown
> > 2023-04-17T15:27:25.192803+00:00 vmvrlcm-104-32 pgpool[3040]: [26-1]
> > 2023-04-17 15:27:25: pid 3040: LOG:  watchdog child process with pid:
> 3042
> > exits with status 768
> > 2023-04-17T15:27:25.192864+00:00 vmvrlcm-104-32 pgpool[3040]: [27-1]
> > 2023-04-17 15:27:25: pid 3040: FATAL:  watchdog child process exit with
> > fatal error. exiting pgpool-II
> > 2023-04-17T15:27:25.197530+00:00 vmvrlcm-104-32 ck[3157]: [23-1]
> 2023-04-17
> > 15:27:25: pid 3157: DEBUG:  lifecheck child receives shutdown request
> > signal 2, forwarding to all children
> > 2023-04-17T15:27:25.197611+00:00 vmvrlcm-104-32 ck[3157]: [24-1]
> 2023-04-17
> > 15:27:25: pid 3157: DEBUG:  lifecheck child receives fast shutdown
> request
> > 2023-04-17T15:27:25.197658+00:00 vmvrlcm-104-32 at sender[3159]: [148-1]
> > 2023-04-17 15:27:25: pid 3159: DEBUG:  watchdog heartbeat sender child
> > receives shutdown request signal 2
> > 2023-04-17T15:27:25.197994+00:00 vmvrlcm-104-32 at sender[3163]: [148-1]
> > 2023-04-17 15:27:25: pid 3163: DEBUG:  watchdog heartbeat sender child
> > receives shutdown request signal 2
> > 2023-04-17T15:27:25.199168+00:00 vmvrlcm-104-32 at receiver[3161]: [18-1]
> > 2023-04-17 15:27:25: pid 3161: DEBUG:  watchdog heartbeat receiver child
> > receives shutdown request signal 2
> > 2023-04-17T15:27:25.199567+00:00 vmvrlcm-104-32 at receiver[3158]: [18-1]
> > 2023-04-17 15:27:25: pid 3158: DEBUG:  watchdog heartbeat receiver child
> > receives shutdown request signal 2
> > 2023-04-17T15:27:25.448554+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [386-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:25.448689+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [387-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  SSL is requested but SSL
> > support is not available
> > 2023-04-17T15:27:25.450621+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [388-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate kind = 5
> > 2023-04-17T15:27:25.451892+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [389-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate backend: key
> > data received
> > 2023-04-17T15:27:25.451987+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [390-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate backend:
> > transaction state: I
> > 2023-04-17T15:27:25.452043+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [391-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:25.452096+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [392-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:25.455020+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [386-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:25.455096+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [387-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  SSL is requested but SSL
> > support is not available
> > 2023-04-17T15:27:25.457196+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [388-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate kind = 5
> > 2023-04-17T15:27:25.458437+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [389-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate backend: key
> > data received
> > 2023-04-17T15:27:25.458556+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [390-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate backend:
> > transaction state: I
> > 2023-04-17T15:27:25.458674+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [391-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:25.458742+00:00 vmvrlcm-104-32 check process(0)[3196]:
> > [392-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:30.452427+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [393-1] 2023-04-17 15:27:30: pid 3197: DEBUG:  health check: clearing
> alarm
> > 2023-04-17T15:27:30.454041+00:00 vmvrlcm-104-32 check process(2)[3197]:
> > [394-1] 2023-04-17 15:27:30: pid 3197: DEBUG:  SSL is requested but SSL
> > support is not available
> >
> > ------------------
> >
> > After this, it is in kind of a loop of the 'clearing alarm' + 'SSL
> support
> > is not available'
> >
> > The relevant (In my current understanding) watchdog settings are:
> >
> ----------------------------------------------------------------------------------------
> > wd_hostname = '10.108.104.31'
> > wd_lifecheck_method = 'heartbeat'
> > wd_interval = 10
> > wd_heartbeat_keepalive = 2
> > wd_heartbeat_deadtime = 30
> > heartbeat_destination0 = '10.108.104.32'
> > heartbeat_device0 = ''
> > heartbeat_destination1 = '10.108.104.33'
> > heartbeat_device1 = ''
> > wd_monitoring_interfaces_list = 'any'
>
> Above logs are DEBUG messages and I don't think they caused this issue.
> Do these DEBUG messages only appear when you restart the network?
>
> If you are using watchdog, you also need to configure the following
> parameters:
>
> heartbeat_destination_port0
> heartbeat_destination_port1
> other_pgpool_hostname0
> other_pgpool_port0
> other_pgpool_hostname1
> other_pgpool_port1
>
> Regards,
>
> --
> Bo Peng <pengbo at sraoss.co.jp>
> SRA OSS LLC
> https://www.sraoss.co.jp/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230419/3dc87a57/attachment-0001.htm>