[pgpool-general: 8748] Re: [pgpool 4.0.4 ] PCP + psql commands get stuck after systemd-networkd restart

Mon May 1 17:15:37 JST 2023

Hello,

It is the feature Pgpool will immediately detect the failure and shutdown when network failure occurs.
This is a solution to avoid split brain.

And wd_interval and wd_heartbeat_deadtime don't affect this behavior.

Could you try to disable wd_monitoring_interfaces_list and see if this issue is improved?

  wd_monitoring_interfaces_list = ''

On Mon, 1 May 2023 10:36:15 +0530
Gopikrishnan <nksgopikrishnan at gmail.com> wrote:

> Hi all,
> 
> Any pointers would be helpful.
> Currently, the pgpool is highly unstable due to momentary network
> disruptions.
> 
> While it looks like pgpool was already built with that in mind,
> It is not working correctly for me due to a bug or misconfiguration.
> 
> *Thanks*
> *Gopi*
> 
> 
> On Wed, Apr 19, 2023 at 2:08 PM Gopikrishnan <nksgopikrishnan at gmail.com>
> wrote:
> 
> > Thank you for your response.
> >
> > Answers to your questions:
> > 1. I am using pgpool 4.0.4
> > 2. DEBUG was specifically enabled to debug this issue. (PCP commands
> > frozen)
> > 3. Yes, all the mentioned properties are enabled. (All the pgpool
> > configurations are below for the reference)
> >
> > >If the network goes down, watchdog will detect the network failure and
> > shutdown itself.
> > >To avoid such problems, it is recommended to shutdown pgpool before
> > restarting network.
> >
> > 1. The network was not down; It got restarted. i.e. it came back up in no
> > time (Within seconds)
> > In my current understanding, the watchdog settings: wd_interval and wd_life_point
> > should have covered/tolerated this network downtime?
> > 2. Most of the time in my prod environment, the restart or a glitch in
> > network is not in the application control, to pre-emptively stop pgPool.
> >
> > My follow-up questions:
> > -------
> > 1. Am I hitting a bug in pgPool?
> > 2. Is this scenario (Network glitch) handled better in a newer PgPool
> > version? (So that I can upgrade, if possible with minimal changes to the
> > confs.)
> > -------------------------
> > allow_clear_text_frontend_auth = off
> > allow_multiple_failover_requests_from_node = off
> > allow_sql_comments = off
> > app_name_redirect_preference_list = ''
> > arping_cmd = 'arping -U $_IP_$ -w 1'
> > arping_path = '/sbin'
> > authentication_timeout = 60
> > backend_data_directory0 = '/db/data'
> > backend_data_directory1 = '/db/data'
> > backend_data_directory2 = '/db/data'
> > backend_flag0 = 'ALLOW_TO_FAILOVER'
> > backend_flag1 = 'ALLOW_TO_FAILOVER'
> > backend_flag2 = 'ALLOW_TO_FAILOVER'
> > backend_hostname0 = '10.108.104.31'
> > backend_hostname1 = '10.108.104.32'
> > backend_hostname2 = '10.108.104.33'
> > backend_port0 = 5432
> > backend_port1 = 5432
> > backend_port2 = 5432
> > backend_weight0 = 1
> > backend_weight1 = 1
> > backend_weight2 = 1
> > black_function_list = 'currval,lastval,nextval,setval'
> > black_memqcache_table_list = ''
> > black_query_pattern_list = ''
> > check_temp_table = on
> > check_unlogged_table = on
> > child_life_time = 300
> > child_max_connections = 0
> > clear_memqcache_on_escalation = on
> > client_idle_limit = 0
> > client_idle_limit_in_recovery = 0
> > connect_timeout = 10000
> > connection_cache = on
> > connection_life_time = 0
> > database_redirect_preference_list = ''
> > delay_threshold = 10000000
> > delegate_IP = ''
> > detach_false_primary = off
> > disable_load_balance_on_write = 'transaction'
> > enable_pool_hba = off
> > failback_command = ''
> > failover_command = '/usr/local/etc/failover.sh %d %h %p %D %m %H %M %P %r
> > %R'
> > failover_if_affected_tuples_mismatch = off
> > failover_on_backend_error = on
> > failover_require_consensus = on
> > failover_when_quorum_exists = on
> > follow_master_command = '/usr/local/etc/follow_master.sh %d %h %p %D %m %M
> > %H %P %r %R'
> > health_check_database = ''
> > health_check_max_retries = 3
> > health_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'
> > health_check_period = 5
> > health_check_retry_delay = 1
> > health_check_timeout = 30
> > health_check_user = 'pgpool'
> > heartbeat_destination0 = '10.108.104.32'
> > heartbeat_destination1 = '10.108.104.33'
> > heartbeat_destination_port0 = 9694
> > heartbeat_destination_port1 = 9694
> > heartbeat_device0 = ''
> > heartbeat_device1 = ''
> > if_down_cmd = ''
> > if_up_cmd = ''
> > ifconfig_path = '/sbin'
> > ignore_leading_white_space = on
> > insert_lock = off
> > listen_addresses = '*'
> > listen_backlog_multiplier = 2
> > load_balance_mode = on
> > lobj_lock_table = ''
> > log_client_messages = off
> > log_connections = off
> > log_destination = 'syslog'
> > log_hostname = off
> > log_line_prefix = '%t: pid %p: '
> > log_per_node_statement = off
> > log_standby_delay = 'if_over_threshold'
> > log_statement = off
> > logdir = '/tmp'
> > master_slave_mode = on
> > master_slave_sub_mode = 'stream'
> > max_pool = 4
> > memory_cache_enabled = off
> > memqcache_auto_cache_invalidation = on
> > memqcache_cache_block_size = 1048576
> > memqcache_expire = 0
> > memqcache_max_num_cache = 1000000
> > memqcache_maxcache = 409600
> > memqcache_memcached_host = 'localhost'
> > memqcache_memcached_port = 11211
> > memqcache_method = 'shmem'
> > memqcache_oiddir = '/var/log/pgpool/oiddir'
> > memqcache_total_size = 67108864
> > num_init_children = 32
> > other_pgpool_hostname0 = '10.108.104.32'
> > other_pgpool_hostname1 = '10.108.104.33'
> > other_pgpool_port0 = 9999
> > other_pgpool_port1 = 9999
> > other_wd_port0 = 9000
> > other_wd_port1 = 9000
> > pcp_listen_addresses = '*'
> > pcp_port = 9898
> > pcp_socket_dir = '/tmp'
> > pid_file_name = '/var/run/pgpool/pgpool.pid'
> > ping_path = '/bin'
> > pool_passwd = 'pool_passwd'
> > port = 9999
> > recovery_1st_stage_command = 'recovery_1st_stage'
> > recovery_2nd_stage_command = ''
> > recovery_password = 'ZPH3Xnuh8ISKMZjSqLvIBQe_WTOzXbPF'
> > recovery_timeout = 90
> > recovery_user = 'postgres'
> > relcache_expire = 0
> > relcache_size = 256
> > replicate_select = off
> > replication_mode = off
> > replication_stop_on_mismatch = off
> > reset_query_list = 'ABORT; DISCARD ALL'
> > search_primary_node_timeout = 300
> > serialize_accept = off
> > socket_dir = '/var/run/pgpool/socket'
> > sr_check_database = 'postgres'
> > sr_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'
> > sr_check_period = 10
> > sr_check_user = 'pgpool'
> > ssl = off
> > ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
> > ssl_prefer_server_ciphers = off
> > syslog_facility = 'LOCAL1'
> > syslog_ident = 'pgpool'
> > trusted_servers = ''
> > use_watchdog = on
> > wd_authkey = ''
> > wd_de_escalation_command = '/usr/local/etc/desc.sh'
> > wd_escalation_command = '/usr/local/etc/esc.sh'
> > wd_heartbeat_deadtime = 30
> > wd_heartbeat_keepalive = 2
> > wd_heartbeat_port = 9694
> > wd_hostname = '10.108.104.31'
> > wd_interval = 10
> > wd_ipc_socket_dir = '/tmp'
> > wd_life_point = 3
> > wd_lifecheck_dbname = 'template1'
> > wd_lifecheck_method = 'heartbeat'
> > wd_lifecheck_password = ''
> > wd_lifecheck_query = 'SELECT 1'
> > wd_lifecheck_user = 'nobody'
> > wd_monitoring_interfaces_list = 'any'
> > wd_port = 9000
> > wd_priority = 1
> > white_function_list = ''
> > white_memqcache_table_list = ''
> >
> >
> >
> >
> >
> > *Thanks*
> > *Gopi*
> >
> >
> > On Wed, Apr 19, 2023 at 1:13 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
> >
> >> Hello,
> >>
> >> > When I do:
> >> > systemctl restart systemd-networkd
> >> >
> >> > After that, I am not able to execute any PCP commands like:
> >> > pcp_watchdog_info
> >> > It is frozen.
> >>
> >> If the network goes down, watchdog will detect the network failure and
> >> shutdown itself.
> >> To avoid such problems, it is recommended to shutdown pgpool before
> >> restarting network.
> >>
> >> BTW, which version of Pgpool-II are you using?
> >>
> >> > I tried restarting pgpool and postgres to no avail.
> >> > However, rebooting the system gets it back to a workable state. (PCP
> >> > commands are running again and I can attach the nodes back to the pool)
> >> >
> >> > The pgPool logs shows that the pg-pool was shutdown due to the network
> >> > event:
> >> > -------------------------------
> >> >
> >> > 2023-04-17T15:27:25.190949+00:00 vmvrlcm-104-32 g[3042]: [268-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network event received
> >> > 2023-04-17T15:27:25.191041+00:00 vmvrlcm-104-32 g[3042]: [268-2]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DETAIL:  deleted = YES Link change event = NO
> >> > 2023-04-17T15:27:25.191186+00:00 vmvrlcm-104-32 g[3042]: [269-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  STATE MACHINE INVOKED WITH EVENT = NETWORK
> >> IP
> >> > IS REMOVED Current State = STANDBY
> >> > 2023-04-17T15:27:25.191243+00:00 vmvrlcm-104-32 g[3042]: [270-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network interface lo having flags 65609
> >> > 2023-04-17T15:27:25.191296+00:00 vmvrlcm-104-32 g[3042]: [271-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network interface eth0 having flags 69699
> >> > 2023-04-17T15:27:25.191352+00:00 vmvrlcm-104-32 g[3042]: [272-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" link is active
> >> > 2023-04-17T15:27:25.191401+00:00 vmvrlcm-104-32 g[3042]: [273-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" link is up
> >> > 2023-04-17T15:27:25.191449+00:00 vmvrlcm-104-32 g[3042]: [274-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network interface lo having flags 65609
> >> > 2023-04-17T15:27:25.191497+00:00 vmvrlcm-104-32 g[3042]: [275-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" is up and we can
> >> > continue
> >> > 2023-04-17T15:27:25.191551+00:00 vmvrlcm-104-32 g[3042]: [276-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: WARNING:  network IP is removed and system has no
> >> IP is
> >> > assigned
> >> > 2023-04-17T15:27:25.191614+00:00 vmvrlcm-104-32 g[3042]: [276-2]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DETAIL:  changing the state to in network trouble
> >> > 2023-04-17T15:27:25.191667+00:00 vmvrlcm-104-32 g[3042]: [277-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: LOG:  watchdog node state changed from [STANDBY] to
> >> [IN
> >> > NETWORK TROUBLE]
> >> > 2023-04-17T15:27:25.191713+00:00 vmvrlcm-104-32 g[3042]: [278-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  STATE MACHINE INVOKED WITH EVENT = STATE
> >> > CHANGED Current State = IN NETWORK TROUBLE
> >> > 2023-04-17T15:27:25.191759+00:00 vmvrlcm-104-32 g[3042]: [279-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: FATAL:  system has lost the network
> >> > 2023-04-17T15:27:25.191807+00:00 vmvrlcm-104-32 g[3042]: [280-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: LOG:  Watchdog is shutting down
> >> > 2023-04-17T15:27:25.191849+00:00 vmvrlcm-104-32 g[3042]: [281-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  sending packet, watchdog node:[
> >> > vmvrlcm-104-31.eng.vmware.com:9999 Linux vmvrlcm-104-31.eng.vmware.com]
> >> > command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
> >> > 2023-04-17T15:27:25.191894+00:00 vmvrlcm-104-32 g[3042]: [282-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  sending watchdog packet to socket:8,
> >> type:[X],
> >> > command ID:10, data Length:0
> >> > 2023-04-17T15:27:25.191952+00:00 vmvrlcm-104-32 g[3042]: [283-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  sending packet, watchdog node:[
> >> > vmvrlcm-104-33.eng.vmware.com:9999 Linux vmvrlcm-104-33.eng.vmware.com]
> >> > command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
> >> > 2023-04-17T15:27:25.192001+00:00 vmvrlcm-104-32 g[3042]: [284-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3042: DEBUG:  sending watchdog packet to socket:9,
> >> type:[X],
> >> > command ID:10, data Length:0
> >> > 2023-04-17T15:27:25.192671+00:00 vmvrlcm-104-32 pgpool[3040]: [24-1]
> >> > 2023-04-17 15:27:25: pid 3040: DEBUG:  reaper handler
> >> > 2023-04-17T15:27:25.192753+00:00 vmvrlcm-104-32 pgpool[3040]: [25-1]
> >> > 2023-04-17 15:27:25: pid 3040: DEBUG:  watchdog child process with pid:
> >> > 3042 exit with FATAL ERROR. pgpool-II will be shutdown
> >> > 2023-04-17T15:27:25.192803+00:00 vmvrlcm-104-32 pgpool[3040]: [26-1]
> >> > 2023-04-17 15:27:25: pid 3040: LOG:  watchdog child process with pid:
> >> 3042
> >> > exits with status 768
> >> > 2023-04-17T15:27:25.192864+00:00 vmvrlcm-104-32 pgpool[3040]: [27-1]
> >> > 2023-04-17 15:27:25: pid 3040: FATAL:  watchdog child process exit with
> >> > fatal error. exiting pgpool-II
> >> > 2023-04-17T15:27:25.197530+00:00 vmvrlcm-104-32 ck[3157]: [23-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3157: DEBUG:  lifecheck child receives shutdown request
> >> > signal 2, forwarding to all children
> >> > 2023-04-17T15:27:25.197611+00:00 vmvrlcm-104-32 ck[3157]: [24-1]
> >> 2023-04-17
> >> > 15:27:25: pid 3157: DEBUG:  lifecheck child receives fast shutdown
> >> request
> >> > 2023-04-17T15:27:25.197658+00:00 vmvrlcm-104-32 at sender[3159]: [148-1]
> >> > 2023-04-17 15:27:25: pid 3159: DEBUG:  watchdog heartbeat sender child
> >> > receives shutdown request signal 2
> >> > 2023-04-17T15:27:25.197994+00:00 vmvrlcm-104-32 at sender[3163]: [148-1]
> >> > 2023-04-17 15:27:25: pid 3163: DEBUG:  watchdog heartbeat sender child
> >> > receives shutdown request signal 2
> >> > 2023-04-17T15:27:25.199168+00:00 vmvrlcm-104-32 at receiver[3161]:
> >> [18-1]
> >> > 2023-04-17 15:27:25: pid 3161: DEBUG:  watchdog heartbeat receiver child
> >> > receives shutdown request signal 2
> >> > 2023-04-17T15:27:25.199567+00:00 vmvrlcm-104-32 at receiver[3158]:
> >> [18-1]
> >> > 2023-04-17 15:27:25: pid 3158: DEBUG:  watchdog heartbeat receiver child
> >> > receives shutdown request signal 2
> >> > 2023-04-17T15:27:25.448554+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [386-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:25.448689+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [387-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  SSL is requested but SSL
> >> > support is not available
> >> > 2023-04-17T15:27:25.450621+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [388-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate kind = 5
> >> > 2023-04-17T15:27:25.451892+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [389-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate backend: key
> >> > data received
> >> > 2023-04-17T15:27:25.451987+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [390-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate backend:
> >> > transaction state: I
> >> > 2023-04-17T15:27:25.452043+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [391-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:25.452096+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [392-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:25.455020+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [386-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:25.455096+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [387-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  SSL is requested but SSL
> >> > support is not available
> >> > 2023-04-17T15:27:25.457196+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [388-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate kind = 5
> >> > 2023-04-17T15:27:25.458437+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [389-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate backend: key
> >> > data received
> >> > 2023-04-17T15:27:25.458556+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [390-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate backend:
> >> > transaction state: I
> >> > 2023-04-17T15:27:25.458674+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [391-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:25.458742+00:00 vmvrlcm-104-32 check process(0)[3196]:
> >> > [392-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:30.452427+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [393-1] 2023-04-17 15:27:30: pid 3197: DEBUG:  health check: clearing
> >> alarm
> >> > 2023-04-17T15:27:30.454041+00:00 vmvrlcm-104-32 check process(2)[3197]:
> >> > [394-1] 2023-04-17 15:27:30: pid 3197: DEBUG:  SSL is requested but SSL
> >> > support is not available
> >> >
> >> > ------------------
> >> >
> >> > After this, it is in kind of a loop of the 'clearing alarm' + 'SSL
> >> support
> >> > is not available'
> >> >
> >> > The relevant (In my current understanding) watchdog settings are:
> >> >
> >> ----------------------------------------------------------------------------------------
> >> > wd_hostname = '10.108.104.31'
> >> > wd_lifecheck_method = 'heartbeat'
> >> > wd_interval = 10
> >> > wd_heartbeat_keepalive = 2
> >> > wd_heartbeat_deadtime = 30
> >> > heartbeat_destination0 = '10.108.104.32'
> >> > heartbeat_device0 = ''
> >> > heartbeat_destination1 = '10.108.104.33'
> >> > heartbeat_device1 = ''
> >> > wd_monitoring_interfaces_list = 'any'
> >>
> >> Above logs are DEBUG messages and I don't think they caused this issue.
> >> Do these DEBUG messages only appear when you restart the network?
> >>
> >> If you are using watchdog, you also need to configure the following
> >> parameters:
> >>
> >> heartbeat_destination_port0
> >> heartbeat_destination_port1
> >> other_pgpool_hostname0
> >> other_pgpool_port0
> >> other_pgpool_hostname1
> >> other_pgpool_port1
> >>
> >> Regards,
> >>
> >> --
> >> Bo Peng <pengbo at sraoss.co.jp>
> >> SRA OSS LLC
> >> https://www.sraoss.co.jp/
> >>
> >

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS LLC
https://www.sraoss.co.jp/