[pgpool-general: 8747] Re: [pgpool 4.0.4 ] PCP + psql commands get stuck after systemd-networkd restart

Mon May 1 14:06:15 JST 2023

Hi all,

Any pointers would be helpful.
Currently, the pgpool is highly unstable due to momentary network
disruptions.

While it looks like pgpool was already built with that in mind,
It is not working correctly for me due to a bug or misconfiguration.

*Thanks*
*Gopi*

On Wed, Apr 19, 2023 at 2:08 PM Gopikrishnan <nksgopikrishnan at gmail.com>
wrote:

> Thank you for your response.
>
> Answers to your questions:
> 1. I am using pgpool 4.0.4
> 2. DEBUG was specifically enabled to debug this issue. (PCP commands
> frozen)
> 3. Yes, all the mentioned properties are enabled. (All the pgpool
> configurations are below for the reference)
>
> >If the network goes down, watchdog will detect the network failure and
> shutdown itself.
> >To avoid such problems, it is recommended to shutdown pgpool before
> restarting network.
>
> 1. The network was not down; It got restarted. i.e. it came back up in no
> time (Within seconds)
> In my current understanding, the watchdog settings: wd_interval and wd_life_point
> should have covered/tolerated this network downtime?
> 2. Most of the time in my prod environment, the restart or a glitch in
> network is not in the application control, to pre-emptively stop pgPool.
>
> My follow-up questions:
> -------
> 1. Am I hitting a bug in pgPool?
> 2. Is this scenario (Network glitch) handled better in a newer PgPool
> version? (So that I can upgrade, if possible with minimal changes to the
> confs.)
> -------------------------
> allow_clear_text_frontend_auth = off
> allow_multiple_failover_requests_from_node = off
> allow_sql_comments = off
> app_name_redirect_preference_list = ''
> arping_cmd = 'arping -U $_IP_$ -w 1'
> arping_path = '/sbin'
> authentication_timeout = 60
> backend_data_directory0 = '/db/data'
> backend_data_directory1 = '/db/data'
> backend_data_directory2 = '/db/data'
> backend_flag0 = 'ALLOW_TO_FAILOVER'
> backend_flag1 = 'ALLOW_TO_FAILOVER'
> backend_flag2 = 'ALLOW_TO_FAILOVER'
> backend_hostname0 = '10.108.104.31'
> backend_hostname1 = '10.108.104.32'
> backend_hostname2 = '10.108.104.33'
> backend_port0 = 5432
> backend_port1 = 5432
> backend_port2 = 5432
> backend_weight0 = 1
> backend_weight1 = 1
> backend_weight2 = 1
> black_function_list = 'currval,lastval,nextval,setval'
> black_memqcache_table_list = ''
> black_query_pattern_list = ''
> check_temp_table = on
> check_unlogged_table = on
> child_life_time = 300
> child_max_connections = 0
> clear_memqcache_on_escalation = on
> client_idle_limit = 0
> client_idle_limit_in_recovery = 0
> connect_timeout = 10000
> connection_cache = on
> connection_life_time = 0
> database_redirect_preference_list = ''
> delay_threshold = 10000000
> delegate_IP = ''
> detach_false_primary = off
> disable_load_balance_on_write = 'transaction'
> enable_pool_hba = off
> failback_command = ''
> failover_command = '/usr/local/etc/failover.sh %d %h %p %D %m %H %M %P %r
> %R'
> failover_if_affected_tuples_mismatch = off
> failover_on_backend_error = on
> failover_require_consensus = on
> failover_when_quorum_exists = on
> follow_master_command = '/usr/local/etc/follow_master.sh %d %h %p %D %m %M
> %H %P %r %R'
> health_check_database = ''
> health_check_max_retries = 3
> health_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'
> health_check_period = 5
> health_check_retry_delay = 1
> health_check_timeout = 30
> health_check_user = 'pgpool'
> heartbeat_destination0 = '10.108.104.32'
> heartbeat_destination1 = '10.108.104.33'
> heartbeat_destination_port0 = 9694
> heartbeat_destination_port1 = 9694
> heartbeat_device0 = ''
> heartbeat_device1 = ''
> if_down_cmd = ''
> if_up_cmd = ''
> ifconfig_path = '/sbin'
> ignore_leading_white_space = on
> insert_lock = off
> listen_addresses = '*'
> listen_backlog_multiplier = 2
> load_balance_mode = on
> lobj_lock_table = ''
> log_client_messages = off
> log_connections = off
> log_destination = 'syslog'
> log_hostname = off
> log_line_prefix = '%t: pid %p: '
> log_per_node_statement = off
> log_standby_delay = 'if_over_threshold'
> log_statement = off
> logdir = '/tmp'
> master_slave_mode = on
> master_slave_sub_mode = 'stream'
> max_pool = 4
> memory_cache_enabled = off
> memqcache_auto_cache_invalidation = on
> memqcache_cache_block_size = 1048576
> memqcache_expire = 0
> memqcache_max_num_cache = 1000000
> memqcache_maxcache = 409600
> memqcache_memcached_host = 'localhost'
> memqcache_memcached_port = 11211
> memqcache_method = 'shmem'
> memqcache_oiddir = '/var/log/pgpool/oiddir'
> memqcache_total_size = 67108864
> num_init_children = 32
> other_pgpool_hostname0 = '10.108.104.32'
> other_pgpool_hostname1 = '10.108.104.33'
> other_pgpool_port0 = 9999
> other_pgpool_port1 = 9999
> other_wd_port0 = 9000
> other_wd_port1 = 9000
> pcp_listen_addresses = '*'
> pcp_port = 9898
> pcp_socket_dir = '/tmp'
> pid_file_name = '/var/run/pgpool/pgpool.pid'
> ping_path = '/bin'
> pool_passwd = 'pool_passwd'
> port = 9999
> recovery_1st_stage_command = 'recovery_1st_stage'
> recovery_2nd_stage_command = ''
> recovery_password = 'ZPH3Xnuh8ISKMZjSqLvIBQe_WTOzXbPF'
> recovery_timeout = 90
> recovery_user = 'postgres'
> relcache_expire = 0
> relcache_size = 256
> replicate_select = off
> replication_mode = off
> replication_stop_on_mismatch = off
> reset_query_list = 'ABORT; DISCARD ALL'
> search_primary_node_timeout = 300
> serialize_accept = off
> socket_dir = '/var/run/pgpool/socket'
> sr_check_database = 'postgres'
> sr_check_password = 'e2f2da4a027a41bf8517406dd9ca970e'
> sr_check_period = 10
> sr_check_user = 'pgpool'
> ssl = off
> ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
> ssl_prefer_server_ciphers = off
> syslog_facility = 'LOCAL1'
> syslog_ident = 'pgpool'
> trusted_servers = ''
> use_watchdog = on
> wd_authkey = ''
> wd_de_escalation_command = '/usr/local/etc/desc.sh'
> wd_escalation_command = '/usr/local/etc/esc.sh'
> wd_heartbeat_deadtime = 30
> wd_heartbeat_keepalive = 2
> wd_heartbeat_port = 9694
> wd_hostname = '10.108.104.31'
> wd_interval = 10
> wd_ipc_socket_dir = '/tmp'
> wd_life_point = 3
> wd_lifecheck_dbname = 'template1'
> wd_lifecheck_method = 'heartbeat'
> wd_lifecheck_password = ''
> wd_lifecheck_query = 'SELECT 1'
> wd_lifecheck_user = 'nobody'
> wd_monitoring_interfaces_list = 'any'
> wd_port = 9000
> wd_priority = 1
> white_function_list = ''
> white_memqcache_table_list = ''
>
>
>
>
>
> *Thanks*
> *Gopi*
>
>
> On Wed, Apr 19, 2023 at 1:13 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
>
>> Hello,
>>
>> > When I do:
>> > systemctl restart systemd-networkd
>> >
>> > After that, I am not able to execute any PCP commands like:
>> > pcp_watchdog_info
>> > It is frozen.
>>
>> If the network goes down, watchdog will detect the network failure and
>> shutdown itself.
>> To avoid such problems, it is recommended to shutdown pgpool before
>> restarting network.
>>
>> BTW, which version of Pgpool-II are you using?
>>
>> > I tried restarting pgpool and postgres to no avail.
>> > However, rebooting the system gets it back to a workable state. (PCP
>> > commands are running again and I can attach the nodes back to the pool)
>> >
>> > The pgPool logs shows that the pg-pool was shutdown due to the network
>> > event:
>> > -------------------------------
>> >
>> > 2023-04-17T15:27:25.190949+00:00 vmvrlcm-104-32 g[3042]: [268-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network event received
>> > 2023-04-17T15:27:25.191041+00:00 vmvrlcm-104-32 g[3042]: [268-2]
>> 2023-04-17
>> > 15:27:25: pid 3042: DETAIL:  deleted = YES Link change event = NO
>> > 2023-04-17T15:27:25.191186+00:00 vmvrlcm-104-32 g[3042]: [269-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  STATE MACHINE INVOKED WITH EVENT = NETWORK
>> IP
>> > IS REMOVED Current State = STANDBY
>> > 2023-04-17T15:27:25.191243+00:00 vmvrlcm-104-32 g[3042]: [270-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network interface lo having flags 65609
>> > 2023-04-17T15:27:25.191296+00:00 vmvrlcm-104-32 g[3042]: [271-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network interface eth0 having flags 69699
>> > 2023-04-17T15:27:25.191352+00:00 vmvrlcm-104-32 g[3042]: [272-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" link is active
>> > 2023-04-17T15:27:25.191401+00:00 vmvrlcm-104-32 g[3042]: [273-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" link is up
>> > 2023-04-17T15:27:25.191449+00:00 vmvrlcm-104-32 g[3042]: [274-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network interface lo having flags 65609
>> > 2023-04-17T15:27:25.191497+00:00 vmvrlcm-104-32 g[3042]: [275-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  network interface "eth0" is up and we can
>> > continue
>> > 2023-04-17T15:27:25.191551+00:00 vmvrlcm-104-32 g[3042]: [276-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: WARNING:  network IP is removed and system has no
>> IP is
>> > assigned
>> > 2023-04-17T15:27:25.191614+00:00 vmvrlcm-104-32 g[3042]: [276-2]
>> 2023-04-17
>> > 15:27:25: pid 3042: DETAIL:  changing the state to in network trouble
>> > 2023-04-17T15:27:25.191667+00:00 vmvrlcm-104-32 g[3042]: [277-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: LOG:  watchdog node state changed from [STANDBY] to
>> [IN
>> > NETWORK TROUBLE]
>> > 2023-04-17T15:27:25.191713+00:00 vmvrlcm-104-32 g[3042]: [278-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  STATE MACHINE INVOKED WITH EVENT = STATE
>> > CHANGED Current State = IN NETWORK TROUBLE
>> > 2023-04-17T15:27:25.191759+00:00 vmvrlcm-104-32 g[3042]: [279-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: FATAL:  system has lost the network
>> > 2023-04-17T15:27:25.191807+00:00 vmvrlcm-104-32 g[3042]: [280-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: LOG:  Watchdog is shutting down
>> > 2023-04-17T15:27:25.191849+00:00 vmvrlcm-104-32 g[3042]: [281-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  sending packet, watchdog node:[
>> > vmvrlcm-104-31.eng.vmware.com:9999 Linux vmvrlcm-104-31.eng.vmware.com]
>> > command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
>> > 2023-04-17T15:27:25.191894+00:00 vmvrlcm-104-32 g[3042]: [282-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  sending watchdog packet to socket:8,
>> type:[X],
>> > command ID:10, data Length:0
>> > 2023-04-17T15:27:25.191952+00:00 vmvrlcm-104-32 g[3042]: [283-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  sending packet, watchdog node:[
>> > vmvrlcm-104-33.eng.vmware.com:9999 Linux vmvrlcm-104-33.eng.vmware.com]
>> > command id:[10] type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
>> > 2023-04-17T15:27:25.192001+00:00 vmvrlcm-104-32 g[3042]: [284-1]
>> 2023-04-17
>> > 15:27:25: pid 3042: DEBUG:  sending watchdog packet to socket:9,
>> type:[X],
>> > command ID:10, data Length:0
>> > 2023-04-17T15:27:25.192671+00:00 vmvrlcm-104-32 pgpool[3040]: [24-1]
>> > 2023-04-17 15:27:25: pid 3040: DEBUG:  reaper handler
>> > 2023-04-17T15:27:25.192753+00:00 vmvrlcm-104-32 pgpool[3040]: [25-1]
>> > 2023-04-17 15:27:25: pid 3040: DEBUG:  watchdog child process with pid:
>> > 3042 exit with FATAL ERROR. pgpool-II will be shutdown
>> > 2023-04-17T15:27:25.192803+00:00 vmvrlcm-104-32 pgpool[3040]: [26-1]
>> > 2023-04-17 15:27:25: pid 3040: LOG:  watchdog child process with pid:
>> 3042
>> > exits with status 768
>> > 2023-04-17T15:27:25.192864+00:00 vmvrlcm-104-32 pgpool[3040]: [27-1]
>> > 2023-04-17 15:27:25: pid 3040: FATAL:  watchdog child process exit with
>> > fatal error. exiting pgpool-II
>> > 2023-04-17T15:27:25.197530+00:00 vmvrlcm-104-32 ck[3157]: [23-1]
>> 2023-04-17
>> > 15:27:25: pid 3157: DEBUG:  lifecheck child receives shutdown request
>> > signal 2, forwarding to all children
>> > 2023-04-17T15:27:25.197611+00:00 vmvrlcm-104-32 ck[3157]: [24-1]
>> 2023-04-17
>> > 15:27:25: pid 3157: DEBUG:  lifecheck child receives fast shutdown
>> request
>> > 2023-04-17T15:27:25.197658+00:00 vmvrlcm-104-32 at sender[3159]: [148-1]
>> > 2023-04-17 15:27:25: pid 3159: DEBUG:  watchdog heartbeat sender child
>> > receives shutdown request signal 2
>> > 2023-04-17T15:27:25.197994+00:00 vmvrlcm-104-32 at sender[3163]: [148-1]
>> > 2023-04-17 15:27:25: pid 3163: DEBUG:  watchdog heartbeat sender child
>> > receives shutdown request signal 2
>> > 2023-04-17T15:27:25.199168+00:00 vmvrlcm-104-32 at receiver[3161]:
>> [18-1]
>> > 2023-04-17 15:27:25: pid 3161: DEBUG:  watchdog heartbeat receiver child
>> > receives shutdown request signal 2
>> > 2023-04-17T15:27:25.199567+00:00 vmvrlcm-104-32 at receiver[3158]:
>> [18-1]
>> > 2023-04-17 15:27:25: pid 3158: DEBUG:  watchdog heartbeat receiver child
>> > receives shutdown request signal 2
>> > 2023-04-17T15:27:25.448554+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [386-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:25.448689+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [387-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  SSL is requested but SSL
>> > support is not available
>> > 2023-04-17T15:27:25.450621+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [388-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate kind = 5
>> > 2023-04-17T15:27:25.451892+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [389-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate backend: key
>> > data received
>> > 2023-04-17T15:27:25.451987+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [390-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  authenticate backend:
>> > transaction state: I
>> > 2023-04-17T15:27:25.452043+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [391-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:25.452096+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [392-1] 2023-04-17 15:27:25: pid 3197: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:25.455020+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [386-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:25.455096+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [387-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  SSL is requested but SSL
>> > support is not available
>> > 2023-04-17T15:27:25.457196+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [388-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate kind = 5
>> > 2023-04-17T15:27:25.458437+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [389-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate backend: key
>> > data received
>> > 2023-04-17T15:27:25.458556+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [390-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  authenticate backend:
>> > transaction state: I
>> > 2023-04-17T15:27:25.458674+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [391-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:25.458742+00:00 vmvrlcm-104-32 check process(0)[3196]:
>> > [392-1] 2023-04-17 15:27:25: pid 3196: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:30.452427+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [393-1] 2023-04-17 15:27:30: pid 3197: DEBUG:  health check: clearing
>> alarm
>> > 2023-04-17T15:27:30.454041+00:00 vmvrlcm-104-32 check process(2)[3197]:
>> > [394-1] 2023-04-17 15:27:30: pid 3197: DEBUG:  SSL is requested but SSL
>> > support is not available
>> >
>> > ------------------
>> >
>> > After this, it is in kind of a loop of the 'clearing alarm' + 'SSL
>> support
>> > is not available'
>> >
>> > The relevant (In my current understanding) watchdog settings are:
>> >
>> ----------------------------------------------------------------------------------------
>> > wd_hostname = '10.108.104.31'
>> > wd_lifecheck_method = 'heartbeat'
>> > wd_interval = 10
>> > wd_heartbeat_keepalive = 2
>> > wd_heartbeat_deadtime = 30
>> > heartbeat_destination0 = '10.108.104.32'
>> > heartbeat_device0 = ''
>> > heartbeat_destination1 = '10.108.104.33'
>> > heartbeat_device1 = ''
>> > wd_monitoring_interfaces_list = 'any'
>>
>> Above logs are DEBUG messages and I don't think they caused this issue.
>> Do these DEBUG messages only appear when you restart the network?
>>
>> If you are using watchdog, you also need to configure the following
>> parameters:
>>
>> heartbeat_destination_port0
>> heartbeat_destination_port1
>> other_pgpool_hostname0
>> other_pgpool_port0
>> other_pgpool_hostname1
>> other_pgpool_port1
>>
>> Regards,
>>
>> --
>> Bo Peng <pengbo at sraoss.co.jp>
>> SRA OSS LLC
>> https://www.sraoss.co.jp/
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230501/da6de8b5/attachment-0001.htm>