[pgpool-general: 6727] Re: Watchdog problem - crashed standby not being detected?

Tue Oct 8 08:53:41 JST 2019

Hi Usama,

Can you please look into this?

This sounds weired to me too because:

1) tcp_keepalive does not affect to heartbeat since it uses UDP, not TCP.

2) Why heartbeat does not work in the case?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Martin Goodson <kaemaril at googlemail.com>
Subject: [pgpool-general: 6726] Watchdog problem - crashed standby not being detected?
Date: Mon, 7 Oct 2019 21:55:28 +0100
Message-ID: <5D9BA640.9030105 at googlemail.com>

> Hello.
> 
> I think I've got a bit of a weird one here.
> 
> I've got a PostgreSQL 10.5 physical replication set-up running, with a
> primary database and a single standby server. Sitting on top of that
> is a three-node pgpool 4.0.4 cluster running with watchdog enabled and
> a delegate IP, in master/streaming mode.
> 
> All five servers are VMWare stretched clusters running on the same
> subnet, all running Redhat Linux 7.6.
> 
> All three pgpool nodes seem to be up and running OK and talking
> amongst themselves. I can communicate with the primary database via
> the VIP, pcp_watchdog_info shows information OK, verifies the quorum
> is up, the VIP is correctly assigned to the master node, and so
> forth. Shutting down any node results in that being detected by the
> other two nodes, and of course if the master is the node that is
> shutdown it fails over properly to one of the two standbys. No
> complaints on that end, all seems fine.
> 
> As part of a recent high availability and disaster recovery test we
> forced one of the pgpool standby servers down in an uncontrolled
> fashion to simulate a server crash. We were expecting pgpool to detect
> that it had lost the standby node. Instead we saw that the pgpool
> master and remaining standby server continued to believe the failed
> server was still up in standby, even with the server completely
> powered down. Even after 30 minutes or so, no observed change of
> status.
> 
> I am at a loss to understand this. Surely the remaining two nodes
> should have, hopefully quite quickly, noticed that they had lost
> contact with the downed server via the watchdog?
> 
> The watchdog parameters I've got set are these (below is an extract
> from the pgpool.conf of our third pgpool node, pgpool2, with servers,
> ports etc changed to protect the innocent :) )
> 
> ------------------------------------------------------------------------------------------
> 
> use_watchdog = on
> 
> wd_hostname = 'pgpool2.company.com'
> wd_port = 40010
> wd_priority = 1
> wd_authkey = '<database>'
> 
> clear_memqcache_on_escalation = on
> wd_escalation_command = ''
> wd_de_escalation_command = ''
> 
> failover_when_quorum_exists = on
> failover_require_consensus = on
> allow_multiple_failover_requests_from_node = off
> 
> wd_monitoring_interfaces_list = ''
> wd_lifecheck_method = 'heartbeat'
> wd_interval = 10
> 
> wd_heartbeat_port = 40015
> wd_heartbeat_keepalive = 2
> wd_heartbeat_deadtime = 30
> 
> 
> heartbeat_destination0 = 'pgpool0.company.com'
> heartbeat_destination_port0 = 40015
> heartbeat_device0 = 'eth0'
> 
> heartbeat_destination1 = 'pgpool1.company.com'
> heartbeat_destination_port1 = 40015
> heartbeat_device1 = 'eth0'
> 
> other_pgpool_hostname0 = 'pgpool0.company.com'
> other_pgpool_port0 = 40000
> other_wd_port0 = 40010
> 
> other_pgpool_hostname1 = 'pgpool1.company.com'
> other_pgpool_port1 = 40000
> other_wd_port1 = 40010
> 
> ------------------------------------------------------------------------------------------
> 
> I'm by no means a network expert or sysadmin, but I thought the point
> of the watchdog was that the nodes were proactively constantly
> 'talking amongst themselves' to determine the status of the other
> nodes.
> 
> Our sysadmin believes that the issue is related to tcp keepalive, and
> he lowered the keepalive value on the server to 60 seconds. After
> changing that, the remaining pgpool nodes detected the change quite
> quickly.
> 
> But surely the watchdog shouldn't be relying on that? Am I
> misunderstanding something basic about watchdog? Have I somehow missed
> something horribly obvious?
> 
> This is the third or fourth pgpool I've built, and this is the first
> one I've seen this behavior on. If this is unexpected behavior, how
> might I begin to troubleshoot the issue?
> 
> Any insight into this problem would be greatly appreciated.
> 
> Many thanks.
> 
> Martin.
> 
> -- 
> Martin Goodson
> 
> In bed above we're deep asleep,
> While greater love lies further deep.
> This dream must end, the world must know,
> We all depend on the beast below.
> 
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general