[pgpool-general: 6728] Re: Watchdog problem - crashed standby not being detected?

Tue Oct 8 09:17:54 JST 2019

My wild guess is, watchdog communication socket (it uses TCP/IP) was
blocked by the standby node crash, and this makes watchdog state
machine freezing. Thus watchdog did not notice heartbeat channel down.

> Hi Usama,
> 
> Can you please look into this?
> 
> This sounds weired to me too because:
> 
> 1) tcp_keepalive does not affect to heartbeat since it uses UDP, not TCP.
> 
> 2) Why heartbeat does not work in the case?
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> From: Martin Goodson <kaemaril at googlemail.com>
> Subject: [pgpool-general: 6726] Watchdog problem - crashed standby not being detected?
> Date: Mon, 7 Oct 2019 21:55:28 +0100
> Message-ID: <5D9BA640.9030105 at googlemail.com>
> 
>> Hello.
>> 
>> I think I've got a bit of a weird one here.
>> 
>> I've got a PostgreSQL 10.5 physical replication set-up running, with a
>> primary database and a single standby server. Sitting on top of that
>> is a three-node pgpool 4.0.4 cluster running with watchdog enabled and
>> a delegate IP, in master/streaming mode.
>> 
>> All five servers are VMWare stretched clusters running on the same
>> subnet, all running Redhat Linux 7.6.
>> 
>> All three pgpool nodes seem to be up and running OK and talking
>> amongst themselves. I can communicate with the primary database via
>> the VIP, pcp_watchdog_info shows information OK, verifies the quorum
>> is up, the VIP is correctly assigned to the master node, and so
>> forth. Shutting down any node results in that being detected by the
>> other two nodes, and of course if the master is the node that is
>> shutdown it fails over properly to one of the two standbys. No
>> complaints on that end, all seems fine.
>> 
>> As part of a recent high availability and disaster recovery test we
>> forced one of the pgpool standby servers down in an uncontrolled
>> fashion to simulate a server crash. We were expecting pgpool to detect
>> that it had lost the standby node. Instead we saw that the pgpool
>> master and remaining standby server continued to believe the failed
>> server was still up in standby, even with the server completely
>> powered down. Even after 30 minutes or so, no observed change of
>> status.
>> 
>> I am at a loss to understand this. Surely the remaining two nodes
>> should have, hopefully quite quickly, noticed that they had lost
>> contact with the downed server via the watchdog?
>> 
>> The watchdog parameters I've got set are these (below is an extract
>> from the pgpool.conf of our third pgpool node, pgpool2, with servers,
>> ports etc changed to protect the innocent :) )
>> 
>> ------------------------------------------------------------------------------------------
>> 
>> use_watchdog = on
>> 
>> wd_hostname = 'pgpool2.company.com'
>> wd_port = 40010
>> wd_priority = 1
>> wd_authkey = '<database>'
>> 
>> clear_memqcache_on_escalation = on
>> wd_escalation_command = ''
>> wd_de_escalation_command = ''
>> 
>> failover_when_quorum_exists = on
>> failover_require_consensus = on
>> allow_multiple_failover_requests_from_node = off
>> 
>> wd_monitoring_interfaces_list = ''
>> wd_lifecheck_method = 'heartbeat'
>> wd_interval = 10
>> 
>> wd_heartbeat_port = 40015
>> wd_heartbeat_keepalive = 2
>> wd_heartbeat_deadtime = 30
>> 
>> 
>> heartbeat_destination0 = 'pgpool0.company.com'
>> heartbeat_destination_port0 = 40015
>> heartbeat_device0 = 'eth0'
>> 
>> heartbeat_destination1 = 'pgpool1.company.com'
>> heartbeat_destination_port1 = 40015
>> heartbeat_device1 = 'eth0'
>> 
>> other_pgpool_hostname0 = 'pgpool0.company.com'
>> other_pgpool_port0 = 40000
>> other_wd_port0 = 40010
>> 
>> other_pgpool_hostname1 = 'pgpool1.company.com'
>> other_pgpool_port1 = 40000
>> other_wd_port1 = 40010
>> 
>> ------------------------------------------------------------------------------------------
>> 
>> I'm by no means a network expert or sysadmin, but I thought the point
>> of the watchdog was that the nodes were proactively constantly
>> 'talking amongst themselves' to determine the status of the other
>> nodes.
>> 
>> Our sysadmin believes that the issue is related to tcp keepalive, and
>> he lowered the keepalive value on the server to 60 seconds. After
>> changing that, the remaining pgpool nodes detected the change quite
>> quickly.
>> 
>> But surely the watchdog shouldn't be relying on that? Am I
>> misunderstanding something basic about watchdog? Have I somehow missed
>> something horribly obvious?
>> 
>> This is the third or fourth pgpool I've built, and this is the first
>> one I've seen this behavior on. If this is unexpected behavior, how
>> might I begin to troubleshoot the issue?
>> 
>> Any insight into this problem would be greatly appreciated.
>> 
>> Many thanks.
>> 
>> Martin.
>> 
>> -- 
>> Martin Goodson
>> 
>> In bed above we're deep asleep,
>> While greater love lies further deep.
>> This dream must end, the world must know,
>> We all depend on the beast below.
>> 
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general