[pgpool-general: 6726] Watchdog problem - crashed standby not being detected?

Tue Oct 8 05:55:28 JST 2019

Hello.

I think I've got a bit of a weird one here.

I've got a PostgreSQL 10.5 physical replication set-up running, with a 
primary database and a single standby server. Sitting on top of that is 
a three-node pgpool 4.0.4 cluster running with watchdog enabled and a 
delegate IP, in master/streaming mode.

All five servers are VMWare stretched clusters running on the same 
subnet, all running Redhat Linux 7.6.

All three pgpool nodes seem to be up and running OK and talking amongst 
themselves. I can communicate with the primary database via the VIP, 
pcp_watchdog_info shows information OK, verifies the quorum is up, the 
VIP is correctly assigned to the master node, and so forth. Shutting 
down any node results in that being detected by the other two nodes, and 
of course if the master is the node that is shutdown it fails over 
properly to one of the two standbys. No complaints on that end, all 
seems fine.

As part of a recent high availability and disaster recovery test we 
forced one of the pgpool standby servers down in an uncontrolled fashion 
to simulate a server crash. We were expecting pgpool to detect that it 
had lost the standby node. Instead we saw that the pgpool master and 
remaining standby server continued to believe the failed server was 
still up in standby, even with the server completely powered down. Even 
after 30 minutes or so, no observed change of status.

I am at a loss to understand this. Surely the remaining two nodes should 
have, hopefully quite quickly, noticed that they had lost contact with 
the downed server via the watchdog?

The watchdog parameters I've got set are these (below is an extract from 
the pgpool.conf of our third pgpool node, pgpool2, with servers, ports 
etc changed to protect the innocent :) )

------------------------------------------------------------------------------------------

use_watchdog = on

wd_hostname = 'pgpool2.company.com'
wd_port = 40010
wd_priority = 1
wd_authkey = '<database>'

clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_de_escalation_command = ''

failover_when_quorum_exists = on
failover_require_consensus = on
allow_multiple_failover_requests_from_node = off

wd_monitoring_interfaces_list = ''
wd_lifecheck_method = 'heartbeat'
wd_interval = 10

wd_heartbeat_port = 40015
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30

heartbeat_destination0 = 'pgpool0.company.com'
heartbeat_destination_port0 = 40015
heartbeat_device0 = 'eth0'

heartbeat_destination1 = 'pgpool1.company.com'
heartbeat_destination_port1 = 40015
heartbeat_device1 = 'eth0'

other_pgpool_hostname0 = 'pgpool0.company.com'
other_pgpool_port0 = 40000
other_wd_port0 = 40010

other_pgpool_hostname1 = 'pgpool1.company.com'
other_pgpool_port1 = 40000
other_wd_port1 = 40010

------------------------------------------------------------------------------------------

I'm by no means a network expert or sysadmin, but I thought the point of 
the watchdog was that the nodes were proactively constantly 'talking 
amongst themselves' to determine the status of the other nodes.

Our sysadmin believes that the issue is related to tcp keepalive, and he 
lowered the keepalive value on the server to 60 seconds. After changing 
that, the remaining pgpool nodes detected the change quite quickly.

But surely the watchdog shouldn't be relying on that? Am I 
misunderstanding something basic about watchdog? Have I somehow missed 
something horribly obvious?

This is the third or fourth pgpool I've built, and this is the first one 
I've seen this behavior on. If this is unexpected behavior, how might I 
begin to troubleshoot the issue?

Any insight into this problem would be greatly appreciated.

Many thanks.

Martin.

-- 
Martin Goodson

In bed above we're deep asleep,
While greater love lies further deep.
This dream must end, the world must know,
We all depend on the beast below.