[pgpool-hackers: 4227] Issue with failover_require_consensus

Mon Nov 21 09:38:48 JST 2022

Hi Usama,

I think I found an issue with failover_require_consensus. When this
parameter is enabled, watchdog asks other watchdog to confirm the
failover event. If other watchdog replies back and the originator
watchdog can make a consensus, failover process begins. However, if no
replies arrive before FAILOVER_COMMAND_FINISH_TIMEOUT expires, the
failover request is discarded and failover will not begin. If this
only happens once or twice, we could expect that subsequent health
check would trigger failover. But actually this could repeat forever
(which means failover never happens) if health_check_period is larger
than FAILOVER_COMMAND_FINISH_TIMEOUT (currently 15 seconds). For
example, if health_check_period = 30 seconds, and other watchdog node
1 starts 50 seconds after watchdog node 0 (suppose this is the leader
node), then every time failover consensus request is made (suppose the
time is t), it will be canceled at t + 15, because failover on
watchdog node 1 will happen at time t + 20 ( = 50 - 30).

Since we allow other watchdog node joins a watchdog cluster anytime, I
think this is not a behavior we expect.

Can we make FAILOVER_COMMAND_FINISH_TIMEOUT longer or disable the
expiring when failover_require_consensus is on?

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp