[pgpool-hackers: 4263] Re: Watchdog heartbeat issue

Tatsuo Ishii ishii at sraoss.co.jp
Tue Jan 17 22:46:52 JST 2023


Hi Usama,

Thank you for investigating the issue.

> Hi Ishii San
> 
> Thanks for figuring out the issue.
> I think removing the code in question altogether could mark the remote node
> as dead too early at startup and can delay the watchdog cluster
> stabilization
> when there is a few seconds delay between the node startup.
> So IMHO the way to solve this is to wait for twice the wd_interval or
> wd_heartbeat_deadtime (depending on the configuration) if
> is_wd_lifecheck_ready()
> reports a failure.
> 
> What do you think of the attached patch?

Probably I am missing something but I wonder why the watchdog leader
node's lifecheck does not notice that node 1 watchdog will never send
hearbeat signal. In the pgpool0 log:

2023-01-14 00:27:15: watchdog pid 26708: LOG:  read from socket failed, remote end closed the connection
2023-01-14 00:27:15: watchdog pid 26708: LOG:  client socket of localhost:50004 Linux abf1b59af489 is closed
2023-01-14 00:27:15: watchdog pid 26708: LOG:  remote node "localhost:50004 Linux abf1b59af489" is shutting down
2023-01-14 00:27:15: watchdog pid 26708: LOG:  removing watchdog node "localhost:50004 Linux abf1b59af489" from the standby list

It seems the leader watchdog alreay noticed that node 1 was down.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list