[pgpool-hackers: 4260] Re: Watchdog heartbeat issue

Sun Jan 15 22:28:10 JST 2023

> The bind error itself sometimes we cannot avoid. But problem is, the
> leader watchdog node (pgpool0) goes into an infinite loop in
> wd_lifecheck.c:lifecheck_main():
> 
> 	/* wait until ready to go */
> 	while (WD_OK != is_wd_lifecheck_ready())
> 	{
> 		sleep(pool_config->wd_interval * 10);
> 	}
> 
> If is_wd_lifecheck_ready() fails to receive heartbeat signal, it
> returns WD_NG and the loop continues again. As a result, the
> regression test failed with timeout (attached pgpool0.log). There's no
> way to avoid bind on socket error and I think what we can do is, just
> sleep before start watchdog.
> 
> But in my opinion the infinite loop above is a real issue in
> production systems. Can we try the loop continue in limited time and
> if it does not succeed, disregard the node in question?

Or can't we just remove the code above? If other node is fine, we
don't need to wait. If other node is down, it will be detected later
on anyway and we don't need to wait too.

For trial I removed the code (see attached patch) and ran the
regression test. All tests passed.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lifecheck.patch
Type: text/x-patch
Size: 469 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20230115/df4484e1/attachment.bin>