[pgpool-hackers: 4262] Re: Watchdog heartbeat issue

Muhammad Usama muhammad.usama at percona.com
Tue Jan 17 18:22:20 JST 2023


Hi Ishii San

Thanks for figuring out the issue.
I think removing the code in question altogether could mark the remote node
as dead too early at startup and can delay the watchdog cluster
stabilization
when there is a few seconds delay between the node startup.
So IMHO the way to solve this is to wait for twice the wd_interval or
wd_heartbeat_deadtime (depending on the configuration) if
is_wd_lifecheck_ready()
reports a failure.

What do you think of the attached patch?

Best regards
Muhammad Usama

On Sun, Jan 15, 2023 at 6:28 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > The bind error itself sometimes we cannot avoid. But problem is, the
> > leader watchdog node (pgpool0) goes into an infinite loop in
> > wd_lifecheck.c:lifecheck_main():
> >
> >       /* wait until ready to go */
> >       while (WD_OK != is_wd_lifecheck_ready())
> >       {
> >               sleep(pool_config->wd_interval * 10);
> >       }
> >
> > If is_wd_lifecheck_ready() fails to receive heartbeat signal, it
> > returns WD_NG and the loop continues again. As a result, the
> > regression test failed with timeout (attached pgpool0.log). There's no
> > way to avoid bind on socket error and I think what we can do is, just
> > sleep before start watchdog.
> >
> > But in my opinion the infinite loop above is a real issue in
> > production systems. Can we try the loop continue in limited time and
> > if it does not succeed, disregard the node in question?
>
> Or can't we just remove the code above? If other node is fine, we
> don't need to wait. If other node is down, it will be detected later
> on anyway and we don't need to wait too.
>
> For trial I removed the code (see attached patch) and ran the
> regression test. All tests passed.
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20230117/1942d9e8/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lifecheck_v2.patch
Type: application/octet-stream
Size: 836 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20230117/1942d9e8/attachment.obj>


More information about the pgpool-hackers mailing list