[pgpool-general: 6730] Re: Watchdog problem - crashed standby not being detected?

Thu Oct 10 15:55:43 JST 2019

Hi Martin,

Pgpool-II watchdog relies on 2 mechanisms to detect node failure. one when
it is informed by the heartbeat (lifecheck)
process and second by itself when it's core fails to receive/send the data
to a particular node.

Now while working on another watchdog related bug I found an issue in the
mechanism that
sends the periodic status updates from standby to the master node for
detecting the failure. And that could
delay the detection of standby node failure by the watchdog core in case of
a standby crash.
So I have already created a patch for that and I will be committing it in a
day or two.

But even without that fix, this issue shouldn't have happened and the
lifecheck should have detected the absence of heartbeat
messages from the crashed node. So this part I still need to figure out
that what could have caused the lifecheck process to
think the (crash) node is still alive and active. And if you happen to have
the pgpool logs for the scenario that would help in debugging the
cause.

Thanks
Best regards
Muhammad Usama

On Wed, Oct 9, 2019 at 2:13 AM Martin Goodson <kaemaril at googlemail.com>
wrote:

> On 08/10/2019 01:17, Tatsuo Ishii wrote:
> > My wild guess is, watchdog communication socket (it uses TCP/IP) was
> > blocked by the standby node crash, and this makes watchdog state
> > machine freezing. Thus watchdog did not notice heartbeat channel down.
> >
> >> Hi Usama,
> >>
> >> Can you please look into this?
> >>
> >> This sounds weired to me too because:
> >>
> >> 1) tcp_keepalive does not affect to heartbeat since it uses UDP, not
> TCP.
> >>
> >> 2) Why heartbeat does not work in the case?
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese:http://www.sraoss.co.jp
>
> Hello. We had another HA/DR test today, but unfortunately today we
> didn't get as far as force-crashing one of the pgpools, other tests were
> done dedicated to the backend nodes instead.
>
> However, I was able to do a tcp dump on the UDP port, and I could see
> that the traffic was definitely going through at two second intervals.
> Initial thoughts from our sysadmin before settling on the keepalive
> theory was that, somehow, the heartbeat traffic was being blocked by a
> firewall which pgpool was somehow silently discarding.  So that idea at
> leaast has been ruled out :)
>
>   I will see if I can force crash a server in our dev environment
> tomorrow while dumping the UDP traffic, and see what happens to the
> traffic with regards to keepalives, etc.
>
> I'll ramp up the logging level as well, and see what happens.
>
> Regards,
>
> M.
> --
> Martin Goodson
>
> "Have you thought up some clever plan, Doctor?"
> "Yes, Jamie, I believe I have."
> "What're you going to do?"
> "Bung a rock at it."
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20191010/d73481f2/attachment.html>