[pgpool-general: 6731] Re: Watchdog problem - crashed standby not being detected?

Martin Goodson kaemaril at googlemail.com
Tue Oct 15 17:39:10 JST 2019


Hello. Apologies for the delay in replying, it's been a busy few days with
an unrelated production incident occupying most of my time :(

Unfortunately it looks like the logs weren't retained, so I'm going to see
if I can reproduce the problem in our test environment this week.

Is there anything you can suggest I could set up in advance to capture any
detail if I can get this reproduced? Tcpdumps on the pgpools, setting log
min message to a specific debug level, etc?

Your help is very much appreciated on this, as it is a real puzzler :(

Regards,

Martin

(Apologies for any typos etc - sent from mobile)

On Thu, 10 Oct 2019, 07:55 Muhammad Usama, <m.usama at gmail.com> wrote:

> Hi Martin,
>
> Pgpool-II watchdog relies on 2 mechanisms to detect node failure. one when
> it is informed by the heartbeat (lifecheck)
> process and second by itself when it's core fails to receive/send the data
> to a particular node.
>
> Now while working on another watchdog related bug I found an issue in the
> mechanism that
> sends the periodic status updates from standby to the master node for
> detecting the failure. And that could
> delay the detection of standby node failure by the watchdog core in case
> of a standby crash.
> So I have already created a patch for that and I will be committing it in
> a day or two.
>
> But even without that fix, this issue shouldn't have happened and the
> lifecheck should have detected the absence of heartbeat
> messages from the crashed node. So this part I still need to figure out
> that what could have caused the lifecheck process to
> think the (crash) node is still alive and active. And if you happen to
> have the pgpool logs for the scenario that would help in debugging the
> cause.
>
>
> Thanks
> Best regards
> Muhammad Usama
>
>
> On Wed, Oct 9, 2019 at 2:13 AM Martin Goodson <kaemaril at googlemail.com>
> wrote:
>
>> On 08/10/2019 01:17, Tatsuo Ishii wrote:
>> > My wild guess is, watchdog communication socket (it uses TCP/IP) was
>> > blocked by the standby node crash, and this makes watchdog state
>> > machine freezing. Thus watchdog did not notice heartbeat channel down.
>> >
>> >> Hi Usama,
>> >>
>> >> Can you please look into this?
>> >>
>> >> This sounds weired to me too because:
>> >>
>> >> 1) tcp_keepalive does not affect to heartbeat since it uses UDP, not
>> TCP.
>> >>
>> >> 2) Why heartbeat does not work in the case?
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese:http://www.sraoss.co.jp
>>
>> Hello. We had another HA/DR test today, but unfortunately today we
>> didn't get as far as force-crashing one of the pgpools, other tests were
>> done dedicated to the backend nodes instead.
>>
>> However, I was able to do a tcp dump on the UDP port, and I could see
>> that the traffic was definitely going through at two second intervals.
>> Initial thoughts from our sysadmin before settling on the keepalive
>> theory was that, somehow, the heartbeat traffic was being blocked by a
>> firewall which pgpool was somehow silently discarding.  So that idea at
>> leaast has been ruled out :)
>>
>>   I will see if I can force crash a server in our dev environment
>> tomorrow while dumping the UDP traffic, and see what happens to the
>> traffic with regards to keepalives, etc.
>>
>> I'll ramp up the logging level as well, and see what happens.
>>
>> Regards,
>>
>> M.
>> --
>> Martin Goodson
>>
>> "Have you thought up some clever plan, Doctor?"
>> "Yes, Jamie, I believe I have."
>> "What're you going to do?"
>> "Bung a rock at it."
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20191015/efe955dc/attachment.html>


More information about the pgpool-general mailing list