<div dir="ltr"><div>Hi Martin,</div><div><br></div><div>Pgpool-II watchdog relies on 2 mechanisms to detect node failure. one when it is informed by the heartbeat (lifecheck)</div><div>process and second by itself when it's core fails to receive/send the data to a particular node.</div><div><br></div><div>Now while working on another watchdog related bug I found an issue in the mechanism that</div><div>sends the periodic status updates from standby to the master node for detecting the failure. And that could</div><div>delay the detection of standby node failure by the watchdog core in case of a standby crash.</div><div>So I have already created a patch for that and I will be committing it in a day or two.</div><div><br></div><div>But even without that fix, this issue shouldn't have happened and the lifecheck should have detected the absence of heartbeat</div><div>messages from the crashed node. So this part I still need to figure out that what could have caused the lifecheck process to</div><div>think the (crash) node is still alive and active. And if you happen to have the pgpool logs for the scenario that would help in debugging the</div><div>cause.</div><div><br></div><div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 9, 2019 at 2:13 AM Martin Goodson <<a href="mailto:kaemaril@googlemail.com">kaemaril@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 08/10/2019 01:17, Tatsuo Ishii wrote:<br>
> My wild guess is, watchdog communication socket (it uses TCP/IP) was<br>
> blocked by the standby node crash, and this makes watchdog state<br>
> machine freezing. Thus watchdog did not notice heartbeat channel down.<br>
> <br>
>> Hi Usama,<br>
>><br>
>> Can you please look into this?<br>
>><br>
>> This sounds weired to me too because:<br>
>><br>
>> 1) tcp_keepalive does not affect to heartbeat since it uses UDP, not TCP.<br>
>><br>
>> 2) Why heartbeat does not work in the case?<br>
>><br>
>> Best regards,<br>
>> --<br>
>> Tatsuo Ishii<br>
>> SRA OSS, Inc. Japan<br>
>> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>
>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>
<br>
Hello. We had another HA/DR test today, but unfortunately today we <br>
didn't get as far as force-crashing one of the pgpools, other tests were <br>
done dedicated to the backend nodes instead.<br>
<br>
However, I was able to do a tcp dump on the UDP port, and I could see <br>
that the traffic was definitely going through at two second intervals. <br>
Initial thoughts from our sysadmin before settling on the keepalive <br>
theory was that, somehow, the heartbeat traffic was being blocked by a <br>
firewall which pgpool was somehow silently discarding. So that idea at <br>
leaast has been ruled out :)<br>
<br>
I will see if I can force crash a server in our dev environment <br>
tomorrow while dumping the UDP traffic, and see what happens to the <br>
traffic with regards to keepalives, etc.<br>
<br>
I'll ramp up the logging level as well, and see what happens.<br>
<br>
Regards,<br>
<br>
M.<br>
-- <br>
Martin Goodson<br>
<br>
"Have you thought up some clever plan, Doctor?"<br>
"Yes, Jamie, I believe I have."<br>
"What're you going to do?"<br>
"Bung a rock at it."<br>
_______________________________________________<br>
pgpool-general mailing list<br>
<a href="mailto:pgpool-general@pgpool.net" target="_blank">pgpool-general@pgpool.net</a><br>
<a href="http://www.pgpool.net/mailman/listinfo/pgpool-general" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-general</a><br>
</blockquote></div></div>