<div dir="ltr"><div>Hi Martin,</div><div><br></div><div>Pgpool-II watchdog relies on 2 mechanisms to detect node failure. one when it is informed by the heartbeat (lifecheck)</div><div>process and second by itself when it's core fails to receive/send the data to a particular node.</div><div><br></div><div>Now while working on another watchdog related bug I found an issue in the mechanism that</div><div>sends the periodic status updates from standby to the master node for detecting the failure. And that could</div><div>delay the detection of standby node failure by the watchdog core in case of a standby crash.</div><div>So I have already created a patch for that and I will be committing it in a day or two.</div><div><br></div><div>But even without that fix, this issue shouldn't have happened and the lifecheck should have detected the absence of heartbeat</div><div>messages from the crashed node. So this part I still need to figure out that what could have caused the lifecheck process to</div><div>think the (crash) node is still alive and active. And if you happen to have the pgpool logs for the scenario that would help in debugging the</div><div>cause.</div><div><br></div><div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Oct 9, 2019 at 2:13 AM Martin Goodson <<a href="mailto:kaemaril@googlemail.com">kaemaril@googlemail.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">On 08/10/2019 01:17, Tatsuo Ishii wrote:<br>

> My wild guess is, watchdog communication socket (it uses TCP/IP) was<br>

> blocked by the standby node crash, and this makes watchdog state<br>

> machine freezing. Thus watchdog did not notice heartbeat channel down.<br>

> <br>

>> Hi Usama,<br>

>><br>

>> Can you please look into this?<br>

>><br>

>> This sounds weired to me too because:<br>

>><br>

>> 1) tcp_keepalive does not affect to heartbeat since it uses UDP, not TCP.<br>

>><br>

>> 2) Why heartbeat does not work in the case?<br>

>><br>

>> Best regards,<br>

>> --<br>

>> Tatsuo Ishii<br>

>> SRA OSS, Inc. Japan<br>

>> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

<br>

Hello. We had another HA/DR test today, but unfortunately today we <br>

didn't get as far as force-crashing one of the pgpools, other tests were <br>

done dedicated to the backend nodes instead.<br>

<br>

However, I was able to do a tcp dump on the UDP port, and I could see <br>

that the traffic was definitely going through at two second intervals. <br>

Initial thoughts from our sysadmin before settling on the keepalive <br>

theory was that, somehow, the heartbeat traffic was being blocked by a <br>

firewall which pgpool was somehow silently discarding.  So that idea at <br>

leaast has been ruled out :)<br>

<br>

  I will see if I can force crash a server in our dev environment <br>

tomorrow while dumping the UDP traffic, and see what happens to the <br>

traffic with regards to keepalives, etc.<br>

<br>

I'll ramp up the logging level as well, and see what happens.<br>

<br>

Regards,<br>

<br>

M.<br>

-- <br>

Martin Goodson<br>

<br>

"Have you thought up some clever plan, Doctor?"<br>

"Yes, Jamie, I believe I have."<br>

"What're you going to do?"<br>

"Bung a rock at it."<br>

_______________________________________________<br>

pgpool-general mailing list<br>

<a href="mailto:pgpool-general@pgpool.net" target="_blank">pgpool-general@pgpool.net</a><br>

<a href="http://www.pgpool.net/mailman/listinfo/pgpool-general" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-general</a><br>

</blockquote></div></div>