[pgpool-general: 5206] Possible bug in de-escalation

Fri Dec 30 02:11:25 JST 2016

Hi all,

I've two pgpool 3.6.1 on Centos 7, configured to share a delegate IP.
We're trying to assess the viability of escalation/de-escalation in case of
network failure or simply when one of the pgpool is shut down to perform
mainteinance.

Our test consists of shutting down MASTER pgpool and let the other escalate.
Escalation always worked correctly, but de-escalation on shutting down
instance sometimes did not bring down delegate IP correctly.
The final results is that both nodes ends up having delegate IP.

Looking at the logs, when de-escalation worked there is a log line saying
"watchdog: de-escalation started". This line is emitted in
fork_plunging_process in watchdog/wd_escalation.c.

On the contrary, when de-escalation did not work, this line did not appear
in the log.

I've added some more verbose log and found that in problematic cases
fork_plunging_process does not complete.
If I add an ereport immediately before
   POOL_SETMASK(&UnBlockSig);
and one immediately after, the second one is never executed, as the process
were killed beforehand.
I'm not familiar with the code but it seems that signal 15 is unblocked at
least for a short window of time.

In my private setup this bug is quite difficult to reproduce.
Thanks to Murphy's law, in our customer setup it happens in more than half
of cases.

This seems to be a bug.
Should I open a ticket?

Best regards,

Gabriele Monfardini

-----
Gabriele Monfardini
LdP Progetti GIS
tel: 0577.531049
email: monfardini at ldpgis.it
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20161229/7dca55e9/attachment.html>