[pgpool-general: 5006] Failover command not invoked by secondary pgpool after its promotion to primary in HA

Wed Sep 21 23:12:02 JST 2016

Hi all,

I have the following setup: 2 identical virtual machines with pgpool 3.5.4
and postgresql 9.5.4 and CentOS 7.

The first one, let's call it A, has master postgresql and primary pgpool.
The second one, let's call it B, has slave postgresql (in hot standby) and
secondary pgpool.
I think this setup is pretty common.

The two pgpool are configured with a delegate ip, watchdog, health check on
both backends and a failover command.
The setup seems to work correctly at first.

The problem is that, to assess failover, we try bringing down the network
on vm A.
This should cause the following:

   - pgpool on B should react becoming primary pgpool and adding delegate
   ip to its ethernet device
   - pgpool, after the configured number of attempts to connect to master
   database on A should promote postgresql on B using the configured
   failover_command.

First point was successful, second one was not.
After attempting to connect to postgresql on A appears a message that
notify that backend on A is declared down, but no failover command was
issued and SHOW pool_nodes shows postgresql on A as online.
Moreover attempts to connect to postgresql on A continue forever.

Note that restarting pgpool fixes all problems.
It declares itself primary since it cannot reach its counterpart on A,
brings up delegate ip and after configured number of attempts promotes
postgresql on B as new master.

So this definitely seems a bug, and indeed a tichet was already be opened
for it.
http://www.pgpool.net/mantisbt/view.php?id=227&history=1

My question is the following: our setup and in particular having two nodes
with one pgpool and one postgresql is ideal or is it advisable to arrange
services in more nodes or differently?

IMHO the probability that one node goes down entirely or becomes
unreachable is bigger than a failure of the sole postgresql service.
Thus, due to this bug, we risk to have no high avaiability at all and to be
left with a read-only database.
Obviously our client is not very happy with that since we have a more
complex setup with little advantages.

Is there any news on bug resolution timing?

Is there any workaround that I can use to mitigate the impact?
For instance, I may setup pgpools in order to have primary on B.
Getting rid of pgpool promotion probably does trigger postgresql failover.

What is your opinion?

Best regards,

Gabriele Monfardini

-----
Gabriele Monfardini
LdP Progetti GIS
tel: 0577.531049
email: monfardini at ldpgis.it
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20160921/1feafce9/attachment.html>