[pgpool-general: 7903] Re: Possible race condition during startup causing node to enter network isolation

Emond Papegaaij emond.papegaaij at gmail.com
Fri Nov 26 17:25:30 JST 2021


On Fri, Nov 26, 2021 at 3:11 AM Bo Peng <pengbo at sraoss.co.jp> wrote:

> > In our tests we are seeing sporadic failures when services on one node
> are
> > restarted.
>
> Could your provide a scenario to reproduce this issue?
> Did you restart Pgpool-II service only?
> At initial startup, watchdog and lifecheck worked?
>
> Could you share your watchdog configurations of each Pgpool-II node?
>
> Hi,

Unfortunately, the failure is not reproducible in a reliable way. It occurs
about once every 10 to 20 runs of our tests. The test that fails is a full
upgrade of our appliance. This upgrade includes a new docker container for
both postgresql and Pgpool. Both services are restarted. Pgpool is upgraded
from 4.2.4 to 4.2.6. When the upgrade starts, the cluster is fully healthy.
All Pgpool nodes are connected. In the logs from node 2, you can also see
that node 1 was lost because it was shutdown (The newly joined node:"
172.29.30.1:5432 Linux 8e410fda51ac" had left the cluster because it was
shutdown). Looking at the sequence of events, it seems the lifecheck
process in node 2 marks node 1 as DEAD on a very inconvenient moment: right
at the moment it is handling the incoming connection from node 1.

In the logs from successful runs I can see that the socket for incoming
heartbeats is opened after the node has joined the cluster and synced with
the leader. I don't know if this is intentional, but perhaps it is possible
to move this to the front and open the socket before it tries to join the
cluster. This should prevent this situation.

Our tests do not store the pgpool configuration used, but it is identical
on every installation with 3 nodes (with only some differences in
backend_application_name2). I've attached a copy our pgpool configuration
of node 1 from one of our installations. Node 2 and 3 are identical, which
just the ip addresses changed. Let me know if you need additional
information. Unfortunately the build is no longer available on our build
servers, but if this happens again, I'll make a copy of the entire build.

Best regards,
Emond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20211126/cd71700d/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.conf
Type: application/octet-stream
Size: 41288 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20211126/cd71700d/attachment-0001.obj>


More information about the pgpool-general mailing list