[pgpool-general: 7904] Re: Possible race condition during startup causing node to enter network isolation

Mon Nov 29 13:32:50 JST 2021

Hello,

On Fri, 26 Nov 2021 09:25:30 +0100
Emond Papegaaij <emond.papegaaij at gmail.com> wrote:

> On Fri, Nov 26, 2021 at 3:11 AM Bo Peng <pengbo at sraoss.co.jp> wrote:
> 
> > > In our tests we are seeing sporadic failures when services on one node
> > are
> > > restarted.
> >
> > Could your provide a scenario to reproduce this issue?
> > Did you restart Pgpool-II service only?
> > At initial startup, watchdog and lifecheck worked?
> >
> > Could you share your watchdog configurations of each Pgpool-II node?
> >
> > Hi,
> 
> Unfortunately, the failure is not reproducible in a reliable way. It occurs
> about once every 10 to 20 runs of our tests. The test that fails is a full
> upgrade of our appliance. This upgrade includes a new docker container for
> both postgresql and Pgpool. Both services are restarted. Pgpool is upgraded
> from 4.2.4 to 4.2.6. 

I have checked your pgpool.conf.
I think it is not caused by the configuration.
Does this issue occur only during 4.2.4->4.2.6 upgrade ?
Does this issue occur after you upgraded all nodes to 4.2.6?

> When the upgrade starts, the cluster is fully healthy.
> All Pgpool nodes are connected. In the logs from node 2, you can also see
> that node 1 was lost because it was shutdown (The newly joined node:"
> 172.29.30.1:5432 Linux 8e410fda51ac" had left the cluster because it was
> shutdown). Looking at the sequence of events, it seems the lifecheck
> process in node 2 marks node 1 as DEAD on a very inconvenient moment: right
> at the moment it is handling the incoming connection from node 1.
> 
> In the logs from successful runs I can see that the socket for incoming
> heartbeats is opened after the node has joined the cluster and synced with
> the leader. I don't know if this is intentional, but perhaps it is possible
> to move this to the front and open the socket before it tries to join the
> cluster. This should prevent this situation.
> 
> Our tests do not store the pgpool configuration used, but it is identical
> on every installation with 3 nodes (with only some differences in
> backend_application_name2). I've attached a copy our pgpool configuration
> of node 1 from one of our installations. Node 2 and 3 are identical, which
> just the ip addresses changed. Let me know if you need additional
> information. Unfortunately the build is no longer available on our build
> servers, but if this happens again, I'll make a copy of the entire build.
> 
> Best regards,
> Emond

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan
http://www.sraoss.co.jp/