[pgpool-general: 8544] Re: Issues taking a node out of a cluster

Mon Jan 16 17:05:06 JST 2023

On Mon, Jan 16, 2023 at 1:33 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > We are seeing failures in our test suite on a specific set of tests
> related
> > to taking a node out of a cluster. In short, it seems to following
> sequence
> > of events occurs:
> > * We start with a health cluster with 3 nodes (0, 1 and 2), each node
> > running pgpool and postgresql. Node 0 runs the primary database.
> > * node 1 is shutdown
> > * pgpool on node 0 and 2 correctly mark backend 1 down
> > * pgpool on node 0 is reconfigured, removing node 1 from the
> configuration,
> > backend 0 remains backend 0, backend 2 is now known as backend 1
> > * pgpool on node 0 starts up again, and receives the cluster status from
> > node 2, which includes backend 1 being down.
> > * pgpool on node 0 now also marks backend 1 as being down, but because of
> > the renumbering, it actually marks the backend on node 2 as down
> > * pgpool on node 2 gets its new configuration, same as on node 0
> > * pgpool on node 2 (which is now runs backend 1) gets the cluster status
> > from node 0, and marks backend 1 down
> > * the cluster ends up with pgpool and postgresql running on both
> remaining
> > nodes, but backend 1 is down. It never recovers from this state
> > automatically, even though auto_failback is enabled and postgresql is up
> > and streaming.
> >
> > For node 2 (with backend 1), pcp_node_info returns the following
> > information for backend 1:
> > Hostname               : 172.29.30.3
> > Port                   : 5432
> > Status                 : 3
> > Weight                 : 0.500000
> > Status Name            : down
> > Backend Status Name    : up
> > Role                   : standby
> > Backend Role           : standby
> > Replication Delay      : 0
> > Replication State      : streaming
> > Replication Sync State : async
> > Last Status Change     : 2023-01-09 22:28:41
> >
> > My first question is: Can we somehow prevent the state of backend 1 being
> > assigned to the wrong node during the configuration update?
>
> Have you removed pgpool_status file before restarting pgpool?  The
> file remembers the backend status along with node id hence you need to
> update the file. If the file does not exist upon pgpol startup, it
> will be automatically created.
>

Yes, we remove the status file when we change the configuration of pgpool.
>From what we can see in the logs, the backend is set to down after synching
the status in the cluster. Are backends identified by their index in the
cluster? After node 0 gets its new configuration, its backend1 will point
to node 2, while on node 2, backend1 still points to the former node 1. It
seems like this causes the backends to get mixed up and the wrong one is
marked down.

> > My second question: Why does the auto_failback not reattach backend 1
> when
> > it detects the database is up and streaming?
>
> Maybe because of this?
>
>
> https://www.pgpool.net/docs/44/en/html/runtime-config-failover.html#RUNTIME-CONFIG-FAILOVER-SETTINGS
>
> > Note: auto_failback may not work, when replication slot is used. There
> > is possibility that the streaming replication is stopped, because
> > failover_command is executed and replication slot is deleted by the
> > command.
>

We do not use replication slots, at least we do not create them manually.
But in this scenario we also don't perform a failover. The primary database
runs on node 0 and is never taken offline. It's the standby database on
node 1 that is taken offline. The backend1 (backend on node 2) that is
marked down, also isn't touched. In the database logs, I can see that the
databases are running and never lost connection.

Best regards,
Emond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230116/4622f53a/attachment.htm>