[pgpool-general: 8553] Re: Issues taking a node out of a cluster

Fri Jan 20 23:42:35 JST 2023

On Fri, Jan 20, 2023 at 2:37 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> >> Can you elaborate the motivation behind this? It seems you just want
> >> to stop PostgreSQL node 1 and then rename PostgreSQL node 2 to node
> >> 1. I don't see benefit from this for admins/users except avoiding the
> >> node 1 "whole" from the configuration file.
> >>
> >
> > We use pgpool as an integral part of solution to provide high
> availability
> > to our customers. We recommend our customers to run 3 instances on 3
> sites
> > for the highest availability. Every instance of our application is a
> > virtual machine, running the application, pgpool, postgresql and some
> other
> > components. Sometimes, things break. For example, we've seen a case where
> > connectivity to one of the sites was bad. This would cause intermittent
> > failures. For this, we offer the option to temporarily disable one of the
> > nodes in the cluster. This will take the node out of the cluster,
> > preventing the other nodes from trying to communicate with it over the
> > unreliable connection. When the issue is fixed, the node can be
> re-enabled
> > and put back into the cluster.
> >
> > In the above scenario, the changes to the topology of the cluster are
> made
> > on a live environment and downtime should be reduced to a minimum. 2 of
> the
> > 3 nodes are healthy and capable of handling requests. They will however
> > need to be reconfigured to (temporarily) forget about the faulty node. We
> > can perform reconfiguration on one node at a time, taking it out of the
> > load balancer during this process, thus avoiding any downtime. If,
> however,
> > we need to restart pgpool on all nodes simultaneously, rather than one
> at a
> > time, that would interrupt service.
> >
> > Initially, we implemented this feature keeping the indexes of the
> backends
> > in place. So node 0 would only have a backend0 and a backend2, but that
> > didn't work. I don't know exactly what the problem was with that setup,
> as
> > this was quite some time ago (is such a configuration even allowed in
> > pgpool?). Because that setup did not work, we switched to reindexing the
> > backends, making sure we always start at 0 and do not skip any numbers.
> > This however confuses pgpool during the reconfiguration phase.
> >
> > I hope this makes our situation clear.
>
> Still I don't see why you can't leave the backend1 as "down" status
> instead of trying to take out the backend1 (if my understanding is
> correct, at the same time the pgpool node1 is brought to down status
> because node1 and backend1 are on the same virtual machine).
>
> This way, the node0 and the node2 can access the backend0 and the
> backend2 without being disturbed by the backend1.
>

Under some circumstances the faulty node can cause disruption of service.
This is especially the case when a link between nodes is unreliable. For
example, we've seen problems when one of the interlinks is not working
(i.e. node 1 and node 2 cannot communicate with each other, but can both
communicate with node 0). Such a scenario can cause various different
failures. Temporarily taking a node out of the cluster allows for the
administrators to troubleshoot the issue without having to worry about
causing interference on the service.

Another case we've seen is where one of the nodes needs to be migrated to a
different site. If it is not possible to perform a live migration of the
VM, the second best alternative is to first add a new node to the cluster
and then remove the old one. The same steps need be performed when a vm
host has had a catastrophic hardware failure and a node is lost
permanently. Our entire software stack, with the exception of pgpool,
supports performing these kinds of modifications without introducing any
downtime.

Best regards,
Emond
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20230120/e95f2354/attachment.htm>