[pgpool-general: 8556] Re: Issues taking a node out of a cluster

Tue Jan 24 16:41:23 JST 2023

>> Still I don't see why you can't leave the backend1 as "down" status
>> instead of trying to take out the backend1 (if my understanding is
>> correct, at the same time the pgpool node1 is brought to down status
>> because node1 and backend1 are on the same virtual machine).
>>
>> This way, the node0 and the node2 can access the backend0 and the
>> backend2 without being disturbed by the backend1.
>>
> 
> Under some circumstances the faulty node can cause disruption of service.
> This is especially the case when a link between nodes is unreliable. For
> example, we've seen problems when one of the interlinks is not working
> (i.e. node 1 and node 2 cannot communicate with each other, but can both
> communicate with node 0). Such a scenario can cause various different
> failures. Temporarily taking a node out of the cluster allows for the
> administrators to troubleshoot the issue without having to worry about
> causing interference on the service.

The problem description is too vague. Please provide concrete example
of the problem.

> Another case we've seen is where one of the nodes needs to be migrated to a
> different site. If it is not possible to perform a live migration of the
> VM, the second best alternative is to first add a new node to the cluster
> and then remove the old one. The same steps need be performed when a vm
> host has had a catastrophic hardware failure and a node is lost
> permanently. Our entire software stack, with the exception of pgpool,
> supports performing these kinds of modifications without introducing any
> downtime.

Pgpool-II consists of many child process. Each process handles a user
session and multiple backends. There are tons of places in the child
process something like:

for (i = 0; i < number_of_backends ; i++)
{
	do something...
}

If the backend configuration has been changed in the middle of the
loop, the code will produce unpredictable errors. To prevent that, we
need to handle the loop as a critical section, which requires a
locking. This will cause serious performance degradation due to the lock
contention.

Note that adding new backend is possible without restarting
pgpool,i.e. by reloading the configuration file. For example if you
have 3 backends (0, 1 and 2) and want to remove backend 1,

1) add backend 3 to each pgpool.conf. If you configure per node health
   check parameters (i.e. health_check_user0 etc.) you need to add the
   health check parameters for backend 3 as well.

2) stop all standby watchdog pgpool.

3) create new backend on new vm and start it (streaming from existing
   primary backend). This will become backend 3.

4) detach backend 1 by using pcp_detach_node.

5) execute "pgpool reload" on leader watchdog node. This will add new
   backend 3. The status should be "down" at this point.

6) attach backend 3 by using pcp_attach_node.

7) optionaly you can remove backend 1 configuration
   parameters. "status" columbn of show pool_nodes will be "unused" after
   restarting pgpool.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp