[pgpool-general: 7471] Re: Recovery after a backend has failed

Wed Mar 31 14:47:15 JST 2021

Hi,

On Tue, 30 Mar 2021 16:33:08 +0200
Emond Papegaaij <emond.papegaaij at gmail.com> wrote:

> Hi,
> 
> We are working on a configuration to get a cluster that requires
> minimal effort to keep running and is mostly resilient to failures. We
> use streaming replication on PG 12/Pgpool 4.1.4 with the following
> settings:
> 
> master_slave_mode = on
> master_slave_sub_mode = 'stream'
> sr_check_period = 5
> sr_check_database = 'postgres'
> delay_threshold = 0
> 
> Health checks are configured as:
> 
> health_check_period = 5
> health_check_timeout = 20
> health_check_database = ''
> health_check_max_retries = 0
> health_check_retry_delay = 1
> connect_timeout = 10000
> 
> For failover/failback and consensus we use:
> 
> failover_on_backend_error = on
> detach_false_primary = off
> search_primary_node_timeout = 0
> auto_failback = on
> auto_failback_interval = 10
> 
> failover_when_quorum_exists = on
> failover_require_consensus = on
> allow_multiple_failover_requests_from_node = off
> enable_consensus_with_half_votes = off
> 
> With this setup we get a fairly reliable failover when a backend node
> is lost. However, when connectivity to that node is restored, it
> sometimes does not rejoin the cluster. Using pcp_node_indo to get the
> status of the node, we get quite inconsistent results:
> 
> Hostname : 172.29.30.1
> Port : 5432
> Status : 3
> Weight : 0.250000
> Status Name : down
> Role : standby
> Replication Delay : 0
> Replication State : streaming
> Replication Sync State : async
> Last Status Change : 2021-03-29 16:02:09
> 
> So pgpool detects that the node is streaming/async with 0 delay, but
> still reports it as down. I would expect this node to be re-attached
> automatically with because auto_failback = on. In this situation, we
> tried restarting the pgpool nodes one by one, but the status = down
> remained persistent in the cluster. Only when we stopped all pgpool
> nodes simultaneously, they were able to recover. Is this behavior as
> expected or do we need to change something in our configuration for
> pgpool to recover in a situation like this?

If you turn on "auto_failback", the DOWN node with "Replication State : streaming"
should be re-attached automatically.

To figure out the cause, could you share the pgpool log?

> Best regards,
> Emond Papegaaij
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan