[pgpool-hackers: 3296] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Thu Apr 11 08:12:53 JST 2019

I think this has been discussed before:
http://www.sraoss.jp/pipermail/pgpool-hackers/2018-March/002756.html
(the original discussion was in Japanese local list:
[pgpool-general-jp: 1504]
https://www.pgpool.net/pipermail/pgpool-general-jp/2018-March/001503.html

and I believe Usama has been working on it.
http://www.sraoss.jp/pipermail/pgpool-hackers/2018-March/002757.html

Usama, any progress on this?

BTW,

> When the communication between master/coordinator pgpool and
> primary PostgreSQL node is down during a short period

I wonder why you don't set appropriate health check retry parameters
to avoid such a temporary communication failure in the firs place. A
brain surgery to ignore the error reports from Pgpool-II does not seem
to be a sane choice.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hello, Pgpool developers
> 
> 
> I found Pgpool-II watchdog is too strict for duplicate failover request
> with allow_multiple_failover_requests_from_node=off setting.
> 
> For example, A watchdog cluster with 3 pgpool instances is here.
> Their backends are PostgreSQL servers using streaming replication.
> 
> When the communication between master/coordinator pgpool and
> primary PostgreSQL node is down during a short period
> (or pgpool do any false-positive judgement by various reasons),
> and then the pgpool tries to failover but cannot get the consensus,
> so it makes the primary node into quarantine status. It cannot
> be reset automatically. As a result, the service becomes unavailable.
> 
> This case generates logs like the following:
> 
> pid 1234: LOG:  new IPC connection received
> pid 1234: LOG:  watchdog received the failover command from local pgpool-II on IPC interface
> pid 1234: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux pg1" node
> pid 1234: DETAIL:  request ignored
> pid 1234: LOG:  failover requires the majority vote, waiting for consensus
> pid 1234: DETAIL:  failover request noted
> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid [4321], is changed to quarantine node request by watchdog
> pid 4321: DETAIL:  watchdog is taking time to build consensus
> 
> Note that this case dosen't have any communication truouble among
> the Pgpool watchdog nodes.
> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
> reject the helth check access from one pgpool node in short period.
> 
> The document don't say that duplicate failover requests make the node
> quarantine immediately. I think it should be just igunoring the request.
> 
> A patch file for head of V3_7_STABLE is attached.
> Pgpool with this patch also disturbs failover by single pgpool's repeated
> failover requests. But it can recover when the connection trouble is gone.
> 
> Does this change have any problem?
> 
> 
> with best regards,
> TAKATSUKA Haruka <harukat at sraoss.co.jp>