[pgpool-hackers: 3295] duplicate failover request over allow_multiple_failover_requests_from_node=off
harukat at sraoss.co.jp
Wed Apr 10 18:04:24 JST 2019
Hello, Pgpool developers
I found Pgpool-II watchdog is too strict for duplicate failover request
with allow_multiple_failover_requests_from_node=off setting.
For example, A watchdog cluster with 3 pgpool instances is here.
Their backends are PostgreSQL servers using streaming replication.
When the communication between master/coordinator pgpool and
primary PostgreSQL node is down during a short period
(or pgpool do any false-positive judgement by various reasons),
and then the pgpool tries to failover but cannot get the consensus,
so it makes the primary node into quarantine status. It cannot
be reset automatically. As a result, the service becomes unavailable.
This case generates logs like the following:
pid 1234: LOG: new IPC connection received
pid 1234: LOG: watchdog received the failover command from local pgpool-II on IPC interface
pid 1234: LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
pid 1234: LOG: Duplicate failover request from "pg1:5432 Linux pg1" node
pid 1234: DETAIL: request ignored
pid 1234: LOG: failover requires the majority vote, waiting for consensus
pid 1234: DETAIL: failover request noted
pid 4321: LOG: degenerate backend request for 1 node(s) from pid , is changed to quarantine node request by watchdog
pid 4321: DETAIL: watchdog is taking time to build consensus
Note that this case dosen't have any communication truouble among
the Pgpool watchdog nodes.
You can reproduce it by changing one PostgreSQL's pg_hba.conf to
reject the helth check access from one pgpool node in short period.
The document don't say that duplicate failover requests make the node
quarantine immediately. I think it should be just igunoring the request.
A patch file for head of V3_7_STABLE is attached.
Pgpool with this patch also disturbs failover by single pgpool's repeated
failover requests. But it can recover when the connection trouble is gone.
Does this change have any problem?
with best regards,
TAKATSUKA Haruka <harukat at sraoss.co.jp>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the pgpool-hackers