[pgpool-hackers: 3297] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Mon Apr 15 23:14:54 JST 2019

Hi  TAKATSUKA Haruka,

Thanks for the patch, But your patch effectively disables the node
quarantine, which does't seems a right way.
Since the backend node that was quarantined because of absence of quorum
and/or consensus is already un-reachable
form the Pgpool-II node, and we don't want to select it as load-balance
node ( in case the node was secondary) or consider it
as available when it is not by not marking it as quarantine.

In my opinion the right way to tackle the issue is  by keep setting the
quarantine state as it is done currently  but
also keep the health check working on quarantine nodes. So that as soon as
the connectivity to the
quarantined node resumes, it becomes the part of cluster automatically.

Can you please try out the attached patch, to see if the solution works for
the situation?
The patch is generated against current master branch.

Thanks
Best Regards
Muhammad Usama

On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <harukat at sraoss.co.jp>
wrote:

> Hello, Pgpool developers
>
>
> I found Pgpool-II watchdog is too strict for duplicate failover request
> with allow_multiple_failover_requests_from_node=off setting.
>
> For example, A watchdog cluster with 3 pgpool instances is here.
> Their backends are PostgreSQL servers using streaming replication.
>
> When the communication between master/coordinator pgpool and
> primary PostgreSQL node is down during a short period
> (or pgpool do any false-positive judgement by various reasons),
> and then the pgpool tries to failover but cannot get the consensus,
> so it makes the primary node into quarantine status. It cannot
> be reset automatically. As a result, the service becomes unavailable.
>
> This case generates logs like the following:
>
> pid 1234: LOG:  new IPC connection received
> pid 1234: LOG:  watchdog received the failover command from local
> pgpool-II on IPC interface
> pid 1234: LOG:  watchdog is processing the failover command
> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux pg1" node
> pid 1234: DETAIL:  request ignored
> pid 1234: LOG:  failover requires the majority vote, waiting for consensus
> pid 1234: DETAIL:  failover request noted
> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid [4321],
> is changed to quarantine node request by watchdog
> pid 4321: DETAIL:  watchdog is taking time to build consensus
>
> Note that this case dosen't have any communication truouble among
> the Pgpool watchdog nodes.
> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
> reject the helth check access from one pgpool node in short period.
>
> The document don't say that duplicate failover requests make the node
> quarantine immediately. I think it should be just igunoring the request.
>
> A patch file for head of V3_7_STABLE is attached.
> Pgpool with this patch also disturbs failover by single pgpool's repeated
> failover requests. But it can recover when the connection trouble is gone.
>
> Does this change have any problem?
>
>
> with best regards,
> TAKATSUKA Haruka <harukat at sraoss.co.jp>
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190415/753261bc/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: quarantine_fix.diff
Type: application/octet-stream
Size: 2123 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190415/753261bc/attachment.obj>