[pgpool-hackers: 3300] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tue Apr 16 16:15:01 JST 2019

> On Tue, Apr 16, 2019 at 7:55 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Hi Usama,
>>
>> > Hi  TAKATSUKA Haruka,
>> >
>> > Thanks for the patch, But your patch effectively disables the node
>> > quarantine, which does't seems a right way.
>> > Since the backend node that was quarantined because of absence of quorum
>> > and/or consensus is already un-reachable
>> > form the Pgpool-II node, and we don't want to select it as load-balance
>> > node ( in case the node was secondary) or consider it
>> > as available when it is not by not marking it as quarantine.
>> >
>> > In my opinion the right way to tackle the issue is  by keep setting the
>> > quarantine state as it is done currently  but
>> > also keep the health check working on quarantine nodes. So that as soon
>> as
>> > the connectivity to the
>> > quarantined node resumes, it becomes the part of cluster automatically.
>>
>> What if the connection failure between the primary PostgreSQL and one
>> of Pgpool-II servers is permanent? Doesn't health checking continues
>> forever?
>>
> 
> Yes, only for the quarantined PostgreSQL nodes. But I don't think there is
> a problem
> in that. As conceptually the quarantine nodes are not failed node (they are
> just unusable at that moment)
> and taking the node out of quarantine zone shouldn't require the manual
> intervention. So I think its the correct
> way to continue the health checking on quarantined nodes.
> 
> Do you see an issue with the approach ?

Yes. Think about the case when the PostgreSQL node is primary. Users
cannot issue write queries while the retrying. The network failure
could persist days and the whole database cluster is unusable in the
period.

BTW,

> > When the communication between master/coordinator pgpool and
> > primary PostgreSQL node is down during a short period
>
> I wonder why you don't set appropriate health check retry parameters
> to avoid such a temporary communication failure in the firs place. A
> brain surgery to ignore the error reports from Pgpool-II does not seem
> to be a sane choice.

The original reporter didn't answer my question. I think it is likely
a problem of misconfiguraton (should use longer heath check retry).

In summary I think for shorter period communication failure just
increasing health check parameters is enough. However for longer
period communication failure, the watchdog node should decline the
role.

>> > Can you please try out the attached patch, to see if the solution works
>> for
>> > the situation?
>> > The patch is generated against current master branch.
>> >
>> > Thanks
>> > Best Regards
>> > Muhammad Usama
>> >
>> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <harukat at sraoss.co.jp>
>> > wrote:
>> >
>> >> Hello, Pgpool developers
>> >>
>> >>
>> >> I found Pgpool-II watchdog is too strict for duplicate failover request
>> >> with allow_multiple_failover_requests_from_node=off setting.
>> >>
>> >> For example, A watchdog cluster with 3 pgpool instances is here.
>> >> Their backends are PostgreSQL servers using streaming replication.
>> >>
>> >> When the communication between master/coordinator pgpool and
>> >> primary PostgreSQL node is down during a short period
>> >> (or pgpool do any false-positive judgement by various reasons),
>> >> and then the pgpool tries to failover but cannot get the consensus,
>> >> so it makes the primary node into quarantine status. It cannot
>> >> be reset automatically. As a result, the service becomes unavailable.
>> >>
>> >> This case generates logs like the following:
>> >>
>> >> pid 1234: LOG:  new IPC connection received
>> >> pid 1234: LOG:  watchdog received the failover command from local
>> >> pgpool-II on IPC interface
>> >> pid 1234: LOG:  watchdog is processing the failover command
>> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC
>> interface
>> >> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux pg1"
>> node
>> >> pid 1234: DETAIL:  request ignored
>> >> pid 1234: LOG:  failover requires the majority vote, waiting for
>> consensus
>> >> pid 1234: DETAIL:  failover request noted
>> >> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid
>> [4321],
>> >> is changed to quarantine node request by watchdog
>> >> pid 4321: DETAIL:  watchdog is taking time to build consensus
>> >>
>> >> Note that this case dosen't have any communication truouble among
>> >> the Pgpool watchdog nodes.
>> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
>> >> reject the helth check access from one pgpool node in short period.
>> >>
>> >> The document don't say that duplicate failover requests make the node
>> >> quarantine immediately. I think it should be just igunoring the request.
>> >>
>> >> A patch file for head of V3_7_STABLE is attached.
>> >> Pgpool with this patch also disturbs failover by single pgpool's
>> repeated
>> >> failover requests. But it can recover when the connection trouble is
>> gone.
>> >>
>> >> Does this change have any problem?
>> >>
>> >>
>> >> with best regards,
>> >> TAKATSUKA Haruka <harukat at sraoss.co.jp>
>> >> _______________________________________________
>> >> pgpool-hackers mailing list
>> >> pgpool-hackers at pgpool.net
>> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>> >>
>>