[pgpool-hackers: 3302] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tue Apr 16 16:49:09 JST 2019

> On Tue, Apr 16, 2019 at 12:14 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > On Tue, Apr 16, 2019 at 7:55 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> >
>> >> Hi Usama,
>> >>
>> >> > Hi  TAKATSUKA Haruka,
>> >> >
>> >> > Thanks for the patch, But your patch effectively disables the node
>> >> > quarantine, which does't seems a right way.
>> >> > Since the backend node that was quarantined because of absence of
>> quorum
>> >> > and/or consensus is already un-reachable
>> >> > form the Pgpool-II node, and we don't want to select it as
>> load-balance
>> >> > node ( in case the node was secondary) or consider it
>> >> > as available when it is not by not marking it as quarantine.
>> >> >
>> >> > In my opinion the right way to tackle the issue is  by keep setting
>> the
>> >> > quarantine state as it is done currently  but
>> >> > also keep the health check working on quarantine nodes. So that as
>> soon
>> >> as
>> >> > the connectivity to the
>> >> > quarantined node resumes, it becomes the part of cluster
>> automatically.
>> >>
>> >> What if the connection failure between the primary PostgreSQL and one
>> >> of Pgpool-II servers is permanent? Doesn't health checking continues
>> >> forever?
>> >>
>> >
>> > Yes, only for the quarantined PostgreSQL nodes. But I don't think there
>> is
>> > a problem
>> > in that. As conceptually the quarantine nodes are not failed node (they
>> are
>> > just unusable at that moment)
>> > and taking the node out of quarantine zone shouldn't require the manual
>> > intervention. So I think its the correct
>> > way to continue the health checking on quarantined nodes.
>> >
>> > Do you see an issue with the approach ?
>>
>> Yes. Think about the case when the PostgreSQL node is primary. Users
>> cannot issue write queries while the retrying. The network failure
>> could persist days and the whole database cluster is unusable in the
>> period.
>>
> 
> Yes thats true, But not allowing the node to go into quarantine state will
> still not solve it,
> Because the primary would still be unavailable anyway even if we set the
> quarantine state
> or not. So whole idea of this patch is to recover from quarantine state
> automatically as soon as
> the connectivity resumes.
> Similarly failover of that node is again not an option if the user wants to
> do failover only when the
> network consensus exists, otherwise he should just disable
> failover_require_consensus.

Question is, why can't we automatically recover from detached state as
well as quarantine state?

>> BTW,
>>
>> > > When the communication between master/coordinator pgpool and
>> > > primary PostgreSQL node is down during a short period
>> >
>> > I wonder why you don't set appropriate health check retry parameters
>> > to avoid such a temporary communication failure in the firs place. A
>> > brain surgery to ignore the error reports from Pgpool-II does not seem
>> > to be a sane choice.
>>
>> The original reporter didn't answer my question. I think it is likely
>> a problem of misconfiguraton (should use longer heath check retry).
>>
>> In summary I think for shorter period communication failure just
>> increasing health check parameters is enough. However for longer
>> period communication failure, the watchdog node should decline the
>> role.
>>
> 
> I am sorry I didn't totally get it what you mean here.
> Do you mean that the pgpool-II node that has the primary node in quarantine
> state should resign from the master/coordinator
> pgpool-II node (if it was a master/coordinator) in that case?

Yes, exactly. Note that if the PostgreSQL node is one of standbys,
keeping the quarantine state is fine because users query could be
processed.

> Thanks
> Best Regards
> Muhammad Usama
> 
> 
>> >> > Can you please try out the attached patch, to see if the solution
>> works
>> >> for
>> >> > the situation?
>> >> > The patch is generated against current master branch.
>> >> >
>> >> > Thanks
>> >> > Best Regards
>> >> > Muhammad Usama
>> >> >
>> >> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <
>> harukat at sraoss.co.jp>
>> >> > wrote:
>> >> >
>> >> >> Hello, Pgpool developers
>> >> >>
>> >> >>
>> >> >> I found Pgpool-II watchdog is too strict for duplicate failover
>> request
>> >> >> with allow_multiple_failover_requests_from_node=off setting.
>> >> >>
>> >> >> For example, A watchdog cluster with 3 pgpool instances is here.
>> >> >> Their backends are PostgreSQL servers using streaming replication.
>> >> >>
>> >> >> When the communication between master/coordinator pgpool and
>> >> >> primary PostgreSQL node is down during a short period
>> >> >> (or pgpool do any false-positive judgement by various reasons),
>> >> >> and then the pgpool tries to failover but cannot get the consensus,
>> >> >> so it makes the primary node into quarantine status. It cannot
>> >> >> be reset automatically. As a result, the service becomes unavailable.
>> >> >>
>> >> >> This case generates logs like the following:
>> >> >>
>> >> >> pid 1234: LOG:  new IPC connection received
>> >> >> pid 1234: LOG:  watchdog received the failover command from local
>> >> >> pgpool-II on IPC interface
>> >> >> pid 1234: LOG:  watchdog is processing the failover command
>> >> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC
>> >> interface
>> >> >> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux pg1"
>> >> node
>> >> >> pid 1234: DETAIL:  request ignored
>> >> >> pid 1234: LOG:  failover requires the majority vote, waiting for
>> >> consensus
>> >> >> pid 1234: DETAIL:  failover request noted
>> >> >> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid
>> >> [4321],
>> >> >> is changed to quarantine node request by watchdog
>> >> >> pid 4321: DETAIL:  watchdog is taking time to build consensus
>> >> >>
>> >> >> Note that this case dosen't have any communication truouble among
>> >> >> the Pgpool watchdog nodes.
>> >> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
>> >> >> reject the helth check access from one pgpool node in short period.
>> >> >>
>> >> >> The document don't say that duplicate failover requests make the node
>> >> >> quarantine immediately. I think it should be just igunoring the
>> request.
>> >> >>
>> >> >> A patch file for head of V3_7_STABLE is attached.
>> >> >> Pgpool with this patch also disturbs failover by single pgpool's
>> >> repeated
>> >> >> failover requests. But it can recover when the connection trouble is
>> >> gone.
>> >> >>
>> >> >> Does this change have any problem?
>> >> >>
>> >> >>
>> >> >> with best regards,
>> >> >> TAKATSUKA Haruka <harukat at sraoss.co.jp>
>> >> >> _______________________________________________
>> >> >> pgpool-hackers mailing list
>> >> >> pgpool-hackers at pgpool.net
>> >> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>> >> >>
>> >>
>>