[pgpool-hackers: 3305] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tue Apr 16 17:33:37 JST 2019

On Tue, Apr 16, 2019 at 1:27 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> >> Question is, why can't we automatically recover from detached state as
> >> well as quarantine state?
> >>
> >
> > Well ideally we should also automatically recover from detached state as
> > well, but the problem
> > is that when the node is detached, specifically the primary node, the
> > failover procedure
> > promotes another standby to make it a new master and follow_master
> adjusts
> > the standby
> > nodes to point to the new master. Now even when the old primary that was
> > detached becomes
> > reachable again, attaching it automatically would lead to the verity of
> > problems and split-brain.
> > I think it is possible to implement the mechanism to verify the detached
> > PostgreSQL node status when it
> > becomes reachable again and after taking appropriate actions attach it
> back
> > automatically but currently
> > we don't have anything like that in Pgpool. So we instead rely on user
> > intervention to do the re-attach
> > using pcp_attach_node or online recovery mechanisms.
> >
> > Now if we look at the quarantine nodes, they are just as good as alive
> > nodes (but unreachable by pgpool at the moment).
> > Because when the node was quarantined, Pgpool-II never executed any
> > failover and/or follow_master commands
> > and did not interfered with the PostgreSQL backend in any way to alter
> its
> > timeline or recovery states,
> > So when the quarantine node becomes reachable again it is safe to
> > automatically connect them back to the Pgpool-II
>
> Ok, that makes sense.
>
> >> >> BTW,
> >> >>
> >> >> > > When the communication between master/coordinator pgpool and
> >> >> > > primary PostgreSQL node is down during a short period
> >> >> >
> >> >> > I wonder why you don't set appropriate health check retry
> parameters
> >> >> > to avoid such a temporary communication failure in the firs place.
> A
> >> >> > brain surgery to ignore the error reports from Pgpool-II does not
> seem
> >> >> > to be a sane choice.
> >> >>
> >> >> The original reporter didn't answer my question. I think it is likely
> >> >> a problem of misconfiguraton (should use longer heath check retry).
> >> >>
> >> >> In summary I think for shorter period communication failure just
> >> >> increasing health check parameters is enough. However for longer
> >> >> period communication failure, the watchdog node should decline the
> >> >> role.
> >> >>
> >> >
> >> > I am sorry I didn't totally get it what you mean here.
> >> > Do you mean that the pgpool-II node that has the primary node in
> >> quarantine
> >> > state should resign from the master/coordinator
> >> > pgpool-II node (if it was a master/coordinator) in that case?
> >>
> >> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
> >> keeping the quarantine state is fine because users query could be
> >> processed.
> >>
> >
> > Yes that makes total sense. I will make that change as separate patch.
>
> Thanks. However this will change existing behavior. Probably we should
> make the change against master branch only?
>

Probably yes, because the current fix I have for this in my mind involves
the configurable timeout parameter
to make the master pgpool resign. Let me come up with the patch and then we
work on the part of that
needs to be back ported.
And regarding the patch I shared upthread to continue the health check on
quarantined nodes, Do you think we should
also back-patch it to older versions as-well ?

Thanks
Best Regards
Muhammad Usama

>
> > Thanks
> > Best Regards
> > Muhammad Usama
> >
> >
> >> > Thanks
> >> > Best Regards
> >> > Muhammad Usama
> >> >
> >> >
> >> >> >> > Can you please try out the attached patch, to see if the
> solution
> >> >> works
> >> >> >> for
> >> >> >> > the situation?
> >> >> >> > The patch is generated against current master branch.
> >> >> >> >
> >> >> >> > Thanks
> >> >> >> > Best Regards
> >> >> >> > Muhammad Usama
> >> >> >> >
> >> >> >> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <
> >> >> harukat at sraoss.co.jp>
> >> >> >> > wrote:
> >> >> >> >
> >> >> >> >> Hello, Pgpool developers
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> I found Pgpool-II watchdog is too strict for duplicate failover
> >> >> request
> >> >> >> >> with allow_multiple_failover_requests_from_node=off setting.
> >> >> >> >>
> >> >> >> >> For example, A watchdog cluster with 3 pgpool instances is
> here.
> >> >> >> >> Their backends are PostgreSQL servers using streaming
> replication.
> >> >> >> >>
> >> >> >> >> When the communication between master/coordinator pgpool and
> >> >> >> >> primary PostgreSQL node is down during a short period
> >> >> >> >> (or pgpool do any false-positive judgement by various reasons),
> >> >> >> >> and then the pgpool tries to failover but cannot get the
> >> consensus,
> >> >> >> >> so it makes the primary node into quarantine status. It cannot
> >> >> >> >> be reset automatically. As a result, the service becomes
> >> unavailable.
> >> >> >> >>
> >> >> >> >> This case generates logs like the following:
> >> >> >> >>
> >> >> >> >> pid 1234: LOG:  new IPC connection received
> >> >> >> >> pid 1234: LOG:  watchdog received the failover command from
> local
> >> >> >> >> pgpool-II on IPC interface
> >> >> >> >> pid 1234: LOG:  watchdog is processing the failover command
> >> >> >> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on
> IPC
> >> >> >> interface
> >> >> >> >> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux
> >> pg1"
> >> >> >> node
> >> >> >> >> pid 1234: DETAIL:  request ignored
> >> >> >> >> pid 1234: LOG:  failover requires the majority vote, waiting
> for
> >> >> >> consensus
> >> >> >> >> pid 1234: DETAIL:  failover request noted
> >> >> >> >> pid 4321: LOG:  degenerate backend request for 1 node(s) from
> pid
> >> >> >> [4321],
> >> >> >> >> is changed to quarantine node request by watchdog
> >> >> >> >> pid 4321: DETAIL:  watchdog is taking time to build consensus
> >> >> >> >>
> >> >> >> >> Note that this case dosen't have any communication truouble
> among
> >> >> >> >> the Pgpool watchdog nodes.
> >> >> >> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf
> to
> >> >> >> >> reject the helth check access from one pgpool node in short
> >> period.
> >> >> >> >>
> >> >> >> >> The document don't say that duplicate failover requests make
> the
> >> node
> >> >> >> >> quarantine immediately. I think it should be just igunoring the
> >> >> request.
> >> >> >> >>
> >> >> >> >> A patch file for head of V3_7_STABLE is attached.
> >> >> >> >> Pgpool with this patch also disturbs failover by single
> pgpool's
> >> >> >> repeated
> >> >> >> >> failover requests. But it can recover when the connection
> trouble
> >> is
> >> >> >> gone.
> >> >> >> >>
> >> >> >> >> Does this change have any problem?
> >> >> >> >>
> >> >> >> >>
> >> >> >> >> with best regards,
> >> >> >> >> TAKATSUKA Haruka <harukat at sraoss.co.jp>
> >> >> >> >> _______________________________________________
> >> >> >> >> pgpool-hackers mailing list
> >> >> >> >> pgpool-hackers at pgpool.net
> >> >> >> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> >> >> >> >>
> >> >> >>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190416/bd019309/attachment.html>