[pgpool-hackers: 3303] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tue Apr 16 17:22:15 JST 2019

On Tue, Apr 16, 2019 at 12:49 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > On Tue, Apr 16, 2019 at 12:14 PM Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >
> >> > On Tue, Apr 16, 2019 at 7:55 AM Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >> >
> >> >> Hi Usama,
> >> >>
> >> >> > Hi  TAKATSUKA Haruka,
> >> >> >
> >> >> > Thanks for the patch, But your patch effectively disables the node
> >> >> > quarantine, which does't seems a right way.
> >> >> > Since the backend node that was quarantined because of absence of
> >> quorum
> >> >> > and/or consensus is already un-reachable
> >> >> > form the Pgpool-II node, and we don't want to select it as
> >> load-balance
> >> >> > node ( in case the node was secondary) or consider it
> >> >> > as available when it is not by not marking it as quarantine.
> >> >> >
> >> >> > In my opinion the right way to tackle the issue is  by keep setting
> >> the
> >> >> > quarantine state as it is done currently  but
> >> >> > also keep the health check working on quarantine nodes. So that as
> >> soon
> >> >> as
> >> >> > the connectivity to the
> >> >> > quarantined node resumes, it becomes the part of cluster
> >> automatically.
> >> >>
> >> >> What if the connection failure between the primary PostgreSQL and one
> >> >> of Pgpool-II servers is permanent? Doesn't health checking continues
> >> >> forever?
> >> >>
> >> >
> >> > Yes, only for the quarantined PostgreSQL nodes. But I don't think
> there
> >> is
> >> > a problem
> >> > in that. As conceptually the quarantine nodes are not failed node
> (they
> >> are
> >> > just unusable at that moment)
> >> > and taking the node out of quarantine zone shouldn't require the
> manual
> >> > intervention. So I think its the correct
> >> > way to continue the health checking on quarantined nodes.
> >> >
> >> > Do you see an issue with the approach ?
> >>
> >> Yes. Think about the case when the PostgreSQL node is primary. Users
> >> cannot issue write queries while the retrying. The network failure
> >> could persist days and the whole database cluster is unusable in the
> >> period.
> >>
> >
> > Yes thats true, But not allowing the node to go into quarantine state
> will
> > still not solve it,
> > Because the primary would still be unavailable anyway even if we set the
> > quarantine state
> > or not. So whole idea of this patch is to recover from quarantine state
> > automatically as soon as
> > the connectivity resumes.
> > Similarly failover of that node is again not an option if the user wants
> to
> > do failover only when the
> > network consensus exists, otherwise he should just disable
> > failover_require_consensus.
>
> Question is, why can't we automatically recover from detached state as
> well as quarantine state?
>

Well ideally we should also automatically recover from detached state as
well, but the problem
is that when the node is detached, specifically the primary node, the
failover procedure
promotes another standby to make it a new master and follow_master adjusts
the standby
nodes to point to the new master. Now even when the old primary that was
detached becomes
reachable again, attaching it automatically would lead to the verity of
problems and split-brain.
I think it is possible to implement the mechanism to verify the detached
PostgreSQL node status when it
becomes reachable again and after taking appropriate actions attach it back
automatically but currently
we don't have anything like that in Pgpool. So we instead rely on user
intervention to do the re-attach
using pcp_attach_node or online recovery mechanisms.

Now if we look at the quarantine nodes, they are just as good as alive
nodes (but unreachable by pgpool at the moment).
Because when the node was quarantined, Pgpool-II never executed any
failover and/or follow_master commands
and did not interfered with the PostgreSQL backend in any way to alter its
timeline or recovery states,
So when the quarantine node becomes reachable again it is safe to
automatically connect them back to the Pgpool-II

> >> BTW,
> >>
> >> > > When the communication between master/coordinator pgpool and
> >> > > primary PostgreSQL node is down during a short period
> >> >
> >> > I wonder why you don't set appropriate health check retry parameters
> >> > to avoid such a temporary communication failure in the firs place. A
> >> > brain surgery to ignore the error reports from Pgpool-II does not seem
> >> > to be a sane choice.
> >>
> >> The original reporter didn't answer my question. I think it is likely
> >> a problem of misconfiguraton (should use longer heath check retry).
> >>
> >> In summary I think for shorter period communication failure just
> >> increasing health check parameters is enough. However for longer
> >> period communication failure, the watchdog node should decline the
> >> role.
> >>
> >
> > I am sorry I didn't totally get it what you mean here.
> > Do you mean that the pgpool-II node that has the primary node in
> quarantine
> > state should resign from the master/coordinator
> > pgpool-II node (if it was a master/coordinator) in that case?
>
> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
> keeping the quarantine state is fine because users query could be
> processed.
>

Yes that makes total sense. I will make that change as separate patch.

Thanks
Best Regards
Muhammad Usama

> > Thanks
> > Best Regards
> > Muhammad Usama
> >
> >
> >> >> > Can you please try out the attached patch, to see if the solution
> >> works
> >> >> for
> >> >> > the situation?
> >> >> > The patch is generated against current master branch.
> >> >> >
> >> >> > Thanks
> >> >> > Best Regards
> >> >> > Muhammad Usama
> >> >> >
> >> >> > On Wed, Apr 10, 2019 at 2:04 PM TAKATSUKA Haruka <
> >> harukat at sraoss.co.jp>
> >> >> > wrote:
> >> >> >
> >> >> >> Hello, Pgpool developers
> >> >> >>
> >> >> >>
> >> >> >> I found Pgpool-II watchdog is too strict for duplicate failover
> >> request
> >> >> >> with allow_multiple_failover_requests_from_node=off setting.
> >> >> >>
> >> >> >> For example, A watchdog cluster with 3 pgpool instances is here.
> >> >> >> Their backends are PostgreSQL servers using streaming replication.
> >> >> >>
> >> >> >> When the communication between master/coordinator pgpool and
> >> >> >> primary PostgreSQL node is down during a short period
> >> >> >> (or pgpool do any false-positive judgement by various reasons),
> >> >> >> and then the pgpool tries to failover but cannot get the
> consensus,
> >> >> >> so it makes the primary node into quarantine status. It cannot
> >> >> >> be reset automatically. As a result, the service becomes
> unavailable.
> >> >> >>
> >> >> >> This case generates logs like the following:
> >> >> >>
> >> >> >> pid 1234: LOG:  new IPC connection received
> >> >> >> pid 1234: LOG:  watchdog received the failover command from local
> >> >> >> pgpool-II on IPC interface
> >> >> >> pid 1234: LOG:  watchdog is processing the failover command
> >> >> >> [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC
> >> >> interface
> >> >> >> pid 1234: LOG:  Duplicate failover request from "pg1:5432 Linux
> pg1"
> >> >> node
> >> >> >> pid 1234: DETAIL:  request ignored
> >> >> >> pid 1234: LOG:  failover requires the majority vote, waiting for
> >> >> consensus
> >> >> >> pid 1234: DETAIL:  failover request noted
> >> >> >> pid 4321: LOG:  degenerate backend request for 1 node(s) from pid
> >> >> [4321],
> >> >> >> is changed to quarantine node request by watchdog
> >> >> >> pid 4321: DETAIL:  watchdog is taking time to build consensus
> >> >> >>
> >> >> >> Note that this case dosen't have any communication truouble among
> >> >> >> the Pgpool watchdog nodes.
> >> >> >> You can reproduce it by changing one PostgreSQL's pg_hba.conf to
> >> >> >> reject the helth check access from one pgpool node in short
> period.
> >> >> >>
> >> >> >> The document don't say that duplicate failover requests make the
> node
> >> >> >> quarantine immediately. I think it should be just igunoring the
> >> request.
> >> >> >>
> >> >> >> A patch file for head of V3_7_STABLE is attached.
> >> >> >> Pgpool with this patch also disturbs failover by single pgpool's
> >> >> repeated
> >> >> >> failover requests. But it can recover when the connection trouble
> is
> >> >> gone.
> >> >> >>
> >> >> >> Does this change have any problem?
> >> >> >>
> >> >> >>
> >> >> >> with best regards,
> >> >> >> TAKATSUKA Haruka <harukat at sraoss.co.jp>
> >> >> >> _______________________________________________
> >> >> >> pgpool-hackers mailing list
> >> >> >> pgpool-hackers at pgpool.net
> >> >> >> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> >> >> >>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190416/eb27f6fa/attachment-0001.html>