[pgpool-hackers: 3330] Re: [pgpool-committers: 5734] pgpool: Fix for duplicate failover request ...

Mon May 27 13:54:21 JST 2019

Hello, Usama and pgpool-hackers

Thank you very much for writing and committing patches to solve the issue
that I reported.

What do you think about back-patching them into 3.7.x or older versions?
Pgpool-II watchdog cluster can easily fall into the status that doesn't
work without the patch.

The quarantine feature hasn't been documented in detail up to now, so
I think the behavior change of the first patch (*1) has no problem.
My customer who has trouble related to this mail thread uses 3.7.x version
and strongly hopes such a pgpool improvement.

*1
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=3dd1cd3f1528

Thanks,
Haruka Takatsuka

On Tue, 21 May 2019 12:00:03 +0500
Muhammad Usama <m.usama at gmail.com> wrote:

> Hi Ishii-San
> 
> The discussion on the thread  [pgpool-hackers: 3318] yielded two patches,
> one was related
> to continuing the health check on the quarantined node and the other one
> was related to the
> de-escalation and resigning of the master watchdog if the primary backend
> node gets into
> quarantine state on the master.
> So this commit only takes care of the first part that is to continue health
> check and I still have
> to commit the second patch taking care of the resigning from master status
> part. The regression
> failure of  test 013.watchdog_failover_require_consensus will also follow
> the second patch for this issue.
> 
> I am sorry I think I am missing something on the part of consensus made in
> the discussion,
> I think we agreed on the thread to commit both the patch  but only in the
> master branch since
> it was change of the existing behaviour and we don't wanted to back port it
> to older branches.
> 
> Please see the snippet from our discussion on the thread from which I infer
> that we are in agreement
> to commit the changes
> 
> --quote--
> ...
> > Now if we look at the quarantine nodes, they are just as good as alive
> > nodes (but unreachable by pgpool at the moment).
> > Because when the node was quarantined, Pgpool-II never executed any
> > failover and/or follow_master commands
> > and did not interfered with the PostgreSQL backend in any way to alter its
> > timeline or recovery states,
> > So when the quarantine node becomes reachable again it is safe to
> > automatically connect them back to the Pgpool-II
> 
> Ok, that makes sense.
> 
> >> >> BTW,
> >> >>
> >> >> > > When the communication between master/coordinator pgpool and
> >> >> > > primary PostgreSQL node is down during a short period
> >> >> >
> >> >> > I wonder why you don't set appropriate health check retry parameters
> >> >> > to avoid such a temporary communication failure in the firs place. A
> >> >> > brain surgery to ignore the error reports from Pgpool-II does not
> seem
> >> >> > to be a sane choice.
> >> >>
> >> >> The original reporter didn't answer my question. I think it is likely
> >> >> a problem of misconfiguraton (should use longer heath check retry).
> >> >>
> >> >> In summary I think for shorter period communication failure just
> >> >> increasing health check parameters is enough. However for longer
> >> >> period communication failure, the watchdog node should decline the
> >> >> role.
> >> >>
> >> >
> >> > I am sorry I didn't totally get it what you mean here.
> >> > Do you mean that the pgpool-II node that has the primary node in
> >> quarantine
> >> > state should resign from the master/coordinator
> >> > pgpool-II node (if it was a master/coordinator) in that case?
> >>
> >> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
> >> keeping the quarantine state is fine because users query could be
> >> processed.
> >>
> >
> > Yes that makes total sense. I will make that change as separate patch.
> 
> Thanks. However this will change existing behavior. Probably we should
> make the change against master branch only?
> 
> --un quote--
> 
> 
> 
> On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
> > Usama,
> >
> > Since this commit regression test/buildfarm are failing:
> >
> > testing 013.watchdog_failover_require_consensus... failed.
> >
> > Also I think this commit seems to be against the consensus made in the
> > discussion [pgpool-hackers: 3295] thread.
> >
> > I thought we agreed on [pgpool-hackers: 3318] so that:
> >
> > ---------------------------------------------------------------------
> > Hi Usama,
> >
> > > Hi
> > >
> > > I have drafted a patch to make the master watchdog node resigns from
> > master
> > > responsibilities if it fails to get the consensus for its
> > > primary backend node failover request. The patch is still little short on
> > > testing but I want to share the early version to get
> > > the feedback on behaviour.
> > > Also with this implementation the master/coordinator node only resigns
> > from
> > > being a master
> > > when it fails to get the consensus for the primary node failover, but in
> > > case of failed consensus for standby node failover
> > > no action is taken by the watchdog master node. Do you think master
> > should
> > > also resign in this case as well ?
> >
> > I don't think so because still queries can be routed to primary (or
> > other standby servers if there are two or more standbys).
> >
> 
> My understand from this part of discussion was that, we agreed to keep the
> master status
> of the watchdog node if one of the standby node on the pgpool
> watchdog-master gets into
> quarantine and only go for resignation if the primary gets quarantine.
> 
> Have I misunderstood something?
> 
> Thanks
> Best regards
> Muhammad Usama
> 
> 
> 
> 
> > ---------------------------------------------------------------------
> >
> > From: Muhammad Usama <m.usama at gmail.com>
> > Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295]
> > duplicate failover request ...
> > Date: Wed, 15 May 2019 21:40:01 +0000
> > Message-ID: <E1hR1d3-0005o4-1k at gothos.postgresql.org>
> >
> > > Fix for [pgpool-hackers: 3295] duplicate failover request ...
> > >
> > > Pgpool should keep the backend health check running on quarantined nodes
> > so
> > > that when the connectivity resumes, they should automatically get removed
> > > from the quarantine. Otherwise the temporary network glitch could send
> > the node
> > > into permanent quarantine state.
> > >
> > > Branch
> > > ------
> > > master
> > >
> > > Details
> > > -------
> > >
> > https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a
> > >
> > > Modified Files
> > > --------------
> > > src/main/health_check.c | 28 ++++++++++++++++++++++++----
> > > 1 file changed, 24 insertions(+), 4 deletions(-)
> > >