[pgpool-hackers: 3327] Re: [pgpool-committers: 5734] pgpool: Fix for duplicate failover request ...

Tue May 21 17:22:11 JST 2019

Hi Usama,

Oh ok. So you are going to create the second part of patches. I am
looking forward to seeing it.

Sorry for my misunderstanding.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi Ishii-San
> 
> The discussion on the thread  [pgpool-hackers: 3318] yielded two patches,
> one was related
> to continuing the health check on the quarantined node and the other one
> was related to the
> de-escalation and resigning of the master watchdog if the primary backend
> node gets into
> quarantine state on the master.
> So this commit only takes care of the first part that is to continue health
> check and I still have
> to commit the second patch taking care of the resigning from master status
> part. The regression
> failure of  test 013.watchdog_failover_require_consensus will also follow
> the second patch for this issue.
> 
> I am sorry I think I am missing something on the part of consensus made in
> the discussion,
> I think we agreed on the thread to commit both the patch  but only in the
> master branch since
> it was change of the existing behaviour and we don't wanted to back port it
> to older branches.
> 
> Please see the snippet from our discussion on the thread from which I infer
> that we are in agreement
> to commit the changes
> 
> --quote--
> ...
>> Now if we look at the quarantine nodes, they are just as good as alive
>> nodes (but unreachable by pgpool at the moment).
>> Because when the node was quarantined, Pgpool-II never executed any
>> failover and/or follow_master commands
>> and did not interfered with the PostgreSQL backend in any way to alter its
>> timeline or recovery states,
>> So when the quarantine node becomes reachable again it is safe to
>> automatically connect them back to the Pgpool-II
> 
> Ok, that makes sense.
> 
>>> >> BTW,
>>> >>
>>> >> > > When the communication between master/coordinator pgpool and
>>> >> > > primary PostgreSQL node is down during a short period
>>> >> >
>>> >> > I wonder why you don't set appropriate health check retry parameters
>>> >> > to avoid such a temporary communication failure in the firs place. A
>>> >> > brain surgery to ignore the error reports from Pgpool-II does not
> seem
>>> >> > to be a sane choice.
>>> >>
>>> >> The original reporter didn't answer my question. I think it is likely
>>> >> a problem of misconfiguraton (should use longer heath check retry).
>>> >>
>>> >> In summary I think for shorter period communication failure just
>>> >> increasing health check parameters is enough. However for longer
>>> >> period communication failure, the watchdog node should decline the
>>> >> role.
>>> >>
>>> >
>>> > I am sorry I didn't totally get it what you mean here.
>>> > Do you mean that the pgpool-II node that has the primary node in
>>> quarantine
>>> > state should resign from the master/coordinator
>>> > pgpool-II node (if it was a master/coordinator) in that case?
>>>
>>> Yes, exactly. Note that if the PostgreSQL node is one of standbys,
>>> keeping the quarantine state is fine because users query could be
>>> processed.
>>>
>>
>> Yes that makes total sense. I will make that change as separate patch.
> 
> Thanks. However this will change existing behavior. Probably we should
> make the change against master branch only?
> 
> --un quote--
> 
> 
> 
> On Tue, May 21, 2019 at 4:32 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Usama,
>>
>> Since this commit regression test/buildfarm are failing:
>>
>> testing 013.watchdog_failover_require_consensus... failed.
>>
>> Also I think this commit seems to be against the consensus made in the
>> discussion [pgpool-hackers: 3295] thread.
>>
>> I thought we agreed on [pgpool-hackers: 3318] so that:
>>
>> ---------------------------------------------------------------------
>> Hi Usama,
>>
>> > Hi
>> >
>> > I have drafted a patch to make the master watchdog node resigns from
>> master
>> > responsibilities if it fails to get the consensus for its
>> > primary backend node failover request. The patch is still little short on
>> > testing but I want to share the early version to get
>> > the feedback on behaviour.
>> > Also with this implementation the master/coordinator node only resigns
>> from
>> > being a master
>> > when it fails to get the consensus for the primary node failover, but in
>> > case of failed consensus for standby node failover
>> > no action is taken by the watchdog master node. Do you think master
>> should
>> > also resign in this case as well ?
>>
>> I don't think so because still queries can be routed to primary (or
>> other standby servers if there are two or more standbys).
>>
> 
> My understand from this part of discussion was that, we agreed to keep the
> master status
> of the watchdog node if one of the standby node on the pgpool
> watchdog-master gets into
> quarantine and only go for resignation if the primary gets quarantine.
> 
> Have I misunderstood something?
> 
> Thanks
> Best regards
> Muhammad Usama
> 
> 
> 
> 
>> ---------------------------------------------------------------------
>>
>> From: Muhammad Usama <m.usama at gmail.com>
>> Subject: [pgpool-committers: 5734] pgpool: Fix for [pgpool-hackers: 3295]
>> duplicate failover request ...
>> Date: Wed, 15 May 2019 21:40:01 +0000
>> Message-ID: <E1hR1d3-0005o4-1k at gothos.postgresql.org>
>>
>> > Fix for [pgpool-hackers: 3295] duplicate failover request ...
>> >
>> > Pgpool should keep the backend health check running on quarantined nodes
>> so
>> > that when the connectivity resumes, they should automatically get removed
>> > from the quarantine. Otherwise the temporary network glitch could send
>> the node
>> > into permanent quarantine state.
>> >
>> > Branch
>> > ------
>> > master
>> >
>> > Details
>> > -------
>> >
>> https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=3dd1cd3f15287ee6bb8b09f0642f99db98e9776a
>> >
>> > Modified Files
>> > --------------
>> > src/main/health_check.c | 28 ++++++++++++++++++++++++----
>> > 1 file changed, 24 insertions(+), 4 deletions(-)
>> >
>>