[pgpool-hackers: 3311] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Thu Apr 18 01:43:15 JST 2019

Hi

I have drafted a patch to make the master watchdog node resigns from master
responsibilities if it fails to get the consensus for its
primary backend node failover request. The patch is still little short on
testing but I want to share the early version to get
the feedback on behaviour.
Also with this implementation the master/coordinator node only resigns from
being a master
when it fails to get the consensus for the primary node failover, but in
case of failed consensus for standby node failover
no action is taken by the watchdog master node. Do you think master should
also resign in this case as well ?

Thanks
Best Regards
Muhammad Usama

On Tue, Apr 16, 2019 at 3:16 PM Muhammad Usama <m.usama at gmail.com> wrote:

> Hi Haruka Takatsuka,
>
> On Tue, Apr 16, 2019 at 2:42 PM TAKATSUKA Haruka <harukat at sraoss.co.jp>
> wrote:
>
>> Hello Usama, and Pgpool Hackers
>>
>> Thanks for your answer.
>> I tried your patch adjusting it for V3.7.x.
>>
>> Thanks for trying out the patch.
>
>
>> In the scenario where the enabled health check find the connection failure
>> and its recover, it works fine. But in the scenario where the health check
>> is disabled and frontend requests find them, quarantine status is
>> continued
>> in the pgpool.
>>
>
> Yes for disabled health-check scenarios its difficult to recover the node
> automatically. but again
> it is not advisable to use the consensus mechanism for failover by
> disabling health check because
> that would actually lead to the situation where the watchdog would never
> come to consensus even in
> the case of genuine backend failures. Since other pgpool nodes that are
> not serving the clients
> would never get to know about the backend node failure and keep sitting
> idle, and would never vote
> for the backend failures.
>
> I believe that is also documented in the failover_require_consensus section
> of the documentation.
>
>
>>
>> I understand that this patch aims to recover from the quarantine status
>> by health check. I confirmed it works so well. I think it can be a help
>> for
>> our customer at certain cases.
>>
>
>
>> However, there is a problem Ishii-san pointed out, witch continues
>> emitting
>> health check failure messages while its cause remains.
>>
>> Thats a valid observation, and I guess we can downgrade the log message
> in that case and make it a DEBUG log.
>
>
>
>> A pgpool node who notices that it cannot get consensus or it's a minority
>> will go down soon; I prefer this simple behavior rather than quarantining.
>> Does any one tell me the reason why this design wasn't adopted?
>>
>> Taking the node down would be too aggressive strategy, and that would
> actually kill the purpose.
> The original idea of building the consensus for failover was to guard
> against the temporary network
> glitches. Because failover is a very expensive operation and comes with
> its own complexities and possibility
> of data loss.
> Now consider the option of taking down the pgpool node when it is not able
> to build consensus for backend node
> failure because of some network glitch. That would mean that as soon as
> the glitch occur the setup will lose one
> pgpool node. That is a disaster in itself since that would mean the setup
> will now have one less pgpool node,
> which not only is bad for the high availability requirements but also it
> might cause the setup to lose its quorum
> altogether.
>
> So I guess the best way out here is what we discussed above, that when
> master/coordinator node fails to build
> the consensus it should give up its coordinator status and let the
> watchdog decide its new leader.
>
> Thanks
> Best Regards
> Muhammad Usama
>
>
>> with best regards,
>> Haruka Takatsuka
>>
>>
>> On Mon, 15 Apr 2019 19:14:54 +0500
>> Muhammad Usama <m.usama at gmail.com> wrote:
>>
>> > Thanks for the patch, But your patch effectively disables the node
>> > quarantine, which does't seems a right way.
>> > Since the backend node that was quarantined because of absence of quorum
>> > and/or consensus is already un-reachable
>> > form the Pgpool-II node, and we don't want to select it as load-balance
>> > node ( in case the node was secondary) or consider it
>> > as available when it is not by not marking it as quarantine.
>> >
>> > In my opinion the right way to tackle the issue is  by keep setting the
>> > quarantine state as it is done currently  but
>> > also keep the health check working on quarantine nodes. So that as soon
>> as
>> > the connectivity to the
>> > quarantined node resumes, it becomes the part of cluster automatically.
>> >
>> > Can you please try out the attached patch, to see if the solution works
>> for
>> > the situation?
>> > The patch is generated against current master branch.
>>
>> _______________________________________________
>> pgpool-hackers mailing list
>> pgpool-hackers at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190417/2ddeb26f/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: resign_from_master.diff
Type: application/octet-stream
Size: 3135 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190417/2ddeb26f/attachment.obj>