[pgpool-hackers: 3309] Re: duplicate failover request over allow_multiple_failover_requests_from_node=off

Tue Apr 16 19:16:30 JST 2019

Hi Haruka Takatsuka,

On Tue, Apr 16, 2019 at 2:42 PM TAKATSUKA Haruka <harukat at sraoss.co.jp>
wrote:

> Hello Usama, and Pgpool Hackers
>
> Thanks for your answer.
> I tried your patch adjusting it for V3.7.x.
>
> Thanks for trying out the patch.

> In the scenario where the enabled health check find the connection failure
> and its recover, it works fine. But in the scenario where the health check
> is disabled and frontend requests find them, quarantine status is continued
> in the pgpool.
>

Yes for disabled health-check scenarios its difficult to recover the node
automatically. but again
it is not advisable to use the consensus mechanism for failover by
disabling health check because
that would actually lead to the situation where the watchdog would never
come to consensus even in
the case of genuine backend failures. Since other pgpool nodes that are not
serving the clients
would never get to know about the backend node failure and keep sitting
idle, and would never vote
for the backend failures.

I believe that is also documented in the failover_require_consensus section
of the documentation.

>
> I understand that this patch aims to recover from the quarantine status
> by health check. I confirmed it works so well. I think it can be a help for
> our customer at certain cases.
>

> However, there is a problem Ishii-san pointed out, witch continues emitting
> health check failure messages while its cause remains.
>
> Thats a valid observation, and I guess we can downgrade the log message in
that case and make it a DEBUG log.

> A pgpool node who notices that it cannot get consensus or it's a minority
> will go down soon; I prefer this simple behavior rather than quarantining.
> Does any one tell me the reason why this design wasn't adopted?
>
> Taking the node down would be too aggressive strategy, and that would
actually kill the purpose.
The original idea of building the consensus for failover was to guard
against the temporary network
glitches. Because failover is a very expensive operation and comes with its
own complexities and possibility
of data loss.
Now consider the option of taking down the pgpool node when it is not able
to build consensus for backend node
failure because of some network glitch. That would mean that as soon as the
glitch occur the setup will lose one
pgpool node. That is a disaster in itself since that would mean the setup
will now have one less pgpool node,
which not only is bad for the high availability requirements but also it
might cause the setup to lose its quorum
altogether.

So I guess the best way out here is what we discussed above, that when
master/coordinator node fails to build
the consensus it should give up its coordinator status and let the watchdog
decide its new leader.

Thanks
Best Regards
Muhammad Usama

> with best regards,
> Haruka Takatsuka
>
>
> On Mon, 15 Apr 2019 19:14:54 +0500
> Muhammad Usama <m.usama at gmail.com> wrote:
>
> > Thanks for the patch, But your patch effectively disables the node
> > quarantine, which does't seems a right way.
> > Since the backend node that was quarantined because of absence of quorum
> > and/or consensus is already un-reachable
> > form the Pgpool-II node, and we don't want to select it as load-balance
> > node ( in case the node was secondary) or consider it
> > as available when it is not by not marking it as quarantine.
> >
> > In my opinion the right way to tackle the issue is  by keep setting the
> > quarantine state as it is done currently  but
> > also keep the health check working on quarantine nodes. So that as soon
> as
> > the connectivity to the
> > quarantined node resumes, it becomes the part of cluster automatically.
> >
> > Can you please try out the attached patch, to see if the solution works
> for
> > the situation?
> > The patch is generated against current master branch.
>
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20190416/d4063a76/attachment-0001.html>