[pgpool-hackers: 4232] Re: Issue with failover_require_consensus

Tatsuo Ishii ishii at sraoss.co.jp
Mon Nov 28 16:51:57 JST 2022


> Hi Ishii-San
> 
> Sorry for the delayed response.

No problem.

> With the attached fix I guess the failover objects will linger on forever
> in case of a false alarm by a health check or small glitch.

That's not good.

> One way to get around the issue could be to compute
> FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value
> of health_check_peroid across the cluster.
> something like: failover_command_finish_timouut = max(health_check_period)
> * 2 = 60

This is much better than my previous proposal.

> If you agree with the proposal I can cook up the patch and share it with
> you.

I agree with you. Please go ahead.

> Thanks
> Best regards
> Muhammad Usama
> 
> On Mon, Nov 21, 2022 at 3:38 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > Hi Usama,
>> >
>> > I think I found an issue with failover_require_consensus. When this
>> > parameter is enabled, watchdog asks other watchdog to confirm the
>> > failover event. If other watchdog replies back and the originator
>> > watchdog can make a consensus, failover process begins. However, if no
>> > replies arrive before FAILOVER_COMMAND_FINISH_TIMEOUT expires, the
>> > failover request is discarded and failover will not begin. If this
>> > only happens once or twice, we could expect that subsequent health
>> > check would trigger failover. But actually this could repeat forever
>> > (which means failover never happens) if health_check_period is larger
>> > than FAILOVER_COMMAND_FINISH_TIMEOUT (currently 15 seconds). For
>> > example, if health_check_period = 30 seconds, and other watchdog node
>> > 1 starts 50 seconds after watchdog node 0 (suppose this is the leader
>> > node), then every time failover consensus request is made (suppose the
>> > time is t), it will be canceled at t + 15, because failover on
>> > watchdog node 1 will happen at time t + 20 ( = 50 - 30).
>> >
>> > Since we allow other watchdog node joins a watchdog cluster anytime, I
>> > think this is not a behavior we expect.
>> >
>> > Can we make FAILOVER_COMMAND_FINISH_TIMEOUT longer or disable the
>> > expiring when failover_require_consensus is on?
>>
>> Attached is the patch for this.
>>
>> > disable the
>> > expiring when failover_require_consensus is on?
>>
>> It seems the patch solves the issue and passed all of regression
>> test. But I wonder if the patch will give unwanted side effects. What
>> do you think?
>>
>> Best reagards,
>> --
>> Tatsuo Ishii
>> SRA OSS LLC
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>


More information about the pgpool-hackers mailing list