<div dir="ltr"><div>Hi Ishii-San</div><div><br></div><div>Sorry for the delayed response.</div><div>With the attached fix I guess the failover objects will linger on forever in case of a false alarm by a health check or small glitch.</div><div>One way to get around the issue could be to compute FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value</div><div>of health_check_peroid across the cluster. </div><div>something like: failover_command_finish_timouut = max(health_check_period) * 2 = 60<br></div><div><br></div>If you agree with the proposal I can cook up the patch and share it with you.<div><br></div><div>Thanks</div><div>Best regards</div><div>Muhammad Usama</div><div><br></div><div><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, Nov 21, 2022 at 3:38 PM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> Hi Usama,<br>

> <br>

> I think I found an issue with failover_require_consensus. When this<br>

> parameter is enabled, watchdog asks other watchdog to confirm the<br>

> failover event. If other watchdog replies back and the originator<br>

> watchdog can make a consensus, failover process begins. However, if no<br>

> replies arrive before FAILOVER_COMMAND_FINISH_TIMEOUT expires, the<br>

> failover request is discarded and failover will not begin. If this<br>

> only happens once or twice, we could expect that subsequent health<br>

> check would trigger failover. But actually this could repeat forever<br>

> (which means failover never happens) if health_check_period is larger<br>

> than FAILOVER_COMMAND_FINISH_TIMEOUT (currently 15 seconds). For<br>

> example, if health_check_period = 30 seconds, and other watchdog node<br>

> 1 starts 50 seconds after watchdog node 0 (suppose this is the leader<br>

> node), then every time failover consensus request is made (suppose the<br>

> time is t), it will be canceled at t + 15, because failover on<br>

> watchdog node 1 will happen at time t + 20 ( = 50 - 30).<br>

> <br>

> Since we allow other watchdog node joins a watchdog cluster anytime, I<br>

> think this is not a behavior we expect.<br>

> <br>

> Can we make FAILOVER_COMMAND_FINISH_TIMEOUT longer or disable the<br>

> expiring when failover_require_consensus is on?<br>

<br>

Attached is the patch for this.<br>

<br>

> disable the<br>

> expiring when failover_require_consensus is on?<br>

<br>

It seems the patch solves the issue and passed all of regression<br>

test. But I wonder if the patch will give unwanted side effects. What<br>

do you think?<br>

<br>

Best reagards,<br>

--<br>

Tatsuo Ishii<br>

SRA OSS LLC<br>

English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

</blockquote></div></div></div>