[pgpool-hackers: 4247] Re: Issue with failover_require_consensus

Mon Dec 19 01:52:36 JST 2022

Hi Ishii San

Thanks for the feedback. Please find the updated patch that takes care of
transferring the per node health check params through WD_POOL_CONFIG_DATA
and use those in calculations.
Could you verify if the attached fixes the issue?

As for standard_packet_processor() is concerned, it Is intended to handle
messages
that do not require to be handled differently for each watchdog state.

Thanks
Best Regards
Muhammad Usama

On Sat, Dec 17, 2022 at 5:01 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> >>> On Tue, Nov 29, 2022 at 3:27 AM Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> >>>
> >>>> >> Hi Ishii-San
> >>>> >>
> >>>> >> Sorry for the delayed response.
> >>>> >
> >>>> > No problem.
> >>>> >
> >>>> >> With the attached fix I guess the failover objects will linger on
> >>>> forever
> >>>> >> in case of a false alarm by a health check or small glitch.
> >>>> >
> >>>> > That's not good.
> >>>> >
> >>>> >> One way to get around the issue could be to compute
> >>>> >> FAILOVER_COMMAND_FINISH_TIMEOUT based on the maximum value
> >>>> >> of health_check_peroid across the cluster.
> >>>> >> something like: failover_command_finish_timouut =
> >>>> max(health_check_period)
> >>>> >> * 2 = 60
> >>>>
> >>>> After thinking more, I think we need to take account
> >>>> health_check_max_retries and health_check_retry_delay as
> >>>> well. i.e. instead of max(health_check_period), something like:
> >>>> max(health_check_period + (health_check_retry_delay *
> >>>> health_check_max_retries)).
> >>>>
> >>>> What do you think?
> >>>>
> >>>
> >>> Thanks for the valuable suggestions.
> >>> Can you try out the attached patch to see if it solves the issue?
> >>
> >> Unfortunately the patch did not pass my test case.
> >>
> >> - 3 watchdog nodes and 2 PostgreSQL servers, streaming replication
> >>   cluster (created by watchdog_setup). pgpool0 is the watchdog leader.
> >>
> >> - health_check_period = 300, health_check_max_retries = 0
> >>
> >> - pgpool1 starts 120 seconds after pgpool0 starts
> >>
> >> - pgpool2 does not start
> >>
> >> - after watchdog cluster becomes ready, shutdown PostgreSQL node 1
> (standby).
> >>
> >> - wait for 600 seconds to expect a failover.
> >>
> >> Unfortunately failover did not happen.
> >>
> >> Attached is the test script and pgpool0 log.
> >>
> >> To run the test:
> >>
> >> - unpack test.tar.gz
> >>
> >> - run prepare.sh
> >>   $ sh prepare.sh
> >>   This should create "testdir" directory with 3 watchdog node +
> PostgreSQL 2 node cluster.
> >>
> >> - cd testdir and run the test
> >>   $ sh ../start.sg -o 120
> >>   This will start the test, "-o" specifies how long wait before
> strating pgpool1.
> >
> > After the test failure, I examined the pgpool log on the pgpool leader
> > node (node 0). It seems timeout was not updated as expected.
> >
> > 2022-12-17 08:07:11.419: watchdog pid 707483: LOG:  failover request
> from 1 nodes with ID:42 is expired
> > 2022-12-17 08:07:11.419: watchdog pid 707483: DETAIL:  marking the
> failover object for removal. timeout: 15
> >
> > After looking into the code, I found update_failover_timeout() only
> > examines "health_check_period".  I think you need to examine
> > "health_check_period0" etc. as well and find the larget one for the
> > timeout caliculation.
> >
> > By the way,
> >
> >> failover_command_timout
> >> g_cluster.failover_command_timout
> >
> > I think "timout" should be "timeout".
>
> I was trying to create a proof of concept patch for this:
>
> > After looking into the code, I found update_failover_timeout() only
> > examines "health_check_period".  I think you need to examine
> > "health_check_period0" etc. as well and find the larget one for the
> > timeout caliculation.
>
> and noticed that update_failover_timeout() is called by
> standard_packet_processor() when WD_POOL_CONFIG_DATA packet is
> received like: update_failover_timeout(wdNode, standby_config); But
> standby_config was created by get_pool_config_from_json() which does
> not seem to create health_check_params in pool_config data. Also I
> wonder why standard_packet_processor() needs to be called when
> WD_POOL_CONFIG_DATA is recieved. Can't we simply omit the call to
> update_failover_timeout() in this case?
>
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS LLC
> English: http://www.sraoss.co.jp/index_en/
> Japanese:http://www.sraoss.co.jp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221218/6db711ff/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fix_failover_command_timout_v2.diff
Type: application/octet-stream
Size: 7411 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20221218/6db711ff/attachment.obj>