[pgpool-general: 4699] Re: Info on health_check parameters

Tatsuo Ishii ishii at postgresql.org
Sat May 21 05:56:20 JST 2016


> Hi,
> 
>>> Let's say
>>>
>>>    - health_check_period = 10
>>>    - health_check_max_retries = 11
>>>    - health_check_timeout =10
>>>    - health_check_retry_delay = 1
>>>
>>> If at time 0 master goes down and health_check is started, after 11 tries
>>> that takes 10+1 seconds each, failover is triggered at time 121
> 
>>That's the worst case. Most of error checking will return far before
>>timeout, so usually the failover trigger time would be time 11 * 1 = 11.
> 
> the problem here is that I need that failover does not happen before 120s,
> but obviously neither I would like it to happen after a lot of time. Best
> option for me would be to failover after exactly 120s.
> 
> There are also 2 timeout to be considered: connection_timeout (10s)
> and health_check_timeout (let's say 10 seconds).
> 
> I've made a small test using psql and two different cases:
>   * trying to connect to a node that is up but postgresql service is down
>   * trying to connect to a node that is down
> 
> Having a look at strace, the behaviour is quite different.
> In case 1. an error is returned almost instantly by the connect() and the
> poll().
> In case 2, the connect() instead hits a timeout after about 30s.
> 
> I think that in case postgresql is down for upgrade but the node is up,
> pgpool behaviour may be similar to case 1, so probably the health checks
> fail quickly.
> 
> On the other hand if the node is down or there are network issues pgpool
> probably will end up hitting connect_timeout or health_check_timeout.
> 
> Since my nodes are located in the same place and network between them is
> reliable, I should probably reduce both timeout to a few seconds and wait a
> little more between tries.
> 
> If I choose:
>   * health_check_timeout and connection timeout = 3
>   * health_check_retry_delay = 8
>   * health_check_max_retries = 15
> 
> I should probably obtain failover not before 15 * 8 = 120s if postgresql is
> down (case 1, quick failures) and not after (8 + 3) * 15 = 165s if the node
> is down or unrechable (case 2, connection timeout).
> 
> I think that 3 seconds may be enough for an health test even after heavy
> load. Moreover those tests are allowed to fail occasionally without
> triggering failover.
> 
> May you please comment about my parameter choice?

Thank you for the study. Your choice of paramters looks good.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Thank you and best regards,
> 
> Gabriele Monfardini
> 
> -----
> Gabriele Monfardini
> LdP Progetti GIS
> tel: 0577.531049
> email: monfardini at ldpgis.it
> 
> On Fri, May 20, 2016 at 4:30 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>>
> 
>> >
> 
>> > Hi all,
>> >
>> > I have a setup with two pgpools in HA and two backends in streaming
>> > replication.
>> > The problem is that, due to unattended upgrade, master has been
> restarted
>> > and master pgpool has correctly started failover.
>> >
>> > We would like to prevent this, playing with health_check parameters, in
>> > order for pgpool to cope with short master outage without performing
>> > failover.
>> >
>> > I've found an old blog post of Tatsuo Ishii,
>> > http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html, in
>> > which the following statement is made:
>> >
>> > Please note that "health_check_max_retries *
>> >> (health_check_timeout+health_check_retry_delay)" should be smaller than
>> >> health_check_period.
>>
>> Yeah, this is not a best advice.
>>
>> > Looking at the code however it seems to me that things are a little
>> > different (probably I'm wrong).
>> >
>> >    1. in main loop health check for backends is performed
>> >    (do_health_check), starting from 0 to number_of_backends
>> >    2. suppose that i-th backend health check fails because of timeout.
> The
>> >    process is interrupted by the timer.
>> >    3. if (current_try <= health_check_max_retries) =>
>> >    sleep health_check_retry_delay
>> >    4. we're back in main loop, the health check restart from i, the
> backend
>> >    for which health_check failed
>> >    5. suppose that health_check fails again and again
>> >    6. when (current_try > health_check_max_retries) => set backend down
>> >    7. we're back in main loop, the health check restart from i, the
> backend
>> >    for which health_check failed, but now its state is DOWN so we
> continue to
>> >    next backend
>> >    8. in main loop when do_health_check exits, all backend are down or
> all
>> >    backend currently not down are healthy
>> >    9. then we sleep health_check_period in main loop before starting
> again
>> >    the check from the beginning.
>> >
>> >
>> > If I understand it correctly, health_check_period is slept
> unconditionally
>> > at the end of the check so it is not needed to set it as high as per the
>> > formula in the blog.
>>
>> Correct.
>>
>> > Moreover if there are many backends and many failures last backend may
> be
>> > checked again after a long time, in the worst case after about
>> >
>> > (number_of_backends-1) * health_check_max_retries *
>> > (health_check_timeout+health_check_retry_delay) + health_check_period
>>
>> Again, correct. To enhance this, we need to create separate health
>> check process, and each process performs health check for each
>> PostgreSQL concurrently.
>>
>> > Suppose that I choose that is acceptable that master may goes down for
> at
>> > max 120 seconds before failover.
>> >
>> > Since I have only two backends, I should probably set
>> >
>> > health_check_max_retries *
> (health_check_timeout+health_check_retry_delay)
>> > + health_check_period
>> >
>> > to about 120s.
>> >
>> > Let's say
>> >
>> >    - health_check_period = 10
>> >    - health_check_max_retries = 11
>> >    - health_check_timeout =10
>> >    - health_check_retry_delay = 1
>> >
>> > If at time 0 master goes down and health_check is started, after 11
> tries
>> > that takes 10+1 seconds each, failover is triggered at time 121
>>
>> That's the worst case. Most of error checking will return far before
>> timeout, so usually the failover trigger time would be time 11 * 1 = 11.
>>
>> > In case all health checks returns OK in negligible time, that should
>> > happens almost always, health_check_period assures that no checks are
> done
>> > for next 10 seconds.
>>
>> Right.
>>
>> > Can you please confirm my findings or correct me?
>>
>> Thank you for your analysis!
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>> > Best regards,
>> >
>> > Gabriele Monfardini
>> >
>> > -----
>> > Gabriele Monfardini
>> > LdP Progetti GIS
>> > tel: 0577.531049
>> > email: monfardini at ldpgis.it
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net <pgpool-general at pgpool.net>
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
> <http://www.pgpool.net/mailman/listinfo/pgpool-general>


More information about the pgpool-general mailing list