[pgpool-general: 4690] Re: Info on health_check parameters

Fri May 20 23:30:53 JST 2016

> Hi all,
> 
> I have a setup with two pgpools in HA and two backends in streaming
> replication.
> The problem is that, due to unattended upgrade, master has been restarted
> and master pgpool has correctly started failover.
> 
> We would like to prevent this, playing with health_check parameters, in
> order for pgpool to cope with short master outage without performing
> failover.
> 
> I've found an old blog post of Tatsuo Ishii,
> http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html, in
> which the following statement is made:
> 
> Please note that "health_check_max_retries *
>> (health_check_timeout+health_check_retry_delay)" should be smaller than
>> health_check_period.

Yeah, this is not a best advice.

> Looking at the code however it seems to me that things are a little
> different (probably I'm wrong).
> 
>    1. in main loop health check for backends is performed
>    (do_health_check), starting from 0 to number_of_backends
>    2. suppose that i-th backend health check fails because of timeout. The
>    process is interrupted by the timer.
>    3. if (current_try <= health_check_max_retries) =>
>    sleep health_check_retry_delay
>    4. we're back in main loop, the health check restart from i, the backend
>    for which health_check failed
>    5. suppose that health_check fails again and again
>    6. when (current_try > health_check_max_retries) => set backend down
>    7. we're back in main loop, the health check restart from i, the backend
>    for which health_check failed, but now its state is DOWN so we continue to
>    next backend
>    8. in main loop when do_health_check exits, all backend are down or all
>    backend currently not down are healthy
>    9. then we sleep health_check_period in main loop before starting again
>    the check from the beginning.
> 
> 
> If I understand it correctly, health_check_period is slept unconditionally
> at the end of the check so it is not needed to set it as high as per the
> formula in the blog.

Correct.

> Moreover if there are many backends and many failures last backend may be
> checked again after a long time, in the worst case after about
> 
> (number_of_backends-1) * health_check_max_retries *
> (health_check_timeout+health_check_retry_delay) + health_check_period

Again, correct. To enhance this, we need to create separate health
check process, and each process performs health check for each
PostgreSQL concurrently.

> Suppose that I choose that is acceptable that master may goes down for at
> max 120 seconds before failover.
> 
> Since I have only two backends, I should probably set
> 
> health_check_max_retries * (health_check_timeout+health_check_retry_delay)
> + health_check_period
> 
> to about 120s.
> 
> Let's say
> 
>    - health_check_period = 10
>    - health_check_max_retries = 11
>    - health_check_timeout =10
>    - health_check_retry_delay = 1
> 
> If at time 0 master goes down and health_check is started, after 11 tries
> that takes 10+1 seconds each, failover is triggered at time 121

That's the worst case. Most of error checking will return far before
timeout, so usually the failover trigger time would be time 11 * 1 = 11.

> In case all health checks returns OK in negligible time, that should
> happens almost always, health_check_period assures that no checks are done
> for next 10 seconds.

Right.

> Can you please confirm my findings or correct me?

Thank you for your analysis!

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Best regards,
> 
> Gabriele Monfardini
> 
> -----
> Gabriele Monfardini
> LdP Progetti GIS
> tel: 0577.531049
> email: monfardini at ldpgis.it