[pgpool-general: 4690] Re: Info on health_check parameters
ishii at postgresql.org
Fri May 20 23:30:53 JST 2016
> Hi all,
> I have a setup with two pgpools in HA and two backends in streaming
> The problem is that, due to unattended upgrade, master has been restarted
> and master pgpool has correctly started failover.
> We would like to prevent this, playing with health_check parameters, in
> order for pgpool to cope with short master outage without performing
> I've found an old blog post of Tatsuo Ishii,
> http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html, in
> which the following statement is made:
> Please note that "health_check_max_retries *
>> (health_check_timeout+health_check_retry_delay)" should be smaller than
Yeah, this is not a best advice.
> Looking at the code however it seems to me that things are a little
> different (probably I'm wrong).
> 1. in main loop health check for backends is performed
> (do_health_check), starting from 0 to number_of_backends
> 2. suppose that i-th backend health check fails because of timeout. The
> process is interrupted by the timer.
> 3. if (current_try <= health_check_max_retries) =>
> sleep health_check_retry_delay
> 4. we're back in main loop, the health check restart from i, the backend
> for which health_check failed
> 5. suppose that health_check fails again and again
> 6. when (current_try > health_check_max_retries) => set backend down
> 7. we're back in main loop, the health check restart from i, the backend
> for which health_check failed, but now its state is DOWN so we continue to
> next backend
> 8. in main loop when do_health_check exits, all backend are down or all
> backend currently not down are healthy
> 9. then we sleep health_check_period in main loop before starting again
> the check from the beginning.
> If I understand it correctly, health_check_period is slept unconditionally
> at the end of the check so it is not needed to set it as high as per the
> formula in the blog.
> Moreover if there are many backends and many failures last backend may be
> checked again after a long time, in the worst case after about
> (number_of_backends-1) * health_check_max_retries *
> (health_check_timeout+health_check_retry_delay) + health_check_period
Again, correct. To enhance this, we need to create separate health
check process, and each process performs health check for each
> Suppose that I choose that is acceptable that master may goes down for at
> max 120 seconds before failover.
> Since I have only two backends, I should probably set
> health_check_max_retries * (health_check_timeout+health_check_retry_delay)
> + health_check_period
> to about 120s.
> Let's say
> - health_check_period = 10
> - health_check_max_retries = 11
> - health_check_timeout =10
> - health_check_retry_delay = 1
> If at time 0 master goes down and health_check is started, after 11 tries
> that takes 10+1 seconds each, failover is triggered at time 121
That's the worst case. Most of error checking will return far before
timeout, so usually the failover trigger time would be time 11 * 1 = 11.
> In case all health checks returns OK in negligible time, that should
> happens almost always, health_check_period assures that no checks are done
> for next 10 seconds.
> Can you please confirm my findings or correct me?
Thank you for your analysis!
SRA OSS, Inc. Japan
> Best regards,
> Gabriele Monfardini
> Gabriele Monfardini
> LdP Progetti GIS
> tel: 0577.531049
> email: monfardini at ldpgis.it
More information about the pgpool-general