[pgpool-general: 4687] Info on health_check parameters

Gabriele Monfardini monfardini at ldpgis.it
Wed May 18 01:17:20 JST 2016


Hi all,

I have a setup with two pgpools in HA and two backends in streaming
replication.
The problem is that, due to unattended upgrade, master has been restarted
and master pgpool has correctly started failover.

We would like to prevent this, playing with health_check parameters, in
order for pgpool to cope with short master outage without performing
failover.

I've found an old blog post of Tatsuo Ishii,
http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html, in
which the following statement is made:

Please note that "health_check_max_retries *
> (health_check_timeout+health_check_retry_delay)" should be smaller than
> health_check_period.


Looking at the code however it seems to me that things are a little
different (probably I'm wrong).

   1. in main loop health check for backends is performed
   (do_health_check), starting from 0 to number_of_backends
   2. suppose that i-th backend health check fails because of timeout. The
   process is interrupted by the timer.
   3. if (current_try <= health_check_max_retries) =>
   sleep health_check_retry_delay
   4. we're back in main loop, the health check restart from i, the backend
   for which health_check failed
   5. suppose that health_check fails again and again
   6. when (current_try > health_check_max_retries) => set backend down
   7. we're back in main loop, the health check restart from i, the backend
   for which health_check failed, but now its state is DOWN so we continue to
   next backend
   8. in main loop when do_health_check exits, all backend are down or all
   backend currently not down are healthy
   9. then we sleep health_check_period in main loop before starting again
   the check from the beginning.


If I understand it correctly, health_check_period is slept unconditionally
at the end of the check so it is not needed to set it as high as per the
formula in the blog.

Moreover if there are many backends and many failures last backend may be
checked again after a long time, in the worst case after about

(number_of_backends-1) * health_check_max_retries *
(health_check_timeout+health_check_retry_delay) + health_check_period

Suppose that I choose that is acceptable that master may goes down for at
max 120 seconds before failover.

Since I have only two backends, I should probably set

health_check_max_retries * (health_check_timeout+health_check_retry_delay)
+ health_check_period

to about 120s.

Let's say

   - health_check_period = 10
   - health_check_max_retries = 11
   - health_check_timeout =10
   - health_check_retry_delay = 1

If at time 0 master goes down and health_check is started, after 11 tries
that takes 10+1 seconds each, failover is triggered at time 121

In case all health checks returns OK in negligible time, that should
happens almost always, health_check_period assures that no checks are done
for next 10 seconds.

Can you please confirm my findings or correct me?

Best regards,

Gabriele Monfardini

-----
Gabriele Monfardini
LdP Progetti GIS
tel: 0577.531049
email: monfardini at ldpgis.it
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20160517/f7cb5eff/attachment.html>


More information about the pgpool-general mailing list