[pgpool-general: 4698] Re: Info on health_check parameters

Sat May 21 05:24:51 JST 2016

Hi,

>> Let's say
>>
>>    - health_check_period = 10
>>    - health_check_max_retries = 11
>>    - health_check_timeout =10
>>    - health_check_retry_delay = 1
>>
>> If at time 0 master goes down and health_check is started, after 11 tries
>> that takes 10+1 seconds each, failover is triggered at time 121

>That's the worst case. Most of error checking will return far before
>timeout, so usually the failover trigger time would be time 11 * 1 = 11.

the problem here is that I need that failover does not happen before 120s,
but obviously neither I would like it to happen after a lot of time. Best
option for me would be to failover after exactly 120s.

There are also 2 timeout to be considered: connection_timeout (10s)
and health_check_timeout (let's say 10 seconds).

I've made a small test using psql and two different cases:
  * trying to connect to a node that is up but postgresql service is down
  * trying to connect to a node that is down

Having a look at strace, the behaviour is quite different.
In case 1. an error is returned almost instantly by the connect() and the
poll().
In case 2, the connect() instead hits a timeout after about 30s.

I think that in case postgresql is down for upgrade but the node is up,
pgpool behaviour may be similar to case 1, so probably the health checks
fail quickly.

On the other hand if the node is down or there are network issues pgpool
probably will end up hitting connect_timeout or health_check_timeout.

Since my nodes are located in the same place and network between them is
reliable, I should probably reduce both timeout to a few seconds and wait a
little more between tries.

If I choose:
  * health_check_timeout and connection timeout = 3
  * health_check_retry_delay = 8
  * health_check_max_retries = 15

I should probably obtain failover not before 15 * 8 = 120s if postgresql is
down (case 1, quick failures) and not after (8 + 3) * 15 = 165s if the node
is down or unrechable (case 2, connection timeout).

I think that 3 seconds may be enough for an health test even after heavy
load. Moreover those tests are allowed to fail occasionally without
triggering failover.

May you please comment about my parameter choice?

Thank you and best regards,

Gabriele Monfardini

-----
Gabriele Monfardini
LdP Progetti GIS
tel: 0577.531049
email: monfardini at ldpgis.it

On Fri, May 20, 2016 at 4:30 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:

>

> >

> > Hi all,
> >
> > I have a setup with two pgpools in HA and two backends in streaming
> > replication.
> > The problem is that, due to unattended upgrade, master has been
restarted
> > and master pgpool has correctly started failover.
> >
> > We would like to prevent this, playing with health_check parameters, in
> > order for pgpool to cope with short master outage without performing
> > failover.
> >
> > I've found an old blog post of Tatsuo Ishii,
> > http://pgsqlpgpool.blogspot.it/2013/09/health-check-parameters.html, in
> > which the following statement is made:
> >
> > Please note that "health_check_max_retries *
> >> (health_check_timeout+health_check_retry_delay)" should be smaller than
> >> health_check_period.
>
> Yeah, this is not a best advice.
>
> > Looking at the code however it seems to me that things are a little
> > different (probably I'm wrong).
> >
> >    1. in main loop health check for backends is performed
> >    (do_health_check), starting from 0 to number_of_backends
> >    2. suppose that i-th backend health check fails because of timeout.
The
> >    process is interrupted by the timer.
> >    3. if (current_try <= health_check_max_retries) =>
> >    sleep health_check_retry_delay
> >    4. we're back in main loop, the health check restart from i, the
backend
> >    for which health_check failed
> >    5. suppose that health_check fails again and again
> >    6. when (current_try > health_check_max_retries) => set backend down
> >    7. we're back in main loop, the health check restart from i, the
backend
> >    for which health_check failed, but now its state is DOWN so we
continue to
> >    next backend
> >    8. in main loop when do_health_check exits, all backend are down or
all
> >    backend currently not down are healthy
> >    9. then we sleep health_check_period in main loop before starting
again
> >    the check from the beginning.
> >
> >
> > If I understand it correctly, health_check_period is slept
unconditionally
> > at the end of the check so it is not needed to set it as high as per the
> > formula in the blog.
>
> Correct.
>
> > Moreover if there are many backends and many failures last backend may
be
> > checked again after a long time, in the worst case after about
> >
> > (number_of_backends-1) * health_check_max_retries *
> > (health_check_timeout+health_check_retry_delay) + health_check_period
>
> Again, correct. To enhance this, we need to create separate health
> check process, and each process performs health check for each
> PostgreSQL concurrently.
>
> > Suppose that I choose that is acceptable that master may goes down for
at
> > max 120 seconds before failover.
> >
> > Since I have only two backends, I should probably set
> >
> > health_check_max_retries *
(health_check_timeout+health_check_retry_delay)
> > + health_check_period
> >
> > to about 120s.
> >
> > Let's say
> >
> >    - health_check_period = 10
> >    - health_check_max_retries = 11
> >    - health_check_timeout =10
> >    - health_check_retry_delay = 1
> >
> > If at time 0 master goes down and health_check is started, after 11
tries
> > that takes 10+1 seconds each, failover is triggered at time 121
>
> That's the worst case. Most of error checking will return far before
> timeout, so usually the failover trigger time would be time 11 * 1 = 11.
>
> > In case all health checks returns OK in negligible time, that should
> > happens almost always, health_check_period assures that no checks are
done
> > for next 10 seconds.
>
> Right.
>
> > Can you please confirm my findings or correct me?
>
> Thank you for your analysis!
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> > Best regards,
> >
> > Gabriele Monfardini
> >
> > -----
> > Gabriele Monfardini
> > LdP Progetti GIS
> > tel: 0577.531049
> > email: monfardini at ldpgis.it
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net <pgpool-general at pgpool.net>
> http://www.pgpool.net/mailman/listinfo/pgpool-general
<http://www.pgpool.net/mailman/listinfo/pgpool-general>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20160520/2337ba42/attachment.html>