[Pgpool-hackers] Health check retries (patch)

Sat Nov 19 04:53:25 UTC 2011

Matt,

Thank you! The patch looks pretty good.  Patch committed with a few
modications.
http://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=55199bdfa7630cf9a5142703ef85ee7695bb4221

1) While retrying, emit log(rather than debug message). This would be
more usefull for DBA because it makes clear that pgpool tries to
recover state. Here is a sampe message.
2011-11-19 13:23:12 LOG:   pid 10375: health check retry sleep time: 1 second(s)

2) After successfull retry, emit a log.
2011-11-19 13:23:19 LOG:   pid 10375: after some retrying backend returned to healthy state

BTW, I think to make the new feature works better, it's best to turn
on fail_over_on_backend_error because even if health checking retries,
writing to backend socket causes immediate failover if
fail_over_on_backend_error is set to off.

Also new_connection() was fixed because it caused immediate failover
when trying to connect to backend despite fail_over_on_backend_error
is set to on.

Could you provide English documentation for this?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hi everyone.  In August, I wrote to the pgpool-general list (see below) asking if there was any
> way to have pgpool-II retry a failed health check before promoting the slave.
> 
> I'm attaching a patch that adds this functionality.  Would anyone care to review it?  We've been
> using it successfully in production for about 3 months now, and it's working great.
> 
> This is my first time submitting a patch to PostgreSQL or PgPool, so go easy :-).
> 
> Some comments:
> - The purpose of this feature is to allow pgpool-II to handle brief networking interruptions
>    without being "fooled" into thinking that the master node is down and the slave needs to
>    be promoted.
> - This patch adds two new configuration settings.
> - The "health_check_max_retries" setting is the maximum number of times to retry a health
>    check before giving up.
> - The "health_check_retry_delay" is the amount of time (in seconds) to sleep between retries.
> - The feature is turned *off* by default (health_check_max_retries defaults to 0, or no retries).
> 
> Patch is against git HEAD revision (commit 58043c962b8305507de0f450be74c24cbe4c8430).
> 
> Please let me know if you have any questions or comments.
>      
> -- Matt
> 
> Begin forwarded message:
> 
>> From: Matt Solnit <msolnit at soasta.com>
>> Subject: Re: [Pgpool-general] Can pgpool-II retry failed health checks?
>> Date: August 4, 2011 10:57:27 PM PDT
>> To: Guillaume Lelarge <guillaume at lelarge.info>
>> Cc: "pgpool-general at pgfoundry.org" <pgpool-general at pgfoundry.org>
>> 
>> On Aug 4, 2011, at 10:54 PM, Guillaume Lelarge wrote:
>> 
>>> On Fri, 2011-08-05 at 00:17 -0400, Matt Solnit wrote:
>>>> On Jul 29, 2011, at 10:37 PM, Matthew Solnit wrote:
>>>> 
>>>>> Hi everyone.  I'm using pgpool-II 3.0.4 with PostgreSQL 9.0.2, in streaming replication mode.  We've had
>>>>> a couple of cases where pgpool-II got a network timeout while performing a health check on the master
>>>>> node, and then immediately initiated failover and promoted the slave.  This was a problem in our case
>>>>> because the master was actually fine -- there was just a temporary network "hiccup" that caused a timeout.
>>>>> 
>>>>> Is there any way to configure pgpool-II to retry in this case?  I couldn't find one in the documentation.
>>>>> 
>>>>> I did see the "Unplugged Wire" thead (http://pgfoundry.org/pipermail/pgpool-general/2010-March/002589.html),
>>>>> which indicates that there was a single retry at one point, which was removed.  But what I am more interested
>>>>> in is a configurable number of retries, with a configurable delay between retries.
>>>>> 
>>>>> -- Matt
>>>> 
>>>> Hi everyone.  I just wanted to try one more time to get an answer for this :-).  We would really, really
>>>> like to find a solution.
>>>> 
>>> 
>>> That kind of configuration doesn't exist right, but could be interesting
>>> to add to a future release.
>>> 
>>> 
>>> -- 
>>> Guillaume
>>> http://blog.guillaume.lelarge.info
>>> http://www.dalibo.com
>>> 
>> 
>> Thanks.  That's what I thought, but it's good to have it confirmed.
>> 
>> -- Matt