[Pgpool-hackers] Health check retries (patch)

Fri Nov 18 21:28:44 UTC 2011

Hi everyone.  In August, I wrote to the pgpool-general list (see below) asking if there was any
way to have pgpool-II retry a failed health check before promoting the slave.

I'm attaching a patch that adds this functionality.  Would anyone care to review it?  We've been
using it successfully in production for about 3 months now, and it's working great.

This is my first time submitting a patch to PostgreSQL or PgPool, so go easy :-).

Some comments:
- The purpose of this feature is to allow pgpool-II to handle brief networking interruptions
   without being "fooled" into thinking that the master node is down and the slave needs to
   be promoted.
- This patch adds two new configuration settings.
- The "health_check_max_retries" setting is the maximum number of times to retry a health
   check before giving up.
- The "health_check_retry_delay" is the amount of time (in seconds) to sleep between retries.
- The feature is turned *off* by default (health_check_max_retries defaults to 0, or no retries).

Patch is against git HEAD revision (commit 58043c962b8305507de0f450be74c24cbe4c8430).

Please let me know if you have any questions or comments.

-- Matt

Begin forwarded message:

> From: Matt Solnit <msolnit at soasta.com>
> Subject: Re: [Pgpool-general] Can pgpool-II retry failed health checks?
> Date: August 4, 2011 10:57:27 PM PDT
> To: Guillaume Lelarge <guillaume at lelarge.info>
> Cc: "pgpool-general at pgfoundry.org" <pgpool-general at pgfoundry.org>
> 
> On Aug 4, 2011, at 10:54 PM, Guillaume Lelarge wrote:
> 
>> On Fri, 2011-08-05 at 00:17 -0400, Matt Solnit wrote:
>>> On Jul 29, 2011, at 10:37 PM, Matthew Solnit wrote:
>>> 
>>>> Hi everyone.  I'm using pgpool-II 3.0.4 with PostgreSQL 9.0.2, in streaming replication mode.  We've had
>>>> a couple of cases where pgpool-II got a network timeout while performing a health check on the master
>>>> node, and then immediately initiated failover and promoted the slave.  This was a problem in our case
>>>> because the master was actually fine -- there was just a temporary network "hiccup" that caused a timeout.
>>>> 
>>>> Is there any way to configure pgpool-II to retry in this case?  I couldn't find one in the documentation.
>>>> 
>>>> I did see the "Unplugged Wire" thead (http://pgfoundry.org/pipermail/pgpool-general/2010-March/002589.html),
>>>> which indicates that there was a single retry at one point, which was removed.  But what I am more interested
>>>> in is a configurable number of retries, with a configurable delay between retries.
>>>> 
>>>> -- Matt
>>> 
>>> Hi everyone.  I just wanted to try one more time to get an answer for this :-).  We would really, really
>>> like to find a solution.
>>> 
>> 
>> That kind of configuration doesn't exist right, but could be interesting
>> to add to a future release.
>> 
>> 
>> -- 
>> Guillaume
>> http://blog.guillaume.lelarge.info
>> http://www.dalibo.com
>> 
> 
> Thanks.  That's what I thought, but it's good to have it confirmed.
> 
> -- Matt
-------------- next part --------------
A non-text attachment was scrubbed...
Name: health_check_retries.patch
Type: application/octet-stream
Size: 8300 bytes
Desc: health_check_retries.patch
URL: <http://pgfoundry.org/pipermail/pgpool-hackers/attachments/20111118/3cc2b019/attachment.obj>