[pgpool-general: 185] Re: Healthcheck timeout not always respected

Fri Jan 20 08:40:34 JST 2012

Key file was missing from that commit/change - pool.h where
health_check_timer_expired was made global. Included now attached patch.

Kind regards,
Stevo.

2012/1/20 Stevo Slavić <sslavic at gmail.com>

> Using exit_request was wrong and caused a bug. 4th patch needed -
> health_check_timer_expired is global now so it can be verified if it was
> set to 1 outside of main.c
>
>
> Kind regards,
> Stevo.
>
> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>
>> Using exit_code was not wise. Tested and encountered a case where this
>> results in a bug. Have to work on it more. Main issue is how in
>> pool_connection_pool.c connect_inet_domain_socket_by_port function to know
>> that health check timer has expired (set to 1). Any ideas?
>>
>> Kind regards,
>> Stevo.
>>
>>
>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>>
>>> Tatsuo,
>>>
>>> Here are the patches which should be applied to current pgpool head for
>>> fixing this issue:
>>>
>>> Fixes-health-check-timeout.patch
>>> Fixes-health-check-retrying-after-failover.patch
>>> Fixes-clearing-exitrequest-flag.patch
>>>
>>> Quirk I noticed in logs was resolved as well - after failover pgpool
>>> would perform healthcheck and report it is doing (max retries + 1) th
>>> health check which was confusing. Rather I've adjusted that it does and
>>> reports it's doing a new health check cycle after failover.
>>>
>>> I've tested and it works well - when in raw mode, backends set to
>>> disallow failover, failover on backend failure disabled, and health checks
>>> configured with retries (30sec interval, 5sec timeout, 2 retries, 10sec
>>> delay between retries).
>>>
>>> Please test, and if confirmed ok include in next release.
>>>
>>> Kind regards,
>>>
>>> Stevo.
>>>
>>>
>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>>>
>>>> Here is pgpool.log, strace.out, and pgpool.conf when I tested with my
>>>> latest patch for health check timeout applied. It works well, except for
>>>> single quirk, after failover completed in log files it was reported that
>>>> 3rd health check retry was done (even though just 2 are configured, see
>>>> pgpool.conf) and that backend has returned to healthy state. That
>>>> interesting part from log file follows:
>>>>
>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 DEBUG: pid
>>>> 1163: retrying 3 th health checking
>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 DEBUG: pid
>>>> 1163: health_check: 0 th DB node status: 3
>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 LOG:   pid
>>>> 1163: after some retrying backend returned to healthy state
>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15 DEBUG: pid
>>>> 1163: starting health checking
>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15 DEBUG: pid
>>>> 1163: health_check: 0 th DB node status: 3
>>>>
>>>>
>>>> As can be seen in pgpool.conf, there is only one backend configured.
>>>> pgpool did failover well after health check max retries has been reached
>>>> (pgpool just degraded that single backend to 3, and restarted child
>>>> processes).
>>>>
>>>> After this quirk has been logged, next health check logs were as
>>>> expected. Except those couple weird log entries, everything seems to be ok.
>>>> Maybe that quirk was caused by single backend only configuration corner
>>>> case. Will try tomorrow if it occurs on dual backend configuration.
>>>>
>>>> Regards,
>>>> Stevo.
>>>>
>>>>
>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>>>>
>>>>> Hello Tatsuo,
>>>>>
>>>>> Unfortunately, with your patch when A is on
>>>>> (pool_config->health_check_period > 0) and B is on, when retry count is
>>>>> over, failover will be disallowed because of B being on.
>>>>>
>>>>> Nenad's patch allows failover to be triggered only by health check.
>>>>> Here is the patch which includes Nenad's fix but also fixes issue with
>>>>> health check timeout not being respected.
>>>>>
>>>>> Key points in fix for health check timeout being respected are:
>>>>> - in pool_connection_pool.c connect_inet_domain_socket_by_port
>>>>> function, before trying to connect, file descriptor is set to non-blocking
>>>>> mode, and also non-blocking mode error codes are handled, EINPROGRESS and
>>>>> EALREADY (please verify changes here, especially regarding closing fd)
>>>>> - in main.c health_check_timer_handler has been changed to signal
>>>>> exit_request to health check initiated connect_inet_domain_socket_by_port
>>>>> function call (please verify this, maybe there is a better way to check
>>>>> from connect_inet_domain_socket_by_port if in health_check_timer_expired
>>>>> has been set to 1)
>>>>>
>>>>> These changes will practically make connect attempt to be non-blocking
>>>>> and repeated until:
>>>>> - connection is made, or
>>>>> - unhandled connection error condition is reached, or
>>>>> - health check timer alarm has been raised, or
>>>>> - some other exit request (shutdown) has been issued.
>>>>>
>>>>>
>>>>> Kind regards,
>>>>> Stevo.
>>>>>
>>>>> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>>>>>
>>>>>> Ok, let me clarify use cases regarding failover.
>>>>>>
>>>>>> Currently there are three parameters:
>>>>>> a) health_check
>>>>>> b) DISALLOW_TO_FAILOVER
>>>>>> c) fail_over_on_backend_error
>>>>>>
>>>>>> Source of errors which can trigger failover are 1)health check 2)write
>>>>>> to backend socket 3)read backend from socket. I represent each 1) as
>>>>>> A, 2) as B, 3) as C.
>>>>>>
>>>>>> 1) trigger failover if A or B or C is error
>>>>>> a = on, b = off, c = on
>>>>>>
>>>>>> 2) trigger failover only when B or C is error
>>>>>> a = off, b = off, c = on
>>>>>>
>>>>>> 3) trigger failover only when B is error
>>>>>> Impossible. Because C error always triggers failover.
>>>>>>
>>>>>> 4) trigger failover only when C is error
>>>>>> a = off, b = off, c = off
>>>>>>
>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>>>>>> Impossible. Because C error always triggers failover.
>>>>>>
>>>>>> 6) never trigger failover
>>>>>> Impossible. Because C error always triggers failover.
>>>>>>
>>>>>> As you can see, C is the problem here (look at #3, #5 and #6)
>>>>>>
>>>>>> If we implemented this:
>>>>>> >> However I think we should disable failover if DISALLOW_TO_FAILOVER
>>>>>> set
>>>>>> >> in case of reading data from backend. This should have been done
>>>>>> when
>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is exactly what
>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you think?
>>>>>>
>>>>>> 1) trigger failover if A or B or C is error
>>>>>> a = on, b = off, c = on
>>>>>>
>>>>>> 2) trigger failover only when B or C is error
>>>>>> a = off, b = off, c = on
>>>>>>
>>>>>> 3) trigger failover only when B is error
>>>>>> a = off, b = on, c = on
>>>>>>
>>>>>> 4) trigger failover only when C is error
>>>>>> a = off, b = off, c = off
>>>>>>
>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>>>>>> a = on, b = on, c = off
>>>>>>
>>>>>> 6) never trigger failover
>>>>>> a = off, b = on, c = off
>>>>>>
>>>>>> So it seems my patch will solve all the problems including yours.
>>>>>> (timeout while retrying is another issue of course).
>>>>>> --
>>>>>> Tatsuo Ishii
>>>>>> SRA OSS, Inc. Japan
>>>>>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> Japanese: http://www.sraoss.co.jp
>>>>>>
>>>>>> > I agree, fail_over_on_backend_error isn't useful, just adds
>>>>>> confusion by
>>>>>> > overlapping with DISALLOW_TO_FAILOVER.
>>>>>> >
>>>>>> > With your patch or without it, it is not possible to failover only
>>>>>> on
>>>>>> > health check (max retries) failure. With Nenad's patch, that part
>>>>>> works ok
>>>>>> > and I think that patch is semantically ok - failover occurs even
>>>>>> though
>>>>>> > DISALLOW_TO_FAILOVER is set for backend but only when health check
>>>>>> is
>>>>>> > configured too. Configuring health check without failover on failed
>>>>>> health
>>>>>> > check has no purpose. Also health check configured with allowed
>>>>>> failover on
>>>>>> > any condition other than health check (max retries) failure has no
>>>>>> purpose.
>>>>>> >
>>>>>> > Kind regards,
>>>>>> > Stevo.
>>>>>> >
>>>>>> > 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>>>>>> >
>>>>>> >> fail_over_on_backend_error has different meaning from
>>>>>> >> DISALLOW_TO_FAILOVER. From the doc:
>>>>>> >>
>>>>>> >>  If true, and an error occurs when writing to the backend
>>>>>> >>  communication, pgpool-II will trigger the fail over procedure .
>>>>>> This
>>>>>> >>  is the same behavior as of pgpool-II 2.2.x or earlier. If set to
>>>>>> >>  false, pgpool will report an error and disconnect the session.
>>>>>> >>
>>>>>> >> This means that if pgpool failed to read from backend, it will
>>>>>> trigger
>>>>>> >> failover even if fail_over_on_backend_error to off. So
>>>>>> unconditionaly
>>>>>> >> disabling failover will lead backward imcompatibilty.
>>>>>> >>
>>>>>> >> However I think we should disable failover if DISALLOW_TO_FAILOVER
>>>>>> set
>>>>>> >> in case of reading data from backend. This should have been done
>>>>>> when
>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is exactly what
>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you think?
>>>>>> >> --
>>>>>> >> Tatsuo Ishii
>>>>>> >> SRA OSS, Inc. Japan
>>>>>> >> English: http://www.sraoss.co.jp/index_en.php
>>>>>> >> Japanese: http://www.sraoss.co.jp
>>>>>> >>
>>>>>> >> > For a moment I thought we could have set
>>>>>> fail_over_on_backend_error to
>>>>>> >> off,
>>>>>> >> > and have backends set with ALLOW_TO_FAILOVER flag. But then I
>>>>>> looked in
>>>>>> >> > code.
>>>>>> >> >
>>>>>> >> > In child.c there is a loop child process goes through in its
>>>>>> lifetime.
>>>>>> >> When
>>>>>> >> > fatal error condition occurs before child process exits it will
>>>>>> call
>>>>>> >> > notice_backend_error which will call degenerate_backend_set
>>>>>> which will
>>>>>> >> not
>>>>>> >> > take into account fail_over_on_backend_error is set to off,
>>>>>> causing
>>>>>> >> backend
>>>>>> >> > to be degenerated and failover to occur. That's why we have
>>>>>> backends set
>>>>>> >> > with DISALLOW_TO_FAILOVER but with our patch applied, health
>>>>>> check could
>>>>>> >> > cause failover to occur as expected.
>>>>>> >> >
>>>>>> >> > Maybe it would be enough just to modify degenerate_backend_set,
>>>>>> to take
>>>>>> >> > fail_over_on_backend_error into account just like it already
>>>>>> takes
>>>>>> >> > DISALLOW_TO_FAILOVER into account.
>>>>>> >> >
>>>>>> >> > Kind regards,
>>>>>> >> > Stevo.
>>>>>> >> >
>>>>>> >> > 2012/1/15 Stevo Slavić <sslavic at gmail.com>
>>>>>> >> >
>>>>>> >> >> Yes and that behaviour which you describe as expected, is not
>>>>>> what we
>>>>>> >> >> want. We want pgpool to degrade backend0 and failover when
>>>>>> configured
>>>>>> >> max
>>>>>> >> >> health check retries have failed, and to failover only in that
>>>>>> case, so
>>>>>> >> not
>>>>>> >> >> sooner e.g. connection/child error condition, but as soon as
>>>>>> max health
>>>>>> >> >> check retries have been attempted.
>>>>>> >> >>
>>>>>> >> >> Maybe examples will be more clear.
>>>>>> >> >>
>>>>>> >> >> Imagine two nodes (node 1 and node 2). On each node a single
>>>>>> pgpool and
>>>>>> >> a
>>>>>> >> >> single backend. Apps/clients access db through pgpool on their
>>>>>> own node.
>>>>>> >> >> Two backends are configured in postgres native streaming
>>>>>> replication.
>>>>>> >> >> pgpools are used in raw mode. Both pgpools have same backend as
>>>>>> >> backend0,
>>>>>> >> >> and same backend as backend1.
>>>>>> >> >> initial state: both backends are up and pgpool can access them,
>>>>>> clients
>>>>>> >> >> connect to their pgpool and do their work on master backend,
>>>>>> backend0.
>>>>>> >> >>
>>>>>> >> >> 1st case: unmodified/non-patched pgpool 3.1.1 is used, backends
>>>>>> are
>>>>>> >> >> configured with ALLOW_TO_FAILOVER flag
>>>>>> >> >> - temporary network outage happens between pgpool on node 2 and
>>>>>> backend0
>>>>>> >> >> - error condition is reported by child process, and since
>>>>>> >> >> ALLOW_TO_FAILOVER is set, pgpool performs failover without
>>>>>> giving
>>>>>> >> chance to
>>>>>> >> >> pgpool health check retries to control whether backend is just
>>>>>> >> temporarily
>>>>>> >> >> inaccessible
>>>>>> >> >> - failover command on node 2 promotes standby backend to a new
>>>>>> master -
>>>>>> >> >> split brain occurs, with two masters
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> 2nd case: unmodified/non-patched pgpool 3.1.1 is used, backends
>>>>>> are
>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>>>>>> >> >> - temporary network outage happens between pgpool on node 2 and
>>>>>> backend0
>>>>>> >> >> - error condition is reported by child process, and since
>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>>>>>> >> >> - health check gets a chance to check backend0 condition,
>>>>>> determines
>>>>>> >> that
>>>>>> >> >> it's not accessible, there will be no health check retries
>>>>>> because
>>>>>> >> >> DISALLOW_TO_FAILOVER is set, no failover occurs ever
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> 3rd case, pgpool 3.1.1 + patch you've sent applied, and backends
>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>>>>>> >> >> - temporary network outage happens between pgpool on node 2 and
>>>>>> backend0
>>>>>> >> >> - error condition is reported by child process, and since
>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>>>>>> >> >> - health check gets a chance to check backend0 condition,
>>>>>> determines
>>>>>> >> that
>>>>>> >> >> it's not accessible, health check retries happen, and even
>>>>>> after max
>>>>>> >> >> retries, no failover happens since failover is disallowed
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> 4th expected behaviour, pgpool 3.1.1 + patch we sent, and
>>>>>> backends
>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>>>>>> >> >> - temporary network outage happens between pgpool on node 2 and
>>>>>> backend0
>>>>>> >> >> - error condition is reported by child process, and since
>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>>>>>> >> >> - health check gets a chance to check backend0 condition,
>>>>>> determines
>>>>>> >> that
>>>>>> >> >> it's not accessible, health check retries happen, before a max
>>>>>> retry
>>>>>> >> >> network condition is cleared, retry happens, and backend0
>>>>>> remains to be
>>>>>> >> >> master, no failover occurs, temporary network issue did not
>>>>>> cause split
>>>>>> >> >> brain
>>>>>> >> >> - after some time, temporary network outage happens again
>>>>>> between pgpool
>>>>>> >> >> on node 2 and backend0
>>>>>> >> >> - error condition is reported by child process, and since
>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>>>>>> >> >> - health check gets a chance to check backend0 condition,
>>>>>> determines
>>>>>> >> that
>>>>>> >> >> it's not accessible, health check retries happen, after max
>>>>>> retries
>>>>>> >> >> backend0 is still not accessible, failover happens, standby is
>>>>>> new
>>>>>> >> master
>>>>>> >> >> and backend0 is degraded
>>>>>> >> >>
>>>>>> >> >> Kind regards,
>>>>>> >> >> Stevo.
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>>>>>> >> >>
>>>>>> >> >>> In my test evironment, the patch works as expected. I have two
>>>>>> >> >>> backends. Health check retry conf is as follows:
>>>>>> >> >>>
>>>>>> >> >>> health_check_max_retries = 3
>>>>>> >> >>> health_check_retry_delay = 1
>>>>>> >> >>>
>>>>>> >> >>> 5 09:17:20 LOG:   pid 21411: Backend status file
>>>>>> /home/t-ishii/work/
>>>>>> >> >>> git.postgresql.org/test/log/pgpool_status discarded
>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: pgpool-II successfully
>>>>>> started.
>>>>>> >> >>> version 3.2alpha1 (hatsuiboshi)
>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: find_primary_node:
>>>>>> primary node
>>>>>> >> id
>>>>>> >> >>> is 0
>>>>>> >> >>> -- backend1 was shutdown
>>>>>> >> >>>
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>>>>>> check_replication_time_lag: could
>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>>>>>> sr_check_password
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> -- health check failed
>>>>>> >> >>>
>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411: health check failed. 1
>>>>>> th host
>>>>>> >> /tmp
>>>>>> >> >>> at port 11001 is down
>>>>>> >> >>> -- start retrying
>>>>>> >> >>> 2012-01-15 09:17:50 LOG:   pid 21411: health check retry sleep
>>>>>> time: 1
>>>>>> >> >>> second(s)
>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411: health check failed. 1
>>>>>> th host
>>>>>> >> /tmp
>>>>>> >> >>> at port 11001 is down
>>>>>> >> >>> 2012-01-15 09:17:51 LOG:   pid 21411: health check retry sleep
>>>>>> time: 1
>>>>>> >> >>> second(s)
>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411: health check failed. 1
>>>>>> th host
>>>>>> >> /tmp
>>>>>> >> >>> at port 11001 is down
>>>>>> >> >>> 2012-01-15 09:17:52 LOG:   pid 21411: health check retry sleep
>>>>>> time: 1
>>>>>> >> >>> second(s)
>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411: health check failed. 1
>>>>>> th host
>>>>>> >> /tmp
>>>>>> >> >>> at port 11001 is down
>>>>>> >> >>> 2012-01-15 09:17:53 LOG:   pid 21411: health_check: 1 failover
>>>>>> is
>>>>>> >> canceld
>>>>>> >> >>> because failover is disallowed
>>>>>> >> >>> -- after 3 retries, pgpool wanted to failover, but gave up
>>>>>> because
>>>>>> >> >>> DISALLOW_TO_FAILOVER is set for backend1
>>>>>> >> >>>
>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>>>>>> check_replication_time_lag: could
>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>>>>>> sr_check_password
>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411: health check failed. 1
>>>>>> th host
>>>>>> >> /tmp
>>>>>> >> >>> at port 11001 is down
>>>>>> >> >>> 2012-01-15 09:18:03 LOG:   pid 21411: health check retry sleep
>>>>>> time: 1
>>>>>> >> >>> second(s)
>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>>>>>> >> connect_unix_domain_socket_by_port:
>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>>>>>> directory
>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>>>>>> make_persistent_db_connection:
>>>>>> >> >>> connection to /tmp(11001) failed
>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411: health check failed. 1
>>>>>> th host
>>>>>> >> /tmp
>>>>>> >> >>> at port 11001 is down
>>>>>> >> >>> 2012-01-15 09:18:04 LOG:   pid 21411: health check retry sleep
>>>>>> time: 1
>>>>>> >> >>> second(s)
>>>>>> >> >>> 2012-01-15 09:18:05 LOG:   pid 21411: after some retrying
>>>>>> backend
>>>>>> >> >>> returned to healthy state
>>>>>> >> >>> -- started backend1 and pgpool succeeded in health checking.
>>>>>> Resumed
>>>>>> >> >>> using backend1
>>>>>> >> >>> --
>>>>>> >> >>> Tatsuo Ishii
>>>>>> >> >>> SRA OSS, Inc. Japan
>>>>>> >> >>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>>
>>>>>> >> >>> > Hello Tatsuo,
>>>>>> >> >>> >
>>>>>> >> >>> > Thank you for the patch and effort, but unfortunately this
>>>>>> change
>>>>>> >> won't
>>>>>> >> >>> > work for us. We need to set disallow failover to prevent
>>>>>> failover on
>>>>>> >> >>> child
>>>>>> >> >>> > reported connection errors (it's ok if few clients lose their
>>>>>> >> >>> connection or
>>>>>> >> >>> > can not connect), and still have pgpool perform failover but
>>>>>> only on
>>>>>> >> >>> failed
>>>>>> >> >>> > health check (if configured, after max retries threshold has
>>>>>> been
>>>>>> >> >>> reached).
>>>>>> >> >>> >
>>>>>> >> >>> > Maybe it would be best to add an extra value for
>>>>>> backend_flag -
>>>>>> >> >>> > ALLOW_TO_FAILOVER_ON_HEALTH_CHECK or
>>>>>> >> >>> DISALLOW_TO_FAILOVER_ON_CHILD_ERROR.
>>>>>> >> >>> > It should behave same as DISALLOW_TO_FAILOVER is set, with
>>>>>> only
>>>>>> >> >>> difference
>>>>>> >> >>> > in behaviour when health check (if set, max retries) has
>>>>>> failed -
>>>>>> >> unlike
>>>>>> >> >>> > DISALLOW_TO_FAILOVER, this new flag should allow failover in
>>>>>> this
>>>>>> >> case
>>>>>> >> >>> only.
>>>>>> >> >>> >
>>>>>> >> >>> > Without this change health check (especially health check
>>>>>> retries)
>>>>>> >> >>> doesn't
>>>>>> >> >>> > make much sense - child error is more likely to occur on
>>>>>> (temporary)
>>>>>> >> >>> > backend failure then health check and will or will not cause
>>>>>> >> failover to
>>>>>> >> >>> > occur depending on backend flag, without giving health check
>>>>>> retries
>>>>>> >> a
>>>>>> >> >>> > chance to determine if failure was temporary or not, risking
>>>>>> split
>>>>>> >> brain
>>>>>> >> >>> > situation with two masters just because of temporary network
>>>>>> link
>>>>>> >> >>> hiccup.
>>>>>> >> >>> >
>>>>>> >> >>> > Our main problem remains though with the health check
>>>>>> timeout not
>>>>>> >> being
>>>>>> >> >>> > respected in these special conditions we have. Maybe Nenad
>>>>>> can help
>>>>>> >> you
>>>>>> >> >>> > more to reproduce the issue on your environment.
>>>>>> >> >>> >
>>>>>> >> >>> > Kind regards,
>>>>>> >> >>> > Stevo.
>>>>>> >> >>> >
>>>>>> >> >>> > 2012/1/13 Tatsuo Ishii <ishii at postgresql.org>
>>>>>> >> >>> >
>>>>>> >> >>> >> Thanks for pointing it out.
>>>>>> >> >>> >> Yes, checking DISALLOW_TO_FAILOVER before retrying is wrong.
>>>>>> >> >>> >> However, after retry count over, we should check
>>>>>> >> DISALLOW_TO_FAILOVER I
>>>>>> >> >>> >> think.
>>>>>> >> >>> >> Attached is the patch attempt to fix it. Please try.
>>>>>> >> >>> >> --
>>>>>> >> >>> >> Tatsuo Ishii
>>>>>> >> >>> >> SRA OSS, Inc. Japan
>>>>>> >> >>> >> English: http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >>
>>>>>> >> >>> >> > pgpool is being used in raw mode - just for (health check
>>>>>> based)
>>>>>> >> >>> failover
>>>>>> >> >>> >> > part, so applications are not required to restart when
>>>>>> standby
>>>>>> >> gets
>>>>>> >> >>> >> > promoted to new master. Here is pgpool.conf file and a
>>>>>> very small
>>>>>> >> >>> patch
>>>>>> >> >>> >> > we're using applied to pgpool 3.1.1 release.
>>>>>> >> >>> >> >
>>>>>> >> >>> >> > We have to have DISALLOW_TO_FAILOVER set for the backend
>>>>>> since any
>>>>>> >> >>> child
>>>>>> >> >>> >> > process that detects condition that master/backend0 is not
>>>>>> >> >>> available, if
>>>>>> >> >>> >> > DISALLOW_TO_FAILOVER was not set, will degenerate backend
>>>>>> without
>>>>>> >> >>> giving
>>>>>> >> >>> >> > health check a chance to retry. We need health check with
>>>>>> retries
>>>>>> >> >>> because
>>>>>> >> >>> >> > condition that backend0 is not available could be
>>>>>> temporary
>>>>>> >> (network
>>>>>> >> >>> >> > glitches to the remote site where master is, or deliberate
>>>>>> >> failover
>>>>>> >> >>> of
>>>>>> >> >>> >> > master postgres service from one node to the other on
>>>>>> remote site
>>>>>> >> -
>>>>>> >> >>> in
>>>>>> >> >>> >> both
>>>>>> >> >>> >> > cases remote means remote to the pgpool that is going to
>>>>>> perform
>>>>>> >> >>> health
>>>>>> >> >>> >> > checks and ultimately the failover) and we don't want
>>>>>> standby to
>>>>>> >> be
>>>>>> >> >>> >> > promoted as easily to a new master, to prevent temporary
>>>>>> network
>>>>>> >> >>> >> conditions
>>>>>> >> >>> >> > which could occur frequently to frequently cause split
>>>>>> brain with
>>>>>> >> two
>>>>>> >> >>> >> > masters.
>>>>>> >> >>> >> >
>>>>>> >> >>> >> > But then, with DISALLOW_TO_FAILOVER set, without the
>>>>>> patch health
>>>>>> >> >>> check
>>>>>> >> >>> >> > will not retry and will thus give only one chance to
>>>>>> backend (if
>>>>>> >> >>> health
>>>>>> >> >>> >> > check ever occurs before child process failure to connect
>>>>>> to the
>>>>>> >> >>> >> backend),
>>>>>> >> >>> >> > rendering retry settings effectively to be ignored.
>>>>>> That's where
>>>>>> >> this
>>>>>> >> >>> >> patch
>>>>>> >> >>> >> > comes into action - enables health check retries while
>>>>>> child
>>>>>> >> >>> processes
>>>>>> >> >>> >> are
>>>>>> >> >>> >> > prevented to degenerate backend.
>>>>>> >> >>> >> >
>>>>>> >> >>> >> > I don't think, but I could be wrong, that this patch
>>>>>> influences
>>>>>> >> the
>>>>>> >> >>> >> > behavior we're seeing with unwanted health check attempt
>>>>>> delays.
>>>>>> >> >>> Also,
>>>>>> >> >>> >> > knowing this, maybe pgpool could be patched or some other
>>>>>> support
>>>>>> >> be
>>>>>> >> >>> >> built
>>>>>> >> >>> >> > into it to cover this use case.
>>>>>> >> >>> >> >
>>>>>> >> >>> >> > Regards,
>>>>>> >> >>> >> > Stevo.
>>>>>> >> >>> >> >
>>>>>> >> >>> >> >
>>>>>> >> >>> >> > 2012/1/12 Tatsuo Ishii <ishii at postgresql.org>
>>>>>> >> >>> >> >
>>>>>> >> >>> >> >> I have accepted the moderation request. Your post should
>>>>>> be sent
>>>>>> >> >>> >> shortly.
>>>>>> >> >>> >> >> Also I have raised the post size limit to 1MB.
>>>>>> >> >>> >> >> I will look into this...
>>>>>> >> >>> >> >> --
>>>>>> >> >>> >> >> Tatsuo Ishii
>>>>>> >> >>> >> >> SRA OSS, Inc. Japan
>>>>>> >> >>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> >> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >> >>
>>>>>> >> >>> >> >> > Here is the log file and strace output file (this time
>>>>>> in an
>>>>>> >> >>> archive,
>>>>>> >> >>> >> >> > didn't know about 200KB constraint on post size which
>>>>>> requires
>>>>>> >> >>> >> moderator
>>>>>> >> >>> >> >> > approval). Timings configured are 30sec health check
>>>>>> interval,
>>>>>> >> >>> 5sec
>>>>>> >> >>> >> >> > timeout, and 2 retries with 10sec retry delay.
>>>>>> >> >>> >> >> >
>>>>>> >> >>> >> >> > It takes a lot more than 5sec from started health
>>>>>> check to
>>>>>> >> >>> sleeping
>>>>>> >> >>> >> 10sec
>>>>>> >> >>> >> >> > for first retry.
>>>>>> >> >>> >> >> >
>>>>>> >> >>> >> >> > Seen in code (main.x, health_check() function), within
>>>>>> (retry)
>>>>>> >> >>> attempt
>>>>>> >> >>> >> >> > there is inner retry (first with postgres database
>>>>>> then with
>>>>>> >> >>> >> template1)
>>>>>> >> >>> >> >> and
>>>>>> >> >>> >> >> > that part doesn't seem to be interrupted by alarm.
>>>>>> >> >>> >> >> >
>>>>>> >> >>> >> >> > Regards,
>>>>>> >> >>> >> >> > Stevo.
>>>>>> >> >>> >> >> >
>>>>>> >> >>> >> >> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
>>>>>> >> >>> >> >> >
>>>>>> >> >>> >> >> >> Here is the log file and strace output file. Timings
>>>>>> >> configured
>>>>>> >> >>> are
>>>>>> >> >>> >> >> 30sec
>>>>>> >> >>> >> >> >> health check interval, 5sec timeout, and 2 retries
>>>>>> with 10sec
>>>>>> >> >>> retry
>>>>>> >> >>> >> >> delay.
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >> It takes a lot more than 5sec from started health
>>>>>> check to
>>>>>> >> >>> sleeping
>>>>>> >> >>> >> >> 10sec
>>>>>> >> >>> >> >> >> for first retry.
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >> Seen in code (main.x, health_check() function),
>>>>>> within (retry)
>>>>>> >> >>> >> attempt
>>>>>> >> >>> >> >> >> there is inner retry (first with postgres database
>>>>>> then with
>>>>>> >> >>> >> template1)
>>>>>> >> >>> >> >> and
>>>>>> >> >>> >> >> >> that part doesn't seem to be interrupted by alarm.
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >> Regards,
>>>>>> >> >>> >> >> >> Stevo.
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >>> Ok, I will do it. In the mean time you could use
>>>>>> "strace -tt
>>>>>> >> -p
>>>>>> >> >>> PID"
>>>>>> >> >>> >> >> >>> to see which system call is blocked.
>>>>>> >> >>> >> >> >>> --
>>>>>> >> >>> >> >> >>> Tatsuo Ishii
>>>>>> >> >>> >> >> >>> SRA OSS, Inc. Japan
>>>>>> >> >>> >> >> >>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> >> >>> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >> >> >>>
>>>>>> >> >>> >> >> >>> > OK, got the info - key point is that ip forwarding
>>>>>> is
>>>>>> >> >>> disabled for
>>>>>> >> >>> >> >> >>> security
>>>>>> >> >>> >> >> >>> > reasons. Rules in iptables are not important,
>>>>>> iptables can
>>>>>> >> be
>>>>>> >> >>> >> >> stopped,
>>>>>> >> >>> >> >> >>> or
>>>>>> >> >>> >> >> >>> > previously added rules removed.
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > Here are the steps to reproduce (kudos to my
>>>>>> colleague
>>>>>> >> Nenad
>>>>>> >> >>> >> >> Bulatovic
>>>>>> >> >>> >> >> >>> for
>>>>>> >> >>> >> >> >>> > providing this):
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > 1.) make sure that ip forwarding is off:
>>>>>> >> >>> >> >> >>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
>>>>>> >> >>> >> >> >>> > 2.) create IP alias on some interface (and have
>>>>>> postgres
>>>>>> >> >>> listen on
>>>>>> >> >>> >> >> it):
>>>>>> >> >>> >> >> >>> >     ip addr add x.x.x.x/yy dev ethz
>>>>>> >> >>> >> >> >>> > 3.) set backend_hostname0 to aforementioned IP
>>>>>> >> >>> >> >> >>> > 4.) start pgpool and monitor health checks
>>>>>> >> >>> >> >> >>> > 5.) remove IP alias:
>>>>>> >> >>> >> >> >>> >     ip addr del x.x.x.x/yy dev ethz
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > Here is the interesting part in pgpool log after
>>>>>> this:
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting
>>>>>> health
>>>>>> >> checking
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>>>>>> health_check: 0 th DB
>>>>>> >> >>> node
>>>>>> >> >>> >> >> >>> status: 2
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>>>>>> health_check: 1 th DB
>>>>>> >> >>> node
>>>>>> >> >>> >> >> >>> status: 1
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting
>>>>>> health
>>>>>> >> checking
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
>>>>>> health_check: 0 th DB
>>>>>> >> >>> node
>>>>>> >> >>> >> >> >>> status: 2
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358:
>>>>>> health_check: 0 th DB
>>>>>> >> >>> node
>>>>>> >> >>> >> >> >>> status: 2
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 ERROR: pid 24358: health check
>>>>>> failed.
>>>>>> >> 0
>>>>>> >> >>> th
>>>>>> >> >>> >> host
>>>>>> >> >>> >> >> >>> > 192.168.2.27 at port 5432 is down
>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 LOG:   pid 24358: health check
>>>>>> retry
>>>>>> >> sleep
>>>>>> >> >>> >> time:
>>>>>> >> >>> >> >> 10
>>>>>> >> >>> >> >> >>> > second(s)
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > That pgpool was configured with health check
>>>>>> interval of
>>>>>> >> >>> 30sec,
>>>>>> >> >>> >> 5sec
>>>>>> >> >>> >> >> >>> > timeout, and 10sec retry delay with 2 max retries.
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > Making use of libpq instead for connecting to db
>>>>>> in health
>>>>>> >> >>> checks
>>>>>> >> >>> >> IMO
>>>>>> >> >>> >> >> >>> > should resolve it, but you'll best determine which
>>>>>> call
>>>>>> >> >>> exactly
>>>>>> >> >>> >> gets
>>>>>> >> >>> >> >> >>> > blocked waiting. Btw, psql with PGCONNECT_TIMEOUT
>>>>>> env var
>>>>>> >> >>> >> configured
>>>>>> >> >>> >> >> >>> > respects that env var timeout.
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > Regards,
>>>>>> >> >>> >> >> >>> > Stevo.
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <
>>>>>> >> >>> sslavic at gmail.com
>>>>>> >> >>> >> >
>>>>>> >> >>> >> >> >>> wrote:
>>>>>> >> >>> >> >> >>> >
>>>>>> >> >>> >> >> >>> >> Tatsuo,
>>>>>> >> >>> >> >> >>> >>
>>>>>> >> >>> >> >> >>> >> Did you restart iptables after adding rule?
>>>>>> >> >>> >> >> >>> >>
>>>>>> >> >>> >> >> >>> >> Regards,
>>>>>> >> >>> >> >> >>> >> Stevo.
>>>>>> >> >>> >> >> >>> >>
>>>>>> >> >>> >> >> >>> >>
>>>>>> >> >>> >> >> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <
>>>>>> >> >>> >> sslavic at gmail.com>
>>>>>> >> >>> >> >> >>> wrote:
>>>>>> >> >>> >> >> >>> >>
>>>>>> >> >>> >> >> >>> >>> Looking into this to verify if these are all
>>>>>> necessary
>>>>>> >> >>> changes
>>>>>> >> >>> >> to
>>>>>> >> >>> >> >> have
>>>>>> >> >>> >> >> >>> >>> port unreachable message silently rejected
>>>>>> (suspecting
>>>>>> >> some
>>>>>> >> >>> >> kernel
>>>>>> >> >>> >> >> >>> >>> parameter tuning is needed).
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>> Just to clarify it's not a problem that host is
>>>>>> being
>>>>>> >> >>> detected
>>>>>> >> >>> >> by
>>>>>> >> >>> >> >> >>> pgpool
>>>>>> >> >>> >> >> >>> >>> to be down, but the timing when that happens. On
>>>>>> >> environment
>>>>>> >> >>> >> where
>>>>>> >> >>> >> >> >>> issue is
>>>>>> >> >>> >> >> >>> >>> reproduced pgpool as part of health check
>>>>>> attempt tries
>>>>>> >> to
>>>>>> >> >>> >> connect
>>>>>> >> >>> >> >> to
>>>>>> >> >>> >> >> >>> >>> backend and hangs for tcp timeout instead of
>>>>>> being
>>>>>> >> >>> interrupted
>>>>>> >> >>> >> by
>>>>>> >> >>> >> >> >>> timeout
>>>>>> >> >>> >> >> >>> >>> alarm. Can you verify/confirm please the health
>>>>>> check
>>>>>> >> retry
>>>>>> >> >>> >> timings
>>>>>> >> >>> >> >> >>> are not
>>>>>> >> >>> >> >> >>> >>> delayed?
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>> Regards,
>>>>>> >> >>> >> >> >>> >>> Stevo.
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <
>>>>>> >> >>> >> >> ishii at postgresql.org
>>>>>> >> >>> >> >> >>> >wrote:
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>>> Ok, I did:
>>>>>> >> >>> >> >> >>> >>>>
>>>>>> >> >>> >> >> >>> >>>> # iptables -A FORWARD -j REJECT --reject-with
>>>>>> >> >>> >> >> icmp-port-unreachable
>>>>>> >> >>> >> >> >>> >>>>
>>>>>> >> >>> >> >> >>> >>>> on the host where pgpoo is running. And pull
>>>>>> network
>>>>>> >> cable
>>>>>> >> >>> from
>>>>>> >> >>> >> >> >>> >>>> backend0 host network interface. Pgpool
>>>>>> detected the
>>>>>> >> host
>>>>>> >> >>> being
>>>>>> >> >>> >> >> down
>>>>>> >> >>> >> >> >>> >>>> as expected...
>>>>>> >> >>> >> >> >>> >>>> --
>>>>>> >> >>> >> >> >>> >>>> Tatsuo Ishii
>>>>>> >> >>> >> >> >>> >>>> SRA OSS, Inc. Japan
>>>>>> >> >>> >> >> >>> >>>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> >> >>> >>>> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >> >> >>> >>>>
>>>>>> >> >>> >> >> >>> >>>> > Backend is not destination of this message,
>>>>>> pgpool
>>>>>> >> host
>>>>>> >> >>> is,
>>>>>> >> >>> >> and
>>>>>> >> >>> >> >> we
>>>>>> >> >>> >> >> >>> >>>> don't
>>>>>> >> >>> >> >> >>> >>>> > want it to ever get it. With command I've
>>>>>> sent you
>>>>>> >> rule
>>>>>> >> >>> will
>>>>>> >> >>> >> be
>>>>>> >> >>> >> >> >>> >>>> created for
>>>>>> >> >>> >> >> >>> >>>> > any source and destination.
>>>>>> >> >>> >> >> >>> >>>> >
>>>>>> >> >>> >> >> >>> >>>> > Regards,
>>>>>> >> >>> >> >> >>> >>>> > Stevo.
>>>>>> >> >>> >> >> >>> >>>> >
>>>>>> >> >>> >> >> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo
>>>>>> Ishii <
>>>>>> >> >>> >> >> >>> ishii at postgresql.org>
>>>>>> >> >>> >> >> >>> >>>> wrote:
>>>>>> >> >>> >> >> >>> >>>> >
>>>>>> >> >>> >> >> >>> >>>> >> I did following:
>>>>>> >> >>> >> >> >>> >>>> >>
>>>>>> >> >>> >> >> >>> >>>> >> Do following on the host where pgpool is
>>>>>> running on:
>>>>>> >> >>> >> >> >>> >>>> >>
>>>>>> >> >>> >> >> >>> >>>> >> # iptables -A FORWARD -j REJECT --reject-with
>>>>>> >> >>> >> >> >>> icmp-port-unreachable -d
>>>>>> >> >>> >> >> >>> >>>> >> 133.137.177.124
>>>>>> >> >>> >> >> >>> >>>> >> (133.137.177.124 is the host where backend
>>>>>> is running
>>>>>> >> >>> on)
>>>>>> >> >>> >> >> >>> >>>> >>
>>>>>> >> >>> >> >> >>> >>>> >> Pull network cable from backend0 host network
>>>>>> >> interface.
>>>>>> >> >>> >> Pgpool
>>>>>> >> >>> >> >> >>> >>>> >> detected the host being down as expected. Am
>>>>>> I
>>>>>> >> missing
>>>>>> >> >>> >> >> something?
>>>>>> >> >>> >> >> >>> >>>> >> --
>>>>>> >> >>> >> >> >>> >>>> >> Tatsuo Ishii
>>>>>> >> >>> >> >> >>> >>>> >> SRA OSS, Inc. Japan
>>>>>> >> >>> >> >> >>> >>>> >> English:
>>>>>> http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> >> >>> >>>> >> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >> >> >>> >>>> >>
>>>>>> >> >>> >> >> >>> >>>> >> > Hello Tatsuo,
>>>>>> >> >>> >> >> >>> >>>> >> >
>>>>>> >> >>> >> >> >>> >>>> >> > With backend0 on one host just configure
>>>>>> following
>>>>>> >> >>> rule on
>>>>>> >> >>> >> >> other
>>>>>> >> >>> >> >> >>> >>>> host
>>>>>> >> >>> >> >> >>> >>>> >> where
>>>>>> >> >>> >> >> >>> >>>> >> > pgpool is:
>>>>>> >> >>> >> >> >>> >>>> >> >
>>>>>> >> >>> >> >> >>> >>>> >> > iptables -A FORWARD -j REJECT --reject-with
>>>>>> >> >>> >> >> >>> icmp-port-unreachable
>>>>>> >> >>> >> >> >>> >>>> >> >
>>>>>> >> >>> >> >> >>> >>>> >> > and then have pgpool startup with health
>>>>>> checking
>>>>>> >> and
>>>>>> >> >>> >> >> retrying
>>>>>> >> >>> >> >> >>> >>>> >> configured,
>>>>>> >> >>> >> >> >>> >>>> >> > and then pull network cable from backend0
>>>>>> host
>>>>>> >> network
>>>>>> >> >>> >> >> >>> interface.
>>>>>> >> >>> >> >> >>> >>>> >> >
>>>>>> >> >>> >> >> >>> >>>> >> > Regards,
>>>>>> >> >>> >> >> >>> >>>> >> > Stevo.
>>>>>> >> >>> >> >> >>> >>>> >> >
>>>>>> >> >>> >> >> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo
>>>>>> Ishii <
>>>>>> >> >>> >> >> >>> ishii at postgresql.org
>>>>>> >> >>> >> >> >>> >>>> >
>>>>>> >> >>> >> >> >>> >>>> >> wrote:
>>>>>> >> >>> >> >> >>> >>>> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> I want to try to test the situation you
>>>>>> descrived:
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured for
>>>>>> security
>>>>>> >> reasons
>>>>>> >> >>> not
>>>>>> >> >>> >> to
>>>>>> >> >>> >> >> >>> return
>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages, even
>>>>>> though
>>>>>> >> >>> >> >> >>> health_check_timeout is
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> But I don't know how to do it. I pulled
>>>>>> out the
>>>>>> >> >>> network
>>>>>> >> >>> >> >> cable
>>>>>> >> >>> >> >> >>> and
>>>>>> >> >>> >> >> >>> >>>> >> >> pgpool detected it as expected. Also I
>>>>>> configured
>>>>>> >> the
>>>>>> >> >>> >> server
>>>>>> >> >>> >> >> >>> which
>>>>>> >> >>> >> >> >>> >>>> >> >> PostgreSQL is running on to disable the
>>>>>> 5432
>>>>>> >> port. In
>>>>>> >> >>> >> this
>>>>>> >> >>> >> >> case
>>>>>> >> >>> >> >> >>> >>>> >> >> connect(2) returned EHOSTUNREACH (No
>>>>>> route to
>>>>>> >> host)
>>>>>> >> >>> so
>>>>>> >> >>> >> >> pgpool
>>>>>> >> >>> >> >> >>> >>>> detected
>>>>>> >> >>> >> >> >>> >>>> >> >> the error as expected.
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> Could you please instruct me?
>>>>>> >> >>> >> >> >>> >>>> >> >> --
>>>>>> >> >>> >> >> >>> >>>> >> >> Tatsuo Ishii
>>>>>> >> >>> >> >> >>> >>>> >> >> SRA OSS, Inc. Japan
>>>>>> >> >>> >> >> >>> >>>> >> >> English:
>>>>>> http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> >> >>> >>>> >> >> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> > Hello Tatsuo,
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> > Thank you for replying!
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> > I'm not sure what exactly is blocking,
>>>>>> just by
>>>>>> >> >>> pgpool
>>>>>> >> >>> >> code
>>>>>> >> >>> >> >> >>> >>>> analysis I
>>>>>> >> >>> >> >> >>> >>>> >> >> > suspect it is the part where a
>>>>>> connection is
>>>>>> >> made
>>>>>> >> >>> to
>>>>>> >> >>> >> the
>>>>>> >> >>> >> >> db
>>>>>> >> >>> >> >> >>> and
>>>>>> >> >>> >> >> >>> >>>> it
>>>>>> >> >>> >> >> >>> >>>> >> >> doesn't
>>>>>> >> >>> >> >> >>> >>>> >> >> > seem to get interrupted by alarm. Tested
>>>>>> >> thoroughly
>>>>>> >> >>> >> health
>>>>>> >> >>> >> >> >>> check
>>>>>> >> >>> >> >> >>> >>>> >> >> behaviour,
>>>>>> >> >>> >> >> >>> >>>> >> >> > it works really well when host/ip is
>>>>>> there and
>>>>>> >> just
>>>>>> >> >>> >> >> >>> >>>> backend/postgres
>>>>>> >> >>> >> >> >>> >>>> >> is
>>>>>> >> >>> >> >> >>> >>>> >> >> > down, but not when backend host/ip is
>>>>>> down. I
>>>>>> >> could
>>>>>> >> >>> >> see in
>>>>>> >> >>> >> >> >>> log
>>>>>> >> >>> >> >> >>> >>>> that
>>>>>> >> >>> >> >> >>> >>>> >> >> initial
>>>>>> >> >>> >> >> >>> >>>> >> >> > health check and each retry got delayed
>>>>>> when
>>>>>> >> >>> host/ip is
>>>>>> >> >>> >> >> not
>>>>>> >> >>> >> >> >>> >>>> reachable,
>>>>>> >> >>> >> >> >>> >>>> >> >> > while when just backend is not
>>>>>> listening (is
>>>>>> >> down)
>>>>>> >> >>> on
>>>>>> >> >>> >> the
>>>>>> >> >>> >> >> >>> >>>> reachable
>>>>>> >> >>> >> >> >>> >>>> >> >> host/ip
>>>>>> >> >>> >> >> >>> >>>> >> >> > then initial health check and all
>>>>>> retries are
>>>>>> >> >>> exact to
>>>>>> >> >>> >> the
>>>>>> >> >>> >> >> >>> >>>> settings in
>>>>>> >> >>> >> >> >>> >>>> >> >> > pgpool.conf.
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one of
>>>>>> the libpq
>>>>>> >> >>> >> >> environment
>>>>>> >> >>> >> >> >>> >>>> variables
>>>>>> >> >>> >> >> >>> >>>> >> in
>>>>>> >> >>> >> >> >>> >>>> >> >> > the docs (see
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html
>>>>>> )
>>>>>> >> >>> >> >> >>> >>>> >> >> > There is equivalent parameter in libpq
>>>>>> >> >>> >> PGconnectdbParams (
>>>>>> >> >>> >> >> >>> see
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> >> >>> >>>> >>
>>>>>> >> >>> >> >> >>> >>>>
>>>>>> >> >>> >> >> >>>
>>>>>> >> >>> >> >>
>>>>>> >> >>> >>
>>>>>> >> >>>
>>>>>> >>
>>>>>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>>>>>> >> >>> >> >> >>> >>>> >> >> )
>>>>>> >> >>> >> >> >>> >>>> >> >> > At the beginning of that same page
>>>>>> there are
>>>>>> >> some
>>>>>> >> >>> >> >> important
>>>>>> >> >>> >> >> >>> >>>> infos on
>>>>>> >> >>> >> >> >>> >>>> >> >> using
>>>>>> >> >>> >> >> >>> >>>> >> >> > these functions.
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> > Regards,
>>>>>> >> >>> >> >> >>> >>>> >> >> > Stevo.
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM,
>>>>>> Tatsuo Ishii <
>>>>>> >> >>> >> >> >>> >>>> ishii at postgresql.org>
>>>>>> >> >>> >> >> >>> >>>> >> >> wrote:
>>>>>> >> >>> >> >> >>> >>>> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Hello pgpool community,
>>>>>> >> >>> >> >> >>> >>>> >> >> >> >
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured for
>>>>>> security
>>>>>> >> reasons
>>>>>> >> >>> not
>>>>>> >> >>> >> to
>>>>>> >> >>> >> >> >>> return
>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages, even
>>>>>> though
>>>>>> >> >>> >> >> >>> health_check_timeout is
>>>>>> >> >>> >> >> >>> >>>> >> >> >> configured,
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > socket call will block and alarm
>>>>>> will not get
>>>>>> >> >>> raised
>>>>>> >> >>> >> >> >>> until TCP
>>>>>> >> >>> >> >> >>> >>>> >> timeout
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > occurs.
>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> >> Interesting. So are you saying that
>>>>>> read(2)
>>>>>> >> >>> cannot be
>>>>>> >> >>> >> >> >>> >>>> interrupted by
>>>>>> >> >>> >> >> >>> >>>> >> >> >> alarm signal if the system is
>>>>>> configured not to
>>>>>> >> >>> return
>>>>>> >> >>> >> >> >>> >>>> destination
>>>>>> >> >>> >> >> >>> >>>> >> >> >> host unreachable message? Could you
>>>>>> please
>>>>>> >> guide
>>>>>> >> >>> me
>>>>>> >> >>> >> >> where I
>>>>>> >> >>> >> >> >>> can
>>>>>> >> >>> >> >> >>> >>>> get
>>>>>> >> >>> >> >> >>> >>>> >> >> >> such that info? (I'm not a network
>>>>>> expert).
>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Not a C programmer, found some info
>>>>>> that
>>>>>> >> select
>>>>>> >> >>> call
>>>>>> >> >>> >> >> >>> could be
>>>>>> >> >>> >> >> >>> >>>> >> replace
>>>>>> >> >>> >> >> >>> >>>> >> >> >> with
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > select/pselect calls. Maybe it would
>>>>>> be best
>>>>>> >> if
>>>>>> >> >>> >> >> >>> >>>> PGCONNECT_TIMEOUT
>>>>>> >> >>> >> >> >>> >>>> >> >> value
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > could be used here for connection
>>>>>> timeout.
>>>>>> >> >>> pgpool
>>>>>> >> >>> >> has
>>>>>> >> >>> >> >> >>> libpq as
>>>>>> >> >>> >> >> >>> >>>> >> >> >> dependency,
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > why isn't it using libpq for the
>>>>>> healthcheck
>>>>>> >> db
>>>>>> >> >>> >> connect
>>>>>> >> >>> >> >> >>> >>>> calls, then
>>>>>> >> >>> >> >> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> >> I don't think libpq uses
>>>>>> select/pselect for
>>>>>> >> >>> >> establishing
>>>>>> >> >>> >> >> >>> >>>> connection,
>>>>>> >> >>> >> >> >>> >>>> >> >> >> but using libpq instead of homebrew
>>>>>> code seems
>>>>>> >> to
>>>>>> >> >>> be
>>>>>> >> >>> >> an
>>>>>> >> >>> >> >> >>> idea.
>>>>>> >> >>> >> >> >>> >>>> Let me
>>>>>> >> >>> >> >> >>> >>>> >> >> >> think about it.
>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >> >> One question. Are you sure that libpq
>>>>>> can deal
>>>>>> >> >>> with
>>>>>> >> >>> >> the
>>>>>> >> >>> >> >> case
>>>>>> >> >>> >> >> >>> >>>> (not to
>>>>>> >> >>> >> >> >>> >>>> >> >> >> return destination host unreachable
>>>>>> messages)
>>>>>> >> by
>>>>>> >> >>> using
>>>>>> >> >>> >> >> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>>>>>> >> >>> >> >> >>> >>>> >> >> >> --
>>>>>> >> >>> >> >> >>> >>>> >> >> >> Tatsuo Ishii
>>>>>> >> >>> >> >> >>> >>>> >> >> >> SRA OSS, Inc. Japan
>>>>>> >> >>> >> >> >>> >>>> >> >> >> English:
>>>>>> http://www.sraoss.co.jp/index_en.php
>>>>>> >> >>> >> >> >>> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>>>>>> >> >>> >> >> >>> >>>> >> >>
>>>>>> >> >>> >> >> >>> >>>> >>
>>>>>> >> >>> >> >> >>> >>>>
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>>
>>>>>> >> >>> >> >> >>> >>
>>>>>> >> >>> >> >> >>>
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >> >>
>>>>>> >> >>> >> >>
>>>>>> >> >>> >>
>>>>>> >> >>>
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120120/a90a8143/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fixes-usage-of-exitcode-for-signaling-health-check-t.patch
Type: text/x-patch
Size: 3663 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120120/a90a8143/attachment.bin>