[pgpool-general: 212] Re: Healthcheck timeout not always respected

Sun Feb 5 14:49:06 JST 2012

Finially I have time to check your patches. Here is the result of review.

> Hello Tatsuo,
> 
> Here is cumulative patch to be applied on pgpool master branch with
> following fixes included:
> 
>    1. fix for health check bug
>       1. it was not possible to allow backend failover only on failed
>       health check(s);
>       2. to achieve this one just configures backend to
>       DISALLOW_TO_FAILOVER, sets fail_over_on_backend_error to off, and
>       configures health checks;
>       3. for this fix in code an unwanted check was removed in main.c,
>       after health check failed if DISALLOW_TO_FAILOVER was set for backend
>       failover would have been always prevented, even when one
> configures health
>       check whose sole purpose is to control failover

This is not acceptable, at least for stable
releases. DISALLOW_TO_FAILOVER and sets fail_over_on_backend_error are
for different purposes. The former is for preventing any failover
including health check. The latter is for write to communication
socket.

fail_over_on_backend_error = on
                                   # Initiates failover when writing to the
                                   # backend communication socket fails
                                   # This is the same behaviour of pgpool-II
                                   # 2.2.x and previous releases
                                   # If set to off, pgpool will report an
                                   # error and disconnect the session.

Your patch changes the existing semantics. Another point is,
DISALLOW_TO_FAILOVER allows to control per backend behavior. Your
patch breaks it.

>       2. fix for health check bug
>       1. health check timeout was not being respected in all conditions
>       (icmp host unreachable messages dropped for security reasons, or
> no active
>       network component to send those message)
>       2. for this fix in code (main.c, pool.h, pool_connection_pool.c) inet
>       connections have been made to be non blocking, and during connection
>       retries status of now global health_check_timer_expired variable is being
>       checked

This seems good. But I need more investigation. For example, your
patch set non blocking to sockets but never revert back to blocking.

>       3. fix for failback bug
>       1. in raw mode, after failback (through pcp_attach_node) standby
>       node/backend would remain in invalid state 

It turned out that even failover was bugged. The status was not set to
CON_DOWN. This leaves the status to CON_CONNECT_WAIT and it prevented
failback from returning to normal state. I fixed this on master branch.

> (it would be in CON_UP, so on
>       failover after failback pgpool would not be able to connect to standby as
>       get_next_master_node expects standby nodes/backends in raw mode to be in
>       CON_CONNECT_WAIT state when finding next master node)
>       2. for this fix in code, when in raw mode on failback status of all
>       nodes/backends with CON_UP state is set to CON_CONNECT_WAIT -
> all children
>       are restarted anyway

> Neither of these fixes changes expected behaviour of related features so
> there are no changes to the documentation.
> 
> 
> Kind regards,
> 
> Stevo.
> 
> 
> 2012/1/24 Tatsuo Ishii <ishii at postgresql.org>
> 
>> > Additional testing confirmed that this fix ensures health check timer
>> gets
>> > respected (should I create a ticket on some issue tracker? send
>> cumulative
>> > patch with all changes to have it accepted?).
>>
>> We have problem with Mantis bug tracker and decided to stop using
>> it(unless someone volunteers to fix it). Please send cumulative patch
>> againt master head to this list so that we will be able to look
>> into(be sure to include English doc changes).
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Problem is that with all the testing another issue has been encountered,
>> > now with pcp_attach_node.
>> >
>> > With pgpool in raw mode and two backends in postgres 9 streaming
>> > replication, when backend0 fails, after health checks retries pgpool
>> calls
>> > failover command and degenerates backend0, backend1 gets promoted to new
>> > master, pgpool can connect to that master, and two backends are in pgpool
>> > state 3/2. And this is ok and expected.
>> >
>> > Once backend0 is recovered, it's attached back to pgpool using
>> > pcp_attach_node, and pgpool will show two backends in state 2/2 (in logs
>> > and in show pool_nodes; query) with backend0 taking all the load (raw
>> > mode). If after that recovery and attachment of backend0 pgpool is not
>> > restarted, and afetr some time backend0 fails again, after health check
>> > retries backend0 will get degenerated, failover command will get called
>> > (promotes standby to master), but pgpool will not be able to connect to
>> > backend1 (regardless if unix or inet sockets are used for backend1). Only
>> > if pgpool is restarted before second (complete) failure of backend0, will
>> > pgpool be able to connect to backend1.
>> >
>> > Following code, pcp_attach_node (failback of backend0) will actually
>> > execute same code as for failover. Not sure what, but that failover does
>> > something with backend1 state or in memory settings, so that pgpool can
>> no
>> > longer connect to backend1. Is this a known issue?
>> >
>> > Kind regards,
>> > Stevo.
>> >
>> > 2012/1/20 Stevo Slavić <sslavic at gmail.com>
>> >
>> >> Key file was missing from that commit/change - pool.h where
>> >> health_check_timer_expired was made global. Included now attached patch.
>> >>
>> >> Kind regards,
>> >> Stevo.
>> >>
>> >>
>> >> 2012/1/20 Stevo Slavić <sslavic at gmail.com>
>> >>
>> >>> Using exit_request was wrong and caused a bug. 4th patch needed -
>> >>> health_check_timer_expired is global now so it can be verified if it
>> was
>> >>> set to 1 outside of main.c
>> >>>
>> >>>
>> >>> Kind regards,
>> >>> Stevo.
>> >>>
>> >>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>> >>>
>> >>>> Using exit_code was not wise. Tested and encountered a case where this
>> >>>> results in a bug. Have to work on it more. Main issue is how in
>> >>>> pool_connection_pool.c connect_inet_domain_socket_by_port function to
>> know
>> >>>> that health check timer has expired (set to 1). Any ideas?
>> >>>>
>> >>>> Kind regards,
>> >>>> Stevo.
>> >>>>
>> >>>>
>> >>>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>> >>>>
>> >>>>> Tatsuo,
>> >>>>>
>> >>>>> Here are the patches which should be applied to current pgpool head
>> for
>> >>>>> fixing this issue:
>> >>>>>
>> >>>>> Fixes-health-check-timeout.patch
>> >>>>> Fixes-health-check-retrying-after-failover.patch
>> >>>>> Fixes-clearing-exitrequest-flag.patch
>> >>>>>
>> >>>>> Quirk I noticed in logs was resolved as well - after failover pgpool
>> >>>>> would perform healthcheck and report it is doing (max retries + 1) th
>> >>>>> health check which was confusing. Rather I've adjusted that it does
>> and
>> >>>>> reports it's doing a new health check cycle after failover.
>> >>>>>
>> >>>>> I've tested and it works well - when in raw mode, backends set to
>> >>>>> disallow failover, failover on backend failure disabled, and health
>> checks
>> >>>>> configured with retries (30sec interval, 5sec timeout, 2 retries,
>> 10sec
>> >>>>> delay between retries).
>> >>>>>
>> >>>>> Please test, and if confirmed ok include in next release.
>> >>>>>
>> >>>>> Kind regards,
>> >>>>>
>> >>>>> Stevo.
>> >>>>>
>> >>>>>
>> >>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>> >>>>>
>> >>>>>> Here is pgpool.log, strace.out, and pgpool.conf when I tested with
>> my
>> >>>>>> latest patch for health check timeout applied. It works well,
>> except for
>> >>>>>> single quirk, after failover completed in log files it was reported
>> that
>> >>>>>> 3rd health check retry was done (even though just 2 are configured,
>> see
>> >>>>>> pgpool.conf) and that backend has returned to healthy state. That
>> >>>>>> interesting part from log file follows:
>> >>>>>>
>> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 DEBUG: pid
>> >>>>>> 1163: retrying 3 th health checking
>> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 DEBUG: pid
>> >>>>>> 1163: health_check: 0 th DB node status: 3
>> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 LOG:   pid
>> >>>>>> 1163: after some retrying backend returned to healthy state
>> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15 DEBUG: pid
>> >>>>>> 1163: starting health checking
>> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15 DEBUG: pid
>> >>>>>> 1163: health_check: 0 th DB node status: 3
>> >>>>>>
>> >>>>>>
>> >>>>>> As can be seen in pgpool.conf, there is only one backend configured.
>> >>>>>> pgpool did failover well after health check max retries has been
>> reached
>> >>>>>> (pgpool just degraded that single backend to 3, and restarted child
>> >>>>>> processes).
>> >>>>>>
>> >>>>>> After this quirk has been logged, next health check logs were as
>> >>>>>> expected. Except those couple weird log entries, everything seems
>> to be ok.
>> >>>>>> Maybe that quirk was caused by single backend only configuration
>> corner
>> >>>>>> case. Will try tomorrow if it occurs on dual backend configuration.
>> >>>>>>
>> >>>>>> Regards,
>> >>>>>> Stevo.
>> >>>>>>
>> >>>>>>
>> >>>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>> >>>>>>
>> >>>>>>> Hello Tatsuo,
>> >>>>>>>
>> >>>>>>> Unfortunately, with your patch when A is on
>> >>>>>>> (pool_config->health_check_period > 0) and B is on, when retry
>> count is
>> >>>>>>> over, failover will be disallowed because of B being on.
>> >>>>>>>
>> >>>>>>> Nenad's patch allows failover to be triggered only by health check.
>> >>>>>>> Here is the patch which includes Nenad's fix but also fixes issue
>> with
>> >>>>>>> health check timeout not being respected.
>> >>>>>>>
>> >>>>>>> Key points in fix for health check timeout being respected are:
>> >>>>>>> - in pool_connection_pool.c connect_inet_domain_socket_by_port
>> >>>>>>> function, before trying to connect, file descriptor is set to
>> non-blocking
>> >>>>>>> mode, and also non-blocking mode error codes are handled,
>> EINPROGRESS and
>> >>>>>>> EALREADY (please verify changes here, especially regarding closing
>> fd)
>> >>>>>>> - in main.c health_check_timer_handler has been changed to signal
>> >>>>>>> exit_request to health check initiated
>> connect_inet_domain_socket_by_port
>> >>>>>>> function call (please verify this, maybe there is a better way to
>> check
>> >>>>>>> from connect_inet_domain_socket_by_port if in
>> health_check_timer_expired
>> >>>>>>> has been set to 1)
>> >>>>>>>
>> >>>>>>> These changes will practically make connect attempt to be
>> >>>>>>> non-blocking and repeated until:
>> >>>>>>> - connection is made, or
>> >>>>>>> - unhandled connection error condition is reached, or
>> >>>>>>> - health check timer alarm has been raised, or
>> >>>>>>> - some other exit request (shutdown) has been issued.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Kind regards,
>> >>>>>>> Stevo.
>> >>>>>>>
>> >>>>>>> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >>>>>>>
>> >>>>>>>> Ok, let me clarify use cases regarding failover.
>> >>>>>>>>
>> >>>>>>>> Currently there are three parameters:
>> >>>>>>>> a) health_check
>> >>>>>>>> b) DISALLOW_TO_FAILOVER
>> >>>>>>>> c) fail_over_on_backend_error
>> >>>>>>>>
>> >>>>>>>> Source of errors which can trigger failover are 1)health check
>> >>>>>>>> 2)write
>> >>>>>>>> to backend socket 3)read backend from socket. I represent each 1)
>> as
>> >>>>>>>> A, 2) as B, 3) as C.
>> >>>>>>>>
>> >>>>>>>> 1) trigger failover if A or B or C is error
>> >>>>>>>> a = on, b = off, c = on
>> >>>>>>>>
>> >>>>>>>> 2) trigger failover only when B or C is error
>> >>>>>>>> a = off, b = off, c = on
>> >>>>>>>>
>> >>>>>>>> 3) trigger failover only when B is error
>> >>>>>>>> Impossible. Because C error always triggers failover.
>> >>>>>>>>
>> >>>>>>>> 4) trigger failover only when C is error
>> >>>>>>>> a = off, b = off, c = off
>> >>>>>>>>
>> >>>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>> >>>>>>>> Impossible. Because C error always triggers failover.
>> >>>>>>>>
>> >>>>>>>> 6) never trigger failover
>> >>>>>>>> Impossible. Because C error always triggers failover.
>> >>>>>>>>
>> >>>>>>>> As you can see, C is the problem here (look at #3, #5 and #6)
>> >>>>>>>>
>> >>>>>>>> If we implemented this:
>> >>>>>>>> >> However I think we should disable failover if
>> >>>>>>>> DISALLOW_TO_FAILOVER set
>> >>>>>>>> >> in case of reading data from backend. This should have been
>> done
>> >>>>>>>> when
>> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is exactly
>> what
>> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you think?
>> >>>>>>>>
>> >>>>>>>> 1) trigger failover if A or B or C is error
>> >>>>>>>> a = on, b = off, c = on
>> >>>>>>>>
>> >>>>>>>> 2) trigger failover only when B or C is error
>> >>>>>>>> a = off, b = off, c = on
>> >>>>>>>>
>> >>>>>>>> 3) trigger failover only when B is error
>> >>>>>>>> a = off, b = on, c = on
>> >>>>>>>>
>> >>>>>>>> 4) trigger failover only when C is error
>> >>>>>>>> a = off, b = off, c = off
>> >>>>>>>>
>> >>>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>> >>>>>>>> a = on, b = on, c = off
>> >>>>>>>>
>> >>>>>>>> 6) never trigger failover
>> >>>>>>>> a = off, b = on, c = off
>> >>>>>>>>
>> >>>>>>>> So it seems my patch will solve all the problems including yours.
>> >>>>>>>> (timeout while retrying is another issue of course).
>> >>>>>>>> --
>> >>>>>>>> Tatsuo Ishii
>> >>>>>>>> SRA OSS, Inc. Japan
>> >>>>>>>> English: http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> Japanese: http://www.sraoss.co.jp
>> >>>>>>>>
>> >>>>>>>> > I agree, fail_over_on_backend_error isn't useful, just adds
>> >>>>>>>> confusion by
>> >>>>>>>> > overlapping with DISALLOW_TO_FAILOVER.
>> >>>>>>>> >
>> >>>>>>>> > With your patch or without it, it is not possible to failover
>> only
>> >>>>>>>> on
>> >>>>>>>> > health check (max retries) failure. With Nenad's patch, that
>> part
>> >>>>>>>> works ok
>> >>>>>>>> > and I think that patch is semantically ok - failover occurs even
>> >>>>>>>> though
>> >>>>>>>> > DISALLOW_TO_FAILOVER is set for backend but only when health
>> check
>> >>>>>>>> is
>> >>>>>>>> > configured too. Configuring health check without failover on
>> >>>>>>>> failed health
>> >>>>>>>> > check has no purpose. Also health check configured with allowed
>> >>>>>>>> failover on
>> >>>>>>>> > any condition other than health check (max retries) failure has
>> no
>> >>>>>>>> purpose.
>> >>>>>>>> >
>> >>>>>>>> > Kind regards,
>> >>>>>>>> > Stevo.
>> >>>>>>>> >
>> >>>>>>>> > 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >>>>>>>> >
>> >>>>>>>> >> fail_over_on_backend_error has different meaning from
>> >>>>>>>> >> DISALLOW_TO_FAILOVER. From the doc:
>> >>>>>>>> >>
>> >>>>>>>> >>  If true, and an error occurs when writing to the backend
>> >>>>>>>> >>  communication, pgpool-II will trigger the fail over procedure
>> .
>> >>>>>>>> This
>> >>>>>>>> >>  is the same behavior as of pgpool-II 2.2.x or earlier. If set
>> to
>> >>>>>>>> >>  false, pgpool will report an error and disconnect the session.
>> >>>>>>>> >>
>> >>>>>>>> >> This means that if pgpool failed to read from backend, it will
>> >>>>>>>> trigger
>> >>>>>>>> >> failover even if fail_over_on_backend_error to off. So
>> >>>>>>>> unconditionaly
>> >>>>>>>> >> disabling failover will lead backward imcompatibilty.
>> >>>>>>>> >>
>> >>>>>>>> >> However I think we should disable failover if
>> >>>>>>>> DISALLOW_TO_FAILOVER set
>> >>>>>>>> >> in case of reading data from backend. This should have been
>> done
>> >>>>>>>> when
>> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is exactly
>> what
>> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you think?
>> >>>>>>>> >> --
>> >>>>>>>> >> Tatsuo Ishii
>> >>>>>>>> >> SRA OSS, Inc. Japan
>> >>>>>>>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >>
>> >>>>>>>> >> > For a moment I thought we could have set
>> >>>>>>>> fail_over_on_backend_error to
>> >>>>>>>> >> off,
>> >>>>>>>> >> > and have backends set with ALLOW_TO_FAILOVER flag. But then I
>> >>>>>>>> looked in
>> >>>>>>>> >> > code.
>> >>>>>>>> >> >
>> >>>>>>>> >> > In child.c there is a loop child process goes through in its
>> >>>>>>>> lifetime.
>> >>>>>>>> >> When
>> >>>>>>>> >> > fatal error condition occurs before child process exits it
>> will
>> >>>>>>>> call
>> >>>>>>>> >> > notice_backend_error which will call degenerate_backend_set
>> >>>>>>>> which will
>> >>>>>>>> >> not
>> >>>>>>>> >> > take into account fail_over_on_backend_error is set to off,
>> >>>>>>>> causing
>> >>>>>>>> >> backend
>> >>>>>>>> >> > to be degenerated and failover to occur. That's why we have
>> >>>>>>>> backends set
>> >>>>>>>> >> > with DISALLOW_TO_FAILOVER but with our patch applied, health
>> >>>>>>>> check could
>> >>>>>>>> >> > cause failover to occur as expected.
>> >>>>>>>> >> >
>> >>>>>>>> >> > Maybe it would be enough just to modify
>> degenerate_backend_set,
>> >>>>>>>> to take
>> >>>>>>>> >> > fail_over_on_backend_error into account just like it already
>> >>>>>>>> takes
>> >>>>>>>> >> > DISALLOW_TO_FAILOVER into account.
>> >>>>>>>> >> >
>> >>>>>>>> >> > Kind regards,
>> >>>>>>>> >> > Stevo.
>> >>>>>>>> >> >
>> >>>>>>>> >> > 2012/1/15 Stevo Slavić <sslavic at gmail.com>
>> >>>>>>>> >> >
>> >>>>>>>> >> >> Yes and that behaviour which you describe as expected, is
>> not
>> >>>>>>>> what we
>> >>>>>>>> >> >> want. We want pgpool to degrade backend0 and failover when
>> >>>>>>>> configured
>> >>>>>>>> >> max
>> >>>>>>>> >> >> health check retries have failed, and to failover only in
>> that
>> >>>>>>>> case, so
>> >>>>>>>> >> not
>> >>>>>>>> >> >> sooner e.g. connection/child error condition, but as soon as
>> >>>>>>>> max health
>> >>>>>>>> >> >> check retries have been attempted.
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> Maybe examples will be more clear.
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> Imagine two nodes (node 1 and node 2). On each node a single
>> >>>>>>>> pgpool and
>> >>>>>>>> >> a
>> >>>>>>>> >> >> single backend. Apps/clients access db through pgpool on
>> their
>> >>>>>>>> own node.
>> >>>>>>>> >> >> Two backends are configured in postgres native streaming
>> >>>>>>>> replication.
>> >>>>>>>> >> >> pgpools are used in raw mode. Both pgpools have same
>> backend as
>> >>>>>>>> >> backend0,
>> >>>>>>>> >> >> and same backend as backend1.
>> >>>>>>>> >> >> initial state: both backends are up and pgpool can access
>> >>>>>>>> them, clients
>> >>>>>>>> >> >> connect to their pgpool and do their work on master backend,
>> >>>>>>>> backend0.
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> 1st case: unmodified/non-patched pgpool 3.1.1 is used,
>> >>>>>>>> backends are
>> >>>>>>>> >> >> configured with ALLOW_TO_FAILOVER flag
>> >>>>>>>> >> >> - temporary network outage happens between pgpool on node 2
>> >>>>>>>> and backend0
>> >>>>>>>> >> >> - error condition is reported by child process, and since
>> >>>>>>>> >> >> ALLOW_TO_FAILOVER is set, pgpool performs failover without
>> >>>>>>>> giving
>> >>>>>>>> >> chance to
>> >>>>>>>> >> >> pgpool health check retries to control whether backend is
>> just
>> >>>>>>>> >> temporarily
>> >>>>>>>> >> >> inaccessible
>> >>>>>>>> >> >> - failover command on node 2 promotes standby backend to a
>> new
>> >>>>>>>> master -
>> >>>>>>>> >> >> split brain occurs, with two masters
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> 2nd case: unmodified/non-patched pgpool 3.1.1 is used,
>> >>>>>>>> backends are
>> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >>>>>>>> >> >> - temporary network outage happens between pgpool on node 2
>> >>>>>>>> and backend0
>> >>>>>>>> >> >> - error condition is reported by child process, and since
>> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> failover
>> >>>>>>>> >> >> - health check gets a chance to check backend0 condition,
>> >>>>>>>> determines
>> >>>>>>>> >> that
>> >>>>>>>> >> >> it's not accessible, there will be no health check retries
>> >>>>>>>> because
>> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, no failover occurs ever
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> 3rd case, pgpool 3.1.1 + patch you've sent applied, and
>> >>>>>>>> backends
>> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >>>>>>>> >> >> - temporary network outage happens between pgpool on node 2
>> >>>>>>>> and backend0
>> >>>>>>>> >> >> - error condition is reported by child process, and since
>> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> failover
>> >>>>>>>> >> >> - health check gets a chance to check backend0 condition,
>> >>>>>>>> determines
>> >>>>>>>> >> that
>> >>>>>>>> >> >> it's not accessible, health check retries happen, and even
>> >>>>>>>> after max
>> >>>>>>>> >> >> retries, no failover happens since failover is disallowed
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> 4th expected behaviour, pgpool 3.1.1 + patch we sent, and
>> >>>>>>>> backends
>> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >>>>>>>> >> >> - temporary network outage happens between pgpool on node 2
>> >>>>>>>> and backend0
>> >>>>>>>> >> >> - error condition is reported by child process, and since
>> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> failover
>> >>>>>>>> >> >> - health check gets a chance to check backend0 condition,
>> >>>>>>>> determines
>> >>>>>>>> >> that
>> >>>>>>>> >> >> it's not accessible, health check retries happen, before a
>> max
>> >>>>>>>> retry
>> >>>>>>>> >> >> network condition is cleared, retry happens, and backend0
>> >>>>>>>> remains to be
>> >>>>>>>> >> >> master, no failover occurs, temporary network issue did not
>> >>>>>>>> cause split
>> >>>>>>>> >> >> brain
>> >>>>>>>> >> >> - after some time, temporary network outage happens again
>> >>>>>>>> between pgpool
>> >>>>>>>> >> >> on node 2 and backend0
>> >>>>>>>> >> >> - error condition is reported by child process, and since
>> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> failover
>> >>>>>>>> >> >> - health check gets a chance to check backend0 condition,
>> >>>>>>>> determines
>> >>>>>>>> >> that
>> >>>>>>>> >> >> it's not accessible, health check retries happen, after max
>> >>>>>>>> retries
>> >>>>>>>> >> >> backend0 is still not accessible, failover happens, standby
>> is
>> >>>>>>>> new
>> >>>>>>>> >> master
>> >>>>>>>> >> >> and backend0 is degraded
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> Kind regards,
>> >>>>>>>> >> >> Stevo.
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>> In my test evironment, the patch works as expected. I have
>> two
>> >>>>>>>> >> >>> backends. Health check retry conf is as follows:
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>> health_check_max_retries = 3
>> >>>>>>>> >> >>> health_check_retry_delay = 1
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>> 5 09:17:20 LOG:   pid 21411: Backend status file
>> >>>>>>>> /home/t-ishii/work/
>> >>>>>>>> >> >>> git.postgresql.org/test/log/pgpool_status discarded
>> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: pgpool-II
>> successfully
>> >>>>>>>> started.
>> >>>>>>>> >> >>> version 3.2alpha1 (hatsuiboshi)
>> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: find_primary_node:
>> >>>>>>>> primary node
>> >>>>>>>> >> id
>> >>>>>>>> >> >>> is 0
>> >>>>>>>> >> >>> -- backend1 was shutdown
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >>>>>>>> check_replication_time_lag: could
>> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>> >>>>>>>> sr_check_password
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> -- health check failed
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411: health check failed.
>> 1
>> >>>>>>>> th host
>> >>>>>>>> >> /tmp
>> >>>>>>>> >> >>> at port 11001 is down
>> >>>>>>>> >> >>> -- start retrying
>> >>>>>>>> >> >>> 2012-01-15 09:17:50 LOG:   pid 21411: health check retry
>> >>>>>>>> sleep time: 1
>> >>>>>>>> >> >>> second(s)
>> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411: health check failed.
>> 1
>> >>>>>>>> th host
>> >>>>>>>> >> /tmp
>> >>>>>>>> >> >>> at port 11001 is down
>> >>>>>>>> >> >>> 2012-01-15 09:17:51 LOG:   pid 21411: health check retry
>> >>>>>>>> sleep time: 1
>> >>>>>>>> >> >>> second(s)
>> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411: health check failed.
>> 1
>> >>>>>>>> th host
>> >>>>>>>> >> /tmp
>> >>>>>>>> >> >>> at port 11001 is down
>> >>>>>>>> >> >>> 2012-01-15 09:17:52 LOG:   pid 21411: health check retry
>> >>>>>>>> sleep time: 1
>> >>>>>>>> >> >>> second(s)
>> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411: health check failed.
>> 1
>> >>>>>>>> th host
>> >>>>>>>> >> /tmp
>> >>>>>>>> >> >>> at port 11001 is down
>> >>>>>>>> >> >>> 2012-01-15 09:17:53 LOG:   pid 21411: health_check: 1
>> >>>>>>>> failover is
>> >>>>>>>> >> canceld
>> >>>>>>>> >> >>> because failover is disallowed
>> >>>>>>>> >> >>> -- after 3 retries, pgpool wanted to failover, but gave up
>> >>>>>>>> because
>> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER is set for backend1
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >>>>>>>> check_replication_time_lag: could
>> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>> >>>>>>>> sr_check_password
>> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411: health check failed.
>> 1
>> >>>>>>>> th host
>> >>>>>>>> >> /tmp
>> >>>>>>>> >> >>> at port 11001 is down
>> >>>>>>>> >> >>> 2012-01-15 09:18:03 LOG:   pid 21411: health check retry
>> >>>>>>>> sleep time: 1
>> >>>>>>>> >> >>> second(s)
>> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file or
>> >>>>>>>> directory
>> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>> >>>>>>>> make_persistent_db_connection:
>> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411: health check failed.
>> 1
>> >>>>>>>> th host
>> >>>>>>>> >> /tmp
>> >>>>>>>> >> >>> at port 11001 is down
>> >>>>>>>> >> >>> 2012-01-15 09:18:04 LOG:   pid 21411: health check retry
>> >>>>>>>> sleep time: 1
>> >>>>>>>> >> >>> second(s)
>> >>>>>>>> >> >>> 2012-01-15 09:18:05 LOG:   pid 21411: after some retrying
>> >>>>>>>> backend
>> >>>>>>>> >> >>> returned to healthy state
>> >>>>>>>> >> >>> -- started backend1 and pgpool succeeded in health
>> checking.
>> >>>>>>>> Resumed
>> >>>>>>>> >> >>> using backend1
>> >>>>>>>> >> >>> --
>> >>>>>>>> >> >>> Tatsuo Ishii
>> >>>>>>>> >> >>> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> English: http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>> > Hello Tatsuo,
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> > Thank you for the patch and effort, but unfortunately
>> this
>> >>>>>>>> change
>> >>>>>>>> >> won't
>> >>>>>>>> >> >>> > work for us. We need to set disallow failover to prevent
>> >>>>>>>> failover on
>> >>>>>>>> >> >>> child
>> >>>>>>>> >> >>> > reported connection errors (it's ok if few clients lose
>> >>>>>>>> their
>> >>>>>>>> >> >>> connection or
>> >>>>>>>> >> >>> > can not connect), and still have pgpool perform failover
>> >>>>>>>> but only on
>> >>>>>>>> >> >>> failed
>> >>>>>>>> >> >>> > health check (if configured, after max retries threshold
>> >>>>>>>> has been
>> >>>>>>>> >> >>> reached).
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> > Maybe it would be best to add an extra value for
>> >>>>>>>> backend_flag -
>> >>>>>>>> >> >>> > ALLOW_TO_FAILOVER_ON_HEALTH_CHECK or
>> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER_ON_CHILD_ERROR.
>> >>>>>>>> >> >>> > It should behave same as DISALLOW_TO_FAILOVER is set,
>> with
>> >>>>>>>> only
>> >>>>>>>> >> >>> difference
>> >>>>>>>> >> >>> > in behaviour when health check (if set, max retries) has
>> >>>>>>>> failed -
>> >>>>>>>> >> unlike
>> >>>>>>>> >> >>> > DISALLOW_TO_FAILOVER, this new flag should allow failover
>> >>>>>>>> in this
>> >>>>>>>> >> case
>> >>>>>>>> >> >>> only.
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> > Without this change health check (especially health check
>> >>>>>>>> retries)
>> >>>>>>>> >> >>> doesn't
>> >>>>>>>> >> >>> > make much sense - child error is more likely to occur on
>> >>>>>>>> (temporary)
>> >>>>>>>> >> >>> > backend failure then health check and will or will not
>> cause
>> >>>>>>>> >> failover to
>> >>>>>>>> >> >>> > occur depending on backend flag, without giving health
>> >>>>>>>> check retries
>> >>>>>>>> >> a
>> >>>>>>>> >> >>> > chance to determine if failure was temporary or not,
>> >>>>>>>> risking split
>> >>>>>>>> >> brain
>> >>>>>>>> >> >>> > situation with two masters just because of temporary
>> >>>>>>>> network link
>> >>>>>>>> >> >>> hiccup.
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> > Our main problem remains though with the health check
>> >>>>>>>> timeout not
>> >>>>>>>> >> being
>> >>>>>>>> >> >>> > respected in these special conditions we have. Maybe
>> Nenad
>> >>>>>>>> can help
>> >>>>>>>> >> you
>> >>>>>>>> >> >>> > more to reproduce the issue on your environment.
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> > Kind regards,
>> >>>>>>>> >> >>> > Stevo.
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> > 2012/1/13 Tatsuo Ishii <ishii at postgresql.org>
>> >>>>>>>> >> >>> >
>> >>>>>>>> >> >>> >> Thanks for pointing it out.
>> >>>>>>>> >> >>> >> Yes, checking DISALLOW_TO_FAILOVER before retrying is
>> >>>>>>>> wrong.
>> >>>>>>>> >> >>> >> However, after retry count over, we should check
>> >>>>>>>> >> DISALLOW_TO_FAILOVER I
>> >>>>>>>> >> >>> >> think.
>> >>>>>>>> >> >>> >> Attached is the patch attempt to fix it. Please try.
>> >>>>>>>> >> >>> >> --
>> >>>>>>>> >> >>> >> Tatsuo Ishii
>> >>>>>>>> >> >>> >> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >>
>> >>>>>>>> >> >>> >> > pgpool is being used in raw mode - just for (health
>> >>>>>>>> check based)
>> >>>>>>>> >> >>> failover
>> >>>>>>>> >> >>> >> > part, so applications are not required to restart when
>> >>>>>>>> standby
>> >>>>>>>> >> gets
>> >>>>>>>> >> >>> >> > promoted to new master. Here is pgpool.conf file and a
>> >>>>>>>> very small
>> >>>>>>>> >> >>> patch
>> >>>>>>>> >> >>> >> > we're using applied to pgpool 3.1.1 release.
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> > We have to have DISALLOW_TO_FAILOVER set for the
>> backend
>> >>>>>>>> since any
>> >>>>>>>> >> >>> child
>> >>>>>>>> >> >>> >> > process that detects condition that master/backend0 is
>> >>>>>>>> not
>> >>>>>>>> >> >>> available, if
>> >>>>>>>> >> >>> >> > DISALLOW_TO_FAILOVER was not set, will degenerate
>> >>>>>>>> backend without
>> >>>>>>>> >> >>> giving
>> >>>>>>>> >> >>> >> > health check a chance to retry. We need health check
>> >>>>>>>> with retries
>> >>>>>>>> >> >>> because
>> >>>>>>>> >> >>> >> > condition that backend0 is not available could be
>> >>>>>>>> temporary
>> >>>>>>>> >> (network
>> >>>>>>>> >> >>> >> > glitches to the remote site where master is, or
>> >>>>>>>> deliberate
>> >>>>>>>> >> failover
>> >>>>>>>> >> >>> of
>> >>>>>>>> >> >>> >> > master postgres service from one node to the other on
>> >>>>>>>> remote site
>> >>>>>>>> >> -
>> >>>>>>>> >> >>> in
>> >>>>>>>> >> >>> >> both
>> >>>>>>>> >> >>> >> > cases remote means remote to the pgpool that is going
>> to
>> >>>>>>>> perform
>> >>>>>>>> >> >>> health
>> >>>>>>>> >> >>> >> > checks and ultimately the failover) and we don't want
>> >>>>>>>> standby to
>> >>>>>>>> >> be
>> >>>>>>>> >> >>> >> > promoted as easily to a new master, to prevent
>> temporary
>> >>>>>>>> network
>> >>>>>>>> >> >>> >> conditions
>> >>>>>>>> >> >>> >> > which could occur frequently to frequently cause split
>> >>>>>>>> brain with
>> >>>>>>>> >> two
>> >>>>>>>> >> >>> >> > masters.
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> > But then, with DISALLOW_TO_FAILOVER set, without the
>> >>>>>>>> patch health
>> >>>>>>>> >> >>> check
>> >>>>>>>> >> >>> >> > will not retry and will thus give only one chance to
>> >>>>>>>> backend (if
>> >>>>>>>> >> >>> health
>> >>>>>>>> >> >>> >> > check ever occurs before child process failure to
>> >>>>>>>> connect to the
>> >>>>>>>> >> >>> >> backend),
>> >>>>>>>> >> >>> >> > rendering retry settings effectively to be ignored.
>> >>>>>>>> That's where
>> >>>>>>>> >> this
>> >>>>>>>> >> >>> >> patch
>> >>>>>>>> >> >>> >> > comes into action - enables health check retries while
>> >>>>>>>> child
>> >>>>>>>> >> >>> processes
>> >>>>>>>> >> >>> >> are
>> >>>>>>>> >> >>> >> > prevented to degenerate backend.
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> > I don't think, but I could be wrong, that this patch
>> >>>>>>>> influences
>> >>>>>>>> >> the
>> >>>>>>>> >> >>> >> > behavior we're seeing with unwanted health check
>> attempt
>> >>>>>>>> delays.
>> >>>>>>>> >> >>> Also,
>> >>>>>>>> >> >>> >> > knowing this, maybe pgpool could be patched or some
>> >>>>>>>> other support
>> >>>>>>>> >> be
>> >>>>>>>> >> >>> >> built
>> >>>>>>>> >> >>> >> > into it to cover this use case.
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> > Regards,
>> >>>>>>>> >> >>> >> > Stevo.
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> > 2012/1/12 Tatsuo Ishii <ishii at postgresql.org>
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> >> I have accepted the moderation request. Your post
>> >>>>>>>> should be sent
>> >>>>>>>> >> >>> >> shortly.
>> >>>>>>>> >> >>> >> >> Also I have raised the post size limit to 1MB.
>> >>>>>>>> >> >>> >> >> I will look into this...
>> >>>>>>>> >> >>> >> >> --
>> >>>>>>>> >> >>> >> >> Tatsuo Ishii
>> >>>>>>>> >> >>> >> >> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> >> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >> >>
>> >>>>>>>> >> >>> >> >> > Here is the log file and strace output file (this
>> >>>>>>>> time in an
>> >>>>>>>> >> >>> archive,
>> >>>>>>>> >> >>> >> >> > didn't know about 200KB constraint on post size
>> which
>> >>>>>>>> requires
>> >>>>>>>> >> >>> >> moderator
>> >>>>>>>> >> >>> >> >> > approval). Timings configured are 30sec health
>> check
>> >>>>>>>> interval,
>> >>>>>>>> >> >>> 5sec
>> >>>>>>>> >> >>> >> >> > timeout, and 2 retries with 10sec retry delay.
>> >>>>>>>> >> >>> >> >> >
>> >>>>>>>> >> >>> >> >> > It takes a lot more than 5sec from started health
>> >>>>>>>> check to
>> >>>>>>>> >> >>> sleeping
>> >>>>>>>> >> >>> >> 10sec
>> >>>>>>>> >> >>> >> >> > for first retry.
>> >>>>>>>> >> >>> >> >> >
>> >>>>>>>> >> >>> >> >> > Seen in code (main.x, health_check() function),
>> >>>>>>>> within (retry)
>> >>>>>>>> >> >>> attempt
>> >>>>>>>> >> >>> >> >> > there is inner retry (first with postgres database
>> >>>>>>>> then with
>> >>>>>>>> >> >>> >> template1)
>> >>>>>>>> >> >>> >> >> and
>> >>>>>>>> >> >>> >> >> > that part doesn't seem to be interrupted by alarm.
>> >>>>>>>> >> >>> >> >> >
>> >>>>>>>> >> >>> >> >> > Regards,
>> >>>>>>>> >> >>> >> >> > Stevo.
>> >>>>>>>> >> >>> >> >> >
>> >>>>>>>> >> >>> >> >> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
>> >>>>>>>> >> >>> >> >> >
>> >>>>>>>> >> >>> >> >> >> Here is the log file and strace output file.
>> Timings
>> >>>>>>>> >> configured
>> >>>>>>>> >> >>> are
>> >>>>>>>> >> >>> >> >> 30sec
>> >>>>>>>> >> >>> >> >> >> health check interval, 5sec timeout, and 2 retries
>> >>>>>>>> with 10sec
>> >>>>>>>> >> >>> retry
>> >>>>>>>> >> >>> >> >> delay.
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >> It takes a lot more than 5sec from started health
>> >>>>>>>> check to
>> >>>>>>>> >> >>> sleeping
>> >>>>>>>> >> >>> >> >> 10sec
>> >>>>>>>> >> >>> >> >> >> for first retry.
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >> Seen in code (main.x, health_check() function),
>> >>>>>>>> within (retry)
>> >>>>>>>> >> >>> >> attempt
>> >>>>>>>> >> >>> >> >> >> there is inner retry (first with postgres database
>> >>>>>>>> then with
>> >>>>>>>> >> >>> >> template1)
>> >>>>>>>> >> >>> >> >> and
>> >>>>>>>> >> >>> >> >> >> that part doesn't seem to be interrupted by alarm.
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >> Regards,
>> >>>>>>>> >> >>> >> >> >> Stevo.
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>> Ok, I will do it. In the mean time you could use
>> >>>>>>>> "strace -tt
>> >>>>>>>> >> -p
>> >>>>>>>> >> >>> PID"
>> >>>>>>>> >> >>> >> >> >>> to see which system call is blocked.
>> >>>>>>>> >> >>> >> >> >>> --
>> >>>>>>>> >> >>> >> >> >>> Tatsuo Ishii
>> >>>>>>>> >> >>> >> >> >>> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> >> >>> English: http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> >> >>> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >> >> >>>
>> >>>>>>>> >> >>> >> >> >>> > OK, got the info - key point is that ip
>> >>>>>>>> forwarding is
>> >>>>>>>> >> >>> disabled for
>> >>>>>>>> >> >>> >> >> >>> security
>> >>>>>>>> >> >>> >> >> >>> > reasons. Rules in iptables are not important,
>> >>>>>>>> iptables can
>> >>>>>>>> >> be
>> >>>>>>>> >> >>> >> >> stopped,
>> >>>>>>>> >> >>> >> >> >>> or
>> >>>>>>>> >> >>> >> >> >>> > previously added rules removed.
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > Here are the steps to reproduce (kudos to my
>> >>>>>>>> colleague
>> >>>>>>>> >> Nenad
>> >>>>>>>> >> >>> >> >> Bulatovic
>> >>>>>>>> >> >>> >> >> >>> for
>> >>>>>>>> >> >>> >> >> >>> > providing this):
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > 1.) make sure that ip forwarding is off:
>> >>>>>>>> >> >>> >> >> >>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
>> >>>>>>>> >> >>> >> >> >>> > 2.) create IP alias on some interface (and have
>> >>>>>>>> postgres
>> >>>>>>>> >> >>> listen on
>> >>>>>>>> >> >>> >> >> it):
>> >>>>>>>> >> >>> >> >> >>> >     ip addr add x.x.x.x/yy dev ethz
>> >>>>>>>> >> >>> >> >> >>> > 3.) set backend_hostname0 to aforementioned IP
>> >>>>>>>> >> >>> >> >> >>> > 4.) start pgpool and monitor health checks
>> >>>>>>>> >> >>> >> >> >>> > 5.) remove IP alias:
>> >>>>>>>> >> >>> >> >> >>> >     ip addr del x.x.x.x/yy dev ethz
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > Here is the interesting part in pgpool log
>> after
>> >>>>>>>> this:
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting
>> >>>>>>>> health
>> >>>>>>>> >> checking
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >>>>>>>> health_check: 0 th DB
>> >>>>>>>> >> >>> node
>> >>>>>>>> >> >>> >> >> >>> status: 2
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >>>>>>>> health_check: 1 th DB
>> >>>>>>>> >> >>> node
>> >>>>>>>> >> >>> >> >> >>> status: 1
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting
>> >>>>>>>> health
>> >>>>>>>> >> checking
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
>> >>>>>>>> health_check: 0 th DB
>> >>>>>>>> >> >>> node
>> >>>>>>>> >> >>> >> >> >>> status: 2
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358:
>> >>>>>>>> health_check: 0 th DB
>> >>>>>>>> >> >>> node
>> >>>>>>>> >> >>> >> >> >>> status: 2
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 ERROR: pid 24358: health
>> >>>>>>>> check failed.
>> >>>>>>>> >> 0
>> >>>>>>>> >> >>> th
>> >>>>>>>> >> >>> >> host
>> >>>>>>>> >> >>> >> >> >>> > 192.168.2.27 at port 5432 is down
>> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 LOG:   pid 24358: health
>> >>>>>>>> check retry
>> >>>>>>>> >> sleep
>> >>>>>>>> >> >>> >> time:
>> >>>>>>>> >> >>> >> >> 10
>> >>>>>>>> >> >>> >> >> >>> > second(s)
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > That pgpool was configured with health check
>> >>>>>>>> interval of
>> >>>>>>>> >> >>> 30sec,
>> >>>>>>>> >> >>> >> 5sec
>> >>>>>>>> >> >>> >> >> >>> > timeout, and 10sec retry delay with 2 max
>> retries.
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > Making use of libpq instead for connecting to
>> db
>> >>>>>>>> in health
>> >>>>>>>> >> >>> checks
>> >>>>>>>> >> >>> >> IMO
>> >>>>>>>> >> >>> >> >> >>> > should resolve it, but you'll best determine
>> >>>>>>>> which call
>> >>>>>>>> >> >>> exactly
>> >>>>>>>> >> >>> >> gets
>> >>>>>>>> >> >>> >> >> >>> > blocked waiting. Btw, psql with
>> PGCONNECT_TIMEOUT
>> >>>>>>>> env var
>> >>>>>>>> >> >>> >> configured
>> >>>>>>>> >> >>> >> >> >>> > respects that env var timeout.
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > Regards,
>> >>>>>>>> >> >>> >> >> >>> > Stevo.
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić
>> <
>> >>>>>>>> >> >>> sslavic at gmail.com
>> >>>>>>>> >> >>> >> >
>> >>>>>>>> >> >>> >> >> >>> wrote:
>> >>>>>>>> >> >>> >> >> >>> >
>> >>>>>>>> >> >>> >> >> >>> >> Tatsuo,
>> >>>>>>>> >> >>> >> >> >>> >>
>> >>>>>>>> >> >>> >> >> >>> >> Did you restart iptables after adding rule?
>> >>>>>>>> >> >>> >> >> >>> >>
>> >>>>>>>> >> >>> >> >> >>> >> Regards,
>> >>>>>>>> >> >>> >> >> >>> >> Stevo.
>> >>>>>>>> >> >>> >> >> >>> >>
>> >>>>>>>> >> >>> >> >> >>> >>
>> >>>>>>>> >> >>> >> >> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo
>> Slavić <
>> >>>>>>>> >> >>> >> sslavic at gmail.com>
>> >>>>>>>> >> >>> >> >> >>> wrote:
>> >>>>>>>> >> >>> >> >> >>> >>
>> >>>>>>>> >> >>> >> >> >>> >>> Looking into this to verify if these are all
>> >>>>>>>> necessary
>> >>>>>>>> >> >>> changes
>> >>>>>>>> >> >>> >> to
>> >>>>>>>> >> >>> >> >> have
>> >>>>>>>> >> >>> >> >> >>> >>> port unreachable message silently rejected
>> >>>>>>>> (suspecting
>> >>>>>>>> >> some
>> >>>>>>>> >> >>> >> kernel
>> >>>>>>>> >> >>> >> >> >>> >>> parameter tuning is needed).
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>> Just to clarify it's not a problem that host
>> is
>> >>>>>>>> being
>> >>>>>>>> >> >>> detected
>> >>>>>>>> >> >>> >> by
>> >>>>>>>> >> >>> >> >> >>> pgpool
>> >>>>>>>> >> >>> >> >> >>> >>> to be down, but the timing when that
>> happens. On
>> >>>>>>>> >> environment
>> >>>>>>>> >> >>> >> where
>> >>>>>>>> >> >>> >> >> >>> issue is
>> >>>>>>>> >> >>> >> >> >>> >>> reproduced pgpool as part of health check
>> >>>>>>>> attempt tries
>> >>>>>>>> >> to
>> >>>>>>>> >> >>> >> connect
>> >>>>>>>> >> >>> >> >> to
>> >>>>>>>> >> >>> >> >> >>> >>> backend and hangs for tcp timeout instead of
>> >>>>>>>> being
>> >>>>>>>> >> >>> interrupted
>> >>>>>>>> >> >>> >> by
>> >>>>>>>> >> >>> >> >> >>> timeout
>> >>>>>>>> >> >>> >> >> >>> >>> alarm. Can you verify/confirm please the
>> health
>> >>>>>>>> check
>> >>>>>>>> >> retry
>> >>>>>>>> >> >>> >> timings
>> >>>>>>>> >> >>> >> >> >>> are not
>> >>>>>>>> >> >>> >> >> >>> >>> delayed?
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>> Regards,
>> >>>>>>>> >> >>> >> >> >>> >>> Stevo.
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo
>> Ishii <
>> >>>>>>>> >> >>> >> >> ishii at postgresql.org
>> >>>>>>>> >> >>> >> >> >>> >wrote:
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>>> Ok, I did:
>> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>>>>>>> >> >>> >> >> >>> >>>> # iptables -A FORWARD -j REJECT
>> --reject-with
>> >>>>>>>> >> >>> >> >> icmp-port-unreachable
>> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>>>>>>> >> >>> >> >> >>> >>>> on the host where pgpoo is running. And pull
>> >>>>>>>> network
>> >>>>>>>> >> cable
>> >>>>>>>> >> >>> from
>> >>>>>>>> >> >>> >> >> >>> >>>> backend0 host network interface. Pgpool
>> >>>>>>>> detected the
>> >>>>>>>> >> host
>> >>>>>>>> >> >>> being
>> >>>>>>>> >> >>> >> >> down
>> >>>>>>>> >> >>> >> >> >>> >>>> as expected...
>> >>>>>>>> >> >>> >> >> >>> >>>> --
>> >>>>>>>> >> >>> >> >> >>> >>>> Tatsuo Ishii
>> >>>>>>>> >> >>> >> >> >>> >>>> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> >> >>> >>>> English:
>> http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> >> >>> >>>> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>>>>>>> >> >>> >> >> >>> >>>> > Backend is not destination of this
>> message,
>> >>>>>>>> pgpool
>> >>>>>>>> >> host
>> >>>>>>>> >> >>> is,
>> >>>>>>>> >> >>> >> and
>> >>>>>>>> >> >>> >> >> we
>> >>>>>>>> >> >>> >> >> >>> >>>> don't
>> >>>>>>>> >> >>> >> >> >>> >>>> > want it to ever get it. With command I've
>> >>>>>>>> sent you
>> >>>>>>>> >> rule
>> >>>>>>>> >> >>> will
>> >>>>>>>> >> >>> >> be
>> >>>>>>>> >> >>> >> >> >>> >>>> created for
>> >>>>>>>> >> >>> >> >> >>> >>>> > any source and destination.
>> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>>>>>>> >> >>> >> >> >>> >>>> > Regards,
>> >>>>>>>> >> >>> >> >> >>> >>>> > Stevo.
>> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>>>>>>> >> >>> >> >> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo
>> >>>>>>>> Ishii <
>> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org>
>> >>>>>>>> >> >>> >> >> >>> >>>> wrote:
>> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> I did following:
>> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> Do following on the host where pgpool is
>> >>>>>>>> running on:
>> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> # iptables -A FORWARD -j REJECT
>> >>>>>>>> --reject-with
>> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable -d
>> >>>>>>>> >> >>> >> >> >>> >>>> >> 133.137.177.124
>> >>>>>>>> >> >>> >> >> >>> >>>> >> (133.137.177.124 is the host where
>> backend
>> >>>>>>>> is running
>> >>>>>>>> >> >>> on)
>> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> Pull network cable from backend0 host
>> >>>>>>>> network
>> >>>>>>>> >> interface.
>> >>>>>>>> >> >>> >> Pgpool
>> >>>>>>>> >> >>> >> >> >>> >>>> >> detected the host being down as expected.
>> >>>>>>>> Am I
>> >>>>>>>> >> missing
>> >>>>>>>> >> >>> >> >> something?
>> >>>>>>>> >> >>> >> >> >>> >>>> >> --
>> >>>>>>>> >> >>> >> >> >>> >>>> >> Tatsuo Ishii
>> >>>>>>>> >> >>> >> >> >>> >>>> >> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> >> >>> >>>> >> English:
>> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> >> >>> >>>> >> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > Hello Tatsuo,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > With backend0 on one host just
>> configure
>> >>>>>>>> following
>> >>>>>>>> >> >>> rule on
>> >>>>>>>> >> >>> >> >> other
>> >>>>>>>> >> >>> >> >> >>> >>>> host
>> >>>>>>>> >> >>> >> >> >>> >>>> >> where
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > pgpool is:
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > iptables -A FORWARD -j REJECT
>> >>>>>>>> --reject-with
>> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then have pgpool startup with
>> health
>> >>>>>>>> checking
>> >>>>>>>> >> and
>> >>>>>>>> >> >>> >> >> retrying
>> >>>>>>>> >> >>> >> >> >>> >>>> >> configured,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then pull network cable from
>> backend0
>> >>>>>>>> host
>> >>>>>>>> >> network
>> >>>>>>>> >> >>> >> >> >>> interface.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > Regards,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > Stevo.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo
>> >>>>>>>> Ishii <
>> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org
>> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> wrote:
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> I want to try to test the situation
>> you
>> >>>>>>>> descrived:
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured for
>> >>>>>>>> security
>> >>>>>>>> >> reasons
>> >>>>>>>> >> >>> not
>> >>>>>>>> >> >>> >> to
>> >>>>>>>> >> >>> >> >> >>> return
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages, even
>> >>>>>>>> though
>> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> But I don't know how to do it. I
>> pulled
>> >>>>>>>> out the
>> >>>>>>>> >> >>> network
>> >>>>>>>> >> >>> >> >> cable
>> >>>>>>>> >> >>> >> >> >>> and
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> pgpool detected it as expected. Also I
>> >>>>>>>> configured
>> >>>>>>>> >> the
>> >>>>>>>> >> >>> >> server
>> >>>>>>>> >> >>> >> >> >>> which
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> PostgreSQL is running on to disable
>> the
>> >>>>>>>> 5432
>> >>>>>>>> >> port. In
>> >>>>>>>> >> >>> >> this
>> >>>>>>>> >> >>> >> >> case
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> connect(2) returned EHOSTUNREACH (No
>> >>>>>>>> route to
>> >>>>>>>> >> host)
>> >>>>>>>> >> >>> so
>> >>>>>>>> >> >>> >> >> pgpool
>> >>>>>>>> >> >>> >> >> >>> >>>> detected
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> the error as expected.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Could you please instruct me?
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> --
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Tatsuo Ishii
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> English:
>> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Hello Tatsuo,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Thank you for replying!
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > I'm not sure what exactly is
>> blocking,
>> >>>>>>>> just by
>> >>>>>>>> >> >>> pgpool
>> >>>>>>>> >> >>> >> code
>> >>>>>>>> >> >>> >> >> >>> >>>> analysis I
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > suspect it is the part where a
>> >>>>>>>> connection is
>> >>>>>>>> >> made
>> >>>>>>>> >> >>> to
>> >>>>>>>> >> >>> >> the
>> >>>>>>>> >> >>> >> >> db
>> >>>>>>>> >> >>> >> >> >>> and
>> >>>>>>>> >> >>> >> >> >>> >>>> it
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> doesn't
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > seem to get interrupted by alarm.
>> >>>>>>>> Tested
>> >>>>>>>> >> thoroughly
>> >>>>>>>> >> >>> >> health
>> >>>>>>>> >> >>> >> >> >>> check
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> behaviour,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > it works really well when host/ip is
>> >>>>>>>> there and
>> >>>>>>>> >> just
>> >>>>>>>> >> >>> >> >> >>> >>>> backend/postgres
>> >>>>>>>> >> >>> >> >> >>> >>>> >> is
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > down, but not when backend host/ip
>> is
>> >>>>>>>> down. I
>> >>>>>>>> >> could
>> >>>>>>>> >> >>> >> see in
>> >>>>>>>> >> >>> >> >> >>> log
>> >>>>>>>> >> >>> >> >> >>> >>>> that
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> initial
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > health check and each retry got
>> >>>>>>>> delayed when
>> >>>>>>>> >> >>> host/ip is
>> >>>>>>>> >> >>> >> >> not
>> >>>>>>>> >> >>> >> >> >>> >>>> reachable,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > while when just backend is not
>> >>>>>>>> listening (is
>> >>>>>>>> >> down)
>> >>>>>>>> >> >>> on
>> >>>>>>>> >> >>> >> the
>> >>>>>>>> >> >>> >> >> >>> >>>> reachable
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> host/ip
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > then initial health check and all
>> >>>>>>>> retries are
>> >>>>>>>> >> >>> exact to
>> >>>>>>>> >> >>> >> the
>> >>>>>>>> >> >>> >> >> >>> >>>> settings in
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > pgpool.conf.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one
>> of
>> >>>>>>>> the libpq
>> >>>>>>>> >> >>> >> >> environment
>> >>>>>>>> >> >>> >> >> >>> >>>> variables
>> >>>>>>>> >> >>> >> >> >>> >>>> >> in
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > the docs (see
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >>
>> >>>>>>>> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > There is equivalent parameter in
>> libpq
>> >>>>>>>> >> >>> >> PGconnectdbParams (
>> >>>>>>>> >> >>> >> >> >>> see
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>>>>>>> >> >>> >> >> >>>
>> >>>>>>>> >> >>> >> >>
>> >>>>>>>> >> >>> >>
>> >>>>>>>> >> >>>
>> >>>>>>>> >>
>> >>>>>>>>
>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> )
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > At the beginning of that same page
>> >>>>>>>> there are
>> >>>>>>>> >> some
>> >>>>>>>> >> >>> >> >> important
>> >>>>>>>> >> >>> >> >> >>> >>>> infos on
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> using
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > these functions.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Regards,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Stevo.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM,
>> >>>>>>>> Tatsuo Ishii <
>> >>>>>>>> >> >>> >> >> >>> >>>> ishii at postgresql.org>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> wrote:
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Hello pgpool community,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> >
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured for
>> >>>>>>>> security
>> >>>>>>>> >> reasons
>> >>>>>>>> >> >>> not
>> >>>>>>>> >> >>> >> to
>> >>>>>>>> >> >>> >> >> >>> return
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages, even
>> >>>>>>>> though
>> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> configured,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > socket call will block and alarm
>> >>>>>>>> will not get
>> >>>>>>>> >> >>> raised
>> >>>>>>>> >> >>> >> >> >>> until TCP
>> >>>>>>>> >> >>> >> >> >>> >>>> >> timeout
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > occurs.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Interesting. So are you saying that
>> >>>>>>>> read(2)
>> >>>>>>>> >> >>> cannot be
>> >>>>>>>> >> >>> >> >> >>> >>>> interrupted by
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> alarm signal if the system is
>> >>>>>>>> configured not to
>> >>>>>>>> >> >>> return
>> >>>>>>>> >> >>> >> >> >>> >>>> destination
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> host unreachable message? Could you
>> >>>>>>>> please
>> >>>>>>>> >> guide
>> >>>>>>>> >> >>> me
>> >>>>>>>> >> >>> >> >> where I
>> >>>>>>>> >> >>> >> >> >>> can
>> >>>>>>>> >> >>> >> >> >>> >>>> get
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> such that info? (I'm not a network
>> >>>>>>>> expert).
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Not a C programmer, found some
>> info
>> >>>>>>>> that
>> >>>>>>>> >> select
>> >>>>>>>> >> >>> call
>> >>>>>>>> >> >>> >> >> >>> could be
>> >>>>>>>> >> >>> >> >> >>> >>>> >> replace
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> with
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > select/pselect calls. Maybe it
>> >>>>>>>> would be best
>> >>>>>>>> >> if
>> >>>>>>>> >> >>> >> >> >>> >>>> PGCONNECT_TIMEOUT
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> value
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > could be used here for connection
>> >>>>>>>> timeout.
>> >>>>>>>> >> >>> pgpool
>> >>>>>>>> >> >>> >> has
>> >>>>>>>> >> >>> >> >> >>> libpq as
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> dependency,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > why isn't it using libpq for the
>> >>>>>>>> healthcheck
>> >>>>>>>> >> db
>> >>>>>>>> >> >>> >> connect
>> >>>>>>>> >> >>> >> >> >>> >>>> calls, then
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be
>> applied?
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> I don't think libpq uses
>> >>>>>>>> select/pselect for
>> >>>>>>>> >> >>> >> establishing
>> >>>>>>>> >> >>> >> >> >>> >>>> connection,
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> but using libpq instead of homebrew
>> >>>>>>>> code seems
>> >>>>>>>> >> to
>> >>>>>>>> >> >>> be
>> >>>>>>>> >> >>> >> an
>> >>>>>>>> >> >>> >> >> >>> idea.
>> >>>>>>>> >> >>> >> >> >>> >>>> Let me
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> think about it.
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> One question. Are you sure that
>> libpq
>> >>>>>>>> can deal
>> >>>>>>>> >> >>> with
>> >>>>>>>> >> >>> >> the
>> >>>>>>>> >> >>> >> >> case
>> >>>>>>>> >> >>> >> >> >>> >>>> (not to
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> return destination host unreachable
>> >>>>>>>> messages)
>> >>>>>>>> >> by
>> >>>>>>>> >> >>> using
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> --
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Tatsuo Ishii
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> SRA OSS, Inc. Japan
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> English:
>> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>>
>> >>>>>>>> >> >>> >> >> >>> >>
>> >>>>>>>> >> >>> >> >> >>>
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >> >>
>> >>>>>>>> >> >>> >> >>
>> >>>>>>>> >> >>> >>
>> >>>>>>>> >> >>>
>> >>>>>>>> >> >>
>> >>>>>>>> >> >>
>> >>>>>>>> >>
>> >>>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>
>>