[pgpool-general: 242] Re: Healthcheck timeout not always respected

Tatsuo Ishii ishii at postgresql.org
Fri Feb 24 08:39:52 JST 2012


> Hello Tatsuo,
> 
> Thank you for accepting and improving majority of the changes.
> Unfortunately, not accepted part is a show stopper so I still have to use
> patched/customized pgpool version in production, since it seems still to be
> impossible to configure pgpool so that health check is only controls if and
> when failover should triggered. With latest sources, and pgpool configured
> as you suggested (backed flag set to ALLOW_TO_FAILOVER, and
> fail_over_on_backend_error set to off) with two backends in raw mode, after
> initial stable state pgpool triggered failover of primary backend as soon
> as that backend became inaccessible to pgpool without giving healthcheck a
> chance. Primary backend was shutdown to simulate failure/relocation but
> same would happen if just connecting to backend failed because of temporary
> network issue.
> 
> This behaviour is in line with documentation of fail_over_on_backend_error
> configuration parameter which states:
> "Please note that even if this parameter is set to off, however, pgpool
> will also do the fail over when connecting to a backend fails or pgpool
> detects the administrative shutdown of postmaster."
> 
> But, it is perfectly valid requirement to want to prevent failover to occur
> immediately when connecting to a backend fails - that condition could be
> temporary, e.g. temporary network condition. Health check retries were
> designed to cover this situation, so one can configure even for health
> check to fail connecting several times, but all is fine and no failover
> should occur as long as after configured number of retries backend is
> accessible again.

I understand the point. It would be nice to retry connecting to
backend in pgpool child as the health check does.

> Also, it is perfectly valid requirement to prevent failover to occur
> immediately when administrative shutdown of backend is performed. For
> example, a single backend for high availability and easy maintenance can be
> configured as cluster service with e.g. two or more nodes where it can run
> while it actually runs on one node only at a given point in time. So e.g.
> when admin wants to upgrade postgres installation on each of the nodes
> within the cluster, to upgrade postgres installation on a node where
> postgres service is currently active, admin relocates service to some other
> node in a cluster. Relocation causes stop (administrative shutdown) of
> postgres service on currently active node, and starts it on another node.
> pgpool which is configured to use such clustered postgres service as a
> single backend (bound to cluster service ip) should not perform failover on
> detected administrative shutdown - reloaction takes time, and healthcheck
> is configured to give relocation enough time, and it should be only one to
> trigger failover if backend is still not accessible after configured number
> of retries and delays between them.

This I don't understand. Why don't you use pcp_attach_node in this
case after failover?

> Given these two examples, I hope you'll agree that it is valid requirement
> to want to let healthcheck only control when failover should be triggered.
> 
> Unfortunately this is not possible at the moment. Configuring backend flag
> to DISALLOW_TO_FAILOVER will prevent health check to trigger failover. With
> fail_over_on_backend_error set to off, will let failover be triggered
> immediately on temporary conditions that health check with retries should
> handle.
> 
> Did I miss something, how does one configure pgpool to have health check to
> be only process in pgpool that triggers failover?
> 
> Kind regards,
> Stevo.
> 
> 2012/2/19 Tatsuo Ishii <ishii at postgresql.org>
> 
>> Stevo,
>>
>> Thanks for the patches. I have committed changes except the part which
>> you ignore DISALLOW_TO_FAILOVER. Instead I modified low level socket
>> reading functions not to unconditionaly failover when fails to read
>> from backend sockets (only failover when If fail_over_on_backend_error
>> is on). So if you want to trigger failover only when health checking
>> fails, you want to turn off fail_over_on_backend_error and turn off
>> DISALLOW_TO_FAILOVER.
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Hello Tatsuo,
>> >
>> > Attached is cumulative patch rebased to current master branch head which:
>> > - Fixes health check timeout not always respected (includes unsetting
>> > non-blocking mode after connection has been successfully established);
>> > - Fixes failover on health check only support.
>> >
>> > Kind regards,
>> > Stevo.
>> >
>> > 2012/2/5 Stevo Slavić <sslavic at gmail.com>
>> >
>> >> Tatsuo,
>> >>
>> >> Thank you very much for your time and effort put into analysis of the
>> >> submitted patch,
>> >>
>> >>
>> >> Obviously I'm missing something regarding healthcheck feature, so please
>> >> clarify:
>> >>
>> >>    - what is the purpose of healthcheck when backend flag is set to
>> >>    DISALLOW_TO_FAILOVER? To log that healthchecks are on time but will
>> not
>> >>    actually do anything?
>> >>    - what is the purpose of healthcheck (especially with retries
>> >>    configured) when backend flag is set to ALLOW_TO_FAILOVER? When
>> answering
>> >>    please consider case of non-helloworld application that connects to
>> db via
>> >>    pgpool - will healthcheck be given a chance to fail even once?
>> >>    - since there is no other backend flag value than the mentioned two,
>> >>    what is the purpose of healthcheck (especially with retries
>> configured) if
>> >>    it's not to be the sole process controlling when to failover?
>> >>
>> >> I disagree that changing pgpool to give healthcheck feature a meaning
>> >> disrupts DISALLOW_TO_FAILOVER meaning, it extends it just for case when
>> >> healthcheck is configured - if one doesn't want healthcheck just keep on
>> >> not-using it, it's disabled by default. Health checks and retries have
>> only
>> >> recently been introduced so I doubt there are many if any users of
>> health
>> >> check especially which have configured DISALLOW_TO_FAILOVER with
>> >> expectation to just have health check logging but not actually do
>> anything.
>> >> Out of all pgpool healthcheck users which have backends set to
>> >> DISALLOW_TO_FAILOVER too I believe most of them expect but do not know
>> that
>> >> this will not allow failover on health check, it will just make log
>> bigger.
>> >> Changes included in patch do not affect users which have health check
>> >> configured and backend set to ALLOW_TO_FAILOVER.
>> >>
>> >>
>> >> About non-blocking connection to backend change:
>> >>
>> >>    - with pgpool in raw mode and extensive testing (endurance tests,
>> >>    failover and failback tests), I didn't notice any unwanted change in
>> >>    behaviour, apart from wanted non-blocking timeout aware health
>> checks;
>> >>    - do you see or know about anything in pgpool depending on connection
>> >>    to backend being blocking one? will have a look myself, just asking
>> maybe
>> >>    you've found something already. will look into means to set
>> connection back
>> >>    to being blocking after it's successfully established - maybe just
>> changing
>> >>    that flag will do.
>> >>
>> >>
>> >> Kind regards,
>> >>
>> >> Stevo.
>> >>
>> >>
>> >> On Feb 5, 2012 6:50 AM, "Tatsuo Ishii" <ishii at postgresql.org> wrote:
>> >>
>> >>> Finially I have time to check your patches. Here is the result of
>> review.
>> >>>
>> >>> > Hello Tatsuo,
>> >>> >
>> >>> > Here is cumulative patch to be applied on pgpool master branch with
>> >>> > following fixes included:
>> >>> >
>> >>> >    1. fix for health check bug
>> >>> >       1. it was not possible to allow backend failover only on failed
>> >>> >       health check(s);
>> >>> >       2. to achieve this one just configures backend to
>> >>> >       DISALLOW_TO_FAILOVER, sets fail_over_on_backend_error to off,
>> and
>> >>> >       configures health checks;
>> >>> >       3. for this fix in code an unwanted check was removed in
>> main.c,
>> >>> >       after health check failed if DISALLOW_TO_FAILOVER was set for
>> >>> backend
>> >>> >       failover would have been always prevented, even when one
>> >>> > configures health
>> >>> >       check whose sole purpose is to control failover
>> >>>
>> >>> This is not acceptable, at least for stable
>> >>> releases. DISALLOW_TO_FAILOVER and sets fail_over_on_backend_error are
>> >>> for different purposes. The former is for preventing any failover
>> >>> including health check. The latter is for write to communication
>> >>> socket.
>> >>>
>> >>> fail_over_on_backend_error = on
>> >>>                                   # Initiates failover when writing to
>> the
>> >>>                                   # backend communication socket fails
>> >>>                                   # This is the same behaviour of
>> >>> pgpool-II
>> >>>                                   # 2.2.x and previous releases
>> >>>                                   # If set to off, pgpool will report
>> an
>> >>>                                   # error and disconnect the session.
>> >>>
>> >>> Your patch changes the existing semantics. Another point is,
>> >>> DISALLOW_TO_FAILOVER allows to control per backend behavior. Your
>> >>> patch breaks it.
>> >>>
>> >>> >       2. fix for health check bug
>> >>> >       1. health check timeout was not being respected in all
>> conditions
>> >>> >       (icmp host unreachable messages dropped for security reasons,
>> or
>> >>> > no active
>> >>> >       network component to send those message)
>> >>> >       2. for this fix in code (main.c, pool.h,
>> pool_connection_pool.c)
>> >>> inet
>> >>> >       connections have been made to be non blocking, and during
>> >>> connection
>> >>> >       retries status of now global health_check_timer_expired
>> variable
>> >>> is being
>> >>> >       checked
>> >>>
>> >>> This seems good. But I need more investigation. For example, your
>> >>> patch set non blocking to sockets but never revert back to blocking.
>> >>>
>> >>> >       3. fix for failback bug
>> >>> >       1. in raw mode, after failback (through pcp_attach_node)
>> standby
>> >>> >       node/backend would remain in invalid state
>> >>>
>> >>> It turned out that even failover was bugged. The status was not set to
>> >>> CON_DOWN. This leaves the status to CON_CONNECT_WAIT and it prevented
>> >>> failback from returning to normal state. I fixed this on master branch.
>> >>>
>> >>> > (it would be in CON_UP, so on
>> >>> >       failover after failback pgpool would not be able to connect to
>> >>> standby as
>> >>> >       get_next_master_node expects standby nodes/backends in raw mode
>> >>> to be in
>> >>> >       CON_CONNECT_WAIT state when finding next master node)
>> >>> >       2. for this fix in code, when in raw mode on failback status of
>> >>> all
>> >>> >       nodes/backends with CON_UP state is set to CON_CONNECT_WAIT -
>> >>> > all children
>> >>> >       are restarted anyway
>> >>>
>> >>>
>> >>> > Neither of these fixes changes expected behaviour of related
>> features so
>> >>> > there are no changes to the documentation.
>> >>> >
>> >>> >
>> >>> > Kind regards,
>> >>> >
>> >>> > Stevo.
>> >>> >
>> >>> >
>> >>> > 2012/1/24 Tatsuo Ishii <ishii at postgresql.org>
>> >>> >
>> >>> >> > Additional testing confirmed that this fix ensures health check
>> timer
>> >>> >> gets
>> >>> >> > respected (should I create a ticket on some issue tracker? send
>> >>> >> cumulative
>> >>> >> > patch with all changes to have it accepted?).
>> >>> >>
>> >>> >> We have problem with Mantis bug tracker and decided to stop using
>> >>> >> it(unless someone volunteers to fix it). Please send cumulative
>> patch
>> >>> >> againt master head to this list so that we will be able to look
>> >>> >> into(be sure to include English doc changes).
>> >>> >> --
>> >>> >> Tatsuo Ishii
>> >>> >> SRA OSS, Inc. Japan
>> >>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >> Japanese: http://www.sraoss.co.jp
>> >>> >>
>> >>> >> > Problem is that with all the testing another issue has been
>> >>> encountered,
>> >>> >> > now with pcp_attach_node.
>> >>> >> >
>> >>> >> > With pgpool in raw mode and two backends in postgres 9 streaming
>> >>> >> > replication, when backend0 fails, after health checks retries
>> pgpool
>> >>> >> calls
>> >>> >> > failover command and degenerates backend0, backend1 gets promoted
>> to
>> >>> new
>> >>> >> > master, pgpool can connect to that master, and two backends are in
>> >>> pgpool
>> >>> >> > state 3/2. And this is ok and expected.
>> >>> >> >
>> >>> >> > Once backend0 is recovered, it's attached back to pgpool using
>> >>> >> > pcp_attach_node, and pgpool will show two backends in state 2/2
>> (in
>> >>> logs
>> >>> >> > and in show pool_nodes; query) with backend0 taking all the load
>> (raw
>> >>> >> > mode). If after that recovery and attachment of backend0 pgpool is
>> >>> not
>> >>> >> > restarted, and afetr some time backend0 fails again, after health
>> >>> check
>> >>> >> > retries backend0 will get degenerated, failover command will get
>> >>> called
>> >>> >> > (promotes standby to master), but pgpool will not be able to
>> connect
>> >>> to
>> >>> >> > backend1 (regardless if unix or inet sockets are used for
>> backend1).
>> >>> Only
>> >>> >> > if pgpool is restarted before second (complete) failure of
>> backend0,
>> >>> will
>> >>> >> > pgpool be able to connect to backend1.
>> >>> >> >
>> >>> >> > Following code, pcp_attach_node (failback of backend0) will
>> actually
>> >>> >> > execute same code as for failover. Not sure what, but that
>> failover
>> >>> does
>> >>> >> > something with backend1 state or in memory settings, so that
>> pgpool
>> >>> can
>> >>> >> no
>> >>> >> > longer connect to backend1. Is this a known issue?
>> >>> >> >
>> >>> >> > Kind regards,
>> >>> >> > Stevo.
>> >>> >> >
>> >>> >> > 2012/1/20 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >
>> >>> >> >> Key file was missing from that commit/change - pool.h where
>> >>> >> >> health_check_timer_expired was made global. Included now attached
>> >>> patch.
>> >>> >> >>
>> >>> >> >> Kind regards,
>> >>> >> >> Stevo.
>> >>> >> >>
>> >>> >> >>
>> >>> >> >> 2012/1/20 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>
>> >>> >> >>> Using exit_request was wrong and caused a bug. 4th patch needed
>> -
>> >>> >> >>> health_check_timer_expired is global now so it can be verified
>> if
>> >>> it
>> >>> >> was
>> >>> >> >>> set to 1 outside of main.c
>> >>> >> >>>
>> >>> >> >>>
>> >>> >> >>> Kind regards,
>> >>> >> >>> Stevo.
>> >>> >> >>>
>> >>> >> >>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>>
>> >>> >> >>>> Using exit_code was not wise. Tested and encountered a case
>> where
>> >>> this
>> >>> >> >>>> results in a bug. Have to work on it more. Main issue is how in
>> >>> >> >>>> pool_connection_pool.c connect_inet_domain_socket_by_port
>> >>> function to
>> >>> >> know
>> >>> >> >>>> that health check timer has expired (set to 1). Any ideas?
>> >>> >> >>>>
>> >>> >> >>>> Kind regards,
>> >>> >> >>>> Stevo.
>> >>> >> >>>>
>> >>> >> >>>>
>> >>> >> >>>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>>>
>> >>> >> >>>>> Tatsuo,
>> >>> >> >>>>>
>> >>> >> >>>>> Here are the patches which should be applied to current pgpool
>> >>> head
>> >>> >> for
>> >>> >> >>>>> fixing this issue:
>> >>> >> >>>>>
>> >>> >> >>>>> Fixes-health-check-timeout.patch
>> >>> >> >>>>> Fixes-health-check-retrying-after-failover.patch
>> >>> >> >>>>> Fixes-clearing-exitrequest-flag.patch
>> >>> >> >>>>>
>> >>> >> >>>>> Quirk I noticed in logs was resolved as well - after failover
>> >>> pgpool
>> >>> >> >>>>> would perform healthcheck and report it is doing (max retries
>> +
>> >>> 1) th
>> >>> >> >>>>> health check which was confusing. Rather I've adjusted that it
>> >>> does
>> >>> >> and
>> >>> >> >>>>> reports it's doing a new health check cycle after failover.
>> >>> >> >>>>>
>> >>> >> >>>>> I've tested and it works well - when in raw mode, backends
>> set to
>> >>> >> >>>>> disallow failover, failover on backend failure disabled, and
>> >>> health
>> >>> >> checks
>> >>> >> >>>>> configured with retries (30sec interval, 5sec timeout, 2
>> retries,
>> >>> >> 10sec
>> >>> >> >>>>> delay between retries).
>> >>> >> >>>>>
>> >>> >> >>>>> Please test, and if confirmed ok include in next release.
>> >>> >> >>>>>
>> >>> >> >>>>> Kind regards,
>> >>> >> >>>>>
>> >>> >> >>>>> Stevo.
>> >>> >> >>>>>
>> >>> >> >>>>>
>> >>> >> >>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>>>>
>> >>> >> >>>>>> Here is pgpool.log, strace.out, and pgpool.conf when I tested
>> >>> with
>> >>> >> my
>> >>> >> >>>>>> latest patch for health check timeout applied. It works well,
>> >>> >> except for
>> >>> >> >>>>>> single quirk, after failover completed in log files it was
>> >>> reported
>> >>> >> that
>> >>> >> >>>>>> 3rd health check retry was done (even though just 2 are
>> >>> configured,
>> >>> >> see
>> >>> >> >>>>>> pgpool.conf) and that backend has returned to healthy state.
>> >>> That
>> >>> >> >>>>>> interesting part from log file follows:
>> >>> >> >>>>>>
>> >>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
>> >>> DEBUG: pid
>> >>> >> >>>>>> 1163: retrying 3 th health checking
>> >>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
>> >>> DEBUG: pid
>> >>> >> >>>>>> 1163: health_check: 0 th DB node status: 3
>> >>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
>> LOG:
>> >>>   pid
>> >>> >> >>>>>> 1163: after some retrying backend returned to healthy state
>> >>> >> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15
>> >>> DEBUG: pid
>> >>> >> >>>>>> 1163: starting health checking
>> >>> >> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15
>> >>> DEBUG: pid
>> >>> >> >>>>>> 1163: health_check: 0 th DB node status: 3
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> As can be seen in pgpool.conf, there is only one backend
>> >>> configured.
>> >>> >> >>>>>> pgpool did failover well after health check max retries has
>> been
>> >>> >> reached
>> >>> >> >>>>>> (pgpool just degraded that single backend to 3, and restarted
>> >>> child
>> >>> >> >>>>>> processes).
>> >>> >> >>>>>>
>> >>> >> >>>>>> After this quirk has been logged, next health check logs
>> were as
>> >>> >> >>>>>> expected. Except those couple weird log entries, everything
>> >>> seems
>> >>> >> to be ok.
>> >>> >> >>>>>> Maybe that quirk was caused by single backend only
>> configuration
>> >>> >> corner
>> >>> >> >>>>>> case. Will try tomorrow if it occurs on dual backend
>> >>> configuration.
>> >>> >> >>>>>>
>> >>> >> >>>>>> Regards,
>> >>> >> >>>>>> Stevo.
>> >>> >> >>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>>>>>
>> >>> >> >>>>>>> Hello Tatsuo,
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> Unfortunately, with your patch when A is on
>> >>> >> >>>>>>> (pool_config->health_check_period > 0) and B is on, when
>> retry
>> >>> >> count is
>> >>> >> >>>>>>> over, failover will be disallowed because of B being on.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> Nenad's patch allows failover to be triggered only by health
>> >>> check.
>> >>> >> >>>>>>> Here is the patch which includes Nenad's fix but also fixes
>> >>> issue
>> >>> >> with
>> >>> >> >>>>>>> health check timeout not being respected.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> Key points in fix for health check timeout being respected
>> are:
>> >>> >> >>>>>>> - in pool_connection_pool.c
>> connect_inet_domain_socket_by_port
>> >>> >> >>>>>>> function, before trying to connect, file descriptor is set
>> to
>> >>> >> non-blocking
>> >>> >> >>>>>>> mode, and also non-blocking mode error codes are handled,
>> >>> >> EINPROGRESS and
>> >>> >> >>>>>>> EALREADY (please verify changes here, especially regarding
>> >>> closing
>> >>> >> fd)
>> >>> >> >>>>>>> - in main.c health_check_timer_handler has been changed to
>> >>> signal
>> >>> >> >>>>>>> exit_request to health check initiated
>> >>> >> connect_inet_domain_socket_by_port
>> >>> >> >>>>>>> function call (please verify this, maybe there is a better
>> way
>> >>> to
>> >>> >> check
>> >>> >> >>>>>>> from connect_inet_domain_socket_by_port if in
>> >>> >> health_check_timer_expired
>> >>> >> >>>>>>> has been set to 1)
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> These changes will practically make connect attempt to be
>> >>> >> >>>>>>> non-blocking and repeated until:
>> >>> >> >>>>>>> - connection is made, or
>> >>> >> >>>>>>> - unhandled connection error condition is reached, or
>> >>> >> >>>>>>> - health check timer alarm has been raised, or
>> >>> >> >>>>>>> - some other exit request (shutdown) has been issued.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> Kind regards,
>> >>> >> >>>>>>> Stevo.
>> >>> >> >>>>>>>
>> >>> >> >>>>>>> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>>> Ok, let me clarify use cases regarding failover.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> Currently there are three parameters:
>> >>> >> >>>>>>>> a) health_check
>> >>> >> >>>>>>>> b) DISALLOW_TO_FAILOVER
>> >>> >> >>>>>>>> c) fail_over_on_backend_error
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> Source of errors which can trigger failover are 1)health
>> check
>> >>> >> >>>>>>>> 2)write
>> >>> >> >>>>>>>> to backend socket 3)read backend from socket. I represent
>> >>> each 1)
>> >>> >> as
>> >>> >> >>>>>>>> A, 2) as B, 3) as C.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 1) trigger failover if A or B or C is error
>> >>> >> >>>>>>>> a = on, b = off, c = on
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 2) trigger failover only when B or C is error
>> >>> >> >>>>>>>> a = off, b = off, c = on
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 3) trigger failover only when B is error
>> >>> >> >>>>>>>> Impossible. Because C error always triggers failover.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 4) trigger failover only when C is error
>> >>> >> >>>>>>>> a = off, b = off, c = off
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>> >>> >> >>>>>>>> Impossible. Because C error always triggers failover.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 6) never trigger failover
>> >>> >> >>>>>>>> Impossible. Because C error always triggers failover.
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> As you can see, C is the problem here (look at #3, #5 and
>> #6)
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> If we implemented this:
>> >>> >> >>>>>>>> >> However I think we should disable failover if
>> >>> >> >>>>>>>> DISALLOW_TO_FAILOVER set
>> >>> >> >>>>>>>> >> in case of reading data from backend. This should have
>> been
>> >>> >> done
>> >>> >> >>>>>>>> when
>> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is
>> exactly
>> >>> >> what
>> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you
>> >>> think?
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 1) trigger failover if A or B or C is error
>> >>> >> >>>>>>>> a = on, b = off, c = on
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 2) trigger failover only when B or C is error
>> >>> >> >>>>>>>> a = off, b = off, c = on
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 3) trigger failover only when B is error
>> >>> >> >>>>>>>> a = off, b = on, c = on
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 4) trigger failover only when C is error
>> >>> >> >>>>>>>> a = off, b = off, c = off
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>> >>> >> >>>>>>>> a = on, b = on, c = off
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> 6) never trigger failover
>> >>> >> >>>>>>>> a = off, b = on, c = off
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> So it seems my patch will solve all the problems including
>> >>> yours.
>> >>> >> >>>>>>>> (timeout while retrying is another issue of course).
>> >>> >> >>>>>>>> --
>> >>> >> >>>>>>>> Tatsuo Ishii
>> >>> >> >>>>>>>> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> English: http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>> > I agree, fail_over_on_backend_error isn't useful, just
>> adds
>> >>> >> >>>>>>>> confusion by
>> >>> >> >>>>>>>> > overlapping with DISALLOW_TO_FAILOVER.
>> >>> >> >>>>>>>> >
>> >>> >> >>>>>>>> > With your patch or without it, it is not possible to
>> >>> failover
>> >>> >> only
>> >>> >> >>>>>>>> on
>> >>> >> >>>>>>>> > health check (max retries) failure. With Nenad's patch,
>> that
>> >>> >> part
>> >>> >> >>>>>>>> works ok
>> >>> >> >>>>>>>> > and I think that patch is semantically ok - failover
>> occurs
>> >>> even
>> >>> >> >>>>>>>> though
>> >>> >> >>>>>>>> > DISALLOW_TO_FAILOVER is set for backend but only when
>> health
>> >>> >> check
>> >>> >> >>>>>>>> is
>> >>> >> >>>>>>>> > configured too. Configuring health check without
>> failover on
>> >>> >> >>>>>>>> failed health
>> >>> >> >>>>>>>> > check has no purpose. Also health check configured with
>> >>> allowed
>> >>> >> >>>>>>>> failover on
>> >>> >> >>>>>>>> > any condition other than health check (max retries)
>> failure
>> >>> has
>> >>> >> no
>> >>> >> >>>>>>>> purpose.
>> >>> >> >>>>>>>> >
>> >>> >> >>>>>>>> > Kind regards,
>> >>> >> >>>>>>>> > Stevo.
>> >>> >> >>>>>>>> >
>> >>> >> >>>>>>>> > 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >>> >> >>>>>>>> >
>> >>> >> >>>>>>>> >> fail_over_on_backend_error has different meaning from
>> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER. From the doc:
>> >>> >> >>>>>>>> >>
>> >>> >> >>>>>>>> >>  If true, and an error occurs when writing to the
>> backend
>> >>> >> >>>>>>>> >>  communication, pgpool-II will trigger the fail over
>> >>> procedure
>> >>> >> .
>> >>> >> >>>>>>>> This
>> >>> >> >>>>>>>> >>  is the same behavior as of pgpool-II 2.2.x or earlier.
>> If
>> >>> set
>> >>> >> to
>> >>> >> >>>>>>>> >>  false, pgpool will report an error and disconnect the
>> >>> session.
>> >>> >> >>>>>>>> >>
>> >>> >> >>>>>>>> >> This means that if pgpool failed to read from backend,
>> it
>> >>> will
>> >>> >> >>>>>>>> trigger
>> >>> >> >>>>>>>> >> failover even if fail_over_on_backend_error to off. So
>> >>> >> >>>>>>>> unconditionaly
>> >>> >> >>>>>>>> >> disabling failover will lead backward imcompatibilty.
>> >>> >> >>>>>>>> >>
>> >>> >> >>>>>>>> >> However I think we should disable failover if
>> >>> >> >>>>>>>> DISALLOW_TO_FAILOVER set
>> >>> >> >>>>>>>> >> in case of reading data from backend. This should have
>> been
>> >>> >> done
>> >>> >> >>>>>>>> when
>> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is
>> exactly
>> >>> >> what
>> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you
>> >>> think?
>> >>> >> >>>>>>>> >> --
>> >>> >> >>>>>>>> >> Tatsuo Ishii
>> >>> >> >>>>>>>> >> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >>
>> >>> >> >>>>>>>> >> > For a moment I thought we could have set
>> >>> >> >>>>>>>> fail_over_on_backend_error to
>> >>> >> >>>>>>>> >> off,
>> >>> >> >>>>>>>> >> > and have backends set with ALLOW_TO_FAILOVER flag. But
>> >>> then I
>> >>> >> >>>>>>>> looked in
>> >>> >> >>>>>>>> >> > code.
>> >>> >> >>>>>>>> >> >
>> >>> >> >>>>>>>> >> > In child.c there is a loop child process goes through
>> in
>> >>> its
>> >>> >> >>>>>>>> lifetime.
>> >>> >> >>>>>>>> >> When
>> >>> >> >>>>>>>> >> > fatal error condition occurs before child process
>> exits
>> >>> it
>> >>> >> will
>> >>> >> >>>>>>>> call
>> >>> >> >>>>>>>> >> > notice_backend_error which will call
>> >>> degenerate_backend_set
>> >>> >> >>>>>>>> which will
>> >>> >> >>>>>>>> >> not
>> >>> >> >>>>>>>> >> > take into account fail_over_on_backend_error is set to
>> >>> off,
>> >>> >> >>>>>>>> causing
>> >>> >> >>>>>>>> >> backend
>> >>> >> >>>>>>>> >> > to be degenerated and failover to occur. That's why we
>> >>> have
>> >>> >> >>>>>>>> backends set
>> >>> >> >>>>>>>> >> > with DISALLOW_TO_FAILOVER but with our patch applied,
>> >>> health
>> >>> >> >>>>>>>> check could
>> >>> >> >>>>>>>> >> > cause failover to occur as expected.
>> >>> >> >>>>>>>> >> >
>> >>> >> >>>>>>>> >> > Maybe it would be enough just to modify
>> >>> >> degenerate_backend_set,
>> >>> >> >>>>>>>> to take
>> >>> >> >>>>>>>> >> > fail_over_on_backend_error into account just like it
>> >>> already
>> >>> >> >>>>>>>> takes
>> >>> >> >>>>>>>> >> > DISALLOW_TO_FAILOVER into account.
>> >>> >> >>>>>>>> >> >
>> >>> >> >>>>>>>> >> > Kind regards,
>> >>> >> >>>>>>>> >> > Stevo.
>> >>> >> >>>>>>>> >> >
>> >>> >> >>>>>>>> >> > 2012/1/15 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>>>>>>> >> >
>> >>> >> >>>>>>>> >> >> Yes and that behaviour which you describe as
>> expected,
>> >>> is
>> >>> >> not
>> >>> >> >>>>>>>> what we
>> >>> >> >>>>>>>> >> >> want. We want pgpool to degrade backend0 and failover
>> >>> when
>> >>> >> >>>>>>>> configured
>> >>> >> >>>>>>>> >> max
>> >>> >> >>>>>>>> >> >> health check retries have failed, and to failover
>> only
>> >>> in
>> >>> >> that
>> >>> >> >>>>>>>> case, so
>> >>> >> >>>>>>>> >> not
>> >>> >> >>>>>>>> >> >> sooner e.g. connection/child error condition, but as
>> >>> soon as
>> >>> >> >>>>>>>> max health
>> >>> >> >>>>>>>> >> >> check retries have been attempted.
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> Maybe examples will be more clear.
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> Imagine two nodes (node 1 and node 2). On each node a
>> >>> single
>> >>> >> >>>>>>>> pgpool and
>> >>> >> >>>>>>>> >> a
>> >>> >> >>>>>>>> >> >> single backend. Apps/clients access db through
>> pgpool on
>> >>> >> their
>> >>> >> >>>>>>>> own node.
>> >>> >> >>>>>>>> >> >> Two backends are configured in postgres native
>> streaming
>> >>> >> >>>>>>>> replication.
>> >>> >> >>>>>>>> >> >> pgpools are used in raw mode. Both pgpools have same
>> >>> >> backend as
>> >>> >> >>>>>>>> >> backend0,
>> >>> >> >>>>>>>> >> >> and same backend as backend1.
>> >>> >> >>>>>>>> >> >> initial state: both backends are up and pgpool can
>> >>> access
>> >>> >> >>>>>>>> them, clients
>> >>> >> >>>>>>>> >> >> connect to their pgpool and do their work on master
>> >>> backend,
>> >>> >> >>>>>>>> backend0.
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> 1st case: unmodified/non-patched pgpool 3.1.1 is
>> used,
>> >>> >> >>>>>>>> backends are
>> >>> >> >>>>>>>> >> >> configured with ALLOW_TO_FAILOVER flag
>> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> >>> node 2
>> >>> >> >>>>>>>> and backend0
>> >>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> >>> since
>> >>> >> >>>>>>>> >> >> ALLOW_TO_FAILOVER is set, pgpool performs failover
>> >>> without
>> >>> >> >>>>>>>> giving
>> >>> >> >>>>>>>> >> chance to
>> >>> >> >>>>>>>> >> >> pgpool health check retries to control whether
>> backend
>> >>> is
>> >>> >> just
>> >>> >> >>>>>>>> >> temporarily
>> >>> >> >>>>>>>> >> >> inaccessible
>> >>> >> >>>>>>>> >> >> - failover command on node 2 promotes standby backend
>> >>> to a
>> >>> >> new
>> >>> >> >>>>>>>> master -
>> >>> >> >>>>>>>> >> >> split brain occurs, with two masters
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> 2nd case: unmodified/non-patched pgpool 3.1.1 is
>> used,
>> >>> >> >>>>>>>> backends are
>> >>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> >>> node 2
>> >>> >> >>>>>>>> and backend0
>> >>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> >>> since
>> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >>> >> failover
>> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> >>> condition,
>> >>> >> >>>>>>>> determines
>> >>> >> >>>>>>>> >> that
>> >>> >> >>>>>>>> >> >> it's not accessible, there will be no health check
>> >>> retries
>> >>> >> >>>>>>>> because
>> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, no failover occurs ever
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> 3rd case, pgpool 3.1.1 + patch you've sent applied,
>> and
>> >>> >> >>>>>>>> backends
>> >>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> >>> node 2
>> >>> >> >>>>>>>> and backend0
>> >>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> >>> since
>> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >>> >> failover
>> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> >>> condition,
>> >>> >> >>>>>>>> determines
>> >>> >> >>>>>>>> >> that
>> >>> >> >>>>>>>> >> >> it's not accessible, health check retries happen, and
>> >>> even
>> >>> >> >>>>>>>> after max
>> >>> >> >>>>>>>> >> >> retries, no failover happens since failover is
>> >>> disallowed
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> 4th expected behaviour, pgpool 3.1.1 + patch we sent,
>> >>> and
>> >>> >> >>>>>>>> backends
>> >>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> >>> node 2
>> >>> >> >>>>>>>> and backend0
>> >>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> >>> since
>> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >>> >> failover
>> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> >>> condition,
>> >>> >> >>>>>>>> determines
>> >>> >> >>>>>>>> >> that
>> >>> >> >>>>>>>> >> >> it's not accessible, health check retries happen,
>> >>> before a
>> >>> >> max
>> >>> >> >>>>>>>> retry
>> >>> >> >>>>>>>> >> >> network condition is cleared, retry happens, and
>> >>> backend0
>> >>> >> >>>>>>>> remains to be
>> >>> >> >>>>>>>> >> >> master, no failover occurs, temporary network issue
>> did
>> >>> not
>> >>> >> >>>>>>>> cause split
>> >>> >> >>>>>>>> >> >> brain
>> >>> >> >>>>>>>> >> >> - after some time, temporary network outage happens
>> >>> again
>> >>> >> >>>>>>>> between pgpool
>> >>> >> >>>>>>>> >> >> on node 2 and backend0
>> >>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> >>> since
>> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >>> >> failover
>> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> >>> condition,
>> >>> >> >>>>>>>> determines
>> >>> >> >>>>>>>> >> that
>> >>> >> >>>>>>>> >> >> it's not accessible, health check retries happen,
>> after
>> >>> max
>> >>> >> >>>>>>>> retries
>> >>> >> >>>>>>>> >> >> backend0 is still not accessible, failover happens,
>> >>> standby
>> >>> >> is
>> >>> >> >>>>>>>> new
>> >>> >> >>>>>>>> >> master
>> >>> >> >>>>>>>> >> >> and backend0 is degraded
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> Kind regards,
>> >>> >> >>>>>>>> >> >> Stevo.
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >>> In my test evironment, the patch works as expected.
>> I
>> >>> have
>> >>> >> two
>> >>> >> >>>>>>>> >> >>> backends. Health check retry conf is as follows:
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>> health_check_max_retries = 3
>> >>> >> >>>>>>>> >> >>> health_check_retry_delay = 1
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>> 5 09:17:20 LOG:   pid 21411: Backend status file
>> >>> >> >>>>>>>> /home/t-ishii/work/
>> >>> >> >>>>>>>> >> >>> git.postgresql.org/test/log/pgpool_status discarded
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: pgpool-II
>> >>> >> successfully
>> >>> >> >>>>>>>> started.
>> >>> >> >>>>>>>> >> >>> version 3.2alpha1 (hatsuiboshi)
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411:
>> >>> find_primary_node:
>> >>> >> >>>>>>>> primary node
>> >>> >> >>>>>>>> >> id
>> >>> >> >>>>>>>> >> >>> is 0
>> >>> >> >>>>>>>> >> >>> -- backend1 was shutdown
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >>> >> >>>>>>>> check_replication_time_lag: could
>> >>> >> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>> >>> >> >>>>>>>> sr_check_password
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> -- health check failed
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411: health check
>> >>> failed.
>> >>> >> 1
>> >>> >> >>>>>>>> th host
>> >>> >> >>>>>>>> >> /tmp
>> >>> >> >>>>>>>> >> >>> at port 11001 is down
>> >>> >> >>>>>>>> >> >>> -- start retrying
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 LOG:   pid 21411: health check
>> >>> retry
>> >>> >> >>>>>>>> sleep time: 1
>> >>> >> >>>>>>>> >> >>> second(s)
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411: health check
>> >>> failed.
>> >>> >> 1
>> >>> >> >>>>>>>> th host
>> >>> >> >>>>>>>> >> /tmp
>> >>> >> >>>>>>>> >> >>> at port 11001 is down
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 LOG:   pid 21411: health check
>> >>> retry
>> >>> >> >>>>>>>> sleep time: 1
>> >>> >> >>>>>>>> >> >>> second(s)
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411: health check
>> >>> failed.
>> >>> >> 1
>> >>> >> >>>>>>>> th host
>> >>> >> >>>>>>>> >> /tmp
>> >>> >> >>>>>>>> >> >>> at port 11001 is down
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 LOG:   pid 21411: health check
>> >>> retry
>> >>> >> >>>>>>>> sleep time: 1
>> >>> >> >>>>>>>> >> >>> second(s)
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411: health check
>> >>> failed.
>> >>> >> 1
>> >>> >> >>>>>>>> th host
>> >>> >> >>>>>>>> >> /tmp
>> >>> >> >>>>>>>> >> >>> at port 11001 is down
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 LOG:   pid 21411: health_check:
>> 1
>> >>> >> >>>>>>>> failover is
>> >>> >> >>>>>>>> >> canceld
>> >>> >> >>>>>>>> >> >>> because failover is disallowed
>> >>> >> >>>>>>>> >> >>> -- after 3 retries, pgpool wanted to failover, but
>> >>> gave up
>> >>> >> >>>>>>>> because
>> >>> >> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER is set for backend1
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >>> >> >>>>>>>> check_replication_time_lag: could
>> >>> >> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>> >>> >> >>>>>>>> sr_check_password
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411: health check
>> >>> failed.
>> >>> >> 1
>> >>> >> >>>>>>>> th host
>> >>> >> >>>>>>>> >> /tmp
>> >>> >> >>>>>>>> >> >>> at port 11001 is down
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 LOG:   pid 21411: health check
>> >>> retry
>> >>> >> >>>>>>>> sleep time: 1
>> >>> >> >>>>>>>> >> >>> second(s)
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
>> file
>> >>> or
>> >>> >> >>>>>>>> directory
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>> >>> >> >>>>>>>> make_persistent_db_connection:
>> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411: health check
>> >>> failed.
>> >>> >> 1
>> >>> >> >>>>>>>> th host
>> >>> >> >>>>>>>> >> /tmp
>> >>> >> >>>>>>>> >> >>> at port 11001 is down
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 LOG:   pid 21411: health check
>> >>> retry
>> >>> >> >>>>>>>> sleep time: 1
>> >>> >> >>>>>>>> >> >>> second(s)
>> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:05 LOG:   pid 21411: after some
>> >>> retrying
>> >>> >> >>>>>>>> backend
>> >>> >> >>>>>>>> >> >>> returned to healthy state
>> >>> >> >>>>>>>> >> >>> -- started backend1 and pgpool succeeded in health
>> >>> >> checking.
>> >>> >> >>>>>>>> Resumed
>> >>> >> >>>>>>>> >> >>> using backend1
>> >>> >> >>>>>>>> >> >>> --
>> >>> >> >>>>>>>> >> >>> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> English: http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>> > Hello Tatsuo,
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> > Thank you for the patch and effort, but
>> unfortunately
>> >>> >> this
>> >>> >> >>>>>>>> change
>> >>> >> >>>>>>>> >> won't
>> >>> >> >>>>>>>> >> >>> > work for us. We need to set disallow failover to
>> >>> prevent
>> >>> >> >>>>>>>> failover on
>> >>> >> >>>>>>>> >> >>> child
>> >>> >> >>>>>>>> >> >>> > reported connection errors (it's ok if few clients
>> >>> lose
>> >>> >> >>>>>>>> their
>> >>> >> >>>>>>>> >> >>> connection or
>> >>> >> >>>>>>>> >> >>> > can not connect), and still have pgpool perform
>> >>> failover
>> >>> >> >>>>>>>> but only on
>> >>> >> >>>>>>>> >> >>> failed
>> >>> >> >>>>>>>> >> >>> > health check (if configured, after max retries
>> >>> threshold
>> >>> >> >>>>>>>> has been
>> >>> >> >>>>>>>> >> >>> reached).
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> > Maybe it would be best to add an extra value for
>> >>> >> >>>>>>>> backend_flag -
>> >>> >> >>>>>>>> >> >>> > ALLOW_TO_FAILOVER_ON_HEALTH_CHECK or
>> >>> >> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER_ON_CHILD_ERROR.
>> >>> >> >>>>>>>> >> >>> > It should behave same as DISALLOW_TO_FAILOVER is
>> set,
>> >>> >> with
>> >>> >> >>>>>>>> only
>> >>> >> >>>>>>>> >> >>> difference
>> >>> >> >>>>>>>> >> >>> > in behaviour when health check (if set, max
>> retries)
>> >>> has
>> >>> >> >>>>>>>> failed -
>> >>> >> >>>>>>>> >> unlike
>> >>> >> >>>>>>>> >> >>> > DISALLOW_TO_FAILOVER, this new flag should allow
>> >>> failover
>> >>> >> >>>>>>>> in this
>> >>> >> >>>>>>>> >> case
>> >>> >> >>>>>>>> >> >>> only.
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> > Without this change health check (especially
>> health
>> >>> check
>> >>> >> >>>>>>>> retries)
>> >>> >> >>>>>>>> >> >>> doesn't
>> >>> >> >>>>>>>> >> >>> > make much sense - child error is more likely to
>> >>> occur on
>> >>> >> >>>>>>>> (temporary)
>> >>> >> >>>>>>>> >> >>> > backend failure then health check and will or will
>> >>> not
>> >>> >> cause
>> >>> >> >>>>>>>> >> failover to
>> >>> >> >>>>>>>> >> >>> > occur depending on backend flag, without giving
>> >>> health
>> >>> >> >>>>>>>> check retries
>> >>> >> >>>>>>>> >> a
>> >>> >> >>>>>>>> >> >>> > chance to determine if failure was temporary or
>> not,
>> >>> >> >>>>>>>> risking split
>> >>> >> >>>>>>>> >> brain
>> >>> >> >>>>>>>> >> >>> > situation with two masters just because of
>> temporary
>> >>> >> >>>>>>>> network link
>> >>> >> >>>>>>>> >> >>> hiccup.
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> > Our main problem remains though with the health
>> check
>> >>> >> >>>>>>>> timeout not
>> >>> >> >>>>>>>> >> being
>> >>> >> >>>>>>>> >> >>> > respected in these special conditions we have.
>> Maybe
>> >>> >> Nenad
>> >>> >> >>>>>>>> can help
>> >>> >> >>>>>>>> >> you
>> >>> >> >>>>>>>> >> >>> > more to reproduce the issue on your environment.
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> > Kind regards,
>> >>> >> >>>>>>>> >> >>> > Stevo.
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> > 2012/1/13 Tatsuo Ishii <ishii at postgresql.org>
>> >>> >> >>>>>>>> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> Thanks for pointing it out.
>> >>> >> >>>>>>>> >> >>> >> Yes, checking DISALLOW_TO_FAILOVER before
>> retrying
>> >>> is
>> >>> >> >>>>>>>> wrong.
>> >>> >> >>>>>>>> >> >>> >> However, after retry count over, we should check
>> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER I
>> >>> >> >>>>>>>> >> >>> >> think.
>> >>> >> >>>>>>>> >> >>> >> Attached is the patch attempt to fix it. Please
>> try.
>> >>> >> >>>>>>>> >> >>> >> --
>> >>> >> >>>>>>>> >> >>> >> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> > pgpool is being used in raw mode - just for
>> >>> (health
>> >>> >> >>>>>>>> check based)
>> >>> >> >>>>>>>> >> >>> failover
>> >>> >> >>>>>>>> >> >>> >> > part, so applications are not required to
>> restart
>> >>> when
>> >>> >> >>>>>>>> standby
>> >>> >> >>>>>>>> >> gets
>> >>> >> >>>>>>>> >> >>> >> > promoted to new master. Here is pgpool.conf
>> file
>> >>> and a
>> >>> >> >>>>>>>> very small
>> >>> >> >>>>>>>> >> >>> patch
>> >>> >> >>>>>>>> >> >>> >> > we're using applied to pgpool 3.1.1 release.
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> > We have to have DISALLOW_TO_FAILOVER set for
>> the
>> >>> >> backend
>> >>> >> >>>>>>>> since any
>> >>> >> >>>>>>>> >> >>> child
>> >>> >> >>>>>>>> >> >>> >> > process that detects condition that
>> >>> master/backend0 is
>> >>> >> >>>>>>>> not
>> >>> >> >>>>>>>> >> >>> available, if
>> >>> >> >>>>>>>> >> >>> >> > DISALLOW_TO_FAILOVER was not set, will
>> degenerate
>> >>> >> >>>>>>>> backend without
>> >>> >> >>>>>>>> >> >>> giving
>> >>> >> >>>>>>>> >> >>> >> > health check a chance to retry. We need health
>> >>> check
>> >>> >> >>>>>>>> with retries
>> >>> >> >>>>>>>> >> >>> because
>> >>> >> >>>>>>>> >> >>> >> > condition that backend0 is not available could
>> be
>> >>> >> >>>>>>>> temporary
>> >>> >> >>>>>>>> >> (network
>> >>> >> >>>>>>>> >> >>> >> > glitches to the remote site where master is, or
>> >>> >> >>>>>>>> deliberate
>> >>> >> >>>>>>>> >> failover
>> >>> >> >>>>>>>> >> >>> of
>> >>> >> >>>>>>>> >> >>> >> > master postgres service from one node to the
>> >>> other on
>> >>> >> >>>>>>>> remote site
>> >>> >> >>>>>>>> >> -
>> >>> >> >>>>>>>> >> >>> in
>> >>> >> >>>>>>>> >> >>> >> both
>> >>> >> >>>>>>>> >> >>> >> > cases remote means remote to the pgpool that is
>> >>> going
>> >>> >> to
>> >>> >> >>>>>>>> perform
>> >>> >> >>>>>>>> >> >>> health
>> >>> >> >>>>>>>> >> >>> >> > checks and ultimately the failover) and we
>> don't
>> >>> want
>> >>> >> >>>>>>>> standby to
>> >>> >> >>>>>>>> >> be
>> >>> >> >>>>>>>> >> >>> >> > promoted as easily to a new master, to prevent
>> >>> >> temporary
>> >>> >> >>>>>>>> network
>> >>> >> >>>>>>>> >> >>> >> conditions
>> >>> >> >>>>>>>> >> >>> >> > which could occur frequently to frequently
>> cause
>> >>> split
>> >>> >> >>>>>>>> brain with
>> >>> >> >>>>>>>> >> two
>> >>> >> >>>>>>>> >> >>> >> > masters.
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> > But then, with DISALLOW_TO_FAILOVER set,
>> without
>> >>> the
>> >>> >> >>>>>>>> patch health
>> >>> >> >>>>>>>> >> >>> check
>> >>> >> >>>>>>>> >> >>> >> > will not retry and will thus give only one
>> chance
>> >>> to
>> >>> >> >>>>>>>> backend (if
>> >>> >> >>>>>>>> >> >>> health
>> >>> >> >>>>>>>> >> >>> >> > check ever occurs before child process failure
>> to
>> >>> >> >>>>>>>> connect to the
>> >>> >> >>>>>>>> >> >>> >> backend),
>> >>> >> >>>>>>>> >> >>> >> > rendering retry settings effectively to be
>> >>> ignored.
>> >>> >> >>>>>>>> That's where
>> >>> >> >>>>>>>> >> this
>> >>> >> >>>>>>>> >> >>> >> patch
>> >>> >> >>>>>>>> >> >>> >> > comes into action - enables health check
>> retries
>> >>> while
>> >>> >> >>>>>>>> child
>> >>> >> >>>>>>>> >> >>> processes
>> >>> >> >>>>>>>> >> >>> >> are
>> >>> >> >>>>>>>> >> >>> >> > prevented to degenerate backend.
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> > I don't think, but I could be wrong, that this
>> >>> patch
>> >>> >> >>>>>>>> influences
>> >>> >> >>>>>>>> >> the
>> >>> >> >>>>>>>> >> >>> >> > behavior we're seeing with unwanted health
>> check
>> >>> >> attempt
>> >>> >> >>>>>>>> delays.
>> >>> >> >>>>>>>> >> >>> Also,
>> >>> >> >>>>>>>> >> >>> >> > knowing this, maybe pgpool could be patched or
>> >>> some
>> >>> >> >>>>>>>> other support
>> >>> >> >>>>>>>> >> be
>> >>> >> >>>>>>>> >> >>> >> built
>> >>> >> >>>>>>>> >> >>> >> > into it to cover this use case.
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> > Regards,
>> >>> >> >>>>>>>> >> >>> >> > Stevo.
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> > 2012/1/12 Tatsuo Ishii <ishii at postgresql.org>
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> I have accepted the moderation request. Your
>> post
>> >>> >> >>>>>>>> should be sent
>> >>> >> >>>>>>>> >> >>> >> shortly.
>> >>> >> >>>>>>>> >> >>> >> >> Also I have raised the post size limit to 1MB.
>> >>> >> >>>>>>>> >> >>> >> >> I will look into this...
>> >>> >> >>>>>>>> >> >>> >> >> --
>> >>> >> >>>>>>>> >> >>> >> >> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> >> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> >> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> > Here is the log file and strace output file
>> >>> (this
>> >>> >> >>>>>>>> time in an
>> >>> >> >>>>>>>> >> >>> archive,
>> >>> >> >>>>>>>> >> >>> >> >> > didn't know about 200KB constraint on post
>> size
>> >>> >> which
>> >>> >> >>>>>>>> requires
>> >>> >> >>>>>>>> >> >>> >> moderator
>> >>> >> >>>>>>>> >> >>> >> >> > approval). Timings configured are 30sec
>> health
>> >>> >> check
>> >>> >> >>>>>>>> interval,
>> >>> >> >>>>>>>> >> >>> 5sec
>> >>> >> >>>>>>>> >> >>> >> >> > timeout, and 2 retries with 10sec retry
>> delay.
>> >>> >> >>>>>>>> >> >>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> > It takes a lot more than 5sec from started
>> >>> health
>> >>> >> >>>>>>>> check to
>> >>> >> >>>>>>>> >> >>> sleeping
>> >>> >> >>>>>>>> >> >>> >> 10sec
>> >>> >> >>>>>>>> >> >>> >> >> > for first retry.
>> >>> >> >>>>>>>> >> >>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> > Seen in code (main.x, health_check()
>> function),
>> >>> >> >>>>>>>> within (retry)
>> >>> >> >>>>>>>> >> >>> attempt
>> >>> >> >>>>>>>> >> >>> >> >> > there is inner retry (first with postgres
>> >>> database
>> >>> >> >>>>>>>> then with
>> >>> >> >>>>>>>> >> >>> >> template1)
>> >>> >> >>>>>>>> >> >>> >> >> and
>> >>> >> >>>>>>>> >> >>> >> >> > that part doesn't seem to be interrupted by
>> >>> alarm.
>> >>> >> >>>>>>>> >> >>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> > Regards,
>> >>> >> >>>>>>>> >> >>> >> >> > Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
>> >>> >> >>>>>>>> >> >>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >> Here is the log file and strace output
>> file.
>> >>> >> Timings
>> >>> >> >>>>>>>> >> configured
>> >>> >> >>>>>>>> >> >>> are
>> >>> >> >>>>>>>> >> >>> >> >> 30sec
>> >>> >> >>>>>>>> >> >>> >> >> >> health check interval, 5sec timeout, and 2
>> >>> retries
>> >>> >> >>>>>>>> with 10sec
>> >>> >> >>>>>>>> >> >>> retry
>> >>> >> >>>>>>>> >> >>> >> >> delay.
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >> It takes a lot more than 5sec from started
>> >>> health
>> >>> >> >>>>>>>> check to
>> >>> >> >>>>>>>> >> >>> sleeping
>> >>> >> >>>>>>>> >> >>> >> >> 10sec
>> >>> >> >>>>>>>> >> >>> >> >> >> for first retry.
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >> Seen in code (main.x, health_check()
>> >>> function),
>> >>> >> >>>>>>>> within (retry)
>> >>> >> >>>>>>>> >> >>> >> attempt
>> >>> >> >>>>>>>> >> >>> >> >> >> there is inner retry (first with postgres
>> >>> database
>> >>> >> >>>>>>>> then with
>> >>> >> >>>>>>>> >> >>> >> template1)
>> >>> >> >>>>>>>> >> >>> >> >> and
>> >>> >> >>>>>>>> >> >>> >> >> >> that part doesn't seem to be interrupted by
>> >>> alarm.
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >> Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >> Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >> 2012/1/11 Tatsuo Ishii <
>> ishii at postgresql.org>
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> Ok, I will do it. In the mean time you
>> could
>> >>> use
>> >>> >> >>>>>>>> "strace -tt
>> >>> >> >>>>>>>> >> -p
>> >>> >> >>>>>>>> >> >>> PID"
>> >>> >> >>>>>>>> >> >>> >> >> >>> to see which system call is blocked.
>> >>> >> >>>>>>>> >> >>> >> >> >>> --
>> >>> >> >>>>>>>> >> >>> >> >> >>> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> >> >>> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> >> >>> English:
>> >>> http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> >> >>> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >> >> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> > OK, got the info - key point is that ip
>> >>> >> >>>>>>>> forwarding is
>> >>> >> >>>>>>>> >> >>> disabled for
>> >>> >> >>>>>>>> >> >>> >> >> >>> security
>> >>> >> >>>>>>>> >> >>> >> >> >>> > reasons. Rules in iptables are not
>> >>> important,
>> >>> >> >>>>>>>> iptables can
>> >>> >> >>>>>>>> >> be
>> >>> >> >>>>>>>> >> >>> >> >> stopped,
>> >>> >> >>>>>>>> >> >>> >> >> >>> or
>> >>> >> >>>>>>>> >> >>> >> >> >>> > previously added rules removed.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > Here are the steps to reproduce (kudos
>> to
>> >>> my
>> >>> >> >>>>>>>> colleague
>> >>> >> >>>>>>>> >> Nenad
>> >>> >> >>>>>>>> >> >>> >> >> Bulatovic
>> >>> >> >>>>>>>> >> >>> >> >> >>> for
>> >>> >> >>>>>>>> >> >>> >> >> >>> > providing this):
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 1.) make sure that ip forwarding is off:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >     echo 0 >
>> /proc/sys/net/ipv4/ip_forward
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2.) create IP alias on some interface
>> (and
>> >>> have
>> >>> >> >>>>>>>> postgres
>> >>> >> >>>>>>>> >> >>> listen on
>> >>> >> >>>>>>>> >> >>> >> >> it):
>> >>> >> >>>>>>>> >> >>> >> >> >>> >     ip addr add x.x.x.x/yy dev ethz
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 3.) set backend_hostname0 to
>> >>> aforementioned IP
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 4.) start pgpool and monitor health
>> checks
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 5.) remove IP alias:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >     ip addr del x.x.x.x/yy dev ethz
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > Here is the interesting part in pgpool
>> log
>> >>> >> after
>> >>> >> >>>>>>>> this:
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >>> starting
>> >>> >> >>>>>>>> health
>> >>> >> >>>>>>>> >> checking
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >>> >> >>>>>>>> health_check: 0 th DB
>> >>> >> >>>>>>>> >> >>> node
>> >>> >> >>>>>>>> >> >>> >> >> >>> status: 2
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >>> >> >>>>>>>> health_check: 1 th DB
>> >>> >> >>>>>>>> >> >>> node
>> >>> >> >>>>>>>> >> >>> >> >> >>> status: 1
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
>> >>> starting
>> >>> >> >>>>>>>> health
>> >>> >> >>>>>>>> >> checking
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
>> >>> >> >>>>>>>> health_check: 0 th DB
>> >>> >> >>>>>>>> >> >>> node
>> >>> >> >>>>>>>> >> >>> >> >> >>> status: 2
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358:
>> >>> >> >>>>>>>> health_check: 0 th DB
>> >>> >> >>>>>>>> >> >>> node
>> >>> >> >>>>>>>> >> >>> >> >> >>> status: 2
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 ERROR: pid 24358:
>> >>> health
>> >>> >> >>>>>>>> check failed.
>> >>> >> >>>>>>>> >> 0
>> >>> >> >>>>>>>> >> >>> th
>> >>> >> >>>>>>>> >> >>> >> host
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 192.168.2.27 at port 5432 is down
>> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 LOG:   pid 24358:
>> >>> health
>> >>> >> >>>>>>>> check retry
>> >>> >> >>>>>>>> >> sleep
>> >>> >> >>>>>>>> >> >>> >> time:
>> >>> >> >>>>>>>> >> >>> >> >> 10
>> >>> >> >>>>>>>> >> >>> >> >> >>> > second(s)
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > That pgpool was configured with health
>> >>> check
>> >>> >> >>>>>>>> interval of
>> >>> >> >>>>>>>> >> >>> 30sec,
>> >>> >> >>>>>>>> >> >>> >> 5sec
>> >>> >> >>>>>>>> >> >>> >> >> >>> > timeout, and 10sec retry delay with 2
>> max
>> >>> >> retries.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > Making use of libpq instead for
>> connecting
>> >>> to
>> >>> >> db
>> >>> >> >>>>>>>> in health
>> >>> >> >>>>>>>> >> >>> checks
>> >>> >> >>>>>>>> >> >>> >> IMO
>> >>> >> >>>>>>>> >> >>> >> >> >>> > should resolve it, but you'll best
>> >>> determine
>> >>> >> >>>>>>>> which call
>> >>> >> >>>>>>>> >> >>> exactly
>> >>> >> >>>>>>>> >> >>> >> gets
>> >>> >> >>>>>>>> >> >>> >> >> >>> > blocked waiting. Btw, psql with
>> >>> >> PGCONNECT_TIMEOUT
>> >>> >> >>>>>>>> env var
>> >>> >> >>>>>>>> >> >>> >> configured
>> >>> >> >>>>>>>> >> >>> >> >> >>> > respects that env var timeout.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >>> > Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo
>> >>> Slavić
>> >>> >> <
>> >>> >> >>>>>>>> >> >>> sslavic at gmail.com
>> >>> >> >>>>>>>> >> >>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> wrote:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >> Tatsuo,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >> Did you restart iptables after adding
>> >>> rule?
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >> Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >> Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo
>> >>> >> Slavić <
>> >>> >> >>>>>>>> >> >>> >> sslavic at gmail.com>
>> >>> >> >>>>>>>> >> >>> >> >> >>> wrote:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Looking into this to verify if these
>> are
>> >>> all
>> >>> >> >>>>>>>> necessary
>> >>> >> >>>>>>>> >> >>> changes
>> >>> >> >>>>>>>> >> >>> >> to
>> >>> >> >>>>>>>> >> >>> >> >> have
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> port unreachable message silently
>> >>> rejected
>> >>> >> >>>>>>>> (suspecting
>> >>> >> >>>>>>>> >> some
>> >>> >> >>>>>>>> >> >>> >> kernel
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> parameter tuning is needed).
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Just to clarify it's not a problem
>> that
>> >>> host
>> >>> >> is
>> >>> >> >>>>>>>> being
>> >>> >> >>>>>>>> >> >>> detected
>> >>> >> >>>>>>>> >> >>> >> by
>> >>> >> >>>>>>>> >> >>> >> >> >>> pgpool
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> to be down, but the timing when that
>> >>> >> happens. On
>> >>> >> >>>>>>>> >> environment
>> >>> >> >>>>>>>> >> >>> >> where
>> >>> >> >>>>>>>> >> >>> >> >> >>> issue is
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> reproduced pgpool as part of health
>> check
>> >>> >> >>>>>>>> attempt tries
>> >>> >> >>>>>>>> >> to
>> >>> >> >>>>>>>> >> >>> >> connect
>> >>> >> >>>>>>>> >> >>> >> >> to
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> backend and hangs for tcp timeout
>> >>> instead of
>> >>> >> >>>>>>>> being
>> >>> >> >>>>>>>> >> >>> interrupted
>> >>> >> >>>>>>>> >> >>> >> by
>> >>> >> >>>>>>>> >> >>> >> >> >>> timeout
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> alarm. Can you verify/confirm please
>> the
>> >>> >> health
>> >>> >> >>>>>>>> check
>> >>> >> >>>>>>>> >> retry
>> >>> >> >>>>>>>> >> >>> >> timings
>> >>> >> >>>>>>>> >> >>> >> >> >>> are not
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> delayed?
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM,
>> Tatsuo
>> >>> >> Ishii <
>> >>> >> >>>>>>>> >> >>> >> >> ishii at postgresql.org
>> >>> >> >>>>>>>> >> >>> >> >> >>> >wrote:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Ok, I did:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> # iptables -A FORWARD -j REJECT
>> >>> >> --reject-with
>> >>> >> >>>>>>>> >> >>> >> >> icmp-port-unreachable
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> on the host where pgpoo is running.
>> And
>> >>> pull
>> >>> >> >>>>>>>> network
>> >>> >> >>>>>>>> >> cable
>> >>> >> >>>>>>>> >> >>> from
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> backend0 host network interface.
>> Pgpool
>> >>> >> >>>>>>>> detected the
>> >>> >> >>>>>>>> >> host
>> >>> >> >>>>>>>> >> >>> being
>> >>> >> >>>>>>>> >> >>> >> >> down
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> as expected...
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> --
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> English:
>> >>> >> http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Backend is not destination of this
>> >>> >> message,
>> >>> >> >>>>>>>> pgpool
>> >>> >> >>>>>>>> >> host
>> >>> >> >>>>>>>> >> >>> is,
>> >>> >> >>>>>>>> >> >>> >> and
>> >>> >> >>>>>>>> >> >>> >> >> we
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> don't
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > want it to ever get it. With
>> command
>> >>> I've
>> >>> >> >>>>>>>> sent you
>> >>> >> >>>>>>>> >> rule
>> >>> >> >>>>>>>> >> >>> will
>> >>> >> >>>>>>>> >> >>> >> be
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> created for
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > any source and destination.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM,
>> >>> Tatsuo
>> >>> >> >>>>>>>> Ishii <
>> >>> >> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> wrote:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> I did following:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Do following on the host where
>> >>> pgpool is
>> >>> >> >>>>>>>> running on:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> # iptables -A FORWARD -j REJECT
>> >>> >> >>>>>>>> --reject-with
>> >>> >> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable -d
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> 133.137.177.124
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> (133.137.177.124 is the host where
>> >>> >> backend
>> >>> >> >>>>>>>> is running
>> >>> >> >>>>>>>> >> >>> on)
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Pull network cable from backend0
>> host
>> >>> >> >>>>>>>> network
>> >>> >> >>>>>>>> >> interface.
>> >>> >> >>>>>>>> >> >>> >> Pgpool
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> detected the host being down as
>> >>> expected.
>> >>> >> >>>>>>>> Am I
>> >>> >> >>>>>>>> >> missing
>> >>> >> >>>>>>>> >> >>> >> >> something?
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> --
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> English:
>> >>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Japanese: http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Hello Tatsuo,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > With backend0 on one host just
>> >>> >> configure
>> >>> >> >>>>>>>> following
>> >>> >> >>>>>>>> >> >>> rule on
>> >>> >> >>>>>>>> >> >>> >> >> other
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> host
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> where
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > pgpool is:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > iptables -A FORWARD -j REJECT
>> >>> >> >>>>>>>> --reject-with
>> >>> >> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then have pgpool startup
>> with
>> >>> >> health
>> >>> >> >>>>>>>> checking
>> >>> >> >>>>>>>> >> and
>> >>> >> >>>>>>>> >> >>> >> >> retrying
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> configured,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then pull network cable from
>> >>> >> backend0
>> >>> >> >>>>>>>> host
>> >>> >> >>>>>>>> >> network
>> >>> >> >>>>>>>> >> >>> >> >> >>> interface.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM,
>> >>> Tatsuo
>> >>> >> >>>>>>>> Ishii <
>> >>> >> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> wrote:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> I want to try to test the
>> >>> situation
>> >>> >> you
>> >>> >> >>>>>>>> descrived:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured
>> for
>> >>> >> >>>>>>>> security
>> >>> >> >>>>>>>> >> reasons
>> >>> >> >>>>>>>> >> >>> not
>> >>> >> >>>>>>>> >> >>> >> to
>> >>> >> >>>>>>>> >> >>> >> >> >>> return
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages,
>> >>> even
>> >>> >> >>>>>>>> though
>> >>> >> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> But I don't know how to do it.
>> I
>> >>> >> pulled
>> >>> >> >>>>>>>> out the
>> >>> >> >>>>>>>> >> >>> network
>> >>> >> >>>>>>>> >> >>> >> >> cable
>> >>> >> >>>>>>>> >> >>> >> >> >>> and
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> pgpool detected it as expected.
>> >>> Also I
>> >>> >> >>>>>>>> configured
>> >>> >> >>>>>>>> >> the
>> >>> >> >>>>>>>> >> >>> >> server
>> >>> >> >>>>>>>> >> >>> >> >> >>> which
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> PostgreSQL is running on to
>> >>> disable
>> >>> >> the
>> >>> >> >>>>>>>> 5432
>> >>> >> >>>>>>>> >> port. In
>> >>> >> >>>>>>>> >> >>> >> this
>> >>> >> >>>>>>>> >> >>> >> >> case
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> connect(2) returned
>> EHOSTUNREACH
>> >>> (No
>> >>> >> >>>>>>>> route to
>> >>> >> >>>>>>>> >> host)
>> >>> >> >>>>>>>> >> >>> so
>> >>> >> >>>>>>>> >> >>> >> >> pgpool
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> detected
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> the error as expected.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Could you please instruct me?
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> --
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> English:
>> >>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Japanese:
>> http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Hello Tatsuo,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Thank you for replying!
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > I'm not sure what exactly is
>> >>> >> blocking,
>> >>> >> >>>>>>>> just by
>> >>> >> >>>>>>>> >> >>> pgpool
>> >>> >> >>>>>>>> >> >>> >> code
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> analysis I
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > suspect it is the part where
>> a
>> >>> >> >>>>>>>> connection is
>> >>> >> >>>>>>>> >> made
>> >>> >> >>>>>>>> >> >>> to
>> >>> >> >>>>>>>> >> >>> >> the
>> >>> >> >>>>>>>> >> >>> >> >> db
>> >>> >> >>>>>>>> >> >>> >> >> >>> and
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> it
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> doesn't
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > seem to get interrupted by
>> >>> alarm.
>> >>> >> >>>>>>>> Tested
>> >>> >> >>>>>>>> >> thoroughly
>> >>> >> >>>>>>>> >> >>> >> health
>> >>> >> >>>>>>>> >> >>> >> >> >>> check
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> behaviour,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > it works really well when
>> >>> host/ip is
>> >>> >> >>>>>>>> there and
>> >>> >> >>>>>>>> >> just
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> backend/postgres
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> is
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > down, but not when backend
>> >>> host/ip
>> >>> >> is
>> >>> >> >>>>>>>> down. I
>> >>> >> >>>>>>>> >> could
>> >>> >> >>>>>>>> >> >>> >> see in
>> >>> >> >>>>>>>> >> >>> >> >> >>> log
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> that
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> initial
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > health check and each retry
>> got
>> >>> >> >>>>>>>> delayed when
>> >>> >> >>>>>>>> >> >>> host/ip is
>> >>> >> >>>>>>>> >> >>> >> >> not
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> reachable,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > while when just backend is
>> not
>> >>> >> >>>>>>>> listening (is
>> >>> >> >>>>>>>> >> down)
>> >>> >> >>>>>>>> >> >>> on
>> >>> >> >>>>>>>> >> >>> >> the
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> reachable
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> host/ip
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > then initial health check and
>> >>> all
>> >>> >> >>>>>>>> retries are
>> >>> >> >>>>>>>> >> >>> exact to
>> >>> >> >>>>>>>> >> >>> >> the
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> settings in
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > pgpool.conf.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed
>> as
>> >>> one
>> >>> >> of
>> >>> >> >>>>>>>> the libpq
>> >>> >> >>>>>>>> >> >>> >> >> environment
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> variables
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> in
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > the docs (see
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >>
>> >>> >> >>>>>>>>
>> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > There is equivalent
>> parameter in
>> >>> >> libpq
>> >>> >> >>>>>>>> >> >>> >> PGconnectdbParams (
>> >>> >> >>>>>>>> >> >>> >> >> >>> see
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>> >> >>>>>>>> >> >>> >> >> >>>
>> >>> >> >>>>>>>> >> >>> >> >>
>> >>> >> >>>>>>>> >> >>> >>
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >>
>> >>> >> >>>>>>>>
>> >>> >>
>> >>>
>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> )
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > At the beginning of that same
>> >>> page
>> >>> >> >>>>>>>> there are
>> >>> >> >>>>>>>> >> some
>> >>> >> >>>>>>>> >> >>> >> >> important
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> infos on
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> using
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > these functions.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > psql respects
>> PGCONNECT_TIMEOUT.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Regards,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Stevo.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13
>> >>> AM,
>> >>> >> >>>>>>>> Tatsuo Ishii <
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> ishii at postgresql.org>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> wrote:
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Hello pgpool community,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> >
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured
>> for
>> >>> >> >>>>>>>> security
>> >>> >> >>>>>>>> >> reasons
>> >>> >> >>>>>>>> >> >>> not
>> >>> >> >>>>>>>> >> >>> >> to
>> >>> >> >>>>>>>> >> >>> >> >> >>> return
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages,
>> >>> even
>> >>> >> >>>>>>>> though
>> >>> >> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> configured,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > socket call will block and
>> >>> alarm
>> >>> >> >>>>>>>> will not get
>> >>> >> >>>>>>>> >> >>> raised
>> >>> >> >>>>>>>> >> >>> >> >> >>> until TCP
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> timeout
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > occurs.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Interesting. So are you
>> saying
>> >>> that
>> >>> >> >>>>>>>> read(2)
>> >>> >> >>>>>>>> >> >>> cannot be
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> interrupted by
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> alarm signal if the system
>> is
>> >>> >> >>>>>>>> configured not to
>> >>> >> >>>>>>>> >> >>> return
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> destination
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> host unreachable message?
>> >>> Could you
>> >>> >> >>>>>>>> please
>> >>> >> >>>>>>>> >> guide
>> >>> >> >>>>>>>> >> >>> me
>> >>> >> >>>>>>>> >> >>> >> >> where I
>> >>> >> >>>>>>>> >> >>> >> >> >>> can
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> get
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> such that info? (I'm not a
>> >>> network
>> >>> >> >>>>>>>> expert).
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Not a C programmer, found
>> >>> some
>> >>> >> info
>> >>> >> >>>>>>>> that
>> >>> >> >>>>>>>> >> select
>> >>> >> >>>>>>>> >> >>> call
>> >>> >> >>>>>>>> >> >>> >> >> >>> could be
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> replace
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> with
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > select/pselect calls.
>> Maybe
>> >>> it
>> >>> >> >>>>>>>> would be best
>> >>> >> >>>>>>>> >> if
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> PGCONNECT_TIMEOUT
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> value
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > could be used here for
>> >>> connection
>> >>> >> >>>>>>>> timeout.
>> >>> >> >>>>>>>> >> >>> pgpool
>> >>> >> >>>>>>>> >> >>> >> has
>> >>> >> >>>>>>>> >> >>> >> >> >>> libpq as
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> dependency,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > why isn't it using libpq
>> for
>> >>> the
>> >>> >> >>>>>>>> healthcheck
>> >>> >> >>>>>>>> >> db
>> >>> >> >>>>>>>> >> >>> >> connect
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> calls, then
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be
>> >>> >> applied?
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> I don't think libpq uses
>> >>> >> >>>>>>>> select/pselect for
>> >>> >> >>>>>>>> >> >>> >> establishing
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> connection,
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> but using libpq instead of
>> >>> homebrew
>> >>> >> >>>>>>>> code seems
>> >>> >> >>>>>>>> >> to
>> >>> >> >>>>>>>> >> >>> be
>> >>> >> >>>>>>>> >> >>> >> an
>> >>> >> >>>>>>>> >> >>> >> >> >>> idea.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Let me
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> think about it.
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> One question. Are you sure
>> that
>> >>> >> libpq
>> >>> >> >>>>>>>> can deal
>> >>> >> >>>>>>>> >> >>> with
>> >>> >> >>>>>>>> >> >>> >> the
>> >>> >> >>>>>>>> >> >>> >> >> case
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> (not to
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> return destination host
>> >>> unreachable
>> >>> >> >>>>>>>> messages)
>> >>> >> >>>>>>>> >> by
>> >>> >> >>>>>>>> >> >>> using
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> --
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Tatsuo Ishii
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> SRA OSS, Inc. Japan
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> English:
>> >>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Japanese:
>> >>> http://www.sraoss.co.jp
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>>
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >> >>
>> >>> >> >>>>>>>> >> >>> >> >>
>> >>> >> >>>>>>>> >> >>> >>
>> >>> >> >>>>>>>> >> >>>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >> >>
>> >>> >> >>>>>>>> >>
>> >>> >> >>>>>>>>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>>
>> >>> >> >>>>>>
>> >>> >> >>>>>
>> >>> >> >>>>
>> >>> >> >>>
>> >>> >> >>
>> >>> >>
>> >>>
>> >>
>>


More information about the pgpool-general mailing list