[pgpool-general: 214] Re: Healthcheck timeout not always respected

Stevo Slavić sslavic at gmail.com
Mon Feb 6 06:16:32 JST 2012


Hello Tatsuo,

Attached is cumulative patch rebased to current master branch head which:
- Fixes health check timeout not always respected (includes unsetting
non-blocking mode after connection has been successfully established);
- Fixes failover on health check only support.

Kind regards,
Stevo.

2012/2/5 Stevo Slavić <sslavic at gmail.com>

> Tatsuo,
>
> Thank you very much for your time and effort put into analysis of the
> submitted patch,
>
>
> Obviously I'm missing something regarding healthcheck feature, so please
> clarify:
>
>    - what is the purpose of healthcheck when backend flag is set to
>    DISALLOW_TO_FAILOVER? To log that healthchecks are on time but will not
>    actually do anything?
>    - what is the purpose of healthcheck (especially with retries
>    configured) when backend flag is set to ALLOW_TO_FAILOVER? When answering
>    please consider case of non-helloworld application that connects to db via
>    pgpool - will healthcheck be given a chance to fail even once?
>    - since there is no other backend flag value than the mentioned two,
>    what is the purpose of healthcheck (especially with retries configured) if
>    it's not to be the sole process controlling when to failover?
>
> I disagree that changing pgpool to give healthcheck feature a meaning
> disrupts DISALLOW_TO_FAILOVER meaning, it extends it just for case when
> healthcheck is configured - if one doesn't want healthcheck just keep on
> not-using it, it's disabled by default. Health checks and retries have only
> recently been introduced so I doubt there are many if any users of health
> check especially which have configured DISALLOW_TO_FAILOVER with
> expectation to just have health check logging but not actually do anything.
> Out of all pgpool healthcheck users which have backends set to
> DISALLOW_TO_FAILOVER too I believe most of them expect but do not know that
> this will not allow failover on health check, it will just make log bigger.
> Changes included in patch do not affect users which have health check
> configured and backend set to ALLOW_TO_FAILOVER.
>
>
> About non-blocking connection to backend change:
>
>    - with pgpool in raw mode and extensive testing (endurance tests,
>    failover and failback tests), I didn't notice any unwanted change in
>    behaviour, apart from wanted non-blocking timeout aware health checks;
>    - do you see or know about anything in pgpool depending on connection
>    to backend being blocking one? will have a look myself, just asking maybe
>    you've found something already. will look into means to set connection back
>    to being blocking after it's successfully established - maybe just changing
>    that flag will do.
>
>
> Kind regards,
>
> Stevo.
>
>
> On Feb 5, 2012 6:50 AM, "Tatsuo Ishii" <ishii at postgresql.org> wrote:
>
>> Finially I have time to check your patches. Here is the result of review.
>>
>> > Hello Tatsuo,
>> >
>> > Here is cumulative patch to be applied on pgpool master branch with
>> > following fixes included:
>> >
>> >    1. fix for health check bug
>> >       1. it was not possible to allow backend failover only on failed
>> >       health check(s);
>> >       2. to achieve this one just configures backend to
>> >       DISALLOW_TO_FAILOVER, sets fail_over_on_backend_error to off, and
>> >       configures health checks;
>> >       3. for this fix in code an unwanted check was removed in main.c,
>> >       after health check failed if DISALLOW_TO_FAILOVER was set for
>> backend
>> >       failover would have been always prevented, even when one
>> > configures health
>> >       check whose sole purpose is to control failover
>>
>> This is not acceptable, at least for stable
>> releases. DISALLOW_TO_FAILOVER and sets fail_over_on_backend_error are
>> for different purposes. The former is for preventing any failover
>> including health check. The latter is for write to communication
>> socket.
>>
>> fail_over_on_backend_error = on
>>                                   # Initiates failover when writing to the
>>                                   # backend communication socket fails
>>                                   # This is the same behaviour of
>> pgpool-II
>>                                   # 2.2.x and previous releases
>>                                   # If set to off, pgpool will report an
>>                                   # error and disconnect the session.
>>
>> Your patch changes the existing semantics. Another point is,
>> DISALLOW_TO_FAILOVER allows to control per backend behavior. Your
>> patch breaks it.
>>
>> >       2. fix for health check bug
>> >       1. health check timeout was not being respected in all conditions
>> >       (icmp host unreachable messages dropped for security reasons, or
>> > no active
>> >       network component to send those message)
>> >       2. for this fix in code (main.c, pool.h, pool_connection_pool.c)
>> inet
>> >       connections have been made to be non blocking, and during
>> connection
>> >       retries status of now global health_check_timer_expired variable
>> is being
>> >       checked
>>
>> This seems good. But I need more investigation. For example, your
>> patch set non blocking to sockets but never revert back to blocking.
>>
>> >       3. fix for failback bug
>> >       1. in raw mode, after failback (through pcp_attach_node) standby
>> >       node/backend would remain in invalid state
>>
>> It turned out that even failover was bugged. The status was not set to
>> CON_DOWN. This leaves the status to CON_CONNECT_WAIT and it prevented
>> failback from returning to normal state. I fixed this on master branch.
>>
>> > (it would be in CON_UP, so on
>> >       failover after failback pgpool would not be able to connect to
>> standby as
>> >       get_next_master_node expects standby nodes/backends in raw mode
>> to be in
>> >       CON_CONNECT_WAIT state when finding next master node)
>> >       2. for this fix in code, when in raw mode on failback status of
>> all
>> >       nodes/backends with CON_UP state is set to CON_CONNECT_WAIT -
>> > all children
>> >       are restarted anyway
>>
>>
>> > Neither of these fixes changes expected behaviour of related features so
>> > there are no changes to the documentation.
>> >
>> >
>> > Kind regards,
>> >
>> > Stevo.
>> >
>> >
>> > 2012/1/24 Tatsuo Ishii <ishii at postgresql.org>
>> >
>> >> > Additional testing confirmed that this fix ensures health check timer
>> >> gets
>> >> > respected (should I create a ticket on some issue tracker? send
>> >> cumulative
>> >> > patch with all changes to have it accepted?).
>> >>
>> >> We have problem with Mantis bug tracker and decided to stop using
>> >> it(unless someone volunteers to fix it). Please send cumulative patch
>> >> againt master head to this list so that we will be able to look
>> >> into(be sure to include English doc changes).
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese: http://www.sraoss.co.jp
>> >>
>> >> > Problem is that with all the testing another issue has been
>> encountered,
>> >> > now with pcp_attach_node.
>> >> >
>> >> > With pgpool in raw mode and two backends in postgres 9 streaming
>> >> > replication, when backend0 fails, after health checks retries pgpool
>> >> calls
>> >> > failover command and degenerates backend0, backend1 gets promoted to
>> new
>> >> > master, pgpool can connect to that master, and two backends are in
>> pgpool
>> >> > state 3/2. And this is ok and expected.
>> >> >
>> >> > Once backend0 is recovered, it's attached back to pgpool using
>> >> > pcp_attach_node, and pgpool will show two backends in state 2/2 (in
>> logs
>> >> > and in show pool_nodes; query) with backend0 taking all the load (raw
>> >> > mode). If after that recovery and attachment of backend0 pgpool is
>> not
>> >> > restarted, and afetr some time backend0 fails again, after health
>> check
>> >> > retries backend0 will get degenerated, failover command will get
>> called
>> >> > (promotes standby to master), but pgpool will not be able to connect
>> to
>> >> > backend1 (regardless if unix or inet sockets are used for backend1).
>> Only
>> >> > if pgpool is restarted before second (complete) failure of backend0,
>> will
>> >> > pgpool be able to connect to backend1.
>> >> >
>> >> > Following code, pcp_attach_node (failback of backend0) will actually
>> >> > execute same code as for failover. Not sure what, but that failover
>> does
>> >> > something with backend1 state or in memory settings, so that pgpool
>> can
>> >> no
>> >> > longer connect to backend1. Is this a known issue?
>> >> >
>> >> > Kind regards,
>> >> > Stevo.
>> >> >
>> >> > 2012/1/20 Stevo Slavić <sslavic at gmail.com>
>> >> >
>> >> >> Key file was missing from that commit/change - pool.h where
>> >> >> health_check_timer_expired was made global. Included now attached
>> patch.
>> >> >>
>> >> >> Kind regards,
>> >> >> Stevo.
>> >> >>
>> >> >>
>> >> >> 2012/1/20 Stevo Slavić <sslavic at gmail.com>
>> >> >>
>> >> >>> Using exit_request was wrong and caused a bug. 4th patch needed -
>> >> >>> health_check_timer_expired is global now so it can be verified if
>> it
>> >> was
>> >> >>> set to 1 outside of main.c
>> >> >>>
>> >> >>>
>> >> >>> Kind regards,
>> >> >>> Stevo.
>> >> >>>
>> >> >>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>> >> >>>
>> >> >>>> Using exit_code was not wise. Tested and encountered a case where
>> this
>> >> >>>> results in a bug. Have to work on it more. Main issue is how in
>> >> >>>> pool_connection_pool.c connect_inet_domain_socket_by_port
>> function to
>> >> know
>> >> >>>> that health check timer has expired (set to 1). Any ideas?
>> >> >>>>
>> >> >>>> Kind regards,
>> >> >>>> Stevo.
>> >> >>>>
>> >> >>>>
>> >> >>>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
>> >> >>>>
>> >> >>>>> Tatsuo,
>> >> >>>>>
>> >> >>>>> Here are the patches which should be applied to current pgpool
>> head
>> >> for
>> >> >>>>> fixing this issue:
>> >> >>>>>
>> >> >>>>> Fixes-health-check-timeout.patch
>> >> >>>>> Fixes-health-check-retrying-after-failover.patch
>> >> >>>>> Fixes-clearing-exitrequest-flag.patch
>> >> >>>>>
>> >> >>>>> Quirk I noticed in logs was resolved as well - after failover
>> pgpool
>> >> >>>>> would perform healthcheck and report it is doing (max retries +
>> 1) th
>> >> >>>>> health check which was confusing. Rather I've adjusted that it
>> does
>> >> and
>> >> >>>>> reports it's doing a new health check cycle after failover.
>> >> >>>>>
>> >> >>>>> I've tested and it works well - when in raw mode, backends set to
>> >> >>>>> disallow failover, failover on backend failure disabled, and
>> health
>> >> checks
>> >> >>>>> configured with retries (30sec interval, 5sec timeout, 2 retries,
>> >> 10sec
>> >> >>>>> delay between retries).
>> >> >>>>>
>> >> >>>>> Please test, and if confirmed ok include in next release.
>> >> >>>>>
>> >> >>>>> Kind regards,
>> >> >>>>>
>> >> >>>>> Stevo.
>> >> >>>>>
>> >> >>>>>
>> >> >>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>> >> >>>>>
>> >> >>>>>> Here is pgpool.log, strace.out, and pgpool.conf when I tested
>> with
>> >> my
>> >> >>>>>> latest patch for health check timeout applied. It works well,
>> >> except for
>> >> >>>>>> single quirk, after failover completed in log files it was
>> reported
>> >> that
>> >> >>>>>> 3rd health check retry was done (even though just 2 are
>> configured,
>> >> see
>> >> >>>>>> pgpool.conf) and that backend has returned to healthy state.
>> That
>> >> >>>>>> interesting part from log file follows:
>> >> >>>>>>
>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
>> DEBUG: pid
>> >> >>>>>> 1163: retrying 3 th health checking
>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
>> DEBUG: pid
>> >> >>>>>> 1163: health_check: 0 th DB node status: 3
>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45 LOG:
>>   pid
>> >> >>>>>> 1163: after some retrying backend returned to healthy state
>> >> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15
>> DEBUG: pid
>> >> >>>>>> 1163: starting health checking
>> >> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15
>> DEBUG: pid
>> >> >>>>>> 1163: health_check: 0 th DB node status: 3
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> As can be seen in pgpool.conf, there is only one backend
>> configured.
>> >> >>>>>> pgpool did failover well after health check max retries has been
>> >> reached
>> >> >>>>>> (pgpool just degraded that single backend to 3, and restarted
>> child
>> >> >>>>>> processes).
>> >> >>>>>>
>> >> >>>>>> After this quirk has been logged, next health check logs were as
>> >> >>>>>> expected. Except those couple weird log entries, everything
>> seems
>> >> to be ok.
>> >> >>>>>> Maybe that quirk was caused by single backend only configuration
>> >> corner
>> >> >>>>>> case. Will try tomorrow if it occurs on dual backend
>> configuration.
>> >> >>>>>>
>> >> >>>>>> Regards,
>> >> >>>>>> Stevo.
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
>> >> >>>>>>
>> >> >>>>>>> Hello Tatsuo,
>> >> >>>>>>>
>> >> >>>>>>> Unfortunately, with your patch when A is on
>> >> >>>>>>> (pool_config->health_check_period > 0) and B is on, when retry
>> >> count is
>> >> >>>>>>> over, failover will be disallowed because of B being on.
>> >> >>>>>>>
>> >> >>>>>>> Nenad's patch allows failover to be triggered only by health
>> check.
>> >> >>>>>>> Here is the patch which includes Nenad's fix but also fixes
>> issue
>> >> with
>> >> >>>>>>> health check timeout not being respected.
>> >> >>>>>>>
>> >> >>>>>>> Key points in fix for health check timeout being respected are:
>> >> >>>>>>> - in pool_connection_pool.c connect_inet_domain_socket_by_port
>> >> >>>>>>> function, before trying to connect, file descriptor is set to
>> >> non-blocking
>> >> >>>>>>> mode, and also non-blocking mode error codes are handled,
>> >> EINPROGRESS and
>> >> >>>>>>> EALREADY (please verify changes here, especially regarding
>> closing
>> >> fd)
>> >> >>>>>>> - in main.c health_check_timer_handler has been changed to
>> signal
>> >> >>>>>>> exit_request to health check initiated
>> >> connect_inet_domain_socket_by_port
>> >> >>>>>>> function call (please verify this, maybe there is a better way
>> to
>> >> check
>> >> >>>>>>> from connect_inet_domain_socket_by_port if in
>> >> health_check_timer_expired
>> >> >>>>>>> has been set to 1)
>> >> >>>>>>>
>> >> >>>>>>> These changes will practically make connect attempt to be
>> >> >>>>>>> non-blocking and repeated until:
>> >> >>>>>>> - connection is made, or
>> >> >>>>>>> - unhandled connection error condition is reached, or
>> >> >>>>>>> - health check timer alarm has been raised, or
>> >> >>>>>>> - some other exit request (shutdown) has been issued.
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>> Kind regards,
>> >> >>>>>>> Stevo.
>> >> >>>>>>>
>> >> >>>>>>> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >> >>>>>>>
>> >> >>>>>>>> Ok, let me clarify use cases regarding failover.
>> >> >>>>>>>>
>> >> >>>>>>>> Currently there are three parameters:
>> >> >>>>>>>> a) health_check
>> >> >>>>>>>> b) DISALLOW_TO_FAILOVER
>> >> >>>>>>>> c) fail_over_on_backend_error
>> >> >>>>>>>>
>> >> >>>>>>>> Source of errors which can trigger failover are 1)health check
>> >> >>>>>>>> 2)write
>> >> >>>>>>>> to backend socket 3)read backend from socket. I represent
>> each 1)
>> >> as
>> >> >>>>>>>> A, 2) as B, 3) as C.
>> >> >>>>>>>>
>> >> >>>>>>>> 1) trigger failover if A or B or C is error
>> >> >>>>>>>> a = on, b = off, c = on
>> >> >>>>>>>>
>> >> >>>>>>>> 2) trigger failover only when B or C is error
>> >> >>>>>>>> a = off, b = off, c = on
>> >> >>>>>>>>
>> >> >>>>>>>> 3) trigger failover only when B is error
>> >> >>>>>>>> Impossible. Because C error always triggers failover.
>> >> >>>>>>>>
>> >> >>>>>>>> 4) trigger failover only when C is error
>> >> >>>>>>>> a = off, b = off, c = off
>> >> >>>>>>>>
>> >> >>>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>> >> >>>>>>>> Impossible. Because C error always triggers failover.
>> >> >>>>>>>>
>> >> >>>>>>>> 6) never trigger failover
>> >> >>>>>>>> Impossible. Because C error always triggers failover.
>> >> >>>>>>>>
>> >> >>>>>>>> As you can see, C is the problem here (look at #3, #5 and #6)
>> >> >>>>>>>>
>> >> >>>>>>>> If we implemented this:
>> >> >>>>>>>> >> However I think we should disable failover if
>> >> >>>>>>>> DISALLOW_TO_FAILOVER set
>> >> >>>>>>>> >> in case of reading data from backend. This should have been
>> >> done
>> >> >>>>>>>> when
>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is exactly
>> >> what
>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you
>> think?
>> >> >>>>>>>>
>> >> >>>>>>>> 1) trigger failover if A or B or C is error
>> >> >>>>>>>> a = on, b = off, c = on
>> >> >>>>>>>>
>> >> >>>>>>>> 2) trigger failover only when B or C is error
>> >> >>>>>>>> a = off, b = off, c = on
>> >> >>>>>>>>
>> >> >>>>>>>> 3) trigger failover only when B is error
>> >> >>>>>>>> a = off, b = on, c = on
>> >> >>>>>>>>
>> >> >>>>>>>> 4) trigger failover only when C is error
>> >> >>>>>>>> a = off, b = off, c = off
>> >> >>>>>>>>
>> >> >>>>>>>> 5) trigger failover only when A is error(Stevo wants this)
>> >> >>>>>>>> a = on, b = on, c = off
>> >> >>>>>>>>
>> >> >>>>>>>> 6) never trigger failover
>> >> >>>>>>>> a = off, b = on, c = off
>> >> >>>>>>>>
>> >> >>>>>>>> So it seems my patch will solve all the problems including
>> yours.
>> >> >>>>>>>> (timeout while retrying is another issue of course).
>> >> >>>>>>>> --
>> >> >>>>>>>> Tatsuo Ishii
>> >> >>>>>>>> SRA OSS, Inc. Japan
>> >> >>>>>>>> English: http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>>
>> >> >>>>>>>> > I agree, fail_over_on_backend_error isn't useful, just adds
>> >> >>>>>>>> confusion by
>> >> >>>>>>>> > overlapping with DISALLOW_TO_FAILOVER.
>> >> >>>>>>>> >
>> >> >>>>>>>> > With your patch or without it, it is not possible to
>> failover
>> >> only
>> >> >>>>>>>> on
>> >> >>>>>>>> > health check (max retries) failure. With Nenad's patch, that
>> >> part
>> >> >>>>>>>> works ok
>> >> >>>>>>>> > and I think that patch is semantically ok - failover occurs
>> even
>> >> >>>>>>>> though
>> >> >>>>>>>> > DISALLOW_TO_FAILOVER is set for backend but only when health
>> >> check
>> >> >>>>>>>> is
>> >> >>>>>>>> > configured too. Configuring health check without failover on
>> >> >>>>>>>> failed health
>> >> >>>>>>>> > check has no purpose. Also health check configured with
>> allowed
>> >> >>>>>>>> failover on
>> >> >>>>>>>> > any condition other than health check (max retries) failure
>> has
>> >> no
>> >> >>>>>>>> purpose.
>> >> >>>>>>>> >
>> >> >>>>>>>> > Kind regards,
>> >> >>>>>>>> > Stevo.
>> >> >>>>>>>> >
>> >> >>>>>>>> > 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >> >>>>>>>> >
>> >> >>>>>>>> >> fail_over_on_backend_error has different meaning from
>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER. From the doc:
>> >> >>>>>>>> >>
>> >> >>>>>>>> >>  If true, and an error occurs when writing to the backend
>> >> >>>>>>>> >>  communication, pgpool-II will trigger the fail over
>> procedure
>> >> .
>> >> >>>>>>>> This
>> >> >>>>>>>> >>  is the same behavior as of pgpool-II 2.2.x or earlier. If
>> set
>> >> to
>> >> >>>>>>>> >>  false, pgpool will report an error and disconnect the
>> session.
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> This means that if pgpool failed to read from backend, it
>> will
>> >> >>>>>>>> trigger
>> >> >>>>>>>> >> failover even if fail_over_on_backend_error to off. So
>> >> >>>>>>>> unconditionaly
>> >> >>>>>>>> >> disabling failover will lead backward imcompatibilty.
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> However I think we should disable failover if
>> >> >>>>>>>> DISALLOW_TO_FAILOVER set
>> >> >>>>>>>> >> in case of reading data from backend. This should have been
>> >> done
>> >> >>>>>>>> when
>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is exactly
>> >> what
>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you
>> think?
>> >> >>>>>>>> >> --
>> >> >>>>>>>> >> Tatsuo Ishii
>> >> >>>>>>>> >> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >>
>> >> >>>>>>>> >> > For a moment I thought we could have set
>> >> >>>>>>>> fail_over_on_backend_error to
>> >> >>>>>>>> >> off,
>> >> >>>>>>>> >> > and have backends set with ALLOW_TO_FAILOVER flag. But
>> then I
>> >> >>>>>>>> looked in
>> >> >>>>>>>> >> > code.
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > In child.c there is a loop child process goes through in
>> its
>> >> >>>>>>>> lifetime.
>> >> >>>>>>>> >> When
>> >> >>>>>>>> >> > fatal error condition occurs before child process exits
>> it
>> >> will
>> >> >>>>>>>> call
>> >> >>>>>>>> >> > notice_backend_error which will call
>> degenerate_backend_set
>> >> >>>>>>>> which will
>> >> >>>>>>>> >> not
>> >> >>>>>>>> >> > take into account fail_over_on_backend_error is set to
>> off,
>> >> >>>>>>>> causing
>> >> >>>>>>>> >> backend
>> >> >>>>>>>> >> > to be degenerated and failover to occur. That's why we
>> have
>> >> >>>>>>>> backends set
>> >> >>>>>>>> >> > with DISALLOW_TO_FAILOVER but with our patch applied,
>> health
>> >> >>>>>>>> check could
>> >> >>>>>>>> >> > cause failover to occur as expected.
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > Maybe it would be enough just to modify
>> >> degenerate_backend_set,
>> >> >>>>>>>> to take
>> >> >>>>>>>> >> > fail_over_on_backend_error into account just like it
>> already
>> >> >>>>>>>> takes
>> >> >>>>>>>> >> > DISALLOW_TO_FAILOVER into account.
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > Kind regards,
>> >> >>>>>>>> >> > Stevo.
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> > 2012/1/15 Stevo Slavić <sslavic at gmail.com>
>> >> >>>>>>>> >> >
>> >> >>>>>>>> >> >> Yes and that behaviour which you describe as expected,
>> is
>> >> not
>> >> >>>>>>>> what we
>> >> >>>>>>>> >> >> want. We want pgpool to degrade backend0 and failover
>> when
>> >> >>>>>>>> configured
>> >> >>>>>>>> >> max
>> >> >>>>>>>> >> >> health check retries have failed, and to failover only
>> in
>> >> that
>> >> >>>>>>>> case, so
>> >> >>>>>>>> >> not
>> >> >>>>>>>> >> >> sooner e.g. connection/child error condition, but as
>> soon as
>> >> >>>>>>>> max health
>> >> >>>>>>>> >> >> check retries have been attempted.
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> Maybe examples will be more clear.
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> Imagine two nodes (node 1 and node 2). On each node a
>> single
>> >> >>>>>>>> pgpool and
>> >> >>>>>>>> >> a
>> >> >>>>>>>> >> >> single backend. Apps/clients access db through pgpool on
>> >> their
>> >> >>>>>>>> own node.
>> >> >>>>>>>> >> >> Two backends are configured in postgres native streaming
>> >> >>>>>>>> replication.
>> >> >>>>>>>> >> >> pgpools are used in raw mode. Both pgpools have same
>> >> backend as
>> >> >>>>>>>> >> backend0,
>> >> >>>>>>>> >> >> and same backend as backend1.
>> >> >>>>>>>> >> >> initial state: both backends are up and pgpool can
>> access
>> >> >>>>>>>> them, clients
>> >> >>>>>>>> >> >> connect to their pgpool and do their work on master
>> backend,
>> >> >>>>>>>> backend0.
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> 1st case: unmodified/non-patched pgpool 3.1.1 is used,
>> >> >>>>>>>> backends are
>> >> >>>>>>>> >> >> configured with ALLOW_TO_FAILOVER flag
>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> node 2
>> >> >>>>>>>> and backend0
>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> since
>> >> >>>>>>>> >> >> ALLOW_TO_FAILOVER is set, pgpool performs failover
>> without
>> >> >>>>>>>> giving
>> >> >>>>>>>> >> chance to
>> >> >>>>>>>> >> >> pgpool health check retries to control whether backend
>> is
>> >> just
>> >> >>>>>>>> >> temporarily
>> >> >>>>>>>> >> >> inaccessible
>> >> >>>>>>>> >> >> - failover command on node 2 promotes standby backend
>> to a
>> >> new
>> >> >>>>>>>> master -
>> >> >>>>>>>> >> >> split brain occurs, with two masters
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> 2nd case: unmodified/non-patched pgpool 3.1.1 is used,
>> >> >>>>>>>> backends are
>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> node 2
>> >> >>>>>>>> and backend0
>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> since
>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >> failover
>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> condition,
>> >> >>>>>>>> determines
>> >> >>>>>>>> >> that
>> >> >>>>>>>> >> >> it's not accessible, there will be no health check
>> retries
>> >> >>>>>>>> because
>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, no failover occurs ever
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> 3rd case, pgpool 3.1.1 + patch you've sent applied, and
>> >> >>>>>>>> backends
>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> node 2
>> >> >>>>>>>> and backend0
>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> since
>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >> failover
>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> condition,
>> >> >>>>>>>> determines
>> >> >>>>>>>> >> that
>> >> >>>>>>>> >> >> it's not accessible, health check retries happen, and
>> even
>> >> >>>>>>>> after max
>> >> >>>>>>>> >> >> retries, no failover happens since failover is
>> disallowed
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> 4th expected behaviour, pgpool 3.1.1 + patch we sent,
>> and
>> >> >>>>>>>> backends
>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool on
>> node 2
>> >> >>>>>>>> and backend0
>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> since
>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >> failover
>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> condition,
>> >> >>>>>>>> determines
>> >> >>>>>>>> >> that
>> >> >>>>>>>> >> >> it's not accessible, health check retries happen,
>> before a
>> >> max
>> >> >>>>>>>> retry
>> >> >>>>>>>> >> >> network condition is cleared, retry happens, and
>> backend0
>> >> >>>>>>>> remains to be
>> >> >>>>>>>> >> >> master, no failover occurs, temporary network issue did
>> not
>> >> >>>>>>>> cause split
>> >> >>>>>>>> >> >> brain
>> >> >>>>>>>> >> >> - after some time, temporary network outage happens
>> again
>> >> >>>>>>>> between pgpool
>> >> >>>>>>>> >> >> on node 2 and backend0
>> >> >>>>>>>> >> >> - error condition is reported by child process, and
>> since
>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not perform
>> >> failover
>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
>> condition,
>> >> >>>>>>>> determines
>> >> >>>>>>>> >> that
>> >> >>>>>>>> >> >> it's not accessible, health check retries happen, after
>> max
>> >> >>>>>>>> retries
>> >> >>>>>>>> >> >> backend0 is still not accessible, failover happens,
>> standby
>> >> is
>> >> >>>>>>>> new
>> >> >>>>>>>> >> master
>> >> >>>>>>>> >> >> and backend0 is degraded
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> Kind regards,
>> >> >>>>>>>> >> >> Stevo.
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>> In my test evironment, the patch works as expected. I
>> have
>> >> two
>> >> >>>>>>>> >> >>> backends. Health check retry conf is as follows:
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>> health_check_max_retries = 3
>> >> >>>>>>>> >> >>> health_check_retry_delay = 1
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>> 5 09:17:20 LOG:   pid 21411: Backend status file
>> >> >>>>>>>> /home/t-ishii/work/
>> >> >>>>>>>> >> >>> git.postgresql.org/test/log/pgpool_status discarded
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: pgpool-II
>> >> successfully
>> >> >>>>>>>> started.
>> >> >>>>>>>> >> >>> version 3.2alpha1 (hatsuiboshi)
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411:
>> find_primary_node:
>> >> >>>>>>>> primary node
>> >> >>>>>>>> >> id
>> >> >>>>>>>> >> >>> is 0
>> >> >>>>>>>> >> >>> -- backend1 was shutdown
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
>> >> >>>>>>>> check_replication_time_lag: could
>> >> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>> >> >>>>>>>> sr_check_password
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> -- health check failed
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411: health check
>> failed.
>> >> 1
>> >> >>>>>>>> th host
>> >> >>>>>>>> >> /tmp
>> >> >>>>>>>> >> >>> at port 11001 is down
>> >> >>>>>>>> >> >>> -- start retrying
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 LOG:   pid 21411: health check
>> retry
>> >> >>>>>>>> sleep time: 1
>> >> >>>>>>>> >> >>> second(s)
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411: health check
>> failed.
>> >> 1
>> >> >>>>>>>> th host
>> >> >>>>>>>> >> /tmp
>> >> >>>>>>>> >> >>> at port 11001 is down
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 LOG:   pid 21411: health check
>> retry
>> >> >>>>>>>> sleep time: 1
>> >> >>>>>>>> >> >>> second(s)
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411: health check
>> failed.
>> >> 1
>> >> >>>>>>>> th host
>> >> >>>>>>>> >> /tmp
>> >> >>>>>>>> >> >>> at port 11001 is down
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 LOG:   pid 21411: health check
>> retry
>> >> >>>>>>>> sleep time: 1
>> >> >>>>>>>> >> >>> second(s)
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411: health check
>> failed.
>> >> 1
>> >> >>>>>>>> th host
>> >> >>>>>>>> >> /tmp
>> >> >>>>>>>> >> >>> at port 11001 is down
>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 LOG:   pid 21411: health_check: 1
>> >> >>>>>>>> failover is
>> >> >>>>>>>> >> canceld
>> >> >>>>>>>> >> >>> because failover is disallowed
>> >> >>>>>>>> >> >>> -- after 3 retries, pgpool wanted to failover, but
>> gave up
>> >> >>>>>>>> because
>> >> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER is set for backend1
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
>> >> >>>>>>>> check_replication_time_lag: could
>> >> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
>> >> >>>>>>>> sr_check_password
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411: health check
>> failed.
>> >> 1
>> >> >>>>>>>> th host
>> >> >>>>>>>> >> /tmp
>> >> >>>>>>>> >> >>> at port 11001 is down
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 LOG:   pid 21411: health check
>> retry
>> >> >>>>>>>> sleep time: 1
>> >> >>>>>>>> >> >>> second(s)
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such file
>> or
>> >> >>>>>>>> directory
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
>> >> >>>>>>>> make_persistent_db_connection:
>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411: health check
>> failed.
>> >> 1
>> >> >>>>>>>> th host
>> >> >>>>>>>> >> /tmp
>> >> >>>>>>>> >> >>> at port 11001 is down
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 LOG:   pid 21411: health check
>> retry
>> >> >>>>>>>> sleep time: 1
>> >> >>>>>>>> >> >>> second(s)
>> >> >>>>>>>> >> >>> 2012-01-15 09:18:05 LOG:   pid 21411: after some
>> retrying
>> >> >>>>>>>> backend
>> >> >>>>>>>> >> >>> returned to healthy state
>> >> >>>>>>>> >> >>> -- started backend1 and pgpool succeeded in health
>> >> checking.
>> >> >>>>>>>> Resumed
>> >> >>>>>>>> >> >>> using backend1
>> >> >>>>>>>> >> >>> --
>> >> >>>>>>>> >> >>> Tatsuo Ishii
>> >> >>>>>>>> >> >>> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> English: http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>> > Hello Tatsuo,
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> > Thank you for the patch and effort, but unfortunately
>> >> this
>> >> >>>>>>>> change
>> >> >>>>>>>> >> won't
>> >> >>>>>>>> >> >>> > work for us. We need to set disallow failover to
>> prevent
>> >> >>>>>>>> failover on
>> >> >>>>>>>> >> >>> child
>> >> >>>>>>>> >> >>> > reported connection errors (it's ok if few clients
>> lose
>> >> >>>>>>>> their
>> >> >>>>>>>> >> >>> connection or
>> >> >>>>>>>> >> >>> > can not connect), and still have pgpool perform
>> failover
>> >> >>>>>>>> but only on
>> >> >>>>>>>> >> >>> failed
>> >> >>>>>>>> >> >>> > health check (if configured, after max retries
>> threshold
>> >> >>>>>>>> has been
>> >> >>>>>>>> >> >>> reached).
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> > Maybe it would be best to add an extra value for
>> >> >>>>>>>> backend_flag -
>> >> >>>>>>>> >> >>> > ALLOW_TO_FAILOVER_ON_HEALTH_CHECK or
>> >> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER_ON_CHILD_ERROR.
>> >> >>>>>>>> >> >>> > It should behave same as DISALLOW_TO_FAILOVER is set,
>> >> with
>> >> >>>>>>>> only
>> >> >>>>>>>> >> >>> difference
>> >> >>>>>>>> >> >>> > in behaviour when health check (if set, max retries)
>> has
>> >> >>>>>>>> failed -
>> >> >>>>>>>> >> unlike
>> >> >>>>>>>> >> >>> > DISALLOW_TO_FAILOVER, this new flag should allow
>> failover
>> >> >>>>>>>> in this
>> >> >>>>>>>> >> case
>> >> >>>>>>>> >> >>> only.
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> > Without this change health check (especially health
>> check
>> >> >>>>>>>> retries)
>> >> >>>>>>>> >> >>> doesn't
>> >> >>>>>>>> >> >>> > make much sense - child error is more likely to
>> occur on
>> >> >>>>>>>> (temporary)
>> >> >>>>>>>> >> >>> > backend failure then health check and will or will
>> not
>> >> cause
>> >> >>>>>>>> >> failover to
>> >> >>>>>>>> >> >>> > occur depending on backend flag, without giving
>> health
>> >> >>>>>>>> check retries
>> >> >>>>>>>> >> a
>> >> >>>>>>>> >> >>> > chance to determine if failure was temporary or not,
>> >> >>>>>>>> risking split
>> >> >>>>>>>> >> brain
>> >> >>>>>>>> >> >>> > situation with two masters just because of temporary
>> >> >>>>>>>> network link
>> >> >>>>>>>> >> >>> hiccup.
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> > Our main problem remains though with the health check
>> >> >>>>>>>> timeout not
>> >> >>>>>>>> >> being
>> >> >>>>>>>> >> >>> > respected in these special conditions we have. Maybe
>> >> Nenad
>> >> >>>>>>>> can help
>> >> >>>>>>>> >> you
>> >> >>>>>>>> >> >>> > more to reproduce the issue on your environment.
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> > Kind regards,
>> >> >>>>>>>> >> >>> > Stevo.
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> > 2012/1/13 Tatsuo Ishii <ishii at postgresql.org>
>> >> >>>>>>>> >> >>> >
>> >> >>>>>>>> >> >>> >> Thanks for pointing it out.
>> >> >>>>>>>> >> >>> >> Yes, checking DISALLOW_TO_FAILOVER before retrying
>> is
>> >> >>>>>>>> wrong.
>> >> >>>>>>>> >> >>> >> However, after retry count over, we should check
>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER I
>> >> >>>>>>>> >> >>> >> think.
>> >> >>>>>>>> >> >>> >> Attached is the patch attempt to fix it. Please try.
>> >> >>>>>>>> >> >>> >> --
>> >> >>>>>>>> >> >>> >> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >>
>> >> >>>>>>>> >> >>> >> > pgpool is being used in raw mode - just for
>> (health
>> >> >>>>>>>> check based)
>> >> >>>>>>>> >> >>> failover
>> >> >>>>>>>> >> >>> >> > part, so applications are not required to restart
>> when
>> >> >>>>>>>> standby
>> >> >>>>>>>> >> gets
>> >> >>>>>>>> >> >>> >> > promoted to new master. Here is pgpool.conf file
>> and a
>> >> >>>>>>>> very small
>> >> >>>>>>>> >> >>> patch
>> >> >>>>>>>> >> >>> >> > we're using applied to pgpool 3.1.1 release.
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> > We have to have DISALLOW_TO_FAILOVER set for the
>> >> backend
>> >> >>>>>>>> since any
>> >> >>>>>>>> >> >>> child
>> >> >>>>>>>> >> >>> >> > process that detects condition that
>> master/backend0 is
>> >> >>>>>>>> not
>> >> >>>>>>>> >> >>> available, if
>> >> >>>>>>>> >> >>> >> > DISALLOW_TO_FAILOVER was not set, will degenerate
>> >> >>>>>>>> backend without
>> >> >>>>>>>> >> >>> giving
>> >> >>>>>>>> >> >>> >> > health check a chance to retry. We need health
>> check
>> >> >>>>>>>> with retries
>> >> >>>>>>>> >> >>> because
>> >> >>>>>>>> >> >>> >> > condition that backend0 is not available could be
>> >> >>>>>>>> temporary
>> >> >>>>>>>> >> (network
>> >> >>>>>>>> >> >>> >> > glitches to the remote site where master is, or
>> >> >>>>>>>> deliberate
>> >> >>>>>>>> >> failover
>> >> >>>>>>>> >> >>> of
>> >> >>>>>>>> >> >>> >> > master postgres service from one node to the
>> other on
>> >> >>>>>>>> remote site
>> >> >>>>>>>> >> -
>> >> >>>>>>>> >> >>> in
>> >> >>>>>>>> >> >>> >> both
>> >> >>>>>>>> >> >>> >> > cases remote means remote to the pgpool that is
>> going
>> >> to
>> >> >>>>>>>> perform
>> >> >>>>>>>> >> >>> health
>> >> >>>>>>>> >> >>> >> > checks and ultimately the failover) and we don't
>> want
>> >> >>>>>>>> standby to
>> >> >>>>>>>> >> be
>> >> >>>>>>>> >> >>> >> > promoted as easily to a new master, to prevent
>> >> temporary
>> >> >>>>>>>> network
>> >> >>>>>>>> >> >>> >> conditions
>> >> >>>>>>>> >> >>> >> > which could occur frequently to frequently cause
>> split
>> >> >>>>>>>> brain with
>> >> >>>>>>>> >> two
>> >> >>>>>>>> >> >>> >> > masters.
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> > But then, with DISALLOW_TO_FAILOVER set, without
>> the
>> >> >>>>>>>> patch health
>> >> >>>>>>>> >> >>> check
>> >> >>>>>>>> >> >>> >> > will not retry and will thus give only one chance
>> to
>> >> >>>>>>>> backend (if
>> >> >>>>>>>> >> >>> health
>> >> >>>>>>>> >> >>> >> > check ever occurs before child process failure to
>> >> >>>>>>>> connect to the
>> >> >>>>>>>> >> >>> >> backend),
>> >> >>>>>>>> >> >>> >> > rendering retry settings effectively to be
>> ignored.
>> >> >>>>>>>> That's where
>> >> >>>>>>>> >> this
>> >> >>>>>>>> >> >>> >> patch
>> >> >>>>>>>> >> >>> >> > comes into action - enables health check retries
>> while
>> >> >>>>>>>> child
>> >> >>>>>>>> >> >>> processes
>> >> >>>>>>>> >> >>> >> are
>> >> >>>>>>>> >> >>> >> > prevented to degenerate backend.
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> > I don't think, but I could be wrong, that this
>> patch
>> >> >>>>>>>> influences
>> >> >>>>>>>> >> the
>> >> >>>>>>>> >> >>> >> > behavior we're seeing with unwanted health check
>> >> attempt
>> >> >>>>>>>> delays.
>> >> >>>>>>>> >> >>> Also,
>> >> >>>>>>>> >> >>> >> > knowing this, maybe pgpool could be patched or
>> some
>> >> >>>>>>>> other support
>> >> >>>>>>>> >> be
>> >> >>>>>>>> >> >>> >> built
>> >> >>>>>>>> >> >>> >> > into it to cover this use case.
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> > Regards,
>> >> >>>>>>>> >> >>> >> > Stevo.
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> > 2012/1/12 Tatsuo Ishii <ishii at postgresql.org>
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> >> I have accepted the moderation request. Your post
>> >> >>>>>>>> should be sent
>> >> >>>>>>>> >> >>> >> shortly.
>> >> >>>>>>>> >> >>> >> >> Also I have raised the post size limit to 1MB.
>> >> >>>>>>>> >> >>> >> >> I will look into this...
>> >> >>>>>>>> >> >>> >> >> --
>> >> >>>>>>>> >> >>> >> >> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> >> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> >> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >> >>
>> >> >>>>>>>> >> >>> >> >> > Here is the log file and strace output file
>> (this
>> >> >>>>>>>> time in an
>> >> >>>>>>>> >> >>> archive,
>> >> >>>>>>>> >> >>> >> >> > didn't know about 200KB constraint on post size
>> >> which
>> >> >>>>>>>> requires
>> >> >>>>>>>> >> >>> >> moderator
>> >> >>>>>>>> >> >>> >> >> > approval). Timings configured are 30sec health
>> >> check
>> >> >>>>>>>> interval,
>> >> >>>>>>>> >> >>> 5sec
>> >> >>>>>>>> >> >>> >> >> > timeout, and 2 retries with 10sec retry delay.
>> >> >>>>>>>> >> >>> >> >> >
>> >> >>>>>>>> >> >>> >> >> > It takes a lot more than 5sec from started
>> health
>> >> >>>>>>>> check to
>> >> >>>>>>>> >> >>> sleeping
>> >> >>>>>>>> >> >>> >> 10sec
>> >> >>>>>>>> >> >>> >> >> > for first retry.
>> >> >>>>>>>> >> >>> >> >> >
>> >> >>>>>>>> >> >>> >> >> > Seen in code (main.x, health_check() function),
>> >> >>>>>>>> within (retry)
>> >> >>>>>>>> >> >>> attempt
>> >> >>>>>>>> >> >>> >> >> > there is inner retry (first with postgres
>> database
>> >> >>>>>>>> then with
>> >> >>>>>>>> >> >>> >> template1)
>> >> >>>>>>>> >> >>> >> >> and
>> >> >>>>>>>> >> >>> >> >> > that part doesn't seem to be interrupted by
>> alarm.
>> >> >>>>>>>> >> >>> >> >> >
>> >> >>>>>>>> >> >>> >> >> > Regards,
>> >> >>>>>>>> >> >>> >> >> > Stevo.
>> >> >>>>>>>> >> >>> >> >> >
>> >> >>>>>>>> >> >>> >> >> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
>> >> >>>>>>>> >> >>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >> Here is the log file and strace output file.
>> >> Timings
>> >> >>>>>>>> >> configured
>> >> >>>>>>>> >> >>> are
>> >> >>>>>>>> >> >>> >> >> 30sec
>> >> >>>>>>>> >> >>> >> >> >> health check interval, 5sec timeout, and 2
>> retries
>> >> >>>>>>>> with 10sec
>> >> >>>>>>>> >> >>> retry
>> >> >>>>>>>> >> >>> >> >> delay.
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >> It takes a lot more than 5sec from started
>> health
>> >> >>>>>>>> check to
>> >> >>>>>>>> >> >>> sleeping
>> >> >>>>>>>> >> >>> >> >> 10sec
>> >> >>>>>>>> >> >>> >> >> >> for first retry.
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >> Seen in code (main.x, health_check()
>> function),
>> >> >>>>>>>> within (retry)
>> >> >>>>>>>> >> >>> >> attempt
>> >> >>>>>>>> >> >>> >> >> >> there is inner retry (first with postgres
>> database
>> >> >>>>>>>> then with
>> >> >>>>>>>> >> >>> >> template1)
>> >> >>>>>>>> >> >>> >> >> and
>> >> >>>>>>>> >> >>> >> >> >> that part doesn't seem to be interrupted by
>> alarm.
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >> Regards,
>> >> >>>>>>>> >> >>> >> >> >> Stevo.
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> Ok, I will do it. In the mean time you could
>> use
>> >> >>>>>>>> "strace -tt
>> >> >>>>>>>> >> -p
>> >> >>>>>>>> >> >>> PID"
>> >> >>>>>>>> >> >>> >> >> >>> to see which system call is blocked.
>> >> >>>>>>>> >> >>> >> >> >>> --
>> >> >>>>>>>> >> >>> >> >> >>> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> >> >>> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> >> >>> English:
>> http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> >> >>> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >> >> >>>
>> >> >>>>>>>> >> >>> >> >> >>> > OK, got the info - key point is that ip
>> >> >>>>>>>> forwarding is
>> >> >>>>>>>> >> >>> disabled for
>> >> >>>>>>>> >> >>> >> >> >>> security
>> >> >>>>>>>> >> >>> >> >> >>> > reasons. Rules in iptables are not
>> important,
>> >> >>>>>>>> iptables can
>> >> >>>>>>>> >> be
>> >> >>>>>>>> >> >>> >> >> stopped,
>> >> >>>>>>>> >> >>> >> >> >>> or
>> >> >>>>>>>> >> >>> >> >> >>> > previously added rules removed.
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > Here are the steps to reproduce (kudos to
>> my
>> >> >>>>>>>> colleague
>> >> >>>>>>>> >> Nenad
>> >> >>>>>>>> >> >>> >> >> Bulatovic
>> >> >>>>>>>> >> >>> >> >> >>> for
>> >> >>>>>>>> >> >>> >> >> >>> > providing this):
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > 1.) make sure that ip forwarding is off:
>> >> >>>>>>>> >> >>> >> >> >>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
>> >> >>>>>>>> >> >>> >> >> >>> > 2.) create IP alias on some interface (and
>> have
>> >> >>>>>>>> postgres
>> >> >>>>>>>> >> >>> listen on
>> >> >>>>>>>> >> >>> >> >> it):
>> >> >>>>>>>> >> >>> >> >> >>> >     ip addr add x.x.x.x/yy dev ethz
>> >> >>>>>>>> >> >>> >> >> >>> > 3.) set backend_hostname0 to
>> aforementioned IP
>> >> >>>>>>>> >> >>> >> >> >>> > 4.) start pgpool and monitor health checks
>> >> >>>>>>>> >> >>> >> >> >>> > 5.) remove IP alias:
>> >> >>>>>>>> >> >>> >> >> >>> >     ip addr del x.x.x.x/yy dev ethz
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > Here is the interesting part in pgpool log
>> >> after
>> >> >>>>>>>> this:
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> starting
>> >> >>>>>>>> health
>> >> >>>>>>>> >> checking
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >> >>>>>>>> health_check: 0 th DB
>> >> >>>>>>>> >> >>> node
>> >> >>>>>>>> >> >>> >> >> >>> status: 2
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
>> >> >>>>>>>> health_check: 1 th DB
>> >> >>>>>>>> >> >>> node
>> >> >>>>>>>> >> >>> >> >> >>> status: 1
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
>> starting
>> >> >>>>>>>> health
>> >> >>>>>>>> >> checking
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
>> >> >>>>>>>> health_check: 0 th DB
>> >> >>>>>>>> >> >>> node
>> >> >>>>>>>> >> >>> >> >> >>> status: 2
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358:
>> >> >>>>>>>> health_check: 0 th DB
>> >> >>>>>>>> >> >>> node
>> >> >>>>>>>> >> >>> >> >> >>> status: 2
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 ERROR: pid 24358:
>> health
>> >> >>>>>>>> check failed.
>> >> >>>>>>>> >> 0
>> >> >>>>>>>> >> >>> th
>> >> >>>>>>>> >> >>> >> host
>> >> >>>>>>>> >> >>> >> >> >>> > 192.168.2.27 at port 5432 is down
>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 LOG:   pid 24358:
>> health
>> >> >>>>>>>> check retry
>> >> >>>>>>>> >> sleep
>> >> >>>>>>>> >> >>> >> time:
>> >> >>>>>>>> >> >>> >> >> 10
>> >> >>>>>>>> >> >>> >> >> >>> > second(s)
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > That pgpool was configured with health
>> check
>> >> >>>>>>>> interval of
>> >> >>>>>>>> >> >>> 30sec,
>> >> >>>>>>>> >> >>> >> 5sec
>> >> >>>>>>>> >> >>> >> >> >>> > timeout, and 10sec retry delay with 2 max
>> >> retries.
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > Making use of libpq instead for connecting
>> to
>> >> db
>> >> >>>>>>>> in health
>> >> >>>>>>>> >> >>> checks
>> >> >>>>>>>> >> >>> >> IMO
>> >> >>>>>>>> >> >>> >> >> >>> > should resolve it, but you'll best
>> determine
>> >> >>>>>>>> which call
>> >> >>>>>>>> >> >>> exactly
>> >> >>>>>>>> >> >>> >> gets
>> >> >>>>>>>> >> >>> >> >> >>> > blocked waiting. Btw, psql with
>> >> PGCONNECT_TIMEOUT
>> >> >>>>>>>> env var
>> >> >>>>>>>> >> >>> >> configured
>> >> >>>>>>>> >> >>> >> >> >>> > respects that env var timeout.
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > Regards,
>> >> >>>>>>>> >> >>> >> >> >>> > Stevo.
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo
>> Slavić
>> >> <
>> >> >>>>>>>> >> >>> sslavic at gmail.com
>> >> >>>>>>>> >> >>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> wrote:
>> >> >>>>>>>> >> >>> >> >> >>> >
>> >> >>>>>>>> >> >>> >> >> >>> >> Tatsuo,
>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >> Did you restart iptables after adding
>> rule?
>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >> Regards,
>> >> >>>>>>>> >> >>> >> >> >>> >> Stevo.
>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo
>> >> Slavić <
>> >> >>>>>>>> >> >>> >> sslavic at gmail.com>
>> >> >>>>>>>> >> >>> >> >> >>> wrote:
>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>> Looking into this to verify if these are
>> all
>> >> >>>>>>>> necessary
>> >> >>>>>>>> >> >>> changes
>> >> >>>>>>>> >> >>> >> to
>> >> >>>>>>>> >> >>> >> >> have
>> >> >>>>>>>> >> >>> >> >> >>> >>> port unreachable message silently
>> rejected
>> >> >>>>>>>> (suspecting
>> >> >>>>>>>> >> some
>> >> >>>>>>>> >> >>> >> kernel
>> >> >>>>>>>> >> >>> >> >> >>> >>> parameter tuning is needed).
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>> Just to clarify it's not a problem that
>> host
>> >> is
>> >> >>>>>>>> being
>> >> >>>>>>>> >> >>> detected
>> >> >>>>>>>> >> >>> >> by
>> >> >>>>>>>> >> >>> >> >> >>> pgpool
>> >> >>>>>>>> >> >>> >> >> >>> >>> to be down, but the timing when that
>> >> happens. On
>> >> >>>>>>>> >> environment
>> >> >>>>>>>> >> >>> >> where
>> >> >>>>>>>> >> >>> >> >> >>> issue is
>> >> >>>>>>>> >> >>> >> >> >>> >>> reproduced pgpool as part of health check
>> >> >>>>>>>> attempt tries
>> >> >>>>>>>> >> to
>> >> >>>>>>>> >> >>> >> connect
>> >> >>>>>>>> >> >>> >> >> to
>> >> >>>>>>>> >> >>> >> >> >>> >>> backend and hangs for tcp timeout
>> instead of
>> >> >>>>>>>> being
>> >> >>>>>>>> >> >>> interrupted
>> >> >>>>>>>> >> >>> >> by
>> >> >>>>>>>> >> >>> >> >> >>> timeout
>> >> >>>>>>>> >> >>> >> >> >>> >>> alarm. Can you verify/confirm please the
>> >> health
>> >> >>>>>>>> check
>> >> >>>>>>>> >> retry
>> >> >>>>>>>> >> >>> >> timings
>> >> >>>>>>>> >> >>> >> >> >>> are not
>> >> >>>>>>>> >> >>> >> >> >>> >>> delayed?
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>> Regards,
>> >> >>>>>>>> >> >>> >> >> >>> >>> Stevo.
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo
>> >> Ishii <
>> >> >>>>>>>> >> >>> >> >> ishii at postgresql.org
>> >> >>>>>>>> >> >>> >> >> >>> >wrote:
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> Ok, I did:
>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> # iptables -A FORWARD -j REJECT
>> >> --reject-with
>> >> >>>>>>>> >> >>> >> >> icmp-port-unreachable
>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> on the host where pgpoo is running. And
>> pull
>> >> >>>>>>>> network
>> >> >>>>>>>> >> cable
>> >> >>>>>>>> >> >>> from
>> >> >>>>>>>> >> >>> >> >> >>> >>>> backend0 host network interface. Pgpool
>> >> >>>>>>>> detected the
>> >> >>>>>>>> >> host
>> >> >>>>>>>> >> >>> being
>> >> >>>>>>>> >> >>> >> >> down
>> >> >>>>>>>> >> >>> >> >> >>> >>>> as expected...
>> >> >>>>>>>> >> >>> >> >> >>> >>>> --
>> >> >>>>>>>> >> >>> >> >> >>> >>>> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> >> >>> >>>> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> >> >>> >>>> English:
>> >> http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> >> >>> >>>> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Backend is not destination of this
>> >> message,
>> >> >>>>>>>> pgpool
>> >> >>>>>>>> >> host
>> >> >>>>>>>> >> >>> is,
>> >> >>>>>>>> >> >>> >> and
>> >> >>>>>>>> >> >>> >> >> we
>> >> >>>>>>>> >> >>> >> >> >>> >>>> don't
>> >> >>>>>>>> >> >>> >> >> >>> >>>> > want it to ever get it. With command
>> I've
>> >> >>>>>>>> sent you
>> >> >>>>>>>> >> rule
>> >> >>>>>>>> >> >>> will
>> >> >>>>>>>> >> >>> >> be
>> >> >>>>>>>> >> >>> >> >> >>> >>>> created for
>> >> >>>>>>>> >> >>> >> >> >>> >>>> > any source and destination.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Regards,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Stevo.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM,
>> Tatsuo
>> >> >>>>>>>> Ishii <
>> >> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> wrote:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> I did following:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Do following on the host where
>> pgpool is
>> >> >>>>>>>> running on:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> # iptables -A FORWARD -j REJECT
>> >> >>>>>>>> --reject-with
>> >> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable -d
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> 133.137.177.124
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> (133.137.177.124 is the host where
>> >> backend
>> >> >>>>>>>> is running
>> >> >>>>>>>> >> >>> on)
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Pull network cable from backend0 host
>> >> >>>>>>>> network
>> >> >>>>>>>> >> interface.
>> >> >>>>>>>> >> >>> >> Pgpool
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> detected the host being down as
>> expected.
>> >> >>>>>>>> Am I
>> >> >>>>>>>> >> missing
>> >> >>>>>>>> >> >>> >> >> something?
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> --
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> English:
>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Hello Tatsuo,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > With backend0 on one host just
>> >> configure
>> >> >>>>>>>> following
>> >> >>>>>>>> >> >>> rule on
>> >> >>>>>>>> >> >>> >> >> other
>> >> >>>>>>>> >> >>> >> >> >>> >>>> host
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> where
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > pgpool is:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > iptables -A FORWARD -j REJECT
>> >> >>>>>>>> --reject-with
>> >> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then have pgpool startup with
>> >> health
>> >> >>>>>>>> checking
>> >> >>>>>>>> >> and
>> >> >>>>>>>> >> >>> >> >> retrying
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> configured,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then pull network cable from
>> >> backend0
>> >> >>>>>>>> host
>> >> >>>>>>>> >> network
>> >> >>>>>>>> >> >>> >> >> >>> interface.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Regards,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Stevo.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM,
>> Tatsuo
>> >> >>>>>>>> Ishii <
>> >> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> wrote:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> I want to try to test the
>> situation
>> >> you
>> >> >>>>>>>> descrived:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured for
>> >> >>>>>>>> security
>> >> >>>>>>>> >> reasons
>> >> >>>>>>>> >> >>> not
>> >> >>>>>>>> >> >>> >> to
>> >> >>>>>>>> >> >>> >> >> >>> return
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages,
>> even
>> >> >>>>>>>> though
>> >> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> But I don't know how to do it. I
>> >> pulled
>> >> >>>>>>>> out the
>> >> >>>>>>>> >> >>> network
>> >> >>>>>>>> >> >>> >> >> cable
>> >> >>>>>>>> >> >>> >> >> >>> and
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> pgpool detected it as expected.
>> Also I
>> >> >>>>>>>> configured
>> >> >>>>>>>> >> the
>> >> >>>>>>>> >> >>> >> server
>> >> >>>>>>>> >> >>> >> >> >>> which
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> PostgreSQL is running on to
>> disable
>> >> the
>> >> >>>>>>>> 5432
>> >> >>>>>>>> >> port. In
>> >> >>>>>>>> >> >>> >> this
>> >> >>>>>>>> >> >>> >> >> case
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> connect(2) returned EHOSTUNREACH
>> (No
>> >> >>>>>>>> route to
>> >> >>>>>>>> >> host)
>> >> >>>>>>>> >> >>> so
>> >> >>>>>>>> >> >>> >> >> pgpool
>> >> >>>>>>>> >> >>> >> >> >>> >>>> detected
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> the error as expected.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Could you please instruct me?
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> --
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> English:
>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Japanese: http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Hello Tatsuo,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Thank you for replying!
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > I'm not sure what exactly is
>> >> blocking,
>> >> >>>>>>>> just by
>> >> >>>>>>>> >> >>> pgpool
>> >> >>>>>>>> >> >>> >> code
>> >> >>>>>>>> >> >>> >> >> >>> >>>> analysis I
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > suspect it is the part where a
>> >> >>>>>>>> connection is
>> >> >>>>>>>> >> made
>> >> >>>>>>>> >> >>> to
>> >> >>>>>>>> >> >>> >> the
>> >> >>>>>>>> >> >>> >> >> db
>> >> >>>>>>>> >> >>> >> >> >>> and
>> >> >>>>>>>> >> >>> >> >> >>> >>>> it
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> doesn't
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > seem to get interrupted by
>> alarm.
>> >> >>>>>>>> Tested
>> >> >>>>>>>> >> thoroughly
>> >> >>>>>>>> >> >>> >> health
>> >> >>>>>>>> >> >>> >> >> >>> check
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> behaviour,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > it works really well when
>> host/ip is
>> >> >>>>>>>> there and
>> >> >>>>>>>> >> just
>> >> >>>>>>>> >> >>> >> >> >>> >>>> backend/postgres
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> is
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > down, but not when backend
>> host/ip
>> >> is
>> >> >>>>>>>> down. I
>> >> >>>>>>>> >> could
>> >> >>>>>>>> >> >>> >> see in
>> >> >>>>>>>> >> >>> >> >> >>> log
>> >> >>>>>>>> >> >>> >> >> >>> >>>> that
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> initial
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > health check and each retry got
>> >> >>>>>>>> delayed when
>> >> >>>>>>>> >> >>> host/ip is
>> >> >>>>>>>> >> >>> >> >> not
>> >> >>>>>>>> >> >>> >> >> >>> >>>> reachable,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > while when just backend is not
>> >> >>>>>>>> listening (is
>> >> >>>>>>>> >> down)
>> >> >>>>>>>> >> >>> on
>> >> >>>>>>>> >> >>> >> the
>> >> >>>>>>>> >> >>> >> >> >>> >>>> reachable
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> host/ip
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > then initial health check and
>> all
>> >> >>>>>>>> retries are
>> >> >>>>>>>> >> >>> exact to
>> >> >>>>>>>> >> >>> >> the
>> >> >>>>>>>> >> >>> >> >> >>> >>>> settings in
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > pgpool.conf.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as
>> one
>> >> of
>> >> >>>>>>>> the libpq
>> >> >>>>>>>> >> >>> >> >> environment
>> >> >>>>>>>> >> >>> >> >> >>> >>>> variables
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> in
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > the docs (see
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >>
>> >> >>>>>>>> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > There is equivalent parameter in
>> >> libpq
>> >> >>>>>>>> >> >>> >> PGconnectdbParams (
>> >> >>>>>>>> >> >>> >> >> >>> see
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >> >>>>>>>> >> >>> >> >> >>>
>> >> >>>>>>>> >> >>> >> >>
>> >> >>>>>>>> >> >>> >>
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >>
>> >> >>>>>>>>
>> >>
>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> )
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > At the beginning of that same
>> page
>> >> >>>>>>>> there are
>> >> >>>>>>>> >> some
>> >> >>>>>>>> >> >>> >> >> important
>> >> >>>>>>>> >> >>> >> >> >>> >>>> infos on
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> using
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > these functions.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Regards,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Stevo.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13
>> AM,
>> >> >>>>>>>> Tatsuo Ishii <
>> >> >>>>>>>> >> >>> >> >> >>> >>>> ishii at postgresql.org>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> wrote:
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Hello pgpool community,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> >
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is configured for
>> >> >>>>>>>> security
>> >> >>>>>>>> >> reasons
>> >> >>>>>>>> >> >>> not
>> >> >>>>>>>> >> >>> >> to
>> >> >>>>>>>> >> >>> >> >> >>> return
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable messages,
>> even
>> >> >>>>>>>> though
>> >> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> configured,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > socket call will block and
>> alarm
>> >> >>>>>>>> will not get
>> >> >>>>>>>> >> >>> raised
>> >> >>>>>>>> >> >>> >> >> >>> until TCP
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> timeout
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > occurs.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Interesting. So are you saying
>> that
>> >> >>>>>>>> read(2)
>> >> >>>>>>>> >> >>> cannot be
>> >> >>>>>>>> >> >>> >> >> >>> >>>> interrupted by
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> alarm signal if the system is
>> >> >>>>>>>> configured not to
>> >> >>>>>>>> >> >>> return
>> >> >>>>>>>> >> >>> >> >> >>> >>>> destination
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> host unreachable message?
>> Could you
>> >> >>>>>>>> please
>> >> >>>>>>>> >> guide
>> >> >>>>>>>> >> >>> me
>> >> >>>>>>>> >> >>> >> >> where I
>> >> >>>>>>>> >> >>> >> >> >>> can
>> >> >>>>>>>> >> >>> >> >> >>> >>>> get
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> such that info? (I'm not a
>> network
>> >> >>>>>>>> expert).
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Not a C programmer, found
>> some
>> >> info
>> >> >>>>>>>> that
>> >> >>>>>>>> >> select
>> >> >>>>>>>> >> >>> call
>> >> >>>>>>>> >> >>> >> >> >>> could be
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> replace
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> with
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > select/pselect calls. Maybe
>> it
>> >> >>>>>>>> would be best
>> >> >>>>>>>> >> if
>> >> >>>>>>>> >> >>> >> >> >>> >>>> PGCONNECT_TIMEOUT
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> value
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > could be used here for
>> connection
>> >> >>>>>>>> timeout.
>> >> >>>>>>>> >> >>> pgpool
>> >> >>>>>>>> >> >>> >> has
>> >> >>>>>>>> >> >>> >> >> >>> libpq as
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> dependency,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > why isn't it using libpq for
>> the
>> >> >>>>>>>> healthcheck
>> >> >>>>>>>> >> db
>> >> >>>>>>>> >> >>> >> connect
>> >> >>>>>>>> >> >>> >> >> >>> >>>> calls, then
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be
>> >> applied?
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> I don't think libpq uses
>> >> >>>>>>>> select/pselect for
>> >> >>>>>>>> >> >>> >> establishing
>> >> >>>>>>>> >> >>> >> >> >>> >>>> connection,
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> but using libpq instead of
>> homebrew
>> >> >>>>>>>> code seems
>> >> >>>>>>>> >> to
>> >> >>>>>>>> >> >>> be
>> >> >>>>>>>> >> >>> >> an
>> >> >>>>>>>> >> >>> >> >> >>> idea.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> Let me
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> think about it.
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> One question. Are you sure that
>> >> libpq
>> >> >>>>>>>> can deal
>> >> >>>>>>>> >> >>> with
>> >> >>>>>>>> >> >>> >> the
>> >> >>>>>>>> >> >>> >> >> case
>> >> >>>>>>>> >> >>> >> >> >>> >>>> (not to
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> return destination host
>> unreachable
>> >> >>>>>>>> messages)
>> >> >>>>>>>> >> by
>> >> >>>>>>>> >> >>> using
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> --
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Tatsuo Ishii
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> SRA OSS, Inc. Japan
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> English:
>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Japanese:
>> http://www.sraoss.co.jp
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
>> >> >>>>>>>> >> >>> >> >> >>> >>>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>>
>> >> >>>>>>>> >> >>> >> >> >>> >>
>> >> >>>>>>>> >> >>> >> >> >>>
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >> >>
>> >> >>>>>>>> >> >>> >> >>
>> >> >>>>>>>> >> >>> >>
>> >> >>>>>>>> >> >>>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >> >>
>> >> >>>>>>>> >>
>> >> >>>>>>>>
>> >> >>>>>>>
>> >> >>>>>>>
>> >> >>>>>>
>> >> >>>>>
>> >> >>>>
>> >> >>>
>> >> >>
>> >>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20120205/3758fe85/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fixes-rebased.patch
Type: text/x-patch
Size: 3937 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20120205/3758fe85/attachment-0001.bin>


More information about the pgpool-general mailing list