[pgpool-general: 243] Re: Healthcheck timeout not always respected

Fri Feb 24 10:35:42 JST 2012

Hello Tatsuo,

Imagine two geo-redundant data centers, in each data center couple of nodes
dedicated to running postgres as clustered service, and the two clustered
postgres services (one in each data center) configured in streaming
replication (so one is master, and other standby, standby initiates
replication).

Then imagine pgpool performing failover immediately upon relocation of
currently active postgres master service from one node to the other node
within same data center/cluster, for maintenance of node where master
service was active. Failover will promote standby postgres to new master,
and original master will relocate after couple of seconds and startup as
master - split brain condition - not good.

This can happen in abrupt failures too of a node with active postgres
master service - cluster manager will detect failure of a node and recover
postgres service on another node in same cluster. With pgpool not letting
health check control failover, and such condition occurs a split brain
condition would last some time, and recovery would be hard and painful.

If pgpool didn't failover immediately, and let health check decide if and
when to failover instead, and if pgpool health check is tuned to
retry/delay giving enough time to postgres cluster service to
relocate/recover - split brain condition would not occur.

Kind regards,
Stevo.

2012/2/24 Tatsuo Ishii <ishii at postgresql.org>

> > Hello Tatsuo,
> >
> > Thank you for accepting and improving majority of the changes.
> > Unfortunately, not accepted part is a show stopper so I still have to use
> > patched/customized pgpool version in production, since it seems still to
> be
> > impossible to configure pgpool so that health check is only controls if
> and
> > when failover should triggered. With latest sources, and pgpool
> configured
> > as you suggested (backed flag set to ALLOW_TO_FAILOVER, and
> > fail_over_on_backend_error set to off) with two backends in raw mode,
> after
> > initial stable state pgpool triggered failover of primary backend as soon
> > as that backend became inaccessible to pgpool without giving healthcheck
> a
> > chance. Primary backend was shutdown to simulate failure/relocation but
> > same would happen if just connecting to backend failed because of
> temporary
> > network issue.
> >
> > This behaviour is in line with documentation of
> fail_over_on_backend_error
> > configuration parameter which states:
> > "Please note that even if this parameter is set to off, however, pgpool
> > will also do the fail over when connecting to a backend fails or pgpool
> > detects the administrative shutdown of postmaster."
> >
> > But, it is perfectly valid requirement to want to prevent failover to
> occur
> > immediately when connecting to a backend fails - that condition could be
> > temporary, e.g. temporary network condition. Health check retries were
> > designed to cover this situation, so one can configure even for health
> > check to fail connecting several times, but all is fine and no failover
> > should occur as long as after configured number of retries backend is
> > accessible again.
>
> I understand the point. It would be nice to retry connecting to
> backend in pgpool child as the health check does.
>
> > Also, it is perfectly valid requirement to prevent failover to occur
> > immediately when administrative shutdown of backend is performed. For
> > example, a single backend for high availability and easy maintenance can
> be
> > configured as cluster service with e.g. two or more nodes where it can
> run
> > while it actually runs on one node only at a given point in time. So e.g.
> > when admin wants to upgrade postgres installation on each of the nodes
> > within the cluster, to upgrade postgres installation on a node where
> > postgres service is currently active, admin relocates service to some
> other
> > node in a cluster. Relocation causes stop (administrative shutdown) of
> > postgres service on currently active node, and starts it on another node.
> > pgpool which is configured to use such clustered postgres service as a
> > single backend (bound to cluster service ip) should not perform failover
> on
> > detected administrative shutdown - reloaction takes time, and healthcheck
> > is configured to give relocation enough time, and it should be only one
> to
> > trigger failover if backend is still not accessible after configured
> number
> > of retries and delays between them.
>
> This I don't understand. Why don't you use pcp_attach_node in this
> case after failover?
>
> > Given these two examples, I hope you'll agree that it is valid
> requirement
> > to want to let healthcheck only control when failover should be
> triggered.
> >
> > Unfortunately this is not possible at the moment. Configuring backend
> flag
> > to DISALLOW_TO_FAILOVER will prevent health check to trigger failover.
> With
> > fail_over_on_backend_error set to off, will let failover be triggered
> > immediately on temporary conditions that health check with retries should
> > handle.
> >
> > Did I miss something, how does one configure pgpool to have health check
> to
> > be only process in pgpool that triggers failover?
> >
> > Kind regards,
> > Stevo.
> >
> > 2012/2/19 Tatsuo Ishii <ishii at postgresql.org>
> >
> >> Stevo,
> >>
> >> Thanks for the patches. I have committed changes except the part which
> >> you ignore DISALLOW_TO_FAILOVER. Instead I modified low level socket
> >> reading functions not to unconditionaly failover when fails to read
> >> from backend sockets (only failover when If fail_over_on_backend_error
> >> is on). So if you want to trigger failover only when health checking
> >> fails, you want to turn off fail_over_on_backend_error and turn off
> >> DISALLOW_TO_FAILOVER.
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese: http://www.sraoss.co.jp
> >>
> >> > Hello Tatsuo,
> >> >
> >> > Attached is cumulative patch rebased to current master branch head
> which:
> >> > - Fixes health check timeout not always respected (includes unsetting
> >> > non-blocking mode after connection has been successfully established);
> >> > - Fixes failover on health check only support.
> >> >
> >> > Kind regards,
> >> > Stevo.
> >> >
> >> > 2012/2/5 Stevo Slavić <sslavic at gmail.com>
> >> >
> >> >> Tatsuo,
> >> >>
> >> >> Thank you very much for your time and effort put into analysis of the
> >> >> submitted patch,
> >> >>
> >> >>
> >> >> Obviously I'm missing something regarding healthcheck feature, so
> please
> >> >> clarify:
> >> >>
> >> >>    - what is the purpose of healthcheck when backend flag is set to
> >> >>    DISALLOW_TO_FAILOVER? To log that healthchecks are on time but
> will
> >> not
> >> >>    actually do anything?
> >> >>    - what is the purpose of healthcheck (especially with retries
> >> >>    configured) when backend flag is set to ALLOW_TO_FAILOVER? When
> >> answering
> >> >>    please consider case of non-helloworld application that connects
> to
> >> db via
> >> >>    pgpool - will healthcheck be given a chance to fail even once?
> >> >>    - since there is no other backend flag value than the mentioned
> two,
> >> >>    what is the purpose of healthcheck (especially with retries
> >> configured) if
> >> >>    it's not to be the sole process controlling when to failover?
> >> >>
> >> >> I disagree that changing pgpool to give healthcheck feature a meaning
> >> >> disrupts DISALLOW_TO_FAILOVER meaning, it extends it just for case
> when
> >> >> healthcheck is configured - if one doesn't want healthcheck just
> keep on
> >> >> not-using it, it's disabled by default. Health checks and retries
> have
> >> only
> >> >> recently been introduced so I doubt there are many if any users of
> >> health
> >> >> check especially which have configured DISALLOW_TO_FAILOVER with
> >> >> expectation to just have health check logging but not actually do
> >> anything.
> >> >> Out of all pgpool healthcheck users which have backends set to
> >> >> DISALLOW_TO_FAILOVER too I believe most of them expect but do not
> know
> >> that
> >> >> this will not allow failover on health check, it will just make log
> >> bigger.
> >> >> Changes included in patch do not affect users which have health check
> >> >> configured and backend set to ALLOW_TO_FAILOVER.
> >> >>
> >> >>
> >> >> About non-blocking connection to backend change:
> >> >>
> >> >>    - with pgpool in raw mode and extensive testing (endurance tests,
> >> >>    failover and failback tests), I didn't notice any unwanted change
> in
> >> >>    behaviour, apart from wanted non-blocking timeout aware health
> >> checks;
> >> >>    - do you see or know about anything in pgpool depending on
> connection
> >> >>    to backend being blocking one? will have a look myself, just
> asking
> >> maybe
> >> >>    you've found something already. will look into means to set
> >> connection back
> >> >>    to being blocking after it's successfully established - maybe just
> >> changing
> >> >>    that flag will do.
> >> >>
> >> >>
> >> >> Kind regards,
> >> >>
> >> >> Stevo.
> >> >>
> >> >>
> >> >> On Feb 5, 2012 6:50 AM, "Tatsuo Ishii" <ishii at postgresql.org> wrote:
> >> >>
> >> >>> Finially I have time to check your patches. Here is the result of
> >> review.
> >> >>>
> >> >>> > Hello Tatsuo,
> >> >>> >
> >> >>> > Here is cumulative patch to be applied on pgpool master branch
> with
> >> >>> > following fixes included:
> >> >>> >
> >> >>> >    1. fix for health check bug
> >> >>> >       1. it was not possible to allow backend failover only on
> failed
> >> >>> >       health check(s);
> >> >>> >       2. to achieve this one just configures backend to
> >> >>> >       DISALLOW_TO_FAILOVER, sets fail_over_on_backend_error to
> off,
> >> and
> >> >>> >       configures health checks;
> >> >>> >       3. for this fix in code an unwanted check was removed in
> >> main.c,
> >> >>> >       after health check failed if DISALLOW_TO_FAILOVER was set
> for
> >> >>> backend
> >> >>> >       failover would have been always prevented, even when one
> >> >>> > configures health
> >> >>> >       check whose sole purpose is to control failover
> >> >>>
> >> >>> This is not acceptable, at least for stable
> >> >>> releases. DISALLOW_TO_FAILOVER and sets fail_over_on_backend_error
> are
> >> >>> for different purposes. The former is for preventing any failover
> >> >>> including health check. The latter is for write to communication
> >> >>> socket.
> >> >>>
> >> >>> fail_over_on_backend_error = on
> >> >>>                                   # Initiates failover when writing
> to
> >> the
> >> >>>                                   # backend communication socket
> fails
> >> >>>                                   # This is the same behaviour of
> >> >>> pgpool-II
> >> >>>                                   # 2.2.x and previous releases
> >> >>>                                   # If set to off, pgpool will
> report
> >> an
> >> >>>                                   # error and disconnect the
> session.
> >> >>>
> >> >>> Your patch changes the existing semantics. Another point is,
> >> >>> DISALLOW_TO_FAILOVER allows to control per backend behavior. Your
> >> >>> patch breaks it.
> >> >>>
> >> >>> >       2. fix for health check bug
> >> >>> >       1. health check timeout was not being respected in all
> >> conditions
> >> >>> >       (icmp host unreachable messages dropped for security
> reasons,
> >> or
> >> >>> > no active
> >> >>> >       network component to send those message)
> >> >>> >       2. for this fix in code (main.c, pool.h,
> >> pool_connection_pool.c)
> >> >>> inet
> >> >>> >       connections have been made to be non blocking, and during
> >> >>> connection
> >> >>> >       retries status of now global health_check_timer_expired
> >> variable
> >> >>> is being
> >> >>> >       checked
> >> >>>
> >> >>> This seems good. But I need more investigation. For example, your
> >> >>> patch set non blocking to sockets but never revert back to blocking.
> >> >>>
> >> >>> >       3. fix for failback bug
> >> >>> >       1. in raw mode, after failback (through pcp_attach_node)
> >> standby
> >> >>> >       node/backend would remain in invalid state
> >> >>>
> >> >>> It turned out that even failover was bugged. The status was not set
> to
> >> >>> CON_DOWN. This leaves the status to CON_CONNECT_WAIT and it
> prevented
> >> >>> failback from returning to normal state. I fixed this on master
> branch.
> >> >>>
> >> >>> > (it would be in CON_UP, so on
> >> >>> >       failover after failback pgpool would not be able to connect
> to
> >> >>> standby as
> >> >>> >       get_next_master_node expects standby nodes/backends in raw
> mode
> >> >>> to be in
> >> >>> >       CON_CONNECT_WAIT state when finding next master node)
> >> >>> >       2. for this fix in code, when in raw mode on failback
> status of
> >> >>> all
> >> >>> >       nodes/backends with CON_UP state is set to CON_CONNECT_WAIT
> -
> >> >>> > all children
> >> >>> >       are restarted anyway
> >> >>>
> >> >>>
> >> >>> > Neither of these fixes changes expected behaviour of related
> >> features so
> >> >>> > there are no changes to the documentation.
> >> >>> >
> >> >>> >
> >> >>> > Kind regards,
> >> >>> >
> >> >>> > Stevo.
> >> >>> >
> >> >>> >
> >> >>> > 2012/1/24 Tatsuo Ishii <ishii at postgresql.org>
> >> >>> >
> >> >>> >> > Additional testing confirmed that this fix ensures health check
> >> timer
> >> >>> >> gets
> >> >>> >> > respected (should I create a ticket on some issue tracker? send
> >> >>> >> cumulative
> >> >>> >> > patch with all changes to have it accepted?).
> >> >>> >>
> >> >>> >> We have problem with Mantis bug tracker and decided to stop using
> >> >>> >> it(unless someone volunteers to fix it). Please send cumulative
> >> patch
> >> >>> >> againt master head to this list so that we will be able to look
> >> >>> >> into(be sure to include English doc changes).
> >> >>> >> --
> >> >>> >> Tatsuo Ishii
> >> >>> >> SRA OSS, Inc. Japan
> >> >>> >> English: http://www.sraoss.co.jp/index_en.php
> >> >>> >> Japanese: http://www.sraoss.co.jp
> >> >>> >>
> >> >>> >> > Problem is that with all the testing another issue has been
> >> >>> encountered,
> >> >>> >> > now with pcp_attach_node.
> >> >>> >> >
> >> >>> >> > With pgpool in raw mode and two backends in postgres 9
> streaming
> >> >>> >> > replication, when backend0 fails, after health checks retries
> >> pgpool
> >> >>> >> calls
> >> >>> >> > failover command and degenerates backend0, backend1 gets
> promoted
> >> to
> >> >>> new
> >> >>> >> > master, pgpool can connect to that master, and two backends
> are in
> >> >>> pgpool
> >> >>> >> > state 3/2. And this is ok and expected.
> >> >>> >> >
> >> >>> >> > Once backend0 is recovered, it's attached back to pgpool using
> >> >>> >> > pcp_attach_node, and pgpool will show two backends in state 2/2
> >> (in
> >> >>> logs
> >> >>> >> > and in show pool_nodes; query) with backend0 taking all the
> load
> >> (raw
> >> >>> >> > mode). If after that recovery and attachment of backend0
> pgpool is
> >> >>> not
> >> >>> >> > restarted, and afetr some time backend0 fails again, after
> health
> >> >>> check
> >> >>> >> > retries backend0 will get degenerated, failover command will
> get
> >> >>> called
> >> >>> >> > (promotes standby to master), but pgpool will not be able to
> >> connect
> >> >>> to
> >> >>> >> > backend1 (regardless if unix or inet sockets are used for
> >> backend1).
> >> >>> Only
> >> >>> >> > if pgpool is restarted before second (complete) failure of
> >> backend0,
> >> >>> will
> >> >>> >> > pgpool be able to connect to backend1.
> >> >>> >> >
> >> >>> >> > Following code, pcp_attach_node (failback of backend0) will
> >> actually
> >> >>> >> > execute same code as for failover. Not sure what, but that
> >> failover
> >> >>> does
> >> >>> >> > something with backend1 state or in memory settings, so that
> >> pgpool
> >> >>> can
> >> >>> >> no
> >> >>> >> > longer connect to backend1. Is this a known issue?
> >> >>> >> >
> >> >>> >> > Kind regards,
> >> >>> >> > Stevo.
> >> >>> >> >
> >> >>> >> > 2012/1/20 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >
> >> >>> >> >> Key file was missing from that commit/change - pool.h where
> >> >>> >> >> health_check_timer_expired was made global. Included now
> attached
> >> >>> patch.
> >> >>> >> >>
> >> >>> >> >> Kind regards,
> >> >>> >> >> Stevo.
> >> >>> >> >>
> >> >>> >> >>
> >> >>> >> >> 2012/1/20 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >>
> >> >>> >> >>> Using exit_request was wrong and caused a bug. 4th patch
> needed
> >> -
> >> >>> >> >>> health_check_timer_expired is global now so it can be
> verified
> >> if
> >> >>> it
> >> >>> >> was
> >> >>> >> >>> set to 1 outside of main.c
> >> >>> >> >>>
> >> >>> >> >>>
> >> >>> >> >>> Kind regards,
> >> >>> >> >>> Stevo.
> >> >>> >> >>>
> >> >>> >> >>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >>>
> >> >>> >> >>>> Using exit_code was not wise. Tested and encountered a case
> >> where
> >> >>> this
> >> >>> >> >>>> results in a bug. Have to work on it more. Main issue is
> how in
> >> >>> >> >>>> pool_connection_pool.c connect_inet_domain_socket_by_port
> >> >>> function to
> >> >>> >> know
> >> >>> >> >>>> that health check timer has expired (set to 1). Any ideas?
> >> >>> >> >>>>
> >> >>> >> >>>> Kind regards,
> >> >>> >> >>>> Stevo.
> >> >>> >> >>>>
> >> >>> >> >>>>
> >> >>> >> >>>> 2012/1/19 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >>>>
> >> >>> >> >>>>> Tatsuo,
> >> >>> >> >>>>>
> >> >>> >> >>>>> Here are the patches which should be applied to current
> pgpool
> >> >>> head
> >> >>> >> for
> >> >>> >> >>>>> fixing this issue:
> >> >>> >> >>>>>
> >> >>> >> >>>>> Fixes-health-check-timeout.patch
> >> >>> >> >>>>> Fixes-health-check-retrying-after-failover.patch
> >> >>> >> >>>>> Fixes-clearing-exitrequest-flag.patch
> >> >>> >> >>>>>
> >> >>> >> >>>>> Quirk I noticed in logs was resolved as well - after
> failover
> >> >>> pgpool
> >> >>> >> >>>>> would perform healthcheck and report it is doing (max
> retries
> >> +
> >> >>> 1) th
> >> >>> >> >>>>> health check which was confusing. Rather I've adjusted
> that it
> >> >>> does
> >> >>> >> and
> >> >>> >> >>>>> reports it's doing a new health check cycle after failover.
> >> >>> >> >>>>>
> >> >>> >> >>>>> I've tested and it works well - when in raw mode, backends
> >> set to
> >> >>> >> >>>>> disallow failover, failover on backend failure disabled,
> and
> >> >>> health
> >> >>> >> checks
> >> >>> >> >>>>> configured with retries (30sec interval, 5sec timeout, 2
> >> retries,
> >> >>> >> 10sec
> >> >>> >> >>>>> delay between retries).
> >> >>> >> >>>>>
> >> >>> >> >>>>> Please test, and if confirmed ok include in next release.
> >> >>> >> >>>>>
> >> >>> >> >>>>> Kind regards,
> >> >>> >> >>>>>
> >> >>> >> >>>>> Stevo.
> >> >>> >> >>>>>
> >> >>> >> >>>>>
> >> >>> >> >>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >>>>>
> >> >>> >> >>>>>> Here is pgpool.log, strace.out, and pgpool.conf when I
> tested
> >> >>> with
> >> >>> >> my
> >> >>> >> >>>>>> latest patch for health check timeout applied. It works
> well,
> >> >>> >> except for
> >> >>> >> >>>>>> single quirk, after failover completed in log files it was
> >> >>> reported
> >> >>> >> that
> >> >>> >> >>>>>> 3rd health check retry was done (even though just 2 are
> >> >>> configured,
> >> >>> >> see
> >> >>> >> >>>>>> pgpool.conf) and that backend has returned to healthy
> state.
> >> >>> That
> >> >>> >> >>>>>> interesting part from log file follows:
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
> >> >>> DEBUG: pid
> >> >>> >> >>>>>> 1163: retrying 3 th health checking
> >> >>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
> >> >>> DEBUG: pid
> >> >>> >> >>>>>> 1163: health_check: 0 th DB node status: 3
> >> >>> >> >>>>>> Jan 16 01:31:45 sslavic pgpool[1163]: 2012-01-16 01:31:45
> >> LOG:
> >> >>>   pid
> >> >>> >> >>>>>> 1163: after some retrying backend returned to healthy
> state
> >> >>> >> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15
> >> >>> DEBUG: pid
> >> >>> >> >>>>>> 1163: starting health checking
> >> >>> >> >>>>>> Jan 16 01:32:15 sslavic pgpool[1163]: 2012-01-16 01:32:15
> >> >>> DEBUG: pid
> >> >>> >> >>>>>> 1163: health_check: 0 th DB node status: 3
> >> >>> >> >>>>>>
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> As can be seen in pgpool.conf, there is only one backend
> >> >>> configured.
> >> >>> >> >>>>>> pgpool did failover well after health check max retries
> has
> >> been
> >> >>> >> reached
> >> >>> >> >>>>>> (pgpool just degraded that single backend to 3, and
> restarted
> >> >>> child
> >> >>> >> >>>>>> processes).
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> After this quirk has been logged, next health check logs
> >> were as
> >> >>> >> >>>>>> expected. Except those couple weird log entries,
> everything
> >> >>> seems
> >> >>> >> to be ok.
> >> >>> >> >>>>>> Maybe that quirk was caused by single backend only
> >> configuration
> >> >>> >> corner
> >> >>> >> >>>>>> case. Will try tomorrow if it occurs on dual backend
> >> >>> configuration.
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> Regards,
> >> >>> >> >>>>>> Stevo.
> >> >>> >> >>>>>>
> >> >>> >> >>>>>>
> >> >>> >> >>>>>> 2012/1/16 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >>>>>>
> >> >>> >> >>>>>>> Hello Tatsuo,
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> Unfortunately, with your patch when A is on
> >> >>> >> >>>>>>> (pool_config->health_check_period > 0) and B is on, when
> >> retry
> >> >>> >> count is
> >> >>> >> >>>>>>> over, failover will be disallowed because of B being on.
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> Nenad's patch allows failover to be triggered only by
> health
> >> >>> check.
> >> >>> >> >>>>>>> Here is the patch which includes Nenad's fix but also
> fixes
> >> >>> issue
> >> >>> >> with
> >> >>> >> >>>>>>> health check timeout not being respected.
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> Key points in fix for health check timeout being
> respected
> >> are:
> >> >>> >> >>>>>>> - in pool_connection_pool.c
> >> connect_inet_domain_socket_by_port
> >> >>> >> >>>>>>> function, before trying to connect, file descriptor is
> set
> >> to
> >> >>> >> non-blocking
> >> >>> >> >>>>>>> mode, and also non-blocking mode error codes are handled,
> >> >>> >> EINPROGRESS and
> >> >>> >> >>>>>>> EALREADY (please verify changes here, especially
> regarding
> >> >>> closing
> >> >>> >> fd)
> >> >>> >> >>>>>>> - in main.c health_check_timer_handler has been changed
> to
> >> >>> signal
> >> >>> >> >>>>>>> exit_request to health check initiated
> >> >>> >> connect_inet_domain_socket_by_port
> >> >>> >> >>>>>>> function call (please verify this, maybe there is a
> better
> >> way
> >> >>> to
> >> >>> >> check
> >> >>> >> >>>>>>> from connect_inet_domain_socket_by_port if in
> >> >>> >> health_check_timer_expired
> >> >>> >> >>>>>>> has been set to 1)
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> These changes will practically make connect attempt to be
> >> >>> >> >>>>>>> non-blocking and repeated until:
> >> >>> >> >>>>>>> - connection is made, or
> >> >>> >> >>>>>>> - unhandled connection error condition is reached, or
> >> >>> >> >>>>>>> - health check timer alarm has been raised, or
> >> >>> >> >>>>>>> - some other exit request (shutdown) has been issued.
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> Kind regards,
> >> >>> >> >>>>>>> Stevo.
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>>> Ok, let me clarify use cases regarding failover.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> Currently there are three parameters:
> >> >>> >> >>>>>>>> a) health_check
> >> >>> >> >>>>>>>> b) DISALLOW_TO_FAILOVER
> >> >>> >> >>>>>>>> c) fail_over_on_backend_error
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> Source of errors which can trigger failover are 1)health
> >> check
> >> >>> >> >>>>>>>> 2)write
> >> >>> >> >>>>>>>> to backend socket 3)read backend from socket. I
> represent
> >> >>> each 1)
> >> >>> >> as
> >> >>> >> >>>>>>>> A, 2) as B, 3) as C.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 1) trigger failover if A or B or C is error
> >> >>> >> >>>>>>>> a = on, b = off, c = on
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 2) trigger failover only when B or C is error
> >> >>> >> >>>>>>>> a = off, b = off, c = on
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 3) trigger failover only when B is error
> >> >>> >> >>>>>>>> Impossible. Because C error always triggers failover.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 4) trigger failover only when C is error
> >> >>> >> >>>>>>>> a = off, b = off, c = off
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 5) trigger failover only when A is error(Stevo wants
> this)
> >> >>> >> >>>>>>>> Impossible. Because C error always triggers failover.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 6) never trigger failover
> >> >>> >> >>>>>>>> Impossible. Because C error always triggers failover.
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> As you can see, C is the problem here (look at #3, #5
> and
> >> #6)
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> If we implemented this:
> >> >>> >> >>>>>>>> >> However I think we should disable failover if
> >> >>> >> >>>>>>>> DISALLOW_TO_FAILOVER set
> >> >>> >> >>>>>>>> >> in case of reading data from backend. This should
> have
> >> been
> >> >>> >> done
> >> >>> >> >>>>>>>> when
> >> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is
> >> exactly
> >> >>> >> what
> >> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you
> >> >>> think?
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 1) trigger failover if A or B or C is error
> >> >>> >> >>>>>>>> a = on, b = off, c = on
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 2) trigger failover only when B or C is error
> >> >>> >> >>>>>>>> a = off, b = off, c = on
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 3) trigger failover only when B is error
> >> >>> >> >>>>>>>> a = off, b = on, c = on
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 4) trigger failover only when C is error
> >> >>> >> >>>>>>>> a = off, b = off, c = off
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 5) trigger failover only when A is error(Stevo wants
> this)
> >> >>> >> >>>>>>>> a = on, b = on, c = off
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> 6) never trigger failover
> >> >>> >> >>>>>>>> a = off, b = on, c = off
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> So it seems my patch will solve all the problems
> including
> >> >>> yours.
> >> >>> >> >>>>>>>> (timeout while retrying is another issue of course).
> >> >>> >> >>>>>>>> --
> >> >>> >> >>>>>>>> Tatsuo Ishii
> >> >>> >> >>>>>>>> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> English: http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>> > I agree, fail_over_on_backend_error isn't useful, just
> >> adds
> >> >>> >> >>>>>>>> confusion by
> >> >>> >> >>>>>>>> > overlapping with DISALLOW_TO_FAILOVER.
> >> >>> >> >>>>>>>> >
> >> >>> >> >>>>>>>> > With your patch or without it, it is not possible to
> >> >>> failover
> >> >>> >> only
> >> >>> >> >>>>>>>> on
> >> >>> >> >>>>>>>> > health check (max retries) failure. With Nenad's
> patch,
> >> that
> >> >>> >> part
> >> >>> >> >>>>>>>> works ok
> >> >>> >> >>>>>>>> > and I think that patch is semantically ok - failover
> >> occurs
> >> >>> even
> >> >>> >> >>>>>>>> though
> >> >>> >> >>>>>>>> > DISALLOW_TO_FAILOVER is set for backend but only when
> >> health
> >> >>> >> check
> >> >>> >> >>>>>>>> is
> >> >>> >> >>>>>>>> > configured too. Configuring health check without
> >> failover on
> >> >>> >> >>>>>>>> failed health
> >> >>> >> >>>>>>>> > check has no purpose. Also health check configured
> with
> >> >>> allowed
> >> >>> >> >>>>>>>> failover on
> >> >>> >> >>>>>>>> > any condition other than health check (max retries)
> >> failure
> >> >>> has
> >> >>> >> no
> >> >>> >> >>>>>>>> purpose.
> >> >>> >> >>>>>>>> >
> >> >>> >> >>>>>>>> > Kind regards,
> >> >>> >> >>>>>>>> > Stevo.
> >> >>> >> >>>>>>>> >
> >> >>> >> >>>>>>>> > 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
> >> >>> >> >>>>>>>> >
> >> >>> >> >>>>>>>> >> fail_over_on_backend_error has different meaning from
> >> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER. From the doc:
> >> >>> >> >>>>>>>> >>
> >> >>> >> >>>>>>>> >>  If true, and an error occurs when writing to the
> >> backend
> >> >>> >> >>>>>>>> >>  communication, pgpool-II will trigger the fail over
> >> >>> procedure
> >> >>> >> .
> >> >>> >> >>>>>>>> This
> >> >>> >> >>>>>>>> >>  is the same behavior as of pgpool-II 2.2.x or
> earlier.
> >> If
> >> >>> set
> >> >>> >> to
> >> >>> >> >>>>>>>> >>  false, pgpool will report an error and disconnect
> the
> >> >>> session.
> >> >>> >> >>>>>>>> >>
> >> >>> >> >>>>>>>> >> This means that if pgpool failed to read from
> backend,
> >> it
> >> >>> will
> >> >>> >> >>>>>>>> trigger
> >> >>> >> >>>>>>>> >> failover even if fail_over_on_backend_error to off.
> So
> >> >>> >> >>>>>>>> unconditionaly
> >> >>> >> >>>>>>>> >> disabling failover will lead backward imcompatibilty.
> >> >>> >> >>>>>>>> >>
> >> >>> >> >>>>>>>> >> However I think we should disable failover if
> >> >>> >> >>>>>>>> DISALLOW_TO_FAILOVER set
> >> >>> >> >>>>>>>> >> in case of reading data from backend. This should
> have
> >> been
> >> >>> >> done
> >> >>> >> >>>>>>>> when
> >> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER was introduced because this is
> >> exactly
> >> >>> >> what
> >> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER tries to accomplish. What do you
> >> >>> think?
> >> >>> >> >>>>>>>> >> --
> >> >>> >> >>>>>>>> >> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> English: http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >>
> >> >>> >> >>>>>>>> >> > For a moment I thought we could have set
> >> >>> >> >>>>>>>> fail_over_on_backend_error to
> >> >>> >> >>>>>>>> >> off,
> >> >>> >> >>>>>>>> >> > and have backends set with ALLOW_TO_FAILOVER flag.
> But
> >> >>> then I
> >> >>> >> >>>>>>>> looked in
> >> >>> >> >>>>>>>> >> > code.
> >> >>> >> >>>>>>>> >> >
> >> >>> >> >>>>>>>> >> > In child.c there is a loop child process goes
> through
> >> in
> >> >>> its
> >> >>> >> >>>>>>>> lifetime.
> >> >>> >> >>>>>>>> >> When
> >> >>> >> >>>>>>>> >> > fatal error condition occurs before child process
> >> exits
> >> >>> it
> >> >>> >> will
> >> >>> >> >>>>>>>> call
> >> >>> >> >>>>>>>> >> > notice_backend_error which will call
> >> >>> degenerate_backend_set
> >> >>> >> >>>>>>>> which will
> >> >>> >> >>>>>>>> >> not
> >> >>> >> >>>>>>>> >> > take into account fail_over_on_backend_error is
> set to
> >> >>> off,
> >> >>> >> >>>>>>>> causing
> >> >>> >> >>>>>>>> >> backend
> >> >>> >> >>>>>>>> >> > to be degenerated and failover to occur. That's
> why we
> >> >>> have
> >> >>> >> >>>>>>>> backends set
> >> >>> >> >>>>>>>> >> > with DISALLOW_TO_FAILOVER but with our patch
> applied,
> >> >>> health
> >> >>> >> >>>>>>>> check could
> >> >>> >> >>>>>>>> >> > cause failover to occur as expected.
> >> >>> >> >>>>>>>> >> >
> >> >>> >> >>>>>>>> >> > Maybe it would be enough just to modify
> >> >>> >> degenerate_backend_set,
> >> >>> >> >>>>>>>> to take
> >> >>> >> >>>>>>>> >> > fail_over_on_backend_error into account just like
> it
> >> >>> already
> >> >>> >> >>>>>>>> takes
> >> >>> >> >>>>>>>> >> > DISALLOW_TO_FAILOVER into account.
> >> >>> >> >>>>>>>> >> >
> >> >>> >> >>>>>>>> >> > Kind regards,
> >> >>> >> >>>>>>>> >> > Stevo.
> >> >>> >> >>>>>>>> >> >
> >> >>> >> >>>>>>>> >> > 2012/1/15 Stevo Slavić <sslavic at gmail.com>
> >> >>> >> >>>>>>>> >> >
> >> >>> >> >>>>>>>> >> >> Yes and that behaviour which you describe as
> >> expected,
> >> >>> is
> >> >>> >> not
> >> >>> >> >>>>>>>> what we
> >> >>> >> >>>>>>>> >> >> want. We want pgpool to degrade backend0 and
> failover
> >> >>> when
> >> >>> >> >>>>>>>> configured
> >> >>> >> >>>>>>>> >> max
> >> >>> >> >>>>>>>> >> >> health check retries have failed, and to failover
> >> only
> >> >>> in
> >> >>> >> that
> >> >>> >> >>>>>>>> case, so
> >> >>> >> >>>>>>>> >> not
> >> >>> >> >>>>>>>> >> >> sooner e.g. connection/child error condition, but
> as
> >> >>> soon as
> >> >>> >> >>>>>>>> max health
> >> >>> >> >>>>>>>> >> >> check retries have been attempted.
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> Maybe examples will be more clear.
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> Imagine two nodes (node 1 and node 2). On each
> node a
> >> >>> single
> >> >>> >> >>>>>>>> pgpool and
> >> >>> >> >>>>>>>> >> a
> >> >>> >> >>>>>>>> >> >> single backend. Apps/clients access db through
> >> pgpool on
> >> >>> >> their
> >> >>> >> >>>>>>>> own node.
> >> >>> >> >>>>>>>> >> >> Two backends are configured in postgres native
> >> streaming
> >> >>> >> >>>>>>>> replication.
> >> >>> >> >>>>>>>> >> >> pgpools are used in raw mode. Both pgpools have
> same
> >> >>> >> backend as
> >> >>> >> >>>>>>>> >> backend0,
> >> >>> >> >>>>>>>> >> >> and same backend as backend1.
> >> >>> >> >>>>>>>> >> >> initial state: both backends are up and pgpool can
> >> >>> access
> >> >>> >> >>>>>>>> them, clients
> >> >>> >> >>>>>>>> >> >> connect to their pgpool and do their work on
> master
> >> >>> backend,
> >> >>> >> >>>>>>>> backend0.
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> 1st case: unmodified/non-patched pgpool 3.1.1 is
> >> used,
> >> >>> >> >>>>>>>> backends are
> >> >>> >> >>>>>>>> >> >> configured with ALLOW_TO_FAILOVER flag
> >> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool
> on
> >> >>> node 2
> >> >>> >> >>>>>>>> and backend0
> >> >>> >> >>>>>>>> >> >> - error condition is reported by child process,
> and
> >> >>> since
> >> >>> >> >>>>>>>> >> >> ALLOW_TO_FAILOVER is set, pgpool performs failover
> >> >>> without
> >> >>> >> >>>>>>>> giving
> >> >>> >> >>>>>>>> >> chance to
> >> >>> >> >>>>>>>> >> >> pgpool health check retries to control whether
> >> backend
> >> >>> is
> >> >>> >> just
> >> >>> >> >>>>>>>> >> temporarily
> >> >>> >> >>>>>>>> >> >> inaccessible
> >> >>> >> >>>>>>>> >> >> - failover command on node 2 promotes standby
> backend
> >> >>> to a
> >> >>> >> new
> >> >>> >> >>>>>>>> master -
> >> >>> >> >>>>>>>> >> >> split brain occurs, with two masters
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> 2nd case: unmodified/non-patched pgpool 3.1.1 is
> >> used,
> >> >>> >> >>>>>>>> backends are
> >> >>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
> >> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool
> on
> >> >>> node 2
> >> >>> >> >>>>>>>> and backend0
> >> >>> >> >>>>>>>> >> >> - error condition is reported by child process,
> and
> >> >>> since
> >> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not
> perform
> >> >>> >> failover
> >> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
> >> >>> condition,
> >> >>> >> >>>>>>>> determines
> >> >>> >> >>>>>>>> >> that
> >> >>> >> >>>>>>>> >> >> it's not accessible, there will be no health check
> >> >>> retries
> >> >>> >> >>>>>>>> because
> >> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, no failover occurs
> ever
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> 3rd case, pgpool 3.1.1 + patch you've sent
> applied,
> >> and
> >> >>> >> >>>>>>>> backends
> >> >>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
> >> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool
> on
> >> >>> node 2
> >> >>> >> >>>>>>>> and backend0
> >> >>> >> >>>>>>>> >> >> - error condition is reported by child process,
> and
> >> >>> since
> >> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not
> perform
> >> >>> >> failover
> >> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
> >> >>> condition,
> >> >>> >> >>>>>>>> determines
> >> >>> >> >>>>>>>> >> that
> >> >>> >> >>>>>>>> >> >> it's not accessible, health check retries happen,
> and
> >> >>> even
> >> >>> >> >>>>>>>> after max
> >> >>> >> >>>>>>>> >> >> retries, no failover happens since failover is
> >> >>> disallowed
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> 4th expected behaviour, pgpool 3.1.1 + patch we
> sent,
> >> >>> and
> >> >>> >> >>>>>>>> backends
> >> >>> >> >>>>>>>> >> >> configured with DISALLOW_TO_FAILOVER
> >> >>> >> >>>>>>>> >> >> - temporary network outage happens between pgpool
> on
> >> >>> node 2
> >> >>> >> >>>>>>>> and backend0
> >> >>> >> >>>>>>>> >> >> - error condition is reported by child process,
> and
> >> >>> since
> >> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not
> perform
> >> >>> >> failover
> >> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
> >> >>> condition,
> >> >>> >> >>>>>>>> determines
> >> >>> >> >>>>>>>> >> that
> >> >>> >> >>>>>>>> >> >> it's not accessible, health check retries happen,
> >> >>> before a
> >> >>> >> max
> >> >>> >> >>>>>>>> retry
> >> >>> >> >>>>>>>> >> >> network condition is cleared, retry happens, and
> >> >>> backend0
> >> >>> >> >>>>>>>> remains to be
> >> >>> >> >>>>>>>> >> >> master, no failover occurs, temporary network
> issue
> >> did
> >> >>> not
> >> >>> >> >>>>>>>> cause split
> >> >>> >> >>>>>>>> >> >> brain
> >> >>> >> >>>>>>>> >> >> - after some time, temporary network outage
> happens
> >> >>> again
> >> >>> >> >>>>>>>> between pgpool
> >> >>> >> >>>>>>>> >> >> on node 2 and backend0
> >> >>> >> >>>>>>>> >> >> - error condition is reported by child process,
> and
> >> >>> since
> >> >>> >> >>>>>>>> >> >> DISALLOW_TO_FAILOVER is set, pgpool does not
> perform
> >> >>> >> failover
> >> >>> >> >>>>>>>> >> >> - health check gets a chance to check backend0
> >> >>> condition,
> >> >>> >> >>>>>>>> determines
> >> >>> >> >>>>>>>> >> that
> >> >>> >> >>>>>>>> >> >> it's not accessible, health check retries happen,
> >> after
> >> >>> max
> >> >>> >> >>>>>>>> retries
> >> >>> >> >>>>>>>> >> >> backend0 is still not accessible, failover
> happens,
> >> >>> standby
> >> >>> >> is
> >> >>> >> >>>>>>>> new
> >> >>> >> >>>>>>>> >> master
> >> >>> >> >>>>>>>> >> >> and backend0 is degraded
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> Kind regards,
> >> >>> >> >>>>>>>> >> >> Stevo.
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >>> In my test evironment, the patch works as
> expected.
> >> I
> >> >>> have
> >> >>> >> two
> >> >>> >> >>>>>>>> >> >>> backends. Health check retry conf is as follows:
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>> health_check_max_retries = 3
> >> >>> >> >>>>>>>> >> >>> health_check_retry_delay = 1
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>> 5 09:17:20 LOG:   pid 21411: Backend status file
> >> >>> >> >>>>>>>> /home/t-ishii/work/
> >> >>> >> >>>>>>>> >> >>> git.postgresql.org/test/log/pgpool_statusdiscarded
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411: pgpool-II
> >> >>> >> successfully
> >> >>> >> >>>>>>>> started.
> >> >>> >> >>>>>>>> >> >>> version 3.2alpha1 (hatsuiboshi)
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:20 LOG:   pid 21411:
> >> >>> find_primary_node:
> >> >>> >> >>>>>>>> primary node
> >> >>> >> >>>>>>>> >> id
> >> >>> >> >>>>>>>> >> >>> is 0
> >> >>> >> >>>>>>>> >> >>> -- backend1 was shutdown
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21445:
> >> >>> >> >>>>>>>> check_replication_time_lag: could
> >> >>> >> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
> >> >>> >> >>>>>>>> sr_check_password
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> -- health check failed
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 ERROR: pid 21411: health
> check
> >> >>> failed.
> >> >>> >> 1
> >> >>> >> >>>>>>>> th host
> >> >>> >> >>>>>>>> >> /tmp
> >> >>> >> >>>>>>>> >> >>> at port 11001 is down
> >> >>> >> >>>>>>>> >> >>> -- start retrying
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:50 LOG:   pid 21411: health
> check
> >> >>> retry
> >> >>> >> >>>>>>>> sleep time: 1
> >> >>> >> >>>>>>>> >> >>> second(s)
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 ERROR: pid 21411: health
> check
> >> >>> failed.
> >> >>> >> 1
> >> >>> >> >>>>>>>> th host
> >> >>> >> >>>>>>>> >> /tmp
> >> >>> >> >>>>>>>> >> >>> at port 11001 is down
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:51 LOG:   pid 21411: health
> check
> >> >>> retry
> >> >>> >> >>>>>>>> sleep time: 1
> >> >>> >> >>>>>>>> >> >>> second(s)
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 ERROR: pid 21411: health
> check
> >> >>> failed.
> >> >>> >> 1
> >> >>> >> >>>>>>>> th host
> >> >>> >> >>>>>>>> >> /tmp
> >> >>> >> >>>>>>>> >> >>> at port 11001 is down
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:52 LOG:   pid 21411: health
> check
> >> >>> retry
> >> >>> >> >>>>>>>> sleep time: 1
> >> >>> >> >>>>>>>> >> >>> second(s)
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 ERROR: pid 21411: health
> check
> >> >>> failed.
> >> >>> >> 1
> >> >>> >> >>>>>>>> th host
> >> >>> >> >>>>>>>> >> /tmp
> >> >>> >> >>>>>>>> >> >>> at port 11001 is down
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:17:53 LOG:   pid 21411:
> health_check:
> >> 1
> >> >>> >> >>>>>>>> failover is
> >> >>> >> >>>>>>>> >> canceld
> >> >>> >> >>>>>>>> >> >>> because failover is disallowed
> >> >>> >> >>>>>>>> >> >>> -- after 3 retries, pgpool wanted to failover,
> but
> >> >>> gave up
> >> >>> >> >>>>>>>> because
> >> >>> >> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER is set for backend1
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:00 ERROR: pid 21445:
> >> >>> >> >>>>>>>> check_replication_time_lag: could
> >> >>> >> >>>>>>>> >> >>> not connect to DB node 1, check sr_check_user and
> >> >>> >> >>>>>>>> sr_check_password
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 ERROR: pid 21411: health
> check
> >> >>> failed.
> >> >>> >> 1
> >> >>> >> >>>>>>>> th host
> >> >>> >> >>>>>>>> >> /tmp
> >> >>> >> >>>>>>>> >> >>> at port 11001 is down
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:03 LOG:   pid 21411: health
> check
> >> >>> retry
> >> >>> >> >>>>>>>> sleep time: 1
> >> >>> >> >>>>>>>> >> >>> second(s)
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
> >> >>> >> >>>>>>>> >> connect_unix_domain_socket_by_port:
> >> >>> >> >>>>>>>> >> >>> connect() failed to /tmp/.s.PGSQL.11001: No such
> >> file
> >> >>> or
> >> >>> >> >>>>>>>> directory
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411:
> >> >>> >> >>>>>>>> make_persistent_db_connection:
> >> >>> >> >>>>>>>> >> >>> connection to /tmp(11001) failed
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 ERROR: pid 21411: health
> check
> >> >>> failed.
> >> >>> >> 1
> >> >>> >> >>>>>>>> th host
> >> >>> >> >>>>>>>> >> /tmp
> >> >>> >> >>>>>>>> >> >>> at port 11001 is down
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:04 LOG:   pid 21411: health
> check
> >> >>> retry
> >> >>> >> >>>>>>>> sleep time: 1
> >> >>> >> >>>>>>>> >> >>> second(s)
> >> >>> >> >>>>>>>> >> >>> 2012-01-15 09:18:05 LOG:   pid 21411: after some
> >> >>> retrying
> >> >>> >> >>>>>>>> backend
> >> >>> >> >>>>>>>> >> >>> returned to healthy state
> >> >>> >> >>>>>>>> >> >>> -- started backend1 and pgpool succeeded in
> health
> >> >>> >> checking.
> >> >>> >> >>>>>>>> Resumed
> >> >>> >> >>>>>>>> >> >>> using backend1
> >> >>> >> >>>>>>>> >> >>> --
> >> >>> >> >>>>>>>> >> >>> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> English: http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>> > Hello Tatsuo,
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> > Thank you for the patch and effort, but
> >> unfortunately
> >> >>> >> this
> >> >>> >> >>>>>>>> change
> >> >>> >> >>>>>>>> >> won't
> >> >>> >> >>>>>>>> >> >>> > work for us. We need to set disallow failover
> to
> >> >>> prevent
> >> >>> >> >>>>>>>> failover on
> >> >>> >> >>>>>>>> >> >>> child
> >> >>> >> >>>>>>>> >> >>> > reported connection errors (it's ok if few
> clients
> >> >>> lose
> >> >>> >> >>>>>>>> their
> >> >>> >> >>>>>>>> >> >>> connection or
> >> >>> >> >>>>>>>> >> >>> > can not connect), and still have pgpool perform
> >> >>> failover
> >> >>> >> >>>>>>>> but only on
> >> >>> >> >>>>>>>> >> >>> failed
> >> >>> >> >>>>>>>> >> >>> > health check (if configured, after max retries
> >> >>> threshold
> >> >>> >> >>>>>>>> has been
> >> >>> >> >>>>>>>> >> >>> reached).
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> > Maybe it would be best to add an extra value
> for
> >> >>> >> >>>>>>>> backend_flag -
> >> >>> >> >>>>>>>> >> >>> > ALLOW_TO_FAILOVER_ON_HEALTH_CHECK or
> >> >>> >> >>>>>>>> >> >>> DISALLOW_TO_FAILOVER_ON_CHILD_ERROR.
> >> >>> >> >>>>>>>> >> >>> > It should behave same as DISALLOW_TO_FAILOVER
> is
> >> set,
> >> >>> >> with
> >> >>> >> >>>>>>>> only
> >> >>> >> >>>>>>>> >> >>> difference
> >> >>> >> >>>>>>>> >> >>> > in behaviour when health check (if set, max
> >> retries)
> >> >>> has
> >> >>> >> >>>>>>>> failed -
> >> >>> >> >>>>>>>> >> unlike
> >> >>> >> >>>>>>>> >> >>> > DISALLOW_TO_FAILOVER, this new flag should
> allow
> >> >>> failover
> >> >>> >> >>>>>>>> in this
> >> >>> >> >>>>>>>> >> case
> >> >>> >> >>>>>>>> >> >>> only.
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> > Without this change health check (especially
> >> health
> >> >>> check
> >> >>> >> >>>>>>>> retries)
> >> >>> >> >>>>>>>> >> >>> doesn't
> >> >>> >> >>>>>>>> >> >>> > make much sense - child error is more likely to
> >> >>> occur on
> >> >>> >> >>>>>>>> (temporary)
> >> >>> >> >>>>>>>> >> >>> > backend failure then health check and will or
> will
> >> >>> not
> >> >>> >> cause
> >> >>> >> >>>>>>>> >> failover to
> >> >>> >> >>>>>>>> >> >>> > occur depending on backend flag, without giving
> >> >>> health
> >> >>> >> >>>>>>>> check retries
> >> >>> >> >>>>>>>> >> a
> >> >>> >> >>>>>>>> >> >>> > chance to determine if failure was temporary or
> >> not,
> >> >>> >> >>>>>>>> risking split
> >> >>> >> >>>>>>>> >> brain
> >> >>> >> >>>>>>>> >> >>> > situation with two masters just because of
> >> temporary
> >> >>> >> >>>>>>>> network link
> >> >>> >> >>>>>>>> >> >>> hiccup.
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> > Our main problem remains though with the health
> >> check
> >> >>> >> >>>>>>>> timeout not
> >> >>> >> >>>>>>>> >> being
> >> >>> >> >>>>>>>> >> >>> > respected in these special conditions we have.
> >> Maybe
> >> >>> >> Nenad
> >> >>> >> >>>>>>>> can help
> >> >>> >> >>>>>>>> >> you
> >> >>> >> >>>>>>>> >> >>> > more to reproduce the issue on your
> environment.
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> > Kind regards,
> >> >>> >> >>>>>>>> >> >>> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> > 2012/1/13 Tatsuo Ishii <ishii at postgresql.org>
> >> >>> >> >>>>>>>> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> Thanks for pointing it out.
> >> >>> >> >>>>>>>> >> >>> >> Yes, checking DISALLOW_TO_FAILOVER before
> >> retrying
> >> >>> is
> >> >>> >> >>>>>>>> wrong.
> >> >>> >> >>>>>>>> >> >>> >> However, after retry count over, we should
> check
> >> >>> >> >>>>>>>> >> DISALLOW_TO_FAILOVER I
> >> >>> >> >>>>>>>> >> >>> >> think.
> >> >>> >> >>>>>>>> >> >>> >> Attached is the patch attempt to fix it.
> Please
> >> try.
> >> >>> >> >>>>>>>> >> >>> >> --
> >> >>> >> >>>>>>>> >> >>> >> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> English: http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> > pgpool is being used in raw mode - just for
> >> >>> (health
> >> >>> >> >>>>>>>> check based)
> >> >>> >> >>>>>>>> >> >>> failover
> >> >>> >> >>>>>>>> >> >>> >> > part, so applications are not required to
> >> restart
> >> >>> when
> >> >>> >> >>>>>>>> standby
> >> >>> >> >>>>>>>> >> gets
> >> >>> >> >>>>>>>> >> >>> >> > promoted to new master. Here is pgpool.conf
> >> file
> >> >>> and a
> >> >>> >> >>>>>>>> very small
> >> >>> >> >>>>>>>> >> >>> patch
> >> >>> >> >>>>>>>> >> >>> >> > we're using applied to pgpool 3.1.1 release.
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> > We have to have DISALLOW_TO_FAILOVER set for
> >> the
> >> >>> >> backend
> >> >>> >> >>>>>>>> since any
> >> >>> >> >>>>>>>> >> >>> child
> >> >>> >> >>>>>>>> >> >>> >> > process that detects condition that
> >> >>> master/backend0 is
> >> >>> >> >>>>>>>> not
> >> >>> >> >>>>>>>> >> >>> available, if
> >> >>> >> >>>>>>>> >> >>> >> > DISALLOW_TO_FAILOVER was not set, will
> >> degenerate
> >> >>> >> >>>>>>>> backend without
> >> >>> >> >>>>>>>> >> >>> giving
> >> >>> >> >>>>>>>> >> >>> >> > health check a chance to retry. We need
> health
> >> >>> check
> >> >>> >> >>>>>>>> with retries
> >> >>> >> >>>>>>>> >> >>> because
> >> >>> >> >>>>>>>> >> >>> >> > condition that backend0 is not available
> could
> >> be
> >> >>> >> >>>>>>>> temporary
> >> >>> >> >>>>>>>> >> (network
> >> >>> >> >>>>>>>> >> >>> >> > glitches to the remote site where master
> is, or
> >> >>> >> >>>>>>>> deliberate
> >> >>> >> >>>>>>>> >> failover
> >> >>> >> >>>>>>>> >> >>> of
> >> >>> >> >>>>>>>> >> >>> >> > master postgres service from one node to the
> >> >>> other on
> >> >>> >> >>>>>>>> remote site
> >> >>> >> >>>>>>>> >> -
> >> >>> >> >>>>>>>> >> >>> in
> >> >>> >> >>>>>>>> >> >>> >> both
> >> >>> >> >>>>>>>> >> >>> >> > cases remote means remote to the pgpool
> that is
> >> >>> going
> >> >>> >> to
> >> >>> >> >>>>>>>> perform
> >> >>> >> >>>>>>>> >> >>> health
> >> >>> >> >>>>>>>> >> >>> >> > checks and ultimately the failover) and we
> >> don't
> >> >>> want
> >> >>> >> >>>>>>>> standby to
> >> >>> >> >>>>>>>> >> be
> >> >>> >> >>>>>>>> >> >>> >> > promoted as easily to a new master, to
> prevent
> >> >>> >> temporary
> >> >>> >> >>>>>>>> network
> >> >>> >> >>>>>>>> >> >>> >> conditions
> >> >>> >> >>>>>>>> >> >>> >> > which could occur frequently to frequently
> >> cause
> >> >>> split
> >> >>> >> >>>>>>>> brain with
> >> >>> >> >>>>>>>> >> two
> >> >>> >> >>>>>>>> >> >>> >> > masters.
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> > But then, with DISALLOW_TO_FAILOVER set,
> >> without
> >> >>> the
> >> >>> >> >>>>>>>> patch health
> >> >>> >> >>>>>>>> >> >>> check
> >> >>> >> >>>>>>>> >> >>> >> > will not retry and will thus give only one
> >> chance
> >> >>> to
> >> >>> >> >>>>>>>> backend (if
> >> >>> >> >>>>>>>> >> >>> health
> >> >>> >> >>>>>>>> >> >>> >> > check ever occurs before child process
> failure
> >> to
> >> >>> >> >>>>>>>> connect to the
> >> >>> >> >>>>>>>> >> >>> >> backend),
> >> >>> >> >>>>>>>> >> >>> >> > rendering retry settings effectively to be
> >> >>> ignored.
> >> >>> >> >>>>>>>> That's where
> >> >>> >> >>>>>>>> >> this
> >> >>> >> >>>>>>>> >> >>> >> patch
> >> >>> >> >>>>>>>> >> >>> >> > comes into action - enables health check
> >> retries
> >> >>> while
> >> >>> >> >>>>>>>> child
> >> >>> >> >>>>>>>> >> >>> processes
> >> >>> >> >>>>>>>> >> >>> >> are
> >> >>> >> >>>>>>>> >> >>> >> > prevented to degenerate backend.
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> > I don't think, but I could be wrong, that
> this
> >> >>> patch
> >> >>> >> >>>>>>>> influences
> >> >>> >> >>>>>>>> >> the
> >> >>> >> >>>>>>>> >> >>> >> > behavior we're seeing with unwanted health
> >> check
> >> >>> >> attempt
> >> >>> >> >>>>>>>> delays.
> >> >>> >> >>>>>>>> >> >>> Also,
> >> >>> >> >>>>>>>> >> >>> >> > knowing this, maybe pgpool could be patched
> or
> >> >>> some
> >> >>> >> >>>>>>>> other support
> >> >>> >> >>>>>>>> >> be
> >> >>> >> >>>>>>>> >> >>> >> built
> >> >>> >> >>>>>>>> >> >>> >> > into it to cover this use case.
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> > Regards,
> >> >>> >> >>>>>>>> >> >>> >> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> > 2012/1/12 Tatsuo Ishii <
> ishii at postgresql.org>
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> I have accepted the moderation request.
> Your
> >> post
> >> >>> >> >>>>>>>> should be sent
> >> >>> >> >>>>>>>> >> >>> >> shortly.
> >> >>> >> >>>>>>>> >> >>> >> >> Also I have raised the post size limit to
> 1MB.
> >> >>> >> >>>>>>>> >> >>> >> >> I will look into this...
> >> >>> >> >>>>>>>> >> >>> >> >> --
> >> >>> >> >>>>>>>> >> >>> >> >> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> >> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> >> English:
> http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> >> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> > Here is the log file and strace output
> file
> >> >>> (this
> >> >>> >> >>>>>>>> time in an
> >> >>> >> >>>>>>>> >> >>> archive,
> >> >>> >> >>>>>>>> >> >>> >> >> > didn't know about 200KB constraint on
> post
> >> size
> >> >>> >> which
> >> >>> >> >>>>>>>> requires
> >> >>> >> >>>>>>>> >> >>> >> moderator
> >> >>> >> >>>>>>>> >> >>> >> >> > approval). Timings configured are 30sec
> >> health
> >> >>> >> check
> >> >>> >> >>>>>>>> interval,
> >> >>> >> >>>>>>>> >> >>> 5sec
> >> >>> >> >>>>>>>> >> >>> >> >> > timeout, and 2 retries with 10sec retry
> >> delay.
> >> >>> >> >>>>>>>> >> >>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> > It takes a lot more than 5sec from
> started
> >> >>> health
> >> >>> >> >>>>>>>> check to
> >> >>> >> >>>>>>>> >> >>> sleeping
> >> >>> >> >>>>>>>> >> >>> >> 10sec
> >> >>> >> >>>>>>>> >> >>> >> >> > for first retry.
> >> >>> >> >>>>>>>> >> >>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> > Seen in code (main.x, health_check()
> >> function),
> >> >>> >> >>>>>>>> within (retry)
> >> >>> >> >>>>>>>> >> >>> attempt
> >> >>> >> >>>>>>>> >> >>> >> >> > there is inner retry (first with postgres
> >> >>> database
> >> >>> >> >>>>>>>> then with
> >> >>> >> >>>>>>>> >> >>> >> template1)
> >> >>> >> >>>>>>>> >> >>> >> >> and
> >> >>> >> >>>>>>>> >> >>> >> >> > that part doesn't seem to be interrupted
> by
> >> >>> alarm.
> >> >>> >> >>>>>>>> >> >>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> > Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> > 2012/1/12 Stevo Slavić <
> sslavic at gmail.com>
> >> >>> >> >>>>>>>> >> >>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >> Here is the log file and strace output
> >> file.
> >> >>> >> Timings
> >> >>> >> >>>>>>>> >> configured
> >> >>> >> >>>>>>>> >> >>> are
> >> >>> >> >>>>>>>> >> >>> >> >> 30sec
> >> >>> >> >>>>>>>> >> >>> >> >> >> health check interval, 5sec timeout,
> and 2
> >> >>> retries
> >> >>> >> >>>>>>>> with 10sec
> >> >>> >> >>>>>>>> >> >>> retry
> >> >>> >> >>>>>>>> >> >>> >> >> delay.
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >> It takes a lot more than 5sec from
> started
> >> >>> health
> >> >>> >> >>>>>>>> check to
> >> >>> >> >>>>>>>> >> >>> sleeping
> >> >>> >> >>>>>>>> >> >>> >> >> 10sec
> >> >>> >> >>>>>>>> >> >>> >> >> >> for first retry.
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >> Seen in code (main.x, health_check()
> >> >>> function),
> >> >>> >> >>>>>>>> within (retry)
> >> >>> >> >>>>>>>> >> >>> >> attempt
> >> >>> >> >>>>>>>> >> >>> >> >> >> there is inner retry (first with
> postgres
> >> >>> database
> >> >>> >> >>>>>>>> then with
> >> >>> >> >>>>>>>> >> >>> >> template1)
> >> >>> >> >>>>>>>> >> >>> >> >> and
> >> >>> >> >>>>>>>> >> >>> >> >> >> that part doesn't seem to be
> interrupted by
> >> >>> alarm.
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >> Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >> Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >> 2012/1/11 Tatsuo Ishii <
> >> ishii at postgresql.org>
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> Ok, I will do it. In the mean time you
> >> could
> >> >>> use
> >> >>> >> >>>>>>>> "strace -tt
> >> >>> >> >>>>>>>> >> -p
> >> >>> >> >>>>>>>> >> >>> PID"
> >> >>> >> >>>>>>>> >> >>> >> >> >>> to see which system call is blocked.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> --
> >> >>> >> >>>>>>>> >> >>> >> >> >>> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> >> >>> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> >> >>> English:
> >> >>> http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> >> >>> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >> >> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > OK, got the info - key point is that
> ip
> >> >>> >> >>>>>>>> forwarding is
> >> >>> >> >>>>>>>> >> >>> disabled for
> >> >>> >> >>>>>>>> >> >>> >> >> >>> security
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > reasons. Rules in iptables are not
> >> >>> important,
> >> >>> >> >>>>>>>> iptables can
> >> >>> >> >>>>>>>> >> be
> >> >>> >> >>>>>>>> >> >>> >> >> stopped,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> or
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > previously added rules removed.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > Here are the steps to reproduce
> (kudos
> >> to
> >> >>> my
> >> >>> >> >>>>>>>> colleague
> >> >>> >> >>>>>>>> >> Nenad
> >> >>> >> >>>>>>>> >> >>> >> >> Bulatovic
> >> >>> >> >>>>>>>> >> >>> >> >> >>> for
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > providing this):
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 1.) make sure that ip forwarding is
> off:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >     echo 0 >
> >> /proc/sys/net/ipv4/ip_forward
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2.) create IP alias on some interface
> >> (and
> >> >>> have
> >> >>> >> >>>>>>>> postgres
> >> >>> >> >>>>>>>> >> >>> listen on
> >> >>> >> >>>>>>>> >> >>> >> >> it):
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >     ip addr add x.x.x.x/yy dev ethz
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 3.) set backend_hostname0 to
> >> >>> aforementioned IP
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 4.) start pgpool and monitor health
> >> checks
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 5.) remove IP alias:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >     ip addr del x.x.x.x/yy dev ethz
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > Here is the interesting part in
> pgpool
> >> log
> >> >>> >> after
> >> >>> >> >>>>>>>> this:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
> >> >>> starting
> >> >>> >> >>>>>>>> health
> >> >>> >> >>>>>>>> >> checking
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
> >> >>> >> >>>>>>>> health_check: 0 th DB
> >> >>> >> >>>>>>>> >> >>> node
> >> >>> >> >>>>>>>> >> >>> >> >> >>> status: 2
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358:
> >> >>> >> >>>>>>>> health_check: 1 th DB
> >> >>> >> >>>>>>>> >> >>> node
> >> >>> >> >>>>>>>> >> >>> >> >> >>> status: 1
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
> >> >>> starting
> >> >>> >> >>>>>>>> health
> >> >>> >> >>>>>>>> >> checking
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358:
> >> >>> >> >>>>>>>> health_check: 0 th DB
> >> >>> >> >>>>>>>> >> >>> node
> >> >>> >> >>>>>>>> >> >>> >> >> >>> status: 2
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358:
> >> >>> >> >>>>>>>> health_check: 0 th DB
> >> >>> >> >>>>>>>> >> >>> node
> >> >>> >> >>>>>>>> >> >>> >> >> >>> status: 2
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 ERROR: pid 24358:
> >> >>> health
> >> >>> >> >>>>>>>> check failed.
> >> >>> >> >>>>>>>> >> 0
> >> >>> >> >>>>>>>> >> >>> th
> >> >>> >> >>>>>>>> >> >>> >> host
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 192.168.2.27 at port 5432 is down
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > 2012-01-11 17:41:46 LOG:   pid 24358:
> >> >>> health
> >> >>> >> >>>>>>>> check retry
> >> >>> >> >>>>>>>> >> sleep
> >> >>> >> >>>>>>>> >> >>> >> time:
> >> >>> >> >>>>>>>> >> >>> >> >> 10
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > second(s)
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > That pgpool was configured with
> health
> >> >>> check
> >> >>> >> >>>>>>>> interval of
> >> >>> >> >>>>>>>> >> >>> 30sec,
> >> >>> >> >>>>>>>> >> >>> >> 5sec
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > timeout, and 10sec retry delay with 2
> >> max
> >> >>> >> retries.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > Making use of libpq instead for
> >> connecting
> >> >>> to
> >> >>> >> db
> >> >>> >> >>>>>>>> in health
> >> >>> >> >>>>>>>> >> >>> checks
> >> >>> >> >>>>>>>> >> >>> >> IMO
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > should resolve it, but you'll best
> >> >>> determine
> >> >>> >> >>>>>>>> which call
> >> >>> >> >>>>>>>> >> >>> exactly
> >> >>> >> >>>>>>>> >> >>> >> gets
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > blocked waiting. Btw, psql with
> >> >>> >> PGCONNECT_TIMEOUT
> >> >>> >> >>>>>>>> env var
> >> >>> >> >>>>>>>> >> >>> >> configured
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > respects that env var timeout.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> > On Wed, Jan 11, 2012 at 11:15 AM,
> Stevo
> >> >>> Slavić
> >> >>> >> <
> >> >>> >> >>>>>>>> >> >>> sslavic at gmail.com
> >> >>> >> >>>>>>>> >> >>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> wrote:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >> Tatsuo,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >> Did you restart iptables after
> adding
> >> >>> rule?
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >> Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >> Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >> On Wed, Jan 11, 2012 at 11:12 AM,
> Stevo
> >> >>> >> Slavić <
> >> >>> >> >>>>>>>> >> >>> >> sslavic at gmail.com>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> wrote:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Looking into this to verify if
> these
> >> are
> >> >>> all
> >> >>> >> >>>>>>>> necessary
> >> >>> >> >>>>>>>> >> >>> changes
> >> >>> >> >>>>>>>> >> >>> >> to
> >> >>> >> >>>>>>>> >> >>> >> >> have
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> port unreachable message silently
> >> >>> rejected
> >> >>> >> >>>>>>>> (suspecting
> >> >>> >> >>>>>>>> >> some
> >> >>> >> >>>>>>>> >> >>> >> kernel
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> parameter tuning is needed).
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Just to clarify it's not a problem
> >> that
> >> >>> host
> >> >>> >> is
> >> >>> >> >>>>>>>> being
> >> >>> >> >>>>>>>> >> >>> detected
> >> >>> >> >>>>>>>> >> >>> >> by
> >> >>> >> >>>>>>>> >> >>> >> >> >>> pgpool
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> to be down, but the timing when
> that
> >> >>> >> happens. On
> >> >>> >> >>>>>>>> >> environment
> >> >>> >> >>>>>>>> >> >>> >> where
> >> >>> >> >>>>>>>> >> >>> >> >> >>> issue is
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> reproduced pgpool as part of health
> >> check
> >> >>> >> >>>>>>>> attempt tries
> >> >>> >> >>>>>>>> >> to
> >> >>> >> >>>>>>>> >> >>> >> connect
> >> >>> >> >>>>>>>> >> >>> >> >> to
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> backend and hangs for tcp timeout
> >> >>> instead of
> >> >>> >> >>>>>>>> being
> >> >>> >> >>>>>>>> >> >>> interrupted
> >> >>> >> >>>>>>>> >> >>> >> by
> >> >>> >> >>>>>>>> >> >>> >> >> >>> timeout
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> alarm. Can you verify/confirm
> please
> >> the
> >> >>> >> health
> >> >>> >> >>>>>>>> check
> >> >>> >> >>>>>>>> >> retry
> >> >>> >> >>>>>>>> >> >>> >> timings
> >> >>> >> >>>>>>>> >> >>> >> >> >>> are not
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> delayed?
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM,
> >> Tatsuo
> >> >>> >> Ishii <
> >> >>> >> >>>>>>>> >> >>> >> >> ishii at postgresql.org
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >wrote:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Ok, I did:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> # iptables -A FORWARD -j REJECT
> >> >>> >> --reject-with
> >> >>> >> >>>>>>>> >> >>> >> >> icmp-port-unreachable
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> on the host where pgpoo is
> running.
> >> And
> >> >>> pull
> >> >>> >> >>>>>>>> network
> >> >>> >> >>>>>>>> >> cable
> >> >>> >> >>>>>>>> >> >>> from
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> backend0 host network interface.
> >> Pgpool
> >> >>> >> >>>>>>>> detected the
> >> >>> >> >>>>>>>> >> host
> >> >>> >> >>>>>>>> >> >>> being
> >> >>> >> >>>>>>>> >> >>> >> >> down
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> as expected...
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> --
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> English:
> >> >>> >> http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Japanese: http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Backend is not destination of
> this
> >> >>> >> message,
> >> >>> >> >>>>>>>> pgpool
> >> >>> >> >>>>>>>> >> host
> >> >>> >> >>>>>>>> >> >>> is,
> >> >>> >> >>>>>>>> >> >>> >> and
> >> >>> >> >>>>>>>> >> >>> >> >> we
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> don't
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > want it to ever get it. With
> >> command
> >> >>> I've
> >> >>> >> >>>>>>>> sent you
> >> >>> >> >>>>>>>> >> rule
> >> >>> >> >>>>>>>> >> >>> will
> >> >>> >> >>>>>>>> >> >>> >> be
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> created for
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > any source and destination.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> > On Wed, Jan 11, 2012 at 10:38
> AM,
> >> >>> Tatsuo
> >> >>> >> >>>>>>>> Ishii <
> >> >>> >> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> wrote:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> I did following:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Do following on the host where
> >> >>> pgpool is
> >> >>> >> >>>>>>>> running on:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> # iptables -A FORWARD -j REJECT
> >> >>> >> >>>>>>>> --reject-with
> >> >>> >> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable -d
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> 133.137.177.124
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> (133.137.177.124 is the host
> where
> >> >>> >> backend
> >> >>> >> >>>>>>>> is running
> >> >>> >> >>>>>>>> >> >>> on)
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Pull network cable from
> backend0
> >> host
> >> >>> >> >>>>>>>> network
> >> >>> >> >>>>>>>> >> interface.
> >> >>> >> >>>>>>>> >> >>> >> Pgpool
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> detected the host being down as
> >> >>> expected.
> >> >>> >> >>>>>>>> Am I
> >> >>> >> >>>>>>>> >> missing
> >> >>> >> >>>>>>>> >> >>> >> >> something?
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> --
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> English:
> >> >>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> Japanese:
> http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Hello Tatsuo,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > With backend0 on one host
> just
> >> >>> >> configure
> >> >>> >> >>>>>>>> following
> >> >>> >> >>>>>>>> >> >>> rule on
> >> >>> >> >>>>>>>> >> >>> >> >> other
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> host
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> where
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > pgpool is:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > iptables -A FORWARD -j REJECT
> >> >>> >> >>>>>>>> --reject-with
> >> >>> >> >>>>>>>> >> >>> >> >> >>> icmp-port-unreachable
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then have pgpool startup
> >> with
> >> >>> >> health
> >> >>> >> >>>>>>>> checking
> >> >>> >> >>>>>>>> >> and
> >> >>> >> >>>>>>>> >> >>> >> >> retrying
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> configured,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > and then pull network cable
> from
> >> >>> >> backend0
> >> >>> >> >>>>>>>> host
> >> >>> >> >>>>>>>> >> network
> >> >>> >> >>>>>>>> >> >>> >> >> >>> interface.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27
> AM,
> >> >>> Tatsuo
> >> >>> >> >>>>>>>> Ishii <
> >> >>> >> >>>>>>>> >> >>> >> >> >>> ishii at postgresql.org
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> wrote:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> I want to try to test the
> >> >>> situation
> >> >>> >> you
> >> >>> >> >>>>>>>> descrived:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is
> configured
> >> for
> >> >>> >> >>>>>>>> security
> >> >>> >> >>>>>>>> >> reasons
> >> >>> >> >>>>>>>> >> >>> not
> >> >>> >> >>>>>>>> >> >>> >> to
> >> >>> >> >>>>>>>> >> >>> >> >> >>> return
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable
> messages,
> >> >>> even
> >> >>> >> >>>>>>>> though
> >> >>> >> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> But I don't know how to do
> it.
> >> I
> >> >>> >> pulled
> >> >>> >> >>>>>>>> out the
> >> >>> >> >>>>>>>> >> >>> network
> >> >>> >> >>>>>>>> >> >>> >> >> cable
> >> >>> >> >>>>>>>> >> >>> >> >> >>> and
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> pgpool detected it as
> expected.
> >> >>> Also I
> >> >>> >> >>>>>>>> configured
> >> >>> >> >>>>>>>> >> the
> >> >>> >> >>>>>>>> >> >>> >> server
> >> >>> >> >>>>>>>> >> >>> >> >> >>> which
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> PostgreSQL is running on to
> >> >>> disable
> >> >>> >> the
> >> >>> >> >>>>>>>> 5432
> >> >>> >> >>>>>>>> >> port. In
> >> >>> >> >>>>>>>> >> >>> >> this
> >> >>> >> >>>>>>>> >> >>> >> >> case
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> connect(2) returned
> >> EHOSTUNREACH
> >> >>> (No
> >> >>> >> >>>>>>>> route to
> >> >>> >> >>>>>>>> >> host)
> >> >>> >> >>>>>>>> >> >>> so
> >> >>> >> >>>>>>>> >> >>> >> >> pgpool
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> detected
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> the error as expected.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Could you please instruct
> me?
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> --
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> English:
> >> >>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> Japanese:
> >> http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Hello Tatsuo,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Thank you for replying!
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > I'm not sure what exactly
> is
> >> >>> >> blocking,
> >> >>> >> >>>>>>>> just by
> >> >>> >> >>>>>>>> >> >>> pgpool
> >> >>> >> >>>>>>>> >> >>> >> code
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> analysis I
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > suspect it is the part
> where
> >> a
> >> >>> >> >>>>>>>> connection is
> >> >>> >> >>>>>>>> >> made
> >> >>> >> >>>>>>>> >> >>> to
> >> >>> >> >>>>>>>> >> >>> >> the
> >> >>> >> >>>>>>>> >> >>> >> >> db
> >> >>> >> >>>>>>>> >> >>> >> >> >>> and
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> it
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> doesn't
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > seem to get interrupted by
> >> >>> alarm.
> >> >>> >> >>>>>>>> Tested
> >> >>> >> >>>>>>>> >> thoroughly
> >> >>> >> >>>>>>>> >> >>> >> health
> >> >>> >> >>>>>>>> >> >>> >> >> >>> check
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> behaviour,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > it works really well when
> >> >>> host/ip is
> >> >>> >> >>>>>>>> there and
> >> >>> >> >>>>>>>> >> just
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> backend/postgres
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> is
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > down, but not when backend
> >> >>> host/ip
> >> >>> >> is
> >> >>> >> >>>>>>>> down. I
> >> >>> >> >>>>>>>> >> could
> >> >>> >> >>>>>>>> >> >>> >> see in
> >> >>> >> >>>>>>>> >> >>> >> >> >>> log
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> that
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> initial
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > health check and each
> retry
> >> got
> >> >>> >> >>>>>>>> delayed when
> >> >>> >> >>>>>>>> >> >>> host/ip is
> >> >>> >> >>>>>>>> >> >>> >> >> not
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> reachable,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > while when just backend is
> >> not
> >> >>> >> >>>>>>>> listening (is
> >> >>> >> >>>>>>>> >> down)
> >> >>> >> >>>>>>>> >> >>> on
> >> >>> >> >>>>>>>> >> >>> >> the
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> reachable
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> host/ip
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > then initial health check
> and
> >> >>> all
> >> >>> >> >>>>>>>> retries are
> >> >>> >> >>>>>>>> >> >>> exact to
> >> >>> >> >>>>>>>> >> >>> >> the
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> settings in
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > pgpool.conf.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > PGCONNECT_TIMEOUT is
> listed
> >> as
> >> >>> one
> >> >>> >> of
> >> >>> >> >>>>>>>> the libpq
> >> >>> >> >>>>>>>> >> >>> >> >> environment
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> variables
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> in
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > the docs (see
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >>
> >> >>> >> >>>>>>>>
> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > There is equivalent
> >> parameter in
> >> >>> >> libpq
> >> >>> >> >>>>>>>> >> >>> >> PGconnectdbParams (
> >> >>> >> >>>>>>>> >> >>> >> >> >>> see
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>>
> >> >>> >> >>>>>>>> >> >>> >> >>
> >> >>> >> >>>>>>>> >> >>> >>
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >>
> >> >>> >> >>>>>>>>
> >> >>> >>
> >> >>>
> >>
> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> )
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > At the beginning of that
> same
> >> >>> page
> >> >>> >> >>>>>>>> there are
> >> >>> >> >>>>>>>> >> some
> >> >>> >> >>>>>>>> >> >>> >> >> important
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> infos on
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> using
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > these functions.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > psql respects
> >> PGCONNECT_TIMEOUT.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Regards,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > Stevo.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> > On Wed, Jan 11, 2012 at
> 12:13
> >> >>> AM,
> >> >>> >> >>>>>>>> Tatsuo Ishii <
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> ishii at postgresql.org>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> wrote:
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Hello pgpool community,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> >
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > When system is
> configured
> >> for
> >> >>> >> >>>>>>>> security
> >> >>> >> >>>>>>>> >> reasons
> >> >>> >> >>>>>>>> >> >>> not
> >> >>> >> >>>>>>>> >> >>> >> to
> >> >>> >> >>>>>>>> >> >>> >> >> >>> return
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> destination
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > host unreachable
> messages,
> >> >>> even
> >> >>> >> >>>>>>>> though
> >> >>> >> >>>>>>>> >> >>> >> >> >>> health_check_timeout is
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> configured,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > socket call will block
> and
> >> >>> alarm
> >> >>> >> >>>>>>>> will not get
> >> >>> >> >>>>>>>> >> >>> raised
> >> >>> >> >>>>>>>> >> >>> >> >> >>> until TCP
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> timeout
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > occurs.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Interesting. So are you
> >> saying
> >> >>> that
> >> >>> >> >>>>>>>> read(2)
> >> >>> >> >>>>>>>> >> >>> cannot be
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> interrupted by
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> alarm signal if the
> system
> >> is
> >> >>> >> >>>>>>>> configured not to
> >> >>> >> >>>>>>>> >> >>> return
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> destination
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> host unreachable message?
> >> >>> Could you
> >> >>> >> >>>>>>>> please
> >> >>> >> >>>>>>>> >> guide
> >> >>> >> >>>>>>>> >> >>> me
> >> >>> >> >>>>>>>> >> >>> >> >> where I
> >> >>> >> >>>>>>>> >> >>> >> >> >>> can
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> get
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> such that info? (I'm not
> a
> >> >>> network
> >> >>> >> >>>>>>>> expert).
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > Not a C programmer,
> found
> >> >>> some
> >> >>> >> info
> >> >>> >> >>>>>>>> that
> >> >>> >> >>>>>>>> >> select
> >> >>> >> >>>>>>>> >> >>> call
> >> >>> >> >>>>>>>> >> >>> >> >> >>> could be
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> replace
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> with
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > select/pselect calls.
> >> Maybe
> >> >>> it
> >> >>> >> >>>>>>>> would be best
> >> >>> >> >>>>>>>> >> if
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> PGCONNECT_TIMEOUT
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> value
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > could be used here for
> >> >>> connection
> >> >>> >> >>>>>>>> timeout.
> >> >>> >> >>>>>>>> >> >>> pgpool
> >> >>> >> >>>>>>>> >> >>> >> has
> >> >>> >> >>>>>>>> >> >>> >> >> >>> libpq as
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> dependency,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > why isn't it using
> libpq
> >> for
> >> >>> the
> >> >>> >> >>>>>>>> healthcheck
> >> >>> >> >>>>>>>> >> db
> >> >>> >> >>>>>>>> >> >>> >> connect
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> calls, then
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT
> would be
> >> >>> >> applied?
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> I don't think libpq uses
> >> >>> >> >>>>>>>> select/pselect for
> >> >>> >> >>>>>>>> >> >>> >> establishing
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> connection,
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> but using libpq instead
> of
> >> >>> homebrew
> >> >>> >> >>>>>>>> code seems
> >> >>> >> >>>>>>>> >> to
> >> >>> >> >>>>>>>> >> >>> be
> >> >>> >> >>>>>>>> >> >>> >> an
> >> >>> >> >>>>>>>> >> >>> >> >> >>> idea.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> Let me
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> think about it.
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> One question. Are you
> sure
> >> that
> >> >>> >> libpq
> >> >>> >> >>>>>>>> can deal
> >> >>> >> >>>>>>>> >> >>> with
> >> >>> >> >>>>>>>> >> >>> >> the
> >> >>> >> >>>>>>>> >> >>> >> >> case
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> (not to
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> return destination host
> >> >>> unreachable
> >> >>> >> >>>>>>>> messages)
> >> >>> >> >>>>>>>> >> by
> >> >>> >> >>>>>>>> >> >>> using
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> --
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Tatsuo Ishii
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> SRA OSS, Inc. Japan
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> English:
> >> >>> >> >>>>>>>> http://www.sraoss.co.jp/index_en.php
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >> Japanese:
> >> >>> http://www.sraoss.co.jp
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>>
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >> >>
> >> >>> >> >>>>>>>> >> >>> >> >>
> >> >>> >> >>>>>>>> >> >>> >>
> >> >>> >> >>>>>>>> >> >>>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >> >>
> >> >>> >> >>>>>>>> >>
> >> >>> >> >>>>>>>>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>>
> >> >>> >> >>>>>>
> >> >>> >> >>>>>
> >> >>> >> >>>>
> >> >>> >> >>>
> >> >>> >> >>
> >> >>> >>
> >> >>>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120224/41d1f8be/attachment.htm>