[pgpool-general: 159] Re: Healthcheck timeout not always respected

Tatsuo Ishii ishii at postgresql.org
Sun Jan 15 17:24:37 JST 2012


fail_over_on_backend_error has different meaning from
DISALLOW_TO_FAILOVER. From the doc:

  If true, and an error occurs when writing to the backend
  communication, pgpool-II will trigger the fail over procedure . This
  is the same behavior as of pgpool-II 2.2.x or earlier. If set to
  false, pgpool will report an error and disconnect the session.

This means that if pgpool failed to read from backend, it will trigger
failover even if fail_over_on_backend_error to off. So unconditionaly
disabling failover will lead backward imcompatibilty.

However I think we should disable failover if DISALLOW_TO_FAILOVER set
in case of reading data from backend. This should have been done when
DISALLOW_TO_FAILOVER was introduced because this is exactly what
DISALLOW_TO_FAILOVER tries to accomplish. What do you think?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> For a moment I thought we could have set fail_over_on_backend_error to off,
> and have backends set with ALLOW_TO_FAILOVER flag. But then I looked in
> code.
> 
> In child.c there is a loop child process goes through in its lifetime. When
> fatal error condition occurs before child process exits it will call
> notice_backend_error which will call degenerate_backend_set which will not
> take into account fail_over_on_backend_error is set to off, causing backend
> to be degenerated and failover to occur. That's why we have backends set
> with DISALLOW_TO_FAILOVER but with our patch applied, health check could
> cause failover to occur as expected.
> 
> Maybe it would be enough just to modify degenerate_backend_set, to take
> fail_over_on_backend_error into account just like it already takes
> DISALLOW_TO_FAILOVER into account.
> 
> Kind regards,
> Stevo.
> 
> 2012/1/15 Stevo Slavić <sslavic at gmail.com>
> 
>> Yes and that behaviour which you describe as expected, is not what we
>> want. We want pgpool to degrade backend0 and failover when configured max
>> health check retries have failed, and to failover only in that case, so not
>> sooner e.g. connection/child error condition, but as soon as max health
>> check retries have been attempted.
>>
>> Maybe examples will be more clear.
>>
>> Imagine two nodes (node 1 and node 2). On each node a single pgpool and a
>> single backend. Apps/clients access db through pgpool on their own node.
>> Two backends are configured in postgres native streaming replication.
>> pgpools are used in raw mode. Both pgpools have same backend as backend0,
>> and same backend as backend1.
>> initial state: both backends are up and pgpool can access them, clients
>> connect to their pgpool and do their work on master backend, backend0.
>>
>> 1st case: unmodified/non-patched pgpool 3.1.1 is used, backends are
>> configured with ALLOW_TO_FAILOVER flag
>> - temporary network outage happens between pgpool on node 2 and backend0
>> - error condition is reported by child process, and since
>> ALLOW_TO_FAILOVER is set, pgpool performs failover without giving chance to
>> pgpool health check retries to control whether backend is just temporarily
>> inaccessible
>> - failover command on node 2 promotes standby backend to a new master -
>> split brain occurs, with two masters
>>
>>
>> 2nd case: unmodified/non-patched pgpool 3.1.1 is used, backends are
>> configured with DISALLOW_TO_FAILOVER
>> - temporary network outage happens between pgpool on node 2 and backend0
>> - error condition is reported by child process, and since
>> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>> - health check gets a chance to check backend0 condition, determines that
>> it's not accessible, there will be no health check retries because
>> DISALLOW_TO_FAILOVER is set, no failover occurs ever
>>
>>
>> 3rd case, pgpool 3.1.1 + patch you've sent applied, and backends
>> configured with DISALLOW_TO_FAILOVER
>> - temporary network outage happens between pgpool on node 2 and backend0
>> - error condition is reported by child process, and since
>> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>> - health check gets a chance to check backend0 condition, determines that
>> it's not accessible, health check retries happen, and even after max
>> retries, no failover happens since failover is disallowed
>>
>>
>> 4th expected behaviour, pgpool 3.1.1 + patch we sent, and backends
>> configured with DISALLOW_TO_FAILOVER
>> - temporary network outage happens between pgpool on node 2 and backend0
>> - error condition is reported by child process, and since
>> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>> - health check gets a chance to check backend0 condition, determines that
>> it's not accessible, health check retries happen, before a max retry
>> network condition is cleared, retry happens, and backend0 remains to be
>> master, no failover occurs, temporary network issue did not cause split
>> brain
>> - after some time, temporary network outage happens again between pgpool
>> on node 2 and backend0
>> - error condition is reported by child process, and since
>> DISALLOW_TO_FAILOVER is set, pgpool does not perform failover
>> - health check gets a chance to check backend0 condition, determines that
>> it's not accessible, health check retries happen, after max retries
>> backend0 is still not accessible, failover happens, standby is new master
>> and backend0 is degraded
>>
>> Kind regards,
>> Stevo.
>>
>>
>> 2012/1/15 Tatsuo Ishii <ishii at postgresql.org>
>>
>>> In my test evironment, the patch works as expected. I have two
>>> backends. Health check retry conf is as follows:
>>>
>>> health_check_max_retries = 3
>>> health_check_retry_delay = 1
>>>
>>> 5 09:17:20 LOG:   pid 21411: Backend status file /home/t-ishii/work/
>>> git.postgresql.org/test/log/pgpool_status discarded
>>> 2012-01-15 09:17:20 LOG:   pid 21411: pgpool-II successfully started.
>>> version 3.2alpha1 (hatsuiboshi)
>>> 2012-01-15 09:17:20 LOG:   pid 21411: find_primary_node: primary node id
>>> is 0
>>> -- backend1 was shutdown
>>>
>>> 2012-01-15 09:17:50 ERROR: pid 21445: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:17:50 ERROR: pid 21445: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:17:50 ERROR: pid 21445: check_replication_time_lag: could
>>> not connect to DB node 1, check sr_check_user and sr_check_password
>>> 2012-01-15 09:17:50 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:17:50 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:17:50 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:17:50 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> -- health check failed
>>>
>>> 2012-01-15 09:17:50 ERROR: pid 21411: health check failed. 1 th host /tmp
>>> at port 11001 is down
>>> -- start retrying
>>> 2012-01-15 09:17:50 LOG:   pid 21411: health check retry sleep time: 1
>>> second(s)
>>> 2012-01-15 09:17:51 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:17:51 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:17:51 ERROR: pid 21411: health check failed. 1 th host /tmp
>>> at port 11001 is down
>>> 2012-01-15 09:17:51 LOG:   pid 21411: health check retry sleep time: 1
>>> second(s)
>>> 2012-01-15 09:17:52 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:17:52 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:17:52 ERROR: pid 21411: health check failed. 1 th host /tmp
>>> at port 11001 is down
>>> 2012-01-15 09:17:52 LOG:   pid 21411: health check retry sleep time: 1
>>> second(s)
>>> 2012-01-15 09:17:53 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:17:53 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:17:53 ERROR: pid 21411: health check failed. 1 th host /tmp
>>> at port 11001 is down
>>> 2012-01-15 09:17:53 LOG:   pid 21411: health_check: 1 failover is canceld
>>> because failover is disallowed
>>> -- after 3 retries, pgpool wanted to failover, but gave up because
>>> DISALLOW_TO_FAILOVER is set for backend1
>>>
>>> 2012-01-15 09:18:00 ERROR: pid 21445: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:18:00 ERROR: pid 21445: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:18:00 ERROR: pid 21445: check_replication_time_lag: could
>>> not connect to DB node 1, check sr_check_user and sr_check_password
>>> 2012-01-15 09:18:03 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:18:03 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:18:03 ERROR: pid 21411: health check failed. 1 th host /tmp
>>> at port 11001 is down
>>> 2012-01-15 09:18:03 LOG:   pid 21411: health check retry sleep time: 1
>>> second(s)
>>> 2012-01-15 09:18:04 ERROR: pid 21411: connect_unix_domain_socket_by_port:
>>> connect() failed to /tmp/.s.PGSQL.11001: No such file or directory
>>> 2012-01-15 09:18:04 ERROR: pid 21411: make_persistent_db_connection:
>>> connection to /tmp(11001) failed
>>> 2012-01-15 09:18:04 ERROR: pid 21411: health check failed. 1 th host /tmp
>>> at port 11001 is down
>>> 2012-01-15 09:18:04 LOG:   pid 21411: health check retry sleep time: 1
>>> second(s)
>>> 2012-01-15 09:18:05 LOG:   pid 21411: after some retrying backend
>>> returned to healthy state
>>> -- started backend1 and pgpool succeeded in health checking. Resumed
>>> using backend1
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>> > Hello Tatsuo,
>>> >
>>> > Thank you for the patch and effort, but unfortunately this change won't
>>> > work for us. We need to set disallow failover to prevent failover on
>>> child
>>> > reported connection errors (it's ok if few clients lose their
>>> connection or
>>> > can not connect), and still have pgpool perform failover but only on
>>> failed
>>> > health check (if configured, after max retries threshold has been
>>> reached).
>>> >
>>> > Maybe it would be best to add an extra value for backend_flag -
>>> > ALLOW_TO_FAILOVER_ON_HEALTH_CHECK or
>>> DISALLOW_TO_FAILOVER_ON_CHILD_ERROR.
>>> > It should behave same as DISALLOW_TO_FAILOVER is set, with only
>>> difference
>>> > in behaviour when health check (if set, max retries) has failed - unlike
>>> > DISALLOW_TO_FAILOVER, this new flag should allow failover in this case
>>> only.
>>> >
>>> > Without this change health check (especially health check retries)
>>> doesn't
>>> > make much sense - child error is more likely to occur on (temporary)
>>> > backend failure then health check and will or will not cause failover to
>>> > occur depending on backend flag, without giving health check retries a
>>> > chance to determine if failure was temporary or not, risking split brain
>>> > situation with two masters just because of temporary network link
>>> hiccup.
>>> >
>>> > Our main problem remains though with the health check timeout not being
>>> > respected in these special conditions we have. Maybe Nenad can help you
>>> > more to reproduce the issue on your environment.
>>> >
>>> > Kind regards,
>>> > Stevo.
>>> >
>>> > 2012/1/13 Tatsuo Ishii <ishii at postgresql.org>
>>> >
>>> >> Thanks for pointing it out.
>>> >> Yes, checking DISALLOW_TO_FAILOVER before retrying is wrong.
>>> >> However, after retry count over, we should check DISALLOW_TO_FAILOVER I
>>> >> think.
>>> >> Attached is the patch attempt to fix it. Please try.
>>> >> --
>>> >> Tatsuo Ishii
>>> >> SRA OSS, Inc. Japan
>>> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> Japanese: http://www.sraoss.co.jp
>>> >>
>>> >> > pgpool is being used in raw mode - just for (health check based)
>>> failover
>>> >> > part, so applications are not required to restart when standby gets
>>> >> > promoted to new master. Here is pgpool.conf file and a very small
>>> patch
>>> >> > we're using applied to pgpool 3.1.1 release.
>>> >> >
>>> >> > We have to have DISALLOW_TO_FAILOVER set for the backend since any
>>> child
>>> >> > process that detects condition that master/backend0 is not
>>> available, if
>>> >> > DISALLOW_TO_FAILOVER was not set, will degenerate backend without
>>> giving
>>> >> > health check a chance to retry. We need health check with retries
>>> because
>>> >> > condition that backend0 is not available could be temporary (network
>>> >> > glitches to the remote site where master is, or deliberate failover
>>> of
>>> >> > master postgres service from one node to the other on remote site -
>>> in
>>> >> both
>>> >> > cases remote means remote to the pgpool that is going to perform
>>> health
>>> >> > checks and ultimately the failover) and we don't want standby to be
>>> >> > promoted as easily to a new master, to prevent temporary network
>>> >> conditions
>>> >> > which could occur frequently to frequently cause split brain with two
>>> >> > masters.
>>> >> >
>>> >> > But then, with DISALLOW_TO_FAILOVER set, without the patch health
>>> check
>>> >> > will not retry and will thus give only one chance to backend (if
>>> health
>>> >> > check ever occurs before child process failure to connect to the
>>> >> backend),
>>> >> > rendering retry settings effectively to be ignored. That's where this
>>> >> patch
>>> >> > comes into action - enables health check retries while child
>>> processes
>>> >> are
>>> >> > prevented to degenerate backend.
>>> >> >
>>> >> > I don't think, but I could be wrong, that this patch influences the
>>> >> > behavior we're seeing with unwanted health check attempt delays.
>>> Also,
>>> >> > knowing this, maybe pgpool could be patched or some other support be
>>> >> built
>>> >> > into it to cover this use case.
>>> >> >
>>> >> > Regards,
>>> >> > Stevo.
>>> >> >
>>> >> >
>>> >> > 2012/1/12 Tatsuo Ishii <ishii at postgresql.org>
>>> >> >
>>> >> >> I have accepted the moderation request. Your post should be sent
>>> >> shortly.
>>> >> >> Also I have raised the post size limit to 1MB.
>>> >> >> I will look into this...
>>> >> >> --
>>> >> >> Tatsuo Ishii
>>> >> >> SRA OSS, Inc. Japan
>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> Japanese: http://www.sraoss.co.jp
>>> >> >>
>>> >> >> > Here is the log file and strace output file (this time in an
>>> archive,
>>> >> >> > didn't know about 200KB constraint on post size which requires
>>> >> moderator
>>> >> >> > approval). Timings configured are 30sec health check interval,
>>> 5sec
>>> >> >> > timeout, and 2 retries with 10sec retry delay.
>>> >> >> >
>>> >> >> > It takes a lot more than 5sec from started health check to
>>> sleeping
>>> >> 10sec
>>> >> >> > for first retry.
>>> >> >> >
>>> >> >> > Seen in code (main.x, health_check() function), within (retry)
>>> attempt
>>> >> >> > there is inner retry (first with postgres database then with
>>> >> template1)
>>> >> >> and
>>> >> >> > that part doesn't seem to be interrupted by alarm.
>>> >> >> >
>>> >> >> > Regards,
>>> >> >> > Stevo.
>>> >> >> >
>>> >> >> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
>>> >> >> >
>>> >> >> >> Here is the log file and strace output file. Timings configured
>>> are
>>> >> >> 30sec
>>> >> >> >> health check interval, 5sec timeout, and 2 retries with 10sec
>>> retry
>>> >> >> delay.
>>> >> >> >>
>>> >> >> >> It takes a lot more than 5sec from started health check to
>>> sleeping
>>> >> >> 10sec
>>> >> >> >> for first retry.
>>> >> >> >>
>>> >> >> >> Seen in code (main.x, health_check() function), within (retry)
>>> >> attempt
>>> >> >> >> there is inner retry (first with postgres database then with
>>> >> template1)
>>> >> >> and
>>> >> >> >> that part doesn't seem to be interrupted by alarm.
>>> >> >> >>
>>> >> >> >> Regards,
>>> >> >> >> Stevo.
>>> >> >> >>
>>> >> >> >>
>>> >> >> >> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
>>> >> >> >>
>>> >> >> >>> Ok, I will do it. In the mean time you could use "strace -tt -p
>>> PID"
>>> >> >> >>> to see which system call is blocked.
>>> >> >> >>> --
>>> >> >> >>> Tatsuo Ishii
>>> >> >> >>> SRA OSS, Inc. Japan
>>> >> >> >>> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> >>> Japanese: http://www.sraoss.co.jp
>>> >> >> >>>
>>> >> >> >>> > OK, got the info - key point is that ip forwarding is
>>> disabled for
>>> >> >> >>> security
>>> >> >> >>> > reasons. Rules in iptables are not important, iptables can be
>>> >> >> stopped,
>>> >> >> >>> or
>>> >> >> >>> > previously added rules removed.
>>> >> >> >>> >
>>> >> >> >>> > Here are the steps to reproduce (kudos to my colleague Nenad
>>> >> >> Bulatovic
>>> >> >> >>> for
>>> >> >> >>> > providing this):
>>> >> >> >>> >
>>> >> >> >>> > 1.) make sure that ip forwarding is off:
>>> >> >> >>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
>>> >> >> >>> > 2.) create IP alias on some interface (and have postgres
>>> listen on
>>> >> >> it):
>>> >> >> >>> >     ip addr add x.x.x.x/yy dev ethz
>>> >> >> >>> > 3.) set backend_hostname0 to aforementioned IP
>>> >> >> >>> > 4.) start pgpool and monitor health checks
>>> >> >> >>> > 5.) remove IP alias:
>>> >> >> >>> >     ip addr del x.x.x.x/yy dev ethz
>>> >> >> >>> >
>>> >> >> >>> >
>>> >> >> >>> > Here is the interesting part in pgpool log after this:
>>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
>>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB
>>> node
>>> >> >> >>> status: 2
>>> >> >> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB
>>> node
>>> >> >> >>> status: 1
>>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
>>> >> >> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB
>>> node
>>> >> >> >>> status: 2
>>> >> >> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB
>>> node
>>> >> >> >>> status: 2
>>> >> >> >>> > 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0
>>> th
>>> >> host
>>> >> >> >>> > 192.168.2.27 at port 5432 is down
>>> >> >> >>> > 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep
>>> >> time:
>>> >> >> 10
>>> >> >> >>> > second(s)
>>> >> >> >>> >
>>> >> >> >>> > That pgpool was configured with health check interval of
>>> 30sec,
>>> >> 5sec
>>> >> >> >>> > timeout, and 10sec retry delay with 2 max retries.
>>> >> >> >>> >
>>> >> >> >>> > Making use of libpq instead for connecting to db in health
>>> checks
>>> >> IMO
>>> >> >> >>> > should resolve it, but you'll best determine which call
>>> exactly
>>> >> gets
>>> >> >> >>> > blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var
>>> >> configured
>>> >> >> >>> > respects that env var timeout.
>>> >> >> >>> >
>>> >> >> >>> > Regards,
>>> >> >> >>> > Stevo.
>>> >> >> >>> >
>>> >> >> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <
>>> sslavic at gmail.com
>>> >> >
>>> >> >> >>> wrote:
>>> >> >> >>> >
>>> >> >> >>> >> Tatsuo,
>>> >> >> >>> >>
>>> >> >> >>> >> Did you restart iptables after adding rule?
>>> >> >> >>> >>
>>> >> >> >>> >> Regards,
>>> >> >> >>> >> Stevo.
>>> >> >> >>> >>
>>> >> >> >>> >>
>>> >> >> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <
>>> >> sslavic at gmail.com>
>>> >> >> >>> wrote:
>>> >> >> >>> >>
>>> >> >> >>> >>> Looking into this to verify if these are all necessary
>>> changes
>>> >> to
>>> >> >> have
>>> >> >> >>> >>> port unreachable message silently rejected (suspecting some
>>> >> kernel
>>> >> >> >>> >>> parameter tuning is needed).
>>> >> >> >>> >>>
>>> >> >> >>> >>> Just to clarify it's not a problem that host is being
>>> detected
>>> >> by
>>> >> >> >>> pgpool
>>> >> >> >>> >>> to be down, but the timing when that happens. On environment
>>> >> where
>>> >> >> >>> issue is
>>> >> >> >>> >>> reproduced pgpool as part of health check attempt tries to
>>> >> connect
>>> >> >> to
>>> >> >> >>> >>> backend and hangs for tcp timeout instead of being
>>> interrupted
>>> >> by
>>> >> >> >>> timeout
>>> >> >> >>> >>> alarm. Can you verify/confirm please the health check retry
>>> >> timings
>>> >> >> >>> are not
>>> >> >> >>> >>> delayed?
>>> >> >> >>> >>>
>>> >> >> >>> >>> Regards,
>>> >> >> >>> >>> Stevo.
>>> >> >> >>> >>>
>>> >> >> >>> >>>
>>> >> >> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <
>>> >> >> ishii at postgresql.org
>>> >> >> >>> >wrote:
>>> >> >> >>> >>>
>>> >> >> >>> >>>> Ok, I did:
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> # iptables -A FORWARD -j REJECT --reject-with
>>> >> >> icmp-port-unreachable
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> on the host where pgpoo is running. And pull network cable
>>> from
>>> >> >> >>> >>>> backend0 host network interface. Pgpool detected the host
>>> being
>>> >> >> down
>>> >> >> >>> >>>> as expected...
>>> >> >> >>> >>>> --
>>> >> >> >>> >>>> Tatsuo Ishii
>>> >> >> >>> >>>> SRA OSS, Inc. Japan
>>> >> >> >>> >>>> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> >>> >>>> Japanese: http://www.sraoss.co.jp
>>> >> >> >>> >>>>
>>> >> >> >>> >>>> > Backend is not destination of this message, pgpool host
>>> is,
>>> >> and
>>> >> >> we
>>> >> >> >>> >>>> don't
>>> >> >> >>> >>>> > want it to ever get it. With command I've sent you rule
>>> will
>>> >> be
>>> >> >> >>> >>>> created for
>>> >> >> >>> >>>> > any source and destination.
>>> >> >> >>> >>>> >
>>> >> >> >>> >>>> > Regards,
>>> >> >> >>> >>>> > Stevo.
>>> >> >> >>> >>>> >
>>> >> >> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <
>>> >> >> >>> ishii at postgresql.org>
>>> >> >> >>> >>>> wrote:
>>> >> >> >>> >>>> >
>>> >> >> >>> >>>> >> I did following:
>>> >> >> >>> >>>> >>
>>> >> >> >>> >>>> >> Do following on the host where pgpool is running on:
>>> >> >> >>> >>>> >>
>>> >> >> >>> >>>> >> # iptables -A FORWARD -j REJECT --reject-with
>>> >> >> >>> icmp-port-unreachable -d
>>> >> >> >>> >>>> >> 133.137.177.124
>>> >> >> >>> >>>> >> (133.137.177.124 is the host where backend is running
>>> on)
>>> >> >> >>> >>>> >>
>>> >> >> >>> >>>> >> Pull network cable from backend0 host network interface.
>>> >> Pgpool
>>> >> >> >>> >>>> >> detected the host being down as expected. Am I missing
>>> >> >> something?
>>> >> >> >>> >>>> >> --
>>> >> >> >>> >>>> >> Tatsuo Ishii
>>> >> >> >>> >>>> >> SRA OSS, Inc. Japan
>>> >> >> >>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> >>> >>>> >> Japanese: http://www.sraoss.co.jp
>>> >> >> >>> >>>> >>
>>> >> >> >>> >>>> >> > Hello Tatsuo,
>>> >> >> >>> >>>> >> >
>>> >> >> >>> >>>> >> > With backend0 on one host just configure following
>>> rule on
>>> >> >> other
>>> >> >> >>> >>>> host
>>> >> >> >>> >>>> >> where
>>> >> >> >>> >>>> >> > pgpool is:
>>> >> >> >>> >>>> >> >
>>> >> >> >>> >>>> >> > iptables -A FORWARD -j REJECT --reject-with
>>> >> >> >>> icmp-port-unreachable
>>> >> >> >>> >>>> >> >
>>> >> >> >>> >>>> >> > and then have pgpool startup with health checking and
>>> >> >> retrying
>>> >> >> >>> >>>> >> configured,
>>> >> >> >>> >>>> >> > and then pull network cable from backend0 host network
>>> >> >> >>> interface.
>>> >> >> >>> >>>> >> >
>>> >> >> >>> >>>> >> > Regards,
>>> >> >> >>> >>>> >> > Stevo.
>>> >> >> >>> >>>> >> >
>>> >> >> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <
>>> >> >> >>> ishii at postgresql.org
>>> >> >> >>> >>>> >
>>> >> >> >>> >>>> >> wrote:
>>> >> >> >>> >>>> >> >
>>> >> >> >>> >>>> >> >> I want to try to test the situation you descrived:
>>> >> >> >>> >>>> >> >>
>>> >> >> >>> >>>> >> >> >> > When system is configured for security reasons
>>> not
>>> >> to
>>> >> >> >>> return
>>> >> >> >>> >>>> >> >> destination
>>> >> >> >>> >>>> >> >> >> > host unreachable messages, even though
>>> >> >> >>> health_check_timeout is
>>> >> >> >>> >>>> >> >>
>>> >> >> >>> >>>> >> >> But I don't know how to do it. I pulled out the
>>> network
>>> >> >> cable
>>> >> >> >>> and
>>> >> >> >>> >>>> >> >> pgpool detected it as expected. Also I configured the
>>> >> server
>>> >> >> >>> which
>>> >> >> >>> >>>> >> >> PostgreSQL is running on to disable the 5432 port. In
>>> >> this
>>> >> >> case
>>> >> >> >>> >>>> >> >> connect(2) returned EHOSTUNREACH (No route to host)
>>> so
>>> >> >> pgpool
>>> >> >> >>> >>>> detected
>>> >> >> >>> >>>> >> >> the error as expected.
>>> >> >> >>> >>>> >> >>
>>> >> >> >>> >>>> >> >> Could you please instruct me?
>>> >> >> >>> >>>> >> >> --
>>> >> >> >>> >>>> >> >> Tatsuo Ishii
>>> >> >> >>> >>>> >> >> SRA OSS, Inc. Japan
>>> >> >> >>> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> >>> >>>> >> >> Japanese: http://www.sraoss.co.jp
>>> >> >> >>> >>>> >> >>
>>> >> >> >>> >>>> >> >> > Hello Tatsuo,
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> > Thank you for replying!
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> > I'm not sure what exactly is blocking, just by
>>> pgpool
>>> >> code
>>> >> >> >>> >>>> analysis I
>>> >> >> >>> >>>> >> >> > suspect it is the part where a connection is made
>>> to
>>> >> the
>>> >> >> db
>>> >> >> >>> and
>>> >> >> >>> >>>> it
>>> >> >> >>> >>>> >> >> doesn't
>>> >> >> >>> >>>> >> >> > seem to get interrupted by alarm. Tested thoroughly
>>> >> health
>>> >> >> >>> check
>>> >> >> >>> >>>> >> >> behaviour,
>>> >> >> >>> >>>> >> >> > it works really well when host/ip is there and just
>>> >> >> >>> >>>> backend/postgres
>>> >> >> >>> >>>> >> is
>>> >> >> >>> >>>> >> >> > down, but not when backend host/ip is down. I could
>>> >> see in
>>> >> >> >>> log
>>> >> >> >>> >>>> that
>>> >> >> >>> >>>> >> >> initial
>>> >> >> >>> >>>> >> >> > health check and each retry got delayed when
>>> host/ip is
>>> >> >> not
>>> >> >> >>> >>>> reachable,
>>> >> >> >>> >>>> >> >> > while when just backend is not listening (is down)
>>> on
>>> >> the
>>> >> >> >>> >>>> reachable
>>> >> >> >>> >>>> >> >> host/ip
>>> >> >> >>> >>>> >> >> > then initial health check and all retries are
>>> exact to
>>> >> the
>>> >> >> >>> >>>> settings in
>>> >> >> >>> >>>> >> >> > pgpool.conf.
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq
>>> >> >> environment
>>> >> >> >>> >>>> variables
>>> >> >> >>> >>>> >> in
>>> >> >> >>> >>>> >> >> > the docs (see
>>> >> >> >>> >>>> >> >>
>>> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
>>> >> >> >>> >>>> >> >> > There is equivalent parameter in libpq
>>> >> PGconnectdbParams (
>>> >> >> >>> see
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >>
>>> >> >> >>> >>>> >>
>>> >> >> >>> >>>>
>>> >> >> >>>
>>> >> >>
>>> >>
>>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>>> >> >> >>> >>>> >> >> )
>>> >> >> >>> >>>> >> >> > At the beginning of that same page there are some
>>> >> >> important
>>> >> >> >>> >>>> infos on
>>> >> >> >>> >>>> >> >> using
>>> >> >> >>> >>>> >> >> > these functions.
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> > Regards,
>>> >> >> >>> >>>> >> >> > Stevo.
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
>>> >> >> >>> >>>> ishii at postgresql.org>
>>> >> >> >>> >>>> >> >> wrote:
>>> >> >> >>> >>>> >> >> >
>>> >> >> >>> >>>> >> >> >> > Hello pgpool community,
>>> >> >> >>> >>>> >> >> >> >
>>> >> >> >>> >>>> >> >> >> > When system is configured for security reasons
>>> not
>>> >> to
>>> >> >> >>> return
>>> >> >> >>> >>>> >> >> destination
>>> >> >> >>> >>>> >> >> >> > host unreachable messages, even though
>>> >> >> >>> health_check_timeout is
>>> >> >> >>> >>>> >> >> >> configured,
>>> >> >> >>> >>>> >> >> >> > socket call will block and alarm will not get
>>> raised
>>> >> >> >>> until TCP
>>> >> >> >>> >>>> >> timeout
>>> >> >> >>> >>>> >> >> >> > occurs.
>>> >> >> >>> >>>> >> >> >>
>>> >> >> >>> >>>> >> >> >> Interesting. So are you saying that read(2)
>>> cannot be
>>> >> >> >>> >>>> interrupted by
>>> >> >> >>> >>>> >> >> >> alarm signal if the system is configured not to
>>> return
>>> >> >> >>> >>>> destination
>>> >> >> >>> >>>> >> >> >> host unreachable message? Could you please guide
>>> me
>>> >> >> where I
>>> >> >> >>> can
>>> >> >> >>> >>>> get
>>> >> >> >>> >>>> >> >> >> such that info? (I'm not a network expert).
>>> >> >> >>> >>>> >> >> >>
>>> >> >> >>> >>>> >> >> >> > Not a C programmer, found some info that select
>>> call
>>> >> >> >>> could be
>>> >> >> >>> >>>> >> replace
>>> >> >> >>> >>>> >> >> >> with
>>> >> >> >>> >>>> >> >> >> > select/pselect calls. Maybe it would be best if
>>> >> >> >>> >>>> PGCONNECT_TIMEOUT
>>> >> >> >>> >>>> >> >> value
>>> >> >> >>> >>>> >> >> >> > could be used here for connection timeout.
>>> pgpool
>>> >> has
>>> >> >> >>> libpq as
>>> >> >> >>> >>>> >> >> >> dependency,
>>> >> >> >>> >>>> >> >> >> > why isn't it using libpq for the healthcheck db
>>> >> connect
>>> >> >> >>> >>>> calls, then
>>> >> >> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
>>> >> >> >>> >>>> >> >> >>
>>> >> >> >>> >>>> >> >> >> I don't think libpq uses select/pselect for
>>> >> establishing
>>> >> >> >>> >>>> connection,
>>> >> >> >>> >>>> >> >> >> but using libpq instead of homebrew code seems to
>>> be
>>> >> an
>>> >> >> >>> idea.
>>> >> >> >>> >>>> Let me
>>> >> >> >>> >>>> >> >> >> think about it.
>>> >> >> >>> >>>> >> >> >>
>>> >> >> >>> >>>> >> >> >> One question. Are you sure that libpq can deal
>>> with
>>> >> the
>>> >> >> case
>>> >> >> >>> >>>> (not to
>>> >> >> >>> >>>> >> >> >> return destination host unreachable messages) by
>>> using
>>> >> >> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>>> >> >> >>> >>>> >> >> >> --
>>> >> >> >>> >>>> >> >> >> Tatsuo Ishii
>>> >> >> >>> >>>> >> >> >> SRA OSS, Inc. Japan
>>> >> >> >>> >>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
>>> >> >> >>> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
>>> >> >> >>> >>>> >> >> >>
>>> >> >> >>> >>>> >> >>
>>> >> >> >>> >>>> >>
>>> >> >> >>> >>>>
>>> >> >> >>> >>>
>>> >> >> >>> >>>
>>> >> >> >>> >>
>>> >> >> >>>
>>> >> >> >>
>>> >> >> >>
>>> >> >>
>>> >>
>>>
>>
>>


More information about the pgpool-general mailing list