[pgpool-general: 148] Re: Healthcheck timeout not always respected

Thu Jan 12 17:30:12 JST 2012

I have accepted the moderation request. Your post should be sent shortly.
Also I have raised the post size limit to 1MB.
I will look into this...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Here is the log file and strace output file (this time in an archive,
> didn't know about 200KB constraint on post size which requires moderator
> approval). Timings configured are 30sec health check interval, 5sec
> timeout, and 2 retries with 10sec retry delay.
> 
> It takes a lot more than 5sec from started health check to sleeping 10sec
> for first retry.
> 
> Seen in code (main.x, health_check() function), within (retry) attempt
> there is inner retry (first with postgres database then with template1) and
> that part doesn't seem to be interrupted by alarm.
> 
> Regards,
> Stevo.
> 
> 2012/1/12 Stevo Slavić <sslavic at gmail.com>
> 
>> Here is the log file and strace output file. Timings configured are 30sec
>> health check interval, 5sec timeout, and 2 retries with 10sec retry delay.
>>
>> It takes a lot more than 5sec from started health check to sleeping 10sec
>> for first retry.
>>
>> Seen in code (main.x, health_check() function), within (retry) attempt
>> there is inner retry (first with postgres database then with template1) and
>> that part doesn't seem to be interrupted by alarm.
>>
>> Regards,
>> Stevo.
>>
>>
>> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
>>
>>> Ok, I will do it. In the mean time you could use "strace -tt -p PID"
>>> to see which system call is blocked.
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>> > OK, got the info - key point is that ip forwarding is disabled for
>>> security
>>> > reasons. Rules in iptables are not important, iptables can be stopped,
>>> or
>>> > previously added rules removed.
>>> >
>>> > Here are the steps to reproduce (kudos to my colleague Nenad Bulatovic
>>> for
>>> > providing this):
>>> >
>>> > 1.) make sure that ip forwarding is off:
>>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
>>> > 2.) create IP alias on some interface (and have postgres listen on it):
>>> >     ip addr add x.x.x.x/yy dev ethz
>>> > 3.) set backend_hostname0 to aforementioned IP
>>> > 4.) start pgpool and monitor health checks
>>> > 5.) remove IP alias:
>>> >     ip addr del x.x.x.x/yy dev ethz
>>> >
>>> >
>>> > Here is the interesting part in pgpool log after this:
>>> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
>>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node
>>> status: 2
>>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node
>>> status: 1
>>> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
>>> > 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node
>>> status: 2
>>> > 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node
>>> status: 2
>>> > 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host
>>> > 192.168.2.27 at port 5432 is down
>>> > 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time: 10
>>> > second(s)
>>> >
>>> > That pgpool was configured with health check interval of 30sec, 5sec
>>> > timeout, and 10sec retry delay with 2 max retries.
>>> >
>>> > Making use of libpq instead for connecting to db in health checks IMO
>>> > should resolve it, but you'll best determine which call exactly gets
>>> > blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured
>>> > respects that env var timeout.
>>> >
>>> > Regards,
>>> > Stevo.
>>> >
>>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <sslavic at gmail.com>
>>> wrote:
>>> >
>>> >> Tatsuo,
>>> >>
>>> >> Did you restart iptables after adding rule?
>>> >>
>>> >> Regards,
>>> >> Stevo.
>>> >>
>>> >>
>>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <sslavic at gmail.com>
>>> wrote:
>>> >>
>>> >>> Looking into this to verify if these are all necessary changes to have
>>> >>> port unreachable message silently rejected (suspecting some kernel
>>> >>> parameter tuning is needed).
>>> >>>
>>> >>> Just to clarify it's not a problem that host is being detected by
>>> pgpool
>>> >>> to be down, but the timing when that happens. On environment where
>>> issue is
>>> >>> reproduced pgpool as part of health check attempt tries to connect to
>>> >>> backend and hangs for tcp timeout instead of being interrupted by
>>> timeout
>>> >>> alarm. Can you verify/confirm please the health check retry timings
>>> are not
>>> >>> delayed?
>>> >>>
>>> >>> Regards,
>>> >>> Stevo.
>>> >>>
>>> >>>
>>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <ishii at postgresql.org
>>> >wrote:
>>> >>>
>>> >>>> Ok, I did:
>>> >>>>
>>> >>>> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
>>> >>>>
>>> >>>> on the host where pgpoo is running. And pull network cable from
>>> >>>> backend0 host network interface. Pgpool detected the host being down
>>> >>>> as expected...
>>> >>>> --
>>> >>>> Tatsuo Ishii
>>> >>>> SRA OSS, Inc. Japan
>>> >>>> English: http://www.sraoss.co.jp/index_en.php
>>> >>>> Japanese: http://www.sraoss.co.jp
>>> >>>>
>>> >>>> > Backend is not destination of this message, pgpool host is, and we
>>> >>>> don't
>>> >>>> > want it to ever get it. With command I've sent you rule will be
>>> >>>> created for
>>> >>>> > any source and destination.
>>> >>>> >
>>> >>>> > Regards,
>>> >>>> > Stevo.
>>> >>>> >
>>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <
>>> ishii at postgresql.org>
>>> >>>> wrote:
>>> >>>> >
>>> >>>> >> I did following:
>>> >>>> >>
>>> >>>> >> Do following on the host where pgpool is running on:
>>> >>>> >>
>>> >>>> >> # iptables -A FORWARD -j REJECT --reject-with
>>> icmp-port-unreachable -d
>>> >>>> >> 133.137.177.124
>>> >>>> >> (133.137.177.124 is the host where backend is running on)
>>> >>>> >>
>>> >>>> >> Pull network cable from backend0 host network interface. Pgpool
>>> >>>> >> detected the host being down as expected. Am I missing something?
>>> >>>> >> --
>>> >>>> >> Tatsuo Ishii
>>> >>>> >> SRA OSS, Inc. Japan
>>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
>>> >>>> >> Japanese: http://www.sraoss.co.jp
>>> >>>> >>
>>> >>>> >> > Hello Tatsuo,
>>> >>>> >> >
>>> >>>> >> > With backend0 on one host just configure following rule on other
>>> >>>> host
>>> >>>> >> where
>>> >>>> >> > pgpool is:
>>> >>>> >> >
>>> >>>> >> > iptables -A FORWARD -j REJECT --reject-with
>>> icmp-port-unreachable
>>> >>>> >> >
>>> >>>> >> > and then have pgpool startup with health checking and retrying
>>> >>>> >> configured,
>>> >>>> >> > and then pull network cable from backend0 host network
>>> interface.
>>> >>>> >> >
>>> >>>> >> > Regards,
>>> >>>> >> > Stevo.
>>> >>>> >> >
>>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <
>>> ishii at postgresql.org
>>> >>>> >
>>> >>>> >> wrote:
>>> >>>> >> >
>>> >>>> >> >> I want to try to test the situation you descrived:
>>> >>>> >> >>
>>> >>>> >> >> >> > When system is configured for security reasons not to
>>> return
>>> >>>> >> >> destination
>>> >>>> >> >> >> > host unreachable messages, even though
>>> health_check_timeout is
>>> >>>> >> >>
>>> >>>> >> >> But I don't know how to do it. I pulled out the network cable
>>> and
>>> >>>> >> >> pgpool detected it as expected. Also I configured the server
>>> which
>>> >>>> >> >> PostgreSQL is running on to disable the 5432 port. In this case
>>> >>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so pgpool
>>> >>>> detected
>>> >>>> >> >> the error as expected.
>>> >>>> >> >>
>>> >>>> >> >> Could you please instruct me?
>>> >>>> >> >> --
>>> >>>> >> >> Tatsuo Ishii
>>> >>>> >> >> SRA OSS, Inc. Japan
>>> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>> >>>> >> >> Japanese: http://www.sraoss.co.jp
>>> >>>> >> >>
>>> >>>> >> >> > Hello Tatsuo,
>>> >>>> >> >> >
>>> >>>> >> >> > Thank you for replying!
>>> >>>> >> >> >
>>> >>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code
>>> >>>> analysis I
>>> >>>> >> >> > suspect it is the part where a connection is made to the db
>>> and
>>> >>>> it
>>> >>>> >> >> doesn't
>>> >>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health
>>> check
>>> >>>> >> >> behaviour,
>>> >>>> >> >> > it works really well when host/ip is there and just
>>> >>>> backend/postgres
>>> >>>> >> is
>>> >>>> >> >> > down, but not when backend host/ip is down. I could see in
>>> log
>>> >>>> that
>>> >>>> >> >> initial
>>> >>>> >> >> > health check and each retry got delayed when host/ip is not
>>> >>>> reachable,
>>> >>>> >> >> > while when just backend is not listening (is down) on the
>>> >>>> reachable
>>> >>>> >> >> host/ip
>>> >>>> >> >> > then initial health check and all retries are exact to the
>>> >>>> settings in
>>> >>>> >> >> > pgpool.conf.
>>> >>>> >> >> >
>>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq environment
>>> >>>> variables
>>> >>>> >> in
>>> >>>> >> >> > the docs (see
>>> >>>> >> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html )
>>> >>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams (
>>> see
>>> >>>> >> >> >
>>> >>>> >> >>
>>> >>>> >>
>>> >>>>
>>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>>> >>>> >> >> )
>>> >>>> >> >> > At the beginning of that same page there are some important
>>> >>>> infos on
>>> >>>> >> >> using
>>> >>>> >> >> > these functions.
>>> >>>> >> >> >
>>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>>> >>>> >> >> >
>>> >>>> >> >> > Regards,
>>> >>>> >> >> > Stevo.
>>> >>>> >> >> >
>>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
>>> >>>> ishii at postgresql.org>
>>> >>>> >> >> wrote:
>>> >>>> >> >> >
>>> >>>> >> >> >> > Hello pgpool community,
>>> >>>> >> >> >> >
>>> >>>> >> >> >> > When system is configured for security reasons not to
>>> return
>>> >>>> >> >> destination
>>> >>>> >> >> >> > host unreachable messages, even though
>>> health_check_timeout is
>>> >>>> >> >> >> configured,
>>> >>>> >> >> >> > socket call will block and alarm will not get raised
>>> until TCP
>>> >>>> >> timeout
>>> >>>> >> >> >> > occurs.
>>> >>>> >> >> >>
>>> >>>> >> >> >> Interesting. So are you saying that read(2) cannot be
>>> >>>> interrupted by
>>> >>>> >> >> >> alarm signal if the system is configured not to return
>>> >>>> destination
>>> >>>> >> >> >> host unreachable message? Could you please guide me where I
>>> can
>>> >>>> get
>>> >>>> >> >> >> such that info? (I'm not a network expert).
>>> >>>> >> >> >>
>>> >>>> >> >> >> > Not a C programmer, found some info that select call
>>> could be
>>> >>>> >> replace
>>> >>>> >> >> >> with
>>> >>>> >> >> >> > select/pselect calls. Maybe it would be best if
>>> >>>> PGCONNECT_TIMEOUT
>>> >>>> >> >> value
>>> >>>> >> >> >> > could be used here for connection timeout. pgpool has
>>> libpq as
>>> >>>> >> >> >> dependency,
>>> >>>> >> >> >> > why isn't it using libpq for the healthcheck db connect
>>> >>>> calls, then
>>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
>>> >>>> >> >> >>
>>> >>>> >> >> >> I don't think libpq uses select/pselect for establishing
>>> >>>> connection,
>>> >>>> >> >> >> but using libpq instead of homebrew code seems to be an
>>> idea.
>>> >>>> Let me
>>> >>>> >> >> >> think about it.
>>> >>>> >> >> >>
>>> >>>> >> >> >> One question. Are you sure that libpq can deal with the case
>>> >>>> (not to
>>> >>>> >> >> >> return destination host unreachable messages) by using
>>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>>> >>>> >> >> >> --
>>> >>>> >> >> >> Tatsuo Ishii
>>> >>>> >> >> >> SRA OSS, Inc. Japan
>>> >>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
>>> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
>>> >>>> >> >> >>
>>> >>>> >> >>
>>> >>>> >>
>>> >>>>
>>> >>>
>>> >>>
>>> >>
>>>
>>
>>