[pgpool-general: 143] Re: Healthcheck timeout not always respected

Tatsuo Ishii ishii at postgresql.org
Thu Jan 12 07:37:58 JST 2012


Ok, I will do it. In the mean time you could use "strace -tt -p PID"
to see which system call is blocked.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> OK, got the info - key point is that ip forwarding is disabled for security
> reasons. Rules in iptables are not important, iptables can be stopped, or
> previously added rules removed.
> 
> Here are the steps to reproduce (kudos to my colleague Nenad Bulatovic for
> providing this):
> 
> 1.) make sure that ip forwarding is off:
>     echo 0 > /proc/sys/net/ipv4/ip_forward
> 2.) create IP alias on some interface (and have postgres listen on it):
>     ip addr add x.x.x.x/yy dev ethz
> 3.) set backend_hostname0 to aforementioned IP
> 4.) start pgpool and monitor health checks
> 5.) remove IP alias:
>     ip addr del x.x.x.x/yy dev ethz
> 
> 
> Here is the interesting part in pgpool log after this:
> 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
> 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node status: 2
> 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node status: 1
> 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
> 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node status: 2
> 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node status: 2
> 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host
> 192.168.2.27 at port 5432 is down
> 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time: 10
> second(s)
> 
> That pgpool was configured with health check interval of 30sec, 5sec
> timeout, and 10sec retry delay with 2 max retries.
> 
> Making use of libpq instead for connecting to db in health checks IMO
> should resolve it, but you'll best determine which call exactly gets
> blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured
> respects that env var timeout.
> 
> Regards,
> Stevo.
> 
> On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <sslavic at gmail.com> wrote:
> 
>> Tatsuo,
>>
>> Did you restart iptables after adding rule?
>>
>> Regards,
>> Stevo.
>>
>>
>> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <sslavic at gmail.com> wrote:
>>
>>> Looking into this to verify if these are all necessary changes to have
>>> port unreachable message silently rejected (suspecting some kernel
>>> parameter tuning is needed).
>>>
>>> Just to clarify it's not a problem that host is being detected by pgpool
>>> to be down, but the timing when that happens. On environment where issue is
>>> reproduced pgpool as part of health check attempt tries to connect to
>>> backend and hangs for tcp timeout instead of being interrupted by timeout
>>> alarm. Can you verify/confirm please the health check retry timings are not
>>> delayed?
>>>
>>> Regards,
>>> Stevo.
>>>
>>>
>>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <ishii at postgresql.org>wrote:
>>>
>>>> Ok, I did:
>>>>
>>>> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
>>>>
>>>> on the host where pgpoo is running. And pull network cable from
>>>> backend0 host network interface. Pgpool detected the host being down
>>>> as expected...
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>>
>>>> > Backend is not destination of this message, pgpool host is, and we
>>>> don't
>>>> > want it to ever get it. With command I've sent you rule will be
>>>> created for
>>>> > any source and destination.
>>>> >
>>>> > Regards,
>>>> > Stevo.
>>>> >
>>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <ishii at postgresql.org>
>>>> wrote:
>>>> >
>>>> >> I did following:
>>>> >>
>>>> >> Do following on the host where pgpool is running on:
>>>> >>
>>>> >> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable -d
>>>> >> 133.137.177.124
>>>> >> (133.137.177.124 is the host where backend is running on)
>>>> >>
>>>> >> Pull network cable from backend0 host network interface. Pgpool
>>>> >> detected the host being down as expected. Am I missing something?
>>>> >> --
>>>> >> Tatsuo Ishii
>>>> >> SRA OSS, Inc. Japan
>>>> >> English: http://www.sraoss.co.jp/index_en.php
>>>> >> Japanese: http://www.sraoss.co.jp
>>>> >>
>>>> >> > Hello Tatsuo,
>>>> >> >
>>>> >> > With backend0 on one host just configure following rule on other
>>>> host
>>>> >> where
>>>> >> > pgpool is:
>>>> >> >
>>>> >> > iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
>>>> >> >
>>>> >> > and then have pgpool startup with health checking and retrying
>>>> >> configured,
>>>> >> > and then pull network cable from backend0 host network interface.
>>>> >> >
>>>> >> > Regards,
>>>> >> > Stevo.
>>>> >> >
>>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <ishii at postgresql.org
>>>> >
>>>> >> wrote:
>>>> >> >
>>>> >> >> I want to try to test the situation you descrived:
>>>> >> >>
>>>> >> >> >> > When system is configured for security reasons not to return
>>>> >> >> destination
>>>> >> >> >> > host unreachable messages, even though health_check_timeout is
>>>> >> >>
>>>> >> >> But I don't know how to do it. I pulled out the network cable and
>>>> >> >> pgpool detected it as expected. Also I configured the server which
>>>> >> >> PostgreSQL is running on to disable the 5432 port. In this case
>>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so pgpool
>>>> detected
>>>> >> >> the error as expected.
>>>> >> >>
>>>> >> >> Could you please instruct me?
>>>> >> >> --
>>>> >> >> Tatsuo Ishii
>>>> >> >> SRA OSS, Inc. Japan
>>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>>> >> >> Japanese: http://www.sraoss.co.jp
>>>> >> >>
>>>> >> >> > Hello Tatsuo,
>>>> >> >> >
>>>> >> >> > Thank you for replying!
>>>> >> >> >
>>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code
>>>> analysis I
>>>> >> >> > suspect it is the part where a connection is made to the db and
>>>> it
>>>> >> >> doesn't
>>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health check
>>>> >> >> behaviour,
>>>> >> >> > it works really well when host/ip is there and just
>>>> backend/postgres
>>>> >> is
>>>> >> >> > down, but not when backend host/ip is down. I could see in log
>>>> that
>>>> >> >> initial
>>>> >> >> > health check and each retry got delayed when host/ip is not
>>>> reachable,
>>>> >> >> > while when just backend is not listening (is down) on the
>>>> reachable
>>>> >> >> host/ip
>>>> >> >> > then initial health check and all retries are exact to the
>>>> settings in
>>>> >> >> > pgpool.conf.
>>>> >> >> >
>>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq environment
>>>> variables
>>>> >> in
>>>> >> >> > the docs (see
>>>> >> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html )
>>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams ( see
>>>> >> >> >
>>>> >> >>
>>>> >>
>>>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>>>> >> >> )
>>>> >> >> > At the beginning of that same page there are some important
>>>> infos on
>>>> >> >> using
>>>> >> >> > these functions.
>>>> >> >> >
>>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>>>> >> >> >
>>>> >> >> > Regards,
>>>> >> >> > Stevo.
>>>> >> >> >
>>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
>>>> ishii at postgresql.org>
>>>> >> >> wrote:
>>>> >> >> >
>>>> >> >> >> > Hello pgpool community,
>>>> >> >> >> >
>>>> >> >> >> > When system is configured for security reasons not to return
>>>> >> >> destination
>>>> >> >> >> > host unreachable messages, even though health_check_timeout is
>>>> >> >> >> configured,
>>>> >> >> >> > socket call will block and alarm will not get raised until TCP
>>>> >> timeout
>>>> >> >> >> > occurs.
>>>> >> >> >>
>>>> >> >> >> Interesting. So are you saying that read(2) cannot be
>>>> interrupted by
>>>> >> >> >> alarm signal if the system is configured not to return
>>>> destination
>>>> >> >> >> host unreachable message? Could you please guide me where I can
>>>> get
>>>> >> >> >> such that info? (I'm not a network expert).
>>>> >> >> >>
>>>> >> >> >> > Not a C programmer, found some info that select call could be
>>>> >> replace
>>>> >> >> >> with
>>>> >> >> >> > select/pselect calls. Maybe it would be best if
>>>> PGCONNECT_TIMEOUT
>>>> >> >> value
>>>> >> >> >> > could be used here for connection timeout. pgpool has libpq as
>>>> >> >> >> dependency,
>>>> >> >> >> > why isn't it using libpq for the healthcheck db connect
>>>> calls, then
>>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
>>>> >> >> >>
>>>> >> >> >> I don't think libpq uses select/pselect for establishing
>>>> connection,
>>>> >> >> >> but using libpq instead of homebrew code seems to be an idea.
>>>> Let me
>>>> >> >> >> think about it.
>>>> >> >> >>
>>>> >> >> >> One question. Are you sure that libpq can deal with the case
>>>> (not to
>>>> >> >> >> return destination host unreachable messages) by using
>>>> >> >> >> PGCONNECT_TIMEOUT?
>>>> >> >> >> --
>>>> >> >> >> Tatsuo Ishii
>>>> >> >> >> SRA OSS, Inc. Japan
>>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
>>>> >> >> >> Japanese: http://www.sraoss.co.jp
>>>> >> >> >>
>>>> >> >>
>>>> >>
>>>>
>>>
>>>
>>


More information about the pgpool-general mailing list