[pgpool-general: 144] Re: Healthcheck timeout not always respected

Thu Jan 12 11:43:07 JST 2012

I did the test and it seems pgpool works as expected.
What I did was:

>> 1.) make sure that ip forwarding is off:
>>     echo 0 > /proc/sys/net/ipv4/ip_forward
On backend1 where PostgreSQL is running:
cat /proc/sys/net/ipv4/ip_forward
0

>> 2.) create IP alias on some interface (and have postgres listen on it):
>>     ip addr add x.x.x.x/yy dev ethz
On backend1 where PostgreSQL is running:
ip addr add 133.137/177.125/23 dev eth0

>> 3.) set backend_hostname0 to aforementioned IP
Done on the host where pgpool is running.

>> 4.) start pgpool and monitor health checks
>> 5.) remove IP alias:
>>     ip addr del x.x.x.x/yy dev ethz
ip addr del 133.137/177.125/23 dev eth0

Here excerptions from pgpool log:
2012-01-12 11:31:29 DEBUG: pid 29973: starting health checking
2012-01-12 11:31:29 DEBUG: pid 29973: health_check: 0 th DB node status: 0
2012-01-12 11:31:29 DEBUG: pid 29973: health_check: 1 th DB node status: 1
2012-01-12 11:31:39 DEBUG: pid 29973: starting health checking
2012-01-12 11:31:39 DEBUG: pid 29973: health_check: 0 th DB node status: 0
2012-01-12 11:31:39 DEBUG: pid 29973: health_check: 1 th DB node status: 1
2012-01-12 11:31:44 ERROR: pid 29973: connect_inet_domain_socket: connect() failed: Interrupted system call
2012-01-12 11:31:44 ERROR: pid 29973: health check failed. 1 th host 133.137.177.125 at port 5432 is down
2012-01-12 11:31:44 LOG:   pid 29973: set 1 th backend down status
2012-01-12 11:31:44 DEBUG: pid 29973: failover_handler called
:
:

Here are my health check settings:
health_check_period = 10
health_check_timeout = 5

So health check was timed out at 2012-01-12 11:31:44, which is 5
seconds after the health checking, that says pgpool works as expected.

Here is the system info.
Linux localhost.localdomain 2.6.35-21vl6 #1 SMP Sun Jan 1 18:40:00 JST 2012 x86_64 x86_64 x86_64 GNU/Linux
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Ok, I will do it. In the mean time you could use "strace -tt -p PID"
> to see which system call is blocked.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
> 
>> OK, got the info - key point is that ip forwarding is disabled for security
>> reasons. Rules in iptables are not important, iptables can be stopped, or
>> previously added rules removed.
>> 
>> Here are the steps to reproduce (kudos to my colleague Nenad Bulatovic for
>> providing this):
>> 
>> 1.) make sure that ip forwarding is off:
>>     echo 0 > /proc/sys/net/ipv4/ip_forward
>> 2.) create IP alias on some interface (and have postgres listen on it):
>>     ip addr add x.x.x.x/yy dev ethz
>> 3.) set backend_hostname0 to aforementioned IP
>> 4.) start pgpool and monitor health checks
>> 5.) remove IP alias:
>>     ip addr del x.x.x.x/yy dev ethz
>> 
>> 
>> Here is the interesting part in pgpool log after this:
>> 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
>> 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node status: 2
>> 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node status: 1
>> 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
>> 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node status: 2
>> 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node status: 2
>> 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host
>> 192.168.2.27 at port 5432 is down
>> 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time: 10
>> second(s)
>> 
>> That pgpool was configured with health check interval of 30sec, 5sec
>> timeout, and 10sec retry delay with 2 max retries.
>> 
>> Making use of libpq instead for connecting to db in health checks IMO
>> should resolve it, but you'll best determine which call exactly gets
>> blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured
>> respects that env var timeout.
>> 
>> Regards,
>> Stevo.
>> 
>> On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <sslavic at gmail.com> wrote:
>> 
>>> Tatsuo,
>>>
>>> Did you restart iptables after adding rule?
>>>
>>> Regards,
>>> Stevo.
>>>
>>>
>>> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <sslavic at gmail.com> wrote:
>>>
>>>> Looking into this to verify if these are all necessary changes to have
>>>> port unreachable message silently rejected (suspecting some kernel
>>>> parameter tuning is needed).
>>>>
>>>> Just to clarify it's not a problem that host is being detected by pgpool
>>>> to be down, but the timing when that happens. On environment where issue is
>>>> reproduced pgpool as part of health check attempt tries to connect to
>>>> backend and hangs for tcp timeout instead of being interrupted by timeout
>>>> alarm. Can you verify/confirm please the health check retry timings are not
>>>> delayed?
>>>>
>>>> Regards,
>>>> Stevo.
>>>>
>>>>
>>>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <ishii at postgresql.org>wrote:
>>>>
>>>>> Ok, I did:
>>>>>
>>>>> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
>>>>>
>>>>> on the host where pgpoo is running. And pull network cable from
>>>>> backend0 host network interface. Pgpool detected the host being down
>>>>> as expected...
>>>>> --
>>>>> Tatsuo Ishii
>>>>> SRA OSS, Inc. Japan
>>>>> English: http://www.sraoss.co.jp/index_en.php
>>>>> Japanese: http://www.sraoss.co.jp
>>>>>
>>>>> > Backend is not destination of this message, pgpool host is, and we
>>>>> don't
>>>>> > want it to ever get it. With command I've sent you rule will be
>>>>> created for
>>>>> > any source and destination.
>>>>> >
>>>>> > Regards,
>>>>> > Stevo.
>>>>> >
>>>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <ishii at postgresql.org>
>>>>> wrote:
>>>>> >
>>>>> >> I did following:
>>>>> >>
>>>>> >> Do following on the host where pgpool is running on:
>>>>> >>
>>>>> >> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable -d
>>>>> >> 133.137.177.124
>>>>> >> (133.137.177.124 is the host where backend is running on)
>>>>> >>
>>>>> >> Pull network cable from backend0 host network interface. Pgpool
>>>>> >> detected the host being down as expected. Am I missing something?
>>>>> >> --
>>>>> >> Tatsuo Ishii
>>>>> >> SRA OSS, Inc. Japan
>>>>> >> English: http://www.sraoss.co.jp/index_en.php
>>>>> >> Japanese: http://www.sraoss.co.jp
>>>>> >>
>>>>> >> > Hello Tatsuo,
>>>>> >> >
>>>>> >> > With backend0 on one host just configure following rule on other
>>>>> host
>>>>> >> where
>>>>> >> > pgpool is:
>>>>> >> >
>>>>> >> > iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
>>>>> >> >
>>>>> >> > and then have pgpool startup with health checking and retrying
>>>>> >> configured,
>>>>> >> > and then pull network cable from backend0 host network interface.
>>>>> >> >
>>>>> >> > Regards,
>>>>> >> > Stevo.
>>>>> >> >
>>>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <ishii at postgresql.org
>>>>> >
>>>>> >> wrote:
>>>>> >> >
>>>>> >> >> I want to try to test the situation you descrived:
>>>>> >> >>
>>>>> >> >> >> > When system is configured for security reasons not to return
>>>>> >> >> destination
>>>>> >> >> >> > host unreachable messages, even though health_check_timeout is
>>>>> >> >>
>>>>> >> >> But I don't know how to do it. I pulled out the network cable and
>>>>> >> >> pgpool detected it as expected. Also I configured the server which
>>>>> >> >> PostgreSQL is running on to disable the 5432 port. In this case
>>>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so pgpool
>>>>> detected
>>>>> >> >> the error as expected.
>>>>> >> >>
>>>>> >> >> Could you please instruct me?
>>>>> >> >> --
>>>>> >> >> Tatsuo Ishii
>>>>> >> >> SRA OSS, Inc. Japan
>>>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>>>>> >> >> Japanese: http://www.sraoss.co.jp
>>>>> >> >>
>>>>> >> >> > Hello Tatsuo,
>>>>> >> >> >
>>>>> >> >> > Thank you for replying!
>>>>> >> >> >
>>>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code
>>>>> analysis I
>>>>> >> >> > suspect it is the part where a connection is made to the db and
>>>>> it
>>>>> >> >> doesn't
>>>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health check
>>>>> >> >> behaviour,
>>>>> >> >> > it works really well when host/ip is there and just
>>>>> backend/postgres
>>>>> >> is
>>>>> >> >> > down, but not when backend host/ip is down. I could see in log
>>>>> that
>>>>> >> >> initial
>>>>> >> >> > health check and each retry got delayed when host/ip is not
>>>>> reachable,
>>>>> >> >> > while when just backend is not listening (is down) on the
>>>>> reachable
>>>>> >> >> host/ip
>>>>> >> >> > then initial health check and all retries are exact to the
>>>>> settings in
>>>>> >> >> > pgpool.conf.
>>>>> >> >> >
>>>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq environment
>>>>> variables
>>>>> >> in
>>>>> >> >> > the docs (see
>>>>> >> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html )
>>>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams ( see
>>>>> >> >> >
>>>>> >> >>
>>>>> >>
>>>>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>>>>> >> >> )
>>>>> >> >> > At the beginning of that same page there are some important
>>>>> infos on
>>>>> >> >> using
>>>>> >> >> > these functions.
>>>>> >> >> >
>>>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>>>>> >> >> >
>>>>> >> >> > Regards,
>>>>> >> >> > Stevo.
>>>>> >> >> >
>>>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
>>>>> ishii at postgresql.org>
>>>>> >> >> wrote:
>>>>> >> >> >
>>>>> >> >> >> > Hello pgpool community,
>>>>> >> >> >> >
>>>>> >> >> >> > When system is configured for security reasons not to return
>>>>> >> >> destination
>>>>> >> >> >> > host unreachable messages, even though health_check_timeout is
>>>>> >> >> >> configured,
>>>>> >> >> >> > socket call will block and alarm will not get raised until TCP
>>>>> >> timeout
>>>>> >> >> >> > occurs.
>>>>> >> >> >>
>>>>> >> >> >> Interesting. So are you saying that read(2) cannot be
>>>>> interrupted by
>>>>> >> >> >> alarm signal if the system is configured not to return
>>>>> destination
>>>>> >> >> >> host unreachable message? Could you please guide me where I can
>>>>> get
>>>>> >> >> >> such that info? (I'm not a network expert).
>>>>> >> >> >>
>>>>> >> >> >> > Not a C programmer, found some info that select call could be
>>>>> >> replace
>>>>> >> >> >> with
>>>>> >> >> >> > select/pselect calls. Maybe it would be best if
>>>>> PGCONNECT_TIMEOUT
>>>>> >> >> value
>>>>> >> >> >> > could be used here for connection timeout. pgpool has libpq as
>>>>> >> >> >> dependency,
>>>>> >> >> >> > why isn't it using libpq for the healthcheck db connect
>>>>> calls, then
>>>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
>>>>> >> >> >>
>>>>> >> >> >> I don't think libpq uses select/pselect for establishing
>>>>> connection,
>>>>> >> >> >> but using libpq instead of homebrew code seems to be an idea.
>>>>> Let me
>>>>> >> >> >> think about it.
>>>>> >> >> >>
>>>>> >> >> >> One question. Are you sure that libpq can deal with the case
>>>>> (not to
>>>>> >> >> >> return destination host unreachable messages) by using
>>>>> >> >> >> PGCONNECT_TIMEOUT?
>>>>> >> >> >> --
>>>>> >> >> >> Tatsuo Ishii
>>>>> >> >> >> SRA OSS, Inc. Japan
>>>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
>>>>> >> >> >> Japanese: http://www.sraoss.co.jp
>>>>> >> >> >>
>>>>> >> >>
>>>>> >>
>>>>>
>>>>
>>>>
>>>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general