[pgpool-general: 147] Re: Healthcheck timeout not always respected

Thu Jan 12 17:16:18 JST 2012

Here is the log file and strace output file. Timings configured are 30sec
health check interval, 5sec timeout, and 2 retries with 10sec retry delay.

It takes a lot more than 5sec from started health check to sleeping 10sec
for first retry.

Seen in code (main.x, health_check() function), within (retry) attempt
there is inner retry (first with postgres database then with template1) and
that part doesn't seem to be interrupted by alarm.

Regards,
Stevo.

2012/1/11 Tatsuo Ishii <ishii at postgresql.org>

> Ok, I will do it. In the mean time you could use "strace -tt -p PID"
> to see which system call is blocked.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > OK, got the info - key point is that ip forwarding is disabled for
> security
> > reasons. Rules in iptables are not important, iptables can be stopped, or
> > previously added rules removed.
> >
> > Here are the steps to reproduce (kudos to my colleague Nenad Bulatovic
> for
> > providing this):
> >
> > 1.) make sure that ip forwarding is off:
> >     echo 0 > /proc/sys/net/ipv4/ip_forward
> > 2.) create IP alias on some interface (and have postgres listen on it):
> >     ip addr add x.x.x.x/yy dev ethz
> > 3.) set backend_hostname0 to aforementioned IP
> > 4.) start pgpool and monitor health checks
> > 5.) remove IP alias:
> >     ip addr del x.x.x.x/yy dev ethz
> >
> >
> > Here is the interesting part in pgpool log after this:
> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node status:
> 2
> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node status:
> 1
> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
> > 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node status:
> 2
> > 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node status:
> 2
> > 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host
> > 192.168.2.27 at port 5432 is down
> > 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time: 10
> > second(s)
> >
> > That pgpool was configured with health check interval of 30sec, 5sec
> > timeout, and 10sec retry delay with 2 max retries.
> >
> > Making use of libpq instead for connecting to db in health checks IMO
> > should resolve it, but you'll best determine which call exactly gets
> > blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured
> > respects that env var timeout.
> >
> > Regards,
> > Stevo.
> >
> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <sslavic at gmail.com>
> wrote:
> >
> >> Tatsuo,
> >>
> >> Did you restart iptables after adding rule?
> >>
> >> Regards,
> >> Stevo.
> >>
> >>
> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <sslavic at gmail.com>
> wrote:
> >>
> >>> Looking into this to verify if these are all necessary changes to have
> >>> port unreachable message silently rejected (suspecting some kernel
> >>> parameter tuning is needed).
> >>>
> >>> Just to clarify it's not a problem that host is being detected by
> pgpool
> >>> to be down, but the timing when that happens. On environment where
> issue is
> >>> reproduced pgpool as part of health check attempt tries to connect to
> >>> backend and hangs for tcp timeout instead of being interrupted by
> timeout
> >>> alarm. Can you verify/confirm please the health check retry timings
> are not
> >>> delayed?
> >>>
> >>> Regards,
> >>> Stevo.
> >>>
> >>>
> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <ishii at postgresql.org
> >wrote:
> >>>
> >>>> Ok, I did:
> >>>>
> >>>> # iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
> >>>>
> >>>> on the host where pgpoo is running. And pull network cable from
> >>>> backend0 host network interface. Pgpool detected the host being down
> >>>> as expected...
> >>>> --
> >>>> Tatsuo Ishii
> >>>> SRA OSS, Inc. Japan
> >>>> English: http://www.sraoss.co.jp/index_en.php
> >>>> Japanese: http://www.sraoss.co.jp
> >>>>
> >>>> > Backend is not destination of this message, pgpool host is, and we
> >>>> don't
> >>>> > want it to ever get it. With command I've sent you rule will be
> >>>> created for
> >>>> > any source and destination.
> >>>> >
> >>>> > Regards,
> >>>> > Stevo.
> >>>> >
> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <
> ishii at postgresql.org>
> >>>> wrote:
> >>>> >
> >>>> >> I did following:
> >>>> >>
> >>>> >> Do following on the host where pgpool is running on:
> >>>> >>
> >>>> >> # iptables -A FORWARD -j REJECT --reject-with
> icmp-port-unreachable -d
> >>>> >> 133.137.177.124
> >>>> >> (133.137.177.124 is the host where backend is running on)
> >>>> >>
> >>>> >> Pull network cable from backend0 host network interface. Pgpool
> >>>> >> detected the host being down as expected. Am I missing something?
> >>>> >> --
> >>>> >> Tatsuo Ishii
> >>>> >> SRA OSS, Inc. Japan
> >>>> >> English: http://www.sraoss.co.jp/index_en.php
> >>>> >> Japanese: http://www.sraoss.co.jp
> >>>> >>
> >>>> >> > Hello Tatsuo,
> >>>> >> >
> >>>> >> > With backend0 on one host just configure following rule on other
> >>>> host
> >>>> >> where
> >>>> >> > pgpool is:
> >>>> >> >
> >>>> >> > iptables -A FORWARD -j REJECT --reject-with icmp-port-unreachable
> >>>> >> >
> >>>> >> > and then have pgpool startup with health checking and retrying
> >>>> >> configured,
> >>>> >> > and then pull network cable from backend0 host network interface.
> >>>> >> >
> >>>> >> > Regards,
> >>>> >> > Stevo.
> >>>> >> >
> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <
> ishii at postgresql.org
> >>>> >
> >>>> >> wrote:
> >>>> >> >
> >>>> >> >> I want to try to test the situation you descrived:
> >>>> >> >>
> >>>> >> >> >> > When system is configured for security reasons not to
> return
> >>>> >> >> destination
> >>>> >> >> >> > host unreachable messages, even though
> health_check_timeout is
> >>>> >> >>
> >>>> >> >> But I don't know how to do it. I pulled out the network cable
> and
> >>>> >> >> pgpool detected it as expected. Also I configured the server
> which
> >>>> >> >> PostgreSQL is running on to disable the 5432 port. In this case
> >>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so pgpool
> >>>> detected
> >>>> >> >> the error as expected.
> >>>> >> >>
> >>>> >> >> Could you please instruct me?
> >>>> >> >> --
> >>>> >> >> Tatsuo Ishii
> >>>> >> >> SRA OSS, Inc. Japan
> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
> >>>> >> >> Japanese: http://www.sraoss.co.jp
> >>>> >> >>
> >>>> >> >> > Hello Tatsuo,
> >>>> >> >> >
> >>>> >> >> > Thank you for replying!
> >>>> >> >> >
> >>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code
> >>>> analysis I
> >>>> >> >> > suspect it is the part where a connection is made to the db
> and
> >>>> it
> >>>> >> >> doesn't
> >>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health
> check
> >>>> >> >> behaviour,
> >>>> >> >> > it works really well when host/ip is there and just
> >>>> backend/postgres
> >>>> >> is
> >>>> >> >> > down, but not when backend host/ip is down. I could see in log
> >>>> that
> >>>> >> >> initial
> >>>> >> >> > health check and each retry got delayed when host/ip is not
> >>>> reachable,
> >>>> >> >> > while when just backend is not listening (is down) on the
> >>>> reachable
> >>>> >> >> host/ip
> >>>> >> >> > then initial health check and all retries are exact to the
> >>>> settings in
> >>>> >> >> > pgpool.conf.
> >>>> >> >> >
> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq environment
> >>>> variables
> >>>> >> in
> >>>> >> >> > the docs (see
> >>>> >> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html )
> >>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams ( see
> >>>> >> >> >
> >>>> >> >>
> >>>> >>
> >>>>
> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
> >>>> >> >> )
> >>>> >> >> > At the beginning of that same page there are some important
> >>>> infos on
> >>>> >> >> using
> >>>> >> >> > these functions.
> >>>> >> >> >
> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
> >>>> >> >> >
> >>>> >> >> > Regards,
> >>>> >> >> > Stevo.
> >>>> >> >> >
> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
> >>>> ishii at postgresql.org>
> >>>> >> >> wrote:
> >>>> >> >> >
> >>>> >> >> >> > Hello pgpool community,
> >>>> >> >> >> >
> >>>> >> >> >> > When system is configured for security reasons not to
> return
> >>>> >> >> destination
> >>>> >> >> >> > host unreachable messages, even though
> health_check_timeout is
> >>>> >> >> >> configured,
> >>>> >> >> >> > socket call will block and alarm will not get raised until
> TCP
> >>>> >> timeout
> >>>> >> >> >> > occurs.
> >>>> >> >> >>
> >>>> >> >> >> Interesting. So are you saying that read(2) cannot be
> >>>> interrupted by
> >>>> >> >> >> alarm signal if the system is configured not to return
> >>>> destination
> >>>> >> >> >> host unreachable message? Could you please guide me where I
> can
> >>>> get
> >>>> >> >> >> such that info? (I'm not a network expert).
> >>>> >> >> >>
> >>>> >> >> >> > Not a C programmer, found some info that select call could
> be
> >>>> >> replace
> >>>> >> >> >> with
> >>>> >> >> >> > select/pselect calls. Maybe it would be best if
> >>>> PGCONNECT_TIMEOUT
> >>>> >> >> value
> >>>> >> >> >> > could be used here for connection timeout. pgpool has
> libpq as
> >>>> >> >> >> dependency,
> >>>> >> >> >> > why isn't it using libpq for the healthcheck db connect
> >>>> calls, then
> >>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
> >>>> >> >> >>
> >>>> >> >> >> I don't think libpq uses select/pselect for establishing
> >>>> connection,
> >>>> >> >> >> but using libpq instead of homebrew code seems to be an idea.
> >>>> Let me
> >>>> >> >> >> think about it.
> >>>> >> >> >>
> >>>> >> >> >> One question. Are you sure that libpq can deal with the case
> >>>> (not to
> >>>> >> >> >> return destination host unreachable messages) by using
> >>>> >> >> >> PGCONNECT_TIMEOUT?
> >>>> >> >> >> --
> >>>> >> >> >> Tatsuo Ishii
> >>>> >> >> >> SRA OSS, Inc. Japan
> >>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
> >>>> >> >> >>
> >>>> >> >>
> >>>> >>
> >>>>
> >>>
> >>>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120112/80e926a3/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool-II-91.log
Type: text/x-log
Size: 53354 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120112/80e926a3/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: strace.out
Type: application/octet-stream
Size: 125313 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120112/80e926a3/attachment.obj>