[pgpool-general: 149] Re: Healthcheck timeout not always respected

Thu Jan 12 18:00:40 JST 2012

pgpool is being used in raw mode - just for (health check based) failover
part, so applications are not required to restart when standby gets
promoted to new master. Here is pgpool.conf file and a very small patch
we're using applied to pgpool 3.1.1 release.

We have to have DISALLOW_TO_FAILOVER set for the backend since any child
process that detects condition that master/backend0 is not available, if
DISALLOW_TO_FAILOVER was not set, will degenerate backend without giving
health check a chance to retry. We need health check with retries because
condition that backend0 is not available could be temporary (network
glitches to the remote site where master is, or deliberate failover of
master postgres service from one node to the other on remote site - in both
cases remote means remote to the pgpool that is going to perform health
checks and ultimately the failover) and we don't want standby to be
promoted as easily to a new master, to prevent temporary network conditions
which could occur frequently to frequently cause split brain with two
masters.

But then, with DISALLOW_TO_FAILOVER set, without the patch health check
will not retry and will thus give only one chance to backend (if health
check ever occurs before child process failure to connect to the backend),
rendering retry settings effectively to be ignored. That's where this patch
comes into action - enables health check retries while child processes are
prevented to degenerate backend.

I don't think, but I could be wrong, that this patch influences the
behavior we're seeing with unwanted health check attempt delays. Also,
knowing this, maybe pgpool could be patched or some other support be built
into it to cover this use case.

Regards,
Stevo.

2012/1/12 Tatsuo Ishii <ishii at postgresql.org>

> I have accepted the moderation request. Your post should be sent shortly.
> Also I have raised the post size limit to 1MB.
> I will look into this...
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Here is the log file and strace output file (this time in an archive,
> > didn't know about 200KB constraint on post size which requires moderator
> > approval). Timings configured are 30sec health check interval, 5sec
> > timeout, and 2 retries with 10sec retry delay.
> >
> > It takes a lot more than 5sec from started health check to sleeping 10sec
> > for first retry.
> >
> > Seen in code (main.x, health_check() function), within (retry) attempt
> > there is inner retry (first with postgres database then with template1)
> and
> > that part doesn't seem to be interrupted by alarm.
> >
> > Regards,
> > Stevo.
> >
> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
> >
> >> Here is the log file and strace output file. Timings configured are
> 30sec
> >> health check interval, 5sec timeout, and 2 retries with 10sec retry
> delay.
> >>
> >> It takes a lot more than 5sec from started health check to sleeping
> 10sec
> >> for first retry.
> >>
> >> Seen in code (main.x, health_check() function), within (retry) attempt
> >> there is inner retry (first with postgres database then with template1)
> and
> >> that part doesn't seem to be interrupted by alarm.
> >>
> >> Regards,
> >> Stevo.
> >>
> >>
> >> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
> >>
> >>> Ok, I will do it. In the mean time you could use "strace -tt -p PID"
> >>> to see which system call is blocked.
> >>> --
> >>> Tatsuo Ishii
> >>> SRA OSS, Inc. Japan
> >>> English: http://www.sraoss.co.jp/index_en.php
> >>> Japanese: http://www.sraoss.co.jp
> >>>
> >>> > OK, got the info - key point is that ip forwarding is disabled for
> >>> security
> >>> > reasons. Rules in iptables are not important, iptables can be
> stopped,
> >>> or
> >>> > previously added rules removed.
> >>> >
> >>> > Here are the steps to reproduce (kudos to my colleague Nenad
> Bulatovic
> >>> for
> >>> > providing this):
> >>> >
> >>> > 1.) make sure that ip forwarding is off:
> >>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
> >>> > 2.) create IP alias on some interface (and have postgres listen on
> it):
> >>> >     ip addr add x.x.x.x/yy dev ethz
> >>> > 3.) set backend_hostname0 to aforementioned IP
> >>> > 4.) start pgpool and monitor health checks
> >>> > 5.) remove IP alias:
> >>> >     ip addr del x.x.x.x/yy dev ethz
> >>> >
> >>> >
> >>> > Here is the interesting part in pgpool log after this:
> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node
> >>> status: 2
> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node
> >>> status: 1
> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node
> >>> status: 2
> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node
> >>> status: 2
> >>> > 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host
> >>> > 192.168.2.27 at port 5432 is down
> >>> > 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time:
> 10
> >>> > second(s)
> >>> >
> >>> > That pgpool was configured with health check interval of 30sec, 5sec
> >>> > timeout, and 10sec retry delay with 2 max retries.
> >>> >
> >>> > Making use of libpq instead for connecting to db in health checks IMO
> >>> > should resolve it, but you'll best determine which call exactly gets
> >>> > blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured
> >>> > respects that env var timeout.
> >>> >
> >>> > Regards,
> >>> > Stevo.
> >>> >
> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <sslavic at gmail.com>
> >>> wrote:
> >>> >
> >>> >> Tatsuo,
> >>> >>
> >>> >> Did you restart iptables after adding rule?
> >>> >>
> >>> >> Regards,
> >>> >> Stevo.
> >>> >>
> >>> >>
> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <sslavic at gmail.com>
> >>> wrote:
> >>> >>
> >>> >>> Looking into this to verify if these are all necessary changes to
> have
> >>> >>> port unreachable message silently rejected (suspecting some kernel
> >>> >>> parameter tuning is needed).
> >>> >>>
> >>> >>> Just to clarify it's not a problem that host is being detected by
> >>> pgpool
> >>> >>> to be down, but the timing when that happens. On environment where
> >>> issue is
> >>> >>> reproduced pgpool as part of health check attempt tries to connect
> to
> >>> >>> backend and hangs for tcp timeout instead of being interrupted by
> >>> timeout
> >>> >>> alarm. Can you verify/confirm please the health check retry timings
> >>> are not
> >>> >>> delayed?
> >>> >>>
> >>> >>> Regards,
> >>> >>> Stevo.
> >>> >>>
> >>> >>>
> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <
> ishii at postgresql.org
> >>> >wrote:
> >>> >>>
> >>> >>>> Ok, I did:
> >>> >>>>
> >>> >>>> # iptables -A FORWARD -j REJECT --reject-with
> icmp-port-unreachable
> >>> >>>>
> >>> >>>> on the host where pgpoo is running. And pull network cable from
> >>> >>>> backend0 host network interface. Pgpool detected the host being
> down
> >>> >>>> as expected...
> >>> >>>> --
> >>> >>>> Tatsuo Ishii
> >>> >>>> SRA OSS, Inc. Japan
> >>> >>>> English: http://www.sraoss.co.jp/index_en.php
> >>> >>>> Japanese: http://www.sraoss.co.jp
> >>> >>>>
> >>> >>>> > Backend is not destination of this message, pgpool host is, and
> we
> >>> >>>> don't
> >>> >>>> > want it to ever get it. With command I've sent you rule will be
> >>> >>>> created for
> >>> >>>> > any source and destination.
> >>> >>>> >
> >>> >>>> > Regards,
> >>> >>>> > Stevo.
> >>> >>>> >
> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <
> >>> ishii at postgresql.org>
> >>> >>>> wrote:
> >>> >>>> >
> >>> >>>> >> I did following:
> >>> >>>> >>
> >>> >>>> >> Do following on the host where pgpool is running on:
> >>> >>>> >>
> >>> >>>> >> # iptables -A FORWARD -j REJECT --reject-with
> >>> icmp-port-unreachable -d
> >>> >>>> >> 133.137.177.124
> >>> >>>> >> (133.137.177.124 is the host where backend is running on)
> >>> >>>> >>
> >>> >>>> >> Pull network cable from backend0 host network interface. Pgpool
> >>> >>>> >> detected the host being down as expected. Am I missing
> something?
> >>> >>>> >> --
> >>> >>>> >> Tatsuo Ishii
> >>> >>>> >> SRA OSS, Inc. Japan
> >>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
> >>> >>>> >> Japanese: http://www.sraoss.co.jp
> >>> >>>> >>
> >>> >>>> >> > Hello Tatsuo,
> >>> >>>> >> >
> >>> >>>> >> > With backend0 on one host just configure following rule on
> other
> >>> >>>> host
> >>> >>>> >> where
> >>> >>>> >> > pgpool is:
> >>> >>>> >> >
> >>> >>>> >> > iptables -A FORWARD -j REJECT --reject-with
> >>> icmp-port-unreachable
> >>> >>>> >> >
> >>> >>>> >> > and then have pgpool startup with health checking and
> retrying
> >>> >>>> >> configured,
> >>> >>>> >> > and then pull network cable from backend0 host network
> >>> interface.
> >>> >>>> >> >
> >>> >>>> >> > Regards,
> >>> >>>> >> > Stevo.
> >>> >>>> >> >
> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <
> >>> ishii at postgresql.org
> >>> >>>> >
> >>> >>>> >> wrote:
> >>> >>>> >> >
> >>> >>>> >> >> I want to try to test the situation you descrived:
> >>> >>>> >> >>
> >>> >>>> >> >> >> > When system is configured for security reasons not to
> >>> return
> >>> >>>> >> >> destination
> >>> >>>> >> >> >> > host unreachable messages, even though
> >>> health_check_timeout is
> >>> >>>> >> >>
> >>> >>>> >> >> But I don't know how to do it. I pulled out the network
> cable
> >>> and
> >>> >>>> >> >> pgpool detected it as expected. Also I configured the server
> >>> which
> >>> >>>> >> >> PostgreSQL is running on to disable the 5432 port. In this
> case
> >>> >>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so
> pgpool
> >>> >>>> detected
> >>> >>>> >> >> the error as expected.
> >>> >>>> >> >>
> >>> >>>> >> >> Could you please instruct me?
> >>> >>>> >> >> --
> >>> >>>> >> >> Tatsuo Ishii
> >>> >>>> >> >> SRA OSS, Inc. Japan
> >>> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
> >>> >>>> >> >> Japanese: http://www.sraoss.co.jp
> >>> >>>> >> >>
> >>> >>>> >> >> > Hello Tatsuo,
> >>> >>>> >> >> >
> >>> >>>> >> >> > Thank you for replying!
> >>> >>>> >> >> >
> >>> >>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code
> >>> >>>> analysis I
> >>> >>>> >> >> > suspect it is the part where a connection is made to the
> db
> >>> and
> >>> >>>> it
> >>> >>>> >> >> doesn't
> >>> >>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health
> >>> check
> >>> >>>> >> >> behaviour,
> >>> >>>> >> >> > it works really well when host/ip is there and just
> >>> >>>> backend/postgres
> >>> >>>> >> is
> >>> >>>> >> >> > down, but not when backend host/ip is down. I could see in
> >>> log
> >>> >>>> that
> >>> >>>> >> >> initial
> >>> >>>> >> >> > health check and each retry got delayed when host/ip is
> not
> >>> >>>> reachable,
> >>> >>>> >> >> > while when just backend is not listening (is down) on the
> >>> >>>> reachable
> >>> >>>> >> >> host/ip
> >>> >>>> >> >> > then initial health check and all retries are exact to the
> >>> >>>> settings in
> >>> >>>> >> >> > pgpool.conf.
> >>> >>>> >> >> >
> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq
> environment
> >>> >>>> variables
> >>> >>>> >> in
> >>> >>>> >> >> > the docs (see
> >>> >>>> >> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
> >>> >>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams (
> >>> see
> >>> >>>> >> >> >
> >>> >>>> >> >>
> >>> >>>> >>
> >>> >>>>
> >>>
> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
> >>> >>>> >> >> )
> >>> >>>> >> >> > At the beginning of that same page there are some
> important
> >>> >>>> infos on
> >>> >>>> >> >> using
> >>> >>>> >> >> > these functions.
> >>> >>>> >> >> >
> >>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
> >>> >>>> >> >> >
> >>> >>>> >> >> > Regards,
> >>> >>>> >> >> > Stevo.
> >>> >>>> >> >> >
> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
> >>> >>>> ishii at postgresql.org>
> >>> >>>> >> >> wrote:
> >>> >>>> >> >> >
> >>> >>>> >> >> >> > Hello pgpool community,
> >>> >>>> >> >> >> >
> >>> >>>> >> >> >> > When system is configured for security reasons not to
> >>> return
> >>> >>>> >> >> destination
> >>> >>>> >> >> >> > host unreachable messages, even though
> >>> health_check_timeout is
> >>> >>>> >> >> >> configured,
> >>> >>>> >> >> >> > socket call will block and alarm will not get raised
> >>> until TCP
> >>> >>>> >> timeout
> >>> >>>> >> >> >> > occurs.
> >>> >>>> >> >> >>
> >>> >>>> >> >> >> Interesting. So are you saying that read(2) cannot be
> >>> >>>> interrupted by
> >>> >>>> >> >> >> alarm signal if the system is configured not to return
> >>> >>>> destination
> >>> >>>> >> >> >> host unreachable message? Could you please guide me
> where I
> >>> can
> >>> >>>> get
> >>> >>>> >> >> >> such that info? (I'm not a network expert).
> >>> >>>> >> >> >>
> >>> >>>> >> >> >> > Not a C programmer, found some info that select call
> >>> could be
> >>> >>>> >> replace
> >>> >>>> >> >> >> with
> >>> >>>> >> >> >> > select/pselect calls. Maybe it would be best if
> >>> >>>> PGCONNECT_TIMEOUT
> >>> >>>> >> >> value
> >>> >>>> >> >> >> > could be used here for connection timeout. pgpool has
> >>> libpq as
> >>> >>>> >> >> >> dependency,
> >>> >>>> >> >> >> > why isn't it using libpq for the healthcheck db connect
> >>> >>>> calls, then
> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
> >>> >>>> >> >> >>
> >>> >>>> >> >> >> I don't think libpq uses select/pselect for establishing
> >>> >>>> connection,
> >>> >>>> >> >> >> but using libpq instead of homebrew code seems to be an
> >>> idea.
> >>> >>>> Let me
> >>> >>>> >> >> >> think about it.
> >>> >>>> >> >> >>
> >>> >>>> >> >> >> One question. Are you sure that libpq can deal with the
> case
> >>> >>>> (not to
> >>> >>>> >> >> >> return destination host unreachable messages) by using
> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
> >>> >>>> >> >> >> --
> >>> >>>> >> >> >> Tatsuo Ishii
> >>> >>>> >> >> >> SRA OSS, Inc. Japan
> >>> >>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
> >>> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
> >>> >>>> >> >> >>
> >>> >>>> >> >>
> >>> >>>> >>
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120112/bbd450ba/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: conf-n-patch.zip
Type: application/zip
Size: 1753 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120112/bbd450ba/attachment.zip>