[pgpool-general: 150] Re: Healthcheck timeout not always respected

Fri Jan 13 12:58:49 JST 2012

Thanks for pointing it out.
Yes, checking DISALLOW_TO_FAILOVER before retrying is wrong.
However, after retry count over, we should check DISALLOW_TO_FAILOVER I think.
Attached is the patch attempt to fix it. Please try.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> pgpool is being used in raw mode - just for (health check based) failover
> part, so applications are not required to restart when standby gets
> promoted to new master. Here is pgpool.conf file and a very small patch
> we're using applied to pgpool 3.1.1 release.
> 
> We have to have DISALLOW_TO_FAILOVER set for the backend since any child
> process that detects condition that master/backend0 is not available, if
> DISALLOW_TO_FAILOVER was not set, will degenerate backend without giving
> health check a chance to retry. We need health check with retries because
> condition that backend0 is not available could be temporary (network
> glitches to the remote site where master is, or deliberate failover of
> master postgres service from one node to the other on remote site - in both
> cases remote means remote to the pgpool that is going to perform health
> checks and ultimately the failover) and we don't want standby to be
> promoted as easily to a new master, to prevent temporary network conditions
> which could occur frequently to frequently cause split brain with two
> masters.
> 
> But then, with DISALLOW_TO_FAILOVER set, without the patch health check
> will not retry and will thus give only one chance to backend (if health
> check ever occurs before child process failure to connect to the backend),
> rendering retry settings effectively to be ignored. That's where this patch
> comes into action - enables health check retries while child processes are
> prevented to degenerate backend.
> 
> I don't think, but I could be wrong, that this patch influences the
> behavior we're seeing with unwanted health check attempt delays. Also,
> knowing this, maybe pgpool could be patched or some other support be built
> into it to cover this use case.
> 
> Regards,
> Stevo.
> 
> 
> 2012/1/12 Tatsuo Ishii <ishii at postgresql.org>
> 
>> I have accepted the moderation request. Your post should be sent shortly.
>> Also I have raised the post size limit to 1MB.
>> I will look into this...
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Here is the log file and strace output file (this time in an archive,
>> > didn't know about 200KB constraint on post size which requires moderator
>> > approval). Timings configured are 30sec health check interval, 5sec
>> > timeout, and 2 retries with 10sec retry delay.
>> >
>> > It takes a lot more than 5sec from started health check to sleeping 10sec
>> > for first retry.
>> >
>> > Seen in code (main.x, health_check() function), within (retry) attempt
>> > there is inner retry (first with postgres database then with template1)
>> and
>> > that part doesn't seem to be interrupted by alarm.
>> >
>> > Regards,
>> > Stevo.
>> >
>> > 2012/1/12 Stevo Slavić <sslavic at gmail.com>
>> >
>> >> Here is the log file and strace output file. Timings configured are
>> 30sec
>> >> health check interval, 5sec timeout, and 2 retries with 10sec retry
>> delay.
>> >>
>> >> It takes a lot more than 5sec from started health check to sleeping
>> 10sec
>> >> for first retry.
>> >>
>> >> Seen in code (main.x, health_check() function), within (retry) attempt
>> >> there is inner retry (first with postgres database then with template1)
>> and
>> >> that part doesn't seem to be interrupted by alarm.
>> >>
>> >> Regards,
>> >> Stevo.
>> >>
>> >>
>> >> 2012/1/11 Tatsuo Ishii <ishii at postgresql.org>
>> >>
>> >>> Ok, I will do it. In the mean time you could use "strace -tt -p PID"
>> >>> to see which system call is blocked.
>> >>> --
>> >>> Tatsuo Ishii
>> >>> SRA OSS, Inc. Japan
>> >>> English: http://www.sraoss.co.jp/index_en.php
>> >>> Japanese: http://www.sraoss.co.jp
>> >>>
>> >>> > OK, got the info - key point is that ip forwarding is disabled for
>> >>> security
>> >>> > reasons. Rules in iptables are not important, iptables can be
>> stopped,
>> >>> or
>> >>> > previously added rules removed.
>> >>> >
>> >>> > Here are the steps to reproduce (kudos to my colleague Nenad
>> Bulatovic
>> >>> for
>> >>> > providing this):
>> >>> >
>> >>> > 1.) make sure that ip forwarding is off:
>> >>> >     echo 0 > /proc/sys/net/ipv4/ip_forward
>> >>> > 2.) create IP alias on some interface (and have postgres listen on
>> it):
>> >>> >     ip addr add x.x.x.x/yy dev ethz
>> >>> > 3.) set backend_hostname0 to aforementioned IP
>> >>> > 4.) start pgpool and monitor health checks
>> >>> > 5.) remove IP alias:
>> >>> >     ip addr del x.x.x.x/yy dev ethz
>> >>> >
>> >>> >
>> >>> > Here is the interesting part in pgpool log after this:
>> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: starting health checking
>> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 0 th DB node
>> >>> status: 2
>> >>> > 2012-01-11 17:38:04 DEBUG: pid 24358: health_check: 1 th DB node
>> >>> status: 1
>> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: starting health checking
>> >>> > 2012-01-11 17:38:34 DEBUG: pid 24358: health_check: 0 th DB node
>> >>> status: 2
>> >>> > 2012-01-11 17:41:43 DEBUG: pid 24358: health_check: 0 th DB node
>> >>> status: 2
>> >>> > 2012-01-11 17:41:46 ERROR: pid 24358: health check failed. 0 th host
>> >>> > 192.168.2.27 at port 5432 is down
>> >>> > 2012-01-11 17:41:46 LOG:   pid 24358: health check retry sleep time:
>> 10
>> >>> > second(s)
>> >>> >
>> >>> > That pgpool was configured with health check interval of 30sec, 5sec
>> >>> > timeout, and 10sec retry delay with 2 max retries.
>> >>> >
>> >>> > Making use of libpq instead for connecting to db in health checks IMO
>> >>> > should resolve it, but you'll best determine which call exactly gets
>> >>> > blocked waiting. Btw, psql with PGCONNECT_TIMEOUT env var configured
>> >>> > respects that env var timeout.
>> >>> >
>> >>> > Regards,
>> >>> > Stevo.
>> >>> >
>> >>> > On Wed, Jan 11, 2012 at 11:15 AM, Stevo Slavić <sslavic at gmail.com>
>> >>> wrote:
>> >>> >
>> >>> >> Tatsuo,
>> >>> >>
>> >>> >> Did you restart iptables after adding rule?
>> >>> >>
>> >>> >> Regards,
>> >>> >> Stevo.
>> >>> >>
>> >>> >>
>> >>> >> On Wed, Jan 11, 2012 at 11:12 AM, Stevo Slavić <sslavic at gmail.com>
>> >>> wrote:
>> >>> >>
>> >>> >>> Looking into this to verify if these are all necessary changes to
>> have
>> >>> >>> port unreachable message silently rejected (suspecting some kernel
>> >>> >>> parameter tuning is needed).
>> >>> >>>
>> >>> >>> Just to clarify it's not a problem that host is being detected by
>> >>> pgpool
>> >>> >>> to be down, but the timing when that happens. On environment where
>> >>> issue is
>> >>> >>> reproduced pgpool as part of health check attempt tries to connect
>> to
>> >>> >>> backend and hangs for tcp timeout instead of being interrupted by
>> >>> timeout
>> >>> >>> alarm. Can you verify/confirm please the health check retry timings
>> >>> are not
>> >>> >>> delayed?
>> >>> >>>
>> >>> >>> Regards,
>> >>> >>> Stevo.
>> >>> >>>
>> >>> >>>
>> >>> >>> On Wed, Jan 11, 2012 at 10:50 AM, Tatsuo Ishii <
>> ishii at postgresql.org
>> >>> >wrote:
>> >>> >>>
>> >>> >>>> Ok, I did:
>> >>> >>>>
>> >>> >>>> # iptables -A FORWARD -j REJECT --reject-with
>> icmp-port-unreachable
>> >>> >>>>
>> >>> >>>> on the host where pgpoo is running. And pull network cable from
>> >>> >>>> backend0 host network interface. Pgpool detected the host being
>> down
>> >>> >>>> as expected...
>> >>> >>>> --
>> >>> >>>> Tatsuo Ishii
>> >>> >>>> SRA OSS, Inc. Japan
>> >>> >>>> English: http://www.sraoss.co.jp/index_en.php
>> >>> >>>> Japanese: http://www.sraoss.co.jp
>> >>> >>>>
>> >>> >>>> > Backend is not destination of this message, pgpool host is, and
>> we
>> >>> >>>> don't
>> >>> >>>> > want it to ever get it. With command I've sent you rule will be
>> >>> >>>> created for
>> >>> >>>> > any source and destination.
>> >>> >>>> >
>> >>> >>>> > Regards,
>> >>> >>>> > Stevo.
>> >>> >>>> >
>> >>> >>>> > On Wed, Jan 11, 2012 at 10:38 AM, Tatsuo Ishii <
>> >>> ishii at postgresql.org>
>> >>> >>>> wrote:
>> >>> >>>> >
>> >>> >>>> >> I did following:
>> >>> >>>> >>
>> >>> >>>> >> Do following on the host where pgpool is running on:
>> >>> >>>> >>
>> >>> >>>> >> # iptables -A FORWARD -j REJECT --reject-with
>> >>> icmp-port-unreachable -d
>> >>> >>>> >> 133.137.177.124
>> >>> >>>> >> (133.137.177.124 is the host where backend is running on)
>> >>> >>>> >>
>> >>> >>>> >> Pull network cable from backend0 host network interface. Pgpool
>> >>> >>>> >> detected the host being down as expected. Am I missing
>> something?
>> >>> >>>> >> --
>> >>> >>>> >> Tatsuo Ishii
>> >>> >>>> >> SRA OSS, Inc. Japan
>> >>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >>>> >> Japanese: http://www.sraoss.co.jp
>> >>> >>>> >>
>> >>> >>>> >> > Hello Tatsuo,
>> >>> >>>> >> >
>> >>> >>>> >> > With backend0 on one host just configure following rule on
>> other
>> >>> >>>> host
>> >>> >>>> >> where
>> >>> >>>> >> > pgpool is:
>> >>> >>>> >> >
>> >>> >>>> >> > iptables -A FORWARD -j REJECT --reject-with
>> >>> icmp-port-unreachable
>> >>> >>>> >> >
>> >>> >>>> >> > and then have pgpool startup with health checking and
>> retrying
>> >>> >>>> >> configured,
>> >>> >>>> >> > and then pull network cable from backend0 host network
>> >>> interface.
>> >>> >>>> >> >
>> >>> >>>> >> > Regards,
>> >>> >>>> >> > Stevo.
>> >>> >>>> >> >
>> >>> >>>> >> > On Wed, Jan 11, 2012 at 6:27 AM, Tatsuo Ishii <
>> >>> ishii at postgresql.org
>> >>> >>>> >
>> >>> >>>> >> wrote:
>> >>> >>>> >> >
>> >>> >>>> >> >> I want to try to test the situation you descrived:
>> >>> >>>> >> >>
>> >>> >>>> >> >> >> > When system is configured for security reasons not to
>> >>> return
>> >>> >>>> >> >> destination
>> >>> >>>> >> >> >> > host unreachable messages, even though
>> >>> health_check_timeout is
>> >>> >>>> >> >>
>> >>> >>>> >> >> But I don't know how to do it. I pulled out the network
>> cable
>> >>> and
>> >>> >>>> >> >> pgpool detected it as expected. Also I configured the server
>> >>> which
>> >>> >>>> >> >> PostgreSQL is running on to disable the 5432 port. In this
>> case
>> >>> >>>> >> >> connect(2) returned EHOSTUNREACH (No route to host) so
>> pgpool
>> >>> >>>> detected
>> >>> >>>> >> >> the error as expected.
>> >>> >>>> >> >>
>> >>> >>>> >> >> Could you please instruct me?
>> >>> >>>> >> >> --
>> >>> >>>> >> >> Tatsuo Ishii
>> >>> >>>> >> >> SRA OSS, Inc. Japan
>> >>> >>>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >>>> >> >> Japanese: http://www.sraoss.co.jp
>> >>> >>>> >> >>
>> >>> >>>> >> >> > Hello Tatsuo,
>> >>> >>>> >> >> >
>> >>> >>>> >> >> > Thank you for replying!
>> >>> >>>> >> >> >
>> >>> >>>> >> >> > I'm not sure what exactly is blocking, just by pgpool code
>> >>> >>>> analysis I
>> >>> >>>> >> >> > suspect it is the part where a connection is made to the
>> db
>> >>> and
>> >>> >>>> it
>> >>> >>>> >> >> doesn't
>> >>> >>>> >> >> > seem to get interrupted by alarm. Tested thoroughly health
>> >>> check
>> >>> >>>> >> >> behaviour,
>> >>> >>>> >> >> > it works really well when host/ip is there and just
>> >>> >>>> backend/postgres
>> >>> >>>> >> is
>> >>> >>>> >> >> > down, but not when backend host/ip is down. I could see in
>> >>> log
>> >>> >>>> that
>> >>> >>>> >> >> initial
>> >>> >>>> >> >> > health check and each retry got delayed when host/ip is
>> not
>> >>> >>>> reachable,
>> >>> >>>> >> >> > while when just backend is not listening (is down) on the
>> >>> >>>> reachable
>> >>> >>>> >> >> host/ip
>> >>> >>>> >> >> > then initial health check and all retries are exact to the
>> >>> >>>> settings in
>> >>> >>>> >> >> > pgpool.conf.
>> >>> >>>> >> >> >
>> >>> >>>> >> >> > PGCONNECT_TIMEOUT is listed as one of the libpq
>> environment
>> >>> >>>> variables
>> >>> >>>> >> in
>> >>> >>>> >> >> > the docs (see
>> >>> >>>> >> >> http://www.postgresql.org/docs/9.1/static/libpq-envars.html)
>> >>> >>>> >> >> > There is equivalent parameter in libpq PGconnectdbParams (
>> >>> see
>> >>> >>>> >> >> >
>> >>> >>>> >> >>
>> >>> >>>> >>
>> >>> >>>>
>> >>>
>> http://www.postgresql.org/docs/9.1/static/libpq-connect.html#LIBPQ-CONNECT-CONNECT-TIMEOUT
>> >>> >>>> >> >> )
>> >>> >>>> >> >> > At the beginning of that same page there are some
>> important
>> >>> >>>> infos on
>> >>> >>>> >> >> using
>> >>> >>>> >> >> > these functions.
>> >>> >>>> >> >> >
>> >>> >>>> >> >> > psql respects PGCONNECT_TIMEOUT.
>> >>> >>>> >> >> >
>> >>> >>>> >> >> > Regards,
>> >>> >>>> >> >> > Stevo.
>> >>> >>>> >> >> >
>> >>> >>>> >> >> > On Wed, Jan 11, 2012 at 12:13 AM, Tatsuo Ishii <
>> >>> >>>> ishii at postgresql.org>
>> >>> >>>> >> >> wrote:
>> >>> >>>> >> >> >
>> >>> >>>> >> >> >> > Hello pgpool community,
>> >>> >>>> >> >> >> >
>> >>> >>>> >> >> >> > When system is configured for security reasons not to
>> >>> return
>> >>> >>>> >> >> destination
>> >>> >>>> >> >> >> > host unreachable messages, even though
>> >>> health_check_timeout is
>> >>> >>>> >> >> >> configured,
>> >>> >>>> >> >> >> > socket call will block and alarm will not get raised
>> >>> until TCP
>> >>> >>>> >> timeout
>> >>> >>>> >> >> >> > occurs.
>> >>> >>>> >> >> >>
>> >>> >>>> >> >> >> Interesting. So are you saying that read(2) cannot be
>> >>> >>>> interrupted by
>> >>> >>>> >> >> >> alarm signal if the system is configured not to return
>> >>> >>>> destination
>> >>> >>>> >> >> >> host unreachable message? Could you please guide me
>> where I
>> >>> can
>> >>> >>>> get
>> >>> >>>> >> >> >> such that info? (I'm not a network expert).
>> >>> >>>> >> >> >>
>> >>> >>>> >> >> >> > Not a C programmer, found some info that select call
>> >>> could be
>> >>> >>>> >> replace
>> >>> >>>> >> >> >> with
>> >>> >>>> >> >> >> > select/pselect calls. Maybe it would be best if
>> >>> >>>> PGCONNECT_TIMEOUT
>> >>> >>>> >> >> value
>> >>> >>>> >> >> >> > could be used here for connection timeout. pgpool has
>> >>> libpq as
>> >>> >>>> >> >> >> dependency,
>> >>> >>>> >> >> >> > why isn't it using libpq for the healthcheck db connect
>> >>> >>>> calls, then
>> >>> >>>> >> >> >> > PGCONNECT_TIMEOUT would be applied?
>> >>> >>>> >> >> >>
>> >>> >>>> >> >> >> I don't think libpq uses select/pselect for establishing
>> >>> >>>> connection,
>> >>> >>>> >> >> >> but using libpq instead of homebrew code seems to be an
>> >>> idea.
>> >>> >>>> Let me
>> >>> >>>> >> >> >> think about it.
>> >>> >>>> >> >> >>
>> >>> >>>> >> >> >> One question. Are you sure that libpq can deal with the
>> case
>> >>> >>>> (not to
>> >>> >>>> >> >> >> return destination host unreachable messages) by using
>> >>> >>>> >> >> >> PGCONNECT_TIMEOUT?
>> >>> >>>> >> >> >> --
>> >>> >>>> >> >> >> Tatsuo Ishii
>> >>> >>>> >> >> >> SRA OSS, Inc. Japan
>> >>> >>>> >> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >>> >>>> >> >> >> Japanese: http://www.sraoss.co.jp
>> >>> >>>> >> >> >>
>> >>> >>>> >> >>
>> >>> >>>> >>
>> >>> >>>>
>> >>> >>>
>> >>> >>>
>> >>> >>
>> >>>
>> >>
>> >>
>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: main.c.patch
Type: text/x-patch
Size: 1650 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20120113/0f3265a8/attachment.bin>