[Pgpool-general] The Unplugged wire

Daniel Codina dcodina at laigu.net
Tue Mar 2 12:05:42 UTC 2010


Thanks to you Tatsuo,...

I am glad  my report helped you! As soon as you tell me it is done I will
start testing/debugging it again.

2010/3/2 Tatsuo Ishii <ishii at sraoss.co.jp>

> Daniel,
>
> Thanks for the report!
>
> > I spoke in another message about this problem, yet, I debugged deeper and
> I
> > have more specific information, that, maybe, can be usefull.
> > (The thread I spoke something about was:
> >
> http://lists.pgfoundry.org/pipermail/pgpool-general/2010-February/002565.html
> > )
> >
> > I am working with two VB Virtual machines with CentOS 5 (i386). Running
> > PostgreSQL 8.3.9 and pgpool 2.3.2.1.
> >
> > The test was simple. While I was inserting values every second, I
> unplugged
> > one of the nodes.
> > health check is every second and it's timeout is 2 seconds.
> >
> > In that moment all inserts stops, and pgpool waits.
> > The point where it stops is:
> >
> > [...]
> > [pid 29444] 10:47:55.537470 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid 29444] 10:47:55.537591 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
> > [pid 29444] 10:47:55.537726 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) =
> 0
> > [pid 29444] 10:47:55.537886 connect(9, {sa_family=AF_INET,
> > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16) = ?
> > ERESTARTSYS (To be restarted)
> > [pid 29444] 10:47:56.529113 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > [pid 29444] 10:47:56.529235 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT
> BUS
> > FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
> > [pid 29444] 10:47:56.529428 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid 29444] 10:47:56.529602 sigreturn() = ? (mask now [])
> > [pid 29444] 10:47:56.529894 connect(9, {sa_family=AF_INET,
> > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16 <unfinished
> > ...>
> >
> >
> > First it does a connect() wich receives de SIGALARM, and continues. But
> then
> > it does another connect(), and this time it does not receive any
> SIGALARM,
> > so, it waits (I think) till the system closes the connection.
> >
> > After waiting (too long) it starts working again (now with the node
> down):
> >
> > [...]
> > [pid 29445] 10:49:30.273727 <... connect resumed> ) = -1 EHOSTUNREACH (No
> > route to host)
> > [pid 29444] 10:49:30.274739 <... connect resumed> ) = -1 EHOSTUNREACH (No
> > route to host)
> > [pid 29445] 10:49:30.274809 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT
> BUS
> > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > [pid 29445] 10:49:30.275057 time(NULL)  = 1267436970
> > [pid 29445] 10:49:30.275202 stat64("/etc/localtime",
> {st_mode=S_IFREG|0644,
> > st_size=2593, ...}) = 0
> > [pid 29445] 10:49:30.275485 write(2, "2010-03-01 10:49:30 ERROR: pid
> 2"...,
> > 1012010-03-01 10:49:30 ERROR: pid 29445: connect_inet_domain_socket:
> > connect() failed: No route to host
> > ) = 101
> > [pid 29445] 10:49:30.275911 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [pid 29445] 10:49:30.276062 close(7)    = 0
> > [pid 29445] 10:49:30.276221 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT
> BUS
> > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > [pid 29445] 10:49:30.276389 time(NULL)  = 1267436970
> > [pid 29445] 10:49:30.276715 stat64("/etc/localtime",
> {st_mode=S_IFREG|0644,
> > st_size=2593, ...}) = 0
> > [pid 29445] 10:49:30.276895 write(2, "2010-03-01 10:49:30 ERROR: pid
> 2"...,
> > 782010-03-01 10:49:30 ERROR: pid 29445: connection to 192.168.1.10(5432)
> > failed
> > ) = 78
> > [...]
> >
> > As you can see it restarts after 1 min and a half (wich is too much). It
> is
> > always the same (without changeing any system values)
> >
> > If it is necessary I can show more debug lines.
> >
> > Looking trough the source, we think, maybe it could be a problem with the
> > connection being blocked. Maybe, it would be possible not to block it
> > (speaking about the socket).
> > We suppose something is happening in pool_connection_pool.c arround line
> 473
> > ("connect_inet_domain_socket_by_port").
> >
> > Or maybe I am doing something wrong,... does anybody else tested the
> > "unpluged wire" ? Is it working?
>
> What health_check() does here is:
>
> start alarm (done by caller of health_check)
> connect()
> write()
> read()
> :
> :
>
> If the wire is unplugged, one of system calls will be blocked and
> eventually alarm interrupt any of connect/write/read and health_check
> returns with error code. Write() and read() are fine. Problem is,
> connect is done by connect_inet_domain_socket_by_port, which does
> retry if connect() is interrupted by a system call.
>
> I belive what you saw was that.
> > First it does a connect() wich receives de SIGALARM, and continues. But
> then
> > it does another connect(), and this time it does not receive any
> SIGALARM,
> > so, it waits (I think) till the system closes the connection.
>
> The retry should be turn off if it's called from health_check(). Will
> fix.
>
> Thanks again for good testing and analysis.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pgfoundry.org/pipermail/pgpool-general/attachments/20100302/ab204b51/attachment.html>


More information about the Pgpool-general mailing list