[Pgpool-general] The Unplugged wire

Sat Mar 6 12:56:13 UTC 2010

Daniel,

I have committed fix for this. Please check it out.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Daniel,
> 
> Looks like interrupted connect() ought to check if SIGQUIT has been
> delivered. Will check. I have no time until next week though.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
> 
> > Hi Tatsuo,
> > 
> > I did some more debug, and it seems the problem I had is solved. But there
> > is still a problem.
> > 
> > The test I did is:
> > 1) Start postgres node2
> > 2) Start postgres node1
> > 3) start pgpool (node1)
> > 4) From a remote terminal (client1):
> > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d database -p
> > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i, 'nom$i');";
> > echo $i; sleep 1; }; done
> > 5) Unplug node2
> > Here the inserts on client terminal stops. It restarts more or less 5 min
> > and a half later.
> > 6)After 10 second (more or les) after unplug the wire,  From a remote
> > terminal (client2):
> > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d database -p
> > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i, 'nom$i');";
> > echo $i; sleep 1; }; done
> > Works perfectly!! I mean, pgpool is not blocked, it works,... there is just
> > some problem in the first session.
> > 
> > 
> > The lines bothering me are:
> > [...]
> > 11:33:42.414222 select(6, 0xbfa01ae8, 0, 0, 0xbfa01b8c2010-03-04 11:33:42
> > DEBUG: pid 4744: pool_read_message_length: slot: 0 length: 8
> > [...]
> > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length: slot: 0
> > length: 5
> > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length: slot: 1
> > length: 5
> > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: message length: 5
> > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: transaction state: I
> > 2010-03-04 11:33:42 DEBUG: pid 4744: read kind from frontend X(58)
> > 2010-03-04 11:33:43 DEBUG: pid 4744: I am 4744 accept fd 5
> > 2010-03-04 11:33:43 LOG:   pid 4744: connection received: host=192.168.1.13
> > port=38751
> > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 1234 Minor: 5679
> > database:  user:
> > 2010-03-04 11:33:43 DEBUG: pid 4744: SSLRequest from client
> > 2010-03-04 11:33:43 DEBUG: pid 4744: pool_ssl: SSL requested but SSL support
> > is not available
> > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 3 Minor: 0 database:
> > database user: postgres
> > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 0 backend
> > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 1 backend
> >  <unfinished ...>
> > 11:33:43.357089 --- SIGALRM (Alarm clock) ---
> > 11:33:44.416652 <... select resumed> )                        = 0
> > 11:33:44.416860 sigprocmask(2, 0x8180380, NULL)               = 0
> > 11:33:44.417547 sigprocmask(2, 0x8180280, NULL)               = 0
> > 11:33:44.418019 gettimeofday(0xbfa01b94, NULL)                = 0
> > 11:33:44.418723 sigprocmask(2, 0x8180380, NULL)               = 0
> > 11:33:44.419311 sigprocmask(2, 0x8180380, 0xbfa01b14)         = 0
> > 11:33:44.419866 getpid()                                      = 4743
> > 11:33:44.420353 time(NULL)                                    = 1267698824
> > 11:33:44.420878 localtime(0xbfa01ae4)                         = 0xcbb300
> > [...]
> > 
> > 
> > When pgpool is started I can see 3 processes: (sorry if times an PID does
> > not match, because I am takeing data from different debug sessions)
> >  5348 pts/1    S+     0:00 pgpool -d -n
> >  5349 pts/1    S+     0:00 pgpool: wait for connection request
> >  5350 pts/1    S+     0:00 pgpool: PCP: wait for connection request
> > 
> >  I monitored de process with name pgpool: wait for connection request
> >  And it stops at:
> >  [...]
> >  11:33:43.107622 time(NULL)              = 1267698823
> > 11:33:43.107773 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > st_size=2593, ...}) = 0
> > 11:33:43.107961 write(2, "2010-03-04 11:33:43 DEBUG: pid 4"..., 74) = 74
> > 11:33:43.108359 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > 11:33:43.108510 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> > 11:33:43.108670 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > 11:33:43.108823 connect(7, {sa_family=AF_INET, sin_port=htons(5432),
> > sin_addr=inet_addr("192.168.1.10")}, 16) = ? ERESTARTSYS (To be restarted)
> > 11:33:45.460457 --- SIGQUIT (Quit) @ 0 (0) ---
> > 11:33:45.460997 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV
> > CONT SYS RTMIN RT_1], [QUIT], 8) = 0
> > 11:33:45.461156 time(NULL)              = 1267698825
> > 11:33:45.461305 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > st_size=2593, ...}) = 0
> > 11:33:45.461537 write(2, "2010-03-04 11:33:45 DEBUG: pid 4"..., 78) = 78
> > 11:33:45.461720 rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0
> > 11:33:45.475194 sigreturn()             = ? (mask now [])
> > 11:33:45.475320 connect(7, {sa_family=AF_INET, sin_port=htons(5432),
> > sin_addr=inet_addr("192.168.1.10")}, 16) = -1 EHOSTUNREACH (No route to
> > host)
> > 
> > 
> > After some time it continues with:
> > [...]
> > 11:35:19.066376 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV
> > CONT SYS RTMIN RT_1], [], 8) = 0
> > 11:35:19.066888 time(NULL)              = 1267698919
> > 11:35:19.067042 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > st_size=2593, ...}) = 0
> > 11:35:19.067408 write(2, "2010-03-04 11:35:19 ERROR: pid 4"..., 100) = 100
> > 11:35:19.067848 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > [...]
> > 
> > 
> > I observed, in that time that now there is one more process, and the process
> > I am monitoring changed:
> >  5348 pts/1    S+     0:00 pgpool -d -n
> >  5349 pts/1    S+     0:00 pgpool: accept connection
> >  5350 pts/1    S+     0:00 pgpool: PCP: wait for connection request
> >  5467 pts/1    S+     0:00 pgpool: wait for connection request
> > 
> > 
> >  I would say that there are two "problems".
> >  One of them is that pgpool takes so long to see node2 failed (i cannot see
> > how long,.. it seems to me that is not allways the same). I have
> > connection_life_time=1 . Is there any other timeout that I would have to
> > change? (the health check is every second).
> > I noticed too that, after the node is down, some healths checks fail to
> > detect it, don't know if maybe it could be because of the
> > virtualitzation,...
> > 
> >  I see that, everytime the node falls, the process(wait for connection
> > request) is killed and anotherone is started. That is not a problem if in
> > the meantime there is not a query. If a query is done before the old process
> > dies and the nwe is create,, the "old process" will take it and won't die,
> > and will wait the 5 minutes (I don't know why this time) as "pgpool: accept
> > connection" still if the new process is alredy started, my for loop will
> > wait there. All the queryies that comes after the new process is started
> > will work perfectly.
> > The problem is that, once the new process is started (and works ok), the old
> > process is still there, retaining the clients(wich started the connection
> > there) connection (and stoping them of working), and, maybe it would be
> > better if, once the new process exist, the oldone would die and would free
> > the "retained" query.
> > 
> > Hope this can be helpfull for you.
> > 
> > Thanks!
> > 
> > 2010/3/3 Tatsuo Ishii <ishii at sraoss.co.jp>
> > 
> > > Daniel,
> > >
> > > I have committed fix into CVS HEAD. Could you try it out?
> > > --
> > > Tatsuo Ishii
> > > SRA OSS, Inc. Japan
> > > English: http://www.sraoss.co.jp/index_en.php
> > > Japanese: http://www.sraoss.co.jp
> > >
> > > > Thanks to you Tatsuo,...
> > > >
> > > > I am glad  my report helped you! As soon as you tell me it is done I will
> > > > start testing/debugging it again.
> > > >
> > > > 2010/3/2 Tatsuo Ishii <ishii at sraoss.co.jp>
> > > >
> > > > > Daniel,
> > > > >
> > > > > Thanks for the report!
> > > > >
> > > > > > I spoke in another message about this problem, yet, I debugged deeper
> > > and
> > > > > I
> > > > > > have more specific information, that, maybe, can be usefull.
> > > > > > (The thread I spoke something about was:
> > > > > >
> > > > >
> > > http://lists.pgfoundry.org/pipermail/pgpool-general/2010-February/002565.html
> > > > > > )
> > > > > >
> > > > > > I am working with two VB Virtual machines with CentOS 5 (i386).
> > > Running
> > > > > > PostgreSQL 8.3.9 and pgpool 2.3.2.1.
> > > > > >
> > > > > > The test was simple. While I was inserting values every second, I
> > > > > unplugged
> > > > > > one of the nodes.
> > > > > > health check is every second and it's timeout is 2 seconds.
> > > > > >
> > > > > > In that moment all inserts stops, and pgpool waits.
> > > > > > The point where it stops is:
> > > > > >
> > > > > > [...]
> > > > > > [pid 29444] 10:47:55.537470 rt_sigprocmask(SIG_SETMASK, [], NULL, 8)
> > > = 0
> > > > > > [pid 29444] 10:47:55.537591 socket(PF_INET, SOCK_STREAM, IPPROTO_IP)
> > > = 9
> > > > > > [pid 29444] 10:47:55.537726 setsockopt(9, SOL_TCP, TCP_NODELAY, [1],
> > > 4) =
> > > > > 0
> > > > > > [pid 29444] 10:47:55.537886 connect(9, {sa_family=AF_INET,
> > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16) = ?
> > > > > > ERESTARTSYS (To be restarted)
> > > > > > [pid 29444] 10:47:56.529113 --- SIGALRM (Alarm clock) @ 0 (0) ---
> > > > > > [pid 29444] 10:47:56.529235 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP
> > > ABRT
> > > > > BUS
> > > > > > FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
> > > > > > [pid 29444] 10:47:56.529428 rt_sigprocmask(SIG_SETMASK, [], NULL, 8)
> > > = 0
> > > > > > [pid 29444] 10:47:56.529602 sigreturn() = ? (mask now [])
> > > > > > [pid 29444] 10:47:56.529894 connect(9, {sa_family=AF_INET,
> > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16
> > > <unfinished
> > > > > > ...>
> > > > > >
> > > > > >
> > > > > > First it does a connect() wich receives de SIGALARM, and continues.
> > > But
> > > > > then
> > > > > > it does another connect(), and this time it does not receive any
> > > > > SIGALARM,
> > > > > > so, it waits (I think) till the system closes the connection.
> > > > > >
> > > > > > After waiting (too long) it starts working again (now with the node
> > > > > down):
> > > > > >
> > > > > > [...]
> > > > > > [pid 29445] 10:49:30.273727 <... connect resumed> ) = -1 EHOSTUNREACH
> > > (No
> > > > > > route to host)
> > > > > > [pid 29444] 10:49:30.274739 <... connect resumed> ) = -1 EHOSTUNREACH
> > > (No
> > > > > > route to host)
> > > > > > [pid 29445] 10:49:30.274809 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP
> > > ABRT
> > > > > BUS
> > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > [pid 29445] 10:49:30.275057 time(NULL)  = 1267436970
> > > > > > [pid 29445] 10:49:30.275202 stat64("/etc/localtime",
> > > > > {st_mode=S_IFREG|0644,
> > > > > > st_size=2593, ...}) = 0
> > > > > > [pid 29445] 10:49:30.275485 write(2, "2010-03-01 10:49:30 ERROR: pid
> > > > > 2"...,
> > > > > > 1012010-03-01 10:49:30 ERROR: pid 29445: connect_inet_domain_socket:
> > > > > > connect() failed: No route to host
> > > > > > ) = 101
> > > > > > [pid 29445] 10:49:30.275911 rt_sigprocmask(SIG_SETMASK, [], NULL, 8)
> > > = 0
> > > > > > [pid 29445] 10:49:30.276062 close(7)    = 0
> > > > > > [pid 29445] 10:49:30.276221 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP
> > > ABRT
> > > > > BUS
> > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > [pid 29445] 10:49:30.276389 time(NULL)  = 1267436970
> > > > > > [pid 29445] 10:49:30.276715 stat64("/etc/localtime",
> > > > > {st_mode=S_IFREG|0644,
> > > > > > st_size=2593, ...}) = 0
> > > > > > [pid 29445] 10:49:30.276895 write(2, "2010-03-01 10:49:30 ERROR: pid
> > > > > 2"...,
> > > > > > 782010-03-01 10:49:30 ERROR: pid 29445: connection to
> > > 192.168.1.10(5432)
> > > > > > failed
> > > > > > ) = 78
> > > > > > [...]
> > > > > >
> > > > > > As you can see it restarts after 1 min and a half (wich is too much).
> > > It
> > > > > is
> > > > > > always the same (without changeing any system values)
> > > > > >
> > > > > > If it is necessary I can show more debug lines.
> > > > > >
> > > > > > Looking trough the source, we think, maybe it could be a problem with
> > > the
> > > > > > connection being blocked. Maybe, it would be possible not to block it
> > > > > > (speaking about the socket).
> > > > > > We suppose something is happening in pool_connection_pool.c arround
> > > line
> > > > > 473
> > > > > > ("connect_inet_domain_socket_by_port").
> > > > > >
> > > > > > Or maybe I am doing something wrong,... does anybody else tested the
> > > > > > "unpluged wire" ? Is it working?
> > > > >
> > > > > What health_check() does here is:
> > > > >
> > > > > start alarm (done by caller of health_check)
> > > > > connect()
> > > > > write()
> > > > > read()
> > > > > :
> > > > > :
> > > > >
> > > > > If the wire is unplugged, one of system calls will be blocked and
> > > > > eventually alarm interrupt any of connect/write/read and health_check
> > > > > returns with error code. Write() and read() are fine. Problem is,
> > > > > connect is done by connect_inet_domain_socket_by_port, which does
> > > > > retry if connect() is interrupted by a system call.
> > > > >
> > > > > I belive what you saw was that.
> > > > > > First it does a connect() wich receives de SIGALARM, and continues.
> > > But
> > > > > then
> > > > > > it does another connect(), and this time it does not receive any
> > > > > SIGALARM,
> > > > > > so, it waits (I think) till the system closes the connection.
> > > > >
> > > > > The retry should be turn off if it's called from health_check(). Will
> > > > > fix.
> > > > >
> > > > > Thanks again for good testing and analysis.
> > > > > --
> > > > > Tatsuo Ishii
> > > > > SRA OSS, Inc. Japan
> > > > > English: http://www.sraoss.co.jp/index_en.php
> > > > > Japanese: http://www.sraoss.co.jp
> > > > >
> > >
> _______________________________________________
> Pgpool-general mailing list
> Pgpool-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgpool-general