[Pgpool-general] The Unplugged wire

Daniel Codina dcodina at laigu.net
Mon Mar 8 15:29:14 UTC 2010


Hi trhere Tatsuo,

I really didn't have much time, but I have been able to test if the changes
went well. As this time it does not stuck un the connect, is quite difficult
to see "what is wrong", anyway, I send you a part of the debug I did,
starting with the last correct health check since the failover is done:

[pid  6509] 12:05:12.844235 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:12.844421 write(2, "2010-03-08 12:05:12 DEBUG: pid 6509:
health_check: 1 th DB node status: 2\n", 742010-03-08 12:05:12 DEBUG: pid
6509: health_check: 1 th DB node status: 2
) = 74
[pid  6509] 12:05:12.844555 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  6509] 12:05:12.844689 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 9
[pid  6509] 12:05:12.844834 setsockopt(9, SOL_TCP, TCP_NODELAY, [1], 4) = 0
[pid  6509] 12:05:12.844992 connect(9, {sa_family=AF_INET,
sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16) = ?
ERESTARTSYS (To be restarted)
[pid  6509] 12:05:13.832561 --- SIGALRM (Alarm clock) @ 0 (0) ---
[pid  6509] 12:05:13.832711 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.832944 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  6509] 12:05:13.833279 sigreturn() = ? (mask now [])
[pid  6509] 12:05:13.834129 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
[pid  6509] 12:05:13.834431 time(NULL)  = 1268046313
[pid  6509] 12:05:13.834650 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.835114 write(2, "2010-03-08 12:05:13 ERROR: pid 6509:
connect_inet_domain_socket: connect() failed: Interrupted system call\n",
1072010-03-08 12:05:13 ERROR: pid 6509: connect_inet_domain_socket:
connect() failed: Interrupted system call
) = 107
[pid  6509] 12:05:13.835423 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  6509] 12:05:13.835635 close(9)    = 0
[pid  6509] 12:05:13.835876 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
[pid  6509] 12:05:13.836320 time(NULL)  = 1268046313
[pid  6509] 12:05:13.836340 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.836340 write(2, "2010-03-08 12:05:13 ERROR: pid 6509:
health check failed. 1 th host 192.168.1.10 at port 5432 is down\n",
1022010-03-08 12:05:13 ERROR: pid 6509: health check failed. 1 th host
192.168.1.10 at port 5432 is down
) = 102
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], ~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP
SYS RTMIN RT_1], 8) = 0
[pid  6509] 12:05:13.836340 time(NULL)  = 1268046313
[pid  6509] 12:05:13.836340 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.836340 write(2, "2010-03-08 12:05:13 LOG:   pid 6509:
set 1 th backend down status\n", 662010-03-08 12:05:13 LOG:   pid 6509: set
1 th backend down status
) = 66
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], ~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP
SYS RTMIN RT_1], 8) = 0
[pid  6509] 12:05:13.836340 time(NULL)  = 1268046313
[pid  6509] 12:05:13.836340 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.836340 write(2, "2010-03-08 12:05:13 DEBUG: pid 6509:
failover_handler called\n", 612010-03-08 12:05:13 DEBUG: pid 6509:
failover_handler called
) = 61
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.836340 semop(983048, 0xbf86e7ce, 1) = 0
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], ~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP
SYS RTMIN RT_1], 8) = 0
[pid  6509] 12:05:13.836340 time(NULL)  = 1268046313
[pid  6509] 12:05:13.836340 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.836340 write(2, "2010-03-08 12:05:13 DEBUG: pid 6509:
failover_handler: starting to select new master node\n", 902010-03-08
12:05:13 DEBUG: pid 6509: failover_handler: starting to select new master
node
) = 90
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.836340 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1], ~[ILL TRAP ABRT BUS FPE KILL SEGV CONT STOP
SYS RTMIN RT_1], 8) = 0
[pid  6509] 12:05:13.836340 time(NULL)  = 1268046313
[pid  6509] 12:05:13.836340 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.837099 write(2, "2010-03-08 12:05:13 LOG:   pid 6509:
starting degeneration. shutdown host 192.168.1.10(5432)\n", 932010-03-08
12:05:13 LOG:   pid 6509: starting degeneration. shutdown host
192.168.1.10(5432)
) = 93
[pid  6509] 12:05:13.837388 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.837529 kill(6510, SIGQUIT <unfinished ...>
[pid  6510] 12:05:13.837658 <... select resumed> ) = ? ERESTARTNOHAND (To be
restarted)
[pid  6509] 12:05:13.837726 <... kill resumed> ) = 0
[pid  6510] 12:05:13.837796 --- SIGQUIT (Quit) @ 0 (0) ---
[pid  6509] 12:05:13.838069 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1],  <unfinished ...>
[pid  6510] 12:05:13.838155 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1],  <unfinished ...>
[pid  6509] 12:05:13.838341 <... rt_sigprocmask resumed> ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], 8) = 0
[pid  6510] 12:05:13.838426 <... rt_sigprocmask resumed> [QUIT], 8) = 0
[pid  6509] 12:05:13.838548 time( <unfinished ...>
[pid  6510] 12:05:13.838641 time( <unfinished ...>
[pid  6509] 12:05:13.838716 <... time resumed> NULL) = 1268046313
[pid  6510] 12:05:13.838829 <... time resumed> NULL) = 1268046313
[pid  6509] 12:05:13.838932 stat64("/etc/localtime",  <unfinished ...>
[pid  6510] 12:05:13.839074 stat64("/etc/localtime",  <unfinished ...>
[pid  6509] 12:05:13.839154 <... stat64 resumed> {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6510] 12:05:13.839278 <... stat64 resumed> {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.839392 write(2, "2010-03-08 12:05:13 DEBUG: pid 6509:
failover_handler: kill 6510\n", 652010-03-08 12:05:13 DEBUG: pid 6509:
failover_handler: kill 6510
 <unfinished ...>
[pid  6510] 12:05:13.839504 write(2, "2010-03-08 12:05:13 DEBUG: pid 6510:
child received shutdown request signal 3\n", 782010-03-08 12:05:13 DEBUG:
pid 6510: child received shutdown request signal 3
 <unfinished ...>
[pid  6509] 12:05:13.839648 <... write resumed> ) = 65
[pid  6510] 12:05:13.839720 <... write resumed> ) = 78
[pid  6509] 12:05:13.839848 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1],  <unfinished ...>
[pid  6510] 12:05:13.839939 rt_sigprocmask(SIG_SETMASK, [QUIT],  <unfinished
...>
[pid  6509] 12:05:13.840066 <... rt_sigprocmask resumed> NULL, 8) = 0
[pid  6510] 12:05:13.840271 <... rt_sigprocmask resumed> NULL, 8) = 0
[pid  6509] 12:05:13.840344 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE SEGV CONT SYS RTMIN RT_1],  <unfinished ...>
[pid  6510] 12:05:13.840432 sigreturn( <unfinished ...>
[pid  6509] 12:05:13.840524 <... rt_sigprocmask resumed> ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], 8) = 0
[pid  6510] 12:05:13.840607 <... sigreturn resumed> ) = ? (mask now [])
[pid  6509] 12:05:13.840687 time( <unfinished ...>
[pid  6510] 12:05:13.840771 select(8, [7], NULL, [7], {27, 943000}
<unfinished ...>
[pid  6509] 12:05:13.840859 <... time resumed> NULL) = 1268046313
[pid  6509] 12:05:13.840934 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
st_size=2593, ...}) = 0
[pid  6509] 12:05:13.841152 write(2, "2010-03-08 12:05:13 LOG:   pid 6509:
execute command: echo host:192.168.1.10, new master id:0, old master id:0
/opt/pgpoolII/log/failover.log\n", 1422010-03-08 12:05:13 LOG:   pid 6509:
execute command: echo host:192.168.1.10, new master id:0, old master id:0
/opt/pgpoolII/log/failover.log
) = 142
[pid  6509] 12:05:13.841332 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0
[pid  6509] 12:05:13.841524 rt_sigaction(SIGINT, {0x1, [], 0}, {0x804b710,
[], SA_RESTART}, 8) = 0
[pid  6509] 12:05:13.841670 rt_sigaction(SIGQUIT, {0x1, [], 0}, {0x804b710,
[], SA_RESTART}, 8) = 0
[pid  6509] 12:05:13.841813 rt_sigprocmask(SIG_BLOCK, [CHLD], ~[ILL TRAP
ABRT BUS FPE KILL SEGV CONT STOP SYS RTMIN RT_1], 8) = 0
[pid  6509] 12:05:13.842065 clone(Process 6598 attached (waiting for parent)
Process 6598 resumed (parent 6509 ready)
child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0xbf86e784)
= 6598
[pid  6509] 12:05:13.842567 waitpid(6598, Process 6509 suspended
 <unfinished ...>
[pid  6598] 12:05:13.843065 rt_sigaction(SIGINT, {0x804b710, [],
SA_RESTART}, NULL, 8) = 0
[pid  6598] 12:05:13.843277 rt_sigaction(SIGQUIT, {0x804b710, [],
SA_RESTART}, NULL, 8) = 0
[pid  6598] 12:05:13.843435 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS
FPE KILL SEGV CONT STOP SYS RTMIN RT_1], NULL, 8) = 0
[pid  6598] 12:05:13.843676 execve("/bin/sh", ["sh", "-c", "echo
host:192.168.1.10, new master id:0, old master id:0
/opt/pgpoolII/log/failover.log"], [/* 23 vars */]) = 0
[pid  6598] 12:05:13.844359 brk(0)      = 0x8771000
[pid  6598] 12:05:13.844715 access("/etc/ld.so.preload", R_OK) = -1 ENOENT
(No such file or directory)
[pid  6598] 12:05:13.844935 open("/etc/ld.so.cache", O_RDONLY) = 9
[pid  6598] 12:05:13.845092 fstat64(9, {st_mode=S_IFREG|0644, st_size=22568,
...}) = 0
[pid  6598] 12:05:13.845236 mmap2(NULL, 22568, PROT_READ, MAP_PRIVATE, 9, 0)
= 0xb7f89000
[pid  6598] 12:05:13.845365 close(9)    = 0
[pid  6598] 12:05:13.845569 open("/lib/libtermcap.so.2", O_RDONLY) = 9
[pid  6598] 12:05:13.845695 read(9,
"\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\320\352\327\0004\0\0\0\224.\0\0\0\0\0\0004\0
\0\5\0(\0\35\0\34\0\1\0\0\0\0\0\0\0\0\340\327\0\0\340\327\0\334&\0\0\334&\0\0\5\0\0\0\0\20\0\0\1\0\0\0\334&\0\0\334\26\330\0\334\26\330\0\220\1\0\0\f\2\0\0\6\0\0\0\0\20\0\0\2\0\0\0\370&\0\0\370\26\330\0\370\26\330\0\310\0\0\0\310\0\0\0\6\0\0\0\4\0\0\0P\345td<$\0\0<\4\330\0<\4\330\0\214\0\0\0\214\0\0\0\4\0\0\0\4\0\0\0Q\345td\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\6\0\0\0\4\0\0\0\21\0\0\0!\0\0\0\4\0\0\0\7\0\0\0`\f<\4\240\20\4\r\v\6\2\1\200\216\204\0!\0\0\0\"\0\0\0$\0\0"...,
512) = 512


I don't know if this tells you much... What I see is that it changes the PID
from 6509 to 6598, but the process is not killed, it is yet waiting the
insert (query) to be done. The processes I can see after I unplugged the
wire are:



 6509 pts/0    S+     0:00 pgpool -d -n
 6510 pts/0    S+     0:03 pgpool: postgres dbname 192.168.1.13(43586)
INSERT
 6511 pts/0    S+     0:00 pgpool: PCP: wait for connection request
 6596 ?        Ss     0:00 postgres: postgres dbname 192.168.1.5(59340) idle
in transaction
 6599 pts/0    S+     0:00 pgpool: wait for connection request


The 192.168.1.13 is the client, and 192.168.1.5 is where pgpool is running.

How this can help you Tatsuo,... Thank you again!


2010/3/6 Tatsuo Ishii <ishii at sraoss.co.jp>

> Daniel,
>
> I have committed fix for this. Please check it out.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Daniel,
> >
> > Looks like interrupted connect() ought to check if SIGQUIT has been
> > delivered. Will check. I have no time until next week though.
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> > English: http://www.sraoss.co.jp/index_en.php
> > Japanese: http://www.sraoss.co.jp
> >
> > > Hi Tatsuo,
> > >
> > > I did some more debug, and it seems the problem I had is solved. But
> there
> > > is still a problem.
> > >
> > > The test I did is:
> > > 1) Start postgres node2
> > > 2) Start postgres node1
> > > 3) start pgpool (node1)
> > > 4) From a remote terminal (client1):
> > > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d database
> -p
> > > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i,
> 'nom$i');";
> > > echo $i; sleep 1; }; done
> > > 5) Unplug node2
> > > Here the inserts on client terminal stops. It restarts more or less 5
> min
> > > and a half later.
> > > 6)After 10 second (more or les) after unplug the wire,  From a remote
> > > terminal (client2):
> > > for i in $(seq 0 2000); do { echo $(date +%H:%M:%S); psql -d database
> -p
> > > 9999 -h 192.168.1.5 -U postgres -c "insert into pri values ($i,
> 'nom$i');";
> > > echo $i; sleep 1; }; done
> > > Works perfectly!! I mean, pgpool is not blocked, it works,... there is
> just
> > > some problem in the first session.
> > >
> > >
> > > The lines bothering me are:
> > > [...]
> > > 11:33:42.414222 select(6, 0xbfa01ae8, 0, 0, 0xbfa01b8c2010-03-04
> 11:33:42
> > > DEBUG: pid 4744: pool_read_message_length: slot: 0 length: 8
> > > [...]
> > > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length: slot: 0
> > > length: 5
> > > 2010-03-04 11:33:42 DEBUG: pid 4744: pool_read_message_length: slot: 1
> > > length: 5
> > > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: message length: 5
> > > 2010-03-04 11:33:42 DEBUG: pid 4744: ReadyForQuery: transaction state:
> I
> > > 2010-03-04 11:33:42 DEBUG: pid 4744: read kind from frontend X(58)
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: I am 4744 accept fd 5
> > > 2010-03-04 11:33:43 LOG:   pid 4744: connection received:
> host=192.168.1.13
> > > port=38751
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 1234 Minor: 5679
> > > database:  user:
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: SSLRequest from client
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: pool_ssl: SSL requested but SSL
> support
> > > is not available
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: Protocol Major: 3 Minor: 0
> database:
> > > database user: postgres
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 0
> backend
> > > 2010-03-04 11:33:43 DEBUG: pid 4744: new_connection: connecting 1
> backend
> > >  <unfinished ...>
> > > 11:33:43.357089 --- SIGALRM (Alarm clock) ---
> > > 11:33:44.416652 <... select resumed> )                        = 0
> > > 11:33:44.416860 sigprocmask(2, 0x8180380, NULL)               = 0
> > > 11:33:44.417547 sigprocmask(2, 0x8180280, NULL)               = 0
> > > 11:33:44.418019 gettimeofday(0xbfa01b94, NULL)                = 0
> > > 11:33:44.418723 sigprocmask(2, 0x8180380, NULL)               = 0
> > > 11:33:44.419311 sigprocmask(2, 0x8180380, 0xbfa01b14)         = 0
> > > 11:33:44.419866 getpid()                                      = 4743
> > > 11:33:44.420353 time(NULL)                                    =
> 1267698824
> > > 11:33:44.420878 localtime(0xbfa01ae4)                         =
> 0xcbb300
> > > [...]
> > >
> > >
> > > When pgpool is started I can see 3 processes: (sorry if times an PID
> does
> > > not match, because I am takeing data from different debug sessions)
> > >  5348 pts/1    S+     0:00 pgpool -d -n
> > >  5349 pts/1    S+     0:00 pgpool: wait for connection request
> > >  5350 pts/1    S+     0:00 pgpool: PCP: wait for connection request
> > >
> > >  I monitored de process with name pgpool: wait for connection request
> > >  And it stops at:
> > >  [...]
> > >  11:33:43.107622 time(NULL)              = 1267698823
> > > 11:33:43.107773 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > > st_size=2593, ...}) = 0
> > > 11:33:43.107961 write(2, "2010-03-04 11:33:43 DEBUG: pid 4"..., 74) =
> 74
> > > 11:33:43.108359 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > > 11:33:43.108510 socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 7
> > > 11:33:43.108670 setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
> > > 11:33:43.108823 connect(7, {sa_family=AF_INET, sin_port=htons(5432),
> > > sin_addr=inet_addr("192.168.1.10")}, 16) = ? ERESTARTSYS (To be
> restarted)
> > > 11:33:45.460457 --- SIGQUIT (Quit) @ 0 (0) ---
> > > 11:33:45.460997 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE
> SEGV
> > > CONT SYS RTMIN RT_1], [QUIT], 8) = 0
> > > 11:33:45.461156 time(NULL)              = 1267698825
> > > 11:33:45.461305 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > > st_size=2593, ...}) = 0
> > > 11:33:45.461537 write(2, "2010-03-04 11:33:45 DEBUG: pid 4"..., 78) =
> 78
> > > 11:33:45.461720 rt_sigprocmask(SIG_SETMASK, [QUIT], NULL, 8) = 0
> > > 11:33:45.475194 sigreturn()             = ? (mask now [])
> > > 11:33:45.475320 connect(7, {sa_family=AF_INET, sin_port=htons(5432),
> > > sin_addr=inet_addr("192.168.1.10")}, 16) = -1 EHOSTUNREACH (No route to
> > > host)
> > >
> > >
> > > After some time it continues with:
> > > [...]
> > > 11:35:19.066376 rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE
> SEGV
> > > CONT SYS RTMIN RT_1], [], 8) = 0
> > > 11:35:19.066888 time(NULL)              = 1267698919
> > > 11:35:19.067042 stat64("/etc/localtime", {st_mode=S_IFREG|0644,
> > > st_size=2593, ...}) = 0
> > > 11:35:19.067408 write(2, "2010-03-04 11:35:19 ERROR: pid 4"..., 100) =
> 100
> > > 11:35:19.067848 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > > [...]
> > >
> > >
> > > I observed, in that time that now there is one more process, and the
> process
> > > I am monitoring changed:
> > >  5348 pts/1    S+     0:00 pgpool -d -n
> > >  5349 pts/1    S+     0:00 pgpool: accept connection
> > >  5350 pts/1    S+     0:00 pgpool: PCP: wait for connection request
> > >  5467 pts/1    S+     0:00 pgpool: wait for connection request
> > >
> > >
> > >  I would say that there are two "problems".
> > >  One of them is that pgpool takes so long to see node2 failed (i cannot
> see
> > > how long,.. it seems to me that is not allways the same). I have
> > > connection_life_time=1 . Is there any other timeout that I would have
> to
> > > change? (the health check is every second).
> > > I noticed too that, after the node is down, some healths checks fail to
> > > detect it, don't know if maybe it could be because of the
> > > virtualitzation,...
> > >
> > >  I see that, everytime the node falls, the process(wait for connection
> > > request) is killed and anotherone is started. That is not a problem if
> in
> > > the meantime there is not a query. If a query is done before the old
> process
> > > dies and the nwe is create,, the "old process" will take it and won't
> die,
> > > and will wait the 5 minutes (I don't know why this time) as "pgpool:
> accept
> > > connection" still if the new process is alredy started, my for loop
> will
> > > wait there. All the queryies that comes after the new process is
> started
> > > will work perfectly.
> > > The problem is that, once the new process is started (and works ok),
> the old
> > > process is still there, retaining the clients(wich started the
> connection
> > > there) connection (and stoping them of working), and, maybe it would be
> > > better if, once the new process exist, the oldone would die and would
> free
> > > the "retained" query.
> > >
> > > Hope this can be helpfull for you.
> > >
> > > Thanks!
> > >
> > > 2010/3/3 Tatsuo Ishii <ishii at sraoss.co.jp>
> > >
> > > > Daniel,
> > > >
> > > > I have committed fix into CVS HEAD. Could you try it out?
> > > > --
> > > > Tatsuo Ishii
> > > > SRA OSS, Inc. Japan
> > > > English: http://www.sraoss.co.jp/index_en.php
> > > > Japanese: http://www.sraoss.co.jp
> > > >
> > > > > Thanks to you Tatsuo,...
> > > > >
> > > > > I am glad  my report helped you! As soon as you tell me it is done
> I will
> > > > > start testing/debugging it again.
> > > > >
> > > > > 2010/3/2 Tatsuo Ishii <ishii at sraoss.co.jp>
> > > > >
> > > > > > Daniel,
> > > > > >
> > > > > > Thanks for the report!
> > > > > >
> > > > > > > I spoke in another message about this problem, yet, I debugged
> deeper
> > > > and
> > > > > > I
> > > > > > > have more specific information, that, maybe, can be usefull.
> > > > > > > (The thread I spoke something about was:
> > > > > > >
> > > > > >
> > > >
> http://lists.pgfoundry.org/pipermail/pgpool-general/2010-February/002565.html
> > > > > > > )
> > > > > > >
> > > > > > > I am working with two VB Virtual machines with CentOS 5 (i386).
> > > > Running
> > > > > > > PostgreSQL 8.3.9 and pgpool 2.3.2.1.
> > > > > > >
> > > > > > > The test was simple. While I was inserting values every second,
> I
> > > > > > unplugged
> > > > > > > one of the nodes.
> > > > > > > health check is every second and it's timeout is 2 seconds.
> > > > > > >
> > > > > > > In that moment all inserts stops, and pgpool waits.
> > > > > > > The point where it stops is:
> > > > > > >
> > > > > > > [...]
> > > > > > > [pid 29444] 10:47:55.537470 rt_sigprocmask(SIG_SETMASK, [],
> NULL, 8)
> > > > = 0
> > > > > > > [pid 29444] 10:47:55.537591 socket(PF_INET, SOCK_STREAM,
> IPPROTO_IP)
> > > > = 9
> > > > > > > [pid 29444] 10:47:55.537726 setsockopt(9, SOL_TCP, TCP_NODELAY,
> [1],
> > > > 4) =
> > > > > > 0
> > > > > > > [pid 29444] 10:47:55.537886 connect(9, {sa_family=AF_INET,
> > > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16)
> = ?
> > > > > > > ERESTARTSYS (To be restarted)
> > > > > > > [pid 29444] 10:47:56.529113 --- SIGALRM (Alarm clock) @ 0 (0)
> ---
> > > > > > > [pid 29444] 10:47:56.529235 rt_sigprocmask(SIG_SETMASK, ~[ILL
> TRAP
> > > > ABRT
> > > > > > BUS
> > > > > > > FPE SEGV CONT SYS RTMIN RT_1], NULL, 8) = 0
> > > > > > > [pid 29444] 10:47:56.529428 rt_sigprocmask(SIG_SETMASK, [],
> NULL, 8)
> > > > = 0
> > > > > > > [pid 29444] 10:47:56.529602 sigreturn() = ? (mask now [])
> > > > > > > [pid 29444] 10:47:56.529894 connect(9, {sa_family=AF_INET,
> > > > > > > sin_port=htons(5432), sin_addr=inet_addr("192.168.1.10")}, 16
> > > > <unfinished
> > > > > > > ...>
> > > > > > >
> > > > > > >
> > > > > > > First it does a connect() wich receives de SIGALARM, and
> continues.
> > > > But
> > > > > > then
> > > > > > > it does another connect(), and this time it does not receive
> any
> > > > > > SIGALARM,
> > > > > > > so, it waits (I think) till the system closes the connection.
> > > > > > >
> > > > > > > After waiting (too long) it starts working again (now with the
> node
> > > > > > down):
> > > > > > >
> > > > > > > [...]
> > > > > > > [pid 29445] 10:49:30.273727 <... connect resumed> ) = -1
> EHOSTUNREACH
> > > > (No
> > > > > > > route to host)
> > > > > > > [pid 29444] 10:49:30.274739 <... connect resumed> ) = -1
> EHOSTUNREACH
> > > > (No
> > > > > > > route to host)
> > > > > > > [pid 29445] 10:49:30.274809 rt_sigprocmask(SIG_SETMASK, ~[ILL
> TRAP
> > > > ABRT
> > > > > > BUS
> > > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > > [pid 29445] 10:49:30.275057 time(NULL)  = 1267436970
> > > > > > > [pid 29445] 10:49:30.275202 stat64("/etc/localtime",
> > > > > > {st_mode=S_IFREG|0644,
> > > > > > > st_size=2593, ...}) = 0
> > > > > > > [pid 29445] 10:49:30.275485 write(2, "2010-03-01 10:49:30
> ERROR: pid
> > > > > > 2"...,
> > > > > > > 1012010-03-01 10:49:30 ERROR: pid 29445:
> connect_inet_domain_socket:
> > > > > > > connect() failed: No route to host
> > > > > > > ) = 101
> > > > > > > [pid 29445] 10:49:30.275911 rt_sigprocmask(SIG_SETMASK, [],
> NULL, 8)
> > > > = 0
> > > > > > > [pid 29445] 10:49:30.276062 close(7)    = 0
> > > > > > > [pid 29445] 10:49:30.276221 rt_sigprocmask(SIG_SETMASK, ~[ILL
> TRAP
> > > > ABRT
> > > > > > BUS
> > > > > > > FPE SEGV CONT SYS RTMIN RT_1], [], 8) = 0
> > > > > > > [pid 29445] 10:49:30.276389 time(NULL)  = 1267436970
> > > > > > > [pid 29445] 10:49:30.276715 stat64("/etc/localtime",
> > > > > > {st_mode=S_IFREG|0644,
> > > > > > > st_size=2593, ...}) = 0
> > > > > > > [pid 29445] 10:49:30.276895 write(2, "2010-03-01 10:49:30
> ERROR: pid
> > > > > > 2"...,
> > > > > > > 782010-03-01 10:49:30 ERROR: pid 29445: connection to
> > > > 192.168.1.10(5432)
> > > > > > > failed
> > > > > > > ) = 78
> > > > > > > [...]
> > > > > > >
> > > > > > > As you can see it restarts after 1 min and a half (wich is too
> much).
> > > > It
> > > > > > is
> > > > > > > always the same (without changeing any system values)
> > > > > > >
> > > > > > > If it is necessary I can show more debug lines.
> > > > > > >
> > > > > > > Looking trough the source, we think, maybe it could be a
> problem with
> > > > the
> > > > > > > connection being blocked. Maybe, it would be possible not to
> block it
> > > > > > > (speaking about the socket).
> > > > > > > We suppose something is happening in pool_connection_pool.c
> arround
> > > > line
> > > > > > 473
> > > > > > > ("connect_inet_domain_socket_by_port").
> > > > > > >
> > > > > > > Or maybe I am doing something wrong,... does anybody else
> tested the
> > > > > > > "unpluged wire" ? Is it working?
> > > > > >
> > > > > > What health_check() does here is:
> > > > > >
> > > > > > start alarm (done by caller of health_check)
> > > > > > connect()
> > > > > > write()
> > > > > > read()
> > > > > > :
> > > > > > :
> > > > > >
> > > > > > If the wire is unplugged, one of system calls will be blocked and
> > > > > > eventually alarm interrupt any of connect/write/read and
> health_check
> > > > > > returns with error code. Write() and read() are fine. Problem is,
> > > > > > connect is done by connect_inet_domain_socket_by_port, which does
> > > > > > retry if connect() is interrupted by a system call.
> > > > > >
> > > > > > I belive what you saw was that.
> > > > > > > First it does a connect() wich receives de SIGALARM, and
> continues.
> > > > But
> > > > > > then
> > > > > > > it does another connect(), and this time it does not receive
> any
> > > > > > SIGALARM,
> > > > > > > so, it waits (I think) till the system closes the connection.
> > > > > >
> > > > > > The retry should be turn off if it's called from health_check().
> Will
> > > > > > fix.
> > > > > >
> > > > > > Thanks again for good testing and analysis.
> > > > > > --
> > > > > > Tatsuo Ishii
> > > > > > SRA OSS, Inc. Japan
> > > > > > English: http://www.sraoss.co.jp/index_en.php
> > > > > > Japanese: http://www.sraoss.co.jp
> > > > > >
> > > >
> > _______________________________________________
> > Pgpool-general mailing list
> > Pgpool-general at pgfoundry.org
> > http://pgfoundry.org/mailman/listinfo/pgpool-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pgfoundry.org/pipermail/pgpool-general/attachments/20100308/8d257bb7/attachment-0001.html>


More information about the Pgpool-general mailing list