[pgpool-general: 3316] Re: Watchdog - ifconfig up failed

Jérôme Schell jsh at myreseau.org
Sat Nov 29 02:24:22 JST 2014


Le 28/11/2014 00:44, Tatsuo Ishii a écrit :
>> Using strace to diagnose the execv done by Pgpool I see:
>> ~~~~
>> [pid 30943] wait4(-1, Process 30943 suspended
>>  <unfinished ...>
>> [pid 30987] <... dup2 resumed> )        = 1
>> [pid 30987] close(9)                    = 0
>> [pid 30987] execve("/usr/bin/sudo", ["sudo", "/sbin/ip", "addr", "add",
>> "xxx.xxx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
>> "USER=postgres", "MAIL=/var/mail/postgres",
>> "PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
>> "LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
>> "LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
>> ...
>> Process 30943 resumed
>> Process 30987 detached
>> <... wait4 resumed> 0x7fff489bc77c, 0, NULL) = -1 ECHILD (No child
>> processes)
>> ~~~~
>> In the last line the hexa address may mean a bad pointer and the "No
>> child process" is not the expected behavior I think.
> 
> Not sure about the hexa address but ECHILD smells like a
> bug. Apparently child process (30987) was there, but the parent did
> not see it. Possible theory is, because watchdog uses "wait(2)" rather
> "waitpid(2)" to wait for its child and watchdog calls wait for several
> times, one of the wait call mistakenly cached sigchild event for
> 30987. If the theory is correct, the fix is using waitpid, rather than
> wait. Attached one line patch addresses this. Could you please try it
> out?

Hello,

Thanks you very much for looking at my problem.
Unfortunately the patch does not solve the problem, it is always the
same ECHILD "No child process".
When the command is run by the main pgpool process (when starting) it
works without problem. The problem appears only when it is run by the
watchdog process.
Stracing the watchdog process does not show any other wait call that
could handle the child. Stracing the main pgpool process does not show
anything either.
I have tried simplifying the exec_ifconfig function by removing pipe()
or for(;;) that seemed useless but without success for now.

Strace with the patch applied
~~~~
clone(Process 19586 attached
child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
child_tidptr=0x7f1bd37f3a10) = 19586
[pid 19519] close(10)                   = 0
[pid 19586] close(1 <unfinished ...>
[pid 19519] wait4(19586, Process 19519 suspended
 <unfinished ...>
[pid 19586] <... close resumed> )       = 0
[pid 19586] dup2(10, 1)                 = 1
[pid 19586] close(9)                    = 0
[pid 19586] execve("/usr/bin/sudo", ["sudo", "/bin/ip", "addr", "add",
"xxx.xx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
"USER=postgres", "MAIL=/var/mail/postgres",
"PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
"LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
"LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
[pid 19586] brk(0)                      = 0xf18000
[pid 19586] fcntl(0, F_GETFD)           = 0
[pid 19586] fcntl(1, F_GETFD)           = 0
[pid 19586] fcntl(2, F_GETFD)           = 0
...
[pid 19586] munmap(0x7f5277a33000, 2318784) = 0
[pid 19586] munmap(0x7f5277831000, 2101280) = 0
[pid 19586] munmap(0x7f527762f000, 2101296) = 0
[pid 19586] exit_group(0)               = ?
Process 19519 resumed
Process 19586 detached
<... wait4 resumed> 0x7fff6c8fdecc, 0, NULL) = -1 ECHILD (No child
processes)
rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
RT_1], [], 8) = 0
time([1417177341])                      = 1417177341
sendto(3, "<131>Nov 28 12:22:21 pgpool[1951"..., 64, MSG_NOSIGNAL, NULL,
0) = 64
~~~~

Could there be a multi-thread gotcha somewhere?

Thanks.
Regards.
-- 
Jérôme


More information about the pgpool-general mailing list