[pgpool-general: 3317] Re: Watchdog - ifconfig up failed

Tatsuo Ishii ishii at postgresql.org
Sat Nov 29 08:18:42 JST 2014


> Le 28/11/2014 00:44, Tatsuo Ishii a écrit :
>>> Using strace to diagnose the execv done by Pgpool I see:
>>> ~~~~
>>> [pid 30943] wait4(-1, Process 30943 suspended
>>>  <unfinished ...>
>>> [pid 30987] <... dup2 resumed> )        = 1
>>> [pid 30987] close(9)                    = 0
>>> [pid 30987] execve("/usr/bin/sudo", ["sudo", "/sbin/ip", "addr", "add",
>>> "xxx.xxx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
>>> "USER=postgres", "MAIL=/var/mail/postgres",
>>> "PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
>>> "LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
>>> "LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
>>> ...
>>> Process 30943 resumed
>>> Process 30987 detached
>>> <... wait4 resumed> 0x7fff489bc77c, 0, NULL) = -1 ECHILD (No child
>>> processes)
>>> ~~~~
>>> In the last line the hexa address may mean a bad pointer and the "No
>>> child process" is not the expected behavior I think.
>> 
>> Not sure about the hexa address but ECHILD smells like a
>> bug. Apparently child process (30987) was there, but the parent did
>> not see it. Possible theory is, because watchdog uses "wait(2)" rather
>> "waitpid(2)" to wait for its child and watchdog calls wait for several
>> times, one of the wait call mistakenly cached sigchild event for
>> 30987. If the theory is correct, the fix is using waitpid, rather than
>> wait. Attached one line patch addresses this. Could you please try it
>> out?
> 
> Hello,
> 
> Thanks you very much for looking at my problem.
> Unfortunately the patch does not solve the problem, it is always the
> same ECHILD "No child process".
> When the command is run by the main pgpool process (when starting) it
> works without problem. The problem appears only when it is run by the
> watchdog process.
> Stracing the watchdog process does not show any other wait call that
> could handle the child. Stracing the main pgpool process does not show
> anything either.
> I have tried simplifying the exec_ifconfig function by removing pipe()
> or for(;;) that seemed useless but without success for now.
> 
> Strace with the patch applied
> ~~~~
> clone(Process 19586 attached
> child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD,
> child_tidptr=0x7f1bd37f3a10) = 19586
> [pid 19519] close(10)                   = 0
> [pid 19586] close(1 <unfinished ...>
> [pid 19519] wait4(19586, Process 19519 suspended
>  <unfinished ...>
> [pid 19586] <... close resumed> )       = 0
> [pid 19586] dup2(10, 1)                 = 1
> [pid 19586] close(9)                    = 0
> [pid 19586] execve("/usr/bin/sudo", ["sudo", "/bin/ip", "addr", "add",
> "xxx.xx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
> "USER=postgres", "MAIL=/var/mail/postgres",
> "PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
> "LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
> "LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
> [pid 19586] brk(0)                      = 0xf18000
> [pid 19586] fcntl(0, F_GETFD)           = 0
> [pid 19586] fcntl(1, F_GETFD)           = 0
> [pid 19586] fcntl(2, F_GETFD)           = 0
> ...
> [pid 19586] munmap(0x7f5277a33000, 2318784) = 0
> [pid 19586] munmap(0x7f5277831000, 2101280) = 0
> [pid 19586] munmap(0x7f527762f000, 2101296) = 0
> [pid 19586] exit_group(0)               = ?
> Process 19519 resumed
> Process 19586 detached
> <... wait4 resumed> 0x7fff6c8fdecc, 0, NULL) = -1 ECHILD (No child
> processes)
> rt_sigprocmask(SIG_SETMASK, ~[ILL TRAP ABRT BUS FPE SEGV CONT SYS RTMIN
> RT_1], [], 8) = 0
> time([1417177341])                      = 1417177341
> sendto(3, "<131>Nov 28 12:22:21 pgpool[1951"..., 64, MSG_NOSIGNAL, NULL,
> 0) = 64
> ~~~~

Hum. waitpid waits for child 19586 and defnitely the process is
there. Very strange.

> Could there be a multi-thread gotcha somewhere?

Yeah could be. I will come up if other idea comes to me.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Thanks.
> Regards.
> -- 
> Jérôme


More information about the pgpool-general mailing list