[pgpool-general: 3313] Re: Watchdog - ifconfig up failed

Tatsuo Ishii ishii at postgresql.org
Fri Nov 28 08:44:25 JST 2014


> Hello all,
> 
> I am facing a strange problem with PGPool 3.3.4 in a 2 nodes cluster
> setup using the watchdog feature.
> It seems that when switching to master status the "ifconfig up" command
> always fails (exit status 127). When initially starting, the same
> command succeeds.
> 
> I have tried several configurations:
> * Using "sudo" in the command with pgpool running as postgres user
> (running the command from the command line is ok)
> * Using direct command with pgpool running as root
> 
> Everythime I stop the master Pgpool I get this log on the slave that
> takes over the delegate IP:
> Nov 27 16:13:34 db02 pgpool[27600]: wd_escalation: escalating to master
> pgpool
> Nov 27 16:13:34 db02 pgpool[27600]: exec_ifconfig: 'sudo /bin/ip addr
> add xxx.xxx.xxx.xxx/24 dev bond0' failed. exit status: 127
> Nov 27 16:13:34 db02 pgpool[27600]: wd_IP_up: ifconfig up failed
> Nov 27 16:13:34 db02 pgpool[27600]: wd_declare: send the packet to
> declare the new master
> Nov 27 16:13:34 db02 pgpool[27600]: wd_escalation: escalated to master
> pgpool with some errors
> Nov 27 16:13:34 db02 pgpool[27603]: exec_ping: succeed to ping
> xxxx.xxx.xxx.xxx
> 
> The strange thing is that the IP is well assigned to the interface but
> the arping command is not run leaving the arp table of other hosts in
> bad state.
> 
> Using strace to diagnose the execv done by Pgpool I see:
> ~~~~
> [pid 30943] wait4(-1, Process 30943 suspended
>  <unfinished ...>
> [pid 30987] <... dup2 resumed> )        = 1
> [pid 30987] close(9)                    = 0
> [pid 30987] execve("/usr/bin/sudo", ["sudo", "/sbin/ip", "addr", "add",
> "xxx.xxx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
> "USER=postgres", "MAIL=/var/mail/postgres",
> "PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
> "LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
> "LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
> ...
> Process 30943 resumed
> Process 30987 detached
> <... wait4 resumed> 0x7fff489bc77c, 0, NULL) = -1 ECHILD (No child
> processes)
> ~~~~
> In the last line the hexa address may mean a bad pointer and the "No
> child process" is not the expected behavior I think.

Not sure about the hexa address but ECHILD smells like a
bug. Apparently child process (30987) was there, but the parent did
not see it. Possible theory is, because watchdog uses "wait(2)" rather
"waitpid(2)" to wait for its child and watchdog calls wait for several
times, one of the wait call mistakenly cached sigchild event for
30987. If the theory is correct, the fix is using waitpid, rather than
wait. Attached one line patch addresses this. Could you please try it
out?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> When the command succeeds I get:
> ~~~~
> [pid 30793] wait4(-1, Process 30793 suspended
>  <unfinished ...>
> [pid 30886] dup2(10, 1)                 = 1
> [pid 30886] close(8)                    = 0
> [pid 30886] execve("/usr/bin/sudo", ["sudo", "/sbin/ip", "addr", "del",
> "xxx.xxx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
> "USER=postgres", "MAIL=/var/mail/postgres",
> "PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
> "LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
> "LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
> ...
> Process 30793 resumed
> Process 30886 detached
> [pid 30793] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}],
> 0, NULL) = 30886
> ~~~~
> 
> My config related to watchdog:
> use_watchdog                         = true
> trusted_servers                      = 'xxx.xxx.xxx.xxx'
> wd_port                              = 9000
> delegate_IP                          = 'xxx.xxx.xxx.xxx'
> ifconfig_path                        = '/usr/bin'
> arping_path                          = '/usr/bin'
> ping_path                            = '/bin'
> if_up_cmd                            = 'sudo /bin/ip addr add
> xxx.xxx.xxx.xxx/24 dev bond0'
> if_down_cmd                          = 'sudo /bin/ip addr del
> xxx.xxx.xxx.xxx/24 dev bond0'
> arping_cmd                           = 'sudo /usr/bin/arping -w 1 -U -I
> bond0 xxx.xxx.xxx.xxx'
> wd_interval                          = 3
> wd_life_point                        = 3
> wd_lifecheck_query                   = 'SELECT 1'
> wd_lifecheck_dbname                  = 'postgres'
> wd_lifecheck_user                    = 'postgres'
> wd_lifecheck_password                = ''
> wd_hostname                          = 'db01'
> wd_authkey                           = 'xxxxxxxxxxxxx'
> wd_escalation_command                = ''
> wd_lifecheck_method                  = 'heartbeat'
> wd_heartbeat_port                    = 9899
> wd_heartbeat_keepalive               = 2
> wd_heartbeat_deadtime                = 200
> heartbeat_destination0               = 'db02'
> heartbeat_destination_port0          = 9899
> heartbeat_device0                    = ''
> clear_memqcache_on_escalation        = on
> other_pgpool_hostname0               = 'db02'
> other_pgpool_port0                   = 9898
> other_wd_port0                       = 9000
> relcache_expire                      = 0
> 
> Sorry I was a bit long ;)
> 
> Any thoughts anybody ?
> 
> Thanks.
> Regards.
> -- 
> Jérôme
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
-------------- next part --------------
A non-text attachment was scrubbed...
Name: watchdog.patch
Type: text/x-patch
Size: 357 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20141128/10d140d5/attachment.bin>


More information about the pgpool-general mailing list