[pgpool-general: 3310] Watchdog - ifconfig up failed

Fri Nov 28 01:38:37 JST 2014

Hello all,

I am facing a strange problem with PGPool 3.3.4 in a 2 nodes cluster
setup using the watchdog feature.
It seems that when switching to master status the "ifconfig up" command
always fails (exit status 127). When initially starting, the same
command succeeds.

I have tried several configurations:
* Using "sudo" in the command with pgpool running as postgres user
(running the command from the command line is ok)
* Using direct command with pgpool running as root

Everythime I stop the master Pgpool I get this log on the slave that
takes over the delegate IP:
Nov 27 16:13:34 db02 pgpool[27600]: wd_escalation: escalating to master
pgpool
Nov 27 16:13:34 db02 pgpool[27600]: exec_ifconfig: 'sudo /bin/ip addr
add xxx.xxx.xxx.xxx/24 dev bond0' failed. exit status: 127
Nov 27 16:13:34 db02 pgpool[27600]: wd_IP_up: ifconfig up failed
Nov 27 16:13:34 db02 pgpool[27600]: wd_declare: send the packet to
declare the new master
Nov 27 16:13:34 db02 pgpool[27600]: wd_escalation: escalated to master
pgpool with some errors
Nov 27 16:13:34 db02 pgpool[27603]: exec_ping: succeed to ping
xxxx.xxx.xxx.xxx

The strange thing is that the IP is well assigned to the interface but
the arping command is not run leaving the arp table of other hosts in
bad state.

Using strace to diagnose the execv done by Pgpool I see:
~~~~
[pid 30943] wait4(-1, Process 30943 suspended
 <unfinished ...>
[pid 30987] <... dup2 resumed> )        = 1
[pid 30987] close(9)                    = 0
[pid 30987] execve("/usr/bin/sudo", ["sudo", "/sbin/ip", "addr", "add",
"xxx.xxx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
"USER=postgres", "MAIL=/var/mail/postgres",
"PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
"LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
"LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
...
Process 30943 resumed
Process 30987 detached
<... wait4 resumed> 0x7fff489bc77c, 0, NULL) = -1 ECHILD (No child
processes)
~~~~
In the last line the hexa address may mean a bad pointer and the "No
child process" is not the expected behavior I think.

When the command succeeds I get:
~~~~
[pid 30793] wait4(-1, Process 30793 suspended
 <unfinished ...>
[pid 30886] dup2(10, 1)                 = 1
[pid 30886] close(8)                    = 0
[pid 30886] execve("/usr/bin/sudo", ["sudo", "/sbin/ip", "addr", "del",
"xxx.xxx.xx.xx/24", "dev", "bond0"], ["SHELL=/bin/bash", "TERM=screen",
"USER=postgres", "MAIL=/var/mail/postgres",
"PATH=/usr/local/bin:/usr/bin:/bi"..., "PWD=/var/lib/postgresql",
"LANG=en_US.UTF-8", "SHLVL=1", "HOME=/var/lib/postgresql",
"LANGUAGE=en_US:en", "LOGNAME=postgres", "_=/usr/sbin/pgpool"]) = 0
...
Process 30793 resumed
Process 30886 detached
[pid 30793] <... wait4 resumed> [{WIFEXITED(s) && WEXITSTATUS(s) == 0}],
0, NULL) = 30886
~~~~

My config related to watchdog:
use_watchdog                         = true
trusted_servers                      = 'xxx.xxx.xxx.xxx'
wd_port                              = 9000
delegate_IP                          = 'xxx.xxx.xxx.xxx'
ifconfig_path                        = '/usr/bin'
arping_path                          = '/usr/bin'
ping_path                            = '/bin'
if_up_cmd                            = 'sudo /bin/ip addr add
xxx.xxx.xxx.xxx/24 dev bond0'
if_down_cmd                          = 'sudo /bin/ip addr del
xxx.xxx.xxx.xxx/24 dev bond0'
arping_cmd                           = 'sudo /usr/bin/arping -w 1 -U -I
bond0 xxx.xxx.xxx.xxx'
wd_interval                          = 3
wd_life_point                        = 3
wd_lifecheck_query                   = 'SELECT 1'
wd_lifecheck_dbname                  = 'postgres'
wd_lifecheck_user                    = 'postgres'
wd_lifecheck_password                = ''
wd_hostname                          = 'db01'
wd_authkey                           = 'xxxxxxxxxxxxx'
wd_escalation_command                = ''
wd_lifecheck_method                  = 'heartbeat'
wd_heartbeat_port                    = 9899
wd_heartbeat_keepalive               = 2
wd_heartbeat_deadtime                = 200
heartbeat_destination0               = 'db02'
heartbeat_destination_port0          = 9899
heartbeat_device0                    = ''
clear_memqcache_on_escalation        = on
other_pgpool_hostname0               = 'db02'
other_pgpool_port0                   = 9898
other_wd_port0                       = 9000
relcache_expire                      = 0

Sorry I was a bit long ;)

Any thoughts anybody ?

Thanks.
Regards.
-- 
Jérôme