[pgpool-general: 2512] Re: wd_escalation_command exit code

Fri Jan 31 05:50:37 JST 2014

On Jan 30, 2014, at 8:40 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:

> Hi,
> 
> On Wed, 29 Jan 2014 10:26:00 +0400
> Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
> 
>> Hi!
>> 
>> I'm testing this patch on a vagrant/virtualbox based VM. 
>> 
>> # uname -a
>> Linux lb-node1 3.2.0-55-generic #85-Ubuntu SMP Wed Oct 2 12:29:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> # cat /etc/issue
>> Ubuntu 12.04.3 LTS \n \l
>> 
>> This is the output of ifconfig before starting pgpool:
>> 
>> eth0      Link encap:Ethernet  HWaddr 08:00:27:03:2b:89
>>          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
>>          inet6 addr: fe80::a00:27ff:fe03:2b89/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:7198 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:4853 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:553607 (553.6 KB)  TX bytes:722721 (722.7 KB)
>> 
>> eth1      Link encap:Ethernet  HWaddr 08:00:27:70:46:a0
>>          inet addr:192.168.33.11  Bcast:192.168.33.255  Mask:255.255.255.0
>>          inet6 addr: fe80::a00:27ff:fe70:46a0/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:23682 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:4876 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:2551344 (2.5 MB)  TX bytes:646217 (646.2 KB)
>> 
>> lo        Link encap:Local Loopback
>>          inet addr:127.0.0.1  Mask:255.0.0.0
>>          inet6 addr: ::1/128 Scope:Host
>>          UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>          RX packets:1636 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:1636 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:0
>>          RX bytes:109868 (109.8 KB)  TX bytes:109868 (109.8 KB)
>> 
>> 
>> /etc/pgpool2/pgpool.conf:
>> ...
>> debug_level                   = 9
>> …
>> delegate_IP                   = '192.168.33.200'
>> ...
>> ifconfig_path                 = '/sbin'
>> if_up_cmd                     = 'ifconfig eth1:0 $_IP_$ netmask 255.255.255.0'
>> if_down_cmd                   = 'ifconfig eth1:0 down'
>> ...
>> 
>> 
>> Once I start pgpool I get the following ifconfig output
>> 
>> 
>> eth0      Link encap:Ethernet  HWaddr 08:00:27:03:2b:89
>>          inet addr:10.0.2.15  Bcast:10.0.2.255  Mask:255.255.255.0
>>          inet6 addr: fe80::a00:27ff:fe03:2b89/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:7939 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:5404 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:606232 (606.2 KB)  TX bytes:816924 (816.9 KB)
>> 
>> eth1      Link encap:Ethernet  HWaddr 08:00:27:70:46:a0
>>          inet addr:192.168.33.11  Bcast:192.168.33.255  Mask:255.255.255.0
>>          inet6 addr: fe80::a00:27ff:fe70:46a0/64 Scope:Link
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>          RX packets:25179 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:5204 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:1000
>>          RX bytes:2704567 (2.7 MB)  TX bytes:690834 (690.8 KB)
>> 
>> eth1:0    Link encap:Ethernet  HWaddr 08:00:27:70:46:a0
>>          inet addr:192.168.33.200  Bcast:192.168.33.255  Mask:255.255.255.0
>>          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> 
>> lo        Link encap:Local Loopback
>>          inet addr:127.0.0.1  Mask:255.0.0.0
>>          inet6 addr: ::1/128 Scope:Host
>>          UP LOOPBACK RUNNING  MTU:16436  Metric:1
>>          RX packets:1745 errors:0 dropped:0 overruns:0 frame:0
>>          TX packets:1745 errors:0 dropped:0 overruns:0 carrier:0
>>          collisions:0 txqueuelen:0
>>          RX bytes:117264 (117.2 KB)  TX bytes:117264 (117.2 KB)
>> 
>> 
>> 
>> # ping 192.168.33.200
>> PING 192.168.33.200 (192.168.33.200) 56(84) bytes of data.
>> 64 bytes from 192.168.33.200: icmp_req=1 ttl=64 time=0.060 ms
>> ^C
>> --- 192.168.33.200 ping statistics ---
>> 1 packets transmitted, 1 received, 0% packet loss, time 0ms
>> rtt min/avg/max/mdev = 0.060/0.060/0.060/0.000 ms
>> 
>> 
>> And these are some messages from pgpool.log:
>> 
>> pgpool[4152]: wd_chk_setuid all commands have setuid bit
>> pgpool[4152]: watchdog might call network commands which using setuid bit.
>> pgpool[4152]: exec_ping: failed to ping 192.168.33.200
>> pgpool[4152]: wd_escalation: escalating to master pgpool
>> pgpool[4152]: wd_IP_up: ifconfig up failed
>> pgpool[4152]: wd_declare: send the packet to declare the new master
>> pgpool[4152]: wd_escalation: escalated to master pgpool with some errors
> 
> That's funny. This says that "failed to ping" but VIP is brought up in fact. 
> It may take times between ifconfig and ping. However, pgpool should try to 
> ping up to three times before this succeeds, but this is tried only one time
> in the case.
> 
> For analysis, I would appreciate it if you would apply the attached patch and
> send the log output messages.

start:

pgpool[2493]: num_backends: 2 total_weight: 2.000000
pgpool[2493]: backend 0 weight: 1073741823.500000
pgpool[2493]: backend 0 flag: 0000
pgpool[2493]: backend 1 weight: 1073741823.500000
pgpool[2493]: backend 1 flag: 0000
pgpool[2493]: loading "/etc/pgpool2/pool_hba.conf" for client authentication configuration file
pgpool[2493]: wd_chk_setuid all commands have setuid bit
pgpool[2493]: watchdog might call network commands which using setuid bit.
pgpool[2493]: Backend status file /var/log/postgresql/pgpool_status discarded
pgpool[2493]: wd_create_send_socket: connect() reports failure (Connection refused). You can safely ignore this while starting up.
pgpool[2493]: send_packet_4_nodes: packet for lb-node2.site:9000 is canceled
pgpool[2493]: exec_ping: failed to ping 192.168.33.200: exit code 1
pgpool[2493]: wd_escalation: escalating to master pgpool
pgpool[2493]: wd_IP_up: ifconfig up failed
pgpool[2493]: wd_declare: send the packet to declare the new master
pgpool[2493]: wd_escalation: escalated to master pgpool with some errors
pgpool[2493]: wd_init: start watchdog

stop:

pgpool[2504]: wd_IP_down: not delegate IP holder
pgpool[2502]: hb_receiver child receives shutdown request signal 2
pgpool[2503]: hb_sender child receives shutdown request signal 2
pgpool[2589]: child received shutdown request signal 2
pgpool[2493]: shmem_exit(0)

BTW, when I start/stop unpatched 3.3.2 version I see the same messages about ping failure. But everything works well in this case. 

unpatched start:

pgpool[7189]: exec_ping: failed to ping 192.168.33.200
pgpool[7189]: wd_escalation: escalating to master pgpool
pgpool[7189]: wd_declare: send the packet to declare the new master
pgpool[7189]: wd_escalation: escalated to master pgpool successfully

unpatched stop:

pgpool[7198]: hb_receiver child receives shutdown request signal 2
pgpool[7199]: hb_sender child receives shutdown request signal 2
pgpool[7200]: exec_ping: failed to ping 192.168.33.200
pgpool[7200]: wd_IP_down: ifconfig down succeeded
pgpool[7189]: shmem_exit(0)

> 
>> 
>> 
>> When I stop pgpool I get the following messages in pgpool.log:
>> 
>> pgpool[4163]: wd_IP_down: not delegate IP holder
>> pgpool[4161]: hb_receiver child receives shutdown request signal 2
>> pgpool[4162]: hb_sender child receives shutdown request signal 2
>> pgpool[4152]: shmem_exit(0)
>> 
>> 
>> 
>> 
>> 
>> On Jan 29, 2014, at 6:42 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
>> 
>>> On Tue, 28 Jan 2014 23:03:20 +0400
>>> Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>> 
>>>> Hi!
>>>> This patch applied successfully. But now a new problem. When I start pgpool service I get a new interface eth0:0 with failover IP address assigned as expected. But when I stop pgpool service eth0:0 won't go down. It remains even after complete shutdown of pgpool.
>>> 
>>> Odd, I can't reproduce this. Are there any error message?
>>> What ifconfig command do you use?
>>> 
>>>> 
>>>> I tried 3.3.2 without this patch and everything worked well. 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Jan 27, 2014, at 5:18 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
>>>> 
>>>>> On Sat, 25 Jan 2014 15:31:44 +0400
>>>>> Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>>>> 
>>>>>> 
>>>>>> On Jan 24, 2014, at 1:25 PM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
>>>>>> 
>>>>>>> On Tue, 21 Jan 2014 15:24:02 +0400
>>>>>>> Sergey Arlashin <sergeyarl.maillist at gmail.com> wrote:
>>>>>>> 
>>>>>>>> Great! Now it is working!
>>>>>>>> 
>>>>>>>> pgpool[31903]: wd_escalation: escalation command failed. exit status: 1
>>>>>>>> 
>>>>>>>> Thank you!
>>>>>>>> 
>>>>>>>> Will this patch be included in 3.3.3 ?
>>>>>>>> 
>>>>>>>> Also, what about failed if_up_cmd and further pgpool behaviour (my second message in the thread.) ?
>>>>>>> 
>>>>>>> I attached the patch. Could you try this? In this fix, pgpool outputs a error 
>>>>>>> message for if_up_cmd failure. This patch should be applied after the previous
>>>>>>> patch. This fix will be included in 3.3.3.
>>>>>> 
>>>>>> 
>>>>>> Hi!
>>>>>> 
>>>>>> I tried to apply the patch against both 3.3.1 and 3.3.2
>>>>>> 
>>>>>> this is what I got:
>>>>> 
>>>>> Hmm.. Could you try the attached patch to 3.3.2? This includes allthe fix
>>>>> for escalation command and ifconfig errors.
>>>>> 
>>>>>> 
>>>>>> node1:~/pgpool-orig# patch -p1 < /root/op/esc.patch
>>>>>> 
>>>>>> patching file src/watchdog/wd_packet.c
>>>>>> Hunk #1 succeeded at 954 (offset 23 lines).
>>>>>> 
>>>>>> node1:~/pgpool-orig# patch -p1 < /root/op/ifup.patch
>>>>>> 
>>>>>> patching file src/watchdog/wd_if.c
>>>>>> Hunk #1 succeeded at 42 with fuzz 1 (offset 3 lines).
>>>>>> Hunk #2 succeeded at 62 (offset 3 lines).
>>>>>> Hunk #3 succeeded at 117 (offset 3 lines).
>>>>>> patching file src/watchdog/wd_packet.c
>>>>>> Hunk #1 succeeded at 654 (offset 23 lines).
>>>>>> Hunk #2 succeeded at 939 (offset 23 lines).
>>>>>> Hunk #3 FAILED at 932.
>>>>>> Hunk #4 succeeded at 976 (offset 18 lines).
>>>>>> 1 out of 4 hunks FAILED -- saving rejects to file src/watchdog/wd_packet.c.rej
>>>>>> 
>>>>>> 
>>>>>> src/watchdog/wd_packet.c.rej:
>>>>>> 
>>>>>> 
>>>>>> --- src/watchdog/wd_packet.c
>>>>>> +++ src/watchdog/wd_packet.c
>>>>>> @@ -932,22 +933,31 @@
>>>>>> 	/* execute escalation command */
>>>>>> 	if (strlen(pool_config->wd_escalation_command))
>>>>>> 	{
>>>>>> -		int r;
>>>>>> 		r = system(pool_config->wd_escalation_command);
>>>>>> 		if (WIFEXITED(r))
>>>>>> 		{
>>>>>> 			if (WEXITSTATUS(r) == EXIT_SUCCESS)
>>>>>> 				pool_log("wd_escalation: escalation command succeeded");
>>>>>> 			else
>>>>>> +			{
>>>>>> 				pool_error("wd_escalation: escalation command failed. exit status: %d", WEXITSTATUS(r));
>>>>>> +				has_error = true;
>>>>>> +			}
>>>>>> 		}
>>>>>> 		else
>>>>>> +		{
>>>>>> 			pool_error("wd_escalation: escalation command exit abnormally");
>>>>>> +			has_error = true;
>>>>>> +		}
>>>>>> 	}
>>>>>> 
>>>>>> 	/* interface up as delegate IP */
>>>>>> 	if (strlen(pool_config->delegate_IP) != 0)
>>>>>> -		wd_IP_up();
>>>>>> +	{
>>>>>> +		r = wd_IP_up();
>>>>>> +		if (r == WD_NG)
>>>>>> +			has_error = true;
>>>>>> +	}
>>>>>> 
>>>>>> 	/* set master status to the wd list */
>>>>>> 	wd_set_wd_list(pool_config->wd_hostname, pool_config->port,
>>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> In addition, I consider that pgpool shoud go to down status when if_up_cmd fails, 
>>>>>>> since this is worthless as a member of watchdog cluster. I'll make this fix for
>>>>>>> either 3.3.3 or 3.4.0.
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> Sounds reasonable. 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> -- 
>>>>> Yugo Nagata <nagata at sraoss.co.jp>
>>>>> <escalation_error_all.patch>
>>>> 
>>> 
>>> 
>>> -- 
>>> Yugo Nagata <nagata at sraoss.co.jp>
>> 
> 
> 
> -- 
> Yugo Nagata <nagata at sraoss.co.jp>
> <escalation_error_all_for_analysis.patch>