[pgpool-general: 1056] Re: watchdog enabled delegate_IP on multiple nodes simultaneously

Lonni J Friedman netllama at gmail.com
Mon Oct 1 04:38:12 JST 2012


No ideas or suggestions?

On Wed, Sep 26, 2012 at 9:05 AM, Lonni J Friedman <netllama at gmail.com> wrote:
> I'm running 3.2.0 on two Linux servers, with use_watchdog=on.  Two
> days ago, I noticed that the watchdog enabled the delegate_IP on both
> servers simultaneously (and remains in that state as of today).  This
> seems like the wrong behavior.  I was under the impression that the
> delegate_IP should be up on only 1 server at any time?  I suppose this
> might be harmless, as long as both servers are otherwise working ok,
> but it seems to defeat the point of enabling the IP only when the
> other server is down, if they're both up simultaneously.
>
> I've verified that both servers are responding by pinging the
> delegate_IP (from a separate system), and checking the HWaddress from
> the 'arp' command.  If I then purge the arp cache and ping again, it
> will eventually show the other server's HWaddress associated with the
> delegate_IP.
>
> In the pgpool logs on both servers, I do see the following around the
> time that this happened:
> wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>
> My primary concern is why both servers have the delegate_IP up
> simultaneously when there was clearly some sort of problem that should
> have caused it to be brought down on at least 1 of them.
>
> Both servers can communicate with each other (they can ping each
> other, and I can invoke psql to connect to localhost pgpool from both
> servers).  Here are all the uncommented settings in the WATCHDOG
> section of pgpool.conf (with wd_hostname differing for each server):
> #########
> use_watchdog = on
>                                     # Activates watchdog
> trusted_servers = 'cuda-fs1,cuda-vm0,cuda-fs2'
>                                     # trusted server list which are used
>                                     # to confirm network connection
>                                     # (hostA,hostB,hostC,...)
> delegate_IP = '10.31.97.78'
>                                     # delegate IP address
> wd_hostname = '10.31.99.166'
>                                     # Host name or IP address of this watchdog
> wd_port = 9000
>                                     # port number for watchdog service
> wd_interval = 10
>                                     # lifecheck interval (sec) > 0
> ping_path = '/bin'
>                                     # ping command path
> ifconfig_path = '/sbin'
>                                     # ifconfig command path
> if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.252.0'
>                                     # startup delegate IP command
> if_down_cmd = 'ifconfig eth0:0 down'
>                                     # shutdown delegate IP command
> arping_path = '/usr/sbin'           # arping command path
> arping_cmd = 'arping -U $_IP_$ -w 1'
>                                     # arping command
> wd_life_point = 3
>                                     # lifecheck retry times
> wd_lifecheck_query = 'SELECT 1'
>                                     # lifecheck query to pgpool from watchdog
> other_pgpool_hostname0 = '10.31.99.165'
>                                     # Host name or IP address to
> connect to for other pgpool 0
> other_pgpool_port0 = 9999
>                                     # Port number for othet pgpool 0
> other_wd_port0 = 9000
> #########
>
> Here's ifconfig output from each server.  First 10.31.99.165:
> #########
> eth0      Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
>           inet addr:10.31.99.165  Bcast:10.31.99.255  Mask:255.255.252.0
>           inet6 addr: fe80::5054:ff:fefc:5add/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:4195383952 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:4618638427 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:5682945530436 (5.1 TiB)  TX bytes:7468635177326 (6.7 TiB)
>
> eth0:0    Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
>           inet addr:10.31.97.78  Bcast:10.31.99.255  Mask:255.255.252.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> #########
>
> And 10.31.99.166:
> #########
> eth0      Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
>           inet addr:10.31.99.166  Bcast:10.31.99.255  Mask:255.255.252.0
>           inet6 addr: fe80::216:3eff:fe87:6f43/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:4618130586 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:200152071 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:5702786076291 (5.1 TiB)  TX bytes:15085410736 (14.0 GiB)
>
> eth0:0    Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
>           inet addr:10.31.97.78  Bcast:10.31.99.255  Mask:255.255.252.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> #########
>
> Here's the content of the pgpool log from 10.31.99.165:
> #########
> 2012-09-24 10:55:34 ERROR: pid 28064: new_connection: create_cp() failed
> 2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:45 ERROR: pid 28088: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:45 ERROR: pid 28088: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:45 ERROR: pid 28088: new_connection: create_cp() failed
> 2012-09-24 10:55:46 ERROR: pid 28086: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:46 ERROR: pid 28086: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:46 ERROR: pid 28086: new_connection: create_cp() failed
> 2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:57 ERROR: pid 28099: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:57 ERROR: pid 28099: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:57 ERROR: pid 28099: new_connection: create_cp() failed
> 2012-09-24 10:55:58 ERROR: pid 27674: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:58 ERROR: pid 27674: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:58 ERROR: pid 27674: new_connection: create_cp() failed
> 2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:56:09 ERROR: pid 28108: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:56:09 ERROR: pid 28108: connection to cuda-db0(5432) failed
> 2012-09-24 10:56:09 ERROR: pid 28108: new_connection: create_cp() failed
> 2012-09-24 10:56:10 ERROR: pid 28106: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:56:10 ERROR: pid 28106: connection to cuda-db0(5432) failed
> 2012-09-24 10:56:10 ERROR: pid 28106: new_connection: create_cp() failed
> 2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:56:22 LOG:   pid 27612: wd_escalation: escalated to master pgpool
> 2012-09-24 10:56:24 LOG:   pid 27612: wd_escalation:  escaleted to
> delegate_IP holder
> 2012-09-24 10:57:04 LOG:   pid 27813: send_failback_request: fail back
> 0 th node request from pid 27813
> 2012-09-24 10:57:04 ERROR: pid 27596: failover_handler: invalid
> node_id 0 status:2 MAX_NUM_BACKENDS: 128
> 2012-09-24 10:57:06 LOG:   pid 27813: send_failback_request: fail back
> 1 th node request from pid 27813
> 2012-09-24 10:57:06 ERROR: pid 27596: failover_handler: invalid
> node_id 1 status:2 MAX_NUM_BACKENDS: 128
> 2012-09-24 10:57:09 LOG:   pid 27813: send_failback_request: fail back
> 2 th node request from pid 27813
> 2012-09-24 10:57:09 ERROR: pid 27596: failover_handler: invalid
> node_id 2 status:2 MAX_NUM_BACKENDS: 128
> #########
>
> and 10.31.99.166:
> #########
> 2012-09-24 10:55:22 ERROR: pid 7192: new_connection: create_cp() failed
> 2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:22 ERROR: pid 7191: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:22 ERROR: pid 7191: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:22 ERROR: pid 7191: new_connection: create_cp() failed
> 2012-09-24 10:55:34 ERROR: pid 7202: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:34 ERROR: pid 7202: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:34 ERROR: pid 7202: new_connection: create_cp() failed
> 2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:34 ERROR: pid 7209: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:34 ERROR: pid 7209: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:34 ERROR: pid 7209: new_connection: create_cp() failed
> 2012-09-24 10:55:46 ERROR: pid 7213: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:46 ERROR: pid 7213: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:46 ERROR: pid 7213: new_connection: create_cp() failed
> 2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:46 ERROR: pid 7221: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:46 ERROR: pid 7221: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:46 ERROR: pid 7221: new_connection: create_cp() failed
> 2012-09-24 10:55:58 ERROR: pid 7223: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:58 ERROR: pid 7223: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:58 ERROR: pid 7223: new_connection: create_cp() failed
> 2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:55:58 ERROR: pid 7231: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:55:58 ERROR: pid 7231: connection to cuda-db0(5432) failed
> 2012-09-24 10:55:58 ERROR: pid 7231: new_connection: create_cp() failed
> 2012-09-24 10:56:10 ERROR: pid 7233: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:56:10 ERROR: pid 7233: connection to cuda-db0(5432) failed
> 2012-09-24 10:56:10 ERROR: pid 7233: new_connection: create_cp() failed
> 2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
> times. pgpool seems not to be working
> 2012-09-24 10:56:10 ERROR: pid 7243: connect_inet_domain_socket:
> connect() failed: Connection refused
> 2012-09-24 10:56:10 ERROR: pid 7243: connection to cuda-db0(5432) failed
> 2012-09-24 10:56:10 ERROR: pid 7243: new_connection: create_cp() failed
> 2012-09-24 10:56:22 LOG:   pid 6724: wd_escalation: escalated to master pgpool
> 2012-09-24 10:56:24 LOG:   pid 6724: wd_escalation:  escaleted to
> delegate_IP holder
> #########
>
> I'm not sure what sort of information is needed to debug what went
> wrong.  Let me know if something else is needed, and I'll do my best
> to provide it.  thanks


More information about the pgpool-general mailing list