[pgpool-general: 1089] Re: watchdog enabled delegate_IP on multiple nodes simultaneously

Tatsuo Ishii ishii at postgresql.org
Tue Oct 16 16:45:56 JST 2012

Our investigation showed that the problem could occur if pgpool is
running in RAW mode (i.e. neither replication mode nor master slave
mode if off). Is this your case?
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Any new status on this investigation?
> On Mon, Oct 1, 2012 at 7:10 AM, Lonni J Friedman <netllama at gmail.com> wrote:
>> Thanks for your reply.  Let me know if you need any other info from me.
>> On Mon, Oct 1, 2012 at 2:42 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
>>> I'm investigating the cause of this.
>>> From the log, it seems that both pgpools failed connecting to DB (cuda-db0)
>>> temporarily. In this period both services of pgpools was down. After the
>>> connection to the DB recovered, both pgpools might consider the other is
>>> still down each other, and escalate to active server (VIP holder).
>>> I'm going to reproduce it.
>>> On Sun, 30 Sep 2012 12:38:12 -0700
>>> Lonni J Friedman <netllama at gmail.com> wrote:
>>>> No ideas or suggestions?
>>>> On Wed, Sep 26, 2012 at 9:05 AM, Lonni J Friedman <netllama at gmail.com> wrote:
>>>> > I'm running 3.2.0 on two Linux servers, with use_watchdog=on.  Two
>>>> > days ago, I noticed that the watchdog enabled the delegate_IP on both
>>>> > servers simultaneously (and remains in that state as of today).  This
>>>> > seems like the wrong behavior.  I was under the impression that the
>>>> > delegate_IP should be up on only 1 server at any time?  I suppose this
>>>> > might be harmless, as long as both servers are otherwise working ok,
>>>> > but it seems to defeat the point of enabling the IP only when the
>>>> > other server is down, if they're both up simultaneously.
>>>> >
>>>> > I've verified that both servers are responding by pinging the
>>>> > delegate_IP (from a separate system), and checking the HWaddress from
>>>> > the 'arp' command.  If I then purge the arp cache and ping again, it
>>>> > will eventually show the other server's HWaddress associated with the
>>>> > delegate_IP.
>>>> >
>>>> > In the pgpool logs on both servers, I do see the following around the
>>>> > time that this happened:
>>>> > wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> >
>>>> > My primary concern is why both servers have the delegate_IP up
>>>> > simultaneously when there was clearly some sort of problem that should
>>>> > have caused it to be brought down on at least 1 of them.
>>>> >
>>>> > Both servers can communicate with each other (they can ping each
>>>> > other, and I can invoke psql to connect to localhost pgpool from both
>>>> > servers).  Here are all the uncommented settings in the WATCHDOG
>>>> > section of pgpool.conf (with wd_hostname differing for each server):
>>>> > #########
>>>> > use_watchdog = on
>>>> >                                     # Activates watchdog
>>>> > trusted_servers = 'cuda-fs1,cuda-vm0,cuda-fs2'
>>>> >                                     # trusted server list which are used
>>>> >                                     # to confirm network connection
>>>> >                                     # (hostA,hostB,hostC,...)
>>>> > delegate_IP = ''
>>>> >                                     # delegate IP address
>>>> > wd_hostname = ''
>>>> >                                     # Host name or IP address of this watchdog
>>>> > wd_port = 9000
>>>> >                                     # port number for watchdog service
>>>> > wd_interval = 10
>>>> >                                     # lifecheck interval (sec) > 0
>>>> > ping_path = '/bin'
>>>> >                                     # ping command path
>>>> > ifconfig_path = '/sbin'
>>>> >                                     # ifconfig command path
>>>> > if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask'
>>>> >                                     # startup delegate IP command
>>>> > if_down_cmd = 'ifconfig eth0:0 down'
>>>> >                                     # shutdown delegate IP command
>>>> > arping_path = '/usr/sbin'           # arping command path
>>>> > arping_cmd = 'arping -U $_IP_$ -w 1'
>>>> >                                     # arping command
>>>> > wd_life_point = 3
>>>> >                                     # lifecheck retry times
>>>> > wd_lifecheck_query = 'SELECT 1'
>>>> >                                     # lifecheck query to pgpool from watchdog
>>>> > other_pgpool_hostname0 = ''
>>>> >                                     # Host name or IP address to
>>>> > connect to for other pgpool 0
>>>> > other_pgpool_port0 = 9999
>>>> >                                     # Port number for othet pgpool 0
>>>> > other_wd_port0 = 9000
>>>> > #########
>>>> >
>>>> > Here's ifconfig output from each server.  First
>>>> > #########
>>>> > eth0      Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
>>>> >           inet addr:  Bcast:  Mask:
>>>> >           inet6 addr: fe80::5054:ff:fefc:5add/64 Scope:Link
>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>> >           RX packets:4195383952 errors:0 dropped:0 overruns:0 frame:0
>>>> >           TX packets:4618638427 errors:0 dropped:0 overruns:0 carrier:0
>>>> >           collisions:0 txqueuelen:1000
>>>> >           RX bytes:5682945530436 (5.1 TiB)  TX bytes:7468635177326 (6.7 TiB)
>>>> >
>>>> > eth0:0    Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
>>>> >           inet addr:  Bcast:  Mask:
>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>> > #########
>>>> >
>>>> > And
>>>> > #########
>>>> > eth0      Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
>>>> >           inet addr:  Bcast:  Mask:
>>>> >           inet6 addr: fe80::216:3eff:fe87:6f43/64 Scope:Link
>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>> >           RX packets:4618130586 errors:0 dropped:0 overruns:0 frame:0
>>>> >           TX packets:200152071 errors:0 dropped:0 overruns:0 carrier:0
>>>> >           collisions:0 txqueuelen:1000
>>>> >           RX bytes:5702786076291 (5.1 TiB)  TX bytes:15085410736 (14.0 GiB)
>>>> >
>>>> > eth0:0    Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
>>>> >           inet addr:  Bcast:  Mask:
>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>>>> > #########
>>>> >
>>>> > Here's the content of the pgpool log from
>>>> > #########
>>>> > 2012-09-24 10:55:34 ERROR: pid 28064: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:45 ERROR: pid 28088: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:45 ERROR: pid 28088: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:45 ERROR: pid 28088: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:46 ERROR: pid 28086: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:46 ERROR: pid 28086: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:46 ERROR: pid 28086: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:57 ERROR: pid 28099: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:57 ERROR: pid 28099: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:57 ERROR: pid 28099: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:58 ERROR: pid 27674: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:58 ERROR: pid 27674: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:58 ERROR: pid 27674: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:56:09 ERROR: pid 28108: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:56:09 ERROR: pid 28108: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:56:09 ERROR: pid 28108: new_connection: create_cp() failed
>>>> > 2012-09-24 10:56:10 ERROR: pid 28106: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:56:10 ERROR: pid 28106: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:56:10 ERROR: pid 28106: new_connection: create_cp() failed
>>>> > 2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:56:22 LOG:   pid 27612: wd_escalation: escalated to master pgpool
>>>> > 2012-09-24 10:56:24 LOG:   pid 27612: wd_escalation:  escaleted to
>>>> > delegate_IP holder
>>>> > 2012-09-24 10:57:04 LOG:   pid 27813: send_failback_request: fail back
>>>> > 0 th node request from pid 27813
>>>> > 2012-09-24 10:57:04 ERROR: pid 27596: failover_handler: invalid
>>>> > node_id 0 status:2 MAX_NUM_BACKENDS: 128
>>>> > 2012-09-24 10:57:06 LOG:   pid 27813: send_failback_request: fail back
>>>> > 1 th node request from pid 27813
>>>> > 2012-09-24 10:57:06 ERROR: pid 27596: failover_handler: invalid
>>>> > node_id 1 status:2 MAX_NUM_BACKENDS: 128
>>>> > 2012-09-24 10:57:09 LOG:   pid 27813: send_failback_request: fail back
>>>> > 2 th node request from pid 27813
>>>> > 2012-09-24 10:57:09 ERROR: pid 27596: failover_handler: invalid
>>>> > node_id 2 status:2 MAX_NUM_BACKENDS: 128
>>>> > #########
>>>> >
>>>> > and
>>>> > #########
>>>> > 2012-09-24 10:55:22 ERROR: pid 7192: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:22 ERROR: pid 7191: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:22 ERROR: pid 7191: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:22 ERROR: pid 7191: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:34 ERROR: pid 7202: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:34 ERROR: pid 7202: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:34 ERROR: pid 7202: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:34 ERROR: pid 7209: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:34 ERROR: pid 7209: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:34 ERROR: pid 7209: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:46 ERROR: pid 7213: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:46 ERROR: pid 7213: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:46 ERROR: pid 7213: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:46 ERROR: pid 7221: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:46 ERROR: pid 7221: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:46 ERROR: pid 7221: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:58 ERROR: pid 7223: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:58 ERROR: pid 7223: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:58 ERROR: pid 7223: new_connection: create_cp() failed
>>>> > 2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:55:58 ERROR: pid 7231: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:55:58 ERROR: pid 7231: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:55:58 ERROR: pid 7231: new_connection: create_cp() failed
>>>> > 2012-09-24 10:56:10 ERROR: pid 7233: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:56:10 ERROR: pid 7233: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:56:10 ERROR: pid 7233: new_connection: create_cp() failed
>>>> > 2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>>>> > times. pgpool seems not to be working
>>>> > 2012-09-24 10:56:10 ERROR: pid 7243: connect_inet_domain_socket:
>>>> > connect() failed: Connection refused
>>>> > 2012-09-24 10:56:10 ERROR: pid 7243: connection to cuda-db0(5432) failed
>>>> > 2012-09-24 10:56:10 ERROR: pid 7243: new_connection: create_cp() failed
>>>> > 2012-09-24 10:56:22 LOG:   pid 6724: wd_escalation: escalated to master pgpool
>>>> > 2012-09-24 10:56:24 LOG:   pid 6724: wd_escalation:  escaleted to
>>>> > delegate_IP holder
>>>> > #########
>>>> >
>>>> > I'm not sure what sort of information is needed to debug what went
>>>> > wrong.  Let me know if something else is needed, and I'll do my best
>>>> > to provide it.  thanks
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general

More information about the pgpool-general mailing list