[pgpool-general: 1362] Re: watchdog enabled delegate_IP on multiple nodes simultaneously

Lonni J Friedman netllama at gmail.com
Sat Feb 2 07:40:58 JST 2013


Thanks Yugo.  I'll give this a try, but it likely won't be any sooner
than the end of next week.

On Thu, Jan 31, 2013 at 10:18 PM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
> Hello Lonni,
>
> On Fri, 2 Nov 2012 13:17:24 +0900
> Yugo Nagata <nagata at sraoss.co.jp> wrote:
>
>> I guess the situation you said is as follows.
>>
>> 1. Detach all the nodes from each pgpool using pcp_detach_node etc.
>> 2. Attach a node to each pgpool using pcp_attach_node etc.
>>  -> Both pgpools bring up the virtual IP.
>>
>> If so, this is a restriction of watchdog functionality, and is not bug.
>>
>> (see Restictions in
>>  http://www.pgpool.net/docs/latest/pgpool-en.html#watchdog )
>>
>> If all the nodes is detached from pgpool, you have to restart the pgpool.
>>
>> # The situation of raw mode you reported previously is a bug. I'll fix it.
>
> Sorry for delay. I attached a patch resolving the split-brain problem
> (situation that there are multiple pgpool holding delegate IP).
> Could you try it?
>
> In this fix, when once all backend DB nodes are detached from pgpool (by
> pcp_detach_node or DB server down etc.), the pgpool goes to DOWN status
> until this is restarted. The pgpool in DOWN status cannot escalate to
> delegate IP holder, so split-brain situation is avoided. All the pgpool
> in DOWN status should be restarted to recover correct status.
>
>
> Yugo Nagata
>
>>
>> On Mon, 29 Oct 2012 13:06:30 -0700
>> Lonni J Friedman <netllama at gmail.com> wrote:
>>
>> > I'm now seeing this behavior with replication mode too.  I seem to
>> > have triggered it by detaching pgpool from all the nodes (using
>> > pcp_detach_node).
>> >
>> > This is a rather bad situation, as I've now got two different pgpool
>> > servers responding on the same IP address.
>> >
>> > On Tue, Oct 16, 2012 at 7:36 AM, Lonni J Friedman <netllama at gmail.com> wrote:
>> > > Yes, I believe that's correct.  I had a single database server, with
>> > > pgpool sitting in front of it for connection pooling & pgpool
>> > > failover.
>> > >
>> > > On Tue, Oct 16, 2012 at 12:50 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>> > >>> Our investigation showed that the problem could occur if pgpool is
>> > >>> running in RAW mode (i.e. neither replication mode nor master slave
>> > >>> mode if off). Is this your case?
>> > >>
>> > >> Oops. I meant "neither replication mode nor master slave mode is on"
>> > >> --
>> > >> Tatsuo Ishii
>> > >> SRA OSS, Inc. Japan
>> > >> English: http://www.sraoss.co.jp/index_en.php
>> > >> Japanese: http://www.sraoss.co.jp
>> > >>
>> > >>>> Any new status on this investigation?
>> > >>>>
>> > >>>> On Mon, Oct 1, 2012 at 7:10 AM, Lonni J Friedman <netllama at gmail.com> wrote:
>> > >>>>> Thanks for your reply.  Let me know if you need any other info from me.
>> > >>>>>
>> > >>>>> On Mon, Oct 1, 2012 at 2:42 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
>> > >>>>>> I'm investigating the cause of this.
>> > >>>>>>
>> > >>>>>> From the log, it seems that both pgpools failed connecting to DB (cuda-db0)
>> > >>>>>> temporarily. In this period both services of pgpools was down. After the
>> > >>>>>> connection to the DB recovered, both pgpools might consider the other is
>> > >>>>>> still down each other, and escalate to active server (VIP holder).
>> > >>>>>>
>> > >>>>>> I'm going to reproduce it.
>> > >>>>>>
>> > >>>>>> On Sun, 30 Sep 2012 12:38:12 -0700
>> > >>>>>> Lonni J Friedman <netllama at gmail.com> wrote:
>> > >>>>>>
>> > >>>>>>> No ideas or suggestions?
>> > >>>>>>>
>> > >>>>>>> On Wed, Sep 26, 2012 at 9:05 AM, Lonni J Friedman <netllama at gmail.com> wrote:
>> > >>>>>>> > I'm running 3.2.0 on two Linux servers, with use_watchdog=on.  Two
>> > >>>>>>> > days ago, I noticed that the watchdog enabled the delegate_IP on both
>> > >>>>>>> > servers simultaneously (and remains in that state as of today).  This
>> > >>>>>>> > seems like the wrong behavior.  I was under the impression that the
>> > >>>>>>> > delegate_IP should be up on only 1 server at any time?  I suppose this
>> > >>>>>>> > might be harmless, as long as both servers are otherwise working ok,
>> > >>>>>>> > but it seems to defeat the point of enabling the IP only when the
>> > >>>>>>> > other server is down, if they're both up simultaneously.
>> > >>>>>>> >
>> > >>>>>>> > I've verified that both servers are responding by pinging the
>> > >>>>>>> > delegate_IP (from a separate system), and checking the HWaddress from
>> > >>>>>>> > the 'arp' command.  If I then purge the arp cache and ping again, it
>> > >>>>>>> > will eventually show the other server's HWaddress associated with the
>> > >>>>>>> > delegate_IP.
>> > >>>>>>> >
>> > >>>>>>> > In the pgpool logs on both servers, I do see the following around the
>> > >>>>>>> > time that this happened:
>> > >>>>>>> > wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>> > >>>>>>> >
>> > >>>>>>> > My primary concern is why both servers have the delegate_IP up
>> > >>>>>>> > simultaneously when there was clearly some sort of problem that should
>> > >>>>>>> > have caused it to be brought down on at least 1 of them.
>> > >>>>>>> >
>> > >>>>>>> > Both servers can communicate with each other (they can ping each
>> > >>>>>>> > other, and I can invoke psql to connect to localhost pgpool from both
>> > >>>>>>> > servers).  Here are all the uncommented settings in the WATCHDOG
>> > >>>>>>> > section of pgpool.conf (with wd_hostname differing for each server):
>> > >>>>>>> > #########
>> > >>>>>>> > use_watchdog = on
>> > >>>>>>> >                                     # Activates watchdog
>> > >>>>>>> > trusted_servers = 'cuda-fs1,cuda-vm0,cuda-fs2'
>> > >>>>>>> >                                     # trusted server list which are used
>> > >>>>>>> >                                     # to confirm network connection
>> > >>>>>>> >                                     # (hostA,hostB,hostC,...)
>> > >>>>>>> > delegate_IP = '10.31.97.78'
>> > >>>>>>> >                                     # delegate IP address
>> > >>>>>>> > wd_hostname = '10.31.99.166'
>> > >>>>>>> >                                     # Host name or IP address of this watchdog
>> > >>>>>>> > wd_port = 9000
>> > >>>>>>> >                                     # port number for watchdog service
>> > >>>>>>> > wd_interval = 10
>> > >>>>>>> >                                     # lifecheck interval (sec) > 0
>> > >>>>>>> > ping_path = '/bin'
>> > >>>>>>> >                                     # ping command path
>> > >>>>>>> > ifconfig_path = '/sbin'
>> > >>>>>>> >                                     # ifconfig command path
>> > >>>>>>> > if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.252.0'
>> > >>>>>>> >                                     # startup delegate IP command
>> > >>>>>>> > if_down_cmd = 'ifconfig eth0:0 down'
>> > >>>>>>> >                                     # shutdown delegate IP command
>> > >>>>>>> > arping_path = '/usr/sbin'           # arping command path
>> > >>>>>>> > arping_cmd = 'arping -U $_IP_$ -w 1'
>> > >>>>>>> >                                     # arping command
>> > >>>>>>> > wd_life_point = 3
>> > >>>>>>> >                                     # lifecheck retry times
>> > >>>>>>> > wd_lifecheck_query = 'SELECT 1'
>> > >>>>>>> >                                     # lifecheck query to pgpool from watchdog
>> > >>>>>>> > other_pgpool_hostname0 = '10.31.99.165'
>> > >>>>>>> >                                     # Host name or IP address to
>> > >>>>>>> > connect to for other pgpool 0
>> > >>>>>>> > other_pgpool_port0 = 9999
>> > >>>>>>> >                                     # Port number for othet pgpool 0
>> > >>>>>>> > other_wd_port0 = 9000
>> > >>>>>>> > #########
>> > >>>>>>> >
>> > >>>>>>> > Here's ifconfig output from each server.  First 10.31.99.165:
>> > >>>>>>> > #########
>> > >>>>>>> > eth0      Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
>> > >>>>>>> >           inet addr:10.31.99.165  Bcast:10.31.99.255  Mask:255.255.252.0
>> > >>>>>>> >           inet6 addr: fe80::5054:ff:fefc:5add/64 Scope:Link
>> > >>>>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> > >>>>>>> >           RX packets:4195383952 errors:0 dropped:0 overruns:0 frame:0
>> > >>>>>>> >           TX packets:4618638427 errors:0 dropped:0 overruns:0 carrier:0
>> > >>>>>>> >           collisions:0 txqueuelen:1000
>> > >>>>>>> >           RX bytes:5682945530436 (5.1 TiB)  TX bytes:7468635177326 (6.7 TiB)
>> > >>>>>>> >
>> > >>>>>>> > eth0:0    Link encap:Ethernet  HWaddr 52:54:00:FC:5A:DD
>> > >>>>>>> >           inet addr:10.31.97.78  Bcast:10.31.99.255  Mask:255.255.252.0
>> > >>>>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> > >>>>>>> > #########
>> > >>>>>>> >
>> > >>>>>>> > And 10.31.99.166:
>> > >>>>>>> > #########
>> > >>>>>>> > eth0      Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
>> > >>>>>>> >           inet addr:10.31.99.166  Bcast:10.31.99.255  Mask:255.255.252.0
>> > >>>>>>> >           inet6 addr: fe80::216:3eff:fe87:6f43/64 Scope:Link
>> > >>>>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> > >>>>>>> >           RX packets:4618130586 errors:0 dropped:0 overruns:0 frame:0
>> > >>>>>>> >           TX packets:200152071 errors:0 dropped:0 overruns:0 carrier:0
>> > >>>>>>> >           collisions:0 txqueuelen:1000
>> > >>>>>>> >           RX bytes:5702786076291 (5.1 TiB)  TX bytes:15085410736 (14.0 GiB)
>> > >>>>>>> >
>> > >>>>>>> > eth0:0    Link encap:Ethernet  HWaddr 00:16:3E:87:6F:43
>> > >>>>>>> >           inet addr:10.31.97.78  Bcast:10.31.99.255  Mask:255.255.252.0
>> > >>>>>>> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>> > >>>>>>> > #########
>> > >>>>>>> >
>> > >>>>>>> > Here's the content of the pgpool log from 10.31.99.165:
>> > >>>>>>> > #########
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 28064: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:34 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:45 ERROR: pid 28088: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:45 ERROR: pid 28088: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:45 ERROR: pid 28088: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 28086: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 28086: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 28086: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:46 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:57 ERROR: pid 28099: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:57 ERROR: pid 28099: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:57 ERROR: pid 28099: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 27674: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 27674: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 27674: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:58 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:56:09 ERROR: pid 28108: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:56:09 ERROR: pid 28108: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:56:09 ERROR: pid 28108: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 28106: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 28106: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 28106: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:56:10 LOG:   pid 27612: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:56:22 LOG:   pid 27612: wd_escalation: escalated to master pgpool
>> > >>>>>>> > 2012-09-24 10:56:24 LOG:   pid 27612: wd_escalation:  escaleted to
>> > >>>>>>> > delegate_IP holder
>> > >>>>>>> > 2012-09-24 10:57:04 LOG:   pid 27813: send_failback_request: fail back
>> > >>>>>>> > 0 th node request from pid 27813
>> > >>>>>>> > 2012-09-24 10:57:04 ERROR: pid 27596: failover_handler: invalid
>> > >>>>>>> > node_id 0 status:2 MAX_NUM_BACKENDS: 128
>> > >>>>>>> > 2012-09-24 10:57:06 LOG:   pid 27813: send_failback_request: fail back
>> > >>>>>>> > 1 th node request from pid 27813
>> > >>>>>>> > 2012-09-24 10:57:06 ERROR: pid 27596: failover_handler: invalid
>> > >>>>>>> > node_id 1 status:2 MAX_NUM_BACKENDS: 128
>> > >>>>>>> > 2012-09-24 10:57:09 LOG:   pid 27813: send_failback_request: fail back
>> > >>>>>>> > 2 th node request from pid 27813
>> > >>>>>>> > 2012-09-24 10:57:09 ERROR: pid 27596: failover_handler: invalid
>> > >>>>>>> > node_id 2 status:2 MAX_NUM_BACKENDS: 128
>> > >>>>>>> > #########
>> > >>>>>>> >
>> > >>>>>>> > and 10.31.99.166:
>> > >>>>>>> > #########
>> > >>>>>>> > 2012-09-24 10:55:22 ERROR: pid 7192: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:22 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:22 ERROR: pid 7191: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:22 ERROR: pid 7191: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:22 ERROR: pid 7191: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 7202: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 7202: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 7202: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:34 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 7209: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 7209: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:34 ERROR: pid 7209: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 7213: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 7213: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 7213: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:46 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 7221: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 7221: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:46 ERROR: pid 7221: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 7223: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 7223: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 7223: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:58 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 7231: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 7231: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:55:58 ERROR: pid 7231: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 7233: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 7233: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 7233: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:56:10 LOG:   pid 6724: wd_lifecheck: lifecheck failed 3
>> > >>>>>>> > times. pgpool seems not to be working
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 7243: connect_inet_domain_socket:
>> > >>>>>>> > connect() failed: Connection refused
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 7243: connection to cuda-db0(5432) failed
>> > >>>>>>> > 2012-09-24 10:56:10 ERROR: pid 7243: new_connection: create_cp() failed
>> > >>>>>>> > 2012-09-24 10:56:22 LOG:   pid 6724: wd_escalation: escalated to master pgpool
>> > >>>>>>> > 2012-09-24 10:56:24 LOG:   pid 6724: wd_escalation:  escaleted to
>> > >>>>>>> > delegate_IP holder
>> > >>>>>>> > #########
>> > >>>>>>> >
>> > >>>>>>> > I'm not sure what sort of information is needed to debug what went
>> > >>>>>>> > wrong.  Let me know if something else is needed, and I'll do my best
>> > >>>>>>> > to provide it.  thanks
>> > _______________________________________________
>> > pgpool-general mailing list
>> > pgpool-general at pgpool.net
>> > http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
>>


More information about the pgpool-general mailing list