[Pgpool-general] Strange "cross" degeneration in a cluster with pgpool and openswan

Pablo Milano pm-pgpool at datatransfer.com.ar
Fri Oct 21 14:34:07 GMT 2005


Oops, I confused when changing the server names on the fisrt message.

What happened was: First, the slave stopped seing the master and entered
degeneration mode. After a while, the master detected a mismatch and shut
down the slave. As a result, master directed queries to master and slave to
slave.
The problem started while openswan was doing the rekeing routines. Searching
on the net, I found a guy whit a similar problem
(http://oss.sgi.com/projects/netdev/archive/2004-06/msg00235.html) but this
exceedes pgpool interests.

Beyond the IPsec issue, I think I need a way to verify (may be externally)
if backends entered "cross" degeneration mode...

> -----Original Message-----
> From: Tatsuo Ishii [mailto:ishii at sraoss.co.jp] 
> Sent: Jueves, 20 de Octubre de 2005 10:07 p.m.
> To: Pablo G. Milano
> Cc: pgpool-general at pgfoundry.org
> Subject: Re: [Pgpool-general] Strange "cross" degeneration in 
> a cluster with pgpool and openswan
> 
> > Yes, of course.
> > 
> > These are the versions of the debian packages:
> > 
> > ii  pgpool         2.5.2-1        connection pool server 
> for PostgreSQL
> > ii  postgresql     7.4.7-6sarge1  object-relational SQL 
> database management
> > sy 
> > 
> > Both configuration files are the same (as follows). Always 
> master backend is
> > "srv1"
> 
> Weren't you saying that srv1 was slave?
> 
> > Happened that srv1 entered degeneration mode because srv2 
> was not reachable,
> > but srv2 did not enter degeneration mode. After some load, 
> srv2 discovered a
> > mismatch between backends and entered degeneration mode 
> shutting down srv1
> > (the slave backend). As a result, srv1 pointed to srv1 and 
> srv2 to srv2,
> > breaking the synchronization.
> --
> SRA OSS, Inc. Japan
> Tatsuo Ishii
> 
> > # pgpool.conf
> > # Note: The real server names were changed to "srv1" and "srv2"
> > #
> > listen_addresses = '*'
> > port = 5433
> > socket_dir = '/var/run/postgresql'
> > backend_host_name = 'srv1'
> > backend_port = 5432
> > backend_socket_dir = '/var/run/postgresql'
> > secondary_backend_host_name = 'srv2'
> > secondary_backend_port = 5432
> > num_init_children = 32
> > max_pool = 4
> > child_life_time = 300
> > connection_life_time = 0
> > logdir = '/var/run/postgresql'
> > replication_mode = true
> > replication_strict = true
> > replication_timeout = 1000
> > load_balance_mode = true
> > weight_master = 0.5
> > weight_secondary = 0.5
> > replication_stop_on_mismatch = true
> > reset_query_list = 'ABORT; RESET ALL; SET SESSION 
> AUTHORIZATION DEFAULT'
> > print_timestamp = true
> > master_slave_mode = false
> > connection_cache = true
> > health_check_timeout = 20
> > health_check_period = 30
> > health_check_user = 'monitor'
> > 
> > We think the health check uses blocking primitives, so it 
> delayed some
> > minutes in detecting the fail, because all the processes 
> were blocked
> > waiting for the kernel.
> > 
> > Regards.
> > 
> > 
> > > -----Original Message-----
> > > From: Tatsuo Ishii [mailto:ishii at sraoss.co.jp] 
> > > Sent: Martes, 18 de Octubre de 2005 02:14 a.m.
> > > To: Pablo G. Milano
> > > Cc: pgpool-general at pgfoundry.org
> > > Subject: Re: [Pgpool-general] Strange "cross" degeneration in 
> > > a cluster with pgpool and openswan
> > > 
> > > Can you show me pgpool/PostgreSQL version, and pgpool's 
> configuration
> > > file contents? Especially I'm interested in which server 
> is the master
> > > for pgool?
> > > --
> > > SRA OSS, Inc. Japan
> > > Tatsuo Ishii
> > > 
> > > > A very strange "cross" degeneration occured last week in 
> > > our production
> > > > cluster, wich uses pgpool and openswan (IPSec). The cluster 
> > > consists of two
> > > > machines (srv1 and srv2), each of which has an 
> appliction server, a
> > > > postgreSQL backend and a pgpool. On both servers, 
> pgpool replicates
> > > > transactions in both backends.
> > > > Happened that srv1 entered degeneration mode because srv2 
> > > was not reachable,
> > > > but srv2 did not enter degeneration mode. After some load, 
> > > srv2 discovered a
> > > > mismatch between backends and entered degeneration mode 
> > > shutting down srv1
> > > > (the slave backend). As a result, srv1 pointed to srv1 and 
> > > srv2 to srv2,
> > > > breaking the synchronization.
> > > > I'm still unsure about the real cause that made srv1 unable 
> > > to contact
> > > > (temporarily) srv2. But maybe openswan related, because the 
> > > log showed the
> > > > following error:
> > > > 
> > > > Oct 12 19:48:25 srv2 kernel: pmtu discovery on SA 
> > > ESP/e5fb8ff6/c0a85015
> > > > 
> > > > ¿Has anyone seen something like this?
> > > > ¿Is there any method to avoid, or at least detect "cross 
> > > degeneration" in
> > > > this type of environment?
> > > > 
> > > > 
> > > > 
> > > > For further information, these are the relevant logs:
> > > > 
> > > > srv1
> > > > ======
> > > > 
> > > > Oct 12 19:19:29 srv1 pluto[1346]: "srv2-srv1" #1177: 
> received Delete
> > > > SA(0xa98b908a) payload: deleting IPSEC State #1167
> > > > Oct 12 19:19:29 srv1 pluto[1346]: "srv2-srv1" #1177: 
> > > received and ignored
> > > > informational message
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > > responding to Main Mode
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > > transition from state
> > > > (null) to state STATE_MAIN_R1
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > > transition from state
> > > > STATE_MAIN_R1 to state STATE_MAIN_R2
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: Peer ID is
> > > > ID_IPV4_ADDR: '192.168.80.20'
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: I did 
> > > not send a
> > > > certificate because I do not have one.
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > > transition from state
> > > > STATE_MAIN_R2 to state STATE_MAIN_R3
> > > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: sent 
> > > MR3, ISAKMP SA
> > > > established
> > > > Oct 12 19:23:01 srv1 CRON[31441]: (pam_unix) session opened 
> > > for user mail by
> > > > (uid=0)
> > > > Oct 12 19:23:01 srv1 CRON[31441]: (pam_unix) session closed 
> > > for user mail
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 10278:
> > > > pool_read: read failed (Connection reset by peer)
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 10278:
> > > > ProcessFrontendResponse: failed to read kind from 
> frontend. fronend
> > > > abnormally exited
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 
> > > 7543: pool_read:
> > > > read failed (Connection reset by peer)
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 7543:
> > > > ProcessFrontendResponse: failed to read kind from 
> frontend. fronend
> > > > abnormally exited
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 32275:
> > > > pool_read: read failed (Connection reset by peer)
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 32275:
> > > > ProcessFrontendResponse: failed to read kind from 
> frontend. fronend
> > > > abnormally exited
> > > > 
> > > > ....
> > > > 	Lots of same messages
> > > > ....
> > > > 
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 25087:
> > > > pool_read: read failed (Connection reset by peer)
> > > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: 
> pid 25087:
> > > > ProcessFrontendResponse: failed to read kind from 
> frontend. fronend
> > > > abnormally exited
> > > > Oct 12 19:28:32 srv1 pgpool: 2005-10-12 19:28:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:29:02 srv1 pgpool: 2005-10-12 19:29:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:29:32 srv1 pgpool: 2005-10-12 19:29:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:30:02 srv1 pgpool: 2005-10-12 19:30:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:30:28 srv1 pluto[1346]: "srv2-srv1" #1178: next 
> > > payload type of
> > > > ISAKMP Hash Payload has an unknown value: 219
> > > > Oct 12 19:30:28 srv1 pluto[1346]: "srv2-srv1" #1178: 
> > > malformed payload in
> > > > packet
> > > > Oct 12 19:30:28 srv1 pluto[1346]: "srv2-srv1" #1178: 
> > > sending encrypted
> > > > notification PAYLOAD_MALFORMED to 192.168.80.20:500
> > > > Oct 12 19:30:32 srv1 pgpool: 2005-10-12 19:30:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:31:02 srv1 pgpool: 2005-10-12 19:31:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:31:32 srv1 pgpool: 2005-10-12 19:31:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:32:02 srv1 pgpool: 2005-10-12 19:32:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:32:32 srv1 pgpool: 2005-10-12 19:32:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:33:02 srv1 pgpool: 2005-10-12 19:33:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:33:32 srv1 pgpool: 2005-10-12 19:33:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:34:02 srv1 pgpool: 2005-10-12 19:34:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:34:32 srv1 pgpool: 2005-10-12 19:34:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:35:02 srv1 pgpool: 2005-10-12 19:35:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:35:32 srv1 pgpool: 2005-10-12 19:35:32 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:36:02 srv1 pgpool: 2005-10-12 19:36:02 LOG:   pid 
> > > 17352: starting
> > > > health checking
> > > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 ERROR: 
> pid 23909:
> > > > pool_read: read failed (Connection reset by peer)
> > > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 ERROR: 
> pid 23909:
> > > > ProcessFrontendResponse: failed to read kind from 
> frontend. fronend
> > > > abnormally exited
> > > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 ERROR: 
> pid 23909:
> > > > pool_flush: fflush failed (Connection timed out)
> > > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 LOG:   
> pid 23909:
> > > > notice_backend_error: master: 0 fail over request from pid 23909
> > > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 LOG:   pid 
> > > 17352: starting
> > > > degeneration. shutdown secondary host srv2(5432)
> > > > 
> > > > 
> > > > 
> > > > srv2
> > > > ====
> > > > 
> > > > Oct 12 19:18:06 srv2 pluto[1558]: "srv2-srv1" #895: 
> > > ignoring Delete SA
> > > > payload: PROTO_IPSEC_ESP SA(0xe661038c) not
> > > > found (maybe expired)
> > > > Oct 12 19:18:06 srv2 pluto[1558]: "srv2-srv1" #895: 
> > > received and ignored
> > > > informational message
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > > initiating Main Mode to
> > > > replace #895
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > > transition from state
> > > > STATE_MAIN_I1 to state STATE_MAIN_I2
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: I 
> did not send a
> > > > certificate because I do not have one.
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > > transition from state
> > > > STATE_MAIN_I2 to state STATE_MAIN_I3
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: Peer ID 
> > > is ID_IPV4_ADDR:
> > > > '192.168.80.21'
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > > transition from state
> > > > STATE_MAIN_I3 to state STATE_MAIN_I4
> > > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: ISAKMP 
> > > SA established
> > > > Oct 12 19:22:45 srv2 sudo:  egutter : TTY=pts/1 ; 
> > > PWD=/home/egutter ;
> > > > USER=root ; COMMAND=/usr/bin/less /var/log/jboss/DT
> > > > .log
> > > > Oct 12 19:23:01 srv2 CRON[1921]: (pam_unix) session opened 
> > > for user mail by
> > > > (uid=0)
> > > > Oct 12 19:23:01 srv2 CRON[1921]: (pam_unix) session closed 
> > > for user mail
> > > > Oct 12 19:29:05 srv2 pluto[1558]: "srv2-srv1" #895: 
> > > received Delete SA
> > > > payload: deleting ISAKMP State #895
> > > > Oct 12 19:29:05 srv2 pluto[1558]: packet from 
> > > 192.168.80.21:500: received
> > > > and ignored informational message
> > > > Oct 12 19:29:05 srv2 pluto[1558]: "srv2-srv1" #896: 
> > > ignoring informational
> > > > payload, type PAYLOAD_MALFORMED
> > > > Oct 12 19:29:05 srv2 pluto[1558]: "srv2-srv1" #896: 
> > > received and ignored
> > > > informational message
> > > > Oct 12 19:19:47 srv2 pgpool: 2005-10-12 19:19:47 LOG:   pid 
> > > 1602: starting
> > > > health checking
> > > > Oct 12 19:19:52 srv2 kernel: pmtu discovery on SA 
> > > ESP/e5fb8ff6/c0a85015
> > > > Oct 12 19:20:05 srv2 last message repeated 7 times
> > > > Oct 12 19:20:17 srv2 pgpool: 2005-10-12 19:20:17 LOG:   pid 
> > > 1602: starting
> > > > health checking
> > > > Oct 12 19:20:18 srv2 kernel: pmtu discovery on SA 
> > > ESP/e5fb8ff6/c0a85015
> > > > ....
> > > > 	Lots of same messages
> > > > ....
> > > > Oct 12 19:48:17 srv2 pgpool: 2005-10-12 19:48:17 LOG:   pid 
> > > 1602: starting
> > > > health checking
> > > > Oct 12 19:48:25 srv2 kernel: pmtu discovery on SA 
> > > ESP/e5fb8ff6/c0a85015
> > > > 
> > > > 
> > > > 
> > > > _______________________________________________
> > > > Pgpool-general mailing list
> > > > Pgpool-general at pgfoundry.org
> > > > http://pgfoundry.org/mailman/listinfo/pgpool-general
> > > > 
> > > 
> > 
> 



More information about the Pgpool-general mailing list