[Pgpool-general] Strange "cross" degeneration in a cluster with pgpool and openswan

Tatsuo Ishii ishii at sraoss.co.jp
Fri Oct 21 01:07:22 GMT 2005


> Yes, of course.
> 
> These are the versions of the debian packages:
> 
> ii  pgpool         2.5.2-1        connection pool server for PostgreSQL
> ii  postgresql     7.4.7-6sarge1  object-relational SQL database management
> sy 
> 
> Both configuration files are the same (as follows). Always master backend is
> "srv1"

Weren't you saying that srv1 was slave?

> Happened that srv1 entered degeneration mode because srv2 was not reachable,
> but srv2 did not enter degeneration mode. After some load, srv2 discovered a
> mismatch between backends and entered degeneration mode shutting down srv1
> (the slave backend). As a result, srv1 pointed to srv1 and srv2 to srv2,
> breaking the synchronization.
--
SRA OSS, Inc. Japan
Tatsuo Ishii

> # pgpool.conf
> # Note: The real server names were changed to "srv1" and "srv2"
> #
> listen_addresses = '*'
> port = 5433
> socket_dir = '/var/run/postgresql'
> backend_host_name = 'srv1'
> backend_port = 5432
> backend_socket_dir = '/var/run/postgresql'
> secondary_backend_host_name = 'srv2'
> secondary_backend_port = 5432
> num_init_children = 32
> max_pool = 4
> child_life_time = 300
> connection_life_time = 0
> logdir = '/var/run/postgresql'
> replication_mode = true
> replication_strict = true
> replication_timeout = 1000
> load_balance_mode = true
> weight_master = 0.5
> weight_secondary = 0.5
> replication_stop_on_mismatch = true
> reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'
> print_timestamp = true
> master_slave_mode = false
> connection_cache = true
> health_check_timeout = 20
> health_check_period = 30
> health_check_user = 'monitor'
> 
> We think the health check uses blocking primitives, so it delayed some
> minutes in detecting the fail, because all the processes were blocked
> waiting for the kernel.
> 
> Regards.
> 
> 
> > -----Original Message-----
> > From: Tatsuo Ishii [mailto:ishii at sraoss.co.jp] 
> > Sent: Martes, 18 de Octubre de 2005 02:14 a.m.
> > To: Pablo G. Milano
> > Cc: pgpool-general at pgfoundry.org
> > Subject: Re: [Pgpool-general] Strange "cross" degeneration in 
> > a cluster with pgpool and openswan
> > 
> > Can you show me pgpool/PostgreSQL version, and pgpool's configuration
> > file contents? Especially I'm interested in which server is the master
> > for pgool?
> > --
> > SRA OSS, Inc. Japan
> > Tatsuo Ishii
> > 
> > > A very strange "cross" degeneration occured last week in 
> > our production
> > > cluster, wich uses pgpool and openswan (IPSec). The cluster 
> > consists of two
> > > machines (srv1 and srv2), each of which has an appliction server, a
> > > postgreSQL backend and a pgpool. On both servers, pgpool replicates
> > > transactions in both backends.
> > > Happened that srv1 entered degeneration mode because srv2 
> > was not reachable,
> > > but srv2 did not enter degeneration mode. After some load, 
> > srv2 discovered a
> > > mismatch between backends and entered degeneration mode 
> > shutting down srv1
> > > (the slave backend). As a result, srv1 pointed to srv1 and 
> > srv2 to srv2,
> > > breaking the synchronization.
> > > I'm still unsure about the real cause that made srv1 unable 
> > to contact
> > > (temporarily) srv2. But maybe openswan related, because the 
> > log showed the
> > > following error:
> > > 
> > > Oct 12 19:48:25 srv2 kernel: pmtu discovery on SA 
> > ESP/e5fb8ff6/c0a85015
> > > 
> > > ¿Has anyone seen something like this?
> > > ¿Is there any method to avoid, or at least detect "cross 
> > degeneration" in
> > > this type of environment?
> > > 
> > > 
> > > 
> > > For further information, these are the relevant logs:
> > > 
> > > srv1
> > > ======
> > > 
> > > Oct 12 19:19:29 srv1 pluto[1346]: "srv2-srv1" #1177: received Delete
> > > SA(0xa98b908a) payload: deleting IPSEC State #1167
> > > Oct 12 19:19:29 srv1 pluto[1346]: "srv2-srv1" #1177: 
> > received and ignored
> > > informational message
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > responding to Main Mode
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > transition from state
> > > (null) to state STATE_MAIN_R1
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > transition from state
> > > STATE_MAIN_R1 to state STATE_MAIN_R2
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: Peer ID is
> > > ID_IPV4_ADDR: '192.168.80.20'
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: I did 
> > not send a
> > > certificate because I do not have one.
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: 
> > transition from state
> > > STATE_MAIN_R2 to state STATE_MAIN_R3
> > > Oct 12 19:20:27 srv1 pluto[1346]: "srv2-srv1" #1179: sent 
> > MR3, ISAKMP SA
> > > established
> > > Oct 12 19:23:01 srv1 CRON[31441]: (pam_unix) session opened 
> > for user mail by
> > > (uid=0)
> > > Oct 12 19:23:01 srv1 CRON[31441]: (pam_unix) session closed 
> > for user mail
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 10278:
> > > pool_read: read failed (Connection reset by peer)
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 10278:
> > > ProcessFrontendResponse: failed to read kind from frontend. fronend
> > > abnormally exited
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 
> > 7543: pool_read:
> > > read failed (Connection reset by peer)
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 7543:
> > > ProcessFrontendResponse: failed to read kind from frontend. fronend
> > > abnormally exited
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 32275:
> > > pool_read: read failed (Connection reset by peer)
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 32275:
> > > ProcessFrontendResponse: failed to read kind from frontend. fronend
> > > abnormally exited
> > > 
> > > ....
> > > 	Lots of same messages
> > > ....
> > > 
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 25087:
> > > pool_read: read failed (Connection reset by peer)
> > > Oct 12 19:24:52 srv1 pgpool: 2005-10-12 19:24:52 ERROR: pid 25087:
> > > ProcessFrontendResponse: failed to read kind from frontend. fronend
> > > abnormally exited
> > > Oct 12 19:28:32 srv1 pgpool: 2005-10-12 19:28:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:29:02 srv1 pgpool: 2005-10-12 19:29:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:29:32 srv1 pgpool: 2005-10-12 19:29:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:30:02 srv1 pgpool: 2005-10-12 19:30:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:30:28 srv1 pluto[1346]: "srv2-srv1" #1178: next 
> > payload type of
> > > ISAKMP Hash Payload has an unknown value: 219
> > > Oct 12 19:30:28 srv1 pluto[1346]: "srv2-srv1" #1178: 
> > malformed payload in
> > > packet
> > > Oct 12 19:30:28 srv1 pluto[1346]: "srv2-srv1" #1178: 
> > sending encrypted
> > > notification PAYLOAD_MALFORMED to 192.168.80.20:500
> > > Oct 12 19:30:32 srv1 pgpool: 2005-10-12 19:30:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:31:02 srv1 pgpool: 2005-10-12 19:31:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:31:32 srv1 pgpool: 2005-10-12 19:31:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:32:02 srv1 pgpool: 2005-10-12 19:32:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:32:32 srv1 pgpool: 2005-10-12 19:32:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:33:02 srv1 pgpool: 2005-10-12 19:33:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:33:32 srv1 pgpool: 2005-10-12 19:33:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:34:02 srv1 pgpool: 2005-10-12 19:34:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:34:32 srv1 pgpool: 2005-10-12 19:34:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:35:02 srv1 pgpool: 2005-10-12 19:35:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:35:32 srv1 pgpool: 2005-10-12 19:35:32 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:36:02 srv1 pgpool: 2005-10-12 19:36:02 LOG:   pid 
> > 17352: starting
> > > health checking
> > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 ERROR: pid 23909:
> > > pool_read: read failed (Connection reset by peer)
> > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 ERROR: pid 23909:
> > > ProcessFrontendResponse: failed to read kind from frontend. fronend
> > > abnormally exited
> > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 ERROR: pid 23909:
> > > pool_flush: fflush failed (Connection timed out)
> > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 LOG:   pid 23909:
> > > notice_backend_error: master: 0 fail over request from pid 23909
> > > Oct 12 19:36:21 srv1 pgpool: 2005-10-12 19:36:21 LOG:   pid 
> > 17352: starting
> > > degeneration. shutdown secondary host srv2(5432)
> > > 
> > > 
> > > 
> > > srv2
> > > ====
> > > 
> > > Oct 12 19:18:06 srv2 pluto[1558]: "srv2-srv1" #895: 
> > ignoring Delete SA
> > > payload: PROTO_IPSEC_ESP SA(0xe661038c) not
> > > found (maybe expired)
> > > Oct 12 19:18:06 srv2 pluto[1558]: "srv2-srv1" #895: 
> > received and ignored
> > > informational message
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > initiating Main Mode to
> > > replace #895
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > transition from state
> > > STATE_MAIN_I1 to state STATE_MAIN_I2
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: I did not send a
> > > certificate because I do not have one.
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > transition from state
> > > STATE_MAIN_I2 to state STATE_MAIN_I3
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: Peer ID 
> > is ID_IPV4_ADDR:
> > > '192.168.80.21'
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: 
> > transition from state
> > > STATE_MAIN_I3 to state STATE_MAIN_I4
> > > Oct 12 19:19:04 srv2 pluto[1558]: "srv2-srv1" #897: ISAKMP 
> > SA established
> > > Oct 12 19:22:45 srv2 sudo:  egutter : TTY=pts/1 ; 
> > PWD=/home/egutter ;
> > > USER=root ; COMMAND=/usr/bin/less /var/log/jboss/DT
> > > .log
> > > Oct 12 19:23:01 srv2 CRON[1921]: (pam_unix) session opened 
> > for user mail by
> > > (uid=0)
> > > Oct 12 19:23:01 srv2 CRON[1921]: (pam_unix) session closed 
> > for user mail
> > > Oct 12 19:29:05 srv2 pluto[1558]: "srv2-srv1" #895: 
> > received Delete SA
> > > payload: deleting ISAKMP State #895
> > > Oct 12 19:29:05 srv2 pluto[1558]: packet from 
> > 192.168.80.21:500: received
> > > and ignored informational message
> > > Oct 12 19:29:05 srv2 pluto[1558]: "srv2-srv1" #896: 
> > ignoring informational
> > > payload, type PAYLOAD_MALFORMED
> > > Oct 12 19:29:05 srv2 pluto[1558]: "srv2-srv1" #896: 
> > received and ignored
> > > informational message
> > > Oct 12 19:19:47 srv2 pgpool: 2005-10-12 19:19:47 LOG:   pid 
> > 1602: starting
> > > health checking
> > > Oct 12 19:19:52 srv2 kernel: pmtu discovery on SA 
> > ESP/e5fb8ff6/c0a85015
> > > Oct 12 19:20:05 srv2 last message repeated 7 times
> > > Oct 12 19:20:17 srv2 pgpool: 2005-10-12 19:20:17 LOG:   pid 
> > 1602: starting
> > > health checking
> > > Oct 12 19:20:18 srv2 kernel: pmtu discovery on SA 
> > ESP/e5fb8ff6/c0a85015
> > > ....
> > > 	Lots of same messages
> > > ....
> > > Oct 12 19:48:17 srv2 pgpool: 2005-10-12 19:48:17 LOG:   pid 
> > 1602: starting
> > > health checking
> > > Oct 12 19:48:25 srv2 kernel: pmtu discovery on SA 
> > ESP/e5fb8ff6/c0a85015
> > > 
> > > 
> > > 
> > > _______________________________________________
> > > Pgpool-general mailing list
> > > Pgpool-general at pgfoundry.org
> > > http://pgfoundry.org/mailman/listinfo/pgpool-general
> > > 
> > 
> 


More information about the Pgpool-general mailing list