[pgpool-general: 4163] Heartbeat random fails between pgpool instances

Thomas SIMON tsimon at neteven.com
Tue Nov 3 19:09:14 JST 2015


Hi all,

I have set up master/slave pgpool scheme in production last week, and I 
have sometimes stranges behaviors of heartbeat check part.

After some time (can be days), master pgpool instance's heartbeat fails 
with no reason, and triggers failover of delegate_IP on slave instance, 
who becomes so master instance.

It says master instance is going down, but master instance is fine...

Below logs of slave :

2015-10-30T05:45:13.362789+01:00 db11 pgpool[38615]: 
check_pgpool_status_by_hb: lifecheck failed. pgpool 1 
(db10.xxx.com:9999) seems not to be working
2015-10-30T05:45:13.362817+01:00 db11 pgpool[38615]: pgpool_down: 
db10.xxx.com:9999 is going down
2015-10-30T05:45:13.377575+01:00 db11 pgpool[38615]: pgpool_down: I'm 
oldest so standing for master
2015-10-30T05:45:13.383137+01:00 db11 pgpool[38615]: wd_escalation: 
escalating to master pgpool

2015-10-30T05:46:36.808437+01:00 db11 pgpool[38615]: wd_escalation: 
escalation command succeeded
2015-10-30T05:46:36.838590+01:00 db11 pgpool[38615]: wd_IP_up: ifconfig 
up failed
2015-10-30T05:46:36.838608+01:00 db11 pgpool[38615]: wd_escalation: 
escalated to master pgpool with some errors
2015-10-30T05:46:39.838775+01:00 db11 pgpool[38615]: 
check_pgpool_status_by_hb: pgpool 1 (db10.xxx.com:9999) is in down status
2015-10-30T05:46:42.838886+01:00 db11 pgpool[38615]: 
check_pgpool_status_by_hb: pgpool 1 (db10.xxx.com:9999) is in down status
...

Logs still says (old) master instance is down.

However, port is open, below check with telnet

[09:52:04]root at db11:/var/log$ telnet db10.xxx.com 9999
Trying xx.xx.xx.xx...
Connected to db10.xxx.com.
Escape character is '^]'.


It can happen on both pgpool instances (I have the same logs - dbxx:9999 
is in down status - in master & slave instance)



Furthermore, delegate IP is declared on both instances (but is still 
routed on old master instance ... )




db10
eth3:1    Link encap:Ethernet  HWaddr 90:e2:ba:6d:b1:49
           inet addr:172.25.1.5  Bcast:172.25.255.255 Mask:255.255.0.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

db11
eth5:1    Link encap:Ethernet  HWaddr 90:e2:ba:8a:ca:5d
           inet addr:172.25.1.5  Bcast:172.25.255.255 Mask:255.255.0.0
           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1




Parameters :

db10:
delegate_IP = '172.25.1.5'
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
wd_heartbeat_keepalive = 20
wd_heartbeat_deadtime = 30
heartbeat_destination0 = 'db11.xxx.com'
heartbeat_destination_port0 = 9694
heartbeat_device0 = 'eth3'



db11:
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
wd_heartbeat_keepalive = 20
wd_heartbeat_deadtime = 30
heartbeat_destination0 = 'db10.neteven.com'
heartbeat_destination_port0 = 9694
heartbeat_device0 = 'eth5'



Does anyone has an idea ?

Thanks

-- 

Thomas



More information about the pgpool-general mailing list