View Issue Details

IDProjectCategoryView StatusLast Update
0000215Pgpool-IIBugpublic2016-08-01 23:43
Reportersupp_kAssigned ToMuhammad Usama 
PriorityurgentSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
Platformpgpool-2OSCentOSOS Version6x and 7x
Product Version 
Target VersionFixed in Version 
Summary0000215: pgpool doesnt escalate ip in case of another node inavailability
DescriptionHi,

We am trying to set up two pgpool servers in front of 2+ postgresql instances.
The pgpool (pgpool-II version 3.5.3 (ekieboshi)) balances all requests and everything is ok except one very significant exception.

We configured it to delegate a virtual IP (please see the attached configuration). And the delegation works only if the primary pgpool node shutdowns correctly (e.g. service pgpool stop). But it we emulate the primary server failure or a network problem but bringing down its network interface then the standby pgpool instance doesnt up the delegation IP.

Please see below the logs generated by slave pgpool instance:
2016-07-08 19:01:55: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet
2016-07-08 19:01:55: pid 20148: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.7.20:9694
2016-07-08 19:01:57: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet
2016-07-08 19:01:57: pid 20148: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.7.20:9694
2016-07-08 19:01:59: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet
2016-07-08 19:01:59: pid 20148: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.7.20:9694
2016-07-08 19:02:01: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet
2016-07-08 19:02:01: pid 20148: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.7.20:9694
2016-07-08 19:02:03: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet
2016-07-08 19:02:03: pid 20148: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.7.20:9694
2016-07-08 19:02:03: pid 20141: DEBUG: STATE MACHINE INVOKED WITH EVENT = TIMEOUT Current State = MASTER
2016-07-08 19:02:03: pid 20141: DEBUG: not sending watchdog internal command packet to DEAD Linux_warm0.local_9999
2016-07-08 19:02:05: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet
2016-07-08 19:02:05: pid 20148: DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.7.20:9694
2016-07-08 19:02:06: pid 20146: DEBUG: watchdog checking life check is ready
2016-07-08 19:02:06: pid 20146: DETAIL: pgpool:1 at "192.168.7.20:9999" has not send the heartbeat signal yet
2016-07-08 19:02:07: pid 20148: DEBUG: watchdog heartbeat: send 224 byte packet




Thanks in advance!

Additional InformationP/S: The configuration file of the slave server is attached. The primary configuration file is absolutely the same except just different IPs mentioned;
TagsNo tags attached.

Activities

supp_k

2016-07-09 01:07

reporter  

pgpool.conf (33,252 bytes)

supp_k

2016-07-09 01:40

reporter  

data.tar.gz (174,080 bytes)

supp_k

2016-07-11 16:29

reporter   ~0000884

I could manage to solve the problem by specifying hostnames in the configuration files instead of IP addresses. Now it works and I see "lifecheck started" in in log files.

Muhammad Usama

2016-07-13 00:33

developer   ~0000899

Hi,
Thanks for the report. I have found the problem in the heartbeat receiver code that fails to identify the heartbeat sender watchdog node when the heartbeat destination is specified in terms of an IP address while wd_hostname is configured as a hostname string or vice versa.

Can you please try the attached wd_hb_fix.diff patch, If it solves the issue.
(The patch can be applied to both master and 3.5 branches)


Muhammad Usama

2016-07-13 00:34

developer  

wd_hb_fix.diff (2,864 bytes)
diff --git a/src/watchdog/wd_heartbeat.c b/src/watchdog/wd_heartbeat.c
index 7b1cb98..67729b5 100644
--- a/src/watchdog/wd_heartbeat.c
+++ b/src/watchdog/wd_heartbeat.c
@@ -76,7 +76,7 @@ static int wd_create_hb_send_socket(WdHbIf * hb_if);
 static int wd_create_hb_recv_socket(WdHbIf * hb_if);
 
 static void wd_hb_send(int sock, WdHbPacket * pkt, int len, const char * destination, const int dest_port);
-static void wd_hb_recv(int sock, WdHbPacket * pkt);
+static void wd_hb_recv(int sock, WdHbPacket * pkt, char *from_addr);
 
 /* create socket for sending heartbeat */
 static int
@@ -301,9 +301,12 @@ wd_hb_send(int sock, WdHbPacket * pkt, int len, const char * host, const int por
 
 }
 
-/* receive heartbeat signal */
+/*
+ * receive heartbeat signal
+ * the function expects the from_addr to be at least WD_MAX_HOST_NAMELEN bytes long
+ */
 void
-static wd_hb_recv(int sock, WdHbPacket * pkt)
+static wd_hb_recv(int sock, WdHbPacket * pkt, char *from_addr)
 {
 	int rtn;
 	struct sockaddr_in senderinfo;
@@ -324,6 +327,8 @@ static wd_hb_recv(int sock, WdHbPacket * pkt)
 		ereport(DEBUG2,
 				(errmsg("watchdog heartbeat: received %d byte packet", rtn)));
 
+	strncpy(from_addr,inet_ntoa(senderinfo.sin_addr),WD_MAX_HOST_NAMELEN-1);
+
 	ntoh_wd_hb_packet(pkt, &buf);
 }
 
@@ -405,7 +410,7 @@ wd_hb_receiver(int fork_wait_time, WdHbIf *hb_if)
 		MemoryContextResetAndDeleteChildren(ProcessLoopContext);
 
 		/* receive heartbeat signal */
-		wd_hb_recv(sock, &pkt);
+		wd_hb_recv(sock, &pkt, from);
 			/* authentication */
 		if (strlen(pool_config->wd_authkey))
 		{
@@ -427,25 +432,24 @@ wd_hb_receiver(int fork_wait_time, WdHbIf *hb_if)
 		gettimeofday(&tv, NULL);
 
 		/* who send this packet? */
-		strlcpy(from, pkt.from, sizeof(from));
 		from_pgpool_port = pkt.from_pgpool_port;
 		for (i = 0; i< gslifeCheckCluster->nodeCount; i++)
 		{
 			LifeCheckNode* node = &gslifeCheckCluster->lifeCheckNodes[i];
 
 			ereport(DEBUG2,
-					(errmsg("received heartbeat signal from \"%s:%d\"",
-							from, from_pgpool_port)));
+					(errmsg("received heartbeat signal from \"%s:%d\" hostname:%s",
+							from, from_pgpool_port, pkt.from)));
 
-			if (!strcmp(node->hostName, from) && node->pgpoolPort == from_pgpool_port)
+			if ( (!strcmp(node->hostName, pkt.from) || !strcmp(node->hostName, from)) && node->pgpoolPort == from_pgpool_port)
 			{
 				/* this is the first packet or the latest packet */
 				if (!WD_TIME_ISSET(node->hb_send_time) ||
 					WD_TIME_BEFORE(node->hb_send_time, pkt.send_time))
 				{
 					ereport(DEBUG1,
-							(errmsg("received heartbeat signal from \"%s:%d\"",
-									from, from_pgpool_port)));
+							(errmsg("received heartbeat signal from \"%s(%s):%d\" node:%s",
+									from,pkt.from, from_pgpool_port, node->nodeName)));
 					
 					node->hb_send_time = pkt.send_time;
 					node->hb_last_recv_time = tv;
wd_hb_fix.diff (2,864 bytes)

supp_k

2016-08-01 15:52

reporter   ~0000951

Hi,

can you apply this patch to the "stable" release. I can apply it locally but I need official build that I can recommend to my customers. This because customers will not accept "patching" on the fly.

Serk.

Muhammad Usama

2016-08-01 23:43

developer   ~0000953

Thanks for the verification, I have pushed the fix to master and 3.5 branches

https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=ff7a6e8218346da56b5442b33913b2673f73bf7b

Issue History

Date Modified Username Field Change
2016-07-09 01:07 supp_k New Issue
2016-07-09 01:07 supp_k File Added: pgpool.conf
2016-07-09 01:40 supp_k File Added: data.tar.gz
2016-07-11 16:29 supp_k Note Added: 0000884
2016-07-12 13:31 t-ishii Assigned To => Muhammad Usama
2016-07-12 13:31 t-ishii Status new => assigned
2016-07-12 13:31 t-ishii Additional Information Updated View Revisions
2016-07-13 00:33 Muhammad Usama Note Added: 0000899
2016-07-13 00:33 Muhammad Usama Status assigned => confirmed
2016-07-13 00:34 Muhammad Usama File Added: wd_hb_fix.diff
2016-08-01 15:52 supp_k Note Added: 0000951
2016-08-01 23:43 Muhammad Usama Status confirmed => resolved
2016-08-01 23:43 Muhammad Usama Resolution open => fixed
2016-08-01 23:43 Muhammad Usama Note Added: 0000953