View Issue Details

IDProjectCategoryView StatusLast Update
0000545Pgpool-IIBugpublic2019-10-31 18:40
ReporternglpAssigned ToMuhammad Usama 
PrioritynormalSeveritymajorReproducibilityrandom
Status resolvedResolutionfixed 
Product Version4.0.4 
Target Version4.0.7Fixed in Version4.0.7 
Summary0000545: Quorum lost and not recovered
DescriptionHi,

We've a 3 node PGPool setup with 3 PostgreSQL master-standby-standby with streaming replication running on vmware vms

PGPool health check method is heartbeat

We had an issue, because our backup method (snapshot) hanged master watchdog node (node2 on that moment) for more time than defined in wd_heartbeat_deadtime, and VIP was released, that its a normal behaviour

Issue was after that, quorum was lost and never recovered, VIP was not reallocated in the other 2 nodes, so service went down

Quorum was never recovered, we had to restart entire pgpool stack, despite checking master watchdog node on pgpool was indicating a master. This has happened in the past and quorum was recovered and VIP reallocated successfully, so its random i guess

Find attached logs and our pgpool configuration

Any question we will be happy to answer

Thanks, Guille
Steps To ReproduceThis has happened after more master failovers due same hang, but:
* Hang watchdog master node for a bit more than wd_heartbeat_deadtime
* Quorum is lost
Additional Informationnode1, node2 and node3 have different wd_priority, in that order, other pgpool settings are exactly the same

node2 was watchdog master at that moment, and after quorum lost
TagsNo tags attached.

Activities

nglp

2019-09-09 20:25

reporter  

node3.txt (983 bytes)
Sep  5 20:00:27 node3 pgpool[23592]: [12513-1] 2019-09-05 20:00:27: pid 23592: LOG:  read from socket failed, remote end closed the connection
Sep  5 20:00:27 node3 pgpool[23592]: [12514-1] 2019-09-05 20:00:27: pid 23592: LOG:  outbound socket of node2:9999 Linux node2 is closed
Sep  5 20:00:27 node3 pgpool[23592]: [12515-1] 2019-09-05 20:00:27: pid 23592: LOG:  new watchdog node connection is received from "node2:13969"
Sep  5 20:00:27 node3 pgpool[23592]: [12516-1] 2019-09-05 20:00:27: pid 23592: LOG:  read from socket failed, remote end closed the connection
Sep  5 20:00:27 node3 pgpool[23592]: [12517-1] 2019-09-05 20:00:27: pid 23592: LOG:  client socket of node2:9999 Linux node2 is closed
Sep  5 20:00:27 node3 pgpool[23592]: [12518-1] 2019-09-05 20:00:27: pid 23592: LOG:  new outbound connection to node2:9000
Sep  5 20:00:27 node3 pgpool[23592]: [12519-1] 2019-09-05 20:00:27: pid 23592: LOG:  new node joined the cluster hostname:"node2" port:9000 pgpool_port:9999
node3.txt (983 bytes)
node2.txt (5,123 bytes)
Sep  5 20:00:27 node2 pgpool[22281]: [30-1] 2019-09-05 20:00:27: pid 22281: LOG:  informing the node status change to watchdog
Sep  5 20:00:27 node2 pgpool[22281]: [30-2] 2019-09-05 20:00:27: pid 22281: DETAIL:  node id :1 status = "NODE DEAD" message:"No heartbeat signal from node"
Sep  5 20:00:27 node2 pgpool[22281]: [31-1] 2019-09-05 20:00:27: pid 22281: LOG:  informing the node status change to watchdog
Sep  5 20:00:27 node2 pgpool[22281]: [31-2] 2019-09-05 20:00:27: pid 22281: DETAIL:  node id :2 status = "NODE DEAD" message:"No heartbeat signal from node"
Sep  5 20:00:27 node2 pgpool[22278]: [13484-1] 2019-09-05 20:00:27: pid 22278: LOG:  new IPC connection received
Sep  5 20:00:27 node2 pgpool[22278]: [13485-1] 2019-09-05 20:00:27: pid 22278: LOG:  received node status change ipc message
Sep  5 20:00:27 node2 pgpool[22278]: [13485-2] 2019-09-05 20:00:27: pid 22278: DETAIL:  No heartbeat signal from node
Sep  5 20:00:27 node2 pgpool[22278]: [13486-1] 2019-09-05 20:00:27: pid 22278: LOG:  remote node "node1:9999 Linux node1" is lost
Sep  5 20:00:27 node2 pgpool[22278]: [13487-1] 2019-09-05 20:00:27: pid 22278: LOG:  removing watchdog node "node1:9999 Linux node1" from the standby list
Sep  5 20:00:27 node2 pgpool[22278]: [13488-1] 2019-09-05 20:00:27: pid 22278: LOG:  new IPC connection received
Sep  5 20:00:27 node2 pgpool[22278]: [13489-1] 2019-09-05 20:00:27: pid 22278: LOG:  received node status change ipc message
Sep  5 20:00:27 node2 pgpool[22278]: [13489-2] 2019-09-05 20:00:27: pid 22278: DETAIL:  No heartbeat signal from node
Sep  5 20:00:27 node2 pgpool[22278]: [13490-1] 2019-09-05 20:00:27: pid 22278: LOG:  remote node "node3:9999 Linux node3" is lost
Sep  5 20:00:27 node2 pgpool[22278]: [13491-1] 2019-09-05 20:00:27: pid 22278: LOG:  removing watchdog node "node3:9999 Linux node3" from the standby list
Sep  5 20:00:27 node2 pgpool[22278]: [13492-1] 2019-09-05 20:00:27: pid 22278: LOG:  We have lost the quorum
Sep  5 20:00:27 node2 pgpool[22277]: [30530-1] 2019-09-05 20:00:27: pid 22277: LOG:  child process with pid: 25928 exits with status 256
Sep  5 20:00:27 node2 pgpool[22278]: [13493-1] 2019-09-05 20:00:27: pid 22278: LOG:  new watchdog node connection is received from "node3:12951"
Sep  5 20:00:27 node2 pgpool[22278]: [13494-1] 2019-09-05 20:00:27: pid 22278: LOG:  new outbound connection to node3:9000
Sep  5 20:00:27 node2 pgpool[22278]: [13495-1] 2019-09-05 20:00:27: pid 22278: LOG:  new node joined the cluster hostname:"node3" port:9000 pgpool_port:9999
Sep  5 20:00:27 node2 pgpool[22277]: [30531-1] 2019-09-05 20:00:27: pid 22277: LOG:  fork a new child process with pid: 10947
Sep  5 20:00:27 node2 pgpool[22277]: [30532-1] 2019-09-05 20:00:27: pid 22277: LOG:  Pgpool-II parent process received watchdog quorum change signal from watchdog
Sep  5 20:00:27 node2 pgpool[22278]: [13496-1] 2019-09-05 20:00:27: pid 22278: LOG:  new IPC connection received
Sep  5 20:00:27 node2 pgpool[22278]: [13497-1] 2019-09-05 20:00:27: pid 22278: LOG:  new outbound connection to node1:9000
Sep  5 20:00:27 node2 pgpool[22278]: [13498-1] 2019-09-05 20:00:27: pid 22278: LOG:  new watchdog node connection is received from "node1:18566"
Sep  5 20:00:27 node2 pgpool[22278]: [13499-1] 2019-09-05 20:00:27: pid 22278: LOG:  new node joined the cluster hostname:"node1" port:9000 pgpool_port:9999
Sep  5 20:00:27 node2 pgpool[10946]: [13493-1] 2019-09-05 20:00:27: pid 10946: LOG:  watchdog: de-escalation started
Sep  5 20:00:27 node2 pgpool[10946]: [13494-1] 2019-09-05 20:00:27: pid 10946: LOG:  successfully released the delegate IP:"vip"
Sep  5 20:00:27 node2 pgpool[10946]: [13494-2] 2019-09-05 20:00:27: pid 10946: DETAIL:  'if_down_cmd' returned with success
Sep  5 20:00:27 node2 pgpool[22278]: [13500-1] 2019-09-05 20:00:27: pid 22278: LOG:  watchdog de-escalation process with pid: 10946 exit with SUCCESS.
Sep  5 20:00:29 node2 pgpool[22281]: [32-1] 2019-09-05 20:00:29: pid 22281: LOG:  informing the node status change to watchdog
Sep  5 20:00:29 node2 pgpool[22281]: [32-2] 2019-09-05 20:00:29: pid 22281: DETAIL:  node id :1 status = "NODE ALIVE" message:"Heartbeat signal found"
Sep  5 20:00:29 node2 pgpool[22281]: [33-1] 2019-09-05 20:00:29: pid 22281: LOG:  informing the node status change to watchdog
Sep  5 20:00:29 node2 pgpool[22281]: [33-2] 2019-09-05 20:00:29: pid 22281: DETAIL:  node id :2 status = "NODE ALIVE" message:"Heartbeat signal found"
Sep  5 20:00:29 node2 pgpool[22278]: [13501-1] 2019-09-05 20:00:29: pid 22278: LOG:  new IPC connection received
Sep  5 20:00:29 node2 pgpool[22278]: [13502-1] 2019-09-05 20:00:29: pid 22278: LOG:  new IPC connection received
Sep  5 20:00:29 node2 pgpool[22278]: [13503-1] 2019-09-05 20:00:29: pid 22278: LOG:  received node status change ipc message
Sep  5 20:00:29 node2 pgpool[22278]: [13503-2] 2019-09-05 20:00:29: pid 22278: DETAIL:  Heartbeat signal found
Sep  5 20:00:29 node2 pgpool[22278]: [13504-1] 2019-09-05 20:00:29: pid 22278: LOG:  received node status change ipc message
Sep  5 20:00:29 node2 pgpool[22278]: [13504-2] 2019-09-05 20:00:29: pid 22278: DETAIL:  Heartbeat signal found
node2.txt (5,123 bytes)
pgpool.conf (4,479 bytes)
node1.txt (983 bytes)
Sep  5 20:00:27 node1 pgpool[19434]: [24251-1] 2019-09-05 20:00:27: pid 19434: LOG:  read from socket failed, remote end closed the connection
Sep  5 20:00:27 node1 pgpool[19434]: [24252-1] 2019-09-05 20:00:27: pid 19434: LOG:  client socket of node2:9999 Linux node2 is closed
Sep  5 20:00:27 node1 pgpool[19434]: [24253-1] 2019-09-05 20:00:27: pid 19434: LOG:  read from socket failed, remote end closed the connection
Sep  5 20:00:27 node1 pgpool[19434]: [24254-1] 2019-09-05 20:00:27: pid 19434: LOG:  outbound socket of node2:9999 Linux node2 is closed
Sep  5 20:00:27 node1 pgpool[19434]: [24255-1] 2019-09-05 20:00:27: pid 19434: LOG:  new watchdog node connection is received from "node2:61630"
Sep  5 20:00:27 node1 pgpool[19434]: [24256-1] 2019-09-05 20:00:27: pid 19434: LOG:  new node joined the cluster hostname:"node2" port:9000 pgpool_port:9999
Sep  5 20:00:27 node1 pgpool[19434]: [24257-1] 2019-09-05 20:00:27: pid 19434: LOG:  new outbound connection to node2:9000
node1.txt (983 bytes)

t-ishii

2019-09-12 10:36

developer   ~0002835

I have pinged Usama, who is an authority of watchdog, a few days ago to look into this issue.
Probably you 'd better to wait for response from him.

In the mean time I noticed you set wd_heartbeat_deadtime = 5, that means if watchdog does not receive heart beat signal within 5 seconds, the pgpool will be regarded as dead (that is your situation). For a workaround, you could set it to longer, say 300 seconds, to not regard your backuping server is dead.

nglp

2019-09-12 16:09

reporter   ~0002836

Hi t-ishii,

Thank you for your response, we will wait for him :)

Yes, once we had identified that timeout, we increased wd_heartbeat_deadtime to 30 (before 5) and wd_interval to 10 (before 2) to avoid these problems. That are default values in a stream replication cluster example configuration

We configured lower values in the past to allow fastest failover possible

Best regards, Guille

t-ishii

2019-09-12 16:49

developer   ~0002837

This one:
https://www.pgpool.net/mantisbt/view.php?id=547
seems to be similar issue?

nglp

2019-09-12 17:19

reporter   ~0002838

Hi,

No, its not the same, if you see logs attached, following things happens:
* Heartbeat signals are lost
* Quorum is lost
* Watchdog deescalation starts
* VIP is removed
* Heartbead signals are back

But at this moment, quorum is not recovered, so VIP is lost

Issue in our case is quorum is never recovered, despite heartbeat signals are back

Best regards

Muhammad Usama

2019-09-12 18:54

developer   ~0002839

Hi

The log files you shared do not have much information and its difficult to guess what could have gone wrong.
Can you please share the complete pgpool logs for all three nodes.

nglp

2019-09-12 19:07

reporter   ~0002840

Hi,

Im sorry to say thats the only output generated from pgpool for this issue

If you want i can attach entire logs, but are filled with non-relevant information

Let me know

Thanks, Guille

Muhammad Usama

2019-09-12 19:17

developer   ~0002841

Do you able to reliably reproduce the issue. If that's the case then sharing the exact steps would be enough.

Otherwise please share the entire log along with the time when the issue happened, and I will try to see
if some useable information can be extracted from that.

Thanks

nglp

2019-09-12 19:27

reporter   ~0002842

Im setting a new environment to make some tests, other ones are all production ones

The way i though you can reproduce it is:
* set low wd_heartbeat_deadtime and wd_interval values (like 2 and 1)
* make virtual machine snapshot on master watchdog node
* check what happens

Expect a random result, on other times VIP has failed over successfully, and quorum has been recovered ok

I will prepare logs to upload them (i've to replace some strings, like hostname, ips, or remove some queries)

Thanks, Guille

nglp

2019-09-13 21:51

reporter   ~0002846

Hi,

Im back with "good" news, im able to reproduce issue in my test environment with pgpool.conf config attached and following modifications:
* wd_interval = 1
* wd_heartbeat_deadtime = 2
* log_error_verbosity = verbose
* client_min_messages = debug5
* log_min_messages = debug1

Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped

Let me know if you need more debug on log_min_messages

Thanks, Guille

nglp

2019-09-13 21:51

reporter  

server1.txt.gz (115,952 bytes)
server2.txt.gz (112,256 bytes)
server3.txt.gz (116,210 bytes)

Muhammad Usama

2019-09-13 23:02

developer   ~0002848

Hi

Thanks a bundle for the logs. I am looking into these

Muhammad Usama

2019-09-17 05:05

developer   ~0002853

Hi

Thanks for providing the logs. I am able to identify the issue using the attached logs.
I am working on the fix.

Thanks

nglp

2019-09-17 16:37

reporter   ~0002857

Hi,

Thank you so much for your work

If we can help you in any way (reproducing more times, other parameters, etc) just let me know

Best regards, Guille

Muhammad Usama

2019-09-18 23:18

developer   ~0002860

Hi Guille,

Can you please try out the attached patch if it fixes the issue.

Thanks
Best Regards
Muhammad Usama

recover_quorum_fix.diff (8,328 bytes)
diff --git a/src/include/watchdog/watchdog.h b/src/include/watchdog/watchdog.h
index 78df76c..61c5ff1 100644
--- a/src/include/watchdog/watchdog.h
+++ b/src/include/watchdog/watchdog.h
@@ -6,7 +6,7 @@
  * pgpool: a language independent connection pool server for PostgreSQL
  * written by Tatsuo Ishii
  *
- * Copyright (c) 2003-2017	PgPool Global Development Group
+ * Copyright (c) 2003-2019	PgPool Global Development Group
  *
  * Permission to use, copy, modify, and distribute this software and
  * its documentation for any purpose and without fee is hereby
@@ -38,6 +38,30 @@
                                     ((a).tv_usec - (b).tv_usec) / 1000000.0)
 
 /*
+ * Data version numnber of watchdog messages
+ * The version number is in major.minor format
+ * The major versions always kept compitable.
+ *
+ * Increment the minor version whenever a minor change is
+ * made to message data, where the older versions can still
+ * work even when that change is not present.
+ *
+ * while increamenting the major version would mean that
+ * the watcdog node with older major version will not be
+ * allowed to join the cluster
+ *
+ * Since the message data version was not present from the
+ * begining so the default version is considered to be 1.0
+ * meaning if the data version number is not present in the
+ * watchdog node info then it will be considered as version 1.0
+ */
+
+#define WD_MESSAGE_DATA_VERSION_MAJOR	"1"
+#define WD_MESSAGE_DATA_VERSION_MINOR	"1"
+#define WD_MESSAGE_DATA_VERSION	WD_MESSAGE_DATA_VERSION_MAJOR "." WD_MESSAGE_DATA_VERSION_MINOR
+#define MAX_VERSION_STR_LEN	10
+
+/*
  * watchdog state
  */
 typedef enum
@@ -89,8 +113,8 @@ typedef enum
 
 	WD_EVENT_NODE_CON_LOST,
 	WD_EVENT_NODE_CON_FOUND,
-	WD_EVENT_CLUSTER_QUORUM_CHANGED
-
+	WD_EVENT_CLUSTER_QUORUM_CHANGED,
+	WD_EVENT_WD_STATE_REQUIRE_RELOAD
 }			WD_EVENTS;
 
 typedef struct SocketConnection
@@ -111,6 +135,10 @@ typedef struct WatchdogNode
 									 * from the node */
 	struct timeval last_sent_time;	/* timestamp when last packet was sent on
 									 * the node */
+	char 		pgp_version[MAX_VERSION_STR_LEN];		/* Pgpool-II version */
+	int			wd_data_major_version;
+	int			wd_data_minor_version;
+
 	char		nodeName[WD_MAX_NODE_NAMELEN];	/* name of this node */
 	char		hostname[WD_MAX_HOST_NAMELEN];	/* host name */
 	int			wd_port;		/* watchdog port */
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index 1a03927..3b6b659 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -325,7 +325,8 @@ PgpoolMain(bool discard_status, bool clear_memcache_oidmaps)
 		}
 
 		ereport(LOG,
-				(errmsg("watchdog process is initialized")));
+				(errmsg("watchdog process is initialized"),
+				 errdetail("watchdog messaging data version: %s",WD_MESSAGE_DATA_VERSION)));
 
 		/*
 		 * initialize the lifecheck process
diff --git a/src/watchdog/watchdog.c b/src/watchdog/watchdog.c
index f0b0f0d..7788dd7 100644
--- a/src/watchdog/watchdog.c
+++ b/src/watchdog/watchdog.c
@@ -128,6 +128,7 @@ typedef enum IPC_CMD_PREOCESS_RES
 #define CLUSTER_IAM_NOT_TRUE_MASTER			'X'
 #define CLUSTER_IAM_RESIGNING_FROM_MASTER	'R'
 #define CLUSTER_NODE_INVALID_VERSION		'V'
+#define CLUSTER_NODE_REQUIRE_TO_RELOAD		'I'
 
 #define WD_MASTER_NODE getMasterWatchdogNode()
 
@@ -197,7 +198,8 @@ char	   *wd_event_name[] =
 	"THIS NODE FOUND",
 	"NODE CONNECTION LOST",
 	"NODE CONNECTION FOUND",
-	"CLUSTER QUORUM STATUS CHANGED"
+	"CLUSTER QUORUM STATUS CHANGED",
+	"NODE REQUIRE TO RELOAD STATE"
 };
 
 char	   *wd_state_names[] = {
@@ -1496,7 +1498,10 @@ read_sockets(fd_set *rmask, int pending_fds_count)
 								close_socket_connection(&wdNode->server_socket);
 								strlcpy(wdNode->delegate_ip, tempNode->delegate_ip, WD_MAX_HOST_NAMELEN);
 								strlcpy(wdNode->nodeName, tempNode->nodeName, WD_MAX_HOST_NAMELEN);
+								strlcpy(wdNode->pgp_version, tempNode->pgp_version, MAX_VERSION_STR_LEN);
 								wdNode->state = tempNode->state;
+								wdNode->wd_data_major_version = tempNode->wd_data_major_version;
+								wdNode->wd_data_minor_version = tempNode->wd_data_minor_version;
 								wdNode->startup_time.tv_sec = tempNode->startup_time.tv_sec;
 								wdNode->wd_priority = tempNode->wd_priority;
 								wdNode->server_socket = *conn;
@@ -2008,7 +2013,7 @@ static IPC_CMD_PREOCESS_RES process_IPC_nodeStatusChange_command(WDCommandData *
 {
 	int			nodeStatus;
 	int			nodeID;
-	char	   *message;
+	char	   *message = NULL;
 	bool		ret;
 
 	if (ipcCommand->sourcePacket.len <= 0 || ipcCommand->sourcePacket.data == NULL)
@@ -2025,11 +2030,12 @@ static IPC_CMD_PREOCESS_RES process_IPC_nodeStatusChange_command(WDCommandData *
 	}
 
 	if (message)
+	{
 		ereport(LOG,
 				(errmsg("received node status change ipc message"),
 				 errdetail("%s", message)));
-	pfree(message);
-
+		pfree(message);
+	}
 	if (fire_node_status_event(nodeID, nodeStatus) == false)
 		return IPC_CMD_ERROR;
 
@@ -3757,6 +3763,12 @@ cluster_service_message_processor(WatchdogNode * wdNode, WDPacketData * pkt)
 			}
 			break;
 
+		case CLUSTER_NODE_REQUIRE_TO_RELOAD:
+		{
+			watchdog_state_machine(WD_EVENT_WD_STATE_REQUIRE_RELOAD, NULL, NULL, NULL);
+		}
+			break;
+
 		case CLUSTER_NODE_INVALID_VERSION:
 			{
 				/*
@@ -5662,6 +5674,32 @@ watchdog_state_machine_coordinator(WD_EVENTS event, WatchdogNode * wdNode, WDPac
 			}
 			break;
 
+		case WD_EVENT_REMOTE_NODE_FOUND:
+		{
+			ereport(LOG,
+				(errmsg("remote node \"%s\" became reachable again", wdNode->nodeName),
+					errdetail("trying to add it back as a standby")));
+			/* If I am the cluster master. Ask for the node info and to re-send the join message */
+			send_message_of_type(wdNode, WD_REQ_INFO_MESSAGE, NULL);
+			if (wdNode->wd_data_major_version >= 1 && wdNode->wd_data_minor_version >= 1)
+			{
+				/*
+				 * Since data version 1.1 we support CLUSTER_NODE_REQUIRE_TO_RELOAD
+				 * which makes the standby nodes to re-send the join master node
+				 */
+				send_cluster_service_message(wdNode, pkt, CLUSTER_NODE_REQUIRE_TO_RELOAD);
+			}
+			else
+			{
+				/*
+				 * The node is on older version
+				 * So ask it to re-join the cluster
+				 */
+				send_cluster_service_message(wdNode, pkt, CLUSTER_NEEDS_ELECTION);
+			}
+			break;
+		}
+
 		case WD_EVENT_PACKET_RCV:
 			{
 				switch (pkt->type)
@@ -6164,6 +6202,10 @@ watchdog_state_machine_standby(WD_EVENTS event, WatchdogNode * wdNode, WDPacketD
 			set_timeout(5);
 			break;
 
+		case WD_EVENT_WD_STATE_REQUIRE_RELOAD:
+			send_cluster_command(WD_MASTER_NODE, WD_JOIN_COORDINATOR_MESSAGE, 5);
+			break;
+
 		case WD_EVENT_COMMAND_FINISHED:
 			{
 				if (clusterCommand->commandPacket.type == WD_JOIN_COORDINATOR_MESSAGE)
diff --git a/src/watchdog/wd_json_data.c b/src/watchdog/wd_json_data.c
index 4fa8dd7..d0abc2c 100644
--- a/src/watchdog/wd_json_data.c
+++ b/src/watchdog/wd_json_data.c
@@ -435,6 +435,10 @@ get_watchdog_node_info_json(WatchdogNode * wdNode, char *authkey)
 
 	JsonNode   *jNode = jw_create_with_object(true);
 
+	jw_put_string(jNode, "PGPOOL_VERSION", VERSION);
+	jw_put_string(jNode, "DATA_VERSION_MAJOR", WD_MESSAGE_DATA_VERSION_MAJOR);
+	jw_put_string(jNode, "DATA_VERSION_MINOR", WD_MESSAGE_DATA_VERSION_MINOR);
+
 	jw_put_int(jNode, "State", wdNode->state);
 	jw_put_int(jNode, "WdPort", wdNode->wd_port);
 	jw_put_int(jNode, "PgpoolPort", wdNode->pgpool_port);
@@ -534,6 +538,24 @@ get_watchdog_node_from_json(char *json_data, int data_len, char **authkey)
 		goto ERROR_EXIT;
 	strncpy(wdNode->delegate_ip, ptr, sizeof(wdNode->delegate_ip) - 1);
 
+	ptr = json_get_string_value_for_key(root, "DATA_VERSION_MAJOR");
+	if (ptr == NULL)
+		wdNode->wd_data_major_version = 1;
+	else
+		wdNode->wd_data_major_version = atoi(ptr);
+
+	ptr = json_get_string_value_for_key(root, "DATA_VERSION_MINOR");
+	if (ptr == NULL)
+		wdNode->wd_data_minor_version = 0;
+	else
+		wdNode->wd_data_minor_version = atoi(ptr);
+
+	ptr = json_get_string_value_for_key(root, "PGPOOL_VERSION");
+	if (ptr != NULL)
+		strncpy(wdNode->pgp_version, ptr, sizeof(wdNode->pgp_version) - 1);
+	else
+		wdNode->pgp_version[0] = '0';
+
 	if (authkey)
 	{
 		ptr = json_get_string_value_for_key(root, "authkey");
@@ -542,6 +564,7 @@ get_watchdog_node_from_json(char *json_data, int data_len, char **authkey)
 		else
 			*authkey = NULL;
 	}
+
 	return wdNode;
 
 ERROR_EXIT:
recover_quorum_fix.diff (8,328 bytes)

nglp

2019-09-19 00:17

reporter   ~0002862

Hi,

Should i download source code of 4.0.4 and compile with patch? or its for another version? Currently its has been installed using rpm package from repository

Best regards

t-ishii

2019-09-19 13:26

developer   ~0002864

Last edited: 2019-09-19 13:29

View 2 revisions

After applying the patch to master branch, I get a few regression test failures:
testing 004.watchdog...ok.
testing 011.watchdog_quorum_failover...failed.
testing 012.watchdog_failover_when_quorum_exists...failed.
testing 013.watchdog_failover_require_consensus...failed.
testing 014.watchdog_test_quorum_bypass...ok.
testing 015.watchdog_master_and_backend_fail...ok.
testing 028.watchdog_enable_consensus_with_half_votes...ok.

013 has already failed these days but 011 and 012 are new.

Muhammad Usama

2019-09-19 18:24

developer   ~0002865

Somehow I am getting clean regression with the patch.

..
testing 006.memqcache...ok.
testing 007.memqcache-memcached...ok.
testing 008.dbredirect...ok.
testing 009.sql_comments...ok.
testing 011.watchdog_quorum_failover...ok.
testing 012.watchdog_failover_when_quorum_exists...ok.
testing 013.watchdog_failover_require_consensus...ok.
testing 014.watchdog_test_quorum_bypass...ok.
testing 015.watchdog_master_and_backend_fail...ok.
testing 016.node_0_is_not_primary...ok.
testing 017.node_0_is_down...ok.
testing 018.detach_primary...ok.
testing 019.log_client_messages...ok.
testing 020.allow_clear_text_frontend_auth...ok.
testing 021.pool_passwd_auth...ok.
testing 022.pool_passwd_alternative_auth...ok.
testing 023.ssl_connection...ok.
testing 024.cert_auth...ok.
testing 025.enable_shared_relcache...ok.
testing 026.temp_table...ok.
testing 027.auto_failback...ok.
testing 028.watchdog_enable_consensus_with_half_votes...ok.
testing 050.bug58...ok.
testing 051.bug60...ok.
..

So I will try to figure out why It could be failing on your system.
I am running the regression on MASTER branch with PG12

Muhammad Usama

2019-09-19 18:26

developer   ~0002866

Hi Guille,

If you have a development environment setup on your system then you can pull the code and apply the patch on the master branch
(I believe the same patch should also work with 4_0 branch) and test the scenario.

nglp

2019-09-19 19:17

reporter   ~0002867

Hi Muhammad,

I've downloaded source code for 4.0.4 version, applied patch, compiled on all machines and started service with new binary

Im sadly to inform issue still occur

My PgSQL version is 11.4 in this environment (i saw in compiling it needs some libraries)

Best regards

t-ishii

2019-09-19 20:39

developer   ~0002868

The regression log for 011 and 012 attached.
This is master branch + PostgreSQL 11.4.

regression_log.tar.gz (1,891 bytes)

Muhammad Usama

2019-09-20 18:50

developer   ~0002869

Hi Ishii-San

Is it possible if you can also share the pgpool.log files for both these test cases? Since I am still not able to reproduce the failures either by running against PG11 or PG12

t-ishii

2019-09-20 20:58

developer   ~0002870

Sure.

011-012.tar.gz (6,663 bytes)

Muhammad Usama

2019-09-20 23:39

developer   ~0002871

Hi Ishii-San

Many thanks for providing the logs. There is something strange I am not able to get what's going wrong with your setup.
Although I figured out the logs you shared does not have health check debug related messages.

like below:
WARNING: check_backend_down_request: failed to open file standby2/log/backend_down_request

So I am wondering was the pgpool-II compiled with health check debug enabled?

I am also attaching the similar logs when I ran the regression on my system. If you can spot the difference to identify what could be going wrong.

Thanks

reg_logs_usama.zip (29,726 bytes)

t-ishii

2019-09-21 08:03

developer   ~0002872

Oops. You are right. I did not enable health check debug while compiling. After enabling it, 011 and 012 tests succeed.
I am terribly sorry for the misinformation.

Muhammad Usama

2019-09-24 15:31

developer   ~0002874

Hi Ishii-San

;-) no issues. Happens with me all the time.

So I am able to test the patch thoroughly using the new wd_cli utility I checked in yesterday. And the patch is behaving as expected.
Do you agree if I go on and commit this one?

t-ishii

2019-09-24 15:39

developer   ~0002875

Not sure. I think you need to get log from nglp to find out why his issue is not resolved with the patch?

nglp

2019-09-24 16:23

reporter   ~0002876

Hi,

Will do, probably will take 1-2 days

Thanks, Guille

Muhammad Usama

2019-09-24 17:05

developer   ~0002877

Hi Ishii-San
Sorry, my bad, I somehow overlooked the note from Guille, and never knew he notified about the issue not fixed with the patch.

@Guille I will look into the issue once you share the pgpool log files with the patch applied.
Thanks in advance.

Kind Regards
Muhammad Usama

nglp

2019-09-24 20:21

reporter   ~0002878

Hi,

 Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped

Compiled PGPool 4.0.4 + Patch
PostgreSQL 11.4

Let me know

Best regards

server3.txt-2.gz (97,602 bytes)
server1.txt-2.gz (106,666 bytes)
server2.txt-2.gz (97,357 bytes)

Muhammad Usama

2019-09-25 19:09

developer   ~0002880

Hi Guille,

Many thanks for providing the logs. Looking at the logs I suspect there is some issue with applying the patch.
Since the code line number of logs are appearing at some offset where they should be.
For example:
Sep 24 07:16:20 server1 pgpool[15336]: [557-1] 2019-09-24 07:16:20: pid 15336: DEBUG: I am the cluster leader node command finished with status:[COMMAND TIMEED OUT] which is success
Sep 24 07:16:20 server1 pgpool[15336]: [557-2] 2019-09-24 07:16:20: pid 15336: DETAIL: The command was sent to 2 nodes and 1 nodes replied to it
Sep 24 07:16:20 server1 pgpool[15336]: [557-3] 2019-09-24 07:16:20: pid 15336: LOCATION: watchdog.c:5548

This line should be "LOCATION: watchdog.c:5517" not at "LOCATION: watchdog.c:5548" With pgpool-II 4.0.4 source code.

Also, the other log messages suggest a similar problem with the code.
So is it possible if you could send me the code with patch applied? or I can send you the built binaries(if you are using MAC or centos7) whichever suits you better.

Thanks in advance for helping in debugging this.

Best Regards

nglp

2019-09-25 19:57

reporter   ~0002881

Hi,

Sure, find my PGPool 4.0.4 + Patch attached

Probably its my bad applying it, i did it manually, so if you have the source code with patch properly applied, i can compile it in my test environment (redhat 7)

Also, we're open mind to update PGPool to latest version (some days ago we suffered a segfault in a connection pool process, we are still trying to reproduce or obtain a coredump to open an issue), so if you have it applied over 4.0.6 will work for us too

Many thanks to you, Guille

pgpool-II-4.0.4.fix.tgz (3,898,639 bytes)

Muhammad Usama

2019-09-25 20:53

developer   ~0002882

Hi

Many thanks for the prompt reply. And my suspicion was correct, there was an issue with patching.
I am attaching the patched code for same pgpool-II 4.0.4. Can you please try it out.
Once you verify the fix I will apply it to all supported branches.

Thanks

pgpool-II-4.0.4-patched.tar.gz (3,365,708 bytes)

nglp

2019-09-25 22:11

reporter   ~0002883

Hi,

Good news, issue looks to be solved with that code

Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped

Also, out of logs, i repeated issue like 10-15 times and everytime went ok

Thank you all for your work

Best regards, Guille

server3.txt-3.gz (107,761 bytes)
server1.txt-3.gz (117,111 bytes)
server2.txt-3.gz (106,937 bytes)

Muhammad Usama

2019-09-25 23:41

developer   ~0002884

Hi Guille,

Many thanks for testing and confirmation of the fix. I will commit the patch to all affected branches.

Best Regards
Muhammad Usama

Muhammad Usama

2019-09-29 06:06

developer   ~0002892

I have committed the fix to all supported branches

https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106

Muhammad Usama

2019-09-29 06:07

developer   ~0002893

Fix committed to all supported branches


https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106

Issue History

Date Modified Username Field Change
2019-09-09 20:25 nglp New Issue
2019-09-09 20:25 nglp File Added: node1.txt
2019-09-09 20:25 nglp File Added: node2.txt
2019-09-09 20:25 nglp File Added: node3.txt
2019-09-09 20:25 nglp File Added: pgpool.conf
2019-09-12 10:36 t-ishii Note Added: 0002835
2019-09-12 16:09 nglp Note Added: 0002836
2019-09-12 16:17 t-ishii Assigned To => Muhammad Usama
2019-09-12 16:17 t-ishii Status new => assigned
2019-09-12 16:49 t-ishii Note Added: 0002837
2019-09-12 17:19 nglp Note Added: 0002838
2019-09-12 18:54 Muhammad Usama Note Added: 0002839
2019-09-12 19:07 nglp Note Added: 0002840
2019-09-12 19:17 Muhammad Usama Note Added: 0002841
2019-09-12 19:27 nglp Note Added: 0002842
2019-09-13 21:51 nglp Note Added: 0002846
2019-09-13 21:51 nglp File Added: server1.txt.gz
2019-09-13 21:51 nglp File Added: server2.txt.gz
2019-09-13 21:51 nglp File Added: server3.txt.gz
2019-09-13 23:02 Muhammad Usama Note Added: 0002848
2019-09-17 05:05 Muhammad Usama Note Added: 0002853
2019-09-17 16:37 nglp Note Added: 0002857
2019-09-18 23:18 Muhammad Usama File Added: recover_quorum_fix.diff
2019-09-18 23:18 Muhammad Usama Note Added: 0002860
2019-09-19 00:17 nglp Note Added: 0002862
2019-09-19 13:26 t-ishii Note Added: 0002864
2019-09-19 13:29 t-ishii Note Edited: 0002864 View Revisions
2019-09-19 18:24 Muhammad Usama Note Added: 0002865
2019-09-19 18:26 Muhammad Usama Note Added: 0002866
2019-09-19 19:17 nglp Note Added: 0002867
2019-09-19 20:39 t-ishii File Added: regression_log.tar.gz
2019-09-19 20:39 t-ishii Note Added: 0002868
2019-09-20 18:50 Muhammad Usama Note Added: 0002869
2019-09-20 20:58 t-ishii File Added: 011-012.tar.gz
2019-09-20 20:58 t-ishii Note Added: 0002870
2019-09-20 23:39 Muhammad Usama File Added: reg_logs_usama.zip
2019-09-20 23:39 Muhammad Usama Note Added: 0002871
2019-09-21 08:03 t-ishii Note Added: 0002872
2019-09-24 15:31 Muhammad Usama Note Added: 0002874
2019-09-24 15:39 t-ishii Note Added: 0002875
2019-09-24 16:23 nglp Note Added: 0002876
2019-09-24 17:05 Muhammad Usama Note Added: 0002877
2019-09-24 20:21 nglp File Added: server1.txt-2.gz
2019-09-24 20:21 nglp File Added: server2.txt-2.gz
2019-09-24 20:21 nglp File Added: server3.txt-2.gz
2019-09-24 20:21 nglp Note Added: 0002878
2019-09-25 19:09 Muhammad Usama Note Added: 0002880
2019-09-25 19:57 nglp File Added: pgpool-II-4.0.4.fix.tgz
2019-09-25 19:57 nglp Note Added: 0002881
2019-09-25 20:53 Muhammad Usama File Added: pgpool-II-4.0.4-patched.tar.gz
2019-09-25 20:53 Muhammad Usama Note Added: 0002882
2019-09-25 22:11 nglp File Added: server1.txt-3.gz
2019-09-25 22:11 nglp File Added: server2.txt-3.gz
2019-09-25 22:11 nglp File Added: server3.txt-3.gz
2019-09-25 22:11 nglp Note Added: 0002883
2019-09-25 23:41 Muhammad Usama Note Added: 0002884
2019-09-29 06:06 Muhammad Usama Note Added: 0002892
2019-09-29 06:07 Muhammad Usama Status assigned => resolved
2019-09-29 06:07 Muhammad Usama Resolution open => fixed
2019-09-29 06:07 Muhammad Usama Note Added: 0002893
2019-10-01 11:53 administrator Target Version => 4.0.7
2019-10-31 18:40 administrator Fixed in Version => 4.0.7