View Issue Details

IDProjectCategoryView StatusLast Update
0000306Pgpool-IIBugpublic2017-08-29 16:13
Reportersupp_kAssigned ToMuhammad Usama 
PriorityhighSeveritymajorReproducibilityalways
Status feedbackResolutionreopened 
Platformx86_64OSCentOSOS Version7.X
Product Version3.6.2 
Target VersionFixed in Version 
Summary0000306: Pgpool steals back MASTER status
DescriptionWhen disconnected Master Pgpool (from the other 2 nodes) gets connected ones again it steals back the MASTER status. Thus the newly promoted MASTER (because the previous one was disconnected) gets depromoted int standby.

Steps To Reproduce2 pgpool instances:

Step 1:
1) A - standby
2) B - standby
3) C - master

Make ifdown on the network interface of pgpool C. The pictures get changed into:
1) A - standby
2) B - master <= it was elected because the Pgpool C 'disappeared'
3) C - master <= it is still MASTER!! But it should be depromoted



Step 2: restore network on Server and the following result
1) A - standby
2) B - standby <= it was depromoted! Why?
3) C - master <= it is MASTER again






TagsNo tags attached.

Activities

supp_k

2017-04-25 03:03

reporter   ~0001458

The instance C states then QUORUM_ABSENT when it gets disconnected from the other 2 Pgpool instances.

Muhammad Usama

2017-04-25 05:26

developer   ~0001459

What was your expectation in that case?
Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not?

supp_k

2017-04-25 05:56

reporter   ~0001460

Muhammad,
I checked out the Head of the V3_6 branch and skim testing revealed there is now such problem. i.e. Pgpool B remains Master while C turns into standby. It seems the issue was fixed by your recent commit.

Give a couple of days and I will confirm that the problem disappeared.

supp_k

2017-04-25 18:58

reporter   ~0001461

Muhammad,

yes. The problem is not being reproduced on the top of V3_6 branch.

supp_k

2017-04-25 21:42

reporter   ~0001463

I'm sorry for the previous quick answer but the issue is reproduced - not often but reproduced on the HEAD of V3_6_STABLE branch.
Lets consider the environment:
1) Pgpool A (Standby)
2) Pgpool B (Master) + PG backend B (Master)
3) Pgpool C (Standby) + PG backend (Sync slave)

Details are the same - when former Pgpool master get connected to the remained pgpool nodes (A & B) it steals the Master status from the newly elected Master (say it B). It seems to be ok there but db clients face database connections shock once again because the VIP is relocated back to C.

And also which is very bad: to this moment the old Pgpool master (C instance) considers that PG Backend B is DOWN because it was network separated. When the connection is restored is forwards all SQLs to the PG on C. But this is instance doesn't contain fresh data! Within the period when the Pgpool C was absent the new Pgpool Master on B was forwarding all SQLs to its local instance of PG which was promoted into master!

Muhammad Usama

2017-04-25 22:07

developer   ~0001464

First of all many thanks for testing this and reporting the issue, The latest code on the HEAD of V3_6STABLE branch is supposed to consider the node escalation state before another thing while deciding the rightful master node, but apparently its not behaving as it is supposed to in your test environment.
Can you confirm that all three nodes are using the code on HEAD of V3_6STABLE? also if it is possible can you share the pgpool log around the time when the disconnected old master joins back the cluster and the situation occurs.

supp_k

2017-04-25 22:10

reporter   ~0001465

> What was your expectation in that case?
>Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not?

Yes, exactly. If the MASTER instance loses Quorum then it should resign (or become Master candidate) - this is the concept of the distributed voting. But as for now I see it remains MASTER nevertheless the fact it is disconnected from the other pgpool instances.

##################################################################################
The 'pcp_watchdog_info' for the disconnected C Pgpool instance gives result:
[root@OAA-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1
Password:
3 NO OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone

OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER
OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone 192.168.68.164 5432 9000 8 LOST
HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 8 LOST
##################################################################################
B Pgpool instance (newly elected Master):
[root@OAB-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1
Password:
3 YES OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone

OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER
OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone 192.168.69.178 5432 9000 8 LOST
HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 7 STANDBY
##################################################################################

supp_k

2017-04-25 22:19

reporter   ~0001467

Muhammad,

Used version was freshly build from the HEAD of V3_6_STABLE
[root@OAA-4f0e8fadf6f0 pgha]# pgpool --version
pgpool-II version 3.6.2 (subaruboshi)

Also the configuration files for 3 pgpool instances are attached.

pgpool_conf.tar.gz (20,480 bytes)

Muhammad Usama

2017-04-25 22:24

developer   ~0001468

Thanks for the configuration files but If you can share the pgpool log files that would be really helpful.

supp_k

2017-04-26 00:20

reporter   ~0001469

Muhammad,

I'm trying to reproduce the issue. Probably you won't believe me but I'm failing to do it for the more than 1 hour. As soon as I reproduce the problem I'll share the logs!

supp_k

2017-04-26 00:59

reporter   ~0001470

Muhammad,

attached the archive: pgpool_fail_voting.tar.gz.

Please see into the 'db_node_b.log' file.
1) In 18:37:50 the DB node B detected that Pgpool on "OAA-4f0e8fadf6f0.aqa.int.zone" is lost. Then it was promoted into MASTER. This is OK because the network interface of the previous Master on "OAA-4f0e8fadf6f0.aqa.int.zone" was brought down (ifdown eth0).

2) In approximately 18:40:54 the network interface of the former Master was restored and I see the following sequence of events in the same log:

Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-1] LOG: new watchdog node connection is received from "192.168.69.178:22753"
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-2] LOCATION: watchdog.c:3178
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-1] LOG: new node joined the cluster hostname:"OAA-4f0e8fadf6f0.aqa.int.zone" port:9000 pgpool_port:5432
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-2] LOCATION: watchdog.c:1442
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-1] LOG: remote node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone" decided it is the true master
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-2] DETAIL: re-initializing the local watchdog cluster state because of split-brain
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-3] LOCATION: watchdog.c:3682
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-1] LOG: watchdog node state changed from [MASTER] to [JOINING]
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-2] LOCATION: watchdog.c:6313
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-1] LOG: watchdog: de-escalation started
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-2] LOCATION: wd_escalation.c:181
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-1] LOG: watchdog node state changed from [JOINING] to [INITIALIZING]
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-2] LOCATION: watchdog.c:6313
Apr 25 18:40:55 OAB-4f0e8fadf6f0 su: (to postgres) root on none
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-1] LOG: successfully released the delegate IP:"192.168.52.96"
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-2] DETAIL: 'if_down_cmd' returned with success
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-3] LOCATION: wd_if.c:208
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-1] LOG: watchdog de-escalation process with pid: 22991 exit with SUCCESS.
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-2] LOCATION: watchdog.c:3053
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-1] LOG: watchdog node state changed from [INITIALIZING] to [STANDBY]
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-2] LOCATION: watchdog.c:6313
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-1] LOG: successfully joined the watchdog cluster as standby node
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-2] DETAIL: our join coordinator request is accepted by cluster leader node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone"
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-3] LOCATION: watchdog.c:6107
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-1] LOG: new IPC connection received
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-2] LOCATION: watchdog.c:3210
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-1] LOG: we have joined the watchdog cluster as STANDBY node
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-2] DETAIL: syncing the backend states from the MASTER watchdog node





Also please pay attention to the messages like:
- Apr 25 18:39:53 OAB-4f0e8fadf6f0 pgpool[8063]: [7535-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket

There are a lot of such messages that follow the command 'Apr 25 18:38:02 OAB-4f0e8fadf6f0 pgpool[8063]: [7107-1] LOG: execute command: sudo /usr/local/pem/bin/pgha/pghactl.py do-failover --new-master-node-id=1 --new-master-host=b.db.node'

The problem is that the 'b.db.node' is the synonym of 'OAB-4f0e8fadf6f0' and it's Postgres instance was promoted into Master by the mentioned failover command. But the message states it is unavailable. Also DB clients face problems with DB connections at the same time.

supp_k

2017-04-26 01:01

reporter  

pgpool_fail_voting.tar.gz (10,926,080 bytes)

supp_k

2017-04-26 01:02

reporter   ~0001471

Attaching logs..

pgpool_fail_voting.tar-2.gz (10,926,080 bytes)

supp_k

2017-04-27 02:20

reporter   ~0001476

Hi Muhammad,

did the logs help you? Can you see a problem?

supp_k

2017-04-27 04:06

reporter   ~0001477

I'm also could observe a case when Standby Pgpool stolen Master status from the acting Master Pgpool after network restored.

Obviously this is because Standby master was self promoted into Master when the network was splited. I think logically it can be solved by not performing voting if quorum doesnt exist within a single node.

supp_k

2017-04-27 06:08

reporter   ~0001478

Muhammad,

please see the attached file "no_failover.log".
In "23:55:53" the Postgres on DB node B was killed (-9) and it was supposed that Pgpool should be initiate failover command. But this didn't happen (((
The log collected from Master Pgpool (pgpool-II version 3.6.2 (subaruboshi)) which was built from HEAD of V3_6_STABLE branch.

supp_k

2017-04-27 06:08

reporter   ~0001479

no_failover.log

no_failover.log (455,354 bytes)

Muhammad Usama

2017-04-28 00:44

developer   ~0001481

Hi,

The Pgpool-II logs "pgpool_fail_voting.tar" you sent earlier does not contains the correct pgpool log for head_pgpool, So it was heard to say what was happening in that case but I found out a potential issue in the quorum calculation which can lead to the same behaviour.

Can you please try the attached wd_quorum.diff patch against the current 3.6 head, if it fixes the issue of rightful master node selection after the cluster recovers for the split-brain. Note that its a WIP patch and to confirm/decline if it solves the problem and it needs to be applied on all pgpool-II nodes participating in the test.

Meanwhile I am also looking into the other problem where the ailover was not performed.

Thanks

wd_quorum.diff (6,597 bytes)
diff --git a/src/include/watchdog/watchdog.h b/src/include/watchdog/watchdog.h
index 63604217..727c71c2 100644
--- a/src/include/watchdog/watchdog.h
+++ b/src/include/watchdog/watchdog.h
@@ -84,6 +84,8 @@ typedef enum {
 	WD_EVENT_REMOTE_NODE_FOUND,
 	WD_EVENT_LOCAL_NODE_FOUND,
 
+	WD_EVENT_WD_REMOTE_NODE_STATE_CHANGED,
+
 	WD_EVENT_NODE_CON_LOST,
 	WD_EVENT_NODE_CON_FOUND
 
diff --git a/src/watchdog/watchdog.c b/src/watchdog/watchdog.c
index 4156329f..e1e152af 100644
--- a/src/watchdog/watchdog.c
+++ b/src/watchdog/watchdog.c
@@ -171,6 +171,7 @@ char *wd_event_name[] =
 	"REMOTE NODE LOST",
 	"REMOTE NODE FOUND",
 	"THIS NODE FOUND",
+	"REMOTE NODE STATE_CHANGED",
 	"NODE CONNECTION LOST",
 	"NODE CONNECTION FOUND"
 };
@@ -3842,6 +3843,8 @@ static int standard_packet_processor(WatchdogNode* wdNode, WDPacketData* pkt)
 		case WD_INFO_MESSAGE:
 		{
 			char *authkey = NULL;
+			bool nodd_state_changed = false;
+
 			WatchdogNode* tempNode = parse_node_info_message(pkt, &authkey);
 			if (tempNode == NULL)
 			{
@@ -3850,6 +3853,7 @@ static int standard_packet_processor(WatchdogNode* wdNode, WDPacketData* pkt)
 				send_cluster_service_message(wdNode,pkt,CLUSTER_NODE_INVALID_VERSION);
 				break;
 			}
+			nodd_state_changed = (wdNode->state != tempNode->state);
 			wdNode->state = tempNode->state;
 			wdNode->startup_time.tv_sec = tempNode->startup_time.tv_sec;
 			wdNode->wd_priority = tempNode->wd_priority;
@@ -3921,6 +3925,9 @@ static int standard_packet_processor(WatchdogNode* wdNode, WDPacketData* pkt)
 				}
 			}
 
+			if (nodd_state_changed)
+				watchdog_state_machine(WD_EVENT_WD_REMOTE_NODE_STATE_CHANGED, wdNode, pkt, NULL);
+
 			pfree(tempNode);
 		}
 			break;
@@ -4525,6 +4532,25 @@ static int update_connected_node_count(void)
 	return g_cluster.aliveNodeCount;
 }
 
+/*
+ * The function only considers the node state.
+ * All node states count towards the cluster participating nodes
+ * except the dead and lost nodes.
+ */
+static int get_cluster_node_count(void)
+{
+	int i;
+	int count = 0;
+	for (i = 0; i< g_cluster.remoteNodeCount; i++)
+	{
+		WatchdogNode* wdNode = &(g_cluster.remoteNodes[i]);
+		if (wdNode->state == WD_DEAD || wdNode->state == WD_LOST || wdNode->state == WD_SHUTDOWN)
+			continue;
+		count++;
+	}
+	return count;
+}
+
 
 static bool service_lost_connections(void)
 {
@@ -4559,24 +4585,6 @@ static bool service_lost_connections(void)
 	return ret;
 }
 
-/*
- * The function only considers the node state.
- * All node states count towards the cluster participating nodes
- * except the dead and lost nodes.
- */
-static int get_cluster_node_count(void)
-{
-	int i;
-	int count = 0;
-	for (i = 0; i< g_cluster.remoteNodeCount; i++)
-	{
-		WatchdogNode* wdNode = &(g_cluster.remoteNodes[i]);
-		if (wdNode->state == WD_DEAD || wdNode->state == WD_LOST || wdNode->state == WD_SHUTDOWN)
-			continue;
-		count++;
-	}
-	return count;
-}
 
 static WDPacketData* get_message_of_type(char type, WDPacketData* replyFor)
 {
@@ -5541,6 +5549,8 @@ static int watchdog_state_machine_coordinator(WD_EVENTS event, WatchdogNode* wdN
 			set_timeout(BEACON_MESSAGE_INTERVAL_SECONDS);
 			break;
 
+		case WD_EVENT_WD_REMOTE_NODE_STATE_CHANGED:
+		case WD_EVENT_REMOTE_NODE_FOUND:
 		case WD_EVENT_REMOTE_NODE_LOST:
 		{
 			/*
@@ -5550,30 +5560,25 @@ static int watchdog_state_machine_coordinator(WD_EVENTS event, WatchdogNode* wdN
 			update_quorum_status();
 			if (g_cluster.quorum_status == -1)
 			{
-				ereport(LOG,
-						(errmsg("We have lost the quorum after loosing \"%s\"",wdNode->nodeName)));
-				/*
-				 * We have lost the quorum, stay as a master node but
-				 * perform de-escalation. As keeping the VIP may result in
-				 * split-brain
-				 */
-				resign_from_escalated_node();
+				if (g_cluster.escalated == true)
+				{
+					ereport(LOG,
+							(errmsg("We have lost the quorum after loosing \"%s\"",wdNode->nodeName)));
+					/*
+					 * We have lost the quorum, stay as a master node but
+					 * perform de-escalation. As keeping the VIP may result in
+					 * split-brain
+					 */
+					resign_from_escalated_node();
+				}
 			}
-			else
-				ereport(DEBUG1,
-						(errmsg("We have lost the node \"%s\" but quorum still holds",wdNode->nodeName)));
-		}
-			break;
-
-		case WD_EVENT_REMOTE_NODE_FOUND:
-		{
-			update_quorum_status();
-			if (g_cluster.escalated == false)
+			else /* g_cluster.quorum_status >= 0 */
 			{
-				if (g_cluster.quorum_status >= 0)
+				/* Quorum is present now, escalate the node if not already */
+				if (g_cluster.escalated == false)
 				{
 					ereport(LOG,
-						(errmsg("quorum is complete after node \"%s\" joined the cluster",wdNode->nodeName),
+							(errmsg("quorum is complete after node \"%s\" joined the cluster",wdNode->nodeName),
 							 errdetail("starting escalation process")));
 					start_escalated_node();
 				}
@@ -5819,7 +5824,7 @@ static int I_am_master_and_cluser_in_split_brain(WatchdogNode* otherMasterNode)
 	/* Decide which node should stay as master */
 	if (otherMasterNode->escalated != g_cluster.escalated)
 	{
-		if (otherMasterNode->escalated != 0 && g_cluster.escalated == false)
+		if (otherMasterNode->escalated == true && g_cluster.escalated == false)
 		{
 			/* remote node stays as the master */
 			ereport(LOG,
@@ -5937,7 +5942,6 @@ static void handle_split_brain(WatchdogNode* otherMasterNode, WDPacketData* pkt)
 				 errdetail("asking the remote node \"%s\" to step down",otherMasterNode->nodeName)));
 		send_cluster_service_message(otherMasterNode,pkt,CLUSTER_IAM_TRUE_MASTER);
 	}
-
 }
 
 static void start_escalated_node(void)
@@ -6257,11 +6261,26 @@ static int watchdog_state_machine_standby(WD_EVENTS event, WatchdogNode* wdNode,
  */
 static int update_quorum_status(void)
 {
-	if ( get_cluster_node_count() > get_mimimum_nodes_required_for_quorum())
+	int i;
+	int cluster_nodes_count = 0;
+	for (i = 0; i< g_cluster.remoteNodeCount; i++)
+	{
+		WatchdogNode* wdNode = &(g_cluster.remoteNodes[i]);
+		if (wdNode->state == WD_STANDBY)
+			cluster_nodes_count++;
+		if (wdNode->state == WD_COORDINATOR)
+		{
+			/* only count master node if it is master as per our record */
+			if (g_cluster.masterNode == wdNode)
+				cluster_nodes_count++;
+		}
+	}
+
+	if ( cluster_nodes_count > get_mimimum_nodes_required_for_quorum())
 	{
 		g_cluster.quorum_status = 1;
 	}
-	else if ( get_cluster_node_count() == get_mimimum_nodes_required_for_quorum())
+	else if ( cluster_nodes_count == get_mimimum_nodes_required_for_quorum())
 	{
 		if (g_cluster.remoteNodeCount % 2 != 0)
 			g_cluster.quorum_status = 0; /* on the edge */
wd_quorum.diff (6,597 bytes)

supp_k

2017-04-28 01:03

reporter   ~0001482

Muhammad,
I'll schedule automatic tests right now. Result will be provided not quickly because problem not being reproduced reliably.

Muhammad Usama

2017-04-28 01:28

developer   ~0001483

Sure, No Issues and many thanks for the testing

supp_k

2017-04-28 01:50

reporter   ~0001484

After 5 iterations of ifdown/ifup actions the problem was reproduced.
Please see the attached log from the former master. It was depromoted into in Standby at 19:40:23 as it can be seen from the log, and later it was self promoted into Master when the network was restored.

I need to reconfigure the environment...when it is done I'll share you logs from 3 pgpool instances and you will be able to observe the "complete picture" for another case.

log_failed_master.log (52,958 bytes)

supp_k

2017-04-28 05:22

reporter   ~0001485

Muhammad,

I'm not sure but I could reproduce the error only after attaching of PG backends to Pgpool cluster. Before that I was unable to reproduce the problem within the long period with your fresh patch!

The cleaned fact of the problem is attached: logs + configuration files. Names of the files within the attached tarball are self explanatory.
=======================================================

Environment before the "network split":
1) Head Pgpool
2) DB Node A: Pgpool + PG
3) DB Node B: Pgpool (Master) + PG

The "ifdown" was initiated in DB Node B at "22:58:50". Right after that the Pgpool in DB Node A was promoted into Master - this OK.

In a few seconds the network was restored (ifup) in DB node B, and now the logs confirm that Pgpool A was depromoted and Pgpool B has taken back the Master status.

pgpool_wd_issue.tar.gz (1,382,297 bytes)

Muhammad Usama

2017-05-04 01:00

developer   ~0001491

Hi Supp_K,

I have created a patch for the second part of problem where the failover was not performed by Pgpool-II, Can you please try the attached patch if it solves that problem.

wd_failover_not_performed.diff (16,565 bytes)
diff --git a/src/include/pool.h b/src/include/pool.h
index 4983be8c..d5dd63aa 100644
--- a/src/include/pool.h
+++ b/src/include/pool.h
@@ -550,10 +550,10 @@ extern void NoticeResponse(POOL_CONNECTION *frontend,
 								  POOL_CONNECTION_POOL *backend);
 
 extern void notice_backend_error(int node_id, bool switch_over);
-extern void degenerate_backend_set(int *node_id_set, int count, bool switch_over, unsigned int wd_failover_id);
+extern bool degenerate_backend_set(int *node_id_set, int count, bool switch_over, unsigned int wd_failover_id);
 extern bool degenerate_backend_set_ex(int *node_id_set, int count, bool error, bool test_only, bool switch_over, unsigned int wd_failover_id);
-extern void promote_backend(int node_id, unsigned int wd_failover_id);
-extern void send_failback_request(int node_id, bool throw_error, unsigned int wd_failover_id);
+extern bool promote_backend(int node_id, unsigned int wd_failover_id);
+extern bool send_failback_request(int node_id, bool throw_error, unsigned int wd_failover_id);
 
 
 extern void pool_set_timeout(int timeoutval);
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index df71128a..5e47c2d9 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -1089,20 +1089,20 @@ bool degenerate_backend_set_ex(int *node_id_set, int count, bool error, bool tes
  * wrapper over degenerate_backend_set_ex function to register
  * NODE down operation request
  */
-void degenerate_backend_set(int *node_id_set, int count, bool switch_over, unsigned int wd_failover_id)
+bool degenerate_backend_set(int *node_id_set, int count, bool switch_over, unsigned int wd_failover_id)
 {
-	degenerate_backend_set_ex(node_id_set, count, false, false, switch_over, wd_failover_id);
+	return degenerate_backend_set_ex(node_id_set, count, false, false, switch_over, wd_failover_id);
 }
 
 /* send promote node request using SIGUSR1 */
-void promote_backend(int node_id, unsigned int wd_failover_id)
+bool promote_backend(int node_id, unsigned int wd_failover_id)
 {
 	WDFailoverCMDResults res = FAILOVER_RES_PROCEED;
-
+	bool ret = false;
 
 	if (!MASTER_SLAVE || pool_config->master_slave_sub_mode != STREAM_MODE)
 	{
-		return;
+		return false;
 	}
 
 	if (node_id < 0 || node_id >= MAX_NUM_BACKENDS || !VALID_BACKEND(node_id))
@@ -1115,7 +1115,7 @@ void promote_backend(int node_id, unsigned int wd_failover_id)
 			ereport(LOG,
 					(errmsg("invalid promote backend request, node id : %d status: [%d] not valid"
 							,node_id,BACKEND_INFO(node_id).backend_status)));
-		return;
+		return false;
 	}
 	ereport(LOG,
 			(errmsg("received promote backend request for node_id: %d from pid [%d]",
@@ -1146,7 +1146,7 @@ void promote_backend(int node_id, unsigned int wd_failover_id)
 
 	if (res == FAILOVER_RES_PROCEED)
 	{
-		register_node_operation_request(PROMOTE_NODE_REQUEST, &node_id, 1, false, wd_failover_id);
+		ret = register_node_operation_request(PROMOTE_NODE_REQUEST, &node_id, 1, false, wd_failover_id);
 	}
 	else if (res == FAILOVER_RES_WILL_BE_DONE)
 	{
@@ -1160,12 +1160,14 @@ void promote_backend(int node_id, unsigned int wd_failover_id)
 				(errmsg("promote backend request for node_id: %d from pid [%d] is canceled  by other pgpool"
 						, node_id, getpid())));
 	}
+	return ret;
 }
 
 /* send failback request using SIGUSR1 */
-void send_failback_request(int node_id,bool throw_error, unsigned int wd_failover_id)
+bool send_failback_request(int node_id,bool throw_error, unsigned int wd_failover_id)
 {
 	WDFailoverCMDResults res = FAILOVER_RES_PROCEED;
+	bool ret = false;
 
     if (node_id < 0 || node_id >= MAX_NUM_BACKENDS ||
 		(RAW_MODE && BACKEND_INFO(node_id).backend_status != CON_DOWN && VALID_BACKEND(node_id)))
@@ -1178,7 +1180,7 @@ void send_failback_request(int node_id,bool throw_error, unsigned int wd_failove
 			ereport(throw_error?ERROR:LOG,
 					(errmsg("invalid failback request, node id : %d status: [%d] not valid for failback"
 							,node_id,BACKEND_INFO(node_id).backend_status)));
-		return;
+		return false;
 	}
 
 	ereport(LOG,
@@ -1210,7 +1212,7 @@ void send_failback_request(int node_id,bool throw_error, unsigned int wd_failove
 
 	if (res == FAILOVER_RES_PROCEED)
 	{
-		register_node_operation_request(NODE_UP_REQUEST, &node_id, 1, false, wd_failover_id);
+		ret = register_node_operation_request(NODE_UP_REQUEST, &node_id, 1, false, wd_failover_id);
 	}
 	else if (res == FAILOVER_RES_WILL_BE_DONE)
 	{
@@ -1218,13 +1220,13 @@ void send_failback_request(int node_id,bool throw_error, unsigned int wd_failove
 				(errmsg("failback request for node_id: %d from pid [%d], will be handled by watchdog"
 						, node_id, getpid())));
 	}
-
 	else
 	{
 		ereport(throw_error?ERROR:LOG,
 				(errmsg("failback request for node_id: %d from pid [%d] is canceled  by other pgpool"
 						, node_id, getpid())));
 	}
+	return ret;
 }
 
 static RETSIGTYPE exit_handler(int sig)
diff --git a/src/watchdog/watchdog.c b/src/watchdog/watchdog.c
index 4156329f..821ef2bf 100644
--- a/src/watchdog/watchdog.c
+++ b/src/watchdog/watchdog.c
@@ -82,6 +82,9 @@ typedef enum IPC_CMD_PREOCESS_RES
 #define	MAX_SECS_WAIT_FOR_REPLY_FROM_NODE	5  /* time in seconds to wait for the reply from
 												* remote watchdog node
 												*/
+#define	MAX_SECS_WAIT_FOR_NODE_FAILOVER_TIME 10  /* time in seconds to wait for the Pgpool-II to finish
+												 * performing failover of backend node
+												 */
 
 
 
@@ -333,6 +336,9 @@ static WDFailoverObject* get_failover_object(POOL_REQUEST_KIND reqKind, int node
 static WDFailoverObject* get_failover_object_by_id(unsigned int failoverID);
 static bool does_int_array_contains_value(int *intArray, int count, int value);
 static bool remove_failover_object_by_id(unsigned int failoverID);
+static void remove_failovers_from_node(WatchdogNode* wdNode);
+static void remove_failover_object(WDFailoverObject* failoverObj);
+static void service_expired_failovers(void);
 
 static int send_command_packet_to_remote_nodes(WDCommandData* ipcCommand, bool source_included);
 static void wd_command_is_complete(WDCommandData* ipcCommand);
@@ -468,7 +474,7 @@ static bool write_ipc_command_with_result_data(WDCommandData* ipcCommand, char t
 static void process_wd_func_commands_for_timer_events(void);
 static void add_wd_command_for_timer_events(unsigned int expire_secs, bool need_tics, WDFunctionCommandData* wd_func_command);
 static bool reply_is_received_for_pgpool_replicate_command(WatchdogNode* wdNode, WDPacketData* pkt, WDCommandData* ipcCommand);
-static void process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt, char* func_name, int node_count, int* node_id_list, unsigned int failover_id);
+static bool process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt, char* func_name, int node_count, int* node_id_list, unsigned int failover_id);
 static void process_pgpool_remote_failover_command(WatchdogNode* wdNode, WDPacketData* pkt);
 static void process_remote_online_recovery_command(WatchdogNode* wdNode, WDPacketData* pkt);
 
@@ -1135,7 +1141,7 @@ watchdog_main(void)
 		}
 
 		check_for_current_command_timeout();
-		
+
 		if (service_lost_connections() == true)
 		{
 			service_internal_command();
@@ -1145,6 +1151,8 @@ watchdog_main(void)
 		service_unreachable_nodes();
 
 		update_connected_node_count();
+
+		service_expired_failovers();
 	}
 	return 0;
 }
@@ -2007,21 +2015,91 @@ static WDFailoverObject* get_failover_object_by_id(unsigned int failoverID)
 	}
 	return NULL;
 }
+
+static void remove_failover_object(WDFailoverObject* failoverObj)
+{
+	ereport(DEBUG1,
+			(errmsg("removing failover object from \"%s\" with ID:%d", failoverObj->wdRequestingNode->nodeName,failoverObj->failoverID)));
+	g_cluster.wdCurrentFailovers = list_delete_ptr(g_cluster.wdCurrentFailovers,failoverObj);
+	pfree(failoverObj->nodeList);
+	pfree(failoverObj);
+}
+
 static bool remove_failover_object_by_id(unsigned int failoverID)
 {
 	WDFailoverObject* failoverObj = get_failover_object_by_id(failoverID);
 	if (failoverObj)
 	{
-		ereport(DEBUG2,
-				(errmsg("removing failover object with ID:%d",failoverID)));
-		g_cluster.wdCurrentFailovers = list_delete_ptr(g_cluster.wdCurrentFailovers,failoverObj);
-		pfree(failoverObj->nodeList);
-		pfree(failoverObj);
+		remove_failover_object(failoverObj);
 		return true;
 	}
 	return false;
 }
 
+/* if the wdNode is NULL. The function removes all failover objects */
+static void remove_failovers_from_node(WatchdogNode* wdNode)
+{
+	ListCell *lc;
+	List *failovers_to_del = NULL;
+
+	foreach(lc, g_cluster.wdCurrentFailovers)
+	{
+		WDFailoverObject* failoverObj = lfirst(lc);
+		if (failoverObj)
+		{
+			if (wdNode == NULL || failoverObj->wdRequestingNode == wdNode)
+			{
+				failovers_to_del = lappend(failovers_to_del,failoverObj);
+			}
+		}
+	}
+
+	/* delete the failover objects */
+
+	foreach(lc, failovers_to_del)
+	{
+		WDFailoverObject* failoverObj = lfirst(lc);
+		remove_failover_object(failoverObj);
+	}
+}
+
+/* if the wdNode is NULL. The function removes all failover objects */
+static void service_expired_failovers(void)
+{
+	ListCell *lc;
+	List *failovers_to_del = NULL;
+	struct timeval currTime;
+
+	if (get_local_node_state() != WD_COORDINATOR)
+		return;
+
+	gettimeofday(&currTime,NULL);
+
+	foreach(lc, g_cluster.wdCurrentFailovers)
+	{
+		WDFailoverObject* failoverObj = lfirst(lc);
+		if (failoverObj)
+		{
+			if (WD_TIME_DIFF_SEC(currTime,failoverObj->startTime) >=  MAX_SECS_WAIT_FOR_NODE_FAILOVER_TIME)
+			{
+				failovers_to_del = lappend(failovers_to_del,failoverObj);
+				ereport(DEBUG1,
+					(errmsg("failover object from \"%s\" with ID:%d is timeout", failoverObj->wdRequestingNode->nodeName,failoverObj->failoverID),
+						 errdetail("adding the failover object for removal")));
+
+			}
+		}
+	}
+
+	/* delete the failover objects */
+
+	foreach(lc, failovers_to_del)
+	{
+		WDFailoverObject* failoverObj = lfirst(lc);
+		remove_failover_object(failoverObj);
+	}
+}
+
 static bool does_int_array_contains_value(int *intArray, int count, int value)
 {
 	int i;
@@ -2242,6 +2320,7 @@ static IPC_CMD_PREOCESS_RES process_failover_command_on_coordinator(WDCommandDat
 	}
 	else
 	{
+		/* No similar failover is in progress */
 		MemoryContext oldCxt;
 		ereport(DEBUG1,
 				(errmsg("proceeding with the failover command [%s] request from %s",
@@ -2254,10 +2333,28 @@ static IPC_CMD_PREOCESS_RES process_failover_command_on_coordinator(WDCommandDat
 		 */
 		wd_packet_shallow_copy(&ipcCommand->sourcePacket, &ipcCommand->commandPacket);
 		ipcCommand->commandPacket.type = WD_REMOTE_FAILOVER_REQUEST;
+		ipcCommand->sendToNode = NULL; /* command needs to be sent to all nodes */
 		set_next_commandID_in_message(&ipcCommand->commandPacket);
 
+		if (ipcCommand->commandSource != COMMAND_SOURCE_IPC)
+		{
+			if (process_wd_command_function(ipcCommand->sourceWdNode, &ipcCommand->sourcePacket,
+										func_name, node_count, node_id_list, ipcCommand->commandPacket.command_id) == false)
+			{
+				return IPC_CMD_COMPLETE;
+			}
+		}
+
+		/* send to all alive nodes */
+		ereport(LOG,
+			(errmsg("forwarding the failover request [%s] to all alive nodes",func_name),
+				 errdetail("watchdog cluster currently has %d connected remote nodes",get_cluster_node_count())));
+		send_command_packet_to_remote_nodes(ipcCommand, false);
+
+		/* create a failover object. to make sure we know the node failovers
+		 * is already in progress
+		 */
 		oldCxt = MemoryContextSwitchTo(TopMemoryContext);
-		/* No similar failover is in progress */
 		failoverObj = palloc0(sizeof(WDFailoverObject));
 		failoverObj->reqKind = reqKind;
 		failoverObj->nodesCount = node_count;
@@ -2268,20 +2365,10 @@ static IPC_CMD_PREOCESS_RES process_failover_command_on_coordinator(WDCommandDat
 		}
 		failoverObj->failoverID = ipcCommand->commandPacket.command_id; /* use command id as failover id */
 		gettimeofday(&failoverObj->startTime, NULL);
-		failoverObj->wdRequestingNode = g_cluster.localNode;
+		failoverObj->wdRequestingNode = ipcCommand->sourceWdNode;
 		g_cluster.wdCurrentFailovers = lappend(g_cluster.wdCurrentFailovers,failoverObj);
 
 		MemoryContextSwitchTo(oldCxt);
-		/* We may also need to send the Accept message here for remote node */
-
-		ipcCommand->sendToNode = NULL; /* command needs to be sent to all nodes */
-
-		ereport(LOG,
-			(errmsg("forwarding the failover request [%s] to all alive nodes",func_name),
-				 errdetail("watchdog cluster currently has %d connected remote nodes",get_cluster_node_count())));
-
-		/* see if there is any node we want to send to */
-		send_command_packet_to_remote_nodes(ipcCommand, false);
 
 		/* For a moment just think it is successfully sent to all nodes.*/
 		if (ipcCommand->commandSource == COMMAND_SOURCE_IPC)
@@ -2291,9 +2378,7 @@ static IPC_CMD_PREOCESS_RES process_failover_command_on_coordinator(WDCommandDat
 		}
 		else
 		{
-			process_wd_command_function(ipcCommand->sourceWdNode, &ipcCommand->sourcePacket,
-										func_name, node_count, node_id_list, failoverObj->failoverID);
-			if (get_cluster_node_count() == 1)
+			if (get_cluster_node_count() <= 1)
 			{
 				/* Since its just 2 nodes cluster, and the only other
 				 * node is the one that actually issued the failover
@@ -4878,7 +4963,6 @@ static int watchdog_state_machine(WD_EVENTS event, WatchdogNode* wdNode, WDPacke
 		wdNode->last_sent_time.tv_sec = 0;
 		wdNode->last_sent_time.tv_usec = 0;
 		node_lost_while_ipc_command(wdNode);
-
 	}
 	else if (event == WD_EVENT_PACKET_RCV)
 	{
@@ -5562,6 +5646,9 @@ static int watchdog_state_machine_coordinator(WD_EVENTS event, WatchdogNode* wdN
 			else
 				ereport(DEBUG1,
 						(errmsg("We have lost the node \"%s\" but quorum still holds",wdNode->nodeName)));
+
+			remove_failovers_from_node(wdNode);
+
 		}
 			break;
 
@@ -6307,8 +6394,10 @@ static int set_state(WD_STATES newState)
 		gettimeofday(&g_cluster.localNode->current_state_time, NULL);
 		/* if we changing from the coordinator state, do the de-escalation if required */
 		if (oldState == WD_COORDINATOR)
+		{
 			resign_from_escalated_node();
-
+			remove_failovers_from_node(NULL);
+		}
 		ereport(LOG,
 				(errmsg("watchdog node state changed from [%s] to [%s]",wd_state_names[oldState],wd_state_names[newState])));
 		watchdog_state_machine(WD_EVENT_WD_STATE_CHANGED, NULL, NULL, NULL);
@@ -6463,9 +6552,10 @@ static void process_remote_online_recovery_command(WatchdogNode* wdNode, WDPacke
 		pfree(node_id_list);
 }
 
-static void process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt, char* func_name,
+static bool process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt, char* func_name,
 										int node_count, int* node_id_list, unsigned int failover_id)
 {
+	bool ret = false;
 	if (strcasecmp(WD_FUNCTION_FAILBACK_REQUEST, func_name) == 0)
 	{
 		if (Req_info->switching)
@@ -6477,8 +6567,7 @@ static void process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt,
 		}
 		else
 		{
-			reply_with_minimal_message(wdNode, WD_ACCEPT_MESSAGE, pkt);
-			send_failback_request(node_id_list[0],false, failover_id);
+			ret = send_failback_request(node_id_list[0],false, failover_id);
 		}
 	}
 	
@@ -6489,12 +6578,10 @@ static void process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt,
 			ereport(LOG,
 					(errmsg("sending watchdog response"),
 					 errdetail("failover request from other pgpool is canceled because of switching")));
-			reply_with_minimal_message(wdNode, WD_REJECT_MESSAGE, pkt);
 		}
 		else
 		{
-			reply_with_minimal_message(wdNode, WD_ACCEPT_MESSAGE, pkt);
-			degenerate_backend_set(node_id_list, node_count, false, failover_id);
+			ret = degenerate_backend_set(node_id_list, node_count, false, failover_id);
 		}
 	}
 
@@ -6505,19 +6592,22 @@ static void process_wd_command_function(WatchdogNode* wdNode, WDPacketData* pkt,
 			ereport(LOG,
 					(errmsg("sending watchdog response"),
 					 errdetail("failover request from other pgpool is canceled because of switching")));
-			reply_with_minimal_message(wdNode, WD_REJECT_MESSAGE, pkt);
 		}
 		else
 		{
-			reply_with_minimal_message(wdNode, WD_ACCEPT_MESSAGE, pkt);
-			promote_backend(node_id_list[0], failover_id);
+			ret = promote_backend(node_id_list[0], failover_id);
 		}
 	}
 	else
 	{
 		/* This is not supported function */
-		reply_with_minimal_message(wdNode, WD_ERROR_MESSAGE, pkt);
+		ereport(LOG,
+			(errmsg("invalid failover function \"%s\"",func_name)));
 	}
+
+	reply_with_minimal_message(wdNode, ret?WD_ACCEPT_MESSAGE:WD_REJECT_MESSAGE, pkt);
+
+	return ret;
 }
 
 

supp_k

2017-05-05 00:39

reporter   ~0001492

Hi Muhammad,

I applied fix to the head versiob if 3_6_stable branch. Result will be provided later because I want test it carefully.

supp_k

2017-05-15 18:38

reporter   ~0001508

Hi Muhammad,
sorry for long delay from my side. I'm not sure the patch is ok. But I see lots of changes in repository. Are the include the patch provided by you?
I think it is better if I continue verify the latest HEAD of v3_6_STABLE branch.

Muhammad Usama

2017-05-16 20:00

developer   ~0001511

Hi Supp_K,
Yes the changes in the patch are already committed to the repository, You can use the v3_6_STABLE branch from Head to verify the fix.

Thanks

supp_k

2017-05-16 22:17

reporter   ~0001512

Muhammad,
the previous standalone fix and the latest HEAD of the V3_6_STABLE branch works OK. In my test cases I could observe stable failover triggering.
But I still observe the primary problem of incorrect split brain resolve when disconnected Master connect again to the set of pgpools that choosen the new Master.

supp_k

2017-05-16 22:24

reporter   ~0001513

Also I observe a case when split-brain never ends. It happens when network disconnected master pgpool instance connects again to the set of pgpools.

I see the pcp_watchdog_info output:
[root@OAB-7976adbf5291 pgha]# pcp_watchdog_info -h 127.0.0.1
Password:
3 NO OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone

OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone 5432 9000 4 MASTER
OAA-7976adbf5291.aqa.int.zone:5432 Linux OAA-7976adbf5291.aqa.int.zone 192.168.83.16 5432 9000 8 LOST
POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone 192.168.83.180 5432 9000 2 JOINING


It never ends. Please see the attached pgpool log of the former master which is reconnected.

supp_k

2017-05-16 22:25

reporter   ~0001514

Never ending voting

restored_pgpool_b.log (3,727,982 bytes)

Muhammad Usama

2017-05-17 03:26

developer   ~0001515

Thanks for the input I am looking into this one.

Muhammad Usama

2017-05-17 21:58

developer   ~0001517

Hi Supp_k,

Is it possible for you to share the pgpool-II log for "POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone" node aswell?

supp_k

2017-05-19 00:35

reporter   ~0001519

Hi Muhammad,

I'm failed to reproduce endless split-brain resolve case. But anyway please have fresh attached logs of incorrect split-brain resolve case.

split_brain_pgpool.tar.gz (183,073 bytes)

supp_k

2017-05-23 00:46

reporter   ~0001522

Hi Muhammad,

with the help of attached logs have you reproduced the issue with incorrect split brain resolve on your side? Do you need any additional help?

Muhammad Usama

2017-05-30 22:22

developer   ~0001525

Hi sup_k,

Thanks for your help and yes I was able to reproduce the issue once and it happens occasionally.
I have made some adjustments on how a node handles and detects the split-brain and so-far the tests are passing on my setup, Can you please also try out the attached patch file:bug_306.diff to see if it solves the issue.

Also while testing the scenario if possible please collect and share the complete Pgpool-II logs for all Pgpool-II nodes.

Thanks
Best regards
Muhammad usama

bug_306.diff (4,365 bytes)
diff --git a/src/watchdog/watchdog.c b/src/watchdog/watchdog.c
index 8b306474..59517340 100644
--- a/src/watchdog/watchdog.c
+++ b/src/watchdog/watchdog.c
@@ -3816,6 +3816,7 @@ static void cluster_service_message_processor(WatchdogNode* wdNode, WDPacketData
 
 		case CLUSTER_IN_SPLIT_BRAIN:
 		{
+			try_connecting_with_all_unreachable_nodes();
 			if (get_local_node_state() == WD_COORDINATOR)
 			{
 				ereport(LOG,
@@ -3974,7 +3975,7 @@ static int standard_packet_processor(WatchdogNode* wdNode, WDPacketData* pkt)
 				}
 				else if (WD_MASTER_NODE != wdNode)
 				{
-					ereport(WARNING,
+					ereport(LOG,
 							(errmsg("\"%s\" is the coordinator as per our record but \"%s\" is also announcing as a coordinator",
 									WD_MASTER_NODE->nodeName, wdNode->nodeName),
 							 errdetail("cluster is in the split-brain")));
@@ -3986,11 +3987,6 @@ static int standard_packet_processor(WatchdogNode* wdNode, WDPacketData* pkt)
 						 */
 
 						send_cluster_service_message(NULL,pkt,CLUSTER_IN_SPLIT_BRAIN);
-
-						ereport(LOG,
-								(errmsg("re-initializing the local watchdog cluster state because of split-brain")));
-
-						set_state(WD_JOINING);
 					}
 					else
 					{
@@ -4054,11 +4050,11 @@ static int standard_packet_processor(WatchdogNode* wdNode, WDPacketData* pkt)
 			if (WD_MASTER_NODE != NULL && wdNode != WD_MASTER_NODE)
 			{
 				ereport(LOG,
-						(errmsg("re-initializing the local watchdog cluster state because of split-brain")));
+					(errmsg("\"%s\" is our coordinator node, but \"%s\" is also announcing as a coordinator",
+							WD_MASTER_NODE->nodeName, wdNode->nodeName),
+						 errdetail("broadcasting the cluster in split-brain message")));
 
 				send_cluster_service_message(NULL,pkt,CLUSTER_IN_SPLIT_BRAIN);
-				set_state(WD_JOINING);
-
 			}
 			else
 			{
@@ -4400,6 +4396,7 @@ static bool watchdog_internal_command_packet_processor(WatchdogNode* wdNode, WDP
 	int i;
 	WDCommandNodeResult* nodeResult = NULL;
 	WDCommandData* clusterCommand = get_wd_cluster_command_from_reply(pkt);
+
 	if (clusterCommand == NULL || clusterCommand->commandStatus != COMMAND_IN_PROGRESS)
 		return false;
 
@@ -5392,11 +5389,11 @@ static int watchdog_state_machine_standForCord(WD_EVENTS event, WatchdogNode* wd
 			}
 		}
 			break;
-			
+
 		case WD_EVENT_TIMEOUT:
 			set_state(WD_COORDINATOR);
 			break;
-			
+
 		case WD_EVENT_PACKET_RCV:
 		{
 			switch (pkt->type)
@@ -5559,13 +5556,14 @@ static int watchdog_state_machine_coordinator(WD_EVENTS event, WatchdogNode* wdN
 									   clusterCommand->commandReplyFromCount
 									   )));
 				}
-				else
+				else if (clusterCommand->commandStatus == COMMAND_FINISHED_NODE_REJECTED)
 				{
-					/* command is finished but because of an error */
-					ereport(WARNING,
-							(errmsg("possible split brain scenario detected by \"%s\" node", wdNode->nodeName),
-							 (errdetail("re-initializing cluster"))));
-					set_state(WD_JOINING);
+					/* one of the node rejected out I am coordinator message */
+					ereport(LOG,
+							(errmsg("possible split brain, \"%s\" node has rejected our coordinator beacon", wdNode->nodeName),
+							 (errdetail("removing the node from out standby list"))));
+
+					standby_node_left_cluster(wdNode);
 				}
 			}
 		}
@@ -6206,6 +6204,7 @@ static int watchdog_state_machine_standby(WD_EVENTS event, WatchdogNode* wdNode,
 				break;
 
 			case WD_DECLARE_COORDINATOR_MESSAGE:
+			{
 				if (wdNode != WD_MASTER_NODE)
 				{
 					/*
@@ -6221,6 +6220,29 @@ static int watchdog_state_machine_standby(WD_EVENTS event, WatchdogNode* wdNode,
 				}
 				break;
 
+				case WD_IAM_COORDINATOR_MESSAGE:
+				{
+					/*
+					 * if the message is received from coordinator reply with info,
+					 * otherwise reject
+					 */
+					if (wdNode != WD_MASTER_NODE)
+					{
+						ereport(LOG,
+								(errmsg("\"%s\" is our coordinator node, but \"%s\" is also announcing as a coordinator",
+										WD_MASTER_NODE->nodeName,wdNode->nodeName),
+								 errdetail("broadcasting the cluster in split-brain message")));
+
+						send_cluster_service_message(NULL,pkt,CLUSTER_IN_SPLIT_BRAIN);
+					}
+					else
+					{
+						send_message_of_type(wdNode, WD_INFO_MESSAGE, pkt);
+						beacon_message_received_from_node(wdNode, pkt);
+					}
+				}
+			}
+
 			default:
 				standard_packet_processor(wdNode, pkt);
 				break;
bug_306.diff (4,365 bytes)

supp_k

2017-05-31 03:52

reporter   ~0001527

Hi Muhammad,

what branch should I apple the fix to?

Muhammad Usama

2017-05-31 17:02

developer   ~0001528

3_6_STABLE or master. It should work with both

supp_k

2017-06-08 18:17

reporter   ~0001537

Environment:
1) Head Slave Pgpool
2) DB node A + Master Pgpool
3) DB node B + Slave Pgpool

Sequence of actions
- At 11:24:10 the network interface of DB node A was brought down (ifdown eth0).
- At (Jun 8 11:24:11) the Pgpool of DB node B discovered the fact the Master Pgpool disappeared (see the db_node_b_pgpool.log)
After that the election process started which ended within several seconds. But the pgpool cluster was not providing any connections to DB client. You can see series of messages like:
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-2] DETAIL: health check timer expired
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-3] LOCATION: pool_connection_pool.c:567
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-1] ERROR: failed to make persistent db connection
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-2] DETAIL: connection to host:"b.db.node:15432" failed
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-3] LOCATION: child.c:1218
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-1] LOG: find_primary_node: checking backend no 0
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-2] LOCATION: pgpool_main.c:2970
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-1] LOG: find_primary_node: checking backend no 1
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-2] LOCATION: pgpool_main.c:2970
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [220-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket


- at (Jun 8 11:27:20) the network interface of DB node A was restored (ifup eth0) and you can see that the split brain was detected.
- the re-election process was initiated. After the election the new master Pgpool (DB node B) remained master - it is OK. But unfortunately the Pgpool cluster is not operational - it doesnt provide connections to DB clients (see the log fragment):

Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11970]: [38690-4] LOCATION: child.c:2130
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-1] LOG: child process with pid: 11970 exits with status 256
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-2] LOCATION: pgpool_main.c:2460
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-1] LOG: fork a new child process with pid: 12138
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-2] LOCATION: pgpool_main.c:2546
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-1] FATAL: pgpool is not accepting any new connections
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-3] HINT: repair the backend nodes and restart pgpool
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-4] LOCATION: child.c:2130
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-1] LOG: child process with pid: 11792 exits with status 256
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-2] LOCATION: pgpool_main.c:2460
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-1] LOG: fork a new child process with pid: 12139
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-2] LOCATION: pgpool_main.c:2546
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-4] LOCATION: child.c:2130
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39029-1] LOG: child process with pid:


All logs are attached.

pgpool_logs.tar.gz (3,863,657 bytes)

Muhammad Usama

2017-06-21 19:33

developer   ~0001541

Hi

Thanks for performing the test and providing the logs. I have analysed all logs and now the master election after split-brain and also the action time to recover from split-brain seems to be working fine and as expected.
Now the second part where after split-brain was recovered and Pgpool-II was not accepting any was because it couldn't find any valid backend node at that time.

Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool

is not related to watchdog and happend because Pgpool-II backend health check was not able to connect to backends so it failovered the backend nodes.

See the following log message extracts from db_node_b_pgpool.log

a.db.node failure logs
=================
22659 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-1] LOG: failed to connect to PostgreSQL server on "a.db.node:15432" using INET socket
22660 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-2] DETAIL: select() system call failed with an error "Interrupted system call"
22661 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-3] LOCATION: pool_connection_pool.c:699
22662 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-1] ERROR: failed to make persistent db connection
22663 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-2] DETAIL: connection to host:"a.db.node:15432" failed\
22664 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-3] LOCATION: child.c:1218

And b.db.node failure logs
======================
22785 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket
22786 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-2] DETAIL: health check timer expired
22787 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-3] LOCATION: pool_connection_pool.c:567
22788 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-1] ERROR: failed to make persistent db connection
22789 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-2] DETAIL: connection to host:"b.db.node:15432" failed
22790 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-3] LOCATION: child.c:1218

So since both the nodes were detached from the Pgpool-II because pgpool health check advised it to do ( health check was not able to connect to backends) so once after the failover of b.db.node the Pgpool-II was not accepting any client connections, since it had no alive PG backend attached

Now the reason why pgpool-II was not able to connect to alive PG nodes may need a separate investigation and may have be caused by some network issue or some other reason, and PostgreSQL log may be more helpful than pgpool-II log to do the postmortem of that.

So as a way forward, I am going to commit the patch (bug_306.diff: the one with which this test was performed) since It solves the original reported problem and close this bug report.
And we can analyse the reason for health-check not able to connect to backend PG node as a separate thread.

Thanks
Best Regards
Muhammad Usama

supp_k

2017-08-14 20:51

reporter   ~0001650

Hi Muhammad,

I'm not sure whether it is correct but I've detected another case instance of the split brain.
Say we have an environment:
1) Server A:
- Head Pgpool
2) Server B:
- Pgpool (Master) + PostgreSQL (Master)
3) Server C:
- Pgpool + PostgreSQL (Sync Slave)

Sequence of actions:
- Perform "ifdown eth0" on Server B, and in a period of time Server C (it has larger wd_priority) promoted into Master and it is OK.
- From the perspective of Server C the PostgreSQL on Server B is marked as Down and it is OK
- From the perspective of Server B the PostgreSQL on Server C is marked as Down and it is OK
- Restore network on Server B by execution "ifup eth0";
   Server B rejoins the pgpool cluster as Slave - it is OK

Issue: Both PostgreSQL instances are marked as Down within Pgpool cluster (as it is confirmed by pcp_node_info)

supp_k

2017-08-29 16:13

reporter   ~0001687

Hi Pengbo,

why have you closed the issue? It is not closed yet, and we are suffering.

Issue History

Date Modified Username Field Change
2017-04-25 02:07 supp_k New Issue
2017-04-25 03:03 supp_k Note Added: 0001458
2017-04-25 05:26 Muhammad Usama Note Added: 0001459
2017-04-25 05:56 supp_k Note Added: 0001460
2017-04-25 18:58 supp_k Note Added: 0001461
2017-04-25 21:42 supp_k Note Added: 0001463
2017-04-25 22:07 Muhammad Usama Note Added: 0001464
2017-04-25 22:10 supp_k Note Added: 0001465
2017-04-25 22:19 supp_k File Added: pgpool_conf.tar.gz
2017-04-25 22:19 supp_k Note Added: 0001467
2017-04-25 22:24 Muhammad Usama Note Added: 0001468
2017-04-26 00:20 supp_k Note Added: 0001469
2017-04-26 00:59 supp_k Note Added: 0001470
2017-04-26 01:01 supp_k File Added: pgpool_fail_voting.tar.gz
2017-04-26 01:02 supp_k File Added: pgpool_fail_voting.tar-2.gz
2017-04-26 01:02 supp_k Note Added: 0001471
2017-04-27 02:20 supp_k Note Added: 0001476
2017-04-27 04:06 supp_k Note Added: 0001477
2017-04-27 06:08 supp_k Note Added: 0001478
2017-04-27 06:08 supp_k File Added: no_failover.log
2017-04-27 06:08 supp_k Note Added: 0001479
2017-04-27 08:02 t-ishii Assigned To => Muhammad Usama
2017-04-27 08:02 t-ishii Status new => assigned
2017-04-28 00:44 Muhammad Usama File Added: wd_quorum.diff
2017-04-28 00:44 Muhammad Usama Note Added: 0001481
2017-04-28 01:03 supp_k Note Added: 0001482
2017-04-28 01:28 Muhammad Usama Note Added: 0001483
2017-04-28 01:50 supp_k File Added: log_failed_master.log
2017-04-28 01:50 supp_k Note Added: 0001484
2017-04-28 05:22 supp_k File Added: pgpool_wd_issue.tar.gz
2017-04-28 05:22 supp_k Note Added: 0001485
2017-05-04 01:00 Muhammad Usama File Added: wd_failover_not_performed.diff
2017-05-04 01:00 Muhammad Usama Note Added: 0001491
2017-05-05 00:39 supp_k Note Added: 0001492
2017-05-15 18:38 supp_k Note Added: 0001508
2017-05-16 20:00 Muhammad Usama Note Added: 0001511
2017-05-16 22:17 supp_k Note Added: 0001512
2017-05-16 22:24 supp_k Note Added: 0001513
2017-05-16 22:25 supp_k File Added: restored_pgpool_b.log
2017-05-16 22:25 supp_k Note Added: 0001514
2017-05-17 03:26 Muhammad Usama Note Added: 0001515
2017-05-17 21:58 Muhammad Usama Note Added: 0001517
2017-05-19 00:35 supp_k File Added: split_brain_pgpool.tar.gz
2017-05-19 00:35 supp_k Note Added: 0001519
2017-05-23 00:46 supp_k Note Added: 0001522
2017-05-30 22:22 Muhammad Usama File Added: bug_306.diff
2017-05-30 22:22 Muhammad Usama Note Added: 0001525
2017-05-31 03:52 supp_k Note Added: 0001527
2017-05-31 17:02 Muhammad Usama Note Added: 0001528
2017-06-08 18:17 supp_k File Added: pgpool_logs.tar.gz
2017-06-08 18:17 supp_k Note Added: 0001537
2017-06-21 19:33 Muhammad Usama Note Added: 0001541
2017-08-14 20:51 supp_k Note Added: 0001650
2017-08-29 09:51 pengbo Status assigned => closed
2017-08-29 16:13 supp_k Status closed => feedback
2017-08-29 16:13 supp_k Resolution open => reopened
2017-08-29 16:13 supp_k Note Added: 0001687