View Issue Details

IDProjectCategoryView StatusLast Update
0000228Pgpool-IIBugpublic2016-08-04 23:51
Reportersupp_kAssigned ToMuhammad Usama 
PriorityhighSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
PlatformpgpoolOSCentOSOS Version6 & 7
Product Version3.5.3 
Target VersionFixed in Version 
Summary0000228: pgpool doesnt de-escalate IP in case netowkr restored
DescriptionPgpool doesn't de-escalate IP address in case the split brain is resolved when it turns from Master into Standby.
Steps To ReproduceEnvironment:
1) Pgpool A (Master) hosts VIP (virtual IP)
2) Pgpool B (Standby)
Watchdog and heartbit processes are OK.

Steps to reproduce:
- emulate network failure between Pgpool A & Pgpool B for the heartbit receive time period. When the heartbit time is exceeded Pgpool initiates voting and up the VIP - it is Ok! Now we have 2 Pgpool masters within the network => it is the "split brain" case.

- restore network connectivity between Pgpool A & Pgpool B => pgpools restart voting and one of the masters turns into Standby (let it be the Pgpool B) - it is OK as well but at the same moment the Pgpool B doesnt down (ip addr del ...) the VIP. Should it be?
Tagswatchdog

Activities

Muhammad Usama

2016-08-03 23:14

developer  

de-esc_bug_228.diff (643 bytes)
diff --git a/src/watchdog/watchdog.c b/src/watchdog/watchdog.c
index 07fb2fa..f5257db 100644
--- a/src/watchdog/watchdog.c
+++ b/src/watchdog/watchdog.c
@@ -4948,6 +4948,10 @@ static int set_state(WD_STATES newState)
 	g_cluster.localNode->state = newState;
 	if (oldState != newState)
 	{
+		/* if we changing from the coordinator state, do the de-escalation if required */
+		if (oldState == WD_COORDINATOR)
+			resign_from_escalated_node();
+
 		ereport(LOG,
 				(errmsg("watchdog node state changed from [%s] to [%s]",wd_state_names[oldState],wd_state_names[newState])));
 		watchdog_state_machine(WD_EVENT_WD_STATE_CHANGED, NULL, NULL);
de-esc_bug_228.diff (643 bytes)

Muhammad Usama

2016-08-03 23:16

developer   ~0000963

Hi

I was able to reproduce the issue, Can you please try the attached patch "de-esc_bug_228.diff" if it solves your problem

supp_k

2016-08-04 01:41

reporter   ~0000964

Hi,

yes the problem disappeared.

Here are the log records:
2016-08-03 19:37:34: pid 2604: WARNING: "Linux_warm1.local_9999" is the coordinator as per our record but "Linux_warm0.local_9999" is also announcing as a coordinator
2016-08-03 19:37:34: pid 2604: DETAIL: re-initializing the cluster
2016-08-03 19:37:34: pid 2604: LOG: watchdog node state changed from [MASTER] to [JOINING]
2016-08-03 19:37:34: pid 2952: LOG: watchdog: de-escalation started
2016-08-03 19:37:34: pid 2604: WARNING: the coordinator as per our record is not coordinator anymore
2016-08-03 19:37:34: pid 2604: DETAIL: re-initializing the cluster
2016-08-03 19:37:34: pid 2604: LOG: watchdog node state changed from [JOINING] to [INITIALIZING]
2016-08-03 19:37:35: pid 2604: LOG: watchdog node state changed from [INITIALIZING] to [STANDING FOR MASTER]
2016-08-03 19:37:35: pid 2604: LOG: watchdog node state changed from [STANDING FOR MASTER] to [PARTICIPATING IN ELECTION]
2016-08-03 19:37:35: pid 2604: LOG: watchdog node state changed from [PARTICIPATING IN ELECTION] to [INITIALIZING]
2016-08-03 19:37:35: pid 2605: LOG: informing the node status change to watchdog
2016-08-03 19:37:35: pid 2605: DETAIL: node id :1 status = "NODE ALIVE" message:"Heartbeat signal found"
2016-08-03 19:37:35: pid 2604: LOG: new IPC connection received
2016-08-03 19:37:35: pid 2604: LOG: received node status change ipc message
2016-08-03 19:37:35: pid 2604: DETAIL: Heartbeat signal found
2016-08-03 19:37:36: pid 2604: LOG: watchdog node state changed from [INITIALIZING] to [STANDBY]
2016-08-03 19:37:40: pid 2604: LOG: successfully joined the watchdog cluster as standby node
2016-08-03 19:37:40: pid 2604: DETAIL: our join coordinator request is accepted by cluster leader node "Linux_warm0.local_9999"
2016-08-03 19:37:46: pid 2952: WARNING: watchdog failed to ping host"192.168.7.7"
2016-08-03 19:37:46: pid 2952: DETAIL: ping process exits with code: 1
2016-08-03 19:37:46: pid 2952: LOG: watchdog bringing down delegate IP
2016-08-03 19:37:46: pid 2952: DETAIL: if_down_cmd succeeded
2016-08-03 19:37:46: pid 2604: LOG: watchdog de-escalation process with pid: 2952 exit with SUCCESS.



Thank you!

Muhammad Usama

2016-08-04 23:51

developer   ~0000965

Thanks for the confirmation of fix. I have committed the same in master and 3.5 branches

http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=cf57d9970f46a92c52315b42eae9dbee73c90525

Issue History

Date Modified Username Field Change
2016-08-02 01:53 supp_k New Issue
2016-08-02 10:22 t-ishii Assigned To => Muhammad Usama
2016-08-02 10:22 t-ishii Status new => assigned
2016-08-02 13:44 t-ishii Tag Attached: watchdog
2016-08-03 23:14 Muhammad Usama File Added: de-esc_bug_228.diff
2016-08-03 23:16 Muhammad Usama Status assigned => feedback
2016-08-03 23:16 Muhammad Usama Note Added: 0000963
2016-08-04 01:41 supp_k Note Added: 0000964
2016-08-04 01:41 supp_k Status feedback => assigned
2016-08-04 23:51 Muhammad Usama Status assigned => resolved
2016-08-04 23:51 Muhammad Usama Resolution open => fixed
2016-08-04 23:51 Muhammad Usama Note Added: 0000965