0000306: Pgpool steals back MASTER status

ID	Project	Category	View Status	Date Submitted	Last Update

0000306	Pgpool-II	Bug	public	2017-04-25 02:07	2017-08-29 16:13

Reporter	supp_k	Assigned To	Muhammad Usama
Priority	high	Severity	major	Reproducibility	always
Status	feedback	Resolution	reopened
Platform	x86_64	OS	CentOS	OS Version	7.X
Product Version	3.6.2

Summary	0000306: Pgpool steals back MASTER status
Description	When disconnected Master Pgpool (from the other 2 nodes) gets connected ones again it steals back the MASTER status. Thus the newly promoted MASTER (because the previous one was disconnected) gets depromoted int standby.
Steps To Reproduce	2 pgpool instances: Step 1: 1) A - standby 2) B - standby 3) C - master Make ifdown on the network interface of pgpool C. The pictures get changed into: 1) A - standby 2) B - master <= it was elected because the Pgpool C 'disappeared' 3) C - master <= it is still MASTER!! But it should be depromoted Step 2: restore network on Server and the following result 1) A - standby 2) B - standby <= it was depromoted! Why? 3) C - master <= it is MASTER again
Tags	No tags attached.

supp_k 2017-04-25 03:03 reporter ~0001458	The instance C states then QUORUM_ABSENT when it gets disconnected from the other 2 Pgpool instances.

Muhammad Usama 2017-04-25 05:26 developer ~0001459	What was your expectation in that case? Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not?

supp_k 2017-04-25 05:56 reporter ~0001460	Muhammad, I checked out the Head of the V3_6 branch and skim testing revealed there is now such problem. i.e. Pgpool B remains Master while C turns into standby. It seems the issue was fixed by your recent commit. Give a couple of days and I will confirm that the problem disappeared.

supp_k 2017-04-25 18:58 reporter ~0001461	Muhammad, yes. The problem is not being reproduced on the top of V3_6 branch.

supp_k 2017-04-25 21:42 reporter ~0001463	I'm sorry for the previous quick answer but the issue is reproduced - not often but reproduced on the HEAD of V3_6_STABLE branch. Lets consider the environment: 1) Pgpool A (Standby) 2) Pgpool B (Master) + PG backend B (Master) 3) Pgpool C (Standby) + PG backend (Sync slave) Details are the same - when former Pgpool master get connected to the remained pgpool nodes (A & B) it steals the Master status from the newly elected Master (say it B). It seems to be ok there but db clients face database connections shock once again because the VIP is relocated back to C. And also which is very bad: to this moment the old Pgpool master (C instance) considers that PG Backend B is DOWN because it was network separated. When the connection is restored is forwards all SQLs to the PG on C. But this is instance doesn't contain fresh data! Within the period when the Pgpool C was absent the new Pgpool Master on B was forwarding all SQLs to its local instance of PG which was promoted into master!

Muhammad Usama 2017-04-25 22:07 developer ~0001464	First of all many thanks for testing this and reporting the issue, The latest code on the HEAD of V3_6STABLE branch is supposed to consider the node escalation state before another thing while deciding the rightful master node, but apparently its not behaving as it is supposed to in your test environment. Can you confirm that all three nodes are using the code on HEAD of V3_6STABLE? also if it is possible can you share the pgpool log around the time when the disconnected old master joins back the cluster and the situation occurs.

supp_k 2017-04-25 22:10 reporter ~0001465	> What was your expectation in that case? >Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not? Yes, exactly. If the MASTER instance loses Quorum then it should resign (or become Master candidate) - this is the concept of the distributed voting. But as for now I see it remains MASTER nevertheless the fact it is disconnected from the other pgpool instances. ################################################################################## The 'pcp_watchdog_info' for the disconnected C Pgpool instance gives result: [root@OAA-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1 Password: 3 NO OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone 192.168.68.164 5432 9000 8 LOST HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 8 LOST ################################################################################## B Pgpool instance (newly elected Master): [root@OAB-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1 Password: 3 YES OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone 192.168.69.178 5432 9000 8 LOST HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 7 STANDBY ##################################################################################

supp_k 2017-04-25 22:19 reporter ~0001467	Muhammad, Used version was freshly build from the HEAD of V3_6_STABLE [root@OAA-4f0e8fadf6f0 pgha]# pgpool --version pgpool-II version 3.6.2 (subaruboshi) Also the configuration files for 3 pgpool instances are attached. pgpool_conf.tar.gz (20,480 bytes)

Muhammad Usama 2017-04-25 22:24 developer ~0001468	Thanks for the configuration files but If you can share the pgpool log files that would be really helpful.

supp_k 2017-04-26 00:20 reporter ~0001469	Muhammad, I'm trying to reproduce the issue. Probably you won't believe me but I'm failing to do it for the more than 1 hour. As soon as I reproduce the problem I'll share the logs!

supp_k 2017-04-26 00:59 reporter ~0001470	Muhammad, attached the archive: pgpool_fail_voting.tar.gz. Please see into the 'db_node_b.log' file. 1) In 18:37:50 the DB node B detected that Pgpool on "OAA-4f0e8fadf6f0.aqa.int.zone" is lost. Then it was promoted into MASTER. This is OK because the network interface of the previous Master on "OAA-4f0e8fadf6f0.aqa.int.zone" was brought down (ifdown eth0). 2) In approximately 18:40:54 the network interface of the former Master was restored and I see the following sequence of events in the same log: Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-1] LOG: new watchdog node connection is received from "192.168.69.178:22753" Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-2] LOCATION: watchdog.c:3178 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-1] LOG: new node joined the cluster hostname:"OAA-4f0e8fadf6f0.aqa.int.zone" port:9000 pgpool_port:5432 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-2] LOCATION: watchdog.c:1442 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-1] LOG: remote node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone" decided it is the true master Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-2] DETAIL: re-initializing the local watchdog cluster state because of split-brain Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-3] LOCATION: watchdog.c:3682 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-1] LOG: watchdog node state changed from [MASTER] to [JOINING] Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-2] LOCATION: watchdog.c:6313 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-1] LOG: watchdog: de-escalation started Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-2] LOCATION: wd_escalation.c:181 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-1] LOG: watchdog node state changed from [JOINING] to [INITIALIZING] Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-2] LOCATION: watchdog.c:6313 Apr 25 18:40:55 OAB-4f0e8fadf6f0 su: (to postgres) root on none Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-1] LOG: successfully released the delegate IP:"192.168.52.96" Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-2] DETAIL: 'if_down_cmd' returned with success Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-3] LOCATION: wd_if.c:208 Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-1] LOG: watchdog de-escalation process with pid: 22991 exit with SUCCESS. Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-2] LOCATION: watchdog.c:3053 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-1] LOG: watchdog node state changed from [INITIALIZING] to [STANDBY] Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-2] LOCATION: watchdog.c:6313 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-1] LOG: successfully joined the watchdog cluster as standby node Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-2] DETAIL: our join coordinator request is accepted by cluster leader node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone" Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-3] LOCATION: watchdog.c:6107 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-1] LOG: new IPC connection received Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-2] LOCATION: watchdog.c:3210 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-1] LOG: we have joined the watchdog cluster as STANDBY node Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-2] DETAIL: syncing the backend states from the MASTER watchdog node Also please pay attention to the messages like: - Apr 25 18:39:53 OAB-4f0e8fadf6f0 pgpool[8063]: [7535-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket There are a lot of such messages that follow the command 'Apr 25 18:38:02 OAB-4f0e8fadf6f0 pgpool[8063]: [7107-1] LOG: execute command: sudo /usr/local/pem/bin/pgha/pghactl.py do-failover --new-master-node-id=1 --new-master-host=b.db.node' The problem is that the 'b.db.node' is the synonym of 'OAB-4f0e8fadf6f0' and it's Postgres instance was promoted into Master by the mentioned failover command. But the message states it is unavailable. Also DB clients face problems with DB connections at the same time.

supp_k 2017-04-26 01:01 reporter	pgpool_fail_voting.tar.gz (10,926,080 bytes)

supp_k 2017-04-26 01:02 reporter ~0001471	Attaching logs.. pgpool_fail_voting.tar-2.gz (10,926,080 bytes)

supp_k 2017-04-27 02:20 reporter ~0001476	Hi Muhammad, did the logs help you? Can you see a problem?

supp_k 2017-04-27 04:06 reporter ~0001477	I'm also could observe a case when Standby Pgpool stolen Master status from the acting Master Pgpool after network restored. Obviously this is because Standby master was self promoted into Master when the network was splited. I think logically it can be solved by not performing voting if quorum doesnt exist within a single node.

supp_k 2017-04-27 06:08 reporter ~0001478	Muhammad, please see the attached file "no_failover.log". In "23:55:53" the Postgres on DB node B was killed (-9) and it was supposed that Pgpool should be initiate failover command. But this didn't happen ((( The log collected from Master Pgpool (pgpool-II version 3.6.2 (subaruboshi)) which was built from HEAD of V3_6_STABLE branch.

supp_k 2017-04-27 06:08 reporter ~0001479	no_failover.log no_failover.log (455,354 bytes)

Muhammad Usama 2017-04-28 00:44 developer ~0001481	Hi, The Pgpool-II logs "pgpool_fail_voting.tar" you sent earlier does not contains the correct pgpool log for head_pgpool, So it was heard to say what was happening in that case but I found out a potential issue in the quorum calculation which can lead to the same behaviour. Can you please try the attached wd_quorum.diff patch against the current 3.6 head, if it fixes the issue of rightful master node selection after the cluster recovers for the split-brain. Note that its a WIP patch and to confirm/decline if it solves the problem and it needs to be applied on all pgpool-II nodes participating in the test. Meanwhile I am also looking into the other problem where the ailover was not performed. Thanks wd_quorum.diff (6,597 bytes) wd_quorum.diff (6,597 bytes)

supp_k 2017-04-28 01:03 reporter ~0001482	Muhammad, I'll schedule automatic tests right now. Result will be provided not quickly because problem not being reproduced reliably.

Muhammad Usama 2017-04-28 01:28 developer ~0001483	Sure, No Issues and many thanks for the testing

supp_k 2017-04-28 01:50 reporter ~0001484	After 5 iterations of ifdown/ifup actions the problem was reproduced. Please see the attached log from the former master. It was depromoted into in Standby at 19:40:23 as it can be seen from the log, and later it was self promoted into Master when the network was restored. I need to reconfigure the environment...when it is done I'll share you logs from 3 pgpool instances and you will be able to observe the "complete picture" for another case. log_failed_master.log (52,958 bytes) log_failed_master.log (52,958 bytes)

supp_k 2017-04-28 05:22 reporter ~0001485	Muhammad, I'm not sure but I could reproduce the error only after attaching of PG backends to Pgpool cluster. Before that I was unable to reproduce the problem within the long period with your fresh patch! The cleaned fact of the problem is attached: logs + configuration files. Names of the files within the attached tarball are self explanatory. ======================================================= Environment before the "network split": 1) Head Pgpool 2) DB Node A: Pgpool + PG 3) DB Node B: Pgpool (Master) + PG The "ifdown" was initiated in DB Node B at "22:58:50". Right after that the Pgpool in DB Node A was promoted into Master - this OK. In a few seconds the network was restored (ifup) in DB node B, and now the logs confirm that Pgpool A was depromoted and Pgpool B has taken back the Master status. pgpool_wd_issue.tar.gz (1,382,297 bytes)

Muhammad Usama 2017-05-04 01:00 developer ~0001491	Hi Supp_K, I have created a patch for the second part of problem where the failover was not performed by Pgpool-II, Can you please try the attached patch if it solves that problem. wd_failover_not_performed.diff (16,565 bytes) wd_failover_not_performed.diff (16,565 bytes)

supp_k 2017-05-05 00:39 reporter ~0001492	Hi Muhammad, I applied fix to the head versiob if 3_6_stable branch. Result will be provided later because I want test it carefully.

supp_k 2017-05-15 18:38 reporter ~0001508	Hi Muhammad, sorry for long delay from my side. I'm not sure the patch is ok. But I see lots of changes in repository. Are the include the patch provided by you? I think it is better if I continue verify the latest HEAD of v3_6_STABLE branch.

Muhammad Usama 2017-05-16 20:00 developer ~0001511	Hi Supp_K, Yes the changes in the patch are already committed to the repository, You can use the v3_6_STABLE branch from Head to verify the fix. Thanks

supp_k 2017-05-16 22:17 reporter ~0001512	Muhammad, the previous standalone fix and the latest HEAD of the V3_6_STABLE branch works OK. In my test cases I could observe stable failover triggering. But I still observe the primary problem of incorrect split brain resolve when disconnected Master connect again to the set of pgpools that choosen the new Master.

supp_k 2017-05-16 22:24 reporter ~0001513	Also I observe a case when split-brain never ends. It happens when network disconnected master pgpool instance connects again to the set of pgpools. I see the pcp_watchdog_info output: [root@OAB-7976adbf5291 pgha]# pcp_watchdog_info -h 127.0.0.1 Password: 3 NO OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone 5432 9000 4 MASTER OAA-7976adbf5291.aqa.int.zone:5432 Linux OAA-7976adbf5291.aqa.int.zone 192.168.83.16 5432 9000 8 LOST POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone 192.168.83.180 5432 9000 2 JOINING It never ends. Please see the attached pgpool log of the former master which is reconnected.

supp_k 2017-05-16 22:25 reporter ~0001514	Never ending voting restored_pgpool_b.log (3,727,982 bytes)

Muhammad Usama 2017-05-17 03:26 developer ~0001515	Thanks for the input I am looking into this one.

Muhammad Usama 2017-05-17 21:58 developer ~0001517	Hi Supp_k, Is it possible for you to share the pgpool-II log for "POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone" node aswell?

supp_k 2017-05-19 00:35 reporter ~0001519	Hi Muhammad, I'm failed to reproduce endless split-brain resolve case. But anyway please have fresh attached logs of incorrect split-brain resolve case. split_brain_pgpool.tar.gz (183,073 bytes)

supp_k 2017-05-23 00:46 reporter ~0001522	Hi Muhammad, with the help of attached logs have you reproduced the issue with incorrect split brain resolve on your side? Do you need any additional help?

Muhammad Usama 2017-05-30 22:22 developer ~0001525	Hi sup_k, Thanks for your help and yes I was able to reproduce the issue once and it happens occasionally. I have made some adjustments on how a node handles and detects the split-brain and so-far the tests are passing on my setup, Can you please also try out the attached patch file:bug_306.diff to see if it solves the issue. Also while testing the scenario if possible please collect and share the complete Pgpool-II logs for all Pgpool-II nodes. Thanks Best regards Muhammad usama bug_306.diff (4,365 bytes) bug_306.diff (4,365 bytes)

supp_k 2017-05-31 03:52 reporter ~0001527	Hi Muhammad, what branch should I apple the fix to?

Muhammad Usama 2017-05-31 17:02 developer ~0001528	3_6_STABLE or master. It should work with both

supp_k 2017-06-08 18:17 reporter ~0001537	Environment: 1) Head Slave Pgpool 2) DB node A + Master Pgpool 3) DB node B + Slave Pgpool Sequence of actions - At 11:24:10 the network interface of DB node A was brought down (ifdown eth0). - At (Jun 8 11:24:11) the Pgpool of DB node B discovered the fact the Master Pgpool disappeared (see the db_node_b_pgpool.log) After that the election process started which ended within several seconds. But the pgpool cluster was not providing any connections to DB client. You can see series of messages like: Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-2] DETAIL: health check timer expired Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-3] LOCATION: pool_connection_pool.c:567 Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-1] ERROR: failed to make persistent db connection Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-2] DETAIL: connection to host:"b.db.node:15432" failed Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-3] LOCATION: child.c:1218 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-1] LOG: find_primary_node: checking backend no 0 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-2] LOCATION: pgpool_main.c:2970 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-1] LOG: find_primary_node: checking backend no 1 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-2] LOCATION: pgpool_main.c:2970 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [220-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket - at (Jun 8 11:27:20) the network interface of DB node A was restored (ifup eth0) and you can see that the split brain was detected. - the re-election process was initiated. After the election the new master Pgpool (DB node B) remained master - it is OK. But unfortunately the Pgpool cluster is not operational - it doesnt provide connections to DB clients (see the log fragment): Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11970]: [38690-4] LOCATION: child.c:2130 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-1] LOG: child process with pid: 11970 exits with status 256 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-2] LOCATION: pgpool_main.c:2460 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-1] LOG: fork a new child process with pid: 12138 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-2] LOCATION: pgpool_main.c:2546 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-1] FATAL: pgpool is not accepting any new connections Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-3] HINT: repair the backend nodes and restart pgpool Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-4] LOCATION: child.c:2130 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-1] LOG: child process with pid: 11792 exits with status 256 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-2] LOCATION: pgpool_main.c:2460 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-1] LOG: fork a new child process with pid: 12139 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-2] LOCATION: pgpool_main.c:2546 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-4] LOCATION: child.c:2130 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39029-1] LOG: child process with pid: All logs are attached. pgpool_logs.tar.gz (3,863,657 bytes)

Muhammad Usama 2017-06-21 19:33 developer ~0001541	Hi Thanks for performing the test and providing the logs. I have analysed all logs and now the master election after split-brain and also the action time to recover from split-brain seems to be working fine and as expected. Now the second part where after split-brain was recovered and Pgpool-II was not accepting any was because it couldn't find any valid backend node at that time. Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool is not related to watchdog and happend because Pgpool-II backend health check was not able to connect to backends so it failovered the backend nodes. See the following log message extracts from db_node_b_pgpool.log a.db.node failure logs ================= 22659 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-1] LOG: failed to connect to PostgreSQL server on "a.db.node:15432" using INET socket 22660 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-2] DETAIL: select() system call failed with an error "Interrupted system call" 22661 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-3] LOCATION: pool_connection_pool.c:699 22662 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-1] ERROR: failed to make persistent db connection 22663 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-2] DETAIL: connection to host:"a.db.node:15432" failed\ 22664 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-3] LOCATION: child.c:1218 And b.db.node failure logs ====================== 22785 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket 22786 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-2] DETAIL: health check timer expired 22787 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-3] LOCATION: pool_connection_pool.c:567 22788 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-1] ERROR: failed to make persistent db connection 22789 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-2] DETAIL: connection to host:"b.db.node:15432" failed 22790 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-3] LOCATION: child.c:1218 So since both the nodes were detached from the Pgpool-II because pgpool health check advised it to do ( health check was not able to connect to backends) so once after the failover of b.db.node the Pgpool-II was not accepting any client connections, since it had no alive PG backend attached Now the reason why pgpool-II was not able to connect to alive PG nodes may need a separate investigation and may have be caused by some network issue or some other reason, and PostgreSQL log may be more helpful than pgpool-II log to do the postmortem of that. So as a way forward, I am going to commit the patch (bug_306.diff: the one with which this test was performed) since It solves the original reported problem and close this bug report. And we can analyse the reason for health-check not able to connect to backend PG node as a separate thread. Thanks Best Regards Muhammad Usama

supp_k 2017-08-14 20:51 reporter ~0001650	Hi Muhammad, I'm not sure whether it is correct but I've detected another case instance of the split brain. Say we have an environment: 1) Server A: - Head Pgpool 2) Server B: - Pgpool (Master) + PostgreSQL (Master) 3) Server C: - Pgpool + PostgreSQL (Sync Slave) Sequence of actions: - Perform "ifdown eth0" on Server B, and in a period of time Server C (it has larger wd_priority) promoted into Master and it is OK. - From the perspective of Server C the PostgreSQL on Server B is marked as Down and it is OK - From the perspective of Server B the PostgreSQL on Server C is marked as Down and it is OK - Restore network on Server B by execution "ifup eth0"; Server B rejoins the pgpool cluster as Slave - it is OK Issue: Both PostgreSQL instances are marked as Down within Pgpool cluster (as it is confirmed by pcp_node_info)

supp_k 2017-08-29 16:13 reporter ~0001687	Hi Pengbo, why have you closed the issue? It is not closed yet, and we are suffering.

Date Modified	Username	Field	Change
2017-04-25 02:07	supp_k	New Issue
2017-04-25 03:03	supp_k	Note Added: 0001458
2017-04-25 05:26	Muhammad Usama	Note Added: 0001459
2017-04-25 05:56	supp_k	Note Added: 0001460
2017-04-25 18:58	supp_k	Note Added: 0001461
2017-04-25 21:42	supp_k	Note Added: 0001463
2017-04-25 22:07	Muhammad Usama	Note Added: 0001464
2017-04-25 22:10	supp_k	Note Added: 0001465
2017-04-25 22:19	supp_k	File Added: pgpool_conf.tar.gz
2017-04-25 22:19	supp_k	Note Added: 0001467
2017-04-25 22:24	Muhammad Usama	Note Added: 0001468
2017-04-26 00:20	supp_k	Note Added: 0001469
2017-04-26 00:59	supp_k	Note Added: 0001470
2017-04-26 01:01	supp_k	File Added: pgpool_fail_voting.tar.gz
2017-04-26 01:02	supp_k	File Added: pgpool_fail_voting.tar-2.gz
2017-04-26 01:02	supp_k	Note Added: 0001471
2017-04-27 02:20	supp_k	Note Added: 0001476
2017-04-27 04:06	supp_k	Note Added: 0001477
2017-04-27 06:08	supp_k	Note Added: 0001478
2017-04-27 06:08	supp_k	File Added: no_failover.log
2017-04-27 06:08	supp_k	Note Added: 0001479
2017-04-27 08:02	t-ishii	Assigned To	=> Muhammad Usama
2017-04-27 08:02	t-ishii	Status	new => assigned
2017-04-28 00:44	Muhammad Usama	File Added: wd_quorum.diff
2017-04-28 00:44	Muhammad Usama	Note Added: 0001481
2017-04-28 01:03	supp_k	Note Added: 0001482
2017-04-28 01:28	Muhammad Usama	Note Added: 0001483
2017-04-28 01:50	supp_k	File Added: log_failed_master.log
2017-04-28 01:50	supp_k	Note Added: 0001484
2017-04-28 05:22	supp_k	File Added: pgpool_wd_issue.tar.gz
2017-04-28 05:22	supp_k	Note Added: 0001485
2017-05-04 01:00	Muhammad Usama	File Added: wd_failover_not_performed.diff
2017-05-04 01:00	Muhammad Usama	Note Added: 0001491
2017-05-05 00:39	supp_k	Note Added: 0001492
2017-05-15 18:38	supp_k	Note Added: 0001508
2017-05-16 20:00	Muhammad Usama	Note Added: 0001511
2017-05-16 22:17	supp_k	Note Added: 0001512
2017-05-16 22:24	supp_k	Note Added: 0001513
2017-05-16 22:25	supp_k	File Added: restored_pgpool_b.log
2017-05-16 22:25	supp_k	Note Added: 0001514
2017-05-17 03:26	Muhammad Usama	Note Added: 0001515
2017-05-17 21:58	Muhammad Usama	Note Added: 0001517
2017-05-19 00:35	supp_k	File Added: split_brain_pgpool.tar.gz
2017-05-19 00:35	supp_k	Note Added: 0001519
2017-05-23 00:46	supp_k	Note Added: 0001522
2017-05-30 22:22	Muhammad Usama	File Added: bug_306.diff
2017-05-30 22:22	Muhammad Usama	Note Added: 0001525
2017-05-31 03:52	supp_k	Note Added: 0001527
2017-05-31 17:02	Muhammad Usama	Note Added: 0001528
2017-06-08 18:17	supp_k	File Added: pgpool_logs.tar.gz
2017-06-08 18:17	supp_k	Note Added: 0001537
2017-06-21 19:33	Muhammad Usama	Note Added: 0001541
2017-08-14 20:51	supp_k	Note Added: 0001650
2017-08-29 09:51	pengbo	Status	assigned => closed
2017-08-29 16:13	supp_k	Status	closed => feedback
2017-08-29 16:13	supp_k	Resolution	open => reopened
2017-08-29 16:13	supp_k	Note Added: 0001687

View Issue Details

Activities

Issue History