View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000306 | Pgpool-II | Bug | public | 2017-04-25 02:07 | 2017-08-29 16:13 |
| Reporter | supp_k | Assigned To | Muhammad Usama | ||
| Priority | high | Severity | major | Reproducibility | always |
| Status | feedback | Resolution | reopened | ||
| Platform | x86_64 | OS | CentOS | OS Version | 7.X |
| Product Version | 3.6.2 | ||||
| Summary | 0000306: Pgpool steals back MASTER status | ||||
| Description | When disconnected Master Pgpool (from the other 2 nodes) gets connected ones again it steals back the MASTER status. Thus the newly promoted MASTER (because the previous one was disconnected) gets depromoted int standby. | ||||
| Steps To Reproduce | 2 pgpool instances: Step 1: 1) A - standby 2) B - standby 3) C - master Make ifdown on the network interface of pgpool C. The pictures get changed into: 1) A - standby 2) B - master <= it was elected because the Pgpool C 'disappeared' 3) C - master <= it is still MASTER!! But it should be depromoted Step 2: restore network on Server and the following result 1) A - standby 2) B - standby <= it was depromoted! Why? 3) C - master <= it is MASTER again | ||||
| Tags | No tags attached. | ||||
|
|
The instance C states then QUORUM_ABSENT when it gets disconnected from the other 2 Pgpool instances. |
|
|
What was your expectation in that case? Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not? |
|
|
Muhammad, I checked out the Head of the V3_6 branch and skim testing revealed there is now such problem. i.e. Pgpool B remains Master while C turns into standby. It seems the issue was fixed by your recent commit. Give a couple of days and I will confirm that the problem disappeared. |
|
|
Muhammad, yes. The problem is not being reproduced on the top of V3_6 branch. |
|
|
I'm sorry for the previous quick answer but the issue is reproduced - not often but reproduced on the HEAD of V3_6_STABLE branch. Lets consider the environment: 1) Pgpool A (Standby) 2) Pgpool B (Master) + PG backend B (Master) 3) Pgpool C (Standby) + PG backend (Sync slave) Details are the same - when former Pgpool master get connected to the remained pgpool nodes (A & B) it steals the Master status from the newly elected Master (say it B). It seems to be ok there but db clients face database connections shock once again because the VIP is relocated back to C. And also which is very bad: to this moment the old Pgpool master (C instance) considers that PG Backend B is DOWN because it was network separated. When the connection is restored is forwards all SQLs to the PG on C. But this is instance doesn't contain fresh data! Within the period when the Pgpool C was absent the new Pgpool Master on B was forwarding all SQLs to its local instance of PG which was promoted into master! |
|
|
First of all many thanks for testing this and reporting the issue, The latest code on the HEAD of V3_6STABLE branch is supposed to consider the node escalation state before another thing while deciding the rightful master node, but apparently its not behaving as it is supposed to in your test environment. Can you confirm that all three nodes are using the code on HEAD of V3_6STABLE? also if it is possible can you share the pgpool log around the time when the disconnected old master joins back the cluster and the situation occurs. |
|
|
> What was your expectation in that case? >Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not? Yes, exactly. If the MASTER instance loses Quorum then it should resign (or become Master candidate) - this is the concept of the distributed voting. But as for now I see it remains MASTER nevertheless the fact it is disconnected from the other pgpool instances. ################################################################################## The 'pcp_watchdog_info' for the disconnected C Pgpool instance gives result: [root@OAA-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1 Password: 3 NO OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone 192.168.68.164 5432 9000 8 LOST HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 8 LOST ################################################################################## B Pgpool instance (newly elected Master): [root@OAB-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1 Password: 3 YES OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone 192.168.69.178 5432 9000 8 LOST HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 7 STANDBY ################################################################################## |
|
|
Muhammad, Used version was freshly build from the HEAD of V3_6_STABLE [root@OAA-4f0e8fadf6f0 pgha]# pgpool --version pgpool-II version 3.6.2 (subaruboshi) Also the configuration files for 3 pgpool instances are attached. |
|
|
Thanks for the configuration files but If you can share the pgpool log files that would be really helpful. |
|
|
Muhammad, I'm trying to reproduce the issue. Probably you won't believe me but I'm failing to do it for the more than 1 hour. As soon as I reproduce the problem I'll share the logs! |
|
|
Muhammad, attached the archive: pgpool_fail_voting.tar.gz. Please see into the 'db_node_b.log' file. 1) In 18:37:50 the DB node B detected that Pgpool on "OAA-4f0e8fadf6f0.aqa.int.zone" is lost. Then it was promoted into MASTER. This is OK because the network interface of the previous Master on "OAA-4f0e8fadf6f0.aqa.int.zone" was brought down (ifdown eth0). 2) In approximately 18:40:54 the network interface of the former Master was restored and I see the following sequence of events in the same log: Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-1] LOG: new watchdog node connection is received from "192.168.69.178:22753" Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-2] LOCATION: watchdog.c:3178 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-1] LOG: new node joined the cluster hostname:"OAA-4f0e8fadf6f0.aqa.int.zone" port:9000 pgpool_port:5432 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-2] LOCATION: watchdog.c:1442 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-1] LOG: remote node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone" decided it is the true master Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-2] DETAIL: re-initializing the local watchdog cluster state because of split-brain Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-3] LOCATION: watchdog.c:3682 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-1] LOG: watchdog node state changed from [MASTER] to [JOINING] Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-2] LOCATION: watchdog.c:6313 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-1] LOG: watchdog: de-escalation started Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-2] LOCATION: wd_escalation.c:181 Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-1] LOG: watchdog node state changed from [JOINING] to [INITIALIZING] Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-2] LOCATION: watchdog.c:6313 Apr 25 18:40:55 OAB-4f0e8fadf6f0 su: (to postgres) root on none Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-1] LOG: successfully released the delegate IP:"192.168.52.96" Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-2] DETAIL: 'if_down_cmd' returned with success Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-3] LOCATION: wd_if.c:208 Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-1] LOG: watchdog de-escalation process with pid: 22991 exit with SUCCESS. Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-2] LOCATION: watchdog.c:3053 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-1] LOG: watchdog node state changed from [INITIALIZING] to [STANDBY] Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-2] LOCATION: watchdog.c:6313 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-1] LOG: successfully joined the watchdog cluster as standby node Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-2] DETAIL: our join coordinator request is accepted by cluster leader node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone" Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-3] LOCATION: watchdog.c:6107 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-1] LOG: new IPC connection received Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-2] LOCATION: watchdog.c:3210 Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-1] LOG: we have joined the watchdog cluster as STANDBY node Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-2] DETAIL: syncing the backend states from the MASTER watchdog node Also please pay attention to the messages like: - Apr 25 18:39:53 OAB-4f0e8fadf6f0 pgpool[8063]: [7535-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket There are a lot of such messages that follow the command 'Apr 25 18:38:02 OAB-4f0e8fadf6f0 pgpool[8063]: [7107-1] LOG: execute command: sudo /usr/local/pem/bin/pgha/pghactl.py do-failover --new-master-node-id=1 --new-master-host=b.db.node' The problem is that the 'b.db.node' is the synonym of 'OAB-4f0e8fadf6f0' and it's Postgres instance was promoted into Master by the mentioned failover command. But the message states it is unavailable. Also DB clients face problems with DB connections at the same time. |
|
|
|
|
|
Attaching logs.. |
|
|
Hi Muhammad, did the logs help you? Can you see a problem? |
|
|
I'm also could observe a case when Standby Pgpool stolen Master status from the acting Master Pgpool after network restored. Obviously this is because Standby master was self promoted into Master when the network was splited. I think logically it can be solved by not performing voting if quorum doesnt exist within a single node. |
|
|
Muhammad, please see the attached file "no_failover.log". In "23:55:53" the Postgres on DB node B was killed (-9) and it was supposed that Pgpool should be initiate failover command. But this didn't happen ((( The log collected from Master Pgpool (pgpool-II version 3.6.2 (subaruboshi)) which was built from HEAD of V3_6_STABLE branch. |
|
|
no_failover.log |
|
|
Hi, The Pgpool-II logs "pgpool_fail_voting.tar" you sent earlier does not contains the correct pgpool log for head_pgpool, So it was heard to say what was happening in that case but I found out a potential issue in the quorum calculation which can lead to the same behaviour. Can you please try the attached wd_quorum.diff patch against the current 3.6 head, if it fixes the issue of rightful master node selection after the cluster recovers for the split-brain. Note that its a WIP patch and to confirm/decline if it solves the problem and it needs to be applied on all pgpool-II nodes participating in the test. Meanwhile I am also looking into the other problem where the ailover was not performed. Thanks |
|
|
Muhammad, I'll schedule automatic tests right now. Result will be provided not quickly because problem not being reproduced reliably. |
|
|
Sure, No Issues and many thanks for the testing |
|
|
After 5 iterations of ifdown/ifup actions the problem was reproduced. Please see the attached log from the former master. It was depromoted into in Standby at 19:40:23 as it can be seen from the log, and later it was self promoted into Master when the network was restored. I need to reconfigure the environment...when it is done I'll share you logs from 3 pgpool instances and you will be able to observe the "complete picture" for another case. |
|
|
Muhammad, I'm not sure but I could reproduce the error only after attaching of PG backends to Pgpool cluster. Before that I was unable to reproduce the problem within the long period with your fresh patch! The cleaned fact of the problem is attached: logs + configuration files. Names of the files within the attached tarball are self explanatory. ======================================================= Environment before the "network split": 1) Head Pgpool 2) DB Node A: Pgpool + PG 3) DB Node B: Pgpool (Master) + PG The "ifdown" was initiated in DB Node B at "22:58:50". Right after that the Pgpool in DB Node A was promoted into Master - this OK. In a few seconds the network was restored (ifup) in DB node B, and now the logs confirm that Pgpool A was depromoted and Pgpool B has taken back the Master status. |
|
|
Hi Supp_K, I have created a patch for the second part of problem where the failover was not performed by Pgpool-II, Can you please try the attached patch if it solves that problem. |
|
|
Hi Muhammad, I applied fix to the head versiob if 3_6_stable branch. Result will be provided later because I want test it carefully. |
|
|
Hi Muhammad, sorry for long delay from my side. I'm not sure the patch is ok. But I see lots of changes in repository. Are the include the patch provided by you? I think it is better if I continue verify the latest HEAD of v3_6_STABLE branch. |
|
|
Hi Supp_K, Yes the changes in the patch are already committed to the repository, You can use the v3_6_STABLE branch from Head to verify the fix. Thanks |
|
|
Muhammad, the previous standalone fix and the latest HEAD of the V3_6_STABLE branch works OK. In my test cases I could observe stable failover triggering. But I still observe the primary problem of incorrect split brain resolve when disconnected Master connect again to the set of pgpools that choosen the new Master. |
|
|
Also I observe a case when split-brain never ends. It happens when network disconnected master pgpool instance connects again to the set of pgpools. I see the pcp_watchdog_info output: [root@OAB-7976adbf5291 pgha]# pcp_watchdog_info -h 127.0.0.1 Password: 3 NO OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone 5432 9000 4 MASTER OAA-7976adbf5291.aqa.int.zone:5432 Linux OAA-7976adbf5291.aqa.int.zone 192.168.83.16 5432 9000 8 LOST POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone 192.168.83.180 5432 9000 2 JOINING It never ends. Please see the attached pgpool log of the former master which is reconnected. |
|
|
Never ending voting |
|
|
Thanks for the input I am looking into this one. |
|
|
Hi Supp_k, Is it possible for you to share the pgpool-II log for "POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone" node aswell? |
|
|
Hi Muhammad, I'm failed to reproduce endless split-brain resolve case. But anyway please have fresh attached logs of incorrect split-brain resolve case. |
|
|
Hi Muhammad, with the help of attached logs have you reproduced the issue with incorrect split brain resolve on your side? Do you need any additional help? |
|
|
Hi sup_k, Thanks for your help and yes I was able to reproduce the issue once and it happens occasionally. I have made some adjustments on how a node handles and detects the split-brain and so-far the tests are passing on my setup, Can you please also try out the attached patch file:bug_306.diff to see if it solves the issue. Also while testing the scenario if possible please collect and share the complete Pgpool-II logs for all Pgpool-II nodes. Thanks Best regards Muhammad usama |
|
|
Hi Muhammad, what branch should I apple the fix to? |
|
|
3_6_STABLE or master. It should work with both |
|
|
Environment: 1) Head Slave Pgpool 2) DB node A + Master Pgpool 3) DB node B + Slave Pgpool Sequence of actions - At 11:24:10 the network interface of DB node A was brought down (ifdown eth0). - At (Jun 8 11:24:11) the Pgpool of DB node B discovered the fact the Master Pgpool disappeared (see the db_node_b_pgpool.log) After that the election process started which ended within several seconds. But the pgpool cluster was not providing any connections to DB client. You can see series of messages like: Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-2] DETAIL: health check timer expired Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-3] LOCATION: pool_connection_pool.c:567 Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-1] ERROR: failed to make persistent db connection Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-2] DETAIL: connection to host:"b.db.node:15432" failed Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-3] LOCATION: child.c:1218 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-1] LOG: find_primary_node: checking backend no 0 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-2] LOCATION: pgpool_main.c:2970 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-1] LOG: find_primary_node: checking backend no 1 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-2] LOCATION: pgpool_main.c:2970 Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [220-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket - at (Jun 8 11:27:20) the network interface of DB node A was restored (ifup eth0) and you can see that the split brain was detected. - the re-election process was initiated. After the election the new master Pgpool (DB node B) remained master - it is OK. But unfortunately the Pgpool cluster is not operational - it doesnt provide connections to DB clients (see the log fragment): Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11970]: [38690-4] LOCATION: child.c:2130 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-1] LOG: child process with pid: 11970 exits with status 256 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-2] LOCATION: pgpool_main.c:2460 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-1] LOG: fork a new child process with pid: 12138 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-2] LOCATION: pgpool_main.c:2546 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-1] FATAL: pgpool is not accepting any new connections Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-3] HINT: repair the backend nodes and restart pgpool Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-4] LOCATION: child.c:2130 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-1] LOG: child process with pid: 11792 exits with status 256 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-2] LOCATION: pgpool_main.c:2460 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-1] LOG: fork a new child process with pid: 12139 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-2] LOCATION: pgpool_main.c:2546 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-4] LOCATION: child.c:2130 Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39029-1] LOG: child process with pid: All logs are attached. |
|
|
Hi Thanks for performing the test and providing the logs. I have analysed all logs and now the master election after split-brain and also the action time to recover from split-brain seems to be working fine and as expected. Now the second part where after split-brain was recovered and Pgpool-II was not accepting any was because it couldn't find any valid backend node at that time. Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool is not related to watchdog and happend because Pgpool-II backend health check was not able to connect to backends so it failovered the backend nodes. See the following log message extracts from db_node_b_pgpool.log a.db.node failure logs ================= 22659 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-1] LOG: failed to connect to PostgreSQL server on "a.db.node:15432" using INET socket 22660 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-2] DETAIL: select() system call failed with an error "Interrupted system call" 22661 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-3] LOCATION: pool_connection_pool.c:699 22662 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-1] ERROR: failed to make persistent db connection 22663 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-2] DETAIL: connection to host:"a.db.node:15432" failed\ 22664 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-3] LOCATION: child.c:1218 And b.db.node failure logs ====================== 22785 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket 22786 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-2] DETAIL: health check timer expired 22787 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-3] LOCATION: pool_connection_pool.c:567 22788 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-1] ERROR: failed to make persistent db connection 22789 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-2] DETAIL: connection to host:"b.db.node:15432" failed 22790 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-3] LOCATION: child.c:1218 So since both the nodes were detached from the Pgpool-II because pgpool health check advised it to do ( health check was not able to connect to backends) so once after the failover of b.db.node the Pgpool-II was not accepting any client connections, since it had no alive PG backend attached Now the reason why pgpool-II was not able to connect to alive PG nodes may need a separate investigation and may have be caused by some network issue or some other reason, and PostgreSQL log may be more helpful than pgpool-II log to do the postmortem of that. So as a way forward, I am going to commit the patch (bug_306.diff: the one with which this test was performed) since It solves the original reported problem and close this bug report. And we can analyse the reason for health-check not able to connect to backend PG node as a separate thread. Thanks Best Regards Muhammad Usama |
|
|
Hi Muhammad, I'm not sure whether it is correct but I've detected another case instance of the split brain. Say we have an environment: 1) Server A: - Head Pgpool 2) Server B: - Pgpool (Master) + PostgreSQL (Master) 3) Server C: - Pgpool + PostgreSQL (Sync Slave) Sequence of actions: - Perform "ifdown eth0" on Server B, and in a period of time Server C (it has larger wd_priority) promoted into Master and it is OK. - From the perspective of Server C the PostgreSQL on Server B is marked as Down and it is OK - From the perspective of Server B the PostgreSQL on Server C is marked as Down and it is OK - Restore network on Server B by execution "ifup eth0"; Server B rejoins the pgpool cluster as Slave - it is OK Issue: Both PostgreSQL instances are marked as Down within Pgpool cluster (as it is confirmed by pcp_node_info) |
|
|
Hi Pengbo, why have you closed the issue? It is not closed yet, and we are suffering. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2017-04-25 02:07 | supp_k | New Issue | |
| 2017-04-25 03:03 | supp_k | Note Added: 0001458 | |
| 2017-04-25 05:26 | Muhammad Usama | Note Added: 0001459 | |
| 2017-04-25 05:56 | supp_k | Note Added: 0001460 | |
| 2017-04-25 18:58 | supp_k | Note Added: 0001461 | |
| 2017-04-25 21:42 | supp_k | Note Added: 0001463 | |
| 2017-04-25 22:07 | Muhammad Usama | Note Added: 0001464 | |
| 2017-04-25 22:10 | supp_k | Note Added: 0001465 | |
| 2017-04-25 22:19 | supp_k | File Added: pgpool_conf.tar.gz | |
| 2017-04-25 22:19 | supp_k | Note Added: 0001467 | |
| 2017-04-25 22:24 | Muhammad Usama | Note Added: 0001468 | |
| 2017-04-26 00:20 | supp_k | Note Added: 0001469 | |
| 2017-04-26 00:59 | supp_k | Note Added: 0001470 | |
| 2017-04-26 01:01 | supp_k | File Added: pgpool_fail_voting.tar.gz | |
| 2017-04-26 01:02 | supp_k | File Added: pgpool_fail_voting.tar-2.gz | |
| 2017-04-26 01:02 | supp_k | Note Added: 0001471 | |
| 2017-04-27 02:20 | supp_k | Note Added: 0001476 | |
| 2017-04-27 04:06 | supp_k | Note Added: 0001477 | |
| 2017-04-27 06:08 | supp_k | Note Added: 0001478 | |
| 2017-04-27 06:08 | supp_k | File Added: no_failover.log | |
| 2017-04-27 06:08 | supp_k | Note Added: 0001479 | |
| 2017-04-27 08:02 | t-ishii | Assigned To | => Muhammad Usama |
| 2017-04-27 08:02 | t-ishii | Status | new => assigned |
| 2017-04-28 00:44 | Muhammad Usama | File Added: wd_quorum.diff | |
| 2017-04-28 00:44 | Muhammad Usama | Note Added: 0001481 | |
| 2017-04-28 01:03 | supp_k | Note Added: 0001482 | |
| 2017-04-28 01:28 | Muhammad Usama | Note Added: 0001483 | |
| 2017-04-28 01:50 | supp_k | File Added: log_failed_master.log | |
| 2017-04-28 01:50 | supp_k | Note Added: 0001484 | |
| 2017-04-28 05:22 | supp_k | File Added: pgpool_wd_issue.tar.gz | |
| 2017-04-28 05:22 | supp_k | Note Added: 0001485 | |
| 2017-05-04 01:00 | Muhammad Usama | File Added: wd_failover_not_performed.diff | |
| 2017-05-04 01:00 | Muhammad Usama | Note Added: 0001491 | |
| 2017-05-05 00:39 | supp_k | Note Added: 0001492 | |
| 2017-05-15 18:38 | supp_k | Note Added: 0001508 | |
| 2017-05-16 20:00 | Muhammad Usama | Note Added: 0001511 | |
| 2017-05-16 22:17 | supp_k | Note Added: 0001512 | |
| 2017-05-16 22:24 | supp_k | Note Added: 0001513 | |
| 2017-05-16 22:25 | supp_k | File Added: restored_pgpool_b.log | |
| 2017-05-16 22:25 | supp_k | Note Added: 0001514 | |
| 2017-05-17 03:26 | Muhammad Usama | Note Added: 0001515 | |
| 2017-05-17 21:58 | Muhammad Usama | Note Added: 0001517 | |
| 2017-05-19 00:35 | supp_k | File Added: split_brain_pgpool.tar.gz | |
| 2017-05-19 00:35 | supp_k | Note Added: 0001519 | |
| 2017-05-23 00:46 | supp_k | Note Added: 0001522 | |
| 2017-05-30 22:22 | Muhammad Usama | File Added: bug_306.diff | |
| 2017-05-30 22:22 | Muhammad Usama | Note Added: 0001525 | |
| 2017-05-31 03:52 | supp_k | Note Added: 0001527 | |
| 2017-05-31 17:02 | Muhammad Usama | Note Added: 0001528 | |
| 2017-06-08 18:17 | supp_k | File Added: pgpool_logs.tar.gz | |
| 2017-06-08 18:17 | supp_k | Note Added: 0001537 | |
| 2017-06-21 19:33 | Muhammad Usama | Note Added: 0001541 | |
| 2017-08-14 20:51 | supp_k | Note Added: 0001650 | |
| 2017-08-29 09:51 | pengbo | Status | assigned => closed |
| 2017-08-29 16:13 | supp_k | Status | closed => feedback |
| 2017-08-29 16:13 | supp_k | Resolution | open => reopened |
| 2017-08-29 16:13 | supp_k | Note Added: 0001687 |