View Issue Details

IDProjectCategoryView StatusLast Update
0000306Pgpool-IIBugpublic2017-08-29 16:13
Reportersupp_k Assigned ToMuhammad Usama  
PriorityhighSeveritymajorReproducibilityalways
Status feedbackResolutionreopened 
Platformx86_64OSCentOSOS Version7.X
Product Version3.6.2 
Summary0000306: Pgpool steals back MASTER status
DescriptionWhen disconnected Master Pgpool (from the other 2 nodes) gets connected ones again it steals back the MASTER status. Thus the newly promoted MASTER (because the previous one was disconnected) gets depromoted int standby.

Steps To Reproduce2 pgpool instances:

Step 1:
1) A - standby
2) B - standby
3) C - master

Make ifdown on the network interface of pgpool C. The pictures get changed into:
1) A - standby
2) B - master <= it was elected because the Pgpool C 'disappeared'
3) C - master <= it is still MASTER!! But it should be depromoted



Step 2: restore network on Server and the following result
1) A - standby
2) B - standby <= it was depromoted! Why?
3) C - master <= it is MASTER again






TagsNo tags attached.

Activities

supp_k

2017-04-25 03:03

reporter   ~0001458

The instance C states then QUORUM_ABSENT when it gets disconnected from the other 2 Pgpool instances.

Muhammad Usama

2017-04-25 05:26

developer   ~0001459

What was your expectation in that case?
Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not?

supp_k

2017-04-25 05:56

reporter   ~0001460

Muhammad,
I checked out the Head of the V3_6 branch and skim testing revealed there is now such problem. i.e. Pgpool B remains Master while C turns into standby. It seems the issue was fixed by your recent commit.

Give a couple of days and I will confirm that the problem disappeared.

supp_k

2017-04-25 18:58

reporter   ~0001461

Muhammad,

yes. The problem is not being reproduced on the top of V3_6 branch.

supp_k

2017-04-25 21:42

reporter   ~0001463

I'm sorry for the previous quick answer but the issue is reproduced - not often but reproduced on the HEAD of V3_6_STABLE branch.
Lets consider the environment:
1) Pgpool A (Standby)
2) Pgpool B (Master) + PG backend B (Master)
3) Pgpool C (Standby) + PG backend (Sync slave)

Details are the same - when former Pgpool master get connected to the remained pgpool nodes (A & B) it steals the Master status from the newly elected Master (say it B). It seems to be ok there but db clients face database connections shock once again because the VIP is relocated back to C.

And also which is very bad: to this moment the old Pgpool master (C instance) considers that PG Backend B is DOWN because it was network separated. When the connection is restored is forwards all SQLs to the PG on C. But this is instance doesn't contain fresh data! Within the period when the Pgpool C was absent the new Pgpool Master on B was forwarding all SQLs to its local instance of PG which was promoted into master!

Muhammad Usama

2017-04-25 22:07

developer   ~0001464

First of all many thanks for testing this and reporting the issue, The latest code on the HEAD of V3_6STABLE branch is supposed to consider the node escalation state before another thing while deciding the rightful master node, but apparently its not behaving as it is supposed to in your test environment.
Can you confirm that all three nodes are using the code on HEAD of V3_6STABLE? also if it is possible can you share the pgpool log around the time when the disconnected old master joins back the cluster and the situation occurs.

supp_k

2017-04-25 22:10

reporter   ~0001465

> What was your expectation in that case?
>Do you think the node B should remain as a MASTER after node C joins back in? as it has the QUORUM while C doses not?

Yes, exactly. If the MASTER instance loses Quorum then it should resign (or become Master candidate) - this is the concept of the distributed voting. But as for now I see it remains MASTER nevertheless the fact it is disconnected from the other pgpool instances.

##################################################################################
The 'pcp_watchdog_info' for the disconnected C Pgpool instance gives result:
[root@OAA-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1
Password:
3 NO OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone

OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone OAA-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER
OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone 192.168.68.164 5432 9000 8 LOST
HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 8 LOST
##################################################################################
B Pgpool instance (newly elected Master):
[root@OAB-4f0e8fadf6f0 pgha]# pcp_watchdog_info -h 127.0.0.1
Password:
3 YES OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone

OAB-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAB-4f0e8fadf6f0.aqa.int.zone OAB-4f0e8fadf6f0.aqa.int.zone 5432 9000 4 MASTER
OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone 192.168.69.178 5432 9000 8 LOST
HEAD-4f0e8fadf6f0.aqa.int.zone:5432 Linux HEAD-4f0e8fadf6f0.aqa.int.zone 192.168.225.54 5432 9000 7 STANDBY
##################################################################################

supp_k

2017-04-25 22:19

reporter   ~0001467

Muhammad,

Used version was freshly build from the HEAD of V3_6_STABLE
[root@OAA-4f0e8fadf6f0 pgha]# pgpool --version
pgpool-II version 3.6.2 (subaruboshi)

Also the configuration files for 3 pgpool instances are attached.
pgpool_conf.tar.gz (20,480 bytes)

Muhammad Usama

2017-04-25 22:24

developer   ~0001468

Thanks for the configuration files but If you can share the pgpool log files that would be really helpful.

supp_k

2017-04-26 00:20

reporter   ~0001469

Muhammad,

I'm trying to reproduce the issue. Probably you won't believe me but I'm failing to do it for the more than 1 hour. As soon as I reproduce the problem I'll share the logs!

supp_k

2017-04-26 00:59

reporter   ~0001470

Muhammad,

attached the archive: pgpool_fail_voting.tar.gz.

Please see into the 'db_node_b.log' file.
1) In 18:37:50 the DB node B detected that Pgpool on "OAA-4f0e8fadf6f0.aqa.int.zone" is lost. Then it was promoted into MASTER. This is OK because the network interface of the previous Master on "OAA-4f0e8fadf6f0.aqa.int.zone" was brought down (ifdown eth0).

2) In approximately 18:40:54 the network interface of the former Master was restored and I see the following sequence of events in the same log:

Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-1] LOG: new watchdog node connection is received from "192.168.69.178:22753"
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1548-2] LOCATION: watchdog.c:3178
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-1] LOG: new node joined the cluster hostname:"OAA-4f0e8fadf6f0.aqa.int.zone" port:9000 pgpool_port:5432
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1549-2] LOCATION: watchdog.c:1442
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-1] LOG: remote node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone" decided it is the true master
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-2] DETAIL: re-initializing the local watchdog cluster state because of split-brain
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1550-3] LOCATION: watchdog.c:3682
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-1] LOG: watchdog node state changed from [MASTER] to [JOINING]
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1551-2] LOCATION: watchdog.c:6313
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-1] LOG: watchdog: de-escalation started
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[22991]: [1551-2] LOCATION: wd_escalation.c:181
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-1] LOG: watchdog node state changed from [JOINING] to [INITIALIZING]
Apr 25 18:40:55 OAB-4f0e8fadf6f0 pgpool[8064]: [1552-2] LOCATION: watchdog.c:6313
Apr 25 18:40:55 OAB-4f0e8fadf6f0 su: (to postgres) root on none
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-1] LOG: successfully released the delegate IP:"192.168.52.96"
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-2] DETAIL: 'if_down_cmd' returned with success
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[22991]: [1552-3] LOCATION: wd_if.c:208
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-1] LOG: watchdog de-escalation process with pid: 22991 exit with SUCCESS.
Apr 25 18:40:56 OAB-4f0e8fadf6f0 pgpool[8064]: [1553-2] LOCATION: watchdog.c:3053
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-1] LOG: watchdog node state changed from [INITIALIZING] to [STANDBY]
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1554-2] LOCATION: watchdog.c:6313
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-1] LOG: successfully joined the watchdog cluster as standby node
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-2] DETAIL: our join coordinator request is accepted by cluster leader node "OAA-4f0e8fadf6f0.aqa.int.zone:5432 Linux OAA-4f0e8fadf6f0.aqa.int.zone"
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1555-3] LOCATION: watchdog.c:6107
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-1] LOG: new IPC connection received
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8064]: [1556-2] LOCATION: watchdog.c:3210
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-1] LOG: we have joined the watchdog cluster as STANDBY node
Apr 25 18:40:57 OAB-4f0e8fadf6f0 pgpool[8063]: [8530-2] DETAIL: syncing the backend states from the MASTER watchdog node





Also please pay attention to the messages like:
- Apr 25 18:39:53 OAB-4f0e8fadf6f0 pgpool[8063]: [7535-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket

There are a lot of such messages that follow the command 'Apr 25 18:38:02 OAB-4f0e8fadf6f0 pgpool[8063]: [7107-1] LOG: execute command: sudo /usr/local/pem/bin/pgha/pghactl.py do-failover --new-master-node-id=1 --new-master-host=b.db.node'

The problem is that the 'b.db.node' is the synonym of 'OAB-4f0e8fadf6f0' and it's Postgres instance was promoted into Master by the mentioned failover command. But the message states it is unavailable. Also DB clients face problems with DB connections at the same time.

supp_k

2017-04-26 01:01

reporter  

pgpool_fail_voting.tar.gz (10,926,080 bytes)

supp_k

2017-04-26 01:02

reporter   ~0001471

Attaching logs..
pgpool_fail_voting.tar-2.gz (10,926,080 bytes)

supp_k

2017-04-27 02:20

reporter   ~0001476

Hi Muhammad,

did the logs help you? Can you see a problem?

supp_k

2017-04-27 04:06

reporter   ~0001477

I'm also could observe a case when Standby Pgpool stolen Master status from the acting Master Pgpool after network restored.

Obviously this is because Standby master was self promoted into Master when the network was splited. I think logically it can be solved by not performing voting if quorum doesnt exist within a single node.

supp_k

2017-04-27 06:08

reporter   ~0001478

Muhammad,

please see the attached file "no_failover.log".
In "23:55:53" the Postgres on DB node B was killed (-9) and it was supposed that Pgpool should be initiate failover command. But this didn't happen (((
The log collected from Master Pgpool (pgpool-II version 3.6.2 (subaruboshi)) which was built from HEAD of V3_6_STABLE branch.

supp_k

2017-04-27 06:08

reporter   ~0001479

no_failover.log
no_failover.log (455,354 bytes)

Muhammad Usama

2017-04-28 00:44

developer   ~0001481

Hi,

The Pgpool-II logs "pgpool_fail_voting.tar" you sent earlier does not contains the correct pgpool log for head_pgpool, So it was heard to say what was happening in that case but I found out a potential issue in the quorum calculation which can lead to the same behaviour.

Can you please try the attached wd_quorum.diff patch against the current 3.6 head, if it fixes the issue of rightful master node selection after the cluster recovers for the split-brain. Note that its a WIP patch and to confirm/decline if it solves the problem and it needs to be applied on all pgpool-II nodes participating in the test.

Meanwhile I am also looking into the other problem where the ailover was not performed.

Thanks
wd_quorum.diff (6,597 bytes)   
wd_quorum.diff (6,597 bytes)   

supp_k

2017-04-28 01:03

reporter   ~0001482

Muhammad,
I'll schedule automatic tests right now. Result will be provided not quickly because problem not being reproduced reliably.

Muhammad Usama

2017-04-28 01:28

developer   ~0001483

Sure, No Issues and many thanks for the testing

supp_k

2017-04-28 01:50

reporter   ~0001484

After 5 iterations of ifdown/ifup actions the problem was reproduced.
Please see the attached log from the former master. It was depromoted into in Standby at 19:40:23 as it can be seen from the log, and later it was self promoted into Master when the network was restored.

I need to reconfigure the environment...when it is done I'll share you logs from 3 pgpool instances and you will be able to observe the "complete picture" for another case.
log_failed_master.log (52,958 bytes)   
log_failed_master.log (52,958 bytes)   

supp_k

2017-04-28 05:22

reporter   ~0001485

Muhammad,

I'm not sure but I could reproduce the error only after attaching of PG backends to Pgpool cluster. Before that I was unable to reproduce the problem within the long period with your fresh patch!

The cleaned fact of the problem is attached: logs + configuration files. Names of the files within the attached tarball are self explanatory.
=======================================================

Environment before the "network split":
1) Head Pgpool
2) DB Node A: Pgpool + PG
3) DB Node B: Pgpool (Master) + PG

The "ifdown" was initiated in DB Node B at "22:58:50". Right after that the Pgpool in DB Node A was promoted into Master - this OK.

In a few seconds the network was restored (ifup) in DB node B, and now the logs confirm that Pgpool A was depromoted and Pgpool B has taken back the Master status.
pgpool_wd_issue.tar.gz (1,382,297 bytes)

Muhammad Usama

2017-05-04 01:00

developer   ~0001491

Hi Supp_K,

I have created a patch for the second part of problem where the failover was not performed by Pgpool-II, Can you please try the attached patch if it solves that problem.
wd_failover_not_performed.diff (16,565 bytes)   
wd_failover_not_performed.diff (16,565 bytes)   

supp_k

2017-05-05 00:39

reporter   ~0001492

Hi Muhammad,

I applied fix to the head versiob if 3_6_stable branch. Result will be provided later because I want test it carefully.

supp_k

2017-05-15 18:38

reporter   ~0001508

Hi Muhammad,
sorry for long delay from my side. I'm not sure the patch is ok. But I see lots of changes in repository. Are the include the patch provided by you?
I think it is better if I continue verify the latest HEAD of v3_6_STABLE branch.

Muhammad Usama

2017-05-16 20:00

developer   ~0001511

Hi Supp_K,
Yes the changes in the patch are already committed to the repository, You can use the v3_6_STABLE branch from Head to verify the fix.

Thanks

supp_k

2017-05-16 22:17

reporter   ~0001512

Muhammad,
the previous standalone fix and the latest HEAD of the V3_6_STABLE branch works OK. In my test cases I could observe stable failover triggering.
But I still observe the primary problem of incorrect split brain resolve when disconnected Master connect again to the set of pgpools that choosen the new Master.

supp_k

2017-05-16 22:24

reporter   ~0001513

Also I observe a case when split-brain never ends. It happens when network disconnected master pgpool instance connects again to the set of pgpools.

I see the pcp_watchdog_info output:
[root@OAB-7976adbf5291 pgha]# pcp_watchdog_info -h 127.0.0.1
Password:
3 NO OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone

OAB-7976adbf5291.aqa.int.zone:5432 Linux OAB-7976adbf5291.aqa.int.zone OAB-7976adbf5291.aqa.int.zone 5432 9000 4 MASTER
OAA-7976adbf5291.aqa.int.zone:5432 Linux OAA-7976adbf5291.aqa.int.zone 192.168.83.16 5432 9000 8 LOST
POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone 192.168.83.180 5432 9000 2 JOINING


It never ends. Please see the attached pgpool log of the former master which is reconnected.

supp_k

2017-05-16 22:25

reporter   ~0001514

Never ending voting
restored_pgpool_b.log (3,727,982 bytes)

Muhammad Usama

2017-05-17 03:26

developer   ~0001515

Thanks for the input I am looking into this one.

Muhammad Usama

2017-05-17 21:58

developer   ~0001517

Hi Supp_k,

Is it possible for you to share the pgpool-II log for "POAMN-7976adbf5291.aqa.int.zone:5432 Linux POAMN-7976adbf5291.aqa.int.zone" node aswell?

supp_k

2017-05-19 00:35

reporter   ~0001519

Hi Muhammad,

I'm failed to reproduce endless split-brain resolve case. But anyway please have fresh attached logs of incorrect split-brain resolve case.
split_brain_pgpool.tar.gz (183,073 bytes)

supp_k

2017-05-23 00:46

reporter   ~0001522

Hi Muhammad,

with the help of attached logs have you reproduced the issue with incorrect split brain resolve on your side? Do you need any additional help?

Muhammad Usama

2017-05-30 22:22

developer   ~0001525

Hi sup_k,

Thanks for your help and yes I was able to reproduce the issue once and it happens occasionally.
I have made some adjustments on how a node handles and detects the split-brain and so-far the tests are passing on my setup, Can you please also try out the attached patch file:bug_306.diff to see if it solves the issue.

Also while testing the scenario if possible please collect and share the complete Pgpool-II logs for all Pgpool-II nodes.

Thanks
Best regards
Muhammad usama
bug_306.diff (4,365 bytes)   
bug_306.diff (4,365 bytes)   

supp_k

2017-05-31 03:52

reporter   ~0001527

Hi Muhammad,

what branch should I apple the fix to?

Muhammad Usama

2017-05-31 17:02

developer   ~0001528

3_6_STABLE or master. It should work with both

supp_k

2017-06-08 18:17

reporter   ~0001537

Environment:
1) Head Slave Pgpool
2) DB node A + Master Pgpool
3) DB node B + Slave Pgpool

Sequence of actions
- At 11:24:10 the network interface of DB node A was brought down (ifdown eth0).
- At (Jun 8 11:24:11) the Pgpool of DB node B discovered the fact the Master Pgpool disappeared (see the db_node_b_pgpool.log)
After that the election process started which ended within several seconds. But the pgpool cluster was not providing any connections to DB client. You can see series of messages like:
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-2] DETAIL: health check timer expired
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [216-3] LOCATION: pool_connection_pool.c:567
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-1] ERROR: failed to make persistent db connection
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-2] DETAIL: connection to host:"b.db.node:15432" failed
Jun 8 11:25:24 POADBNodeB-fca51b7864ff pgpool[24330]: [217-3] LOCATION: child.c:1218
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-1] LOG: find_primary_node: checking backend no 0
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [218-2] LOCATION: pgpool_main.c:2970
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-1] LOG: find_primary_node: checking backend no 1
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [219-2] LOCATION: pgpool_main.c:2970
Jun 8 11:25:25 POADBNodeB-fca51b7864ff pgpool[24330]: [220-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket


- at (Jun 8 11:27:20) the network interface of DB node A was restored (ifup eth0) and you can see that the split brain was detected.
- the re-election process was initiated. After the election the new master Pgpool (DB node B) remained master - it is OK. But unfortunately the Pgpool cluster is not operational - it doesnt provide connections to DB clients (see the log fragment):

Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11970]: [38690-4] LOCATION: child.c:2130
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-1] LOG: child process with pid: 11970 exits with status 256
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39025-2] LOCATION: pgpool_main.c:2460
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-1] LOG: fork a new child process with pid: 12138
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39026-2] LOCATION: pgpool_main.c:2546
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-1] FATAL: pgpool is not accepting any new connections
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-3] HINT: repair the backend nodes and restart pgpool
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11792]: [38334-4] LOCATION: child.c:2130
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-1] LOG: child process with pid: 11792 exits with status 256
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39027-2] LOCATION: pgpool_main.c:2460
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-1] LOG: fork a new child process with pid: 12139
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39028-2] LOCATION: pgpool_main.c:2546
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-4] LOCATION: child.c:2130
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[24330]: [39029-1] LOG: child process with pid:


All logs are attached.
pgpool_logs.tar.gz (3,863,657 bytes)

Muhammad Usama

2017-06-21 19:33

developer   ~0001541

Hi

Thanks for performing the test and providing the logs. I have analysed all logs and now the master election after split-brain and also the action time to recover from split-brain seems to be working fine and as expected.
Now the second part where after split-brain was recovered and Pgpool-II was not accepting any was because it couldn't find any valid backend node at that time.

Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-1] FATAL: pgpool is not accepting any new connections
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-2] DETAIL: all backend nodes are down, pgpool requires at least one valid node
Jun 8 11:28:53 POADBNodeB-fca51b7864ff pgpool[11971]: [38692-3] HINT: repair the backend nodes and restart pgpool

is not related to watchdog and happend because Pgpool-II backend health check was not able to connect to backends so it failovered the backend nodes.

See the following log message extracts from db_node_b_pgpool.log

a.db.node failure logs
=================
22659 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-1] LOG: failed to connect to PostgreSQL server on "a.db.node:15432" using INET socket
22660 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-2] DETAIL: select() system call failed with an error "Interrupted system call"
22661 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [27-3] LOCATION: pool_connection_pool.c:699
22662 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-1] ERROR: failed to make persistent db connection
22663 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-2] DETAIL: connection to host:"a.db.node:15432" failed\
22664 Jun 8 11:24:33 POADBNodeB-fca51b7864ff pgpool[24330]: [28-3] LOCATION: child.c:1218

And b.db.node failure logs
======================
22785 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-1] LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket
22786 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-2] DETAIL: health check timer expired
22787 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [36-3] LOCATION: pool_connection_pool.c:567
22788 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-1] ERROR: failed to make persistent db connection
22789 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-2] DETAIL: connection to host:"b.db.node:15432" failed
22790 Jun 8 11:24:39 POADBNodeB-fca51b7864ff pgpool[24330]: [37-3] LOCATION: child.c:1218

So since both the nodes were detached from the Pgpool-II because pgpool health check advised it to do ( health check was not able to connect to backends) so once after the failover of b.db.node the Pgpool-II was not accepting any client connections, since it had no alive PG backend attached

Now the reason why pgpool-II was not able to connect to alive PG nodes may need a separate investigation and may have be caused by some network issue or some other reason, and PostgreSQL log may be more helpful than pgpool-II log to do the postmortem of that.

So as a way forward, I am going to commit the patch (bug_306.diff: the one with which this test was performed) since It solves the original reported problem and close this bug report.
And we can analyse the reason for health-check not able to connect to backend PG node as a separate thread.

Thanks
Best Regards
Muhammad Usama

supp_k

2017-08-14 20:51

reporter   ~0001650

Hi Muhammad,

I'm not sure whether it is correct but I've detected another case instance of the split brain.
Say we have an environment:
1) Server A:
- Head Pgpool
2) Server B:
- Pgpool (Master) + PostgreSQL (Master)
3) Server C:
- Pgpool + PostgreSQL (Sync Slave)

Sequence of actions:
- Perform "ifdown eth0" on Server B, and in a period of time Server C (it has larger wd_priority) promoted into Master and it is OK.
- From the perspective of Server C the PostgreSQL on Server B is marked as Down and it is OK
- From the perspective of Server B the PostgreSQL on Server C is marked as Down and it is OK
- Restore network on Server B by execution "ifup eth0";
   Server B rejoins the pgpool cluster as Slave - it is OK

Issue: Both PostgreSQL instances are marked as Down within Pgpool cluster (as it is confirmed by pcp_node_info)

supp_k

2017-08-29 16:13

reporter   ~0001687

Hi Pengbo,

why have you closed the issue? It is not closed yet, and we are suffering.

Issue History

Date Modified Username Field Change
2017-04-25 02:07 supp_k New Issue
2017-04-25 03:03 supp_k Note Added: 0001458
2017-04-25 05:26 Muhammad Usama Note Added: 0001459
2017-04-25 05:56 supp_k Note Added: 0001460
2017-04-25 18:58 supp_k Note Added: 0001461
2017-04-25 21:42 supp_k Note Added: 0001463
2017-04-25 22:07 Muhammad Usama Note Added: 0001464
2017-04-25 22:10 supp_k Note Added: 0001465
2017-04-25 22:19 supp_k File Added: pgpool_conf.tar.gz
2017-04-25 22:19 supp_k Note Added: 0001467
2017-04-25 22:24 Muhammad Usama Note Added: 0001468
2017-04-26 00:20 supp_k Note Added: 0001469
2017-04-26 00:59 supp_k Note Added: 0001470
2017-04-26 01:01 supp_k File Added: pgpool_fail_voting.tar.gz
2017-04-26 01:02 supp_k File Added: pgpool_fail_voting.tar-2.gz
2017-04-26 01:02 supp_k Note Added: 0001471
2017-04-27 02:20 supp_k Note Added: 0001476
2017-04-27 04:06 supp_k Note Added: 0001477
2017-04-27 06:08 supp_k Note Added: 0001478
2017-04-27 06:08 supp_k File Added: no_failover.log
2017-04-27 06:08 supp_k Note Added: 0001479
2017-04-27 08:02 t-ishii Assigned To => Muhammad Usama
2017-04-27 08:02 t-ishii Status new => assigned
2017-04-28 00:44 Muhammad Usama File Added: wd_quorum.diff
2017-04-28 00:44 Muhammad Usama Note Added: 0001481
2017-04-28 01:03 supp_k Note Added: 0001482
2017-04-28 01:28 Muhammad Usama Note Added: 0001483
2017-04-28 01:50 supp_k File Added: log_failed_master.log
2017-04-28 01:50 supp_k Note Added: 0001484
2017-04-28 05:22 supp_k File Added: pgpool_wd_issue.tar.gz
2017-04-28 05:22 supp_k Note Added: 0001485
2017-05-04 01:00 Muhammad Usama File Added: wd_failover_not_performed.diff
2017-05-04 01:00 Muhammad Usama Note Added: 0001491
2017-05-05 00:39 supp_k Note Added: 0001492
2017-05-15 18:38 supp_k Note Added: 0001508
2017-05-16 20:00 Muhammad Usama Note Added: 0001511
2017-05-16 22:17 supp_k Note Added: 0001512
2017-05-16 22:24 supp_k Note Added: 0001513
2017-05-16 22:25 supp_k File Added: restored_pgpool_b.log
2017-05-16 22:25 supp_k Note Added: 0001514
2017-05-17 03:26 Muhammad Usama Note Added: 0001515
2017-05-17 21:58 Muhammad Usama Note Added: 0001517
2017-05-19 00:35 supp_k File Added: split_brain_pgpool.tar.gz
2017-05-19 00:35 supp_k Note Added: 0001519
2017-05-23 00:46 supp_k Note Added: 0001522
2017-05-30 22:22 Muhammad Usama File Added: bug_306.diff
2017-05-30 22:22 Muhammad Usama Note Added: 0001525
2017-05-31 03:52 supp_k Note Added: 0001527
2017-05-31 17:02 Muhammad Usama Note Added: 0001528
2017-06-08 18:17 supp_k File Added: pgpool_logs.tar.gz
2017-06-08 18:17 supp_k Note Added: 0001537
2017-06-21 19:33 Muhammad Usama Note Added: 0001541
2017-08-14 20:51 supp_k Note Added: 0001650
2017-08-29 09:51 pengbo Status assigned => closed
2017-08-29 16:13 supp_k Status closed => feedback
2017-08-29 16:13 supp_k Resolution open => reopened
2017-08-29 16:13 supp_k Note Added: 0001687