View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000547 | Pgpool-II | Bug | public | 2019-09-12 16:37 | 2019-10-31 18:41 |
| Reporter | harukat | Assigned To | Muhammad Usama | ||
| Priority | normal | Severity | major | Reproducibility | always |
| Status | assigned | Resolution | open | ||
| Product Version | 3.7.11 | ||||
| Target Version | 3.7.12 | Fixed in Version | 3.7.12 | ||
| Summary | 0000547: We need to do arping again after recovering from split-brain. | ||||
| Description | Pgpool-II should do arping again (or wd_IP_down() and wd_IP_up() again) in the master/coordinator node after recovering from split-brain. In the following scenario, Pgpool-II doesn't work well. scenario: 1. There are watchdog cluster nodes: n1, n2, and n3. n3 is master/coordinator. VIP is set on n3. 2. Network trouble occurs. n1 and n2 decide that n3 is down. n2 become new master/coordinator, VIP is set on n2 with arping. 3. Network recovers. n1, n2, and n3 notice split-brain status. They decide n3 is best master/coordinator. n2 resigns master/coordinator. VIP is released on n2. 4. VIP has been set on n3, but ARP table says n2 is VIP target yet. | ||||
| Tags | No tags attached. | ||||
|
|
According to Usama's explanation in similar case: https://www.pgpool.net/pipermail/pgpool-general/2019-August/006733.html After "3. Network recovers." > n1, n2, and n3 notice split-brain status. > They decide n3 is best master/coordinator. > n2 resigns master/coordinator. VIP is released on n2. n2 should have not resigned the master in the first place. Since the other case in the upthread of the URL above looks quite similar, I guess there's something wrong in watchdog code to handle the case when former master node comes back. So, > We need to do arping again after recovering from split-brain. Point is not here. |
|
|
In the above-explained scenario. As per the current design, the watchdog should behave as follows: 1. There are watchdog cluster nodes: n1, n2, and n3. n3 is master/coordinator. VIP is set on n3. 2. Network trouble occurs. n1 and n2 decide that n3 is down. n2 become new master/coordinator, VIP is set on n2 with arping. 2-b. At the same time, n3 would perform de-escalation since it has lost the quorum. Even if the n3 stays as master it would remove the VIP because of losing the quorum. 3. Network recovers. n1, n2, and n3 notice split-brain status. They decide n3 is best master/coordinator. n2 resigns master/coordinator. VIP is released on n2. 4. Even in the case of No:4. If n3 is chosen as the best master. The n3 will perform the escalation again. The reason being it had already performed the de-escalation when the quorum was lost. And when the escalation is performed again after recovery from split-brain the arping step would be performed again. Now if somehow (step No:4 of the original scenario: VIP has been set on n3, but ARP table says n2 is VIP target yet.) is happening which means the watchdog had not performed the de-escalation when the quorum was lost. If that is the case can you please provide me the log files when that happened? |
|
|
This is the log for this issue. IP address and host name are modified from our customer's original. N3 had been master/coordinator with VIP for a long time. When n1 and n2 realized n3 was lost, n3 didn't throw any errors for watchdog. |
|
|
Our customer says: In our case, n3 did not seem to know quorum was lost...for most of that network partition, n3 could *see* n1 and n2, but n2 and n1 could not *see* n3. I'm not certain if n3 had a way to know quorum was lost during the network partition? We do know with certainty that the end result was two production outages within 3 days. |
|
|
This is a simple patch for V3_7_STABLE to do arping after recovering from split-brain. It passed the regression test. |
|
|
The regression does not do any VIP/arp test. That means passing the regression test does not say anything about your patch is ok or not as far as VIP/arp issue concerns. |
|
|
Have you actually tested the patch with case described above ? |
|
|
No, I haven't recreate the reported situation. |
|
|
We do not accept untested patches. |
|
|
Hi, Harukat, First of all sorry for the delayed response, and thanks a lot for providing the patch and log files. The log file contained the messages from all three nodes and it was quite huge so took me a while to understand the issue. Anyhow after reviewing the log files I realized that there was confusion in the watchdog code on how to deal with the life-check failed scenarios especially for the cases when the life-check reports the node failure while watchdog core still able to communicate with remote nodes. and also for the case when node A's life-check reports node B as lost while B still thinks A is alive and healthy. So I have reviewed the whole watchdog design around the life-check reports and have made some fixes in that area in the attached patch You can try the attached patch and test your scenario if you can still find yourself in the same situation as described in the initial bug report. The patch is generated against the current MASTER branch and I will commit it after little more testing and then backport it to all supported branches. Finally, the original idea in your patch and bug report to do arping after recovering from the split-brain seems reasonable but your patch needs a little more thought on the design, since executing the wd_arping function from watchdog main process is not the right thing to do. But effectively after my patch (attached) you should never end up in the situation where multiple watchdog nodes ended up performing the escalation provided you are using odd number of pgpool-II nodes. The arping solution you suggested should only be effective in a situation where the watchdog cluster is configured with even number of total nodes and network partition divides the network in such a way that both network partitions gets exactly half nodes each. Thanks Best regards Muhammad Usama ] |
|
|
Usama, After applying the patch, one of watchdog regression test failed. testing 011.watchdog_quorum_failover...failed. regression log attached. |
|
|
Also pgpool.log attached. |
|
|
Hi Ishii-San, I am looking into attached logs Somehow regression always passes on my machine :-) testing 011.watchdog_quorum_failover...ok. testing 012.watchdog_failover_when_quorum_exists...ok. testing 013.watchdog_failover_require_consensus...ok. testing 014.watchdog_test_quorum_bypass...ok. testing 015.watchdog_master_and_backend_fail...ok. testing 016.node_0_is_not_primary...ok. testing 017.node_0_is_down...ok. testing 018.detach_primary...ok. testing 019.log_client_messages...ok. testing 020.allow_clear_text_frontend_auth...ok. testing 021.pool_passwd_auth...ok. testing 022.pool_passwd_alternative_auth...ok. testing 023.ssl_connection...ok. testing 024.cert_auth...ok. |
|
|
From the look of it seems like the setup for PostgreSQL server is failing the test case. See the line 23 in attached 011.watchdog_quorum_failover 23 recovery node 1...ERROR: executing recovery, execution of command failed at "1st stage" 24 DETAIL: command:"basebackup.sh" But I can't figure out what could be the reason for that. |
|
|
Oops. You are right. I had broken my pgpool set up while migrating to PostgreSQL 12. Sorry for noise. |
|
|
Thanks, Usama. I'll do a test the patch in the V4_1_STABLE environment. It looks big changes. Shouldn't I except 3.7.x version fix? |
|
|
Once you verify the fix I will backport it to all supported branches including 3.7 Thanks |
|
|
I did a test that runs Pgpool 4.1.x nodes in the artificial unstable network by executing the attached script. This "make_splitbrain.sh" script blocks the watchdog communication randomly when it is a master node. In spite of the severe network environment, patched Pgpools run well long time (over 1 day) in most cases. They often recovered from split-brain status successfully. I also got a failed case. It's log is attached "test_20191007.tgz" file. I'm sorry, it may be difficult to read because it includes Japanese locale output. In that case, I need to execute arping to use Pgpool-II via VIP. Its log say at last: Oct 7 16:19:00 cp2 [16456]: [2889-1] 2019-10-07 16:19:00: pid 16456: LOG: successfully acquired the delegate IP:"10.10.10.152" Oct 7 16:19:00 cp2 [16456]: [2889-2] 2019-10-07 16:19:00: pid 16456: DETAIL: 'if_up_cmd' returned with success But I cannot access Pgpool-II in cp2 host via VIP without arping. I couldn't find its cause. |
|
|
Hi Harukat, First of all many thanks for doing thorough testing. I have gone through all the attached logs and it seems that watchdog is behaving correctly. despite some flooding of log messages, I can't see any issue currently at least in the logs. As far as the problem you described, that you had to do arping on cp2 after the escalation that happened at time:'Oct 7 16:19:00'. I believe it's not because of anything wrong done by Pgpool-II or watchdog ( at least noting in the logs suggests anything wrong with watchdog or pgpool that could have caused that) If you see the logs of all three nodes around that time you could see that only one node CP2 had performed the escalation and brought up the VIP and no other node had acquired the VIP after that time. And since after acquiring the VIP pgpool escalation process does performs the arping and ping, so I am guessing some external factor might be involved. Because nothing in the logs points to any situation or issue that could require a manual arping. I am not sure but I am thinking if that can be caused by a nature of the test case as it was causing the very frequent escalations and de-escalations, So do you think if it is possible that a network switch or VM host might have played role in that? For example, the last de-escalation on CP2 happened at time:"7 16:18:45" (VIP released by CP2) and the new escalation started at time:"7 16:18:56" ( VIP acquired again), So its just 10 seconds gap in between the release and acquiring of VIP on CP2. So I am just thinking out loud that what if ARP tables on client machine might have got it wrong because of these frequent updates. Somehow the client machine still didn't receive the new VIP record. Though I still think that might not be a case and there is some other external elements that would have caused that. |
|
|
Thank you for confirming my report. If you have no idea on the Pgpool-II code for the failed case, I also think some external elements caused it because the log suggests the code must do arping last. And my test script could start dropping the packets in the VIP holder host immediately after its escalation without a stable time. It can be said a kind of double fault, so I think it's OK that the Pgpool-II patched code don't cover the such case. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2019-09-12 16:37 | harukat | New Issue | |
| 2019-09-12 16:52 | t-ishii | Assigned To | => Muhammad Usama |
| 2019-09-12 16:52 | t-ishii | Status | new => assigned |
| 2019-09-13 15:57 | t-ishii | Note Added: 0002845 | |
| 2019-09-13 17:07 | t-ishii | Note Edited: 0002845 | |
| 2019-09-13 21:57 | Muhammad Usama | Note Added: 0002847 | |
| 2019-09-13 21:58 | Muhammad Usama | Note Edited: 0002847 | |
| 2019-09-17 11:10 | harukat | File Added: lost_arping_case.log | |
| 2019-09-17 11:10 | harukat | Note Added: 0002855 | |
| 2019-09-25 13:03 | harukat | Note Added: 0002879 | |
| 2019-09-27 12:19 | harukat | File Added: pgpool2_V3_7_STSBLE_arping_again.patch | |
| 2019-09-27 12:19 | harukat | Note Added: 0002885 | |
| 2019-09-27 12:32 | t-ishii | Note Added: 0002886 | |
| 2019-09-27 13:14 | t-ishii | Note Added: 0002887 | |
| 2019-10-01 18:10 | harukat | Note Added: 0002896 | |
| 2019-10-02 09:36 | t-ishii | Note Added: 0002898 | |
| 2019-10-02 23:07 | Muhammad Usama | File Added: watchdog_node_lost_fix.diff | |
| 2019-10-02 23:07 | Muhammad Usama | Note Added: 0002899 | |
| 2019-10-02 23:28 | t-ishii | File Added: 011.watchdog_quorum_failover | |
| 2019-10-02 23:28 | t-ishii | Note Added: 0002900 | |
| 2019-10-02 23:32 | t-ishii | File Added: pgpool-logs.tar.gz | |
| 2019-10-02 23:32 | t-ishii | Note Added: 0002902 | |
| 2019-10-03 00:19 | Muhammad Usama | Note Added: 0002903 | |
| 2019-10-03 00:33 | Muhammad Usama | Note Added: 0002904 | |
| 2019-10-03 07:13 | t-ishii | Note Added: 0002905 | |
| 2019-10-03 13:33 | harukat | Note Added: 0002906 | |
| 2019-10-03 18:19 | Muhammad Usama | Note Added: 0002907 | |
| 2019-10-08 13:31 | harukat | File Added: make_splitbrain.sh | |
| 2019-10-08 13:31 | harukat | File Added: test_20191007.tgz | |
| 2019-10-08 13:31 | harukat | Note Added: 0002912 | |
| 2019-10-08 13:32 | harukat | Note Edited: 0002912 | |
| 2019-10-11 00:19 | Muhammad Usama | Note Added: 0002918 | |
| 2019-10-15 11:54 | harukat | Note Added: 0002927 | |
| 2019-10-31 18:41 | administrator | Fixed in Version | => 3.7.12 |
| 2019-10-31 18:41 | administrator | Target Version | => 3.7.12 |