0000547: We need to do arping again after recovering from split-brain.

ID	Project	Category	View Status	Date Submitted	Last Update

0000547	Pgpool-II	Bug	public	2019-09-12 16:37	2019-10-31 18:41

Reporter	harukat	Assigned To	Muhammad Usama
Priority	normal	Severity	major	Reproducibility	always
Status	assigned	Resolution	open
Product Version	3.7.11
Target Version	3.7.12	Fixed in Version	3.7.12

Summary	0000547: We need to do arping again after recovering from split-brain.
Description	Pgpool-II should do arping again (or wd_IP_down() and wd_IP_up() again) in the master/coordinator node after recovering from split-brain. In the following scenario, Pgpool-II doesn't work well. scenario: 1. There are watchdog cluster nodes: n1, n2, and n3. n3 is master/coordinator. VIP is set on n3. 2. Network trouble occurs. n1 and n2 decide that n3 is down. n2 become new master/coordinator, VIP is set on n2 with arping. 3. Network recovers. n1, n2, and n3 notice split-brain status. They decide n3 is best master/coordinator. n2 resigns master/coordinator. VIP is released on n2. 4. VIP has been set on n3, but ARP table says n2 is VIP target yet.
Tags	No tags attached.

t-ishii 2019-09-13 15:57 developer ~0002845 Last edited: 2019-09-13 17:07	According to Usama's explanation in similar case: https://www.pgpool.net/pipermail/pgpool-general/2019-August/006733.html After "3. Network recovers." > n1, n2, and n3 notice split-brain status. > They decide n3 is best master/coordinator. > n2 resigns master/coordinator. VIP is released on n2. n2 should have not resigned the master in the first place. Since the other case in the upthread of the URL above looks quite similar, I guess there's something wrong in watchdog code to handle the case when former master node comes back. So, > We need to do arping again after recovering from split-brain. Point is not here.

Muhammad Usama 2019-09-13 21:57 developer ~0002847 Last edited: 2019-09-13 21:58	In the above-explained scenario. As per the current design, the watchdog should behave as follows: 1. There are watchdog cluster nodes: n1, n2, and n3. n3 is master/coordinator. VIP is set on n3. 2. Network trouble occurs. n1 and n2 decide that n3 is down. n2 become new master/coordinator, VIP is set on n2 with arping. 2-b. At the same time, n3 would perform de-escalation since it has lost the quorum. Even if the n3 stays as master it would remove the VIP because of losing the quorum. 3. Network recovers. n1, n2, and n3 notice split-brain status. They decide n3 is best master/coordinator. n2 resigns master/coordinator. VIP is released on n2. 4. Even in the case of No:4. If n3 is chosen as the best master. The n3 will perform the escalation again. The reason being it had already performed the de-escalation when the quorum was lost. And when the escalation is performed again after recovery from split-brain the arping step would be performed again. Now if somehow (step No:4 of the original scenario: VIP has been set on n3, but ARP table says n2 is VIP target yet.) is happening which means the watchdog had not performed the de-escalation when the quorum was lost. If that is the case can you please provide me the log files when that happened?

harukat 2019-09-17 11:10 developer ~0002855	This is the log for this issue. IP address and host name are modified from our customer's original. N3 had been master/coordinator with VIP for a long time. When n1 and n2 realized n3 was lost, n3 didn't throw any errors for watchdog. lost_arping_case.log (239,651 bytes)

harukat 2019-09-25 13:03 developer ~0002879	Our customer says: In our case, n3 did not seem to know quorum was lost...for most of that network partition, n3 could see n1 and n2, but n2 and n1 could not see n3. I'm not certain if n3 had a way to know quorum was lost during the network partition? We do know with certainty that the end result was two production outages within 3 days.

harukat 2019-09-27 12:19 developer ~0002885	This is a simple patch for V3_7_STABLE to do arping after recovering from split-brain. It passed the regression test. pgpool2_V3_7_STSBLE_arping_again.patch (2,390 bytes) pgpool2_V3_7_STSBLE_arping_again.patch (2,390 bytes)

t-ishii 2019-09-27 12:32 developer ~0002886	The regression does not do any VIP/arp test. That means passing the regression test does not say anything about your patch is ok or not as far as VIP/arp issue concerns.

t-ishii 2019-09-27 13:14 developer ~0002887	Have you actually tested the patch with case described above ?

harukat 2019-10-01 18:10 developer ~0002896	No, I haven't recreate the reported situation.

t-ishii 2019-10-02 09:36 developer ~0002898	We do not accept untested patches.

Muhammad Usama 2019-10-02 23:07 developer ~0002899	Hi, Harukat, First of all sorry for the delayed response, and thanks a lot for providing the patch and log files. The log file contained the messages from all three nodes and it was quite huge so took me a while to understand the issue. Anyhow after reviewing the log files I realized that there was confusion in the watchdog code on how to deal with the life-check failed scenarios especially for the cases when the life-check reports the node failure while watchdog core still able to communicate with remote nodes. and also for the case when node A's life-check reports node B as lost while B still thinks A is alive and healthy. So I have reviewed the whole watchdog design around the life-check reports and have made some fixes in that area in the attached patch You can try the attached patch and test your scenario if you can still find yourself in the same situation as described in the initial bug report. The patch is generated against the current MASTER branch and I will commit it after little more testing and then backport it to all supported branches. Finally, the original idea in your patch and bug report to do arping after recovering from the split-brain seems reasonable but your patch needs a little more thought on the design, since executing the wd_arping function from watchdog main process is not the right thing to do. But effectively after my patch (attached) you should never end up in the situation where multiple watchdog nodes ended up performing the escalation provided you are using odd number of pgpool-II nodes. The arping solution you suggested should only be effective in a situation where the watchdog cluster is configured with even number of total nodes and network partition divides the network in such a way that both network partitions gets exactly half nodes each. Thanks Best regards Muhammad Usama ] watchdog_node_lost_fix.diff (54,091 bytes) watchdog_node_lost_fix.diff (54,091 bytes)

t-ishii 2019-10-02 23:28 developer ~0002900	Usama, After applying the patch, one of watchdog regression test failed. testing 011.watchdog_quorum_failover...failed. regression log attached. 011.watchdog_quorum_failover (8,474 bytes)

t-ishii 2019-10-02 23:32 developer ~0002902	Also pgpool.log attached. pgpool-logs.tar.gz (5,493 bytes)

Muhammad Usama 2019-10-03 00:19 developer ~0002903	Hi Ishii-San, I am looking into attached logs Somehow regression always passes on my machine :-) testing 011.watchdog_quorum_failover...ok. testing 012.watchdog_failover_when_quorum_exists...ok. testing 013.watchdog_failover_require_consensus...ok. testing 014.watchdog_test_quorum_bypass...ok. testing 015.watchdog_master_and_backend_fail...ok. testing 016.node_0_is_not_primary...ok. testing 017.node_0_is_down...ok. testing 018.detach_primary...ok. testing 019.log_client_messages...ok. testing 020.allow_clear_text_frontend_auth...ok. testing 021.pool_passwd_auth...ok. testing 022.pool_passwd_alternative_auth...ok. testing 023.ssl_connection...ok. testing 024.cert_auth...ok.

Muhammad Usama 2019-10-03 00:33 developer ~0002904	From the look of it seems like the setup for PostgreSQL server is failing the test case. See the line 23 in attached 011.watchdog_quorum_failover 23 recovery node 1...ERROR: executing recovery, execution of command failed at "1st stage" 24 DETAIL: command:"basebackup.sh" But I can't figure out what could be the reason for that.

t-ishii 2019-10-03 07:13 developer ~0002905	Oops. You are right. I had broken my pgpool set up while migrating to PostgreSQL 12. Sorry for noise.

harukat 2019-10-03 13:33 developer ~0002906	Thanks, Usama. I'll do a test the patch in the V4_1_STABLE environment. It looks big changes. Shouldn't I except 3.7.x version fix?

Muhammad Usama 2019-10-03 18:19 developer ~0002907	Once you verify the fix I will backport it to all supported branches including 3.7 Thanks

harukat 2019-10-08 13:31 developer ~0002912 Last edited: 2019-10-08 13:32	I did a test that runs Pgpool 4.1.x nodes in the artificial unstable network by executing the attached script. This "make_splitbrain.sh" script blocks the watchdog communication randomly when it is a master node. In spite of the severe network environment, patched Pgpools run well long time (over 1 day) in most cases. They often recovered from split-brain status successfully. I also got a failed case. It's log is attached "test_20191007.tgz" file. I'm sorry, it may be difficult to read because it includes Japanese locale output. In that case, I need to execute arping to use Pgpool-II via VIP. Its log say at last: Oct 7 16:19:00 cp2 [16456]: [2889-1] 2019-10-07 16:19:00: pid 16456: LOG: successfully acquired the delegate IP:"10.10.10.152" Oct 7 16:19:00 cp2 [16456]: [2889-2] 2019-10-07 16:19:00: pid 16456: DETAIL: 'if_up_cmd' returned with success But I cannot access Pgpool-II in cp2 host via VIP without arping. I couldn't find its cause. make_splitbrain.sh (999 bytes) test_20191007.tgz (113,231 bytes)

Muhammad Usama 2019-10-11 00:19 developer ~0002918	Hi Harukat, First of all many thanks for doing thorough testing. I have gone through all the attached logs and it seems that watchdog is behaving correctly. despite some flooding of log messages, I can't see any issue currently at least in the logs. As far as the problem you described, that you had to do arping on cp2 after the escalation that happened at time:'Oct 7 16:19:00'. I believe it's not because of anything wrong done by Pgpool-II or watchdog ( at least noting in the logs suggests anything wrong with watchdog or pgpool that could have caused that) If you see the logs of all three nodes around that time you could see that only one node CP2 had performed the escalation and brought up the VIP and no other node had acquired the VIP after that time. And since after acquiring the VIP pgpool escalation process does performs the arping and ping, so I am guessing some external factor might be involved. Because nothing in the logs points to any situation or issue that could require a manual arping. I am not sure but I am thinking if that can be caused by a nature of the test case as it was causing the very frequent escalations and de-escalations, So do you think if it is possible that a network switch or VM host might have played role in that? For example, the last de-escalation on CP2 happened at time:"7 16:18:45" (VIP released by CP2) and the new escalation started at time:"7 16:18:56" ( VIP acquired again), So its just 10 seconds gap in between the release and acquiring of VIP on CP2. So I am just thinking out loud that what if ARP tables on client machine might have got it wrong because of these frequent updates. Somehow the client machine still didn't receive the new VIP record. Though I still think that might not be a case and there is some other external elements that would have caused that.

harukat 2019-10-15 11:54 developer ~0002927	Thank you for confirming my report. If you have no idea on the Pgpool-II code for the failed case, I also think some external elements caused it because the log suggests the code must do arping last. And my test script could start dropping the packets in the VIP holder host immediately after its escalation without a stable time. It can be said a kind of double fault, so I think it's OK that the Pgpool-II patched code don't cover the such case.

Date Modified	Username	Field	Change
2019-09-12 16:37	harukat	New Issue
2019-09-12 16:52	t-ishii	Assigned To	=> Muhammad Usama
2019-09-12 16:52	t-ishii	Status	new => assigned
2019-09-13 15:57	t-ishii	Note Added: 0002845
2019-09-13 17:07	t-ishii	Note Edited: 0002845
2019-09-13 21:57	Muhammad Usama	Note Added: 0002847
2019-09-13 21:58	Muhammad Usama	Note Edited: 0002847
2019-09-17 11:10	harukat	File Added: lost_arping_case.log
2019-09-17 11:10	harukat	Note Added: 0002855
2019-09-25 13:03	harukat	Note Added: 0002879
2019-09-27 12:19	harukat	File Added: pgpool2_V3_7_STSBLE_arping_again.patch
2019-09-27 12:19	harukat	Note Added: 0002885
2019-09-27 12:32	t-ishii	Note Added: 0002886
2019-09-27 13:14	t-ishii	Note Added: 0002887
2019-10-01 18:10	harukat	Note Added: 0002896
2019-10-02 09:36	t-ishii	Note Added: 0002898
2019-10-02 23:07	Muhammad Usama	File Added: watchdog_node_lost_fix.diff
2019-10-02 23:07	Muhammad Usama	Note Added: 0002899
2019-10-02 23:28	t-ishii	File Added: 011.watchdog_quorum_failover
2019-10-02 23:28	t-ishii	Note Added: 0002900
2019-10-02 23:32	t-ishii	File Added: pgpool-logs.tar.gz
2019-10-02 23:32	t-ishii	Note Added: 0002902
2019-10-03 00:19	Muhammad Usama	Note Added: 0002903
2019-10-03 00:33	Muhammad Usama	Note Added: 0002904
2019-10-03 07:13	t-ishii	Note Added: 0002905
2019-10-03 13:33	harukat	Note Added: 0002906
2019-10-03 18:19	Muhammad Usama	Note Added: 0002907
2019-10-08 13:31	harukat	File Added: make_splitbrain.sh
2019-10-08 13:31	harukat	File Added: test_20191007.tgz
2019-10-08 13:31	harukat	Note Added: 0002912
2019-10-08 13:32	harukat	Note Edited: 0002912
2019-10-11 00:19	Muhammad Usama	Note Added: 0002918
2019-10-15 11:54	harukat	Note Added: 0002927
2019-10-31 18:41	administrator	Fixed in Version	=> 3.7.12
2019-10-31 18:41	administrator	Target Version	=> 3.7.12

View Issue Details

Activities

Issue History