0000545: Quorum lost and not recovered

ID	Project	Category	View Status	Date Submitted	Last Update

0000545	Pgpool-II	Bug	public	2019-09-09 20:25	2019-10-31 18:40

Reporter	nglp	Assigned To	Muhammad Usama
Priority	normal	Severity	major	Reproducibility	random
Status	resolved	Resolution	fixed
Product Version	4.0.4
Target Version	4.0.7	Fixed in Version	4.0.7

Summary	0000545: Quorum lost and not recovered
Description	Hi, We've a 3 node PGPool setup with 3 PostgreSQL master-standby-standby with streaming replication running on vmware vms PGPool health check method is heartbeat We had an issue, because our backup method (snapshot) hanged master watchdog node (node2 on that moment) for more time than defined in wd_heartbeat_deadtime, and VIP was released, that its a normal behaviour Issue was after that, quorum was lost and never recovered, VIP was not reallocated in the other 2 nodes, so service went down Quorum was never recovered, we had to restart entire pgpool stack, despite checking master watchdog node on pgpool was indicating a master. This has happened in the past and quorum was recovered and VIP reallocated successfully, so its random i guess Find attached logs and our pgpool configuration Any question we will be happy to answer Thanks, Guille
Steps To Reproduce	This has happened after more master failovers due same hang, but: * Hang watchdog master node for a bit more than wd_heartbeat_deadtime * Quorum is lost
Additional Information	node1, node2 and node3 have different wd_priority, in that order, other pgpool settings are exactly the same node2 was watchdog master at that moment, and after quorum lost
Tags	No tags attached.

nglp 2019-09-09 20:25 reporter	node1.txt (983 bytes) node1.txt (983 bytes) node2.txt (5,123 bytes) node2.txt (5,123 bytes) node3.txt (983 bytes) node3.txt (983 bytes) pgpool.conf (4,479 bytes)

t-ishii 2019-09-12 10:36 developer ~0002835	I have pinged Usama, who is an authority of watchdog, a few days ago to look into this issue. Probably you 'd better to wait for response from him. In the mean time I noticed you set wd_heartbeat_deadtime = 5, that means if watchdog does not receive heart beat signal within 5 seconds, the pgpool will be regarded as dead (that is your situation). For a workaround, you could set it to longer, say 300 seconds, to not regard your backuping server is dead.

nglp 2019-09-12 16:09 reporter ~0002836	Hi t-ishii, Thank you for your response, we will wait for him :) Yes, once we had identified that timeout, we increased wd_heartbeat_deadtime to 30 (before 5) and wd_interval to 10 (before 2) to avoid these problems. That are default values in a stream replication cluster example configuration We configured lower values in the past to allow fastest failover possible Best regards, Guille

t-ishii 2019-09-12 16:49 developer ~0002837	This one: https://www.pgpool.net/mantisbt/view.php?id=547 seems to be similar issue?

nglp 2019-09-12 17:19 reporter ~0002838	Hi, No, its not the same, if you see logs attached, following things happens: * Heartbeat signals are lost * Quorum is lost * Watchdog deescalation starts * VIP is removed * Heartbead signals are back But at this moment, quorum is not recovered, so VIP is lost Issue in our case is quorum is never recovered, despite heartbeat signals are back Best regards

Muhammad Usama 2019-09-12 18:54 developer ~0002839	Hi The log files you shared do not have much information and its difficult to guess what could have gone wrong. Can you please share the complete pgpool logs for all three nodes.

nglp 2019-09-12 19:07 reporter ~0002840	Hi, Im sorry to say thats the only output generated from pgpool for this issue If you want i can attach entire logs, but are filled with non-relevant information Let me know Thanks, Guille

Muhammad Usama 2019-09-12 19:17 developer ~0002841	Do you able to reliably reproduce the issue. If that's the case then sharing the exact steps would be enough. Otherwise please share the entire log along with the time when the issue happened, and I will try to see if some useable information can be extracted from that. Thanks

nglp 2019-09-12 19:27 reporter ~0002842	Im setting a new environment to make some tests, other ones are all production ones The way i though you can reproduce it is: * set low wd_heartbeat_deadtime and wd_interval values (like 2 and 1) * make virtual machine snapshot on master watchdog node * check what happens Expect a random result, on other times VIP has failed over successfully, and quorum has been recovered ok I will prepare logs to upload them (i've to replace some strings, like hostname, ips, or remove some queries) Thanks, Guille

nglp 2019-09-13 21:51 reporter ~0002846	Hi, Im back with "good" news, im able to reproduce issue in my test environment with pgpool.conf config attached and following modifications: * wd_interval = 1 * wd_heartbeat_deadtime = 2 * log_error_verbosity = verbose * client_min_messages = debug5 * log_min_messages = debug1 Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped Let me know if you need more debug on log_min_messages Thanks, Guille

nglp 2019-09-13 21:51 reporter	server1.txt.gz (115,952 bytes) server2.txt.gz (112,256 bytes) server3.txt.gz (116,210 bytes)

Muhammad Usama 2019-09-13 23:02 developer ~0002848	Hi Thanks a bundle for the logs. I am looking into these

Muhammad Usama 2019-09-17 05:05 developer ~0002853	Hi Thanks for providing the logs. I am able to identify the issue using the attached logs. I am working on the fix. Thanks

nglp 2019-09-17 16:37 reporter ~0002857	Hi, Thank you so much for your work If we can help you in any way (reproducing more times, other parameters, etc) just let me know Best regards, Guille

Muhammad Usama 2019-09-18 23:18 developer ~0002860	Hi Guille, Can you please try out the attached patch if it fixes the issue. Thanks Best Regards Muhammad Usama recover_quorum_fix.diff (8,328 bytes) recover_quorum_fix.diff (8,328 bytes)

nglp 2019-09-19 00:17 reporter ~0002862	Hi, Should i download source code of 4.0.4 and compile with patch? or its for another version? Currently its has been installed using rpm package from repository Best regards

t-ishii 2019-09-19 13:26 developer ~0002864 Last edited: 2019-09-19 13:29	After applying the patch to master branch, I get a few regression test failures: testing 004.watchdog...ok. testing 011.watchdog_quorum_failover...failed. testing 012.watchdog_failover_when_quorum_exists...failed. testing 013.watchdog_failover_require_consensus...failed. testing 014.watchdog_test_quorum_bypass...ok. testing 015.watchdog_master_and_backend_fail...ok. testing 028.watchdog_enable_consensus_with_half_votes...ok. 013 has already failed these days but 011 and 012 are new.

Muhammad Usama 2019-09-19 18:24 developer ~0002865	Somehow I am getting clean regression with the patch. .. testing 006.memqcache...ok. testing 007.memqcache-memcached...ok. testing 008.dbredirect...ok. testing 009.sql_comments...ok. testing 011.watchdog_quorum_failover...ok. testing 012.watchdog_failover_when_quorum_exists...ok. testing 013.watchdog_failover_require_consensus...ok. testing 014.watchdog_test_quorum_bypass...ok. testing 015.watchdog_master_and_backend_fail...ok. testing 016.node_0_is_not_primary...ok. testing 017.node_0_is_down...ok. testing 018.detach_primary...ok. testing 019.log_client_messages...ok. testing 020.allow_clear_text_frontend_auth...ok. testing 021.pool_passwd_auth...ok. testing 022.pool_passwd_alternative_auth...ok. testing 023.ssl_connection...ok. testing 024.cert_auth...ok. testing 025.enable_shared_relcache...ok. testing 026.temp_table...ok. testing 027.auto_failback...ok. testing 028.watchdog_enable_consensus_with_half_votes...ok. testing 050.bug58...ok. testing 051.bug60...ok. .. So I will try to figure out why It could be failing on your system. I am running the regression on MASTER branch with PG12

Muhammad Usama 2019-09-19 18:26 developer ~0002866	Hi Guille, If you have a development environment setup on your system then you can pull the code and apply the patch on the master branch (I believe the same patch should also work with 4_0 branch) and test the scenario.

nglp 2019-09-19 19:17 reporter ~0002867	Hi Muhammad, I've downloaded source code for 4.0.4 version, applied patch, compiled on all machines and started service with new binary Im sadly to inform issue still occur My PgSQL version is 11.4 in this environment (i saw in compiling it needs some libraries) Best regards

t-ishii 2019-09-19 20:39 developer ~0002868	The regression log for 011 and 012 attached. This is master branch + PostgreSQL 11.4. regression_log.tar.gz (1,891 bytes)

Muhammad Usama 2019-09-20 18:50 developer ~0002869	Hi Ishii-San Is it possible if you can also share the pgpool.log files for both these test cases? Since I am still not able to reproduce the failures either by running against PG11 or PG12

t-ishii 2019-09-20 20:58 developer ~0002870	Sure. 011-012.tar.gz (6,663 bytes)

Muhammad Usama 2019-09-20 23:39 developer ~0002871	Hi Ishii-San Many thanks for providing the logs. There is something strange I am not able to get what's going wrong with your setup. Although I figured out the logs you shared does not have health check debug related messages. like below: WARNING: check_backend_down_request: failed to open file standby2/log/backend_down_request So I am wondering was the pgpool-II compiled with health check debug enabled? I am also attaching the similar logs when I ran the regression on my system. If you can spot the difference to identify what could be going wrong. Thanks reg_logs_usama.zip (29,726 bytes)

t-ishii 2019-09-21 08:03 developer ~0002872	Oops. You are right. I did not enable health check debug while compiling. After enabling it, 011 and 012 tests succeed. I am terribly sorry for the misinformation.

Muhammad Usama 2019-09-24 15:31 developer ~0002874	Hi Ishii-San ;-) no issues. Happens with me all the time. So I am able to test the patch thoroughly using the new wd_cli utility I checked in yesterday. And the patch is behaving as expected. Do you agree if I go on and commit this one?

t-ishii 2019-09-24 15:39 developer ~0002875	Not sure. I think you need to get log from nglp to find out why his issue is not resolved with the patch?

nglp 2019-09-24 16:23 reporter ~0002876	Hi, Will do, probably will take 1-2 days Thanks, Guille

Muhammad Usama 2019-09-24 17:05 developer ~0002877	Hi Ishii-San Sorry, my bad, I somehow overlooked the note from Guille, and never knew he notified about the issue not fixed with the patch. @Guille I will look into the issue once you share the pgpool log files with the patch applied. Thanks in advance. Kind Regards Muhammad Usama

nglp 2019-09-24 20:21 reporter ~0002878	Hi, Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped Compiled PGPool 4.0.4 + Patch PostgreSQL 11.4 Let me know Best regards server1.txt-2.gz (106,666 bytes) server2.txt-2.gz (97,357 bytes) server3.txt-2.gz (97,602 bytes)

Muhammad Usama 2019-09-25 19:09 developer ~0002880	Hi Guille, Many thanks for providing the logs. Looking at the logs I suspect there is some issue with applying the patch. Since the code line number of logs are appearing at some offset where they should be. For example: Sep 24 07:16:20 server1 pgpool[15336]: [557-1] 2019-09-24 07:16:20: pid 15336: DEBUG: I am the cluster leader node command finished with status:[COMMAND TIMEED OUT] which is success Sep 24 07:16:20 server1 pgpool[15336]: [557-2] 2019-09-24 07:16:20: pid 15336: DETAIL: The command was sent to 2 nodes and 1 nodes replied to it Sep 24 07:16:20 server1 pgpool[15336]: [557-3] 2019-09-24 07:16:20: pid 15336: LOCATION: watchdog.c:5548 This line should be "LOCATION: watchdog.c:5517" not at "LOCATION: watchdog.c:5548" With pgpool-II 4.0.4 source code. Also, the other log messages suggest a similar problem with the code. So is it possible if you could send me the code with patch applied? or I can send you the built binaries(if you are using MAC or centos7) whichever suits you better. Thanks in advance for helping in debugging this. Best Regards

nglp 2019-09-25 19:57 reporter ~0002881	Hi, Sure, find my PGPool 4.0.4 + Patch attached Probably its my bad applying it, i did it manually, so if you have the source code with patch properly applied, i can compile it in my test environment (redhat 7) Also, we're open mind to update PGPool to latest version (some days ago we suffered a segfault in a connection pool process, we are still trying to reproduce or obtain a coredump to open an issue), so if you have it applied over 4.0.6 will work for us too Many thanks to you, Guille pgpool-II-4.0.4.fix.tgz (3,898,639 bytes)

Muhammad Usama 2019-09-25 20:53 developer ~0002882	Hi Many thanks for the prompt reply. And my suspicion was correct, there was an issue with patching. I am attaching the patched code for same pgpool-II 4.0.4. Can you please try it out. Once you verify the fix I will apply it to all supported branches. Thanks pgpool-II-4.0.4-patched.tar.gz (3,365,708 bytes)

nglp 2019-09-25 22:11 reporter ~0002883	Hi, Good news, issue looks to be solved with that code Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped Also, out of logs, i repeated issue like 10-15 times and everytime went ok Thank you all for your work Best regards, Guille server1.txt-3.gz (117,111 bytes) server2.txt-3.gz (106,937 bytes) server3.txt-3.gz (107,761 bytes)

Muhammad Usama 2019-09-25 23:41 developer ~0002884	Hi Guille, Many thanks for testing and confirmation of the fix. I will commit the patch to all affected branches. Best Regards Muhammad Usama

Muhammad Usama 2019-09-29 06:06 developer ~0002892	I have committed the fix to all supported branches https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106

Muhammad Usama 2019-09-29 06:07 developer ~0002893	Fix committed to all supported branches https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106

Date Modified	Username	Field	Change
2019-09-09 20:25	nglp	New Issue
2019-09-09 20:25	nglp	File Added: node1.txt
2019-09-09 20:25	nglp	File Added: node2.txt
2019-09-09 20:25	nglp	File Added: node3.txt
2019-09-09 20:25	nglp	File Added: pgpool.conf
2019-09-12 10:36	t-ishii	Note Added: 0002835
2019-09-12 16:09	nglp	Note Added: 0002836
2019-09-12 16:17	t-ishii	Assigned To	=> Muhammad Usama
2019-09-12 16:17	t-ishii	Status	new => assigned
2019-09-12 16:49	t-ishii	Note Added: 0002837
2019-09-12 17:19	nglp	Note Added: 0002838
2019-09-12 18:54	Muhammad Usama	Note Added: 0002839
2019-09-12 19:07	nglp	Note Added: 0002840
2019-09-12 19:17	Muhammad Usama	Note Added: 0002841
2019-09-12 19:27	nglp	Note Added: 0002842
2019-09-13 21:51	nglp	Note Added: 0002846
2019-09-13 21:51	nglp	File Added: server1.txt.gz
2019-09-13 21:51	nglp	File Added: server2.txt.gz
2019-09-13 21:51	nglp	File Added: server3.txt.gz
2019-09-13 23:02	Muhammad Usama	Note Added: 0002848
2019-09-17 05:05	Muhammad Usama	Note Added: 0002853
2019-09-17 16:37	nglp	Note Added: 0002857
2019-09-18 23:18	Muhammad Usama	File Added: recover_quorum_fix.diff
2019-09-18 23:18	Muhammad Usama	Note Added: 0002860
2019-09-19 00:17	nglp	Note Added: 0002862
2019-09-19 13:26	t-ishii	Note Added: 0002864
2019-09-19 13:29	t-ishii	Note Edited: 0002864
2019-09-19 18:24	Muhammad Usama	Note Added: 0002865
2019-09-19 18:26	Muhammad Usama	Note Added: 0002866
2019-09-19 19:17	nglp	Note Added: 0002867
2019-09-19 20:39	t-ishii	File Added: regression_log.tar.gz
2019-09-19 20:39	t-ishii	Note Added: 0002868
2019-09-20 18:50	Muhammad Usama	Note Added: 0002869
2019-09-20 20:58	t-ishii	File Added: 011-012.tar.gz
2019-09-20 20:58	t-ishii	Note Added: 0002870
2019-09-20 23:39	Muhammad Usama	File Added: reg_logs_usama.zip
2019-09-20 23:39	Muhammad Usama	Note Added: 0002871
2019-09-21 08:03	t-ishii	Note Added: 0002872
2019-09-24 15:31	Muhammad Usama	Note Added: 0002874
2019-09-24 15:39	t-ishii	Note Added: 0002875
2019-09-24 16:23	nglp	Note Added: 0002876
2019-09-24 17:05	Muhammad Usama	Note Added: 0002877
2019-09-24 20:21	nglp	File Added: server1.txt-2.gz
2019-09-24 20:21	nglp	File Added: server2.txt-2.gz
2019-09-24 20:21	nglp	File Added: server3.txt-2.gz
2019-09-24 20:21	nglp	Note Added: 0002878
2019-09-25 19:09	Muhammad Usama	Note Added: 0002880
2019-09-25 19:57	nglp	File Added: pgpool-II-4.0.4.fix.tgz
2019-09-25 19:57	nglp	Note Added: 0002881
2019-09-25 20:53	Muhammad Usama	File Added: pgpool-II-4.0.4-patched.tar.gz
2019-09-25 20:53	Muhammad Usama	Note Added: 0002882
2019-09-25 22:11	nglp	File Added: server1.txt-3.gz
2019-09-25 22:11	nglp	File Added: server2.txt-3.gz
2019-09-25 22:11	nglp	File Added: server3.txt-3.gz
2019-09-25 22:11	nglp	Note Added: 0002883
2019-09-25 23:41	Muhammad Usama	Note Added: 0002884
2019-09-29 06:06	Muhammad Usama	Note Added: 0002892
2019-09-29 06:07	Muhammad Usama	Status	assigned => resolved
2019-09-29 06:07	Muhammad Usama	Resolution	open => fixed
2019-09-29 06:07	Muhammad Usama	Note Added: 0002893
2019-10-01 11:53	administrator	Target Version	=> 4.0.7
2019-10-31 18:40	administrator	Fixed in Version	=> 4.0.7

View Issue Details

Activities

Issue History