View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000545 | Pgpool-II | Bug | public | 2019-09-09 20:25 | 2019-10-31 18:40 |
| Reporter | nglp | Assigned To | Muhammad Usama | ||
| Priority | normal | Severity | major | Reproducibility | random |
| Status | resolved | Resolution | fixed | ||
| Product Version | 4.0.4 | ||||
| Target Version | 4.0.7 | Fixed in Version | 4.0.7 | ||
| Summary | 0000545: Quorum lost and not recovered | ||||
| Description | Hi, We've a 3 node PGPool setup with 3 PostgreSQL master-standby-standby with streaming replication running on vmware vms PGPool health check method is heartbeat We had an issue, because our backup method (snapshot) hanged master watchdog node (node2 on that moment) for more time than defined in wd_heartbeat_deadtime, and VIP was released, that its a normal behaviour Issue was after that, quorum was lost and never recovered, VIP was not reallocated in the other 2 nodes, so service went down Quorum was never recovered, we had to restart entire pgpool stack, despite checking master watchdog node on pgpool was indicating a master. This has happened in the past and quorum was recovered and VIP reallocated successfully, so its random i guess Find attached logs and our pgpool configuration Any question we will be happy to answer Thanks, Guille | ||||
| Steps To Reproduce | This has happened after more master failovers due same hang, but: * Hang watchdog master node for a bit more than wd_heartbeat_deadtime * Quorum is lost | ||||
| Additional Information | node1, node2 and node3 have different wd_priority, in that order, other pgpool settings are exactly the same node2 was watchdog master at that moment, and after quorum lost | ||||
| Tags | No tags attached. | ||||
|
|
|
|
|
I have pinged Usama, who is an authority of watchdog, a few days ago to look into this issue. Probably you 'd better to wait for response from him. In the mean time I noticed you set wd_heartbeat_deadtime = 5, that means if watchdog does not receive heart beat signal within 5 seconds, the pgpool will be regarded as dead (that is your situation). For a workaround, you could set it to longer, say 300 seconds, to not regard your backuping server is dead. |
|
|
Hi t-ishii, Thank you for your response, we will wait for him :) Yes, once we had identified that timeout, we increased wd_heartbeat_deadtime to 30 (before 5) and wd_interval to 10 (before 2) to avoid these problems. That are default values in a stream replication cluster example configuration We configured lower values in the past to allow fastest failover possible Best regards, Guille |
|
|
This one: https://www.pgpool.net/mantisbt/view.php?id=547 seems to be similar issue? |
|
|
Hi, No, its not the same, if you see logs attached, following things happens: * Heartbeat signals are lost * Quorum is lost * Watchdog deescalation starts * VIP is removed * Heartbead signals are back But at this moment, quorum is not recovered, so VIP is lost Issue in our case is quorum is never recovered, despite heartbeat signals are back Best regards |
|
|
Hi The log files you shared do not have much information and its difficult to guess what could have gone wrong. Can you please share the complete pgpool logs for all three nodes. |
|
|
Hi, Im sorry to say thats the only output generated from pgpool for this issue If you want i can attach entire logs, but are filled with non-relevant information Let me know Thanks, Guille |
|
|
Do you able to reliably reproduce the issue. If that's the case then sharing the exact steps would be enough. Otherwise please share the entire log along with the time when the issue happened, and I will try to see if some useable information can be extracted from that. Thanks |
|
|
Im setting a new environment to make some tests, other ones are all production ones The way i though you can reproduce it is: * set low wd_heartbeat_deadtime and wd_interval values (like 2 and 1) * make virtual machine snapshot on master watchdog node * check what happens Expect a random result, on other times VIP has failed over successfully, and quorum has been recovered ok I will prepare logs to upload them (i've to replace some strings, like hostname, ips, or remove some queries) Thanks, Guille |
|
|
Hi, Im back with "good" news, im able to reproduce issue in my test environment with pgpool.conf config attached and following modifications: * wd_interval = 1 * wd_heartbeat_deadtime = 2 * log_error_verbosity = verbose * client_min_messages = debug5 * log_min_messages = debug1 Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped Let me know if you need more debug on log_min_messages Thanks, Guille |
|
|
|
|
|
Hi Thanks a bundle for the logs. I am looking into these |
|
|
Hi Thanks for providing the logs. I am able to identify the issue using the attached logs. I am working on the fix. Thanks |
|
|
Hi, Thank you so much for your work If we can help you in any way (reproducing more times, other parameters, etc) just let me know Best regards, Guille |
|
|
Hi Guille, Can you please try out the attached patch if it fixes the issue. Thanks Best Regards Muhammad Usama |
|
|
Hi, Should i download source code of 4.0.4 and compile with patch? or its for another version? Currently its has been installed using rpm package from repository Best regards |
|
|
After applying the patch to master branch, I get a few regression test failures: testing 004.watchdog...ok. testing 011.watchdog_quorum_failover...failed. testing 012.watchdog_failover_when_quorum_exists...failed. testing 013.watchdog_failover_require_consensus...failed. testing 014.watchdog_test_quorum_bypass...ok. testing 015.watchdog_master_and_backend_fail...ok. testing 028.watchdog_enable_consensus_with_half_votes...ok. 013 has already failed these days but 011 and 012 are new. |
|
|
Somehow I am getting clean regression with the patch. .. testing 006.memqcache...ok. testing 007.memqcache-memcached...ok. testing 008.dbredirect...ok. testing 009.sql_comments...ok. testing 011.watchdog_quorum_failover...ok. testing 012.watchdog_failover_when_quorum_exists...ok. testing 013.watchdog_failover_require_consensus...ok. testing 014.watchdog_test_quorum_bypass...ok. testing 015.watchdog_master_and_backend_fail...ok. testing 016.node_0_is_not_primary...ok. testing 017.node_0_is_down...ok. testing 018.detach_primary...ok. testing 019.log_client_messages...ok. testing 020.allow_clear_text_frontend_auth...ok. testing 021.pool_passwd_auth...ok. testing 022.pool_passwd_alternative_auth...ok. testing 023.ssl_connection...ok. testing 024.cert_auth...ok. testing 025.enable_shared_relcache...ok. testing 026.temp_table...ok. testing 027.auto_failback...ok. testing 028.watchdog_enable_consensus_with_half_votes...ok. testing 050.bug58...ok. testing 051.bug60...ok. .. So I will try to figure out why It could be failing on your system. I am running the regression on MASTER branch with PG12 |
|
|
Hi Guille, If you have a development environment setup on your system then you can pull the code and apply the patch on the master branch (I believe the same patch should also work with 4_0 branch) and test the scenario. |
|
|
Hi Muhammad, I've downloaded source code for 4.0.4 version, applied patch, compiled on all machines and started service with new binary Im sadly to inform issue still occur My PgSQL version is 11.4 in this environment (i saw in compiling it needs some libraries) Best regards |
|
|
The regression log for 011 and 012 attached. This is master branch + PostgreSQL 11.4. |
|
|
Hi Ishii-San Is it possible if you can also share the pgpool.log files for both these test cases? Since I am still not able to reproduce the failures either by running against PG11 or PG12 |
|
|
Sure. |
|
|
Hi Ishii-San Many thanks for providing the logs. There is something strange I am not able to get what's going wrong with your setup. Although I figured out the logs you shared does not have health check debug related messages. like below: WARNING: check_backend_down_request: failed to open file standby2/log/backend_down_request So I am wondering was the pgpool-II compiled with health check debug enabled? I am also attaching the similar logs when I ran the regression on my system. If you can spot the difference to identify what could be going wrong. Thanks |
|
|
Oops. You are right. I did not enable health check debug while compiling. After enabling it, 011 and 012 tests succeed. I am terribly sorry for the misinformation. |
|
|
Hi Ishii-San ;-) no issues. Happens with me all the time. So I am able to test the patch thoroughly using the new wd_cli utility I checked in yesterday. And the patch is behaving as expected. Do you agree if I go on and commit this one? |
|
|
Not sure. I think you need to get log from nglp to find out why his issue is not resolved with the patch? |
|
|
Hi, Will do, probably will take 1-2 days Thanks, Guille |
|
|
Hi Ishii-San Sorry, my bad, I somehow overlooked the note from Guille, and never knew he notified about the issue not fixed with the patch. @Guille I will look into the issue once you share the pgpool log files with the patch applied. Thanks in advance. Kind Regards Muhammad Usama |
|
|
Hi, Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped Compiled PGPool 4.0.4 + Patch PostgreSQL 11.4 Let me know Best regards |
|
|
Hi Guille, Many thanks for providing the logs. Looking at the logs I suspect there is some issue with applying the patch. Since the code line number of logs are appearing at some offset where they should be. For example: Sep 24 07:16:20 server1 pgpool[15336]: [557-1] 2019-09-24 07:16:20: pid 15336: DEBUG: I am the cluster leader node command finished with status:[COMMAND TIMEED OUT] which is success Sep 24 07:16:20 server1 pgpool[15336]: [557-2] 2019-09-24 07:16:20: pid 15336: DETAIL: The command was sent to 2 nodes and 1 nodes replied to it Sep 24 07:16:20 server1 pgpool[15336]: [557-3] 2019-09-24 07:16:20: pid 15336: LOCATION: watchdog.c:5548 This line should be "LOCATION: watchdog.c:5517" not at "LOCATION: watchdog.c:5548" With pgpool-II 4.0.4 source code. Also, the other log messages suggest a similar problem with the code. So is it possible if you could send me the code with patch applied? or I can send you the built binaries(if you are using MAC or centos7) whichever suits you better. Thanks in advance for helping in debugging this. Best Regards |
|
|
Hi, Sure, find my PGPool 4.0.4 + Patch attached Probably its my bad applying it, i did it manually, so if you have the source code with patch properly applied, i can compile it in my test environment (redhat 7) Also, we're open mind to update PGPool to latest version (some days ago we suffered a segfault in a connection pool process, we are still trying to reproduce or obtain a coredump to open an issue), so if you have it applied over 4.0.6 will work for us too Many thanks to you, Guille |
|
|
Hi Many thanks for the prompt reply. And my suspicion was correct, there was an issue with patching. I am attaching the patched code for same pgpool-II 4.0.4. Can you please try it out. Once you verify the fix I will apply it to all supported branches. Thanks |
|
|
Hi, Good news, issue looks to be solved with that code Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped Also, out of logs, i repeated issue like 10-15 times and everytime went ok Thank you all for your work Best regards, Guille |
|
|
Hi Guille, Many thanks for testing and confirmation of the fix. I will commit the patch to all affected branches. Best Regards Muhammad Usama |
|
|
I have committed the fix to all supported branches https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106 |
|
|
Fix committed to all supported branches https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106 |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2019-09-09 20:25 | nglp | New Issue | |
| 2019-09-09 20:25 | nglp | File Added: node1.txt | |
| 2019-09-09 20:25 | nglp | File Added: node2.txt | |
| 2019-09-09 20:25 | nglp | File Added: node3.txt | |
| 2019-09-09 20:25 | nglp | File Added: pgpool.conf | |
| 2019-09-12 10:36 | t-ishii | Note Added: 0002835 | |
| 2019-09-12 16:09 | nglp | Note Added: 0002836 | |
| 2019-09-12 16:17 | t-ishii | Assigned To | => Muhammad Usama |
| 2019-09-12 16:17 | t-ishii | Status | new => assigned |
| 2019-09-12 16:49 | t-ishii | Note Added: 0002837 | |
| 2019-09-12 17:19 | nglp | Note Added: 0002838 | |
| 2019-09-12 18:54 | Muhammad Usama | Note Added: 0002839 | |
| 2019-09-12 19:07 | nglp | Note Added: 0002840 | |
| 2019-09-12 19:17 | Muhammad Usama | Note Added: 0002841 | |
| 2019-09-12 19:27 | nglp | Note Added: 0002842 | |
| 2019-09-13 21:51 | nglp | Note Added: 0002846 | |
| 2019-09-13 21:51 | nglp | File Added: server1.txt.gz | |
| 2019-09-13 21:51 | nglp | File Added: server2.txt.gz | |
| 2019-09-13 21:51 | nglp | File Added: server3.txt.gz | |
| 2019-09-13 23:02 | Muhammad Usama | Note Added: 0002848 | |
| 2019-09-17 05:05 | Muhammad Usama | Note Added: 0002853 | |
| 2019-09-17 16:37 | nglp | Note Added: 0002857 | |
| 2019-09-18 23:18 | Muhammad Usama | File Added: recover_quorum_fix.diff | |
| 2019-09-18 23:18 | Muhammad Usama | Note Added: 0002860 | |
| 2019-09-19 00:17 | nglp | Note Added: 0002862 | |
| 2019-09-19 13:26 | t-ishii | Note Added: 0002864 | |
| 2019-09-19 13:29 | t-ishii | Note Edited: 0002864 | |
| 2019-09-19 18:24 | Muhammad Usama | Note Added: 0002865 | |
| 2019-09-19 18:26 | Muhammad Usama | Note Added: 0002866 | |
| 2019-09-19 19:17 | nglp | Note Added: 0002867 | |
| 2019-09-19 20:39 | t-ishii | File Added: regression_log.tar.gz | |
| 2019-09-19 20:39 | t-ishii | Note Added: 0002868 | |
| 2019-09-20 18:50 | Muhammad Usama | Note Added: 0002869 | |
| 2019-09-20 20:58 | t-ishii | File Added: 011-012.tar.gz | |
| 2019-09-20 20:58 | t-ishii | Note Added: 0002870 | |
| 2019-09-20 23:39 | Muhammad Usama | File Added: reg_logs_usama.zip | |
| 2019-09-20 23:39 | Muhammad Usama | Note Added: 0002871 | |
| 2019-09-21 08:03 | t-ishii | Note Added: 0002872 | |
| 2019-09-24 15:31 | Muhammad Usama | Note Added: 0002874 | |
| 2019-09-24 15:39 | t-ishii | Note Added: 0002875 | |
| 2019-09-24 16:23 | nglp | Note Added: 0002876 | |
| 2019-09-24 17:05 | Muhammad Usama | Note Added: 0002877 | |
| 2019-09-24 20:21 | nglp | File Added: server1.txt-2.gz | |
| 2019-09-24 20:21 | nglp | File Added: server2.txt-2.gz | |
| 2019-09-24 20:21 | nglp | File Added: server3.txt-2.gz | |
| 2019-09-24 20:21 | nglp | Note Added: 0002878 | |
| 2019-09-25 19:09 | Muhammad Usama | Note Added: 0002880 | |
| 2019-09-25 19:57 | nglp | File Added: pgpool-II-4.0.4.fix.tgz | |
| 2019-09-25 19:57 | nglp | Note Added: 0002881 | |
| 2019-09-25 20:53 | Muhammad Usama | File Added: pgpool-II-4.0.4-patched.tar.gz | |
| 2019-09-25 20:53 | Muhammad Usama | Note Added: 0002882 | |
| 2019-09-25 22:11 | nglp | File Added: server1.txt-3.gz | |
| 2019-09-25 22:11 | nglp | File Added: server2.txt-3.gz | |
| 2019-09-25 22:11 | nglp | File Added: server3.txt-3.gz | |
| 2019-09-25 22:11 | nglp | Note Added: 0002883 | |
| 2019-09-25 23:41 | Muhammad Usama | Note Added: 0002884 | |
| 2019-09-29 06:06 | Muhammad Usama | Note Added: 0002892 | |
| 2019-09-29 06:07 | Muhammad Usama | Status | assigned => resolved |
| 2019-09-29 06:07 | Muhammad Usama | Resolution | open => fixed |
| 2019-09-29 06:07 | Muhammad Usama | Note Added: 0002893 | |
| 2019-10-01 11:53 | administrator | Target Version | => 4.0.7 |
| 2019-10-31 18:40 | administrator | Fixed in Version | => 4.0.7 |