View Issue Details

IDProjectCategoryView StatusLast Update
0000545Pgpool-IIBugpublic2019-10-31 18:40
Reporternglp Assigned ToMuhammad Usama  
PrioritynormalSeveritymajorReproducibilityrandom
Status resolvedResolutionfixed 
Product Version4.0.4 
Target Version4.0.7Fixed in Version4.0.7 
Summary0000545: Quorum lost and not recovered
DescriptionHi,

We've a 3 node PGPool setup with 3 PostgreSQL master-standby-standby with streaming replication running on vmware vms

PGPool health check method is heartbeat

We had an issue, because our backup method (snapshot) hanged master watchdog node (node2 on that moment) for more time than defined in wd_heartbeat_deadtime, and VIP was released, that its a normal behaviour

Issue was after that, quorum was lost and never recovered, VIP was not reallocated in the other 2 nodes, so service went down

Quorum was never recovered, we had to restart entire pgpool stack, despite checking master watchdog node on pgpool was indicating a master. This has happened in the past and quorum was recovered and VIP reallocated successfully, so its random i guess

Find attached logs and our pgpool configuration

Any question we will be happy to answer

Thanks, Guille
Steps To ReproduceThis has happened after more master failovers due same hang, but:
* Hang watchdog master node for a bit more than wd_heartbeat_deadtime
* Quorum is lost
Additional Informationnode1, node2 and node3 have different wd_priority, in that order, other pgpool settings are exactly the same

node2 was watchdog master at that moment, and after quorum lost
TagsNo tags attached.

Activities

nglp

2019-09-09 20:25

reporter  

node1.txt (983 bytes)   
node1.txt (983 bytes)   
node2.txt (5,123 bytes)   
node2.txt (5,123 bytes)   
node3.txt (983 bytes)   
node3.txt (983 bytes)   
pgpool.conf (4,479 bytes)

t-ishii

2019-09-12 10:36

developer   ~0002835

I have pinged Usama, who is an authority of watchdog, a few days ago to look into this issue.
Probably you 'd better to wait for response from him.

In the mean time I noticed you set wd_heartbeat_deadtime = 5, that means if watchdog does not receive heart beat signal within 5 seconds, the pgpool will be regarded as dead (that is your situation). For a workaround, you could set it to longer, say 300 seconds, to not regard your backuping server is dead.

nglp

2019-09-12 16:09

reporter   ~0002836

Hi t-ishii,

Thank you for your response, we will wait for him :)

Yes, once we had identified that timeout, we increased wd_heartbeat_deadtime to 30 (before 5) and wd_interval to 10 (before 2) to avoid these problems. That are default values in a stream replication cluster example configuration

We configured lower values in the past to allow fastest failover possible

Best regards, Guille

t-ishii

2019-09-12 16:49

developer   ~0002837

This one:
https://www.pgpool.net/mantisbt/view.php?id=547
seems to be similar issue?

nglp

2019-09-12 17:19

reporter   ~0002838

Hi,

No, its not the same, if you see logs attached, following things happens:
* Heartbeat signals are lost
* Quorum is lost
* Watchdog deescalation starts
* VIP is removed
* Heartbead signals are back

But at this moment, quorum is not recovered, so VIP is lost

Issue in our case is quorum is never recovered, despite heartbeat signals are back

Best regards

Muhammad Usama

2019-09-12 18:54

developer   ~0002839

Hi

The log files you shared do not have much information and its difficult to guess what could have gone wrong.
Can you please share the complete pgpool logs for all three nodes.

nglp

2019-09-12 19:07

reporter   ~0002840

Hi,

Im sorry to say thats the only output generated from pgpool for this issue

If you want i can attach entire logs, but are filled with non-relevant information

Let me know

Thanks, Guille

Muhammad Usama

2019-09-12 19:17

developer   ~0002841

Do you able to reliably reproduce the issue. If that's the case then sharing the exact steps would be enough.

Otherwise please share the entire log along with the time when the issue happened, and I will try to see
if some useable information can be extracted from that.

Thanks

nglp

2019-09-12 19:27

reporter   ~0002842

Im setting a new environment to make some tests, other ones are all production ones

The way i though you can reproduce it is:
* set low wd_heartbeat_deadtime and wd_interval values (like 2 and 1)
* make virtual machine snapshot on master watchdog node
* check what happens

Expect a random result, on other times VIP has failed over successfully, and quorum has been recovered ok

I will prepare logs to upload them (i've to replace some strings, like hostname, ips, or remove some queries)

Thanks, Guille

nglp

2019-09-13 21:51

reporter   ~0002846

Hi,

Im back with "good" news, im able to reproduce issue in my test environment with pgpool.conf config attached and following modifications:
* wd_interval = 1
* wd_heartbeat_deadtime = 2
* log_error_verbosity = verbose
* client_min_messages = debug5
* log_min_messages = debug1

Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped

Let me know if you need more debug on log_min_messages

Thanks, Guille

nglp

2019-09-13 21:51

reporter  

server1.txt.gz (115,952 bytes)
server2.txt.gz (112,256 bytes)
server3.txt.gz (116,210 bytes)

Muhammad Usama

2019-09-13 23:02

developer   ~0002848

Hi

Thanks a bundle for the logs. I am looking into these

Muhammad Usama

2019-09-17 05:05

developer   ~0002853

Hi

Thanks for providing the logs. I am able to identify the issue using the attached logs.
I am working on the fix.

Thanks

nglp

2019-09-17 16:37

reporter   ~0002857

Hi,

Thank you so much for your work

If we can help you in any way (reproducing more times, other parameters, etc) just let me know

Best regards, Guille

Muhammad Usama

2019-09-18 23:18

developer   ~0002860

Hi Guille,

Can you please try out the attached patch if it fixes the issue.

Thanks
Best Regards
Muhammad Usama
recover_quorum_fix.diff (8,328 bytes)   
recover_quorum_fix.diff (8,328 bytes)   

nglp

2019-09-19 00:17

reporter   ~0002862

Hi,

Should i download source code of 4.0.4 and compile with patch? or its for another version? Currently its has been installed using rpm package from repository

Best regards

t-ishii

2019-09-19 13:26

developer   ~0002864

Last edited: 2019-09-19 13:29

After applying the patch to master branch, I get a few regression test failures:
testing 004.watchdog...ok.
testing 011.watchdog_quorum_failover...failed.
testing 012.watchdog_failover_when_quorum_exists...failed.
testing 013.watchdog_failover_require_consensus...failed.
testing 014.watchdog_test_quorum_bypass...ok.
testing 015.watchdog_master_and_backend_fail...ok.
testing 028.watchdog_enable_consensus_with_half_votes...ok.

013 has already failed these days but 011 and 012 are new.

Muhammad Usama

2019-09-19 18:24

developer   ~0002865

Somehow I am getting clean regression with the patch.

..
testing 006.memqcache...ok.
testing 007.memqcache-memcached...ok.
testing 008.dbredirect...ok.
testing 009.sql_comments...ok.
testing 011.watchdog_quorum_failover...ok.
testing 012.watchdog_failover_when_quorum_exists...ok.
testing 013.watchdog_failover_require_consensus...ok.
testing 014.watchdog_test_quorum_bypass...ok.
testing 015.watchdog_master_and_backend_fail...ok.
testing 016.node_0_is_not_primary...ok.
testing 017.node_0_is_down...ok.
testing 018.detach_primary...ok.
testing 019.log_client_messages...ok.
testing 020.allow_clear_text_frontend_auth...ok.
testing 021.pool_passwd_auth...ok.
testing 022.pool_passwd_alternative_auth...ok.
testing 023.ssl_connection...ok.
testing 024.cert_auth...ok.
testing 025.enable_shared_relcache...ok.
testing 026.temp_table...ok.
testing 027.auto_failback...ok.
testing 028.watchdog_enable_consensus_with_half_votes...ok.
testing 050.bug58...ok.
testing 051.bug60...ok.
..

So I will try to figure out why It could be failing on your system.
I am running the regression on MASTER branch with PG12

Muhammad Usama

2019-09-19 18:26

developer   ~0002866

Hi Guille,

If you have a development environment setup on your system then you can pull the code and apply the patch on the master branch
(I believe the same patch should also work with 4_0 branch) and test the scenario.

nglp

2019-09-19 19:17

reporter   ~0002867

Hi Muhammad,

I've downloaded source code for 4.0.4 version, applied patch, compiled on all machines and started service with new binary

Im sadly to inform issue still occur

My PgSQL version is 11.4 in this environment (i saw in compiling it needs some libraries)

Best regards

t-ishii

2019-09-19 20:39

developer   ~0002868

The regression log for 011 and 012 attached.
This is master branch + PostgreSQL 11.4.
regression_log.tar.gz (1,891 bytes)

Muhammad Usama

2019-09-20 18:50

developer   ~0002869

Hi Ishii-San

Is it possible if you can also share the pgpool.log files for both these test cases? Since I am still not able to reproduce the failures either by running against PG11 or PG12

t-ishii

2019-09-20 20:58

developer   ~0002870

Sure.
011-012.tar.gz (6,663 bytes)

Muhammad Usama

2019-09-20 23:39

developer   ~0002871

Hi Ishii-San

Many thanks for providing the logs. There is something strange I am not able to get what's going wrong with your setup.
Although I figured out the logs you shared does not have health check debug related messages.

like below:
WARNING: check_backend_down_request: failed to open file standby2/log/backend_down_request

So I am wondering was the pgpool-II compiled with health check debug enabled?

I am also attaching the similar logs when I ran the regression on my system. If you can spot the difference to identify what could be going wrong.

Thanks
reg_logs_usama.zip (29,726 bytes)

t-ishii

2019-09-21 08:03

developer   ~0002872

Oops. You are right. I did not enable health check debug while compiling. After enabling it, 011 and 012 tests succeed.
I am terribly sorry for the misinformation.

Muhammad Usama

2019-09-24 15:31

developer   ~0002874

Hi Ishii-San

;-) no issues. Happens with me all the time.

So I am able to test the patch thoroughly using the new wd_cli utility I checked in yesterday. And the patch is behaving as expected.
Do you agree if I go on and commit this one?

t-ishii

2019-09-24 15:39

developer   ~0002875

Not sure. I think you need to get log from nglp to find out why his issue is not resolved with the patch?

nglp

2019-09-24 16:23

reporter   ~0002876

Hi,

Will do, probably will take 1-2 days

Thanks, Guille

Muhammad Usama

2019-09-24 17:05

developer   ~0002877

Hi Ishii-San
Sorry, my bad, I somehow overlooked the note from Guille, and never knew he notified about the issue not fixed with the patch.

@Guille I will look into the issue once you share the pgpool log files with the patch applied.
Thanks in advance.

Kind Regards
Muhammad Usama

nglp

2019-09-24 20:21

reporter   ~0002878

Hi,

 Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped

Compiled PGPool 4.0.4 + Patch
PostgreSQL 11.4

Let me know

Best regards
server1.txt-2.gz (106,666 bytes)
server2.txt-2.gz (97,357 bytes)
server3.txt-2.gz (97,602 bytes)

Muhammad Usama

2019-09-25 19:09

developer   ~0002880

Hi Guille,

Many thanks for providing the logs. Looking at the logs I suspect there is some issue with applying the patch.
Since the code line number of logs are appearing at some offset where they should be.
For example:
Sep 24 07:16:20 server1 pgpool[15336]: [557-1] 2019-09-24 07:16:20: pid 15336: DEBUG: I am the cluster leader node command finished with status:[COMMAND TIMEED OUT] which is success
Sep 24 07:16:20 server1 pgpool[15336]: [557-2] 2019-09-24 07:16:20: pid 15336: DETAIL: The command was sent to 2 nodes and 1 nodes replied to it
Sep 24 07:16:20 server1 pgpool[15336]: [557-3] 2019-09-24 07:16:20: pid 15336: LOCATION: watchdog.c:5548

This line should be "LOCATION: watchdog.c:5517" not at "LOCATION: watchdog.c:5548" With pgpool-II 4.0.4 source code.

Also, the other log messages suggest a similar problem with the code.
So is it possible if you could send me the code with patch applied? or I can send you the built binaries(if you are using MAC or centos7) whichever suits you better.

Thanks in advance for helping in debugging this.

Best Regards

nglp

2019-09-25 19:57

reporter   ~0002881

Hi,

Sure, find my PGPool 4.0.4 + Patch attached

Probably its my bad applying it, i did it manually, so if you have the source code with patch properly applied, i can compile it in my test environment (redhat 7)

Also, we're open mind to update PGPool to latest version (some days ago we suffered a segfault in a connection pool process, we are still trying to reproduce or obtain a coredump to open an issue), so if you have it applied over 4.0.6 will work for us too

Many thanks to you, Guille
pgpool-II-4.0.4.fix.tgz (3,898,639 bytes)

Muhammad Usama

2019-09-25 20:53

developer   ~0002882

Hi

Many thanks for the prompt reply. And my suspicion was correct, there was an issue with patching.
I am attaching the patched code for same pgpool-II 4.0.4. Can you please try it out.
Once you verify the fix I will apply it to all supported branches.

Thanks

nglp

2019-09-25 22:11

reporter   ~0002883

Hi,

Good news, issue looks to be solved with that code

Attaching logs, pgpool was stopped, log cleared, started, wait 2mins, issue reproduced, wait 1mins and then stopped

Also, out of logs, i repeated issue like 10-15 times and everytime went ok

Thank you all for your work

Best regards, Guille
server1.txt-3.gz (117,111 bytes)
server2.txt-3.gz (106,937 bytes)
server3.txt-3.gz (107,761 bytes)

Muhammad Usama

2019-09-25 23:41

developer   ~0002884

Hi Guille,

Many thanks for testing and confirmation of the fix. I will commit the patch to all affected branches.

Best Regards
Muhammad Usama

Muhammad Usama

2019-09-29 06:06

developer   ~0002892

I have committed the fix to all supported branches

https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106

Muhammad Usama

2019-09-29 06:07

developer   ~0002893

Fix committed to all supported branches


https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=963ac83ce19dbb3d1a423ae5ce582c6866b47106

Issue History

Date Modified Username Field Change
2019-09-09 20:25 nglp New Issue
2019-09-09 20:25 nglp File Added: node1.txt
2019-09-09 20:25 nglp File Added: node2.txt
2019-09-09 20:25 nglp File Added: node3.txt
2019-09-09 20:25 nglp File Added: pgpool.conf
2019-09-12 10:36 t-ishii Note Added: 0002835
2019-09-12 16:09 nglp Note Added: 0002836
2019-09-12 16:17 t-ishii Assigned To => Muhammad Usama
2019-09-12 16:17 t-ishii Status new => assigned
2019-09-12 16:49 t-ishii Note Added: 0002837
2019-09-12 17:19 nglp Note Added: 0002838
2019-09-12 18:54 Muhammad Usama Note Added: 0002839
2019-09-12 19:07 nglp Note Added: 0002840
2019-09-12 19:17 Muhammad Usama Note Added: 0002841
2019-09-12 19:27 nglp Note Added: 0002842
2019-09-13 21:51 nglp Note Added: 0002846
2019-09-13 21:51 nglp File Added: server1.txt.gz
2019-09-13 21:51 nglp File Added: server2.txt.gz
2019-09-13 21:51 nglp File Added: server3.txt.gz
2019-09-13 23:02 Muhammad Usama Note Added: 0002848
2019-09-17 05:05 Muhammad Usama Note Added: 0002853
2019-09-17 16:37 nglp Note Added: 0002857
2019-09-18 23:18 Muhammad Usama File Added: recover_quorum_fix.diff
2019-09-18 23:18 Muhammad Usama Note Added: 0002860
2019-09-19 00:17 nglp Note Added: 0002862
2019-09-19 13:26 t-ishii Note Added: 0002864
2019-09-19 13:29 t-ishii Note Edited: 0002864
2019-09-19 18:24 Muhammad Usama Note Added: 0002865
2019-09-19 18:26 Muhammad Usama Note Added: 0002866
2019-09-19 19:17 nglp Note Added: 0002867
2019-09-19 20:39 t-ishii File Added: regression_log.tar.gz
2019-09-19 20:39 t-ishii Note Added: 0002868
2019-09-20 18:50 Muhammad Usama Note Added: 0002869
2019-09-20 20:58 t-ishii File Added: 011-012.tar.gz
2019-09-20 20:58 t-ishii Note Added: 0002870
2019-09-20 23:39 Muhammad Usama File Added: reg_logs_usama.zip
2019-09-20 23:39 Muhammad Usama Note Added: 0002871
2019-09-21 08:03 t-ishii Note Added: 0002872
2019-09-24 15:31 Muhammad Usama Note Added: 0002874
2019-09-24 15:39 t-ishii Note Added: 0002875
2019-09-24 16:23 nglp Note Added: 0002876
2019-09-24 17:05 Muhammad Usama Note Added: 0002877
2019-09-24 20:21 nglp File Added: server1.txt-2.gz
2019-09-24 20:21 nglp File Added: server2.txt-2.gz
2019-09-24 20:21 nglp File Added: server3.txt-2.gz
2019-09-24 20:21 nglp Note Added: 0002878
2019-09-25 19:09 Muhammad Usama Note Added: 0002880
2019-09-25 19:57 nglp File Added: pgpool-II-4.0.4.fix.tgz
2019-09-25 19:57 nglp Note Added: 0002881
2019-09-25 20:53 Muhammad Usama File Added: pgpool-II-4.0.4-patched.tar.gz
2019-09-25 20:53 Muhammad Usama Note Added: 0002882
2019-09-25 22:11 nglp File Added: server1.txt-3.gz
2019-09-25 22:11 nglp File Added: server2.txt-3.gz
2019-09-25 22:11 nglp File Added: server3.txt-3.gz
2019-09-25 22:11 nglp Note Added: 0002883
2019-09-25 23:41 Muhammad Usama Note Added: 0002884
2019-09-29 06:06 Muhammad Usama Note Added: 0002892
2019-09-29 06:07 Muhammad Usama Status assigned => resolved
2019-09-29 06:07 Muhammad Usama Resolution open => fixed
2019-09-29 06:07 Muhammad Usama Note Added: 0002893
2019-10-01 11:53 administrator Target Version => 4.0.7
2019-10-31 18:40 administrator Fixed in Version => 4.0.7