View Issue Details

IDProjectCategoryView StatusLast Update
0000547Pgpool-IIBugpublic2019-10-31 18:41
Reporterharukat Assigned ToMuhammad Usama  
PrioritynormalSeveritymajorReproducibilityalways
Status assignedResolutionopen 
Product Version3.7.11 
Target Version3.7.12Fixed in Version3.7.12 
Summary0000547: We need to do arping again after recovering from split-brain.
DescriptionPgpool-II should do arping again (or wd_IP_down() and wd_IP_up() again)
in the master/coordinator node after recovering from split-brain.
In the following scenario, Pgpool-II doesn't work well.

scenario:

1. There are watchdog cluster nodes: n1, n2, and n3.
    n3 is master/coordinator. VIP is set on n3.

2. Network trouble occurs.
    n1 and n2 decide that n3 is down.
   n2 become new master/coordinator, VIP is set on n2 with arping.

3. Network recovers.
    n1, n2, and n3 notice split-brain status.
    They decide n3 is best master/coordinator.
    n2 resigns master/coordinator. VIP is released on n2.

4. VIP has been set on n3, but ARP table says n2 is VIP target yet.
TagsNo tags attached.

Activities

t-ishii

2019-09-13 15:57

developer   ~0002845

Last edited: 2019-09-13 17:07

According to Usama's explanation in similar case:
https://www.pgpool.net/pipermail/pgpool-general/2019-August/006733.html
After "3. Network recovers."
> n1, n2, and n3 notice split-brain status.
> They decide n3 is best master/coordinator.
> n2 resigns master/coordinator. VIP is released on n2.
n2 should have not resigned the master in the first place. Since the other case in the upthread of the URL above looks quite similar, I guess there's something wrong in watchdog code to handle the case when former master node comes back.

So,
> We need to do arping again after recovering from split-brain.
Point is not here.

Muhammad Usama

2019-09-13 21:57

developer   ~0002847

Last edited: 2019-09-13 21:58

In the above-explained scenario. As per the current design, the watchdog should behave as follows:

1. There are watchdog cluster nodes: n1, n2, and n3.
    n3 is master/coordinator. VIP is set on n3.

2. Network trouble occurs.
        n1 and n2 decide that n3 is down.
        n2 become new master/coordinator, VIP is set on n2 with arping.

2-b. At the same time, n3 would perform de-escalation since it has lost the quorum.
         Even if the n3 stays as master it would remove the VIP because of losing the quorum.

3. Network recovers.
       n1, n2, and n3 notice split-brain status.
       They decide n3 is best master/coordinator.
      n2 resigns master/coordinator. VIP is released on n2.

4. Even in the case of No:4. If n3 is chosen as the best master. The n3 will perform
       the escalation again. The reason being it had already performed the de-escalation when the
      quorum was lost. And when the escalation is performed again after recovery from
      split-brain the arping step would be performed again.

Now if somehow (step No:4 of the original scenario: VIP has been set on n3, but ARP table says n2 is VIP target yet.)
is happening which means the watchdog had not performed the de-escalation when the quorum was lost.

If that is the case can you please provide me the log files when that happened?

harukat

2019-09-17 11:10

developer   ~0002855

This is the log for this issue.
IP address and host name are modified from our customer's original.
N3 had been master/coordinator with VIP for a long time.
When n1 and n2 realized n3 was lost, n3 didn't throw any errors for watchdog.
lost_arping_case.log (239,651 bytes)

harukat

2019-09-25 13:03

developer   ~0002879

Our customer says:
In our case, n3 did not seem to know quorum was lost...for most of that network partition,
n3 could *see* n1 and n2, but n2 and n1 could not *see* n3.
I'm not certain if n3 had a way to know quorum was lost during the network partition?
We do know with certainty that the end result was two production outages within 3 days.

harukat

2019-09-27 12:19

developer   ~0002885

This is a simple patch for V3_7_STABLE to do arping after recovering from split-brain.
It passed the regression test.

t-ishii

2019-09-27 12:32

developer   ~0002886

The regression does not do any VIP/arp test. That means passing the regression test does not say anything about your patch is ok or not as far as VIP/arp issue concerns.

t-ishii

2019-09-27 13:14

developer   ~0002887

Have you actually tested the patch with case described above ?

harukat

2019-10-01 18:10

developer   ~0002896

No, I haven't recreate the reported situation.

t-ishii

2019-10-02 09:36

developer   ~0002898

We do not accept untested patches.

Muhammad Usama

2019-10-02 23:07

developer   ~0002899

Hi, Harukat,

First of all sorry for the delayed response, and thanks a lot for providing the patch and log files.

The log file contained the messages from all three nodes and it was quite huge so took me a while to understand
the issue.

Anyhow after reviewing the log files I realized that there was confusion in the watchdog code on how to deal
with the life-check failed scenarios especially for the cases when the life-check reports the node failure while
watchdog core still able to communicate with remote nodes. and also for the case when node A's life-check reports
node B as lost while B still thinks A is alive and healthy.

So I have reviewed the whole watchdog design around the life-check reports and have made some fixes in that area
in the attached patch

You can try the attached patch and test your scenario if you can still find yourself in the same situation as described
in the initial bug report.
The patch is generated against the current MASTER branch and I will commit it after little more testing and then backport
it to all supported branches.

Finally, the original idea in your patch and bug report to do arping after recovering from the split-brain seems reasonable
but your patch needs a little more thought on the design, since executing the wd_arping function from watchdog main process
is not the right thing to do.
But effectively after my patch (attached) you should never end up in the situation where
multiple watchdog nodes ended up performing the escalation provided you are using odd number of pgpool-II nodes.
The arping solution you suggested should only be effective in a situation where the watchdog cluster is configured with
even number of total nodes and network partition divides the network in such a way that both network partitions gets exactly
half nodes each.

Thanks
Best regards
Muhammad Usama
]
watchdog_node_lost_fix.diff (54,091 bytes)   
watchdog_node_lost_fix.diff (54,091 bytes)   

t-ishii

2019-10-02 23:28

developer   ~0002900

Usama,
After applying the patch, one of watchdog regression test failed.
testing 011.watchdog_quorum_failover...failed.

regression log attached.

t-ishii

2019-10-02 23:32

developer   ~0002902

Also pgpool.log attached.
pgpool-logs.tar.gz (5,493 bytes)

Muhammad Usama

2019-10-03 00:19

developer   ~0002903

Hi Ishii-San,

I am looking into attached logs

Somehow regression always passes on my machine :-)

testing 011.watchdog_quorum_failover...ok.
testing 012.watchdog_failover_when_quorum_exists...ok.
testing 013.watchdog_failover_require_consensus...ok.
testing 014.watchdog_test_quorum_bypass...ok.
testing 015.watchdog_master_and_backend_fail...ok.
testing 016.node_0_is_not_primary...ok.
testing 017.node_0_is_down...ok.
testing 018.detach_primary...ok.
testing 019.log_client_messages...ok.
testing 020.allow_clear_text_frontend_auth...ok.
testing 021.pool_passwd_auth...ok.
testing 022.pool_passwd_alternative_auth...ok.
testing 023.ssl_connection...ok.
testing 024.cert_auth...ok.

Muhammad Usama

2019-10-03 00:33

developer   ~0002904

From the look of it seems like the setup for PostgreSQL server is failing the test case.

See the line 23 in attached 011.watchdog_quorum_failover

 23 recovery node 1...ERROR: executing recovery, execution of command failed at "1st stage"
 24 DETAIL: command:"basebackup.sh"

But I can't figure out what could be the reason for that.

t-ishii

2019-10-03 07:13

developer   ~0002905

Oops. You are right. I had broken my pgpool set up while migrating to PostgreSQL 12. Sorry for noise.

harukat

2019-10-03 13:33

developer   ~0002906

Thanks, Usama.
I'll do a test the patch in the V4_1_STABLE environment.

It looks big changes.
Shouldn't I except 3.7.x version fix?

Muhammad Usama

2019-10-03 18:19

developer   ~0002907

Once you verify the fix I will backport it to all supported branches including 3.7

Thanks

harukat

2019-10-08 13:31

developer   ~0002912

Last edited: 2019-10-08 13:32

I did a test that runs Pgpool 4.1.x nodes in the artificial unstable network by executing the attached script.
This "make_splitbrain.sh" script blocks the watchdog communication randomly when it is a master node.
In spite of the severe network environment, patched Pgpools run well long time (over 1 day) in most cases.
They often recovered from split-brain status successfully.

I also got a failed case. It's log is attached "test_20191007.tgz" file.
I'm sorry, it may be difficult to read because it includes Japanese locale output.
In that case, I need to execute arping to use Pgpool-II via VIP.
Its log say at last:
  Oct 7 16:19:00 cp2 [16456]: [2889-1] 2019-10-07 16:19:00: pid 16456: LOG: successfully acquired the delegate IP:"10.10.10.152"
  Oct 7 16:19:00 cp2 [16456]: [2889-2] 2019-10-07 16:19:00: pid 16456: DETAIL: 'if_up_cmd' returned with success
But I cannot access Pgpool-II in cp2 host via VIP without arping. I couldn't find its cause.

make_splitbrain.sh (999 bytes)
test_20191007.tgz (113,231 bytes)

Muhammad Usama

2019-10-11 00:19

developer   ~0002918

Hi Harukat,

First of all many thanks for doing thorough testing.
I have gone through all the attached logs and it seems that watchdog is behaving correctly. despite some
flooding of log messages, I can't see any issue currently at least in the logs.

As far as the problem you described, that you had to do arping on cp2 after the escalation that happened at time:'Oct 7 16:19:00'.

I believe it's not because of anything wrong done by Pgpool-II or watchdog ( at least noting in the logs suggests
anything wrong with watchdog or pgpool that could have caused that)

If you see the logs of all three nodes around that time you could see that only one node CP2 had performed
the escalation and brought up the VIP and no other node had acquired the VIP after that time.
And since after acquiring the VIP pgpool escalation process does performs the arping and ping, so I am guessing some
external factor might be involved. Because nothing in the logs points to any situation or issue that could require a manual arping.

I am not sure but I am thinking if that can be caused by a nature of the test case as it was causing the very frequent escalations and de-escalations,
So do you think if it is possible that a network switch or VM host might have played role in that?
For example, the last de-escalation on CP2 happened at time:"7 16:18:45" (VIP released by CP2) and the new escalation started
at time:"7 16:18:56" ( VIP acquired again), So its just 10 seconds gap in between the release and acquiring of VIP on CP2.
So I am just thinking out loud that what if ARP tables on client machine might have got it wrong because of these frequent updates.
Somehow the client machine still didn't receive the new VIP record.
Though I still think that might not be a case and there is some other external elements that would have caused that.

harukat

2019-10-15 11:54

developer   ~0002927

Thank you for confirming my report.
If you have no idea on the Pgpool-II code for the failed case,
I also think some external elements caused it because the log suggests the code must do arping last.
And my test script could start dropping the packets in the VIP holder host immediately
after its escalation without a stable time. It can be said a kind of double fault, so I think
it's OK that the Pgpool-II patched code don't cover the such case.

Issue History

Date Modified Username Field Change
2019-09-12 16:37 harukat New Issue
2019-09-12 16:52 t-ishii Assigned To => Muhammad Usama
2019-09-12 16:52 t-ishii Status new => assigned
2019-09-13 15:57 t-ishii Note Added: 0002845
2019-09-13 17:07 t-ishii Note Edited: 0002845
2019-09-13 21:57 Muhammad Usama Note Added: 0002847
2019-09-13 21:58 Muhammad Usama Note Edited: 0002847
2019-09-17 11:10 harukat File Added: lost_arping_case.log
2019-09-17 11:10 harukat Note Added: 0002855
2019-09-25 13:03 harukat Note Added: 0002879
2019-09-27 12:19 harukat File Added: pgpool2_V3_7_STSBLE_arping_again.patch
2019-09-27 12:19 harukat Note Added: 0002885
2019-09-27 12:32 t-ishii Note Added: 0002886
2019-09-27 13:14 t-ishii Note Added: 0002887
2019-10-01 18:10 harukat Note Added: 0002896
2019-10-02 09:36 t-ishii Note Added: 0002898
2019-10-02 23:07 Muhammad Usama File Added: watchdog_node_lost_fix.diff
2019-10-02 23:07 Muhammad Usama Note Added: 0002899
2019-10-02 23:28 t-ishii File Added: 011.watchdog_quorum_failover
2019-10-02 23:28 t-ishii Note Added: 0002900
2019-10-02 23:32 t-ishii File Added: pgpool-logs.tar.gz
2019-10-02 23:32 t-ishii Note Added: 0002902
2019-10-03 00:19 Muhammad Usama Note Added: 0002903
2019-10-03 00:33 Muhammad Usama Note Added: 0002904
2019-10-03 07:13 t-ishii Note Added: 0002905
2019-10-03 13:33 harukat Note Added: 0002906
2019-10-03 18:19 Muhammad Usama Note Added: 0002907
2019-10-08 13:31 harukat File Added: make_splitbrain.sh
2019-10-08 13:31 harukat File Added: test_20191007.tgz
2019-10-08 13:31 harukat Note Added: 0002912
2019-10-08 13:32 harukat Note Edited: 0002912
2019-10-11 00:19 Muhammad Usama Note Added: 0002918
2019-10-15 11:54 harukat Note Added: 0002927
2019-10-31 18:41 administrator Fixed in Version => 3.7.12
2019-10-31 18:41 administrator Target Version => 3.7.12