<div dir="ltr"><div dir="ltr"><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Thu, Jul 15, 2021 at 10:42 AM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp">ishii@sraoss.co.jp</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Usama,<br>

<br>

I am trying to understand your proposal. Please correct me if I am<br>

wrong.  It seems the proposal just gives up the concept of quorum. For<br>

example, we start with 3-node cluster A, B and C.  Due to a network<br>

problem, C is separated with A and B. A and B can still<br>

communicate. After wd_lost_node_to_remove_timeout passed, A, B become<br>

a 2-node cluster with quorum. C becomes a 1-node cluster with<br>

quorum. So a split brain occurs.<br></blockquote><div><br></div><div>Hi Ishii_San</div><div><br></div><div>Your understanding is correct for the proposal. Basically IMHO whatever we</div><div>do for trying to remedy that original issue there will always be a chance of split-brain.</div><div><br></div><div>The reason I am proposing this solution is that with this proposed design the behaviour</div><div>would be configurable. For example if user set <span style="background-color:transparent">wd_lost_node_to_remove_timeout = 0</span></div><div><span style="background-color:transparent">then this will disable the lost node removal function and eventually the watchdog would</span></div><div><span style="background-color:transparent">behave as it does currently.</span></div><div><span style="background-color:transparent">And normally I expect this </span><span style="background-color:transparent">wd_lost_node_to_remove_timeout value to be set in the</span></div><div><span style="background-color:transparent">range of 5 to 10 mins. </span><span style="background-color:transparent">Because blackout for more than 5 to 10 mins would mean</span></div><div><span style="background-color:transparent">there is some serious problem in the </span><span style="background-color:transparent">network that a node is unable to </span>community<span style="background-color:transparent"> for</span></div><div><span style="background-color:transparent">such a long period of time and we need </span><span style="background-color:transparent">resume the service even if it comes with</span></div><div><span style="background-color:transparent">the risk of a split-brain.</span></div><div><span style="background-color:transparent"><br></span></div><div><span style="background-color:transparent">The second part of proposal talks about the nodes that are properly shut down. In that</span></div><div><span style="background-color:transparent">case, the proposal is to stop counting those nodes towards the quorum calculation since</span></div><div><span style="background-color:transparent">we already know that these nodes are not alive anymore. But again it also have associated</span></div><div><span style="background-color:transparent">risks in case the previously shutdown node got started again but unable to communicate</span></div><div><span style="background-color:transparent">with existing cluster.</span></div><div><span style="background-color:transparent"><br></span></div><div><span style="background-color:transparent">Best regards</span></div><div><span style="background-color:transparent">Muhammad Usama</span></div><div><span style="background-color:transparent"> </span><br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

<br>

Am I missing something?<br>

<br>

Best regards,<br>

--<br>

Tatsuo Ishii<br>

SRA OSS, Inc. Japan<br>

English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

<br>

> Hi,<br>

> <br>

> I have been thinking about this issue and I believe the concerns are genuine<br>

> and we need to figure out a way around.<br>

> <br>

> IMHO one possible solution is to change how watchdog does the quorum<br>

> calculations<br>

> and which nodes makes up the watchdog cluster.<br>

> <br>

> The current implementation calculates the quorum based on the number of<br>

> configured<br>

> watchdog nodes and alive nodes. And if we make the watchdog cluster adjust<br>

> itself dynamically<br>

> based on the current situation, then we can have a better user experience.<br>

> <br>

> As of now the watchdog cluster definition recognises node as either alive<br>

> or absent.<br>

> And the number of alive-nodes need to be >= to the total number of<br>

> configured nodes<br>

> for the quorum to hold.<br>

> <br>

> So my suggestion is that instead of using a binary status, we consider that<br>

> watchdog node<br>

> can be in one of three states 'Alive', 'Dead' or 'Lost', and all dead nodes<br>

> should be considered<br>

> as not part of the current cluster.<br>

> <br>

> Consider the example where we have 5 configured watchdog nodes.<br>

> With current implementation the quorum will require 3 alive nodes.<br>

> <br>

> Now suppose we have started only 3 nodes. That would be good enough to make<br>

> the cluster<br>

> hold the quorum and one of the nodes will eventually acquire the VIP, so no<br>

> problems there.<br>

> But as soon as we shutdown one of the nodes or it becomes 'Lost' the<br>

> cluster will lose the<br>

> quorum and release the VIP, making the service unavailable.<br>

> <br>

> Consider the same scenario, with above mentioned new definition of watchdog<br>

> cluster.<br>

> When we initially start 3 nodes out of 5 the cluster marks the remaining<br>

> two nodes<br>

> (after configurable time) as dead, and removes them for the cluster until<br>

> one of those nodes<br>

> is started and connects with the cluster. So after that configured time,<br>

> even if we have 5 configured<br>

> watchdog nodes our cluster dynamically adjusts itself and considers the<br>

> cluster having<br>

> only 3 nodes (instead of 5) and that will require only 2 nodes be alive.<br>

> <br>

> By this new definition if one of the node gets lost, the cluster will still<br>

> hold the quorum<br>

> since it was considering it consists of 3 nodes. And that lost node will<br>

> again be marked<br>

> as dead after a configured amount of time and eventually further shrink the<br>

> cluster size to 2 nodes.<br>

> Similarly, when some previously dead node joins the cluster, the cluster<br>

> will expend itself again to<br>

> accommodate that node.<br>

> <br>

> On top of that if some watchdog node is properly shutdown then it would be<br>

> Immediately<br>

> marked as dead and removed from the cluster.<br>

> <br>

> Of course, this is not a bullet-proof and comes with the risk of having a<br>

> split-brain in case of<br>

> a few network partitioning scenarios, but I think it would work in 99% of<br>

> cases.<br>

> <br>

> This new implementation would require two new (proposed) additional<br>

> configuration parameter.<br>

> 1- wd_lost_node_to_remove_timeout (seconds)<br>

> 2- wd_initial_node_showup_time (seconds)<br>

> <br>

> Also, we can also implement a new PCP command to force the lost node to be<br>

> marked as dead.<br>

> <br>

> Thoughts and suggestions?<br>

> <br>

> Thanks<br>

> Best regards<br>

> Muhammad Usama<br>

> <br>

> On Tue, May 11, 2021 at 7:18 AM Tatsuo Ishii <<a href="mailto:ishii@sraoss.co.jp" target="_blank">ishii@sraoss.co.jp</a>> wrote:<br>

> <br>

>> Hi Pgpool-II developers,<br>

>><br>

>> Recently we got a complain below from a user.<br>

>><br>

>> Currently Pgpool-II releases VIP if the quorum is lost.  This is<br>

>> reasonable and safe so that we can prevent split-brain problems.<br>

>><br>

>> However, I feel it would be nice if there's a way to allow to hold VIP<br>

>> even if the quorum is lost for emergency sake.<br>

>><br>

>> Suppose we have 3-node pgpool each in different 3 cities. Those 2<br>

>> cities are break down by an earth quake, and user want to keep their<br>

>> business relying on the remaining 1 node. Of course we could disable<br>

>> watchdog and restart pgpool so that applications can directly connect<br>

>> to pgpool. However in this case applications need to change the IP<br>

>> which connect to.<br>

>><br>

>> Also as the user pointed out, with 2-node configuration the VIP can be<br>

>> used by enabling enable_consensus_with_half_vote even if there is<br>

>> only 1 node remains. It seems as if 2-node config is better than<br>

>> 3-node config in this regard. Of course this is not true since 3-node<br>

>> config is much more resistant to split-brain problems.<br>

>><br>

>> I think there are multiple ways to deal with the problem:<br>

>><br>

>> 1) invent a new config parameter so that pgpool keeps VIP even if the<br>

>> quorum is lost.<br>

>><br>

>> 2) add a new pcp command which re-attaches the VIP after VIP is lost<br>

>> due to loss of the quorum.<br>

>><br>

>> #1 could easily creates duplicate VIPs. #2 looks better but when other<br>

>>  nodes come up, it could be possible that duplicate VIPs are created.<br>

>><br>

>> Thoughts?<br>

>><br>

>> Best regards,<br>

>> --<br>

>> Tatsuo Ishii<br>

>> SRA OSS, Inc. Japan<br>

>> English: <a href="http://www.sraoss.co.jp/index_en.php" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en.php</a><br>

>> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

>><br>

>> > Dear all,<br>

>> ><br>

>> > I have fairly common 3-node cluster, with each node running a PgPool<br>

>> > and a PostreSQL instance.<br>

>> ><br>

>> > I have set up priorities so that:<br>

>> >   - when all 3 nodes are up, the 1st node is gonna have the VIP,<br>

>> >   - when the 1st node is down, the 2nd node is gonna have the VIP, and<br>

>> >   - when both the 1st and the 2nd nodes are down, then the 3rd node<br>

>> > should get the VIP.<br>

>> ><br>

>> > My problem is that when only 1 node is up, the VIP is not brought up,<br>

>> > because there is no quorum.<br>

>> > How can I get PgPool to bring up the VIP to the only remaining node,<br>

>> > which still could and should serve requests?<br>

>> ><br>

>> > Regards,<br>

>> ><br>

>> > tamas<br>

>> ><br>

>> > --<br>

>> > Rébeli-Szabó Tamás<br>

>> ><br>

>> > _______________________________________________<br>

>> > pgpool-general mailing list<br>

>> > <a href="mailto:pgpool-general@pgpool.net" target="_blank">pgpool-general@pgpool.net</a><br>

>> > <a href="http://www.pgpool.net/mailman/listinfo/pgpool-general" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-general</a><br>

>> _______________________________________________<br>

>> pgpool-hackers mailing list<br>

>> <a href="mailto:pgpool-hackers@pgpool.net" target="_blank">pgpool-hackers@pgpool.net</a><br>

>> <a href="http://www.pgpool.net/mailman/listinfo/pgpool-hackers" rel="noreferrer" target="_blank">http://www.pgpool.net/mailman/listinfo/pgpool-hackers</a><br>

>><br>

</blockquote></div></div>