[pgpool-hackers: 4056] Re: [pgpool-general: 7543] VIP with one node

Mon Nov 8 11:33:51 JST 2021

Hi Usama,

Thank you for the work. I have added it to the Pgpool-II 4.3 release
note.  Can you please take a look at to check if I have
misunderstood anything regarding the commit.

One thing I noticed in the doc you added is, you did not mention about
the risk of split-brain because of this feature enabled. Should we add
that?

> Hi Ishii-San,
> 
> On Tue, Nov 2, 2021 at 5:58 AM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Hi Usama,
>>
>> I confirmed you patch works as expected. Thank you for your great work!
>>
> 
> Many thanks for the confirmation. I have made a few cosmetic changes and
> committed the patch and documentation update.
> 
> Best Regards
> Muhammad Usama
> 
> 
> 
>> > Hi Tatsuo,
>> >
>> > On Mon, Nov 1, 2021 at 12:21 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> >
>> >> Hi Usama,
>> >>
>> >> Thank you for the patch. Unfortunately the patch does not apply to the
>> >> master branch anymore. Can you please rebase it?
>> >>
>> >
>> > Please find the rebased patch
>> >
>> > Thanks
>> > Best regards
>> > Muhammad Usama
>> >
>> >
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >> > Hi,
>> >> >
>> >> > So I have cooked up a WIP patch that implements the above discussed
>> >> > behavior.
>> >> >
>> >> > The attached patch adds three new configuration parameters
>> >> >
>> >> > #wd_remove_shutdown_nodes = off
>> >> >                                     # when enabled properly shutdown
>> >> > watchdog nodes get
>> >> >                                     # removed from the cluster and
>> does
>> >> not
>> >> > count towards
>> >> >                                     # the quorum and consensus
>> >> computations
>> >> >
>> >> > #wd_lost_node_removal_timeout = 0s
>> >> >                                     # Time after which the LOST
>> watchdog
>> >> > nodes get
>> >> >                                     # removed from the cluster and
>> does
>> >> not
>> >> > count towards
>> >> >                                     # the quorum and consensus
>> >> computations
>> >> >                                     # setting it to 0 will never
>> remove
>> >> the
>> >> > LOST nodes
>> >> >
>> >> > #wd_initial_node_showup_time = 0s
>> >> >
>> >> >                                     # Time to wait for Watchdog nodes
>> to
>> >> > connect to the cluster.
>> >> >                                     # After that time the nodes are
>> >> > considered to be not part of
>> >> >                                     # the cluster and will not count
>> >> towards
>> >> >                                     # the quorum and consensus
>> >> computations
>> >> >                                     # setting it to 0 will wait
>> forever
>> >> >
>> >> >
>> >> > Keeping the default values for these parameters retains the existing
>> >> > behavior.
>> >> >
>> >> >
>> >> > Moreover, the patch also enhances the wd_watchdog_info utility to
>> output
>> >> > the current "Quorum State"
>> >> >
>> >> > for each watchdog node and "number of nodes require for quorum" and
>> >> "valid
>> >> > remote nodes count" as per
>> >> >
>> >> > the current status of watchdog cluster. This change might also require
>> >> the
>> >> > bump of pcp lib version.
>> >> >
>> >> >
>> >> >
>> >> > bin/pcp_watchdog_info -U postgres -v
>> >> > Watchdog Cluster Information
>> >> > Total Nodes              : 3
>> >> > Remote Nodes             : 2
>> >> >
>> >> > *Valid Remote Nodes       : 1*Alive Remote Nodes       : 0
>> >> >
>> >> > *Nodes required for quorum: 2*Quorum state             : QUORUM ABSENT
>> >> > VIP up on local node     : NO
>> >> > Leader Node Name         : localhost:9990 Darwin
>> Usama-Macbook-Pro.local
>> >> > Leader Host Name         : localhost
>> >> >
>> >> > Watchdog Node Information
>> >> > Node Name      : localhost:9990 Darwin Usama-Macbook-Pro.local
>> >> > ...
>> >> > Status Name    : LEADER
>> >> >
>> >> > *Quorum State   : ACTIVE*
>> >> > Node Name      : localhost:9991 Darwin Usama-Macbook-Pro.local
>> >> > ...
>> >> > Status         : 10
>> >> > Status Name    : SHUTDOWN
>> >> > *Quorum State   : ACTIVE*
>> >> >
>> >> > Node Name      : Not_Set
>> >> > ...
>> >> > Status Name    : DEAD
>> >> >
>> >> > *Quorum State   : REMOVED-NO-SHOW*
>> >> >
>> >> > The patch is still in WIP state mainly because it lacks the
>> documentation
>> >> > updates, and I am
>> >> > sharing it to get an opinion and suggestions on the behavior and
>> >> > configuration parameter names.
>> >> >
>> >> > Thanks
>> >> > Best regards
>> >> > Muhammad Usama
>> >> >
>> >> >
>> >> > On Mon, Aug 23, 2021 at 6:05 AM Tatsuo Ishii <ishii at sraoss.co.jp>
>> wrote:
>> >> >
>> >> >> Hi Usama,
>> >> >>
>> >> >> Sorry for late reply.
>> >> >>
>> >> >> From: Muhammad Usama <m.usama at gmail.com>
>> >> >> Subject: Re: [pgpool-hackers: 3898] Re: [pgpool-general: 7543] VIP
>> with
>> >> >> one node
>> >> >> Date: Thu, 22 Jul 2021 14:12:59 +0500
>> >> >> Message-ID: <
>> >> >> CAEJvTzXsKE2B0QMd0AjGBmXK6zocWZZcGU7yzzkSnmff0iAfqA at mail.gmail.com>
>> >> >>
>> >> >> > On Tue, Jul 20, 2021 at 4:40 AM Tatsuo Ishii <ishii at sraoss.co.jp>
>> >> wrote:
>> >> >> >
>> >> >> >> >> Is it possible to configure watchdog to enable the lost node
>> >> removal
>> >> >> >> >> function only when a node is properly shutdown?
>> >> >> >> >>
>> >> >> >>
>> >> >> >> > Yes if we disable the wd_lost_node_to_remove_timeout (by
>> setting it
>> >> >> to 0)
>> >> >> >> > the lost node removal will only happen for properly shutdown
>> nodes.
>> >> >> >>
>> >> >> >> Oh, I thought setting wd_lost_node_to_remove_timeout to 0 will
>> keep
>> >> >> >> the existing behavior.
>> >> >> >>
>> >> >> >
>> >> >> > As there are two parts of the proposal, First one deals with
>> removing
>> >> the
>> >> >> > lost node
>> >> >> > from the cluster after wd_lost_node_to_remove_timeout amount of
>> time.
>> >> >> While
>> >> >> > the
>> >> >> > second part is about removing the properly shutdown nodes from the
>> >> >> cluster.
>> >> >> >
>> >> >> > Now disabling the wd_lost_node_to_remove_timeout (setting it to 0)
>> >> will
>> >> >> > keep the
>> >> >> > existing behaviour as far as removing the lost node portion of
>> >> proposal
>> >> >> is
>> >> >> > concerned.
>> >> >> >
>> >> >> > While not counting the properly shutdown node as part of watchdog
>> >> cluster
>> >> >> > is not configurable (as per original proposal), So if we want to
>> make
>> >> >> this
>> >> >> > part configurable
>> >> >> > as well so that we can switch to 100% current behaviour then we can
>> >> add
>> >> >> > another
>> >> >> > config parameter for that. like
>> >> >> consider_shutdown_nodes_part_of_wd_cluster
>> >> >> > = [on|off]
>> >> >>
>> >> >> +1 to add the new parameter.
>> >> >>
>> >> >> The reason is, some users may want to avoid split brain problem even
>> >> >> if quorum/VIP is lost.  Suppose there are two admins A for the system
>> >> >> (OS), B for the database. B never wants to have the split brain
>> >> >> possibility. If A shutdowns the system, B may not notice there are
>> not
>> >> >> enough nodes to form consensus anymore because if
>> >> >> consider_shutdown_nodes_part_of_wd_cluster is on because the
>> >> >> quorum/VIP will be kept until no node remains.
>> >> >>
>> >> >> In summary I think there are two use-cases for both
>> >> >> consider_shutdown_nodes_part_of_wd_cluster is on and off.
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS, Inc. Japan
>> >> >> English: http://www.sraoss.co.jp/index_en.php
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >> >>
>> >>
>>