[pgpool-hackers: 2525] Re: New Feature with patch: Quorum and Consensus for backend failover

Tatsuo Ishii ishii at sraoss.co.jp
Mon Sep 11 15:01:53 JST 2017


Usama,

I have modified watchdog regression script to test out your quorum
aware failover patch. Here are differences from the existing script.

- Install streaming replication primary and standby DB node (before
  raw mode + only 1 node). This is necessary to test an ordinary
  failover scenario. Current script creates only one DB node. So if we
  get the node down, the cluster goes into "all db node down"
  status. I already reported possible problem with the status up
  thread and waiting for the answer from Usama.

- Add one more pgpool-II node "standby2". For this purpose, new
  configuration file "standby2.conf" added.

- Add new test scenario: "fake" failover. By using the infrastructure
  I have created to simulate the communication path between standby2
  and DB node 1. The test checks if such a error raised a failover
  request from standby2, but it is safely ignored.

- Add new test scenario: "real" failover. Shutting down DB node 1,
  which should raise a failover request.

- Modify test.sh to agree with the changes above.

Since the modification requires your quorum patch committed, I haven't
push the change yet.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> I have tested the patch a little bit using 004 watchdog regression
> test.  After the test ends, I manually started master and standby
> Pgpool-II.
> 
> 1) Stop master PostgreSQL. Since there's only one PostgreSQL is
> configured, I expected:
> 
> psql: ERROR:  pgpool is not accepting any new connections
> DETAIL:  all backend nodes are down, pgpool requires at least one valid node
> HINT:  repair the backend nodes and restart pgpool
> 
> but master Pgpool-II replies:
> 
> psql: FATAL:  failed to create a backend connection
> DETAIL:  executing failover on backend
> 
> Is this normal?
> 
> 2) I shutdown the master node to see if the standby escalates.
> 
> After shutting down the master, I see this using pcp_watchdog_info:
> 
> pcp_watchdog_info -p 11105
> localhost:11100 Linux tishii-CF-SX3HE4BP localhost 11100 21104 4 MASTER
> localhost:11000 Linux tishii-CF-SX3HE4BP localhost 11000 21004 10 SHUTDOWN
> 
> Seems ok but I want to confirm.
> 
> master and standby pgpool logs attached.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
>> On Fri, Aug 25, 2017 at 5:05 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> 
>>> > On Fri, Aug 25, 2017 at 12:53 PM, Tatsuo Ishii <ishii at sraoss.co.jp>
>>> wrote:
>>> >
>>> >> Usama,
>>> >>
>>> >> With the new patch, the regression tests all passed.
>>> >>
>>> >
>>> > Glad to hear that :-)
>>> > Did you had a chance to look at the node quarantine state I added. What
>>> are
>>> > your thoughts on that ?
>>>
>>> I'm going to look into the patch this weekend.
>>>
>> 
>> Many thanks
>> 
>> Best Regards
>> Muhammad Usama
>> 
>>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese:http://www.sraoss.co.jp
>>>
>>> >> > Hi Ishii-San
>>> >> >
>>> >> > Please fine the updated patch, It fixes the regression issue you were
>>> >> > facing and also another bug which I encountered during my testing.
>>> >> >
>>> >> > -- Adding Yugo to the thread,
>>> >> > Hi Yugo,
>>> >> >
>>> >> > Since you are an expert of watchdog feature, So I thought you might
>>> have
>>> >> > something to say especially regarding the discussion points mentioned
>>> in
>>> >> > the initial mail.
>>> >> >
>>> >> >
>>> >> > Thanks
>>> >> > Best Regards
>>> >> > Muhammad Usama
>>> >> >
>>> >> >
>>> >> > On Thu, Aug 24, 2017 at 11:25 AM, Muhammad Usama <m.usama at gmail.com>
>>> >> wrote:
>>> >> >
>>> >> >>
>>> >> >>
>>> >> >> On Thu, Aug 24, 2017 at 4:34 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
>>> >> wrote:
>>> >> >>
>>> >> >>> After applying the patch, many of regression tests fail. It seems
>>> >> >>> pgpool.conf.sample has bogus comment which causes the pgpool.conf
>>> >> >>> parser to complain parse error.
>>> >> >>>
>>> >> >>> 2017-08-24 08:22:36: pid 6017: FATAL:  syntex error in configuration
>>> >> file
>>> >> >>> "/home/t-ishii/work/pgpool-II/current/pgpool2/src/test/regre
>>> >> >>> ssion/tests/004.watchdog/standby/etc/pgpool.conf"
>>> >> >>> 2017-08-24 08:22:36: pid 6017: DETAIL:  parse error at line 568 '*'
>>> >> token
>>> >> >>> = 8
>>> >> >>>
>>> >> >>
>>> >> >> Really sorry, Somehow I overlooked the sample config file changes I
>>> made
>>> >> >> at the last minute.
>>> >> >> Will send you the updated version.
>>> >> >>
>>> >> >> Thanks
>>> >> >> Best Regards
>>> >> >> Muhammad Usama
>>> >> >>
>>> >> >>>
>>> >> >>> Best regards,
>>> >> >>> --
>>> >> >>> Tatsuo Ishii
>>> >> >>> SRA OSS, Inc. Japan
>>> >> >>> English: http://www.sraoss.co.jp/index_en.php
>>> >> >>> Japanese:http://www.sraoss.co.jp
>>> >> >>>
>>> >> >>> > Usama,
>>> >> >>> >
>>> >> >>> > Thanks for the patch. I am going to review it.
>>> >> >>> >
>>> >> >>> > In the mean time when I apply your patch, I got some trailing
>>> >> >>> > whitespace errors. Can you please fix them?
>>> >> >>> >
>>> >> >>> > /home/t-ishii/quorum_aware_failover.diff:470: trailing
>>> whitespace.
>>> >> >>> >
>>> >> >>> > /home/t-ishii/quorum_aware_failover.diff:485: trailing
>>> whitespace.
>>> >> >>> >
>>> >> >>> > /home/t-ishii/quorum_aware_failover.diff:564: trailing
>>> whitespace.
>>> >> >>> >
>>> >> >>> > /home/t-ishii/quorum_aware_failover.diff:1428: trailing
>>> whitespace.
>>> >> >>> >
>>> >> >>> > /home/t-ishii/quorum_aware_failover.diff:1450: trailing
>>> whitespace.
>>> >> >>> >
>>> >> >>> > warning: squelched 3 whitespace errors
>>> >> >>> > warning: 8 lines add whitespace errors.
>>> >> >>> >
>>> >> >>> > Best regards,
>>> >> >>> > --
>>> >> >>> > Tatsuo Ishii
>>> >> >>> > SRA OSS, Inc. Japan
>>> >> >>> > English: http://www.sraoss.co.jp/index_en.php
>>> >> >>> > Japanese:http://www.sraoss.co.jp
>>> >> >>> >
>>> >> >>> >> Hi
>>> >> >>> >>
>>> >> >>> >> I was working on the new feature to make the backend node
>>> failover
>>> >> >>> quorum
>>> >> >>> >> aware and on the half way through the implementation I also added
>>> >> the
>>> >> >>> >> majority consensus feature for the same.
>>> >> >>> >>
>>> >> >>> >> So please find the first version of the patch for review that
>>> makes
>>> >> the
>>> >> >>> >> backend node failover consider the watchdog cluster quorum status
>>> >> and
>>> >> >>> seek
>>> >> >>> >> the majority consensus before performing failover.
>>> >> >>> >>
>>> >> >>> >> *Changes in the Failover mechanism with watchdog.*
>>> >> >>> >> For this new feature I have modified the Pgpool-II's existing
>>> >> failover
>>> >> >>> >> mechanism with watchdog.
>>> >> >>> >> Previously as you know when the Pgpool-II require to perform a
>>> node
>>> >> >>> >> operation (failover, failback, promote-node) with the watchdog.
>>> The
>>> >> >>> >> watchdog used to propagated the failover request to all the
>>> >> Pgpool-II
>>> >> >>> nodes
>>> >> >>> >> in the watchdog cluster and as soon as the request was received
>>> by
>>> >> the
>>> >> >>> >> node, it used to initiate the local failover and that failover
>>> was
>>> >> >>> >> synchronised on all nodes using the distributed locks.
>>> >> >>> >>
>>> >> >>> >> *Now Only the Master node performs the failover.*
>>> >> >>> >> The attached patch changes the mechanism of synchronised
>>> failover,
>>> >> and
>>> >> >>> now
>>> >> >>> >> only the Pgpool-II of master watchdog node performs the failover,
>>> >> and
>>> >> >>> all
>>> >> >>> >> other standby nodes sync the backend statuses after the master
>>> >> >>> Pgpool-II is
>>> >> >>> >> finished with the failover.
>>> >> >>> >>
>>> >> >>> >> *Overview of new failover mechanism.*
>>> >> >>> >> -- If the failover request is received to the standby watchdog
>>> >> >>> node(from
>>> >> >>> >> local Pgpool-II), That request is forwarded to the master
>>> watchdog
>>> >> and
>>> >> >>> the
>>> >> >>> >> Pgpool-II main process is returned with the
>>> >> FAILOVER_RES_WILL_BE_DONE
>>> >> >>> >> return code. And upon receiving the FAILOVER_RES_WILL_BE_DONE
>>> from
>>> >> the
>>> >> >>> >> watchdog for the failover request the requesting Pgpool-II moves
>>> >> >>> forward
>>> >> >>> >> without doing anything further for the particular failover
>>> command.
>>> >> >>> >>
>>> >> >>> >> -- Now when the failover request from standby node is received by
>>> >> the
>>> >> >>> >> master watchdog, after performing the validation, applying the
>>> >> >>> consensus
>>> >> >>> >> rules the failover request is triggered on the local Pgpool-II .
>>> >> >>> >>
>>> >> >>> >> -- When the failover request is received to the master watchdog
>>> node
>>> >> >>> from
>>> >> >>> >> the local Pgpool-II (On the IPC channel) the watchdog process
>>> inform
>>> >> >>> the
>>> >> >>> >> Pgpool-II requesting process to proceed with failover (provided
>>> all
>>> >> >>> >> failover rules are satisfied).
>>> >> >>> >>
>>> >> >>> >> -- After the failover is finished on the master Pgpool-II, the
>>> >> failover
>>> >> >>> >> function calls the *wd_failover_end*() which sends the backend
>>> sync
>>> >> >>> >> required message to all standby watchdogs.
>>> >> >>> >>
>>> >> >>> >> -- Upon receiving the sync required message from master watchdog
>>> >> node
>>> >> >>> all
>>> >> >>> >> Pgpool-II sync the new statuses of each backend node from the
>>> master
>>> >> >>> >> watchdog.
>>> >> >>> >>
>>> >> >>> >> *No More Failover locks*
>>> >> >>> >> Since with this new failover mechanism we do not require any
>>> >> >>> >> synchronisation and guards against the execution of
>>> >> failover_commands
>>> >> >>> by
>>> >> >>> >> multiple Pgpool-II nodes, So the patch removes all the
>>> distributed
>>> >> >>> locks
>>> >> >>> >> from failover function, This makes the failover simpler and
>>> faster.
>>> >> >>> >>
>>> >> >>> >> *New kind of Failover operation NODE_QUARANTINE_REQUEST*
>>> >> >>> >> The patch adds the new kind of backend node operation
>>> >> NODE_QUARANTINE
>>> >> >>> which
>>> >> >>> >> is effectively same as the NODE_DOWN, but with node_quarantine
>>> the
>>> >> >>> >> failover_command is not triggered.
>>> >> >>> >> The NODE_DOWN_REQUEST is automatically converted to the
>>> >> >>> >> NODE_QUARANTINE_REQUEST when the failover is requested on the
>>> >> backend
>>> >> >>> node
>>> >> >>> >> but watchdog cluster does not holds the quorum.
>>> >> >>> >> This means in the absence of quorum the failed backend nodes are
>>> >> >>> >> quarantined and when the quorum becomes available again the
>>> >> Pgpool-II
>>> >> >>> >> performs the failback operation on all quarantine nodes.
>>> >> >>> >> And again when the failback is performed on the quarantine
>>> backend
>>> >> >>> node the
>>> >> >>> >> failover function does not trigger the failback_command.
>>> >> >>> >>
>>> >> >>> >> *Controlling the Failover behaviour.*
>>> >> >>> >> The patch adds three new configuration parameters to configure
>>> the
>>> >> >>> failover
>>> >> >>> >> behaviour from user side.
>>> >> >>> >>
>>> >> >>> >> *failover_when_quorum_exists*
>>> >> >>> >> When enabled the failover command will only be executed when the
>>> >> >>> watchdog
>>> >> >>> >> cluster holds the quorum. And when the quorum is absent and
>>> >> >>> >> failover_when_quorum_exists is enabled the failed backend nodes
>>> will
>>> >> >>> get
>>> >> >>> >> quarantine until the quorum becomes available again.
>>> >> >>> >> disabling it will enable the old behaviour of failover commands.
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> *failover_require_consensus*This new configuration parameter
>>> can be
>>> >> >>> used to
>>> >> >>> >> make sure we get the majority vote before performing the
>>> failover on
>>> >> >>> the
>>> >> >>> >> node. When *failover_require_consensus* is enabled then the
>>> >> failover is
>>> >> >>> >> only performed after receiving the failover request from the
>>> >> majority
>>> >> >>> or
>>> >> >>> >> Pgpool-II nodes.
>>> >> >>> >> For example in three nodes cluster the failover will not be
>>> >> performed
>>> >> >>> until
>>> >> >>> >> at least two nodes ask for performing the failover on the
>>> particular
>>> >> >>> >> backend node.
>>> >> >>> >>
>>> >> >>> >> It is also worthwhile to mention here that
>>> >> *failover_require_consensus*
>>> >> >>> >> only works when failover_when_quorum_exists is enables.
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> *enable_multiple_failover_requests_from_node*
>>> >> >>> >> This parameter works in connection with
>>> *failover_require_consensus*
>>> >> >>> >> config. When enabled a single Pgpool-II node can vote for
>>> failover
>>> >> >>> multiple
>>> >> >>> >> times.
>>> >> >>> >> For example in the three nodes cluster if one Pgpool-II node
>>> sends
>>> >> the
>>> >> >>> >> failover request of particular node twice that would be counted
>>> as
>>> >> two
>>> >> >>> >> votes in favour of failover and the failover will be performed
>>> even
>>> >> if
>>> >> >>> we
>>> >> >>> >> do not get a vote from other two nodes.
>>> >> >>> >>
>>> >> >>> >> And when *enable_multiple_failover_requests_from_node* is
>>> disabled,
>>> >> >>> Only
>>> >> >>> >> the first vote from each Pgpool-II will be accepted and all other
>>> >> >>> >> subsequent votes will be marked duplicate and rejected.
>>> >> >>> >> So in that case we will require a majority votes from distinct
>>> >> nodes to
>>> >> >>> >> execute the failover.
>>> >> >>> >> Again this *enable_multiple_failover_requests_from_node* only
>>> >> becomes
>>> >> >>> >> effective when both *failover_when_quorum_exists* and
>>> >> >>> >> *failover_require_consensus* are enabled.
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> *Controlling the failover: The Coding perspective.*
>>> >> >>> >> Although the failover functions are made quorum and consensus
>>> aware
>>> >> but
>>> >> >>> >> there is still a way to bypass the quorum conditions, and
>>> >> requirement
>>> >> >>> of
>>> >> >>> >> consensus.
>>> >> >>> >>
>>> >> >>> >> For this the patch uses the existing request_details flags in
>>> >> >>> >> POOL_REQUEST_NODE to control the behaviour of failover.
>>> >> >>> >>
>>> >> >>> >> Here are the newly added flags values.
>>> >> >>> >>
>>> >> >>> >> *REQ_DETAIL_WATCHDOG*:
>>> >> >>> >> Setting this flag while issuing the failover command will not
>>> send
>>> >> the
>>> >> >>> >> failover request to the watchdog. But this flag may not be
>>> useful in
>>> >> >>> any
>>> >> >>> >> other place than where it is already used.
>>> >> >>> >> Mostly this flag can be used to avoid the failover command from
>>> >> going
>>> >> >>> to
>>> >> >>> >> watchdog that is already originated from watchdog. Otherwise we
>>> can
>>> >> >>> end up
>>> >> >>> >> in infinite loop.
>>> >> >>> >>
>>> >> >>> >> *REQ_DETAIL_CONFIRMED*:
>>> >> >>> >> Setting this flag will bypass the *failover_require_consensus*
>>> >> >>> >> configuration and immediately perform the failover if quorum is
>>> >> >>> present.
>>> >> >>> >> This flag can be used to issue the failover request originated
>>> from
>>> >> PCP
>>> >> >>> >> command.
>>> >> >>> >>
>>> >> >>> >> *REQ_DETAIL_UPDATE*:
>>> >> >>> >> This flag is used for the command where we are failing back the
>>> >> >>> quarantine
>>> >> >>> >> nodes. Setting this flag will not trigger the failback_command.
>>> >> >>> >>
>>> >> >>> >> *Some conditional flags used:*
>>> >> >>> >> I was not sure about the configuration of each type of failover
>>> >> >>> operation.
>>> >> >>> >> As we have three main failover operations NODE_UP_REQUEST,
>>> >> >>> >> NODE_DOWN_REQUEST, and PROMOTE_NODE_REQUEST
>>> >> >>> >> So I was thinking do we need to give the configuration option to
>>> the
>>> >> >>> users,
>>> >> >>> >> if they want to enable/disable quorum checking and consensus for
>>> >> >>> individual
>>> >> >>> >> failover operation type.
>>> >> >>> >> For example: is it a practical configuration where a user would
>>> >> want to
>>> >> >>> >> ensure quorum while preforming NODE_DOWN operation while does not
>>> >> want
>>> >> >>> it
>>> >> >>> >> for NODE_UP.
>>> >> >>> >> So in this patch I use three compile time defines to enable
>>> disable
>>> >> the
>>> >> >>> >> individual failover operation, while we can decide on the best
>>> >> >>> solution.
>>> >> >>> >>
>>> >> >>> >> NODE_UP_REQUIRE_CONSENSUS: defining it will enable quorum
>>> checking
>>> >> >>> feature
>>> >> >>> >> for NODE_UP_REQUESTs
>>> >> >>> >>
>>> >> >>> >> NODE_DOWN_REQUIRE_CONSENSUS: defining it will enable quorum
>>> checking
>>> >> >>> >> feature for NODE_DOWN_REQUESTs
>>> >> >>> >>
>>> >> >>> >> NODE_PROMOTE_REQUIRE_CONSENSUS: defining it will enable quorum
>>> >> >>> checking
>>> >> >>> >> feature for PROMOTE_NODE_REQUESTs
>>> >> >>> >>
>>> >> >>> >> *Some Point for Discussion:*
>>> >> >>> >>
>>> >> >>> >> *Do we really need to check ReqInfo->switching flag before
>>> enqueuing
>>> >> >>> >> failover request.*
>>> >> >>> >> While working on the patch I was wondering why do we disallow
>>> >> >>> enqueuing the
>>> >> >>> >> failover command when the failover is already in progress? For
>>> >> example
>>> >> >>> in
>>> >> >>> >> *pcp_process_command*() function if we see the
>>> *Req_info->switching*
>>> >> >>> flag
>>> >> >>> >> set we bailout with the error instead of enqueuing the command.
>>> Is
>>> >> is
>>> >> >>> >> really necessary?
>>> >> >>> >>
>>> >> >>> >> *Do we need more granule control over each failover operation:*
>>> >> >>> >> As described in section "Some conditional flags used" I want the
>>> >> >>> opinion on
>>> >> >>> >> do we need configuration parameters in pgpool.conf to enable
>>> disable
>>> >> >>> quorum
>>> >> >>> >> and consensus checking on individual failover types.
>>> >> >>> >>
>>> >> >>> >> *Which failover should be mark as Confirmed:*
>>> >> >>> >> As defined in the above section of REQ_DETAIL_CONFIRMED, We can
>>> mark
>>> >> >>> the
>>> >> >>> >> failover request to not need consensus, currently the requests
>>> from
>>> >> >>> the PCP
>>> >> >>> >> commands are fired with this flag. But I was wondering there may
>>> be
>>> >> >>> more
>>> >> >>> >> places where we many need to use the flag.
>>> >> >>> >> For example I currently use the same confirmed flag when
>>> failover is
>>> >> >>> >> triggered because of *replication_stop_on_mismatch*.
>>> >> >>> >>
>>> >> >>> >> I think we should think this flag for each place of failover,
>>> like
>>> >> >>> when the
>>> >> >>> >> failover is triggered
>>> >> >>> >> because of health_check failure.
>>> >> >>> >> because of replication mismatch
>>> >> >>> >> because of backend_error
>>> >> >>> >> e.t.c
>>> >> >>> >>
>>> >> >>> >> *Node Quarantine behaviour.*
>>> >> >>> >> What do you think about the node quarantine used by this patch.
>>> Can
>>> >> you
>>> >> >>> >> think of some problem which can be caused by this?
>>> >> >>> >>
>>> >> >>> >> *What should be the default values for each newly added config
>>> >> >>> parameters.*
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> *TODOs*
>>> >> >>> >>
>>> >> >>> >> -- Updating the documentation is still todo. Will do that once
>>> every
>>> >> >>> aspect
>>> >> >>> >> of the feature will be finalised.
>>> >> >>> >> -- Some code warnings and cleanups are still not done.
>>> >> >>> >> -- I am still little short on testing
>>> >> >>> >> -- Regression test cases for the feature
>>> >> >>> >>
>>> >> >>> >>
>>> >> >>> >> Thoughts and suggestions are most welcome.
>>> >> >>> >>
>>> >> >>> >> Thanks
>>> >> >>> >> Best regards
>>> >> >>> >> Muhammad Usama
>>> >> >>> > _______________________________________________
>>> >> >>> > pgpool-hackers mailing list
>>> >> >>> > pgpool-hackers at pgpool.net
>>> >> >>> > http://www.pgpool.net/mailman/listinfo/pgpool-hackers
>>> >> >>>
>>> >> >>
>>> >> >>
>>> >>
>>>


More information about the pgpool-hackers mailing list