[pgpool-hackers: 1493] Re: Proposal: minimize process restart when fail over occurs
ishii at postgresql.org
Thu Apr 7 10:41:24 JST 2016
I have moved forward a little bit with this. At this point I have just
a created necessary infrastructure to deal with the goal. See
[pgpool-committers: 3127] for more details.
SRA OSS, Inc. Japan
> So this is a proposal for pgpool-II 3.6.
> I already did some discussion on this:
> From: Tatsuo Ishii <ishii at postgresql.org>
> Subject: [pgpool-hackers: 1413] Item #11, torward pgpool-II 3.6
> Date: Fri, 19 Feb 2016 12:03:12 +0900 (JST)
> Message-ID: <20160219.120312.816223524770393776.t-ishii at sraoss.co.jp>
> Here is a more or less formal proposal which is replacing it.
> Currently pgpool-II kills all child process when fail over (or switch
> over by pcp_detach_node) occurs. Of course this leads to disconnecting
> of all existing client connections because the peer process which
> client is connecting is gone. This proposal is seeking a way to
> minimize such session disconnections.
> o Precondition:
> I assume this proposal is for streaming replication mode only. Maybe
> we could expand this for other modes in the future. I also assume the
> broken server is not primary.
> o Consideration:
> What is the reason why we need to kill child process? Basically the
> problem is the retry in the TCP/IP stack layer when the connection
> goes wrong, for example, the network cable is pulled out. In this case
> the only way to stop the retry is restarting the process.
> There are several chances where we could avoid the restarting:
> 1) Knowing that we are not dealing with a fail over caused by the
> cabling problem. There are at least two cases we know the problem is
> not a cabling:
> a) the fail over is triggered by pcp_detach_node.
> b) the fail over is triggered by posmaster shutdown.
> For other cases we need to find a way to know that the problem is a
> cabling or not. Currently we use timeout to detect such that
> situation. So if we could know if the timeout is occurred or not, then
> we could know the problem is a cabling or not.
> 2) Once we succeed in #1, next thing we need to do is, whether a
> session in question is using the broken server. This is fairly easy
> because we already have the info on shared memory. If the session uses
> the broken server, then we need to restart the process. No way. Other
> case we just close a connection to the broken backend (if any).
> o Things we need to do:
> - Invent a way to know if the fail over request is created by
> pcp_detach_node. Probably we add a new flag to the fail over request
> packet to indicate whether the origin of the request is
> pcp_detach_node or not.
> - The same technique above can be used for the admin PostgreSQL
> shutdown case.
> - Create a API to deal with connections using the broken server.
> o What are the benefit once above proposal is implemented?
> - If conditions below are met, the user session can be survives after fail over.
> - Operated in streaming replication mode
> - The failed server is not primary
> - The session does not connect to the broken broken standby server
> Comments, opinions?
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
More information about the pgpool-hackers