[pgpool-hackers: 1504] Re: Proposal: minimize process restart when fail over occurs

Sun Apr 17 18:15:52 JST 2016

Ok, I have succeeded in not restart child process when certain
conditions are met.

 - Streaming replication mode
 - pcp_detach_node is used
 - does not use the load balance node (that means the process does not
   issue queries to the load balance node)
 - the node is not primary node

At this point this just enhance allow following use cases (we assume
that pcp_detach_node detaches DB node N):

1) Lucky users connecting to the database server N are not affected by
   the pcp_detach_node.

2) Planned DB shutdown. For demonstration purpose, I use pgbench -C.

   - start pgbench -C
   - change pgpool.conf to change the weight to 0 for backend N.
   - pgpool reload
   - pcp_detach_node N
   - pgbench happily continues the benchmark

Probably #2 is practically useful.

I think we could expand this to certain cases such as PostgreSQL
is shutdown by admin. Will continue to work on this.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> I have moved forward a little bit with this. At this point I have just
> a created necessary infrastructure to deal with the goal. See
> [pgpool-committers: 3127] for more details.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
>> So this is a proposal for pgpool-II 3.6.
>> 
>> I already did some discussion on this:
>> 
>> From: Tatsuo Ishii <ishii at postgresql.org>
>> Subject: [pgpool-hackers: 1413] Item #11, torward pgpool-II 3.6
>> Date: Fri, 19 Feb 2016 12:03:12 +0900 (JST)
>> Message-ID: <20160219.120312.816223524770393776.t-ishii at sraoss.co.jp>
>> 
>> Here is a more or less formal proposal which is replacing it.
>> 
>> Goal:
>> 
>> Currently pgpool-II kills all child process when fail over (or switch
>> over by pcp_detach_node) occurs. Of course this leads to disconnecting
>> of all existing client connections because the peer process which
>> client is connecting is gone. This proposal is seeking a way to
>> minimize such session disconnections.
>> 
>> o Precondition:
>> 
>> I assume this proposal is for streaming replication mode only. Maybe
>> we could expand this for other modes in the future. I also assume the
>> broken server is not primary.
>> 
>> o Consideration:
>> 
>> What is the reason why we need to kill child process? Basically the
>> problem is the retry in the TCP/IP stack layer when the connection
>> goes wrong, for example, the network cable is pulled out. In this case
>> the only way to stop the retry is restarting the process.
>> 
>> There are several chances where we could avoid the restarting:
>> 
>> 1) Knowing that we are not dealing with a fail over caused by the
>> cabling problem. There are at least two cases we know the problem is
>> not a cabling:
>> 
>>  a) the fail over is triggered by pcp_detach_node.
>> 
>>  b) the fail over is triggered by posmaster shutdown.
>> 
>> For other cases we need to find a way to know that the problem is a
>> cabling or not. Currently we use timeout to detect such that
>> situation. So if we could know if the timeout is occurred or not, then
>> we could know the problem is a cabling or not.
>> 
>> 2) Once we succeed in #1, next thing we need to do is, whether a
>> session in question is using the broken server. This is fairly easy
>> because we already have the info on shared memory. If the session uses
>> the broken server, then we need to restart the process. No way. Other
>> case we just close a connection to the broken backend (if any).
>> 
>> o Things we need to do:
>> 
>> - Invent a way to know if the fail over request is created by
>>   pcp_detach_node. Probably we add a new flag to the fail over request
>>   packet to indicate whether the origin of the request is
>>   pcp_detach_node or not.
>> 
>> - The same technique above can be used for the admin PostgreSQL
>>   shutdown case.
>> 
>> - Create a API to deal with connections using the broken server.
>> 
>> o What are the benefit once above proposal is implemented?
>> 
>> - If conditions below are met, the user session can be survives after fail over.
>> 
>>  - Operated in streaming replication mode
>> 
>>  - The failed server is not primary
>> 
>>  - The session does not connect to the broken broken standby server
>> 
>> Comments, opinions?
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>> _______________________________________________
>> pgpool-hackers mailing list
>> pgpool-hackers at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers