[pgpool-hackers: 3213] Re: Deal with recovery failure by an abnormally exiting child process

Tue Jan 8 11:16:19 JST 2019

>> In bug 431, it was reported that recovery second stage fails if there
>> was an abnormally exiting child process (typically caused by SIGKILL
>> or segfault). This is because the global connection counter
>> (Req_info->conn_counter) is left when the child process abnormaly
>> exits. In general we have nothing to do for abnormaly exiting process
>> situation and we recommend to restart whole Pgpool-II in this case.
>> 
>> However I find a tricky solution for a particular situation: if
>> client_idle_limit_in_recovery is properly set (i.e.
>> client_idle_limit_in_recovery >= recovery_timeout).

Sorry this should have been: 0< client_idle_limit_in_recovery <= recovery_timeout || client_idle_limit_in_recovery == -1

>> The logic is shown in the patch:
>> 
>> 	/*
>> 	 * recovery_timeout was expired. Before returning with failure status,
>> 	 * let's check if this is caused by the malformed conn_counter. If a child
>> 	 * process abnormally exits (killed by SIGKILL or SEGFAULT, for example),
>> 	 * then conn_counter is not decremented at process exit, thus it will
>> 	 * never be returning to 0. This could be detected by checking if
>> 	 * client_idle_limit_in_recovery is enabled and less value than
>> 	 * recovery_timeout because all clients must be kicked out by the time
>> 	 * when client_idle_limit_in_recovery is expired. If so, we should reset
>> 	 * conn_counter to 0 also.
>> 
>> Should we emply this? Is it too tricky? Comments are welcome.
> 
> Forgot to attach patch.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp