[pgpool-hackers: 3211] Deal with recovery failure by an abnormally exiting child process

Tue Jan 8 10:33:23 JST 2019

In bug 431, it was reported that recovery second stage fails if there
was an abnormally exiting child process (typically caused by SIGKILL
or segfault). This is because the global connection counter
(Req_info->conn_counter) is left when the child process abnormaly
exits. In general we have nothing to do for abnormaly exiting process
situation and we recommend to restart whole Pgpool-II in this case.

However I find a tricky solution for a particular situation: if
client_idle_limit_in_recovery is properly set (i.e.
client_idle_limit_in_recovery >= recovery_timeout).

The logic is shown in the patch:

	/*
	 * recovery_timeout was expired. Before returning with failure status,
	 * let's check if this is caused by the malformed conn_counter. If a child
	 * process abnormally exits (killed by SIGKILL or SEGFAULT, for example),
	 * then conn_counter is not decremented at process exit, thus it will
	 * never be returning to 0. This could be detected by checking if
	 * client_idle_limit_in_recovery is enabled and less value than
	 * recovery_timeout because all clients must be kicked out by the time
	 * when client_idle_limit_in_recovery is expired. If so, we should reset
	 * conn_counter to 0 also.

Should we emply this? Is it too tricky? Comments are welcome.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp