0000483: online-recovery is blocked after a child process exits abnormally with replication mode and watchdog - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000483	Pgpool-II	Bug	public	2019-03-26 21:18	2019-08-15 19:14

Reporter	nagata	Assigned To	Muhammad Usama
Priority	normal	Severity	minor	Reproducibility	have not tried
Status	resolved	Resolution	fixed
Product Version	3.6.15
Fixed in Version	3.6.18

Summary	0000483: online-recovery is blocked after a child process exits abnormally with replication mode and watchdog
Description	Pgpool-II 3.6.15 fixed a failure of online-recovery which occurs when child processes exit abnormally. - Fix online recovery failed due to client_idle_limit_in_recovery in certain cases. (bug 431) (Tatsuo Ishii) (https://www.pgpool.net/mantisbt/view.php?id=431) However, got the same problem even in 3.6.15, after a child was terminated by segfault, although client_idle_limit_in_recovery = -1. I found this is due to watchdog. When watchdog is enabled, wd_start_recovery() is called just after "2nd stage" starts. In wd_start_recovery(), the recovery request is sent to other pgpool, and the other pgpool will waits for all children exiting. However, if some child process has exited abnormally in the other pgpool, this never returns a response because Req_info->conn_counter cannot be zero. Therefore, the original pgpool will waits for the response until the timeout is detected, and the online recovery fails eventually. I guess a fix similar to one in 3.6.15 will be needed in process_wd_command_timer_event(), where Req_info->conn_counter is checked periodically for processing recovery commands received by watchdog.
Steps To Reproduce	The related discussion is in [pgpool-hackers: 3211] Deal with recovery failure by an abnormally exiting child proces See this thread for details.
Tags	No tags attached.

Muhammad Usama 2019-04-16 00:21 developer ~0002556	Hi yugo I think your analysis is spot on. I have applied the solution done by Tatsuo Ishii for bug 431into watchdog code flow. Can you please try out the attached patch to see if it solves the issue. Thanks bug_483.diff (2,295 bytes) bug_483.diff (2,295 bytes)

harukat 2019-05-17 19:41 developer ~0002598 Last edited: 2019-05-17 19:54	Hello Usama, I did a test for the bug_483.diff patch with Pgpool-II 3.6.17. It seems not to work. (connect via pgpool1 and kill its child process) $ psql -p 9999 db1 $ ps x \| grep pgpool \| grep idle $ kill -9 12345 (recovery request via pgpool2) $ pcp_recovery_node -p 9797 1 ERROR: node recovery failed, failed to send start recovery packet This can be success without the previous step. pgpool1 (that has a killed child pgpool proces) log: 2019-05-17 19:33:51: pid 13391: LOG: watchdog received online recovery request from "localhost:9998 Linux srapc2499" 2019-05-17 19:33:51: pid 13391: LOG: wait_connection_closed: mulformed conn_counter (1) detected. reset it to 0 pgpool2 log: 2019-05-17 19:33:51: pid 13654: WARNING: start recovery command lock failed 2019-05-17 19:33:51: pid 13654: DETAIL: ipc command timeout 2019-05-17 19:33:51: pid 13393: LOG: new IPC connection received 2019-05-17 19:33:51: pid 13393: LOG: read from socket failed, remote end closed the connection 2019-05-17 19:33:51: pid 13393: LOG: online recovery request from local pgpool-II node received on IPC interface is forwarded to master watchdog node "localhost:9998 Linux srapc2499" 2019-05-17 19:33:51: pid 13393: DETAIL: waiting for the reply... 2019-05-17 19:33:51: pid 13654: ERROR: node recovery failed, failed to send start recovery packet 2019-05-17 19:33:51: pid 13608: LOG: PCP process with pid: 13654 exit with SUCCESS. 2019-05-17 19:33:51: pid 13608: LOG: PCP process with pid: 13654 exits with status 0 According to the script logs, recovery_1st_stage was executed, but recovery_2nd_stage and pgpool_remote_start weren't executed.

harukat 2019-05-30 14:09 developer ~0002642	Hello, Usama and Yugo. It has been a while. I am reporting more. In this case above, pgpool2's issue_command_to_watchdog() always resulted in timeout that exceeds recovery_timeout like the following logic flow. In pgpool2, which receives the pcp_recovery_node request: wd_start_recovery() - issue_command_to_watchdog() with recovery_timeout In pgpool1, which has a child process that killed: ensure_conn_counter_validity() after timer_expired (recovery_timeout) I confirmed that the change like the following for example can avoid this failure. src/watchdog/wd_commands.c: wd_start_recovery(void): WDIPCCmdResult *result = issue_command_to_watchdog(WD_IPC_ONLINE_RECOVERY_COMMAND, - pool_config->recovery_timeout, + pool_config->recovery_timeout + 10, func, strlen(func), true); When I do pcp_recovery_node for pgpool1 with bug_483.diff patch, it always succeeds.

Muhammad Usama 2019-06-19 15:27 developer ~0002668	Hi Haruka, Thanks for the patch. I am looking into this one.

Muhammad Usama 2019-08-08 23:45 developer ~0002757	Hi Haruka, I can confirm your suggested changes were required on top of the original patch. I have committed the modified patch on all supported branches. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=c402060814ba9737e9572ba07eef1be7664c9679 Thanks

Date Modified	Username	Field	Change
2019-03-26 21:18	nagata	New Issue
2019-04-01 16:32	administrator	Assigned To	=> t-ishii
2019-04-01 16:32	administrator	Status	new => assigned
2019-04-02 13:19	t-ishii	Assigned To	t-ishii => Muhammad Usama
2019-04-16 00:21	Muhammad Usama	File Added: bug_483.diff
2019-04-16 00:21	Muhammad Usama	Note Added: 0002556
2019-05-17 19:41	harukat	Note Added: 0002598
2019-05-17 19:51	harukat	Note Edited: 0002598
2019-05-17 19:54	harukat	Note Edited: 0002598
2019-05-30 14:09	harukat	Note Added: 0002642
2019-06-19 15:27	Muhammad Usama	Note Added: 0002668
2019-08-08 23:45	Muhammad Usama	Status	assigned => resolved
2019-08-08 23:45	Muhammad Usama	Resolution	open => fixed
2019-08-08 23:45	Muhammad Usama	Note Added: 0002757
2019-08-15 19:14	administrator	Fixed in Version	=> 3.6.18