View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000483 | Pgpool-II | Bug | public | 2019-03-26 21:18 | 2019-08-15 19:14 |
| Reporter | nagata | Assigned To | Muhammad Usama | ||
| Priority | normal | Severity | minor | Reproducibility | have not tried |
| Status | resolved | Resolution | fixed | ||
| Product Version | 3.6.15 | ||||
| Fixed in Version | 3.6.18 | ||||
| Summary | 0000483: online-recovery is blocked after a child process exits abnormally with replication mode and watchdog | ||||
| Description | Pgpool-II 3.6.15 fixed a failure of online-recovery which occurs when child processes exit abnormally. - Fix online recovery failed due to client_idle_limit_in_recovery in certain cases. (bug 431) (Tatsuo Ishii) (https://www.pgpool.net/mantisbt/view.php?id=431) However, got the same problem even in 3.6.15, after a child was terminated by segfault, although client_idle_limit_in_recovery = -1. I found this is due to watchdog. When watchdog is enabled, wd_start_recovery() is called just after "2nd stage" starts. In wd_start_recovery(), the recovery request is sent to other pgpool, and the other pgpool will waits for all children exiting. However, if some child process has exited abnormally in the other pgpool, this never returns a response because Req_info->conn_counter cannot be zero. Therefore, the original pgpool will waits for the response until the timeout is detected, and the online recovery fails eventually. I guess a fix similar to one in 3.6.15 will be needed in process_wd_command_timer_event(), where Req_info->conn_counter is checked periodically for processing recovery commands received by watchdog. | ||||
| Steps To Reproduce | The related discussion is in [pgpool-hackers: 3211] Deal with recovery failure by an abnormally exiting child proces See this thread for details. | ||||
| Tags | No tags attached. | ||||
|
|
Hi yugo I think your analysis is spot on. I have applied the solution done by Tatsuo Ishii for bug 431into watchdog code flow. Can you please try out the attached patch to see if it solves the issue. Thanks |
|
|
Hello Usama, I did a test for the bug_483.diff patch with Pgpool-II 3.6.17. It seems not to work. (connect via pgpool1 and kill its child process) $ psql -p 9999 db1 $ ps x | grep pgpool | grep idle $ kill -9 12345 (recovery request via pgpool2) $ pcp_recovery_node -p 9797 1 ERROR: node recovery failed, failed to send start recovery packet This can be success without the previous step. pgpool1 (that has a killed child pgpool proces) log: 2019-05-17 19:33:51: pid 13391: LOG: watchdog received online recovery request from "localhost:9998 Linux srapc2499" 2019-05-17 19:33:51: pid 13391: LOG: wait_connection_closed: mulformed conn_counter (1) detected. reset it to 0 pgpool2 log: 2019-05-17 19:33:51: pid 13654: WARNING: start recovery command lock failed 2019-05-17 19:33:51: pid 13654: DETAIL: ipc command timeout 2019-05-17 19:33:51: pid 13393: LOG: new IPC connection received 2019-05-17 19:33:51: pid 13393: LOG: read from socket failed, remote end closed the connection 2019-05-17 19:33:51: pid 13393: LOG: online recovery request from local pgpool-II node received on IPC interface is forwarded to master watchdog node "localhost:9998 Linux srapc2499" 2019-05-17 19:33:51: pid 13393: DETAIL: waiting for the reply... 2019-05-17 19:33:51: pid 13654: ERROR: node recovery failed, failed to send start recovery packet 2019-05-17 19:33:51: pid 13608: LOG: PCP process with pid: 13654 exit with SUCCESS. 2019-05-17 19:33:51: pid 13608: LOG: PCP process with pid: 13654 exits with status 0 According to the script logs, recovery_1st_stage was executed, but recovery_2nd_stage and pgpool_remote_start weren't executed. |
|
|
Hello, Usama and Yugo. It has been a while. I am reporting more. In this case above, pgpool2's issue_command_to_watchdog() always resulted in timeout that exceeds recovery_timeout like the following logic flow. In pgpool2, which receives the pcp_recovery_node request: wd_start_recovery() - issue_command_to_watchdog() with recovery_timeout In pgpool1, which has a child process that killed: ensure_conn_counter_validity() after timer_expired (recovery_timeout) I confirmed that the change like the following for example can avoid this failure. src/watchdog/wd_commands.c: wd_start_recovery(void): WDIPCCmdResult *result = issue_command_to_watchdog(WD_IPC_ONLINE_RECOVERY_COMMAND, - pool_config->recovery_timeout, + pool_config->recovery_timeout + 10, func, strlen(func), true); When I do pcp_recovery_node for pgpool1 with bug_483.diff patch, it always succeeds. |
|
|
Hi Haruka, Thanks for the patch. I am looking into this one. |
|
|
Hi Haruka, I can confirm your suggested changes were required on top of the original patch. I have committed the modified patch on all supported branches. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=c402060814ba9737e9572ba07eef1be7664c9679 Thanks |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2019-03-26 21:18 | nagata | New Issue | |
| 2019-04-01 16:32 | administrator | Assigned To | => t-ishii |
| 2019-04-01 16:32 | administrator | Status | new => assigned |
| 2019-04-02 13:19 | t-ishii | Assigned To | t-ishii => Muhammad Usama |
| 2019-04-16 00:21 | Muhammad Usama | File Added: bug_483.diff | |
| 2019-04-16 00:21 | Muhammad Usama | Note Added: 0002556 | |
| 2019-05-17 19:41 | harukat | Note Added: 0002598 | |
| 2019-05-17 19:51 | harukat | Note Edited: 0002598 | |
| 2019-05-17 19:54 | harukat | Note Edited: 0002598 | |
| 2019-05-30 14:09 | harukat | Note Added: 0002642 | |
| 2019-06-19 15:27 | Muhammad Usama | Note Added: 0002668 | |
| 2019-08-08 23:45 | Muhammad Usama | Status | assigned => resolved |
| 2019-08-08 23:45 | Muhammad Usama | Resolution | open => fixed |
| 2019-08-08 23:45 | Muhammad Usama | Note Added: 0002757 | |
| 2019-08-15 19:14 | administrator | Fixed in Version | => 3.6.18 |