View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0000249 | pgpool-II | Bug | public | 2016-09-20 21:31 | 2017-02-24 04:30 |
Reporter | ysobj | Assigned To | Muhammad Usama | ||
Priority | high | Severity | major | Reproducibility | random |
Status | resolved | Resolution | fixed | ||
Platform | OS | CentOS | OS Version | 7.2 | |
Product Version | 3.5.3 | ||||
Target Version | Fixed in Version | ||||
Summary | 0000249: watchdog sometimes fails de-escalation. | ||||
Description | I shut down pgpool through systemctl. VIP sometimes was not removed. I use 'pgpool-II version 3.5.4 (ekieboshi)' that had been installed by RPM. | ||||
Steps To Reproduce | I tested like below(I tested with MASTER only) systemctl start pgpool.service (wait for the VIP is added.) systemctl stop pgpool.service | ||||
Additional Information | (try 1) Sep 20 20:22:11 vega pgpool[20223]: [23-1] 2016-09-20 20:22:11: pid 20223: LOG: Watchdog is shutting down Sep 20 20:22:11 vega pgpool[20248]: [10-1] 2016-09-20 20:22:11: pid 20248: LOG: child process received shutdown request signal 15 Sep 20 20:22:11 vega pgpool[20222]: [13-1] 2016-09-20 20:22:11: pid 20222: LOG: received smart shutdown request Sep 20 20:22:11 vega pgpool[20222]: [14-1] 2016-09-20 20:22:11: pid 20222: LOG: shutdown request. closing listen socket Sep 20 20:22:11 vega pgpool[20248]: [11-1] 2016-09-20 20:22:11: pid 20248: LOG: child process received shutdown request signal 15 (snip) Sep 20 20:22:11 vega pgpool[20241]: [10-1] 2016-09-20 20:22:11: pid 20241: LOG: child process received shutdown request signal 15 Sep 20 20:22:11 vega pgpool[20312]: [24-1] 2016-09-20 20:22:11: pid 20312: LOG: watchdog: de-escalation started Sep 20 20:22:11 vega pgpool[20242]: [10-1] 2016-09-20 20:22:11: pid 20242: LOG: child process received shutdown request signal 15 Sep 20 20:22:11 vega pgpool[20243]: [10-1] 2016-09-20 20:22:11: pid 20243: LOG: child process received shutdown request signal 15 Sep 20 20:22:11 vega pgpool[20244]: [10-1] 2016-09-20 20:22:11: pid 20244: LOG: child process received shutdown request signal 15 Sep 20 20:22:11 vega pgpool[20312]: [25-1] 2016-09-20 20:22:11: pid 20312: LOG: watchdog de-escalation successful Sep 20 20:22:11 vega pgpool[20245]: [10-1] 2016-09-20 20:22:11: pid 20245: LOG: child process received shutdown request signal 15 (snip) -> VIP was removed. (try 2) Sep 20 20:25:06 vega pgpool[20366]: [13-1] 2016-09-20 20:25:06: pid 20366: LOG: received smart shutdown request Sep 20 20:25:06 vega pgpool[20366]: [14-1] 2016-09-20 20:25:06: pid 20366: LOG: shutdown request. closing listen socket Sep 20 20:25:06 vega pgpool[20367]: [23-1] 2016-09-20 20:25:06: pid 20367: LOG: Watchdog is shutting down Sep 20 20:25:06 vega pgpool[20374]: [10-1] 2016-09-20 20:25:06: pid 20374: LOG: child process received shutdown request signal 15 (snip) Sep 20 20:25:06 vega pgpool[20415]: [10-1] 2016-09-20 20:25:06: pid 20415: LOG: child process received shutdown request signal 15 Sep 20 20:25:06 vega pgpool[20820]: [24-1] 2016-09-20 20:25:06: pid 20820: LOG: watchdog: de-escalation started -> VIP was removed. However, "LOG: watchdog de-escalation successful" was not appered. (try 3) Sep 20 20:27:37 vega pgpool[20843]: [13-1] 2016-09-20 20:27:37: pid 20843: LOG: received smart shutdown request Sep 20 20:27:37 vega pgpool[20843]: [14-1] 2016-09-20 20:27:37: pid 20843: LOG: shutdown request. closing listen socket Sep 20 20:27:37 vega pgpool[20844]: [23-1] 2016-09-20 20:27:37: pid 20844: LOG: Watchdog is shutting down Sep 20 20:27:37 vega pgpool[20848]: [10-1] 2016-09-20 20:27:37: pid 20848: LOG: child process received shutdown request signal 15 (snip) -> VIP wasn't removed. (try 4) Sep 20 20:31:43 vega pgpool[21551]: [13-1] 2016-09-20 20:31:43: pid 21551: LOG: received smart shutdown request Sep 20 20:31:43 vega pgpool[21551]: [14-1] 2016-09-20 20:31:43: pid 21551: LOG: shutdown request. closing listen socket Sep 20 20:31:43 vega pgpool[21552]: [23-1] 2016-09-20 20:31:43: pid 21552: LOG: Watchdog is shutting down Sep 20 20:31:43 vega pgpool[21566]: [10-1] 2016-09-20 20:31:43: pid 21566: LOG: child process received shutdown request signal 15 (snip) -> VIP wasn't removed. | ||||
Tags | No tags attached. | ||||
|
pgpool.conf (34,559 bytes) |
|
this test was always successful in 'pgpool-II version 3.5.3'. |
|
I think this problem has occured by a changeset e94b63b 'Fix usage of wait(2) in pgpool main process'. this problem is not appeared in changeset '9f64257'. |
|
Hi. Can you please share the complete pgpool-II log for the try-4 where the VIP was not removed. |
|
Also. it would be really helpful if you can share the pgpool-II log with DEBUG enabled |
|
try4.log (14,382 bytes) |
|
I tested with 'log_min_messages = debug1'. pgpool_debug.log (38,642 bytes) |
|
Hi all, I was able to reproduce this bug quite consistently using 3.6.1 (Master). As in bug description, when de-escalation worked there is a log line saying "watchdog: de-escalation started". This line is emitted in fork_plunging_process in watchdog/wd_escalation.c. On the contrary, when de-escalation did not work, this line did not appear in the log. I've added some more verbose log and found that in problematic cases function fork_plunging_process does not complete. If I add an ereport immediately before POOL_SETMASK(&UnBlockSig); and one immediately after, the second one is never executed, as the process were killed beforehand. I'm not familiar with the code but it seems that signal 15 is unblocked at least for a short window of time. |
|
Any update on this issue? |
|
Hi I have committed the fix for the problem which could cause this issue. Can you please verify building from the code if the problem is also solved at your end. Below is the commit. http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=28a0184da02037c011ae81c88418ff515d68fb66 |
|
Hi Usama, thanks for your quick response. the fix you committed seems to solve the issue. Now stopping the service takes some seconds instead of being immediate, and the log line "Stopped pgpool-II" is always the last in the log. Without the fix it may happen that the line "Stopped pgpool-II" were between other log lines issued by children "child process received shutdown request signal 15". This is expected since now main process waits for the child to terminate. Thus for me the issue is solved, thank again for your work. |
|
Are you going to release a new version with the patched code? |
|
Is there any plan to release an official version with this bug fixed? I have one client on hold and I would like to provide a deploy plan |
|
Recently we also faced this case! When the official version with the fixed bug will be provided? We want to test it and go live. |
|
We have fetched the snapshot https://git.postgresql.org/gitweb/?p=pgpool2.git;a=snapshot;h=4f98bf00eaac62092e39f9194618206ff0154d74;sf=tgz and built it. But the issue didn't disappear. We have the set: 1) Pgpool master + VIP 3) Pgpool standby When we stop pgpool master it doesn't down the VIP: Logs: Jan 29 00:19:59.639 pgpool[2434]: [346-1] DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.86.51:9694 Jan 29 00:19:59.639 pgpool[2438]: [346-2] LOCATION: wd_heartbeat.c:564 Jan 29 00:19:59.639 pgpool[2434]: [346-2] LOCATION: wd_heartbeat.c:564 Jan 29 00:20:00.613 pgpool[2432]: [243-1] DEBUG: watchdog heartbeat: received 224 byte packet Jan 29 00:20:00.613 pgpool[2432]: [243-2] LOCATION: wd_heartbeat.c:328 Jan 29 00:20:00.613 pgpool[2432]: [244-1] DEBUG: received heartbeat signal from "192.168.86.51:5432" hostname:OAA-72c02c29743e.aqa.int.zone Jan 29 00:20:00.613 pgpool[2432]: [244-2] LOCATION: wd_heartbeat.c:442 Jan 29 00:20:00.613 pgpool[2432]: [245-1] DEBUG: received heartbeat signal from "192.168.86.51:5432" hostname:OAA-72c02c29743e.aqa.int.zone Jan 29 00:20:00.613 pgpool[2432]: [245-2] LOCATION: wd_heartbeat.c:442 Jan 29 00:20:00.613 pgpool[2432]: [246-1] DEBUG: received heartbeat signal from "192.168.86.51(OAA-72c02c29743e.aqa.int.zone):5432" node:OAA-72c02c29743e.aqa.int.zone:5432 Linux OAA-72c02c29743e.aqa.int.zone Jan 29 00:20:00.613 pgpool[2432]: [246-2] LOCATION: wd_heartbeat.c:452 Jan 29 00:20:00.621 pgpool[2404]: [136-1] DEBUG: watchdog checking life check is ready Jan 29 00:20:00.621 pgpool[2404]: [136-2] DETAIL: pgpool:2 at "192.168.102.95:5432" has not send the heartbeat signal yet Jan 29 00:20:00.621 pgpool[2404]: [136-3] LOCATION: wd_lifecheck.c:647 Jan 29 00:20:01.639 pgpool[2438]: [347-1] DEBUG: watchdog heartbeat: send 224 byte packet Jan 29 00:20:01.639 pgpool[2434]: [347-1] DEBUG: watchdog heartbeat: send 224 byte packet Jan 29 00:20:01.639 pgpool[2438]: [347-2] LOCATION: wd_heartbeat.c:300 Jan 29 00:20:01.639 pgpool[2434]: [347-2] LOCATION: wd_heartbeat.c:300 Jan 29 00:20:01.639 pgpool[2434]: [348-1] DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.86.51:9694 Jan 29 00:20:01.639 pgpool[2438]: [348-1] DEBUG: watchdog heartbeat: send heartbeat signal to 192.168.102.95:9694 Jan 29 00:20:01.639 pgpool[2434]: [348-2] LOCATION: wd_heartbeat.c:564 Jan 29 00:20:01.639 pgpool[2438]: [348-2] LOCATION: wd_heartbeat.c:564 Jan 29 00:20:02.614 pgpool[2432]: [247-1] DEBUG: watchdog heartbeat: received 224 byte packet Jan 29 00:20:02.614 pgpool[2432]: [247-2] LOCATION: wd_heartbeat.c:328 Jan 29 00:20:02.614 pgpool[2432]: [248-1] DEBUG: received heartbeat signal from "192.168.86.51:5432" hostname:OAA-72c02c29743e.aqa.int.zone Jan 29 00:20:02.614 pgpool[2432]: [248-2] LOCATION: wd_heartbeat.c:442 Jan 29 00:20:02.614 pgpool[2432]: [249-1] DEBUG: received heartbeat signal from "192.168.86.51:5432" hostname:OAA-72c02c29743e.aqa.int.zone Jan 29 00:20:02.614 pgpool[2432]: [249-2] LOCATION: wd_heartbeat.c:442 Jan 29 00:20:02.614 pgpool[2432]: [250-1] DEBUG: received heartbeat signal from "192.168.86.51(OAA-72c02c29743e.aqa.int.zone):5432" node:OAA-72c02c29743e.aqa.int.zone:5432 Linux OAA-72c02c29743e.aqa.int.zone Jan 29 00:20:02.614 pgpool[2432]: [250-2] LOCATION: wd_heartbeat.c:452 Jan 29 00:20:02.956 pgpool[2403]: [548-1] LOG: Watchdog is shutting down Jan 29 00:20:02.956 pgpool[2403]: [548-2] LOCATION: watchdog.c:2967 Jan 29 00:20:02.956 pgpool[2403]: [549-1] DEBUG: sending watchdog packet Socket:9, Type:[INFORM I AM GOING DOWN], Command_ID:35, data Length:0 Jan 29 00:20:02.956 pgpool[2403]: [549-2] LOCATION: watchdog.c:3185 Jan 29 00:20:02.956 pgpool[2403]: [550-1] DEBUG: not sending watchdog internal command packet to DEAD Jan 29 00:20:02.956 pgpool[2403]: [550-2] LOCATION: watchdog.c:4085 Jan 29 00:20:02.956 pgpool[2402]: [268-1] LOG: received smart shutdown request Jan 29 00:20:02.956 pgpool[2402]: [268-2] LOCATION: pgpool_main.c:1390 Jan 29 00:20:02.956 pgpool[2402]: [269-1] LOG: shutdown request. closing listen socket Jan 29 00:20:02.956 pgpool[2402]: [269-2] LOCATION: pgpool_main.c:1412 Jan 29 00:20:02.956 pgpool[2404]: [137-1] DEBUG: lifecheck child receives shutdown request signal 15, forwarding to all children Jan 29 00:20:02.956 pgpool[2404]: [137-2] LOCATION: wd_lifecheck.c:275 Jan 29 00:20:02.956 pgpool[2404]: [138-1] DEBUG: lifecheck child receives smart shutdown request Jan 29 00:20:02.956 pgpool[2404]: [138-2] LOCATION: wd_lifecheck.c:280 Jan 29 00:20:02.956 pgpool[2436]: [175-1] DEBUG: watchdog heartbeat receiver child receives shutdown request signal 15 Jan 29 00:20:02.956 pgpool[2436]: [175-2] LOCATION: wd_heartbeat.c:599 Jan 29 00:20:02.956 pgpool[2436]: [176-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2436]: [176-2] LOCATION: elog.c:2587 Jan 29 00:20:02.956 pgpool[2436]: [177-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2436]: [177-2] LOCATION: elog.c:2559 Jan 29 00:20:02.956 pgpool[2434]: [349-1] DEBUG: watchdog heartbeat sender child receives shutdown request signal 15 Jan 29 00:20:02.956 pgpool[2434]: [349-2] LOCATION: wd_heartbeat.c:580 Jan 29 00:20:02.956 pgpool[2434]: [350-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2434]: [350-2] LOCATION: elog.c:2587 Jan 29 00:20:02.956 pgpool[2434]: [351-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2434]: [351-2] LOCATION: elog.c:2559 Jan 29 00:20:02.956 pgpool[2406]: [126-1] DEBUG: UNLOCKING select() Jan 29 00:20:02.956 pgpool[2406]: [126-2] LOCATION: child.c:1915 Jan 29 00:20:02.956 pgpool[2406]: [127-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2406]: [127-2] LOCATION: elog.c:2587 Jan 29 00:20:02.956 pgpool[2406]: [128-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2406]: [128-2] LOCATION: elog.c:2559 Jan 29 00:20:02.956 pgpool[2649]: [250-1] DEBUG: PCP child receives shutdown request signal 15, Forwarding to all children Jan 29 00:20:02.956 pgpool[2649]: [250-2] LOCATION: pcp_child.c:358 Jan 29 00:20:02.956 pgpool[2649]: [251-1] DEBUG: PCP child receives smart shutdown request Jan 29 00:20:02.956 pgpool[2649]: [251-2] LOCATION: pcp_child.c:365 Jan 29 00:20:02.956 pgpool[2649]: [252-1] DEBUG: shmem_exit(-1): 2 callbacks to make Jan 29 00:20:02.956 pgpool[2649]: [252-2] LOCATION: elog.c:2587 Jan 29 00:20:02.956 pgpool[2409]: [125-1] DEBUG: LOCKING select() Jan 29 00:20:02.956 pgpool[2649]: [253-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.956 pgpool[2409]: [125-2] LOCATION: child.c:1904 Jan 29 00:20:02.956 pgpool[2649]: [253-2] LOCATION: elog.c:2559 Jan 29 00:20:02.957 pgpool[2438]: [349-1] DEBUG: watchdog heartbeat sender child receives shutdown request signal 15 Jan 29 00:20:02.957 pgpool[2438]: [349-2] LOCATION: wd_heartbeat.c:580 Jan 29 00:20:02.957 pgpool[2438]: [350-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2438]: [350-2] LOCATION: elog.c:2587 Jan 29 00:20:02.957 pgpool[2438]: [351-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2438]: [351-2] LOCATION: elog.c:2559 Jan 29 00:20:02.957 pgpool[2404]: [139-1] DEBUG: lifecheck child receives shutdown request signal 15, forwarding to all children Jan 29 00:20:02.957 pgpool[2404]: [139-2] LOCATION: wd_lifecheck.c:275 Jan 29 00:20:02.957 pgpool[2404]: [140-1] DEBUG: lifecheck child receives smart shutdown request Jan 29 00:20:02.957 pgpool[2404]: [140-2] LOCATION: wd_lifecheck.c:280 Jan 29 00:20:02.957 pgpool[2432]: [251-1] DEBUG: watchdog heartbeat receiver child receives shutdown request signal 15 Jan 29 00:20:02.957 pgpool[2432]: [251-2] LOCATION: wd_heartbeat.c:599 Jan 29 00:20:02.957 pgpool[2432]: [252-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2432]: [252-2] LOCATION: elog.c:2587 Jan 29 00:20:02.957 pgpool[2432]: [253-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2432]: [253-2] LOCATION: elog.c:2559 Jan 29 00:20:02.957 pgpool[2735]: [551-1] LOG: watchdog: de-escalation started Jan 29 00:20:02.957 pgpool[2735]: [551-2] LOCATION: wd_escalation.c:181 Jan 29 00:20:02.957 pgpool[2404]: [141-1] DEBUG: shmem_exit(-1): 4 callbacks to make Jan 29 00:20:02.957 pgpool[2404]: [141-2] LOCATION: elog.c:2587 Jan 29 00:20:02.957 pgpool[2404]: [142-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2404]: [142-2] LOCATION: elog.c:2559 Jan 29 00:20:02.957 pgpool[2650]: [125-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2650]: [125-2] LOCATION: elog.c:2587 Jan 29 00:20:02.957 pgpool[2650]: [126-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:20:02.957 pgpool[2650]: [126-2] LOCATION: elog.c:2559 Jan 29 00:20:02.958 pgpool[2403]: [551-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:20:02.958 pgpool[2403]: [551-2] LOCATION: elog.c:2587 Jan 29 00:20:02.958 pgpool[2403]: [552-1] DEBUG: proc_exit(-1): 1 callbacks to make Jan 29 00:20:02.958 pgpool[2403]: [552-2] LOCATION: elog.c:2559 |
|
Also please see the attached log file (core.log). It contains errors: Jan 29 00:31:02.270 pgpool[2927]: [126-2] LOCATION: child.c:926 Jan 29 00:31:02.270 pgpool[2925]: [126-1] LOG: child process received shutdown request signal 15 Jan 29 00:31:02.270 pgpool[2925]: [126-2] LOCATION: child.c:926 Jan 29 00:31:02.270 pgpool[2922]: [126-1] LOG: child process received shutdown request signal 15 Jan 29 00:31:02.270 pgpool[2922]: [126-2] LOCATION: child.c:926 Jan 29 00:31:02.270 pgpool[2829]: [126-1] DEBUG: LOCKING select() Jan 29 00:31:02.270 pgpool[2829]: [126-2] LOCATION: child.c:1888 Jan 29 00:31:02.307 pgpool[2825]: [228-1] DEBUG: shmem_exit(-1): 17 callbacks to make Jan 29 00:31:02.307 pgpool[2825]: [228-2] LOCATION: elog.c:2587 Jan 29 00:31:02.309 pgpool[2830]: [127-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.309 pgpool[2830]: [127-2] LOCATION: pool_sema.c:129 Jan 29 00:31:02.309 pgpool[2862]: [127-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.309 pgpool[2830]: [128-1] DEBUG: LOCKING select() Jan 29 00:31:02.309 pgpool[2862]: [127-2] LOCATION: pool_sema.c:129 Jan 29 00:31:02.309 pgpool[2830]: [128-2] LOCATION: child.c:1888 Jan 29 00:31:02.309 pgpool[2862]: [128-1] DEBUG: LOCKING select() Jan 29 00:31:02.309 pgpool[2862]: [128-2] LOCATION: child.c:1888 Jan 29 00:31:02.309 pgpool[2860]: [127-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.309 pgpool[2860]: [127-2] LOCATION: pool_sema.c:129 AND Jan 29 00:31:02.313 pgpool[2910]: [128-1] DEBUG: LOCKING select() Jan 29 00:31:02.313 pgpool[2910]: [128-2] LOCATION: child.c:1888 Jan 29 00:31:02.313 pgpool[3048]: [128-1] DEBUG: LOCKING select() Jan 29 00:31:02.313 pgpool[3048]: [128-2] LOCATION: child.c:1888 Jan 29 00:31:02.313 pgpool[2980]: [126-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.313 pgpool[3049]: [127-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.313 pgpool[2980]: [126-2] LOCATION: pool_sema.c:129 Jan 29 00:31:02.313 pgpool[3049]: [127-2] LOCATION: pool_sema.c:129 Jan 29 00:31:02.313 pgpool[2980]: [127-1] DEBUG: LOCKING select() Jan 29 00:31:02.313 pgpool[3049]: [128-1] DEBUG: LOCKING select() Jan 29 00:31:02.313 pgpool[2980]: [127-2] LOCATION: child.c:1888 Jan 29 00:31:02.313 pgpool[3049]: [128-2] LOCATION: child.c:1888 Jan 29 00:31:02.313 pgpool[2970]: [127-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.313 pgpool[3051]: [127-1] WARNING: failed to lock semaphore error:"Identifier removed" Jan 29 00:31:02.313 pgpool[2970]: [127-2] LOCATION: pool_sema.c:129 Jan 29 00:31:02.313 pgpool[3051]: [127-2] LOCATION: pool_sema.c:129 core.log (2,266,433 bytes) |
|
Jan 29 00:31:02.384 pgpool[3059]: [133-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:31:02.384 pgpool[3059]: [133-2] LOCATION: elog.c:2559 Jan 29 00:31:02.384 pgpool[3058]: [133-2] LOCATION: elog.c:2559 Jan 29 00:31:02.384 pgpool[3060]: [129-1] LOG: child process received shutdown request signal 15 Jan 29 00:31:02.384 pgpool[3060]: [129-2] LOCATION: child.c:926 Jan 29 00:31:02.384 pgpool[3060]: [130-1] WARNING: failed to unlock semaphore error:"Invalid argument" Jan 29 00:31:02.384 pgpool[3060]: [130-2] LOCATION: pool_sema.c:158 Jan 29 00:31:02.384 pgpool[3060]: [131-1] DEBUG: UNLOCKING select() Jan 29 00:31:02.384 pgpool[3060]: [131-2] LOCATION: child.c:1899 Jan 29 00:31:02.384 pgpool[3060]: [132-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:31:02.384 pgpool[3060]: [132-2] LOCATION: elog.c:2587 Jan 29 00:31:02.384 pgpool[3060]: [133-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:31:02.384 pgpool[3060]: [133-2] LOCATION: elog.c:2559 Jan 29 00:31:02.384 pgpool[3061]: [129-1] LOG: child process received shutdown request signal 15 Jan 29 00:31:02.384 pgpool[3061]: [129-2] LOCATION: child.c:926 Jan 29 00:31:02.384 pgpool[3061]: [130-1] WARNING: failed to unlock semaphore error:"Invalid argument" Jan 29 00:31:02.384 pgpool[3061]: [130-2] LOCATION: pool_sema.c:158 Jan 29 00:31:02.384 pgpool[3061]: [131-1] DEBUG: UNLOCKING select() Jan 29 00:31:02.384 pgpool[3061]: [131-2] LOCATION: child.c:1899 Jan 29 00:31:02.384 pgpool[3061]: [132-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 29 00:31:02.384 pgpool[3061]: [132-2] LOCATION: elog.c:2587 Jan 29 00:31:02.384 pgpool[3061]: [133-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 29 00:31:02.384 pgpool[3061]: [133-2] LOCATION: elog.c:2559 Jan 29 00:31:02.386 pgpool[3062]: [129-1] LOG: child process received shutdown request signal 15 Jan 29 00:31:02.386 pgpool[3064]: [129-1] LOG: child process received shutdown request signal 15 Jan 29 00:31:02.386 pgpool[3062]: [129-2] LOCATION: child.c:926 Jan 29 00:31:02.386 pgpool[3064]: [129-2] LOCATION: child.c:926 Jan 29 00:31:02.386 pgpool[3062]: [130-1] WARNING: failed to unlock semaphore error:"Invalid argument" Jan 29 00:31:02.386 pgpool[3064]: [130-1] WARNING: failed to unlock semaphore error:"Invalid argument" Jan 29 00:31:02.386 pgpool[3062]: [130-2] LOCATION: pool_sema.c:158 Jan 29 00:31:02.386 pgpool[3064]: [130-2] LOCATION: pool_sema.c:158 Jan 29 00:31:02.386 pgpool[3062]: [131-1] DEBUG: UNLOCKING select() |
|
In our environment we see the strange case. environment: - setup "2+' pgpool instances - stop master pgpool ( systemctl pgpool.service stop) At this moment in log files we that the master pgpool brings down the VIP and then pings it. But for some reasons we see it is performed twice! but the second operation detects it is up already ....because it was up by a new master. At this moment the logic fails. |
|
Muhammad, We did careful testing: the issue is still reproduced. Please see the results (attached artifact vip_test.zip) of the testing the latest version of V3_6_STABLE: >[root@POAMN-72c02c29743e pgpool2]# git log -2 >commit 4f98bf00eaac62092e39f9194618206ff0154d74 Environment: - 2x active Pgpool instances (Master VIP holder + standby) Facts: 1) We perform (/bin/systemctl stop pgpool.service) on CentOS with Master Pgpool 2) It can be seen pgpool performs shutdown for something like 20 seconds - sometimes faster, sometimes longer 3) VIP not bringed DOWN 3.1) Socket files are not removed: # ls -lah /var/run/pgpool/ total 4.0K drwxr-xr-x 2 pgpool pgpool 100 Jan 30 00:04 . drwxr-xr-x 25 root root 680 Jan 29 23:25 .. -rw------- 1 pgpool pgpool 6 Jan 30 00:03 pgpool.pid srwxrwxrwx 1 pgpool pgpool 0 Jan 30 00:03 .s.PGSQL.5432 srwxrwxrwx 1 pgpool pgpool 0 Jan 30 00:03 .s.PGSQL.9898 4) Status request gives result: /bin/systemctl status pgpool.service pgpool.service - pgpool-II Loaded: loaded (/usr/lib/systemd/system/pgpool.service; disabled; vendor preset: disabled) Drop-In: /etc/systemd/system/pgpool.service.d └─pgpool.conf Active: failed (Result: signal) since Mon 2017-01-30 00:06:12 MSK; 17min ago Process: 14601 ExecStart=/usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf $OPTS (code=killed, signal=KILL) Process: 14566 ExecStartPre=/etc/pgpool-II/pool_start_check.sh (code=exited, status=0/SUCCESS) Main PID: 14601 (code=killed, signal=KILL) Also we detected one very complex case: if the environment consists of 2+ pgpool instances and the Master is being shutdown then it can be seen that a NEW MASTER brings up the VIP. It is OK, but from the Previous Master logs we see it performs repeated 'if_down_cmd' command because the ping gives positive result nevertheless the fact the previous 'if_down_cmd' succeeded! We think the ping command after succeeded 'if_down_cmd' should clarified as optional here. vip_test.zip (72,810 bytes) |
|
Hi supp_k, As per the log (core.log) file shared above, It seems that Pgpool-II has performed all the de-escalation steps correctly at the time of Master shutdown. And apparently the command provided in the pgpool-II configuration (vip_down.sh $_IP_$) for bringing down the VIP is failing. So the Pgpool-II doesn't seem to be at fault in this. |
|
Hi supp_k, Your comment about getting the positive ping result, after bringing down the VIP because the VIP is brought up at the same time by another Pgpool-II node (that assumes the master responsibility) is valid. And I am looking into a possiible fix for that. |
|
Hi gabrimonfa, I will discuss the next release date with the community leader Ishii-San and will update you on that accordingly. |
|
Hi Muhammad, ok, we will perform deeper tests and prodive the results tomorrow. |
|
Muhammad, we repeated a series of tests and unfortunately the error is reproduced in 50% of cases (approximately). Please see the attached wd_error.zip file. It contains syslog log file and the corresponding pgpool.conf. The log file contains information on the 2 startups. ====================================================================================== The 1st startup (22:37:38.389) and stop (22:40:03.730) completed successfully with brought down VIP: Jan 31 22:38:33.611 pgpool[26446]: [290-1] DEBUG: sending watchdog packet Socket:9, Type:[INFORM I AM GOING DOWN], Command_ID:15, data Length:0 Jan 31 22:38:33.611 pgpool[26446]: [290-2] LOCATION: watchdog.c:3187 Jan 31 22:38:33.611 pgpool[26446]: [291-1] DEBUG: not sending watchdog internal command packet to DEAD Jan 31 22:38:33.611 pgpool[26446]: [291-2] LOCATION: watchdog.c:4087 Jan 31 22:38:33.612 pgpool[26955]: [292-1] LOG: watchdog: de-escalation started Jan 31 22:38:33.612 pgpool[26955]: [292-2] LOCATION: wd_escalation.c:181 Jan 31 22:38:33.612 pgpool[26447]: [129-1] DEBUG: shmem_exit(-1): 4 callbacks to make Jan 31 22:38:33.612 pgpool[26447]: [129-2] LOCATION: elog.c:2587 Jan 31 22:38:33.612 pgpool[26447]: [130-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:38:33.612 pgpool[26447]: [130-2] LOCATION: elog.c:2559 Jan 31 22:38:33.613 pgpool[26955]: [293-1] DEBUG: watchdog exec interface up/down command: 'ip addr del $_IP_$/16 dev eth0' succeeded Jan 31 22:38:33.613 pgpool[26955]: [293-2] LOCATION: wd_if.c:338 Jan 31 22:38:33.613 pgpool[26955]: [294-1] DEBUG: watchdog trying to ping host "192.168.110.32" Jan 31 22:38:33.613 pgpool[26955]: [294-2] LOCATION: wd_ping.c:103 Jan 31 22:38:36.616 pgpool[26955]: [295-1] WARNING: watchdog failed to ping host"192.168.110.32" Jan 31 22:38:36.616 pgpool[26955]: [295-2] DETAIL: ping process exits with code: 1 Jan 31 22:38:36.616 pgpool[26955]: [295-3] LOCATION: wd_ping.c:167 Jan 31 22:38:36.616 pgpool[26955]: [296-1] LOG: watchdog bringing down delegate IP Jan 31 22:38:36.617 pgpool[26955]: [296-2] DETAIL: if_down_cmd succeeded Jan 31 22:38:36.617 pgpool[26955]: [296-3] LOCATION: wd_if.c:213 ====================================================================================== But the 2nd launch (22:40:37.642) completed without invocation of (if_down_cmd): Jan 31 22:41:28.127 pgpool[27013]: [299-1] DEBUG: sending watchdog packet Socket:9, Type:[INFORM I AM GOING DOWN], Command_ID:15, data Length:0 Jan 31 22:41:28.127 pgpool[27025]: [186-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27032]: [162-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27013]: [299-2] LOCATION: watchdog.c:3187 Jan 31 22:41:28.127 pgpool[27025]: [186-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27032]: [162-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27025]: [187-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27032]: [163-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27025]: [187-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27032]: [163-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27013]: [300-1] DEBUG: not sending watchdog internal command packet to DEAD Jan 31 22:41:28.127 pgpool[27013]: [300-2] LOCATION: watchdog.c:4087 Jan 31 22:41:28.127 pgpool[27015]: [125-1] DEBUG: lifecheck child receives shutdown request signal 15, forwarding to all children Jan 31 22:41:28.127 pgpool[27015]: [125-2] LOCATION: wd_lifecheck.c:275 Jan 31 22:41:28.127 pgpool[27012]: [1180-1] LOG: received smart shutdown request Jan 31 22:41:28.127 pgpool[27015]: [126-1] DEBUG: lifecheck child receives smart shutdown request Jan 31 22:41:28.127 pgpool[27015]: [126-2] LOCATION: wd_lifecheck.c:280 Jan 31 22:41:28.127 pgpool[27012]: [1180-2] LOCATION: pgpool_main.c:1390 Jan 31 22:41:28.127 pgpool[27012]: [1181-1] LOG: shutdown request. closing listen socket Jan 31 22:41:28.127 pgpool[27012]: [1181-2] LOCATION: pgpool_main.c:1412 Jan 31 22:41:28.127 pgpool[27027]: [161-1] DEBUG: watchdog heartbeat sender child receives shutdown request signal 15 Jan 31 22:41:28.127 pgpool[27027]: [161-2] LOCATION: wd_heartbeat.c:580 Jan 31 22:41:28.127 pgpool[27027]: [162-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27027]: [162-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27027]: [163-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27027]: [163-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27266]: [448-1] DEBUG: UNLOCKING select() Jan 31 22:41:28.127 pgpool[27266]: [448-2] LOCATION: child.c:1915 Jan 31 22:41:28.127 pgpool[27266]: [449-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27266]: [449-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27266]: [450-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27266]: [450-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27029]: [125-1] DEBUG: watchdog heartbeat receiver child receives shutdown request signal 15 Jan 31 22:41:28.127 pgpool[27029]: [125-2] LOCATION: wd_heartbeat.c:599 Jan 31 22:41:28.127 pgpool[27029]: [126-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27029]: [126-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27029]: [127-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27029]: [127-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27267]: [447-1] DEBUG: LOCKING select() Jan 31 22:41:28.127 pgpool[27267]: [447-2] LOCATION: child.c:1904 Jan 31 22:41:28.127 pgpool[27506]: [448-1] DEBUG: PCP child receives shutdown request signal 15, Forwarding to all children Jan 31 22:41:28.127 pgpool[27506]: [448-2] LOCATION: pcp_child.c:358 Jan 31 22:41:28.127 pgpool[27506]: [449-1] DEBUG: PCP child receives smart shutdown request Jan 31 22:41:28.127 pgpool[27506]: [449-2] LOCATION: pcp_child.c:365 Jan 31 22:41:28.127 pgpool[27506]: [450-1] DEBUG: shmem_exit(-1): 2 callbacks to make Jan 31 22:41:28.127 pgpool[27506]: [450-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27015]: [127-1] DEBUG: lifecheck child receives shutdown request signal 15, forwarding to all children Jan 31 22:41:28.127 pgpool[27015]: [127-2] LOCATION: wd_lifecheck.c:275 Jan 31 22:41:28.127 pgpool[27015]: [128-1] DEBUG: lifecheck child receives smart shutdown request Jan 31 22:41:28.127 pgpool[27015]: [128-2] LOCATION: wd_lifecheck.c:280 Jan 31 22:41:28.127 pgpool[27506]: [451-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27506]: [451-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27506]: [452-1] DEBUG: PCP child receives shutdown request signal 15, Forwarding to all children Jan 31 22:41:28.127 pgpool[27506]: [452-2] LOCATION: pcp_child.c:358 Jan 31 22:41:28.127 pgpool[27506]: [453-1] DEBUG: PCP child receives smart shutdown request Jan 31 22:41:28.127 pgpool[27506]: [453-2] LOCATION: pcp_child.c:365 Jan 31 22:41:28.127 pgpool[27507]: [1180-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27507]: [1180-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27507]: [1181-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27015]: [129-1] DEBUG: shmem_exit(-1): 4 callbacks to make Jan 31 22:41:28.127 pgpool[27013]: [301-1] DEBUG: shmem_exit(-1): 0 callbacks to make Jan 31 22:41:28.127 pgpool[27507]: [1181-2] LOCATION: elog.c:2559 Jan 31 22:41:28.127 pgpool[27013]: [301-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27015]: [129-2] LOCATION: elog.c:2587 Jan 31 22:41:28.127 pgpool[27013]: [302-1] DEBUG: proc_exit(-1): 1 callbacks to make Jan 31 22:41:28.127 pgpool[27013]: [302-2] LOCATION: elog.c:2559 Jan 31 22:41:28.128 pgpool[27015]: [130-1] DEBUG: proc_exit(-1): 0 callbacks to make Jan 31 22:41:28.128 pgpool[27015]: [130-2] LOCATION: elog.c:2559 ====================================================================================== We illuminated all possible obstacles for execution of /usr/sbin/ip util. We still see the error is not fixed :(((( wd_error.zip (71,447 bytes) |
|
Muhammad, after modification of the systemd service configuration file we see the 'if_down_cmd' command is executed always (please see it). ============================================================== [Unit] Description=pgpool-II After=syslog.target network.target [Service] EnvironmentFile=-/etc/sysconfig/pgpool ExecStart=/usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf $OPTS ExecStop=/usr/bin/pgpool -f /etc/pgpool-II/pgpool.conf stop [Install] WantedBy=multi-user.target =============================================================== The previous version of the file did not contain the "ExecStop" command!!! It can be considered as a solution but we have a theory that pgpool doesnt process the SIGTERM signal "completely" and some its child processes are just "killed". Thus when the "stop" option is specified the pgpool ends lifecycle "NORMALLY". The issue for sure will not be reproduced on CentOS 6.X because it is init.d based and the corresponding init.d script shutdowns pgpool using this "stop" option. |
|
Hi Muhammad, may I ask you to comment the last message? If the proposed patch is good? We can see that pgpool.service file in git repository doesn't contain the described configuration options. |
|
Hi supp_k, Thanks for the thorough analysis, systemd service configuration file clearly seems to be missing the stop and reload commands, and your patch looks fine to me. I have posted the question to community leader and pgpool-II hacker list regarding this and will update the sevice file as soon as I get the answer from the community. Meanwhile, I am also looking into the possible problem with SIGTERM shutdown of pgpool-II |
|
Hi Muhammad, are there any updates about incorrect PING during shutdown of the master? Also are you going to accept the patch to pgpool.service file? |
|
Hi supp_k, I have pushed the service file changes to all related branches http://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=8fa731a3eed7962b127710236823de9ab0379b44 And I am still looking for some robust solution for the PING issue. |
|
Muhammad, I think pgpool should not be responsible for further validation checks like (ping) if 'if_down_cmd' returned 0 (zero). After all the Zero value is contract. |
|
Hi Muhammad, are there any updates on the question with the repeated ping? Also when will it be possible to obtain the latest build which includes fixed problems? |
|
Hi Pgpool Community, would it be possible to obtain the build of the Pgpool with this issue fixed? |
|
Hi Sorry for the delayed response, I was travelling and need the community leader's consent for making the change, which I have got today and committed the fix for the ping issue. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=2b7217ef30a9b2e68cfa5845baf3429027d1626b And on the second part the point release of all supported versions is expected in the next week. |
Date Modified | Username | Field | Change |
---|---|---|---|
2016-09-20 21:31 | ysobj | New Issue | |
2016-09-20 21:31 | ysobj | File Added: pgpool.conf | |
2016-09-21 13:58 | ysobj | Note Added: 0001075 | |
2016-09-21 16:33 | t-ishii | Assigned To | => Muhammad Usama |
2016-09-21 16:33 | t-ishii | Status | new => assigned |
2016-09-21 16:56 | ysobj | Note Added: 0001078 | |
2016-09-21 22:09 | Muhammad Usama | Status | assigned => feedback |
2016-09-21 22:09 | Muhammad Usama | Note Added: 0001079 | |
2016-09-21 22:12 | Muhammad Usama | Note Added: 0001080 | |
2016-09-21 22:22 | ysobj | File Added: try4.log | |
2016-09-21 22:42 | ysobj | File Added: pgpool_debug.log | |
2016-09-21 22:42 | ysobj | Note Added: 0001081 | |
2016-09-21 22:42 | ysobj | Status | feedback => assigned |
2016-12-30 18:39 | gabrimonfa | Note Added: 0001275 | |
2017-01-02 22:01 | gabrimonfa | Note Added: 0001276 | |
2017-01-05 00:17 | Muhammad Usama | Note Added: 0001279 | |
2017-01-05 17:32 | gabrimonfa | Note Added: 0001280 | |
2017-01-05 17:33 | gabrimonfa | Note Added: 0001281 | |
2017-01-24 01:45 | gabrimonfa | Note Added: 0001294 | |
2017-01-29 05:09 | supp_k | Note Added: 0001305 | |
2017-01-29 06:26 | supp_k | Note Added: 0001306 | |
2017-01-29 06:34 | supp_k | File Added: core.log | |
2017-01-29 06:34 | supp_k | Note Added: 0001307 | |
2017-01-29 06:35 | supp_k | Note Added: 0001308 | |
2017-01-29 07:23 | elepuser | Note Added: 0001309 | |
2017-01-30 06:26 | supp_k | File Added: vip_test.zip | |
2017-01-30 06:26 | supp_k | Note Added: 0001311 | |
2017-01-31 23:35 | Muhammad Usama | Note Added: 0001314 | |
2017-01-31 23:41 | Muhammad Usama | Note Added: 0001315 | |
2017-01-31 23:43 | Muhammad Usama | Note Added: 0001316 | |
2017-02-01 01:05 | supp_k | Note Added: 0001317 | |
2017-02-01 04:59 | supp_k | File Added: wd_error.zip | |
2017-02-01 04:59 | supp_k | Note Added: 0001318 | |
2017-02-01 06:09 | supp_k | Note Added: 0001319 | |
2017-02-02 18:22 | supp_k | Note Added: 0001320 | |
2017-02-02 20:49 | Muhammad Usama | Note Added: 0001321 | |
2017-02-06 16:42 | supp_k | Note Added: 0001330 | |
2017-02-07 00:46 | Muhammad Usama | Note Added: 0001334 | |
2017-02-07 01:09 | supp_k | Note Added: 0001335 | |
2017-02-13 16:22 | supp_k | Note Added: 0001340 | |
2017-02-22 05:49 | supp_k | Note Added: 0001352 | |
2017-02-24 01:29 | Muhammad Usama | Note Added: 0001363 | |
2017-02-24 04:30 | Muhammad Usama | Status | assigned => resolved |
2017-02-24 04:30 | Muhammad Usama | Resolution | open => fixed |