View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000325 | Pgpool-II | Bug | public | 2017-08-02 18:00 | 2017-08-09 21:27 |
| Reporter | tho | Assigned To | t-ishii | ||
| Priority | urgent | Severity | major | Reproducibility | random |
| Status | closed | Resolution | open | ||
| Platform | VM | OS | Fedora Server | OS Version | 25 |
| Product Version | 3.6.4 | ||||
| Summary | 0000325: pool_check_fd call failed with an error "Interrupted system call" | ||||
| Description | We're encountering the exact same problem that was mentioned in issue 307. It is not predictable when it happens, but it could be related to high load. In the unfortunate event of 2 consecutives fails, pgpool will trigger a failover which leads to a split brain condition: Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [46-1] 2017-07-31 13:40:24: pid 14418: WARNING: health check timed out while waiting for reading data Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-1] 2017-07-31 13:40:24: pid 14418: ERROR: unable to read data from DB node 0 Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-2] 2017-07-31 13:40:24: pid 14418: DETAIL: pool_check_fd call failed with an error "Interrupted system call" Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [48-1] 2017-07-31 13:40:44: pid 14418: WARNING: health check timed out while waiting for reading data Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [49-1] 2017-07-31 13:40:44: pid 14418: ERROR: unable to read data from DB node 0 Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [49-2] 2017-07-31 13:40:44: pid 14418: DETAIL: pool_check_fd call failed with an error "Interrupted system call" Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [50-1] 2017-07-31 13:40:44: pid 14418: LOG: setting backend node 1 status to NODE DOWN Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [51-1] 2017-07-31 13:40:44: pid 14418: LOG: received degenerate backend request for node_id: 1 from pid [14418] Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [40-1] 2017-07-31 13:40:44: pid 14419: LOG: new IPC connection received Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [41-1] 2017-07-31 13:40:44: pid 14419: LOG: failover request from local pgpool-II node received on IPC interface is forwarded to master watchdog node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [41-2] 2017-07-31 13:40:44: pid 14419: DETAIL: waiting for the reply... Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [42-1] 2017-07-31 13:40:49: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" is not replying.. Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [42-2] 2017-07-31 13:40:49: pid 14419: DETAIL: marking the node as lost Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [43-1] 2017-07-31 13:40:49: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" is lost Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [44-1] 2017-07-31 13:40:49: pid 14419: LOG: watchdog cluster has lost the coordinator node Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [45-1] 2017-07-31 13:40:49: pid 14419: LOG: unassigning the remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" from watchdog cluster master Jul 31 13:40:49 db1.xxxxxxx.xx pgpool[14418]: [52-1] 2017-07-31 13:40:49: pid 14418: LOG: degenerate backend request for 1 node(s) from pid [14418] is canceled by other pgpool Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [46-1] 2017-07-31 13:40:49: pid 14419: LOG: We have lost the cluster master node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [47-1] 2017-07-31 13:40:49: pid 14419: LOG: watchdog node state changed from [STANDBY] to [JOINING] Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [48-1] 2017-07-31 13:40:49: pid 14419: LOG: new outbond connection to db2.xxxxxxx.xx:9000 Jul 31 13:40:53 db1.xxxxxxx.xx atchdog[14419]: [49-1] 2017-07-31 13:40:53: pid 14419: LOG: watchdog node state changed from [JOINING] to [INITIALIZING] Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [50-1] 2017-07-31 13:40:54: pid 14419: LOG: I am the only alive node in the watchdog cluster Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [50-2] 2017-07-31 13:40:54: pid 14419: HINT: skiping stand for coordinator state Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [51-1] 2017-07-31 13:40:54: pid 14419: LOG: watchdog node state changed from [INITIALIZING] to [MASTER] Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [52-1] 2017-07-31 13:40:54: pid 14419: LOG: I am announcing my self as master/coordinator watchdog node Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [53-1] 2017-07-31 13:40:54: pid 14419: LOG: setting the remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" as watchdog cluster master Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [54-1] 2017-07-31 13:40:54: pid 14419: LOG: new watchdog node connection is received from "xxx.xxx.xxx.xxx:30408" Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [55-1] 2017-07-31 13:40:54: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" decided it is the true master Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [55-2] 2017-07-31 13:40:54: pid 14419: DETAIL: re-initializing the local watchdog cluster state because of split-brain | ||||
| Tags | failover, watchdog | ||||
|
|
It seems just backend did not respond within [health_check_timeout] seconds, possibly due to high load. You could [health_check_timeout] longer than now and/or increase [health_check_max_retries]. BTW, Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-1] 2017-07-31 13:40:24: pid 14418: ERROR: unable to read data from DB node 0 this complains that Pgpool-II cannot read data from backend 0. This could be wrong. The backend could be other than 0, because of Pgpool-II bug. I committed the fix for this today. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=8f00811ec5d0d941158c47d6c9e982bd2d43712c |
|
|
You are correct, node 0 was indeed the wrong one, it should have been node 1. Besides, I've set [health_check_max_retries] = 3. My guess is that the backend didn't respond due to a load spike, network connectivity seems to be fine after some days of monitoring. |
|
|
Can we close this issue? |
|
|
Sure, go ahead. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2017-08-02 18:00 | tho | New Issue | |
| 2017-08-02 18:00 | tho | Tag Attached: failover | |
| 2017-08-02 18:00 | tho | Tag Attached: watchdog | |
| 2017-08-03 15:48 | t-ishii | Note Added: 0001620 | |
| 2017-08-04 13:53 | t-ishii | Assigned To | => t-ishii |
| 2017-08-04 13:53 | t-ishii | Status | new => feedback |
| 2017-08-04 14:06 | tho | Note Added: 0001625 | |
| 2017-08-04 14:06 | tho | Status | feedback => assigned |
| 2017-08-09 19:20 | t-ishii | Note Added: 0001631 | |
| 2017-08-09 19:39 | tho | Note Added: 0001632 | |
| 2017-08-09 21:27 | t-ishii | Status | assigned => closed |