View Issue Details

IDProjectCategoryView StatusLast Update
0000325Pgpool-IIBugpublic2017-08-09 21:27
Reportertho Assigned Tot-ishii  
PriorityurgentSeveritymajorReproducibilityrandom
Status closedResolutionopen 
PlatformVMOSFedora ServerOS Version25
Product Version3.6.4 
Summary0000325: pool_check_fd call failed with an error "Interrupted system call"
DescriptionWe're encountering the exact same problem that was mentioned in issue 307.

It is not predictable when it happens, but it could be related to high load.

In the unfortunate event of 2 consecutives fails, pgpool will trigger a failover which leads to a split brain condition:

Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [46-1] 2017-07-31 13:40:24: pid 14418: WARNING: health check timed out while waiting for reading data
Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-1] 2017-07-31 13:40:24: pid 14418: ERROR: unable to read data from DB node 0
Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-2] 2017-07-31 13:40:24: pid 14418: DETAIL: pool_check_fd call failed with an error "Interrupted system call"
Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [48-1] 2017-07-31 13:40:44: pid 14418: WARNING: health check timed out while waiting for reading data
Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [49-1] 2017-07-31 13:40:44: pid 14418: ERROR: unable to read data from DB node 0
Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [49-2] 2017-07-31 13:40:44: pid 14418: DETAIL: pool_check_fd call failed with an error "Interrupted system call"
Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [50-1] 2017-07-31 13:40:44: pid 14418: LOG: setting backend node 1 status to NODE DOWN
Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [51-1] 2017-07-31 13:40:44: pid 14418: LOG: received degenerate backend request for node_id: 1 from pid [14418]
Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [40-1] 2017-07-31 13:40:44: pid 14419: LOG: new IPC connection received
Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [41-1] 2017-07-31 13:40:44: pid 14419: LOG: failover request from local pgpool-II node received on IPC interface is forwarded to master watchdog node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx"
Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [41-2] 2017-07-31 13:40:44: pid 14419: DETAIL: waiting for the reply...
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [42-1] 2017-07-31 13:40:49: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" is not replying..
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [42-2] 2017-07-31 13:40:49: pid 14419: DETAIL: marking the node as lost
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [43-1] 2017-07-31 13:40:49: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" is lost
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [44-1] 2017-07-31 13:40:49: pid 14419: LOG: watchdog cluster has lost the coordinator node
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [45-1] 2017-07-31 13:40:49: pid 14419: LOG: unassigning the remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" from watchdog cluster master
Jul 31 13:40:49 db1.xxxxxxx.xx pgpool[14418]: [52-1] 2017-07-31 13:40:49: pid 14418: LOG: degenerate backend request for 1 node(s) from pid [14418] is canceled by other pgpool
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [46-1] 2017-07-31 13:40:49: pid 14419: LOG: We have lost the cluster master node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx"
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [47-1] 2017-07-31 13:40:49: pid 14419: LOG: watchdog node state changed from [STANDBY] to [JOINING]
Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [48-1] 2017-07-31 13:40:49: pid 14419: LOG: new outbond connection to db2.xxxxxxx.xx:9000
Jul 31 13:40:53 db1.xxxxxxx.xx atchdog[14419]: [49-1] 2017-07-31 13:40:53: pid 14419: LOG: watchdog node state changed from [JOINING] to [INITIALIZING]
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [50-1] 2017-07-31 13:40:54: pid 14419: LOG: I am the only alive node in the watchdog cluster
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [50-2] 2017-07-31 13:40:54: pid 14419: HINT: skiping stand for coordinator state
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [51-1] 2017-07-31 13:40:54: pid 14419: LOG: watchdog node state changed from [INITIALIZING] to [MASTER]
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [52-1] 2017-07-31 13:40:54: pid 14419: LOG: I am announcing my self as master/coordinator watchdog node
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [53-1] 2017-07-31 13:40:54: pid 14419: LOG: setting the remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" as watchdog cluster master
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [54-1] 2017-07-31 13:40:54: pid 14419: LOG: new watchdog node connection is received from "xxx.xxx.xxx.xxx:30408"
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [55-1] 2017-07-31 13:40:54: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" decided it is the true master
Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [55-2] 2017-07-31 13:40:54: pid 14419: DETAIL: re-initializing the local watchdog cluster state because of split-brain
Tagsfailover, watchdog

Activities

t-ishii

2017-08-03 15:48

developer   ~0001620

It seems just backend did not respond within [health_check_timeout] seconds, possibly due to high load. You could [health_check_timeout] longer than now and/or increase [health_check_max_retries].

BTW,

Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-1] 2017-07-31 13:40:24: pid 14418: ERROR: unable to read data from DB node 0

this complains that Pgpool-II cannot read data from backend 0. This could be wrong. The backend could be other than 0, because of Pgpool-II bug. I committed the fix for this today.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=8f00811ec5d0d941158c47d6c9e982bd2d43712c

tho

2017-08-04 14:06

reporter   ~0001625

You are correct, node 0 was indeed the wrong one, it should have been node 1.

Besides, I've set [health_check_max_retries] = 3. My guess is that the backend didn't respond due to a load spike, network connectivity seems to be fine after some days of monitoring.

t-ishii

2017-08-09 19:20

developer   ~0001631

Can we close this issue?

tho

2017-08-09 19:39

reporter   ~0001632

Sure, go ahead.

Issue History

Date Modified Username Field Change
2017-08-02 18:00 tho New Issue
2017-08-02 18:00 tho Tag Attached: failover
2017-08-02 18:00 tho Tag Attached: watchdog
2017-08-03 15:48 t-ishii Note Added: 0001620
2017-08-04 13:53 t-ishii Assigned To => t-ishii
2017-08-04 13:53 t-ishii Status new => feedback
2017-08-04 14:06 tho Note Added: 0001625
2017-08-04 14:06 tho Status feedback => assigned
2017-08-09 19:20 t-ishii Note Added: 0001631
2017-08-09 19:39 tho Note Added: 0001632
2017-08-09 21:27 t-ishii Status assigned => closed