0000325: pool_check_fd call failed with an error "Interrupted system call" - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000325	Pgpool-II	Bug	public	2017-08-02 18:00	2017-08-09 21:27

Reporter	tho	Assigned To	t-ishii
Priority	urgent	Severity	major	Reproducibility	random
Status	closed	Resolution	open
Platform	VM	OS	Fedora Server	OS Version	25
Product Version	3.6.4

Summary	0000325: pool_check_fd call failed with an error "Interrupted system call"
Description	We're encountering the exact same problem that was mentioned in issue 307. It is not predictable when it happens, but it could be related to high load. In the unfortunate event of 2 consecutives fails, pgpool will trigger a failover which leads to a split brain condition: Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [46-1] 2017-07-31 13:40:24: pid 14418: WARNING: health check timed out while waiting for reading data Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-1] 2017-07-31 13:40:24: pid 14418: ERROR: unable to read data from DB node 0 Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-2] 2017-07-31 13:40:24: pid 14418: DETAIL: pool_check_fd call failed with an error "Interrupted system call" Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [48-1] 2017-07-31 13:40:44: pid 14418: WARNING: health check timed out while waiting for reading data Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [49-1] 2017-07-31 13:40:44: pid 14418: ERROR: unable to read data from DB node 0 Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [49-2] 2017-07-31 13:40:44: pid 14418: DETAIL: pool_check_fd call failed with an error "Interrupted system call" Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [50-1] 2017-07-31 13:40:44: pid 14418: LOG: setting backend node 1 status to NODE DOWN Jul 31 13:40:44 db1.xxxxxxx.xx pgpool[14418]: [51-1] 2017-07-31 13:40:44: pid 14418: LOG: received degenerate backend request for node_id: 1 from pid [14418] Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [40-1] 2017-07-31 13:40:44: pid 14419: LOG: new IPC connection received Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [41-1] 2017-07-31 13:40:44: pid 14419: LOG: failover request from local pgpool-II node received on IPC interface is forwarded to master watchdog node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" Jul 31 13:40:44 db1.xxxxxxx.xx atchdog[14419]: [41-2] 2017-07-31 13:40:44: pid 14419: DETAIL: waiting for the reply... Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [42-1] 2017-07-31 13:40:49: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" is not replying.. Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [42-2] 2017-07-31 13:40:49: pid 14419: DETAIL: marking the node as lost Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [43-1] 2017-07-31 13:40:49: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" is lost Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [44-1] 2017-07-31 13:40:49: pid 14419: LOG: watchdog cluster has lost the coordinator node Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [45-1] 2017-07-31 13:40:49: pid 14419: LOG: unassigning the remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" from watchdog cluster master Jul 31 13:40:49 db1.xxxxxxx.xx pgpool[14418]: [52-1] 2017-07-31 13:40:49: pid 14418: LOG: degenerate backend request for 1 node(s) from pid [14418] is canceled by other pgpool Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [46-1] 2017-07-31 13:40:49: pid 14419: LOG: We have lost the cluster master node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [47-1] 2017-07-31 13:40:49: pid 14419: LOG: watchdog node state changed from [STANDBY] to [JOINING] Jul 31 13:40:49 db1.xxxxxxx.xx atchdog[14419]: [48-1] 2017-07-31 13:40:49: pid 14419: LOG: new outbond connection to db2.xxxxxxx.xx:9000 Jul 31 13:40:53 db1.xxxxxxx.xx atchdog[14419]: [49-1] 2017-07-31 13:40:53: pid 14419: LOG: watchdog node state changed from [JOINING] to [INITIALIZING] Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [50-1] 2017-07-31 13:40:54: pid 14419: LOG: I am the only alive node in the watchdog cluster Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [50-2] 2017-07-31 13:40:54: pid 14419: HINT: skiping stand for coordinator state Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [51-1] 2017-07-31 13:40:54: pid 14419: LOG: watchdog node state changed from [INITIALIZING] to [MASTER] Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [52-1] 2017-07-31 13:40:54: pid 14419: LOG: I am announcing my self as master/coordinator watchdog node Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [53-1] 2017-07-31 13:40:54: pid 14419: LOG: setting the remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" as watchdog cluster master Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [54-1] 2017-07-31 13:40:54: pid 14419: LOG: new watchdog node connection is received from "xxx.xxx.xxx.xxx:30408" Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [55-1] 2017-07-31 13:40:54: pid 14419: LOG: remote node "db2.xxxxxxx.xx:9999 Linux db2.xxxxxxx.xx" decided it is the true master Jul 31 13:40:54 db1.xxxxxxx.xx atchdog[14419]: [55-2] 2017-07-31 13:40:54: pid 14419: DETAIL: re-initializing the local watchdog cluster state because of split-brain
Tags	failover, watchdog

t-ishii 2017-08-03 15:48 developer ~0001620	It seems just backend did not respond within [health_check_timeout] seconds, possibly due to high load. You could [health_check_timeout] longer than now and/or increase [health_check_max_retries]. BTW, Jul 31 13:40:24 db1.xxxxxxx.xx pgpool[14418]: [47-1] 2017-07-31 13:40:24: pid 14418: ERROR: unable to read data from DB node 0 this complains that Pgpool-II cannot read data from backend 0. This could be wrong. The backend could be other than 0, because of Pgpool-II bug. I committed the fix for this today. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=8f00811ec5d0d941158c47d6c9e982bd2d43712c

tho 2017-08-04 14:06 reporter ~0001625	You are correct, node 0 was indeed the wrong one, it should have been node 1. Besides, I've set [health_check_max_retries] = 3. My guess is that the backend didn't respond due to a load spike, network connectivity seems to be fine after some days of monitoring.

t-ishii 2017-08-09 19:20 developer ~0001631	Can we close this issue?

tho 2017-08-09 19:39 reporter ~0001632	Sure, go ahead.

Date Modified	Username	Field	Change
2017-08-02 18:00	tho	New Issue
2017-08-02 18:00	tho	Tag Attached: failover
2017-08-02 18:00	tho	Tag Attached: watchdog
2017-08-03 15:48	t-ishii	Note Added: 0001620
2017-08-04 13:53	t-ishii	Assigned To	=> t-ishii
2017-08-04 13:53	t-ishii	Status	new => feedback
2017-08-04 14:06	tho	Note Added: 0001625
2017-08-04 14:06	tho	Status	feedback => assigned
2017-08-09 19:20	t-ishii	Note Added: 0001631
2017-08-09 19:39	tho	Note Added: 0001632
2017-08-09 21:27	t-ishii	Status	assigned => closed