0000322: PGPool child processes go into <defunct> state when a backend node is unresponsive

ID	Project	Category	View Status	Date Submitted	Last Update

0000322	Pgpool-II	General	public	2017-07-18 16:59	2018-06-22 15:29

Reporter	sahana shetty	Assigned To	Muhammad Usama
Priority	high	Severity	major	Reproducibility	unable to reproduce
Status	closed	Resolution	open
Product Version	3.4.6

Summary	0000322: PGPool child processes go into <defunct> state when a backend node is unresponsive
Description	I have a master/slave pgpool(3.4.6) setup with watchdog running atop 3 postgresql(9.5.3) nodes configured in streaming replication mode. When one of the slave backend nodes become unresponsive ( the machine became unresponsive : ssh, telnet to postgres port 5432 hung), both pgpool nodes' child processes went into <defunct> state with every new connection request. Neither of the pgpool nodes executed a failover. When all child processes were exhausted this way, it refused any new connections and new connections hung. Below are the pgpool logs from the pgpool nodes (Identical logs in both nodes) ================================================================ pid 26104: LOG: new connection received pid 26104: DETAIL: connecting host=<ip> port=54940 pid 26104: LOG: received degenerate backend request for node_id: 2 from pid [26104] pid 14439: LOG: new connection received pid 14439: DETAIL: connecting host=<ip> port=55007 pid 14439: LOG: received degenerate backend request for node_id: 2 from pid [14439] pid 25644: LOG: new connection received pid 25644: DETAIL: connecting host=<ip> port=55094 pid 25644: LOG: received degenerate backend request for node_id: 2 from pid [25644] pid 34180: LOG: new connection received pid 34180: DETAIL: connecting host=<ip> port=55235 pid 26615: LOG: new connection received pid 26615: DETAIL: connecting host=<ip> port=55236 pid 34180: LOG: received degenerate backend request for node_id: 2 from pid [34180] pid 26615: LOG: received degenerate backend request for node_id: 2 from pid [26615] I would normally expect "received degenerate backend request for node_id" messages in the pgpool node that does not run the failover. But in this case neither pgpools ran the failover; and they had similar log messages Is this a pgpool bug or is there some pgpool configuration parameters that I must tune to handle hung backend nodes?
Additional Information	Some of my pgpool config params ================================= num_init_children = 80 max_pool = 1 child_life_time = 300 child_max_connections = 0 connection_life_time = 90 client_idle_limit = 0 connect_timeout = 10000
Tags	backend, defunct pgpool processes, failover, watchdog

sahana shetty 2017-07-18 22:50 reporter ~0001582	This issue is reproducible . It is seen when the backend node is unresponsive Steps to reproduce : 1. SIGSTOP a backend node 2. Connect to the pgpool nodes. Each child process goes to defunct state with every new connection request with the below LOG LOG: received degenerate backend request for node_id: 2 from pid

t-ishii 2017-07-19 14:22 developer ~0001586	> 1. SIGSTOP a backend node What do you mean by this? Sending a signal to postmaster?

sahana shetty 2017-07-19 15:56 reporter ~0001591	Yes. signals the postmaster. The process is stopped and can be later resumed.

sahana shetty 2017-07-21 16:44 reporter ~0001597	Any updates on this? This is critical for us to go in production.

t-ishii 2017-07-21 17:11 developer ~0001598	I'm looking for someone who has time to take care of this.

sahana shetty 2017-07-24 16:47 reporter ~0001602	As a workaround, I tried doing a manual failover when we detect a server is hung through shell scripts. But the manual failover also gets triggered only after the hung server resumes operation. Run out of ideas to handle this. It appears that the read/write operation to the socket fails. I'm guessing the select system call in the "connect_inet_domain_socket_by_port" flow should have handled this even before the read/write operation. Any suggestions is appreciated.

t-ishii 2018-03-22 16:11 developer ~0001977	Sorry for long absence. I think of a theory: Pgpool-II main process is trying to find new primary node until search_primary_node_timeout expires. Since the default of the parameter is 300 seconds, after 5 minutes passes, everything should go well. Also if my theory is correct, you will see something like: "find_primary_node_repeatedly: waiting for finding a primary node" in your log file. If this is the case, you could shorten the parameter so that access to Pgpool-II recovers quicker.

t-ishii 2018-05-18 17:28 developer ~0002026	No response over 1 month. I am going to close the issue unless there's an objection.

t-ishii 2018-06-22 15:29 developer ~0002068	Issue closed.

Date Modified	Username	Field	Change
2017-07-18 16:59	sahana shetty	New Issue
2017-07-18 16:59	sahana shetty	Tag Attached: backend
2017-07-18 16:59	sahana shetty	Tag Attached: failover
2017-07-18 16:59	sahana shetty	Tag Attached: watchdog
2017-07-18 16:59	sahana shetty	Tag Attached: defunct pgpool processes
2017-07-18 22:50	sahana shetty	Note Added: 0001582
2017-07-19 14:22	t-ishii	Note Added: 0001586
2017-07-19 14:25	t-ishii	Status	new => feedback
2017-07-19 15:56	sahana shetty	Note Added: 0001591
2017-07-19 15:56	sahana shetty	Status	feedback => new
2017-07-21 16:44	sahana shetty	Note Added: 0001597
2017-07-21 17:11	t-ishii	Note Added: 0001598
2017-07-24 16:47	sahana shetty	Note Added: 0001602
2017-07-26 10:07	t-ishii	Assigned To	=> Muhammad Usama
2017-07-26 10:07	t-ishii	Status	new => assigned
2018-03-22 16:11	t-ishii	Note Added: 0001977
2018-03-22 16:12	t-ishii	Status	assigned => feedback
2018-05-18 17:28	t-ishii	Note Added: 0002026
2018-06-22 15:29	t-ishii	Note Added: 0002068
2018-06-22 15:29	t-ishii	Status	feedback => closed

View Issue Details

Activities

Issue History