View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000322 | Pgpool-II | General | public | 2017-07-18 16:59 | 2018-06-22 15:29 |
| Reporter | sahana shetty | Assigned To | Muhammad Usama | ||
| Priority | high | Severity | major | Reproducibility | unable to reproduce |
| Status | closed | Resolution | open | ||
| Product Version | 3.4.6 | ||||
| Summary | 0000322: PGPool child processes go into <defunct> state when a backend node is unresponsive | ||||
| Description | I have a master/slave pgpool(3.4.6) setup with watchdog running atop 3 postgresql(9.5.3) nodes configured in streaming replication mode. When one of the slave backend nodes become unresponsive ( the machine became unresponsive : ssh, telnet to postgres port 5432 hung), both pgpool nodes' child processes went into <defunct> state with every new connection request. Neither of the pgpool nodes executed a failover. When all child processes were exhausted this way, it refused any new connections and new connections hung. Below are the pgpool logs from the pgpool nodes (Identical logs in both nodes) ================================================================ pid 26104: LOG: new connection received pid 26104: DETAIL: connecting host=<ip> port=54940 pid 26104: LOG: received degenerate backend request for node_id: 2 from pid [26104] pid 14439: LOG: new connection received pid 14439: DETAIL: connecting host=<ip> port=55007 pid 14439: LOG: received degenerate backend request for node_id: 2 from pid [14439] pid 25644: LOG: new connection received pid 25644: DETAIL: connecting host=<ip> port=55094 pid 25644: LOG: received degenerate backend request for node_id: 2 from pid [25644] pid 34180: LOG: new connection received pid 34180: DETAIL: connecting host=<ip> port=55235 pid 26615: LOG: new connection received pid 26615: DETAIL: connecting host=<ip> port=55236 pid 34180: LOG: received degenerate backend request for node_id: 2 from pid [34180] pid 26615: LOG: received degenerate backend request for node_id: 2 from pid [26615] I would normally expect "received degenerate backend request for node_id" messages in the pgpool node that does not run the failover. But in this case neither pgpools ran the failover; and they had similar log messages Is this a pgpool bug or is there some pgpool configuration parameters that I must tune to handle hung backend nodes? | ||||
| Additional Information | Some of my pgpool config params ================================= num_init_children = 80 max_pool = 1 child_life_time = 300 child_max_connections = 0 connection_life_time = 90 client_idle_limit = 0 connect_timeout = 10000 | ||||
| Tags | backend, defunct pgpool processes, failover, watchdog | ||||
|
|
This issue is reproducible . It is seen when the backend node is unresponsive Steps to reproduce : 1. SIGSTOP a backend node 2. Connect to the pgpool nodes. Each child process goes to defunct state with every new connection request with the below LOG LOG: received degenerate backend request for node_id: 2 from pid |
|
|
> 1. SIGSTOP a backend node What do you mean by this? Sending a signal to postmaster? |
|
|
Yes. signals the postmaster. The process is stopped and can be later resumed. |
|
|
Any updates on this? This is critical for us to go in production. |
|
|
I'm looking for someone who has time to take care of this. |
|
|
As a workaround, I tried doing a manual failover when we detect a server is hung through shell scripts. But the manual failover also gets triggered only after the hung server resumes operation. Run out of ideas to handle this. It appears that the read/write operation to the socket fails. I'm guessing the select system call in the "connect_inet_domain_socket_by_port" flow should have handled this even before the read/write operation. Any suggestions is appreciated. |
|
|
Sorry for long absence. I think of a theory: Pgpool-II main process is trying to find new primary node until search_primary_node_timeout expires. Since the default of the parameter is 300 seconds, after 5 minutes passes, everything should go well. Also if my theory is correct, you will see something like: "find_primary_node_repeatedly: waiting for finding a primary node" in your log file. If this is the case, you could shorten the parameter so that access to Pgpool-II recovers quicker. |
|
|
No response over 1 month. I am going to close the issue unless there's an objection. |
|
|
Issue closed. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2017-07-18 16:59 | sahana shetty | New Issue | |
| 2017-07-18 16:59 | sahana shetty | Tag Attached: backend | |
| 2017-07-18 16:59 | sahana shetty | Tag Attached: failover | |
| 2017-07-18 16:59 | sahana shetty | Tag Attached: watchdog | |
| 2017-07-18 16:59 | sahana shetty | Tag Attached: defunct pgpool processes | |
| 2017-07-18 22:50 | sahana shetty | Note Added: 0001582 | |
| 2017-07-19 14:22 | t-ishii | Note Added: 0001586 | |
| 2017-07-19 14:25 | t-ishii | Status | new => feedback |
| 2017-07-19 15:56 | sahana shetty | Note Added: 0001591 | |
| 2017-07-19 15:56 | sahana shetty | Status | feedback => new |
| 2017-07-21 16:44 | sahana shetty | Note Added: 0001597 | |
| 2017-07-21 17:11 | t-ishii | Note Added: 0001598 | |
| 2017-07-24 16:47 | sahana shetty | Note Added: 0001602 | |
| 2017-07-26 10:07 | t-ishii | Assigned To | => Muhammad Usama |
| 2017-07-26 10:07 | t-ishii | Status | new => assigned |
| 2018-03-22 16:11 | t-ishii | Note Added: 0001977 | |
| 2018-03-22 16:12 | t-ishii | Status | assigned => feedback |
| 2018-05-18 17:28 | t-ishii | Note Added: 0002026 | |
| 2018-06-22 15:29 | t-ishii | Note Added: 0002068 | |
| 2018-06-22 15:29 | t-ishii | Status | feedback => closed |