View Issue Details

IDProjectCategoryView StatusLast Update
0000322Pgpool-II[All Projects] Generalpublic2018-06-22 15:29
Reportersahana shettyAssigned ToMuhammad Usama 
PriorityhighSeveritymajorReproducibilityunable to reproduce
Status closedResolutionopen 
Product Version3.4.6 
Target VersionFixed in Version 
Summary0000322: PGPool child processes go into <defunct> state when a backend node is unresponsive
DescriptionI have a master/slave pgpool(3.4.6) setup with watchdog running atop 3 postgresql(9.5.3) nodes configured in streaming replication mode.

When one of the slave backend nodes become unresponsive ( the machine became unresponsive : ssh, telnet to postgres port 5432 hung), both pgpool nodes' child processes went into <defunct> state with every new connection request. Neither of the pgpool nodes executed a failover.

When all child processes were exhausted this way, it refused any new connections and new connections hung.

Below are the pgpool logs from the pgpool nodes (Identical logs in both nodes)
================================================================
pid 26104: LOG: new connection received
pid 26104: DETAIL: connecting host=<ip> port=54940
pid 26104: LOG: received degenerate backend request for node_id: 2 from pid [26104]
pid 14439: LOG: new connection received
pid 14439: DETAIL: connecting host=<ip> port=55007
pid 14439: LOG: received degenerate backend request for node_id: 2 from pid [14439]
pid 25644: LOG: new connection received
pid 25644: DETAIL: connecting host=<ip> port=55094
pid 25644: LOG: received degenerate backend request for node_id: 2 from pid [25644]
pid 34180: LOG: new connection received
pid 34180: DETAIL: connecting host=<ip> port=55235
pid 26615: LOG: new connection received
pid 26615: DETAIL: connecting host=<ip> port=55236
pid 34180: LOG: received degenerate backend request for node_id: 2 from pid [34180]
pid 26615: LOG: received degenerate backend request for node_id: 2 from pid [26615]

I would normally expect "received degenerate backend request for node_id" messages in the pgpool node that does not run the failover. But in this case neither pgpools ran the failover; and they had similar log messages
Is this a pgpool bug or is there some pgpool configuration parameters that I must tune to handle hung backend nodes?
Additional InformationSome of my pgpool config params
=================================
num_init_children = 80
max_pool = 1
child_life_time = 300
child_max_connections = 0
connection_life_time = 90
client_idle_limit = 0
connect_timeout = 10000
Tagsbackend, defunct pgpool processes, failover, watchdog

Activities

sahana shetty

2017-07-18 22:50

reporter   ~0001582

This issue is reproducible . It is seen when the backend node is unresponsive

Steps to reproduce :
1. SIGSTOP a backend node
2. Connect to the pgpool nodes. Each child process goes to defunct state with every new connection request with the below LOG

LOG: received degenerate backend request for node_id: 2 from pid

t-ishii

2017-07-19 14:22

developer   ~0001586

> 1. SIGSTOP a backend node
What do you mean by this? Sending a signal to postmaster?

sahana shetty

2017-07-19 15:56

reporter   ~0001591

Yes. signals the postmaster. The process is stopped and can be later resumed.

sahana shetty

2017-07-21 16:44

reporter   ~0001597

Any updates on this? This is critical for us to go in production.

t-ishii

2017-07-21 17:11

developer   ~0001598

I'm looking for someone who has time to take care of this.

sahana shetty

2017-07-24 16:47

reporter   ~0001602

As a workaround, I tried doing a manual failover when we detect a server is hung through shell scripts. But the manual failover also gets triggered only after the hung server resumes operation.
Run out of ideas to handle this.
It appears that the read/write operation to the socket fails. I'm guessing the select system call in the "connect_inet_domain_socket_by_port" flow should have handled this even before the read/write operation.
Any suggestions is appreciated.

t-ishii

2018-03-22 16:11

developer   ~0001977

Sorry for long absence. I think of a theory: Pgpool-II main process is trying to find new primary node until search_primary_node_timeout expires. Since the default of the parameter is 300 seconds, after 5 minutes passes, everything should go well.
Also if my theory is correct, you will see something like:

"find_primary_node_repeatedly: waiting for finding a primary node"

in your log file.

If this is the case, you could shorten the parameter so that access to Pgpool-II recovers quicker.

t-ishii

2018-05-18 17:28

developer   ~0002026

No response over 1 month. I am going to close the issue unless there's an objection.

t-ishii

2018-06-22 15:29

developer   ~0002068

Issue closed.

Issue History

Date Modified Username Field Change
2017-07-18 16:59 sahana shetty New Issue
2017-07-18 16:59 sahana shetty Tag Attached: backend
2017-07-18 16:59 sahana shetty Tag Attached: failover
2017-07-18 16:59 sahana shetty Tag Attached: watchdog
2017-07-18 16:59 sahana shetty Tag Attached: defunct pgpool processes
2017-07-18 22:50 sahana shetty Note Added: 0001582
2017-07-19 14:22 t-ishii Note Added: 0001586
2017-07-19 14:25 t-ishii Status new => feedback
2017-07-19 15:56 sahana shetty Note Added: 0001591
2017-07-19 15:56 sahana shetty Status feedback => new
2017-07-21 16:44 sahana shetty Note Added: 0001597
2017-07-21 17:11 t-ishii Note Added: 0001598
2017-07-24 16:47 sahana shetty Note Added: 0001602
2017-07-26 10:07 t-ishii Assigned To => Muhammad Usama
2017-07-26 10:07 t-ishii Status new => assigned
2018-03-22 16:11 t-ishii Note Added: 0001977
2018-03-22 16:12 t-ishii Status assigned => feedback
2018-05-18 17:28 t-ishii Note Added: 0002026
2018-06-22 15:29 t-ishii Note Added: 0002068
2018-06-22 15:29 t-ishii Status feedback => closed