0000423: Connecting Pgpool-II during failback causes a possible hang - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000423	Pgpool-II	Bug	public	2018-08-13 11:53	2018-08-16 15:37

Reporter	nagata	Assigned To	nagata
Priority	low	Severity	minor	Reproducibility	sometimes
Status	closed	Resolution	fixed
Product Version	3.6.6

Summary	0000423: Connecting Pgpool-II during failback causes a possible hang
Description	One of our clients reported that failback command got hung when psql connected to Pgpool-II in failback_command. One of our clients are planning to use failback_command to confirm that replication is healthy referring pg_stat_replication. The failback is triggered at the end of online-recovery executed in follow_master_command. In their environment, there are one primary and two standbys, so online-recovery and failback are executed twice respectively. My analysis of this phenomenon is below: At the first failback after the first recovery, the flag to indicate that all processes need to be restarted is set. But, child processes are not restarted yet. After the second pcp_recovery_node is done, SIGUSR2 signal is sent to all child prosesses to wake up them, so each child process wakes up, and they exits itself immediately since the flag is set. Just after this, the second failback starts. In the failback command, psql tries to connect to Pgpool-II, but this can not accept the connection because all child processes has exited, so psql has hung. The original reporter is wondering this behaviour should be treated as a bug. Although I think that we should not send any queries to Pgpool-II during failover/failback because the backend status is in transition and not stable, if we can have any safety net it would be nice. Even if we can not, I think, we should include some warning in the documentation.
Steps To Reproduce	The similar phenomenon are reproduced by executing pcp_recovery_node in quick succession. 1. Set up 3 nodes cluster by "pgpool_setup -n 3" 2. Make failback.sh $ cat > /tmp/failback.sh #!/bin/bash psql -h localhost -p 11000 test -c "select 1;" 3. Configure pgpool.conf failback_command = '/tmp/failback.sh' 4. Start pgpool $ ./startall 5. Shutdown standbys $ pg_ctl -D data1 stop $ pg_ctl -D data2 stop $ psql -h localhost -p 11000 test -c "show pool_nodes" node_id \| hostname \| port \| status \| lb_weight \| role \| select_cnt \| load_balance_node \| replication_delay \| last_status_change ---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0 \| /tmp \| 11002 \| up \| 0.333333 \| primary \| 0 \| true \| 0 \| 2018-08-14 19:07:15 1 \| /tmp \| 11003 \| down \| 0.333333 \| standby \| 0 \| false \| 0 \| 2018-08-14 19:07:10 2 \| /tmp \| 11004 \| down \| 0.333333 \| standby \| 0 \| false \| 0 \| 2018-08-14 19:07:15 (3 rows) 6. Run pcp_recovery_node in quick succession. $ pcp_recovery_node -h localhost -p 11001 -w -n1; pcp_recovery_node -h localhost -p 11001 -w -n2 7. Confirm child processes are zombie $ ps aux \| grep pgpool ....(snip) ... yugo-n 19054 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19055 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19056 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19057 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19058 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19059 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19060 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19061 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19062 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19063 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19065 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19066 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19067 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19068 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> ... 8. Confirm psql gets hung $ ps aux \| grep psql yugo-n 19547 0.0 0.0 125672 8732 pts/3 S 19:07 0:00 /usr/lib/postgresql/10/bin/psql -h localhost -p 11000 test -c select 1;
Tags	No tags attached.

pengbo 2018-08-14 10:29 developer ~0002153	Thank you for reporting this issue. We will check that.

nagata 2018-08-14 18:52 developer ~0002155 Last edited: 2018-08-14 20:21	I think the basic causes of the hang are as followings. 1. child processes which terminated during failover/failback are not respawn until the the failover/failback end. 2. Pgpool-II keeps the clients waiting all the time until child processes are respawn. 3. failover/failback command does not have timeout. Do you think any of these as a bug? If so, can we fix this in a maintenance release? If these are specifications not bugs and we should not to send queries to Pgpool-II during failover or failback, I think it is better to describe this in the documentation.

pengbo 2018-08-15 13:03 developer ~0002157	I reproduced this issue. We will discuss about this one in our development team.

t-ishii 2018-08-16 15:01 developer ~0002158	> 1. child processes which terminated during failover/failback are not respawn until the the failover/failback end. Expected behavior. > 2. Pgpool-II keeps the clients waiting all the time until child processes are respawn. Pgpool-II does not do this. So I don't know what you mean. > 3. failover/failback command does not have timeout. Correct. Because timeout will make Pgpool-II and PostgreSQL into unknown state, which will be hard to recovery from. > If these are specifications not bugs and we should not to send queries to Pgpool-II during failover or failback, I think it is better to describe this in the documentation. This seems different from what you said in the report. You said, the failover/failback script tries to connect to Pgpool-II. I would say, trying to connect to Pgpool-II within a failover/failoback script should be avoided. I will write this caution in the docs.

nagata 2018-08-16 15:36 developer ~0002159	>> 2. Pgpool-II keeps the clients waiting all the time until child processes are respawn. > Pgpool-II does not do this. So I don't know what you mean Correctly, Pgpool-II keeps the clients waiting all the time while all child processes don't exist or are zombie. Even any error isn't returned to the client because the parent process is still listening on the socket, as we discussed on email. >> If these are specifications not bugs and we should not to send queries to Pgpool-II during failover or failback, I think it is better to describe this in the documentation. >This seems different from what you said in the report. You said, the failover/failback script tries to connect to Pgpool-II. Yes, you are right. The problem I reported is that failback script got hung when psql tried connect to Pgpool-II within the script, and the cause is based on the behaviours I mentioned above. > I would say, trying to connect to Pgpool-II within a failover/failoback script should be avoided. I will write this caution in the docs. Thank you for your decision to add this caution to the docs. I confirmed that these are not bug and will not be fixed.

Date Modified	Username	Field	Change
2018-08-13 11:53	nagata	New Issue
2018-08-14 10:29	pengbo	Note Added: 0002153
2018-08-14 18:52	nagata	Note Added: 0002155
2018-08-14 20:09	nagata	Steps to Reproduce Updated
2018-08-14 20:12	nagata	Note View State: 0002155: private
2018-08-14 20:12	nagata	Steps to Reproduce Updated
2018-08-14 20:13	nagata	Note View State: 0002155: public
2018-08-14 20:21	nagata	Note Edited: 0002155
2018-08-15 13:03	pengbo	Note Added: 0002157
2018-08-16 15:01	t-ishii	Note Added: 0002158
2018-08-16 15:36	nagata	Note Added: 0002159
2018-08-16 15:37	nagata	Assigned To	=> nagata
2018-08-16 15:37	nagata	Status	new => closed
2018-08-16 15:37	nagata	Resolution	open => fixed