View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000423 | Pgpool-II | Bug | public | 2018-08-13 11:53 | 2018-08-16 15:37 |
| Reporter | nagata | Assigned To | nagata | ||
| Priority | low | Severity | minor | Reproducibility | sometimes |
| Status | closed | Resolution | fixed | ||
| Product Version | 3.6.6 | ||||
| Summary | 0000423: Connecting Pgpool-II during failback causes a possible hang | ||||
| Description | One of our clients reported that failback command got hung when psql connected to Pgpool-II in failback_command. One of our clients are planning to use failback_command to confirm that replication is healthy referring pg_stat_replication. The failback is triggered at the end of online-recovery executed in follow_master_command. In their environment, there are one primary and two standbys, so online-recovery and failback are executed twice respectively. My analysis of this phenomenon is below: At the first failback after the first recovery, the flag to indicate that all processes need to be restarted is set. But, child processes are not restarted yet. After the second pcp_recovery_node is done, SIGUSR2 signal is sent to all child prosesses to wake up them, so each child process wakes up, and they exits itself immediately since the flag is set. Just after this, the second failback starts. In the failback command, psql tries to connect to Pgpool-II, but this can not accept the connection because all child processes has exited, so psql has hung. The original reporter is wondering this behaviour should be treated as a bug. Although I think that we should not send any queries to Pgpool-II during failover/failback because the backend status is in transition and not stable, if we can have any safety net it would be nice. Even if we can not, I think, we should include some warning in the documentation. | ||||
| Steps To Reproduce | The similar phenomenon are reproduced by executing pcp_recovery_node in quick succession. 1. Set up 3 nodes cluster by "pgpool_setup -n 3" 2. Make failback.sh $ cat > /tmp/failback.sh #!/bin/bash psql -h localhost -p 11000 test -c "select 1;" 3. Configure pgpool.conf failback_command = '/tmp/failback.sh' 4. Start pgpool $ ./startall 5. Shutdown standbys $ pg_ctl -D data1 stop $ pg_ctl -D data2 stop $ psql -h localhost -p 11000 test -c "show pool_nodes" node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay | last_status_change ---------+----------+-------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0 | /tmp | 11002 | up | 0.333333 | primary | 0 | true | 0 | 2018-08-14 19:07:15 1 | /tmp | 11003 | down | 0.333333 | standby | 0 | false | 0 | 2018-08-14 19:07:10 2 | /tmp | 11004 | down | 0.333333 | standby | 0 | false | 0 | 2018-08-14 19:07:15 (3 rows) 6. Run pcp_recovery_node in quick succession. $ pcp_recovery_node -h localhost -p 11001 -w -n1; pcp_recovery_node -h localhost -p 11001 -w -n2 7. Confirm child processes are zombie $ ps aux | grep pgpool ....(snip) ... yugo-n 19054 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19055 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19056 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19057 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19058 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19059 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19060 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19061 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19062 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19063 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19065 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19066 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19067 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> yugo-n 19068 0.0 0.0 0 0 pts/3 Z 19:06 0:00 [pgpool] <defunct> ... 8. Confirm psql gets hung $ ps aux | grep psql yugo-n 19547 0.0 0.0 125672 8732 pts/3 S 19:07 0:00 /usr/lib/postgresql/10/bin/psql -h localhost -p 11000 test -c select 1; | ||||
| Tags | No tags attached. | ||||
|
|
Thank you for reporting this issue. We will check that. |
|
|
I think the basic causes of the hang are as followings. 1. child processes which terminated during failover/failback are not respawn until the the failover/failback end. 2. Pgpool-II keeps the clients waiting all the time until child processes are respawn. 3. failover/failback command does not have timeout. Do you think any of these as a bug? If so, can we fix this in a maintenance release? If these are specifications not bugs and we should not to send queries to Pgpool-II during failover or failback, I think it is better to describe this in the documentation. |
|
|
I reproduced this issue. We will discuss about this one in our development team. |
|
|
> 1. child processes which terminated during failover/failback are not respawn until the the failover/failback end. Expected behavior. > 2. Pgpool-II keeps the clients waiting all the time until child processes are respawn. Pgpool-II does not do this. So I don't know what you mean. > 3. failover/failback command does not have timeout. Correct. Because timeout will make Pgpool-II and PostgreSQL into unknown state, which will be hard to recovery from. > If these are specifications not bugs and we should not to send queries to Pgpool-II during failover or failback, I think it is better to describe this in the documentation. This seems different from what you said in the report. You said, the failover/failback script tries to connect to Pgpool-II. I would say, trying to connect to Pgpool-II within a failover/failoback script should be avoided. I will write this caution in the docs. |
|
|
>> 2. Pgpool-II keeps the clients waiting all the time until child processes are respawn. > Pgpool-II does not do this. So I don't know what you mean Correctly, Pgpool-II keeps the clients waiting all the time while all child processes don't exist or are zombie. Even any error isn't returned to the client because the parent process is still listening on the socket, as we discussed on email. >> If these are specifications not bugs and we should not to send queries to Pgpool-II during failover or failback, I think it is better to describe this in the documentation. >This seems different from what you said in the report. You said, the failover/failback script tries to connect to Pgpool-II. Yes, you are right. The problem I reported is that failback script got hung when psql tried connect to Pgpool-II within the script, and the cause is based on the behaviours I mentioned above. > I would say, trying to connect to Pgpool-II within a failover/failoback script should be avoided. I will write this caution in the docs. Thank you for your decision to add this caution to the docs. I confirmed that these are not bug and will not be fixed. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2018-08-13 11:53 | nagata | New Issue | |
| 2018-08-14 10:29 | pengbo | Note Added: 0002153 | |
| 2018-08-14 18:52 | nagata | Note Added: 0002155 | |
| 2018-08-14 20:09 | nagata | Steps to Reproduce Updated | |
| 2018-08-14 20:12 | nagata | Note View State: 0002155: private | |
| 2018-08-14 20:12 | nagata | Steps to Reproduce Updated | |
| 2018-08-14 20:13 | nagata | Note View State: 0002155: public | |
| 2018-08-14 20:21 | nagata | Note Edited: 0002155 | |
| 2018-08-15 13:03 | pengbo | Note Added: 0002157 | |
| 2018-08-16 15:01 | t-ishii | Note Added: 0002158 | |
| 2018-08-16 15:36 | nagata | Note Added: 0002159 | |
| 2018-08-16 15:37 | nagata | Assigned To | => nagata |
| 2018-08-16 15:37 | nagata | Status | new => closed |
| 2018-08-16 15:37 | nagata | Resolution | open => fixed |