[pgpool-committers: 7806] pgpool: Fix pcp_detach_node leaves down node.

Tatsuo Ishii ishii at sraoss.co.jp
Tue Jun 8 19:25:24 JST 2021


Fix pcp_detach_node leaves down node.

Detaching primary node using pcp_detach_node leaves a standby node
after follow primary command was executed.

This can be reproduced reliably by following steps:

$ pgpool_setup -n 4
$ ./startall
$ pcp_detatch_node -p 11001 0

This is caused by that pcp_recovery_node is denied by pcp child process:

2021-06-05 07:22:17: follow_child pid 6593: LOG:  execute command: /home/t-ishii/work/Pgpool-II/current/x/etc/follow_primary.sh 3 /tmp 11005 /home/t-ishii/work/Pgpool-II/current/x/data3 1 0 /tmp 0 11002 /home/t-ishii/work/Pgpool-II/current/x/data0
2021-06-05 07:22:17: pcp_main pid 6848: LOG:  forked new pcp worker, pid=7027 socket=6
2021-06-05 07:22:17: pcp_child pid 7027: ERROR:  failed to process PCP request at the moment
2021-06-05 07:22:17: pcp_child pid 7027: DETAIL:  failback is in progress

it complains that a failback request is still going. The reason why the
failback is not completed is, find_primary_node_repeatedly() is trying
to acquire the follow primary lock. However the follow primary command
has already acquired the lock and it is waiting for the completion of
the failback request. Thus this is a kind of dead lock situation.

How to solve this?

The purpose of the follow primary lock is to prevent concurrent run of
follow primary command and detach false primary by the streaming
replication check. We cannot throw it away. However it is not always
necessary to acquire the lock by find_primary_node_repeatedly(). If
it does not try to acquire the lock, failover/failback will not be
blocked and will finish soon, thus Req_info->switching flags will be
promptly turned to false.

When a primary node is detached, failover command is called and new
primary is selected. At this point find_primary_node_repeatedly() is
surely needed to run to find the new primary. However, once follow
primary command starts, the primary will not be changed. So my idea
is, find_primary_node_repeatedly() checks whether follow primary
command is running or not. If it is running, just returns the current
primary. Otherwise acquires the lock.

For this purpose, new shared memory variable
Req_info->follow_primary_ongoing was introduced. The flag is set/unset
by follow primary process.

New regression test 075.detach_primary_left_down_node is added.

Discussion: https://www.pgpool.net/pipermail/pgpool-hackers/2021-June/003916.html

Branch
------
V4_2_STABLE

Details
-------
https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=79e5ee3396a515e5bd0a3bee316e02e6cf43cae7

Modified Files
--------------
src/include/pool.h                                 |  1 +
src/main/pgpool_main.c                             | 22 ++++++++++
.../075.detach_primary_left_down_node/test.sh      | 49 ++++++++++++++++++++++
3 files changed, 72 insertions(+)



More information about the pgpool-committers mailing list