[pgpool-general: 7533] Re: Strange behavior on switchover with detach_false_primary enabled

Fri Apr 30 09:37:18 JST 2021

>> But by the time when the main process forkes off a child process which
>> will kick the follow primary command, find_primary_node has already
>> finished. So there's no window here, no?
> 
> When working on multi-threaded pieces of code, I always like to reason
> about the effects by introducing exaggerated pauses. The order of
> things is unclear and depends on the scheduling done by the OS. Under
> load, forked processes may spend significant time before they are
> actually started. So, in this case, what would happen if you put a 1
> minute sleep in fork_follow_child, after the fork call, just before
> the lock is acquired (inside the if). In my opinion, the pause in the
> forked process will allow the detach_false_primary to kick in again,
> because the detected state will be: node 0 primary with 1 slave, node
> 1 primary without slaves and node 2 slave. It is unlikely such a long
> pause will ever occur, but when dealing with multi-threading, the
> unlikely is bound to occur at some point in time.

Yeah, we need to protect follow_child from detach_false_primary
(actually executed in separate process: pgpool_worker). For this
purpose I think we could use another shared memory variable
Req_info->switching). The variable is set to true while failover
procedure is running. detach_false_primary will not be executed if
Req_info->switching is true. I will implement this in the next patch
set.

> Looking at logs again, I'm starting to think the original problem may
> be more complicated. In the attached logging I do not see the node
> being marked invalid. Instead, I see this:
> 
> (node 1, 172.29.30.2)
> 2021-04-26 04:03:33: pid 13: LOG:  watchdog received the failover
> command from remote pgpool-II node \"172.29.30.1:5432 Linux
> 218337d23478\"
> 2021-04-26 04:03:33: pid 13: LOG:  watchdog is processing the failover
> command [DEGENERATE_BACKEND_REQUEST] received from 172.29.30.1:5432
> Linux 218337d23478
> 2021-04-26 04:03:33: pid 13: LOG:  The failover request does not need
> quorum to hold
> 2021-04-26 04:03:33: pid 13: DETAIL:  proceeding with the failover
> 2021-04-26 04:03:33: pid 13: HINT:  REQ_DETAIL_CONFIRMED
> 2021-04-26 04:03:33: pid 13: LOG:  received degenerate backend request
> for node_id: 0 from pid [13]
> 2021-04-26 04:03:33: pid 1: LOG:  Pgpool-II parent process has
> received failover request
> 2021-04-26 04:03:33: pid 13: LOG:  new IPC connection received
> 2021-04-26 04:03:33: pid 13: LOG:  received the failover indication
> from Pgpool-II on IPC interface
> 2021-04-26 04:03:33: pid 13: LOG:  watchdog is informed of failover
> start by the main process
> 2021-04-26 04:03:33: pid 1: LOG:  starting degeneration. shutdown host
> 172.29.30.1(5432)
> ....
> 2021-04-26 04:03:34: pid 13: LOG:  watchdog is processing the failover
> command [DEGENERATE_BACKEND_REQUEST] received from 172.29.30.3:5432
> Linux 8d98afb57601
> 2021-04-26 04:03:34: pid 13: LOG:  The failover request does not need
> quorum to hold
> 2021-04-26 04:03:34: pid 13: DETAIL:  proceeding with the failover
> 2021-04-26 04:03:34: pid 13: HINT:  REQ_DETAIL_CONFIRMED
> 2021-04-26 04:03:34: pid 13: LOG:  received degenerate backend request
> for node_id: 1 from pid [13]
> 2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
> 1 does not connect to standby 2
> 2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
> 1 owns only 0 standbys out of 1
> 2021-04-26 04:03:34: pid 68: LOG:  pgpool_worker_child: invalid node found 1
> 2021-04-26 04:03:34: pid 68: LOG:  received degenerate backend request
> for node_id: 1 from pid [68]
> 
> So it seems, the request to degenerate the backend node 1 (the new
> primary) comes from 172.29.30.3 (node 2). In the logging on node 2, I
> can see this:
> 
> (node 2, 172.29.30.3)
> 2021-04-26 04:03:31: pid 1: LOG:  leader watchdog node
> \"172.29.30.2:5432 Linux 127663cd6d79\" returned status for 3 backend
> nodes
> 2021-04-26 04:03:31: pid 1: LOG:  backend:0 is set to UP status
> 2021-04-26 04:03:31: pid 1: DETAIL:  backend:0 is UP on cluster leader
> \"172.29.30.2:5432 Linux 127663cd6d79\"
> 2021-04-26 04:03:31: pid 1: LOG:  backend:1 is set to UP status
> 2021-04-26 04:03:31: pid 1: DETAIL:  backend:1 is UP on cluster leader
> \"172.29.30.2:5432 Linux 127663cd6d79\"
> 2021-04-26 04:03:31: pid 1: LOG:  backend:2 is set to UP status
> 2021-04-26 04:03:31: pid 1: DETAIL:  backend:2 is UP on cluster leader
> \"172.29.30.2:5432 Linux 127663cd6d79\"
> 2021-04-26 04:03:31: pid 1: LOG:  backend nodes status remains same
> after the sync from \"172.29.30.2:5432 Linux 127663cd6d79\"
> 2021-04-26 04:03:31: pid 12: LOG:  new watchdog node connection is
> received from \"172.29.30.1:50410\"
> 2021-04-26 04:03:31: pid 12: LOG:  new node joined the cluster
> hostname:\"172.29.30.1\" port:9009 pgpool_port:5432
> 2021-04-26 04:03:31: pid 12: DETAIL:  Pgpool-II version:\"4.2.2\"
> watchdog messaging version: 1.2
> 2021-04-26 04:03:31: pid 12: LOG:  The newly joined
> node:\"172.29.30.1:5432 Linux 218337d23478\" had left the cluster
> because it was shutdown
> 2021-04-26 04:03:31: pid 12: LOG:  new outbound connection to 172.29.30.1:9009
> 2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
> 1 does not connect to standby 2
> 2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
> 1 owns only 0 standbys out of 1
> 2021-04-26 04:03:34: pid 68: LOG:  pgpool_worker_child: invalid node found 1
> 2021-04-26 04:03:34: pid 68: LOG:  received degenerate backend request
> for node_id: 1 from pid [68]
> 2021-04-26 04:03:34: pid 12: LOG:  new IPC connection received
> 2021-04-26 04:03:34: pid 12: LOG:  failover request from local
> pgpool-II node received on IPC interface is forwarded to leader
> watchdog node \"172.29.30.2:5432 Linux 127663cd6d79\"
> 2021-04-26 04:03:34: pid 12: DETAIL:  waiting for the reply...
> 2021-04-26 04:03:34: pid 68: LOG:  degenerate backend request for 1
> node(s) from pid [68], will be handled by watchdog
> 
> This makes me think the sequence of events involves other pgpool nodes:
> * The instruction to detach primary node 0 is performed on node 0 and
> forwarded to node 1.
> * pgpool node 1 starts the failover, promoting backend node 1
> * pgpool node 2 wrongfully detects a false primary and requests to
> detach backend node 1
> * pgpool node 1 accepts this request and starts to detach node 1,
> while it is in the middle of instructing node 0 and 2 to follow this
> node as primary

Thanks for the info. I will look into this.

>> > Another thing is that the I'm not sure if the change to
>> > verify_backend_node_status will not cause any other unwanted effects.
>> > This requires a bit more explanation on why we enabled
>> > detach_false_primary in the first place. We were seeing some failures
>> > in our tests in the following scenario:
>> > * We start with a healthy, 3 node cluster, node 0 is both primary
>> > database and pgpool leader
>> > * We terminate node 2 and completely wipe its database directory
>> > * Node 2 is restarted and will initialize a new, empty database
>> > * Node 2 rejoins the cluster
>> > * Pgpool now detects it has 2 primaries  (node 0 and 2) and 1 (node 1)
>> > standby, but without detach_false_primary, it doesn't act on its own
>> > * find_primary_node gets the following status information: primary,
>> > standby, primary and will return the highest backend number, which is
>> > 2 in this case. This conflicts with the actual state of the system, as
>> > node 0 was the real primary.
>> > In our case, this was causing issues in our application. I'm not
>> > familiar enough with the internals of pgpool itself, but I can image
>> > this might cause issues in some situations. IMHO pgpool should not
>> > 'change its mind' on which node is the primary in the middle of a
>> > failover, as that is probably going to cause a lot of issues. It could
>> > well be that the outcome of find_primary_node does not really matter
>> > at this point, but that's hard for me to judge.
>>
>> I see your point. Actually I was able to reproduce what you said.
>>
>> # 1. create 3 node streaming replication cluster
>> # primary, standby, standby
>> pgpool_setup -n 3
>>
>> # 2. detach node 2
>> pcp_detach_node 2
>>
>> # 2. make node 2 primary
>> pg_ctl -D data2 promote
>>
>> # 3. attach node 2
>> pcp_attach_node 2
>>
>> At this point, the cluster is: standby, standby, primary, as you said.
> 
> You mean 'primary, standby, primary' here, right?

No. See below:

t-ishii$ pgpool_setup -n 3
[snip]
t-ishii$ ./startall 
waiting for server to start....20719 2021-04-30 09:17:25 JST LOG:  redirecting log output to logging collector process
20719 2021-04-30 09:17:25 JST HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....20732 2021-04-30 09:17:25 JST LOG:  redirecting log output to logging collector process
20732 2021-04-30 09:17:25 JST HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....20747 2021-04-30 09:17:25 JST LOG:  redirecting log output to logging collector process
20747 2021-04-30 09:17:25 JST HINT:  Future log output will appear in directory "log".
 done
server started
t-ishii$ !psql
psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 11002 | up     | up        | 0.333333  | primary | primary | 0          | false             | 0                 |                   |                        | 2021-04-30 09:17:48
 1       | /tmp     | 11003 | up     | up        | 0.333333  | standby | standby | 0          | true              | 0                 | streaming         | async                  | 2021-04-30 09:17:48
 2       | /tmp     | 11004 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2021-04-30 09:17:48
(3 rows)

t-ishii$ pcp_detach_node -w -p 11001 2
pcp_detach_node -- Command Successful
t-ishii$ pg_ctl -D data2 promote
waiting for server to promote.... done
server promoted
t-ishii$ !psql
psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 11002 | up     | up        | 0.333333  | primary | primary | 0          | true              | 0                 |                   |                        | 2021-04-30 09:17:48
 1       | /tmp     | 11003 | up     | up        | 0.333333  | standby | standby | 0          | false             | 0                 | streaming         | async                  | 2021-04-30 09:17:48
 2       | /tmp     | 11004 | down   | up        | 0.333333  | standby | primary | 0          | false             | 0                 | streaming         | async                  | 2021-04-30 09:18:07
(3 rows)

t-ishii$ pcp_attach_node -w -p 11001 2
pcp_attach_node -- Command Successful
t-ishii$ !psql
psql -p 11000 -c "show pool_nodes" test
 node_id | hostname | port  | status | pg_status | lb_weight |  role   | pg_role | select_cnt | load_balance_node | replication_delay | replication_state | replication_sync_state | last_status_change  
---------+----------+-------+--------+-----------+-----------+---------+---------+------------+-------------------+-------------------+-------------------+------------------------+---------------------
 0       | /tmp     | 11002 | up     | up        | 0.333333  | standby | primary | 0          | true              | 100663792         |                   |                        | 2021-04-30 09:19:05
 1       | /tmp     | 11003 | up     | up        | 0.333333  | standby | standby | 0          | false             | 280               |                   |                        | 2021-04-30 09:17:48
 2       | /tmp     | 11004 | up     | up        | 0.333333  | primary | primary | 0          | false             | 0                 |                   |                        | 2021-04-30 09:19:11
(3 rows)

Note that the pgpool is from master branch HEAD. "pg_role" column has
been newly added in the master branch, which indicates actual backend
status.

> What I meant is that the most likely primary should be decided at one
> point, and pgpool should stick with that until it at least completed
> the steps it decided to take. In the past few weeks, I've observed
> several instances where pgpool 'changing its mind' while it was still
> executing failover or follow primary commands. This resulted in
> strange commands, like: make node 1 follow node 1 as it's primary, but
> also things like make node 2 do a pg_basebackup from node 1 while node
> 1 was in the middle of a pg_basebackup from node 0. Many of these
> failures are less likely to occur in the real, but do surface in our
> tests, because these tests have very short pauses between actions. So,
> if you decide node x will be the new primary, at least stick to that
> conclusion until you are done executing all steps necessary for a
> complete failover to node x, including the follow primary commands.

I agree that pgpool should not change the primary node while follow
primary command is executing. That's the aim of the proposed patch.

On the other hand failover command could change the primary node in
the middle of failover script because it may need to elect new primary
node. We should allow this. Otherwise no new primary node will be
chosen.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp