[pgpool-general: 7532] Re: Strange behavior on switchover with detach_false_primary enabled

Thu Apr 29 20:58:15 JST 2021

> > Thanks for the patch. Unfortunately, I'm not in the position to test
> > your patch at this moment. Hopefully I can give it a try somewhere
> > next week. I did take a look at the code and do have a few questions.
> >
> > First of all, you acquire an exclusive lock just before you start the
> > follow primary loop. However, this process has at that point already
> > forked of the main process. I think this means there is a small window
> > in which the main process has already decided it will need to start a
> > follow process, but the lock isn't held yet (there is no guarantee
> > that a forked process starts to execute before the main process
> > continues). During this time window, the main process could already
> > enter the detach_false_primary. I'm not exactly sure what the state of
> > the cluster is at the moment, but I believe it could already be in the
> > state we've seen before. I think the main process should either wait
> > for the lock to be acquired by the forked process, or it should take
> > the lock itself and pass it on to the forked process (not sure if that
> > is possible).
>
> But by the time when the main process forkes off a child process which
> will kick the follow primary command, find_primary_node has already
> finished. So there's no window here, no?

When working on multi-threaded pieces of code, I always like to reason
about the effects by introducing exaggerated pauses. The order of
things is unclear and depends on the scheduling done by the OS. Under
load, forked processes may spend significant time before they are
actually started. So, in this case, what would happen if you put a 1
minute sleep in fork_follow_child, after the fork call, just before
the lock is acquired (inside the if). In my opinion, the pause in the
forked process will allow the detach_false_primary to kick in again,
because the detected state will be: node 0 primary with 1 slave, node
1 primary without slaves and node 2 slave. It is unlikely such a long
pause will ever occur, but when dealing with multi-threading, the
unlikely is bound to occur at some point in time.

Looking at logs again, I'm starting to think the original problem may
be more complicated. In the attached logging I do not see the node
being marked invalid. Instead, I see this:

(node 1, 172.29.30.2)
2021-04-26 04:03:33: pid 13: LOG:  watchdog received the failover
command from remote pgpool-II node \"172.29.30.1:5432 Linux
218337d23478\"
2021-04-26 04:03:33: pid 13: LOG:  watchdog is processing the failover
command [DEGENERATE_BACKEND_REQUEST] received from 172.29.30.1:5432
Linux 218337d23478
2021-04-26 04:03:33: pid 13: LOG:  The failover request does not need
quorum to hold
2021-04-26 04:03:33: pid 13: DETAIL:  proceeding with the failover
2021-04-26 04:03:33: pid 13: HINT:  REQ_DETAIL_CONFIRMED
2021-04-26 04:03:33: pid 13: LOG:  received degenerate backend request
for node_id: 0 from pid [13]
2021-04-26 04:03:33: pid 1: LOG:  Pgpool-II parent process has
received failover request
2021-04-26 04:03:33: pid 13: LOG:  new IPC connection received
2021-04-26 04:03:33: pid 13: LOG:  received the failover indication
from Pgpool-II on IPC interface
2021-04-26 04:03:33: pid 13: LOG:  watchdog is informed of failover
start by the main process
2021-04-26 04:03:33: pid 1: LOG:  starting degeneration. shutdown host
172.29.30.1(5432)
....
2021-04-26 04:03:34: pid 13: LOG:  watchdog is processing the failover
command [DEGENERATE_BACKEND_REQUEST] received from 172.29.30.3:5432
Linux 8d98afb57601
2021-04-26 04:03:34: pid 13: LOG:  The failover request does not need
quorum to hold
2021-04-26 04:03:34: pid 13: DETAIL:  proceeding with the failover
2021-04-26 04:03:34: pid 13: HINT:  REQ_DETAIL_CONFIRMED
2021-04-26 04:03:34: pid 13: LOG:  received degenerate backend request
for node_id: 1 from pid [13]
2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
1 does not connect to standby 2
2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
1 owns only 0 standbys out of 1
2021-04-26 04:03:34: pid 68: LOG:  pgpool_worker_child: invalid node found 1
2021-04-26 04:03:34: pid 68: LOG:  received degenerate backend request
for node_id: 1 from pid [68]

So it seems, the request to degenerate the backend node 1 (the new
primary) comes from 172.29.30.3 (node 2). In the logging on node 2, I
can see this:

(node 2, 172.29.30.3)
2021-04-26 04:03:31: pid 1: LOG:  leader watchdog node
\"172.29.30.2:5432 Linux 127663cd6d79\" returned status for 3 backend
nodes
2021-04-26 04:03:31: pid 1: LOG:  backend:0 is set to UP status
2021-04-26 04:03:31: pid 1: DETAIL:  backend:0 is UP on cluster leader
\"172.29.30.2:5432 Linux 127663cd6d79\"
2021-04-26 04:03:31: pid 1: LOG:  backend:1 is set to UP status
2021-04-26 04:03:31: pid 1: DETAIL:  backend:1 is UP on cluster leader
\"172.29.30.2:5432 Linux 127663cd6d79\"
2021-04-26 04:03:31: pid 1: LOG:  backend:2 is set to UP status
2021-04-26 04:03:31: pid 1: DETAIL:  backend:2 is UP on cluster leader
\"172.29.30.2:5432 Linux 127663cd6d79\"
2021-04-26 04:03:31: pid 1: LOG:  backend nodes status remains same
after the sync from \"172.29.30.2:5432 Linux 127663cd6d79\"
2021-04-26 04:03:31: pid 12: LOG:  new watchdog node connection is
received from \"172.29.30.1:50410\"
2021-04-26 04:03:31: pid 12: LOG:  new node joined the cluster
hostname:\"172.29.30.1\" port:9009 pgpool_port:5432
2021-04-26 04:03:31: pid 12: DETAIL:  Pgpool-II version:\"4.2.2\"
watchdog messaging version: 1.2
2021-04-26 04:03:31: pid 12: LOG:  The newly joined
node:\"172.29.30.1:5432 Linux 218337d23478\" had left the cluster
because it was shutdown
2021-04-26 04:03:31: pid 12: LOG:  new outbound connection to 172.29.30.1:9009
2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
1 does not connect to standby 2
2021-04-26 04:03:34: pid 68: LOG:  verify_backend_node_status: primary
1 owns only 0 standbys out of 1
2021-04-26 04:03:34: pid 68: LOG:  pgpool_worker_child: invalid node found 1
2021-04-26 04:03:34: pid 68: LOG:  received degenerate backend request
for node_id: 1 from pid [68]
2021-04-26 04:03:34: pid 12: LOG:  new IPC connection received
2021-04-26 04:03:34: pid 12: LOG:  failover request from local
pgpool-II node received on IPC interface is forwarded to leader
watchdog node \"172.29.30.2:5432 Linux 127663cd6d79\"
2021-04-26 04:03:34: pid 12: DETAIL:  waiting for the reply...
2021-04-26 04:03:34: pid 68: LOG:  degenerate backend request for 1
node(s) from pid [68], will be handled by watchdog

This makes me think the sequence of events involves other pgpool nodes:
* The instruction to detach primary node 0 is performed on node 0 and
forwarded to node 1.
* pgpool node 1 starts the failover, promoting backend node 1
* pgpool node 2 wrongfully detects a false primary and requests to
detach backend node 1
* pgpool node 1 accepts this request and starts to detach node 1,
while it is in the middle of instructing node 0 and 2 to follow this
node as primary

> > Another thing is that the I'm not sure if the change to
> > verify_backend_node_status will not cause any other unwanted effects.
> > This requires a bit more explanation on why we enabled
> > detach_false_primary in the first place. We were seeing some failures
> > in our tests in the following scenario:
> > * We start with a healthy, 3 node cluster, node 0 is both primary
> > database and pgpool leader
> > * We terminate node 2 and completely wipe its database directory
> > * Node 2 is restarted and will initialize a new, empty database
> > * Node 2 rejoins the cluster
> > * Pgpool now detects it has 2 primaries  (node 0 and 2) and 1 (node 1)
> > standby, but without detach_false_primary, it doesn't act on its own
> > * find_primary_node gets the following status information: primary,
> > standby, primary and will return the highest backend number, which is
> > 2 in this case. This conflicts with the actual state of the system, as
> > node 0 was the real primary.
> > In our case, this was causing issues in our application. I'm not
> > familiar enough with the internals of pgpool itself, but I can image
> > this might cause issues in some situations. IMHO pgpool should not
> > 'change its mind' on which node is the primary in the middle of a
> > failover, as that is probably going to cause a lot of issues. It could
> > well be that the outcome of find_primary_node does not really matter
> > at this point, but that's hard for me to judge.
>
> I see your point. Actually I was able to reproduce what you said.
>
> # 1. create 3 node streaming replication cluster
> # primary, standby, standby
> pgpool_setup -n 3
>
> # 2. detach node 2
> pcp_detach_node 2
>
> # 2. make node 2 primary
> pg_ctl -D data2 promote
>
> # 3. attach node 2
> pcp_attach_node 2
>
> At this point, the cluster is: standby, standby, primary, as you said.

You mean 'primary, standby, primary' here, right?

> In this case find_primary_node really ought to choose node 0 as the
> real primary because node 1 is connected standby to node 0, but node
> does not have any connected standby. I will fix this.

Great, I think the logic to determine the (most likely) primary should
be the same whether detach_false_primary is enabled or not. The switch
should only enable/disable the detaching part.

> In the mean time I am not really sure Pgpool-II should not 'change its
> mind' on which node is the primary in the middle of a
> failover. Because the former primary is not always necessarily a
> correct primary. For example, if we start with #3 above and
> detach/attach node 0, probably we want node 0 be new primary.
>
> Probably new rule would be:
>
> Pgpool-II should not 'change its mind' on which node is the primary in
> the middle of a failover if there's no reliable way to judge which is
> the correct primary (for example, there are multiple primaries but no
> standby).
>
> What do you think?

What I meant is that the most likely primary should be decided at one
point, and pgpool should stick with that until it at least completed
the steps it decided to take. In the past few weeks, I've observed
several instances where pgpool 'changing its mind' while it was still
executing failover or follow primary commands. This resulted in
strange commands, like: make node 1 follow node 1 as it's primary, but
also things like make node 2 do a pg_basebackup from node 1 while node
1 was in the middle of a pg_basebackup from node 0. Many of these
failures are less likely to occur in the real, but do surface in our
tests, because these tests have very short pauses between actions. So,
if you decide node x will be the new primary, at least stick to that
conclusion until you are done executing all steps necessary for a
complete failover to node x, including the follow primary commands.

Best regards,
Emond