[pgpool-general: 7538] Re: Strange behavior on switchover with detach_false_primary enabled

Sat May 1 05:44:59 JST 2021

On Fri, Apr 30, 2021 at 2:11 PM Emond Papegaaij
<emond.papegaaij at gmail.com> wrote:
> > >> I started to think that detach_false_primary should not active on
> > >> other than leader watchdog node because standby watchdog node can be
> > >> interrupted by other watchdog node and causes unexpected failover. I
> > >> will investigate more.
> > >
> > > I was in the middle of typing this, it seems we are in agreement: I'm
> > > wondering why a node that is not the pgpool leader makes this decision
> > > at all. Shouldn't the view of the database cluster be the same from
> > > all pgpool nodes? I'm not that much at home in how the watchdog makes
> > > its decisions, but to me it seems only the leader can make autonomous
> > > decisions. A node may request the cluster status to be reviewed, but
> > > for an automated detach, I would at least expect a quorum is needed.
> >
> > Attached is the patch which implement this:
> >
> > - only the leader watchdog node runs detach_false_primary
> > - if a quorum does not exist, the leader node does not run detach_false_primary
>
> This patch is a bit over my head. I'm not that much at home in the
> internals of pgpool. I've started a dedicated build job to test our
> application with pgpool master with this patch applied. As each run
> takes several hours, and I might need a run or 2-3 to get things
> right, I expect to get some results from this next week.

I've started several runs so far, but none of them have been
successful. The problem seems to be caused by the change in the
pcp_node_info output that breaks some of my scripts. It's easy to fix
the scripts to accommodate for the values, but I've noticed something
strange in the output:
Node 0: 172.29.30.1 5432 1 0.333333 waiting down primary primary 0
2021-04-30 18:20:26
Node 1: 172.29.30.2 5432 1 0.333333 waiting down standby standby 0
streaming async 2021-04-30 18:21:00
Node 2: 172.29.30.3 5432 1 0.333333 waiting down standby standby 0
streaming async 2021-04-30 18:20:47

As you can see, all nodes report 'down' as the 'actual backend
status'. This is probably caused by the fact that pgpool runs inside a
docker container that does not have postgresql installed. Therefore
pg_isready is not available. I don't know if this is going to have
consequences for pgpool itself or if it's just a reporting issue, but
I would prefer pgpool to not depend on postgresql binaries for its
reporting.

I hope the next test run will succeed, but my working week is over
now, so I'll check it next week.

Best regards,
Emond