[pgpool-general: 7509] Re: Problems with pg_rewind

Thu Apr 15 03:55:51 JST 2021

Hi all,

I can confirm that the pg_rewind issue is indeed caused by timing.
I've updated our pg_rewind script to retry a couple of times if it
detects that no rewind is required, and now we get this:

Apr 14 17:31:27 tkh-server webhook: Removing tkh-db ... done
Apr 14 17:31:27 tkh-server webhook: Going to remove tkh-db
Apr 14 17:31:27 tkh-server webhook: === Start of pg_rewind output ===
Apr 14 17:31:27 tkh-server webhook: Creating tkh_tkh-db_run ... #015
Creating tkh_tkh-db_run ... done#015 pg_rewind: source and target
cluster are on the same timeline pg_rewind: no rewind required
Apr 14 17:31:27 tkh-server webhook: === End of pg_rewind output ===
Apr 14 17:31:27 tkh-server webhook: Rewind on same timeline
Apr 14 17:31:31 tkh-server webhook: === Start of pg_rewind output ===
Apr 14 17:31:31 tkh-server webhook: Creating tkh_tkh-db_run ... #015
Creating tkh_tkh-db_run ... done#015 pg_rewind: servers diverged at
WAL location 0/9093858 on timeline 1 pg_rewind: rewinding from last
common checkpoint at 0/8000060 on timeline 1 pg_rewind: Done!
Apr 14 17:31:31 tkh-server webhook: === End of pg_rewind output ===
Apr 14 17:31:31 tkh-server webhook: Rewind successful

In between the 2 tries, the script just sleeps for a few seconds and
then tries again. No changes are made on the node and the postgresql
database is still down. It seems the new primary node starts the new
timeline a few seconds after it has been promoted:

Apr 14 17:31:23 tkh-server webhook: [webhook] 2021/04/14 19:31:23
[1b624d] command output: waiting for server to promote.... done
Apr 14 17:31:23 tkh-server webhook: server promoted

The workaround will do for now, but I would be very interested in a
more robust solution. The problem with this workaround is that a
genuine rewind on the same timeline will now also fail. For example,
when with 3 nodes, node 0 is primary, node 1 follows 0 and node 2
follows 1. The follow_primary script to let 2 follow 0 will then
incorrectly think that rewind failed and do a full pg_basebackup while
in fact no action was required at all.

Best regards,
Emond

On Tue, Apr 13, 2021 at 7:29 PM Emond Papegaaij
<emond.papegaaij at gmail.com> wrote:
>
> > One possible reason I can think of that may be causing this problem is
> > that our testcases require the creation of snapshots for the vms
> > running the cluster. These snapshots are restored at the beginning of
> > the test. We do make sure the cluster is in a consistent state before
> > we continue with the test, but maybe some part of the cluster is still
> > not entirely in sync. We've changed the re-synchronization code
> > yesterday to fully stop and restart the entire cluster in a reliable
> > way rather than trying to fix a distorted cluster. We haven't seen the
> > error since, but it will require some more runs to be sure. It still
> > is a strange issue, because even if it's caused the restoring the vm
> > snapshots, all pgpools and postgresql databases have already been
> > restarted prior to this failure.
>
> Updating on this possible cause: We've observed the same failure with
> our new setup. In this setup all pgpool and pg nodes are stopped and
> restarted in a predictable way. This works great and restores the
> cluster consistently in a healthy state. However, we still hit the
> rewind problem.
>
> Also, we've updated pgpool to the latest version, 4.2.2.
>
> Best regards,
> Emond