<div dir="ltr">Hi all,<div><br></div><div>We are seeing failures in our test suite on a specific set of tests related to taking a node out of a cluster. In short, it seems to following sequence of events occurs:</div><div>* We start with a health cluster with 3 nodes (0, 1 and 2), each node running pgpool and postgresql. Node 0 runs the primary database.</div><div>* node 1 is shutdown</div><div>* pgpool on node 0 and 2 correctly mark backend 1 down</div><div>* pgpool on node 0 is reconfigured, removing node 1 from the configuration, backend 0 remains backend 0, backend 2 is now known as backend 1</div><div>* pgpool on node 0 starts up again, and receives the cluster status from node 2, which includes backend 1 being down.</div><div>* pgpool on node 0 now also marks backend 1 as being down, but because of the renumbering, it actually marks the backend on node 2 as down</div><div>* pgpool on node 2 gets its new configuration, same as on node 0</div><div>* pgpool on node 2 (which is now runs backend 1) gets the cluster status from node 0, and marks backend 1 down</div><div>* the cluster ends up with pgpool and postgresql running on both remaining nodes, but backend 1 is down. It never recovers from this state automatically, even though auto_failback is enabled and postgresql is up and streaming.</div><div><br></div><div>For node 2 (with backend 1), pcp_node_info returns the following information for backend 1:</div><div>Hostname               : 172.29.30.3<br>Port                   : 5432<br>Status                 : 3<br>Weight                 : 0.500000<br>Status Name            : down<br>Backend Status Name    : up<br>Role                   : standby<br>Backend Role           : standby<br>Replication Delay      : 0<br>Replication State      : streaming<br>Replication Sync State : async<br>Last Status Change     : 2023-01-09 22:28:41<br></div><div><br></div><div>My first question is: Can we somehow prevent the state of backend 1 being assigned to the wrong node during the configuration update?</div><div><br></div><div>My second question: Why does the auto_failback not reattach backend 1 when it detects the database is up and streaming?</div><div><br></div><div>Best regards,</div><div>Emond Papegaaij</div></div>