[pgpool-hackers: 3966] Health check =?utf-8?Q?doesn=E2=80=99t_?=fail on storage failure and detach_false_primary =?utf-8?Q?doesn=E2=80=99t_?=work properly with one remaining node

Wed Jul 14 04:15:37 JST 2021

Hello

I was performing some tests on my HA Cluster which has the following configuration - 1 primary, 2 standbys behind 1 pgpool without watchdog.

TEST 1 - Storage failure (by making remote storage inaccessible)

Action - One client connected to pgpool and the instance’s storage disk was made inaccessible by blocking the connection to it from the firewall.
Expected Behaviour - Pgpool health check should start to fail and a failover should be triggered after ‘health_check_max_retries' number of tries.
Observed Behaviour - Pgpool health check didn’t detect anything wrong. `show pool_nodes` (pgpool queries) working fine. `\l` (postgresql queries) got stuck for ever and ^C to cancel the command didn’t work either. No new client was able to connect to pgpool.

TEST 2 - Failover on failure of Online Recovery (or node incorrectly detected as invalid?)

Setup was running with 1 primary (Node 2) and 2 failed standbys (Node 0 & Node 1). I tried to restore one of the standbys by performing an Online Recovery. The recovery command waited for longer than usual and then failed. And when it failed, at the same time, pgpool detected the only running (primary) node as an invalid node and shut it down. From what I understand, I see two possible cases:

• The online recovery failed due to some external reason and that caused pgpool to perform the failover which would be a problem as usually in failed online recoveries, the current primary stays unaffected as it should.
• Or the second case might be that the marking of the current primary as invalid node, made the online recovery fail, in which case, I would like to understand the reason why pgpool would mark a single running instance as the false primary?

I am more inclined to believe in the first case, since the cluster was working fine before I ran the online recovery command. Here are the logs of the incident:

2021-07-12 20:13:21: pid 205: LOG: starting recovering node 0
2021-07-12 20:13:21: pid 205: LOG: executing recovery
2021-07-12 20:13:21: pid 205: DETAIL: starting recovery command: "SELECT pgpool_recovery('recovery_1st_stage', 'dbod-ag-pg01.ch', '/ORA/dbs03/AG_PG01/data', '6600', 0, '6601')"
2021-07-12 20:13:21: pid 205: LOG: executing recovery
2021-07-12 20:13:21: pid 205: DETAIL: disabling statement_timeout
2021-07-12 20:15:09: pid 126: LOG: get_query_result falied: status: -2
2021-07-12 20:15:09: pid 126: CONTEXT: while checking replication time lag
2021-07-12 20:15:09: pid 126: LOG: Replication of node:1 is behind 43088 bytes from the primary server (node:2)
2021-07-12 20:15:09: pid 126: CONTEXT: while checking replication time lag
2021-07-12 20:15:09: pid 126: LOG: pgpool_worker_child: invalid node found 2
2021-07-12 20:15:09: pid 126: LOG: received degenerate backend request for node_id: 2 from pid [126]
2021-07-12 20:15:09: pid 9: LOG: Pgpool-II parent process has received failover request
2021-07-12 20:15:09: pid 9: LOG: starting degeneration. shutdown host dbod-ag-pg03.ch(6600)
2021-07-12 20:15:09: pid 9: WARNING: All the DB nodes are in down status and skip writing status file.
2021-07-12 20:15:09: pid 9: LOG: failover: no valid backend node found
2021-07-12 20:15:09: pid 9: LOG: Restart all children
2021-07-12 20:15:09: pid 9: LOG: execute command: /usr/local/bin/failover.sh 2 dbod-ag-pg03.ch 6600 /ORA/dbs03/AG_PG03/data -1 "" 2 2 "" "" dbod-ag-pg03.ch 6600
2021-07-12 20:15:09: pid 165: LOG: frontend disconnection: session time: 0:27:55.239 user=postgres database=postgres host=dbod-hac-c02.ch port=41958
2021-07-12 20:15:09: pid 9: LOG: find_primary_node_repeatedly: all of the backends are down. Giving up finding primary node
2021-07-12 20:15:09: pid 9: LOG: failover: no follow backends are degenerated
2021-07-12 20:15:09: pid 9: LOG: failover: set new primary node: -1
failover done. shutdown host dbod-ag-pg03.ch(6600)2021-07-12 20:15:09: pid 9: LOG: failover done. shutdown host dbod-ag-pg03.ch(6600)
2021-07-12 20:15:09: pid 126: LOG: worker process received restart request
2021-07-12 20:15:10: pid 200: LOG: restart request received in pcp child process
2021-07-12 20:18:05: pid 212: FATAL: pgpool is not accepting any new connections
2021-07-12 20:18:05: pid 212: DETAIL: all backend nodes are down, pgpool requires at least one valid node
2021-07-12 20:18:05: pid 212: HINT: repair the backend nodes and restart pgpool
2021-07-12 20:18:05: pid 212: LOG: frontend disconnection: session time: 0:00:00.003 user= database= host=
2021-07-12 20:23:25: pid 205: ERROR: executing recovery, execution of command failed at "1st stage"
2021-07-12 20:23:25: pid 205: DETAIL: command:"recovery_1st_stage"
2021-07-12 20:23:25: pid 9: LOG: PCP child 200 exits with status 0 in failover()
2021-07-12 20:23:25: pid 9: LOG: fork a new PCP child pid 240 in failover()
2021-07-12 20:23:25: pid 240: LOG: PCP process: 240 started

Relevant config vars:
failover_on_backend_error = off
detach_false_primary = on

health_check_period = 10
health_check_timeout = 5
health_check_max_retries = 5
health_check_retry_delay = 2
connect_timeout = 10000

Snippet of recovery_1st_stage where process failed:
ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -t postgres@${DEST_NODE_HOST} "
set -o errexit
echo -e 'Starting online recovery\n' >> ${DEST_NODE_LOG}

# Disable Puppet
sudo /opt/puppetlabs/bin/puppet agent --disable 'Online recovery of instance: ${DEST_NODE_NAME,,}';
waiting=0
while [[ \${waiting} -lt 60 ]] && sudo [ -f /opt/puppetlabs/puppet/cache/state/agent_catalog_run.lock ];
 do waiting=\$((waiting+1)); sleep 10;
done;
if [[ \${waiting} -eq 60 ]]; then
 echo 'Waited too long. Exiting'; exit 2
fi

Failure Log:
recovery_1st_stage: start: Online Recovery for Standby node 0
INFO - Initialize kerberos session
Waited too long. Exiting
ERROR: recovery_1st_stage: end: pg_basebackup failed. online recovery failed

For Test 1, I would like to understand if it is as per the design of pgpool to not detect storage issues with a node in health check. If pgpool checks the health of postgresql nodes via a ping, maybe we could have that ping write something to a test db to make sure the instance is actually reachable and working? Or at the least, we could add a timeout for queries running via pgpool so that if something like this occurs, the query can terminate automatically after a few seconds.

For Test 2, if this failover happened because of the failing online recovery, is there something I can do to prevent it in the future, some configuration change or anything?

Thanks in anticipation!

Anirudh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20210714/0753da12/attachment.htm>