[pgpool-general: 8430] Re: pgpool failing over on bad queries

Wed Sep 28 23:59:07 JST 2022

Yesterday evening pgpool took all of my backends offline for SSL "non-errors". To get them back online I stopped pgpool on all nodes, then removed /tmp/pgpool_status from all nodes, then restarted pgpool on all nodes and everything was fine. Is this expected?

After that challenge I decided to rethink my SSL setup, at least until pgpool has a release that patches these non-errors.

My pool_hba.conf has this statement, allowing connections to all databases from a particular subnet. The authentication mechanism is "password" so that I don't need to keep a separate file on the pgpool nodes containing the database user passwords. I'm using "hostssl" to keep the password from going in the clear. 
hostssl    all         all           192.1.213.0/24          password

My pg_hba.conf file looks like below. This allows connections only from the pgpool frontends without SSL and using scram-sha-256. I figure that since pgpool has the username and password in the clear it can handle the challenge response of scram and I don't have passwords going in the clear between the pgpool frontends and the postgresql backends. Does this make sense?

host replication all 192.1.213.41/32 scram-sha-256
host db_check db_check 192.1.213.41/32 scram-sha-256
host all all 192.1.213.41/32 scram-sha-256
host replication all 192.1.213.42/32 scram-sha-256
host db_check db_check 192.1.213.42/32 scram-sha-256
host all all 192.1.213.42/32 scram-sha-256
host replication all 192.1.213.43/32 scram-sha-256
host db_check db_check 192.1.213.43/32 scram-sha-256
host all all 192.1.213.43/32 scram-sha-256

I also set "ssl = off" in my postgresql.conf file.

With this setup am I correct that pgpool will be handling the scram challenge response? Or is the challenge sent all the way back to the client? I ask because after making this change I found that some of my client libraries needed to be upgraded for connections to work.

Given these changes I didn't expect there to be anymore issues with my secondary nodes going offline. However that was wrong, this morning I was greeted by a message that my secondary nodes went offline. I started digging into the pgpool log file on one of the secondary nodes that went offline and found the following.

I see lots of messages like this, is this normal?
2022-09-28 00:01:49.306: main pid 1449554: LOG:  child process with pid: 1528566 exits with status 256

Some also have status of "0" and some have "SUCCESS".

I also see lots of these messages. Should I be changing my log level to not see them?
2022-09-28 00:14:34.815: watchdog pid 1449578: LOG:  new IPC connection received

What would cause this error?
2022-09-28 10:23:02.560: [unknown] pid 1575894: LOG:  pool_read_kind: error message from main backend:remaining connection slots are reserved for non-replication superuser connections
2022-09-28 10:23:02.560: [unknown] pid 1575894: ERROR:  unable to read message kind
2022-09-28 10:23:02.561: [unknown] pid 1575894: DETAIL:  kind does not match between main(45) slot[1] (53)
2022-09-28 10:23:04.908: health_check0 pid 1449995: LOG:  health check retrying on DB node: 0 (round:1)
2022-09-28 10:23:05.530: watchdog pid 1449578: LOG:  new IPC connection received
2022-09-28 10:23:05.561: sr_check_worker pid 1504222: FATAL:  unable to read data from DB node 0
2022-09-28 10:23:05.562: sr_check_worker pid 1504222: DETAIL:  EOF encountered with backend
2022-09-28 10:23:05.565: main pid 1449554: LOG:  worker child process with pid: 1504222 exits with status 256
2022-09-28 10:23:05.566: main pid 1449554: LOG:  fork a new worker child process with pid: 1577472
2022-09-28 10:23:05.566: sr_check_worker pid 1577472: LOG:  process started
2022-09-28 10:23:05.567: watchdog pid 1449578: LOG:  new IPC connection received
2022-09-28 10:23:05.598: sr_check_worker pid 1577472: FATAL:  unable to read data from DB node 0
2022-09-28 10:23:05.598: sr_check_worker pid 1577472: DETAIL:  EOF encountered with backend
2022-09-28 10:23:05.601: main pid 1449554: LOG:  worker child process with pid: 1577472 exits with status 256
2022-09-28 10:23:05.601: main pid 1449554: LOG:  fork a new worker child process with pid: 1577473
2022-09-28 10:23:05.602: sr_check_worker pid 1577473: LOG:  process started
2022-09-28 10:23:05.602: watchdog pid 1449578: LOG:  new IPC connection received
2022-09-28 10:23:05.631: sr_check_worker pid 1577473: FATAL:  unable to read data from DB node 0
2022-09-28 10:23:05.631: sr_check_worker pid 1577473: DETAIL:  EOF encountered with backend
2022-09-28 10:23:05.634: main pid 1449554: LOG:  worker child process with pid: 1577473 exits with status 256
2022-09-28 10:23:05.634: main pid 1449554: LOG:  fork a new worker child process with pid: 1577474
2022-09-28 10:23:05.635: sr_check_worker pid 1577474: LOG:  process started
2022-09-28 10:23:05.635: watchdog pid 1449578: LOG:  new IPC connection received
2022-09-28 10:23:05.665: sr_check_worker pid 1577474: FATAL:  unable to read data from DB node 0
2022-09-28 10:23:05.665: sr_check_worker pid 1577474: DETAIL:  EOF encountered with backend
2022-09-28 10:23:05.668: main pid 1449554: LOG:  worker child process with pid: 1577474 exits with status 256
2022-09-28 10:23:05.668: main pid 1449554: LOG:  fork a new worker child process with pid: 1577475
2022-09-28 10:23:05.669: sr_check_worker pid 1577475: LOG:  process started
2022-09-28 10:23:05.669: watchdog pid 1449578: LOG:  new IPC connection received

I'm also not sure what would have caused this error:

2022-09-28 10:38:23.069: main pid 1449554: LOG:  fork a new child process with pid: 1579534
+ '[' 0 -eq 0 ']'
+ /usr/bin/pcp_attach_node -w -h localhost -U pgpool -p 9898 -n 2
2022-09-28 10:38:23.137: child pid 1579532: ERROR:  failed while reading startup packet
2022-09-28 10:38:23.137: child pid 1579532: DETAIL:  incorrect packet length (1291845628)
ERROR: unable to read data from socket.
+ '[' 1 -ne 0 ']'
+ echo ERROR: follow_primary.sh: end: pcp_attach_node failed
ERROR: follow_primary.sh: end: pcp_attach_node failed
+ exit 1

I went to look at the postgresql logs and realized that they weren't available anymore due to the recovery process rewriting the data directory.

I was able to successfully execute pcp_recovery_node for both secondary nodes and the pool is back online with all nodes.

Additional question:
If I were to disable the health check "health_check_period=0", how does pgpool determine that a backend is down?