[pgpool-general: 8441] Re: pgpool failing over on bad queries

Wed Oct 5 20:12:17 JST 2022

> Yesterday evening pgpool took all of my backends offline for SSL "non-errors". To get them back online I stopped pgpool on all nodes, then removed /tmp/pgpool_status from all nodes, then restarted pgpool on all nodes and everything was fine. Is this expected?

Yes.

> After that challenge I decided to rethink my SSL setup, at least until pgpool has a release that patches these non-errors.
> 
> My pool_hba.conf has this statement, allowing connections to all databases from a particular subnet. The authentication mechanism is "password" so that I don't need to keep a separate file on the pgpool nodes containing the database user passwords. I'm using "hostssl" to keep the password from going in the clear. 
> hostssl    all         all           192.1.213.0/24          password
> 
> My pg_hba.conf file looks like below. This allows connections only from the pgpool frontends without SSL and using scram-sha-256. I figure that since pgpool has the username and password in the clear it can handle the challenge response of scram and I don't have passwords going in the clear between the pgpool frontends and the postgresql backends. Does this make sense?

Yes.

> host replication all 192.1.213.41/32 scram-sha-256
> host db_check db_check 192.1.213.41/32 scram-sha-256
> host all all 192.1.213.41/32 scram-sha-256
> host replication all 192.1.213.42/32 scram-sha-256
> host db_check db_check 192.1.213.42/32 scram-sha-256
> host all all 192.1.213.42/32 scram-sha-256
> host replication all 192.1.213.43/32 scram-sha-256
> host db_check db_check 192.1.213.43/32 scram-sha-256
> host all all 192.1.213.43/32 scram-sha-256
> 
> I also set "ssl = off" in my postgresql.conf file.
> 
> With this setup am I correct that pgpool will be handling the scram challenge response? Or is the challenge sent all the way back to the client? I ask because after making this change I found that some of my client libraries needed to be upgraded for connections to work.
> 
> Given these changes I didn't expect there to be anymore issues with my secondary nodes going offline. However that was wrong, this morning I was greeted by a message that my secondary nodes went offline. I started digging into the pgpool log file on one of the secondary nodes that went offline and found the following.
> 
> 
> I see lots of messages like this, is this normal?

No. "remaining connection slots are reserved for non-replication
superuser connections" is a complain from PostgreSQL.  See your
postgresql.conf to fix this. My guess is max_connections is not large
enough.

> 2022-09-28 00:01:49.306: main pid 1449554: LOG:  child process with pid: 1528566 exits with status 256
> 
> Some also have status of "0" and some have "SUCCESS".
> 
> I also see lots of these messages. Should I be changing my log level to not see them?
> 2022-09-28 00:14:34.815: watchdog pid 1449578: LOG:  new IPC connection received
> 
> 
> What would cause this error?
> 2022-09-28 10:23:02.560: [unknown] pid 1575894: LOG:  pool_read_kind: error message from main backend:remaining connection slots are reserved for non-replication superuser connections
> 2022-09-28 10:23:02.560: [unknown] pid 1575894: ERROR:  unable to read message kind
> 2022-09-28 10:23:02.561: [unknown] pid 1575894: DETAIL:  kind does not match between main(45) slot[1] (53)
> 2022-09-28 10:23:04.908: health_check0 pid 1449995: LOG:  health check retrying on DB node: 0 (round:1)
> 2022-09-28 10:23:05.530: watchdog pid 1449578: LOG:  new IPC connection received
> 2022-09-28 10:23:05.561: sr_check_worker pid 1504222: FATAL:  unable to read data from DB node 0
> 2022-09-28 10:23:05.562: sr_check_worker pid 1504222: DETAIL:  EOF encountered with backend
> 2022-09-28 10:23:05.565: main pid 1449554: LOG:  worker child process with pid: 1504222 exits with status 256
> 2022-09-28 10:23:05.566: main pid 1449554: LOG:  fork a new worker child process with pid: 1577472
> 2022-09-28 10:23:05.566: sr_check_worker pid 1577472: LOG:  process started
> 2022-09-28 10:23:05.567: watchdog pid 1449578: LOG:  new IPC connection received
> 2022-09-28 10:23:05.598: sr_check_worker pid 1577472: FATAL:  unable to read data from DB node 0
> 2022-09-28 10:23:05.598: sr_check_worker pid 1577472: DETAIL:  EOF encountered with backend
> 2022-09-28 10:23:05.601: main pid 1449554: LOG:  worker child process with pid: 1577472 exits with status 256
> 2022-09-28 10:23:05.601: main pid 1449554: LOG:  fork a new worker child process with pid: 1577473
> 2022-09-28 10:23:05.602: sr_check_worker pid 1577473: LOG:  process started
> 2022-09-28 10:23:05.602: watchdog pid 1449578: LOG:  new IPC connection received
> 2022-09-28 10:23:05.631: sr_check_worker pid 1577473: FATAL:  unable to read data from DB node 0
> 2022-09-28 10:23:05.631: sr_check_worker pid 1577473: DETAIL:  EOF encountered with backend
> 2022-09-28 10:23:05.634: main pid 1449554: LOG:  worker child process with pid: 1577473 exits with status 256
> 2022-09-28 10:23:05.634: main pid 1449554: LOG:  fork a new worker child process with pid: 1577474
> 2022-09-28 10:23:05.635: sr_check_worker pid 1577474: LOG:  process started
> 2022-09-28 10:23:05.635: watchdog pid 1449578: LOG:  new IPC connection received
> 2022-09-28 10:23:05.665: sr_check_worker pid 1577474: FATAL:  unable to read data from DB node 0
> 2022-09-28 10:23:05.665: sr_check_worker pid 1577474: DETAIL:  EOF encountered with backend
> 2022-09-28 10:23:05.668: main pid 1449554: LOG:  worker child process with pid: 1577474 exits with status 256
> 2022-09-28 10:23:05.668: main pid 1449554: LOG:  fork a new worker child process with pid: 1577475
> 2022-09-28 10:23:05.669: sr_check_worker pid 1577475: LOG:  process started
> 2022-09-28 10:23:05.669: watchdog pid 1449578: LOG:  new IPC connection received
> 
> 
> I'm also not sure what would have caused this error:

No idea.

> 2022-09-28 10:38:23.069: main pid 1449554: LOG:  fork a new child process with pid: 1579534
> + '[' 0 -eq 0 ']'
> + /usr/bin/pcp_attach_node -w -h localhost -U pgpool -p 9898 -n 2
> 2022-09-28 10:38:23.137: child pid 1579532: ERROR:  failed while reading startup packet
> 2022-09-28 10:38:23.137: child pid 1579532: DETAIL:  incorrect packet length (1291845628)
> ERROR: unable to read data from socket.
> + '[' 1 -ne 0 ']'
> + echo ERROR: follow_primary.sh: end: pcp_attach_node failed
> ERROR: follow_primary.sh: end: pcp_attach_node failed
> + exit 1
> 
> I went to look at the postgresql logs and realized that they weren't available anymore due to the recovery process rewriting the data directory.
> 
> 
> I was able to successfully execute pcp_recovery_node for both secondary nodes and the pool is back online with all nodes.
> 
> Additional question:
> If I were to disable the health check "health_check_period=0", how does pgpool determine that a backend is down?

I already answered in other email.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp