[pgpool-general: 8372] Re: pgpool failing over on bad queries

Sat Sep 3 05:49:52 JST 2022

>> I'm finding that I have a client sending a bunch of bad queries to pgpool. After a large number of these errors pgpool appears to decide that the primary backend is in a bad state and fails over to another backend. Why is pgpool doing this and how do I prevent it?
>
>Syntax error itself does not trigger failover. There must be other
>reason which triggers the failover. To find the cause, I need to
>inspect the pgpool log file.
>
What should I look for in the pgpool log file?

I have logs of log messages "new IPC connection received", those appear normal.
I also see "child process with pid ... exits with status 256" and "fork a new child process..."

I see the following on psql-03 when the primary node changed:

2022-08-31 01:11:57.819: watchdog pid 971756: LOG:  watchdog received the failover command from remote pgpool-II node "psql-01:9898 Linux psql-01"
2022-08-31 01:11:57.819: watchdog pid 971756: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from psql-01:9898 Linux psql-01
2022-08-31 01:11:57.819: watchdog pid 971756: LOG:  failover requires the majority vote, waiting for consensus
2022-08-31 01:11:57.819: watchdog pid 971756: DETAIL:  failover request noted
2022-08-31 01:11:57.819: watchdog pid 971756: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "psql-01:9898 Linux psql-01" is queued, waiting for the confirmation from other nodes
2022-08-31 01:11:57.819: watchdog pid 971756: LOG:  signal_user1_to_parent_with_reason(4)
2022-08-31 01:11:57.819: main pid 971732: LOG:  Pgpool-II parent process received SIGUSR1
2022-08-31 01:11:57.819: main pid 971732: LOG:  Pgpool-II parent process received inform quarantine nodes signal from watchdog
2022-08-31 01:12:04.923: watchdog pid 971756: LOG:  new IPC connection received
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  watchdog received the failover command from remote pgpool-II node "psql-02:9898 Linux psql-02"
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from psql-02:9898 Linux psql-02
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  we have got the consensus to perform the failover
2022-08-31 01:12:07.919: watchdog pid 971756: DETAIL:  2 node(s) voted in the favor
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  received degenerate backend request for node_id: 0 from pid [971756]
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  signal_user1_to_parent_with_reason(0)
2022-08-31 01:12:07.919: main pid 971732: LOG:  Pgpool-II parent process received SIGUSR1
2022-08-31 01:12:07.919: main pid 971732: LOG:  Pgpool-II parent process has received failover request
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  new IPC connection received
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  received the failover indication from Pgpool-II on IPC interface
2022-08-31 01:12:07.919: watchdog pid 971756: LOG:  watchdog is informed of failover start by the main process
2022-08-31 01:12:07.920: main pid 971732: LOG:  starting degeneration. shutdown host psql-01(5432)
2022-08-31 01:12:07.922: main pid 971732: LOG:  Restart all children
2022-08-31 01:12:07.923: main pid 971732: LOG:  execute command: /etc/pgpool-II/failover.sh 0 psql-01 5432 /var/lib/pgsql/14/data 1 psql-02 0 0 5432 /var/lib/pgsql/14/data psql-01 5432

On psql-01 I see
2022-08-31 01:11:57.818: sr_check_worker pid 2111409: LOG:  pool_ssl: "SSL_read": "no SSL error reported"
2022-08-31 01:11:57.818: sr_check_worker pid 2111409: LOG:  received degenerate backend request for node_id: 0 from pid [2111409]
2022-08-31 01:11:57.818: watchdog pid 2080471: LOG:  new IPC connection received
2022-08-31 01:11:57.818: watchdog pid 2080471: LOG:  failover request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "psql-03:9898 Linux psql-03"
2022-08-31 01:11:57.818: watchdog pid 2080471: DETAIL:  waiting for the reply...
2022-08-31 01:11:57.819: watchdog pid 2080471: LOG:  remote node "psql-03:9898 Linux psql-03" is asking to inform about quarantined backend nodes
2022-08-31 01:11:57.819: watchdog pid 2080471: LOG:  signal_user1_to_parent_with_reason(4)
2022-08-31 01:11:57.819: sr_check_worker pid 2111409: LOG:  degenerate backend request for node_id: 0 from pid [2111409], will be handled by watchdog, which is building consensus for request
2022-08-31 01:11:57.819: main pid 2080446: LOG:  Pgpool-II parent process received SIGUSR1
2022-08-31 01:11:57.819: main pid 2080446: LOG:  Pgpool-II parent process received inform quarantine nodes signal from watchdog
2022-08-31 01:12:07.884: sr_check_worker pid 2111409: ERROR:  Failed to check replication time lag
2022-08-31 01:12:07.884: sr_check_worker pid 2111409: DETAIL:  No persistent db connection for the node 0
2022-08-31 01:12:07.884: sr_check_worker pid 2111409: HINT:  check sr_check_user and sr_check_password
2022-08-31 01:12:07.884: sr_check_worker pid 2111409: CONTEXT:  while checking replication time lag
2022-08-31 01:12:07.884: watchdog pid 2080471: LOG:  new IPC connection received
2022-08-31 01:12:09.642: watchdog pid 2080471: LOG:  processing follow primary looking[acquire] request from remote node "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.733: watchdog pid 2080471: LOG:  signal_user1_to_parent_with_reason(2)
2022-08-31 01:12:09.733: main pid 2080446: LOG:  Pgpool-II parent process received SIGUSR1
2022-08-31 01:12:09.733: main pid 2080446: LOG:  Pgpool-II parent process received sync backend signal from watchdog
2022-08-31 01:12:09.734: watchdog pid 2080471: LOG:  new IPC connection received
2022-08-31 01:12:09.734: main pid 2080446: LOG:  leader watchdog has performed failover
2022-08-31 01:12:09.734: main pid 2080446: DETAIL:  syncing the backend states from the LEADER watchdog node
2022-08-31 01:12:09.734: watchdog pid 2080471: LOG:  new IPC connection received
2022-08-31 01:12:09.734: watchdog pid 2080471: LOG:  received the get data request from local pgpool-II on IPC interface
2022-08-31 01:12:09.734: watchdog pid 2080471: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: watchdog pid 2080471: DETAIL:  waiting for the reply...
2022-08-31 01:12:09.735: main pid 2080446: LOG:  leader watchdog node "psql-03:9898 Linux psql-03" returned status for 3 backend nodes
2022-08-31 01:12:09.735: main pid 2080446: LOG:  backend:0 is set to down status
2022-08-31 01:12:09.735: main pid 2080446: DETAIL:  backend:0 is DOWN on cluster leader "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: main pid 2080446: LOG:  backend:2 is set to down status
2022-08-31 01:12:09.735: main pid 2080446: DETAIL:  backend:2 is DOWN on cluster leader "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: main pid 2080446: LOG:  primary node:1 on leader watchdog node "psql-03:9898 Linux psql-03" is different from local primary node:0
2022-08-31 01:12:09.735: main pid 2080446: LOG:  primary node was changed after the sync from "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: main pid 2080446: DETAIL:  all children needs to be restarted
2022-08-31 01:12:09.857: sr_check_worker pid 2111409: LOG:  worker process received restart request

On psql-02 I see
2022-08-31 01:11:57.819: watchdog pid 1187187: LOG:  remote node "psql-03:9898 Linux psql-03" is asking to inform about quarantined backend nodes
2022-08-31 01:11:57.819: watchdog pid 1187187: LOG:  signal_user1_to_parent_with_reason(4)
2022-08-31 01:11:57.819: main pid 1187162: LOG:  Pgpool-II parent process received SIGUSR1
2022-08-31 01:11:57.819: main pid 1187162: LOG:  Pgpool-II parent process received inform quarantine nodes signal from watchdog
2022-08-31 01:12:07.885: watchdog pid 1187187: LOG:  new IPC connection received
2022-08-31 01:12:07.918: sr_check_worker pid 2277999: LOG:  pool_ssl: "SSL_read": "no SSL error reported"
2022-08-31 01:12:07.918: sr_check_worker pid 2277999: LOG:  received degenerate backend request for node_id: 0 from pid [2277999]
2022-08-31 01:12:07.918: watchdog pid 1187187: LOG:  new IPC connection received
2022-08-31 01:12:07.918: watchdog pid 1187187: LOG:  failover request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "psql-03:9898 Linux psql-03"
2022-08-31 01:12:07.918: watchdog pid 1187187: DETAIL:  waiting for the reply...
2022-08-31 01:12:07.919: sr_check_worker pid 2277999: LOG:  degenerate backend request for 1 node(s) from pid [2277999], will be handled by watchdog
2022-08-31 01:12:09.642: watchdog pid 1187187: LOG:  processing follow primary looking[acquire] request from remote node "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.733: watchdog pid 1187187: LOG:  signal_user1_to_parent_with_reason(2)
2022-08-31 01:12:09.733: main pid 1187162: LOG:  Pgpool-II parent process received SIGUSR1
2022-08-31 01:12:09.734: main pid 1187162: LOG:  Pgpool-II parent process received sync backend signal from watchdog
2022-08-31 01:12:09.734: watchdog pid 1187187: LOG:  new IPC connection received
2022-08-31 01:12:09.734: main pid 1187162: LOG:  leader watchdog has performed failover
2022-08-31 01:12:09.734: main pid 1187162: DETAIL:  syncing the backend states from the LEADER watchdog node
2022-08-31 01:12:09.734: watchdog pid 1187187: LOG:  new IPC connection received
2022-08-31 01:12:09.734: watchdog pid 1187187: LOG:  received the get data request from local pgpool-II on IPC interface
2022-08-31 01:12:09.734: watchdog pid 1187187: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.734: watchdog pid 1187187: DETAIL:  waiting for the reply...
2022-08-31 01:12:09.735: main pid 1187162: LOG:  leader watchdog node "psql-03:9898 Linux psql-03" returned status for 3 backend nodes
2022-08-31 01:12:09.735: main pid 1187162: LOG:  backend:0 is set to down status
2022-08-31 01:12:09.735: main pid 1187162: DETAIL:  backend:0 is DOWN on cluster leader "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: main pid 1187162: LOG:  backend:2 is set to down status
2022-08-31 01:12:09.735: main pid 1187162: DETAIL:  backend:2 is DOWN on cluster leader "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: main pid 1187162: LOG:  primary node:1 on leader watchdog node "psql-03:9898 Linux psql-03" is different from local primary node:0
2022-08-31 01:12:09.735: main pid 1187162: LOG:  primary node was changed after the sync from "psql-03:9898 Linux psql-03"
2022-08-31 01:12:09.735: main pid 1187162: DETAIL:  all children needs to be restarted
2022-08-31 01:12:09.872: sr_check_worker pid 2277999: ERROR:  Failed to check replication time lag
2022-08-31 01:12:09.872: sr_check_worker pid 2277999: DETAIL:  No persistent db connection for the node 0
2022-08-31 01:12:09.873: sr_check_worker pid 2277999: HINT:  check sr_check_user and sr_check_password
2022-08-31 01:12:09.873: sr_check_worker pid 2277999: CONTEXT:  while checking replication time lag
2022-08-31 01:12:09.873: sr_check_worker pid 2277999: LOG:  worker process received restart request