[pgpool-general: 7599] Failover of a STANDBY backend during replication lag

Láznička Vladimír Vladimir.Laznicka at cca.cz
Tue Jun 29 16:28:48 JST 2021


Hello,

We have recently encountered a problematic situation on our production environment with the following configuration:

1) SVLPGP1PP - MASTER pgPool node (version 4.1.5), configured with Watchdog.
2) SVLPGP2PP - STANDBY pgPool node (version 4.1.5), configured with Watchdog.
3) SVLDBD1PP - PRIMARY PostgreSQL node (version 11.7), configured for streaming replication of WAL to the STANDBY node.
4) SVLDBD2PP - HOT STANDBY PostgreSQL node (version 11.7), configured for receiving WAL from the PRIMARY via streaming replication, used for HA and load-balancing.

During a normal operation one session apparently started a SELECT on the STANDBY PostgreSQL node, in the meantime however another session started to change the same data on the PRIMARY node and this resulted in the following on the STANDBY:

Jun 27 01:46:46 SVLDBD2PP postgres[4581]: [8] 2021-06-27 01:46:46.816 CEST ::: APP - PostgreSQL JDBC Driver :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: HOST - 10.5.5.11 :H: CMD - idle in transaction :C: |FATAL:  terminating connection due to conflict with recovery#
0122021-06-27 01:46:46.816 CEST ::: APP - PostgreSQL JDBC Driver :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: HOST - 10.5.5.11 :H: CMD - idle in transaction :C: |DETAIL:  User query might have needed to see row versions that must be removed.#
0122021-06-27 01:46:46.816 CEST ::: APP - PostgreSQL JDBC Driver :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: HOST - 10.5.5.11 :H: CMD - idle in transaction :C: |HINT:  In a moment you should be able to reconnect to the database and repeat your command.

On the pgPool side it was apparently taken as a failure of SVLDBD2PP node, which was then disconnected from the cluster:

Jun 27 01:46:18 SVLPGP1PP pgpool[15246]: [1982-1] 2021-06-27 01:46:18 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Replication of node:1 is behind 187064 bytes from the primary server (node:0)
Jun 27 01:46:18 SVLPGP1PP pgpool[15246]: [1982-2] 2021-06-27 01:46:18 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |CONTEXT:  while checking replication time lag
Jun 27 01:46:23 SVLPGP1PP pgpool[15246]: [1983-1] 2021-06-27 01:46:23 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Replication of node:1 is behind 1888264 bytes from the primary server (node:0)
Jun 27 01:46:23 SVLPGP1PP pgpool[15246]: [1983-2] 2021-06-27 01:46:23 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |CONTEXT:  while checking replication time lag
Jun 27 01:46:28 SVLPGP1PP pgpool[15246]: [1984-1] 2021-06-27 01:46:28 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Replication of node:1 is behind 3890888 bytes from the primary server (node:0)
Jun 27 01:46:28 SVLPGP1PP pgpool[15246]: [1984-2] 2021-06-27 01:46:28 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |CONTEXT:  while checking replication time lag
Jun 27 01:46:33 SVLPGP1PP pgpool[15246]: [1985-1] 2021-06-27 01:46:33 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Replication of node:1 is behind 5832176 bytes from the primary server (node:0)
Jun 27 01:46:33 SVLPGP1PP pgpool[15246]: [1985-2] 2021-06-27 01:46:33 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |CONTEXT:  while checking replication time lag
Jun 27 01:46:38 SVLPGP1PP pgpool[15246]: [1986-1] 2021-06-27 01:46:38 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Replication of node:1 is behind 732568 bytes from the primary server (node:0)
Jun 27 01:46:38 SVLPGP1PP pgpool[15246]: [1986-2] 2021-06-27 01:46:38 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |CONTEXT:  while checking replication time lag
Jun 27 01:46:43 SVLPGP1PP pgpool[15246]: [1987-1] 2021-06-27 01:46:43 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Replication of node:1 is behind 1685264 bytes from the primary server (node:0)
Jun 27 01:46:43 SVLPGP1PP pgpool[15246]: [1987-2] 2021-06-27 01:46:43 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |CONTEXT:  while checking replication time lag
Jun 27 01:46:46 SVLPGP1PP pgpool[9339]: [5262-1] 2021-06-27 01:46:46 ::: APP - [unknown] :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: |ERROR:  unable to forward message to frontend
Jun 27 01:46:46 SVLPGP1PP pgpool[9339]: [5262-2] 2021-06-27 01:46:46 ::: APP - [unknown] :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: |DETAIL:  FATAL error occured on backend
Jun 27 01:46:46 SVLPGP1PP pgpool[9339]: [5263-1] 2021-06-27 01:46:46 ::: APP - [unknown] :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: |LOG:  received degenerate backend request for node_id: 1 from pid [9339]
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2784-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  new IPC connection received
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2785-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  watchdog received the failover command from local pgpool-II on IPC interface
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2786-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2787-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  we have got the consensus to perform the failover
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2787-2] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |DETAIL:  1 node(s) voted in the favor
Jun 27 01:46:46 SVLPGP1PP pgpool[9339]: [5264-1] 2021-06-27 01:46:46 ::: APP - [unknown] :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: |WARNING:  write on backend 1 failed with error :"Success"
Jun 27 01:46:46 SVLPGP1PP pgpool[9339]: [5264-2] 2021-06-27 01:46:46 ::: APP - [unknown] :A: DB - aisgdpp :D: USER - aisg_backend_apl :U: |DETAIL:  while trying to write data from offset: 0 wlen: 5
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11838-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Pgpool-II parent process has received failover request
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2788-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  new IPC connection received
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2789-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  received the failover indication from Pgpool-II on IPC interface
Jun 27 01:46:46 SVLPGP1PP pgpool[3885]: [2790-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  watchdog is informed of failover start by the main process
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11839-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  starting degeneration. shutdown host 10.5.5.14(5432)
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11840-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  Do not restart children because we are switching over node id 1 host: 10.5.5.14 port: 5432 and we are in streaming replication mode
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11841-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 14561 needs to restart because pool 0 uses backend 1
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11842-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 21415 needs to restart because pool 0 uses backend 1
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11843-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 14173 needs to restart because pool 0 uses backend 1
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11844-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 31996 needs to restart because pool 0 uses backend 1
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11845-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 17116 needs to restart because pool 0 uses backend 1
Jun 27 01:46:46 SVLPGP1PP pgpool[3883]: [11846-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 874 needs to restart because pool 0 uses backend 1
Jun 27 01:46:47 SVLPGP1PP pgpool[3883]: [11847-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 9339 needs to restart because pool 0 uses backend 1
Jun 27 01:46:47 SVLPGP1PP pgpool[3883]: [11848-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  child pid 15191 needs to restart because pool 0 uses backend 1
Jun 27 01:46:47 SVLPGP1PP pgpool[3883]: [11849-1] 2021-06-27 01:46:46 ::: APP - [No Connection] :A: DB - [No Connection] :D: USER - [No Connection] :U: |LOG:  execute command: /etc/pgpool-II/failover.sh 1 10.5.5.14 5432 /postgres/pgsql/11/data 0 10.5.5.13 0 0 5432 /postgres/pgsql/11/data 10.5.5.13 5432

My theory is that it is because the parameter failover_on_backend_error being set to on, but I was not able to replicate the issue on our development/test environment. If I try to just interrupt the running query on the STANDBY via kill -7, it doesn't result in the failover in any case, and if I try to outright kill the process with kill -9, it will perform the failover regardless of the parameter.

So my questions are:
Would a setting the parameter failover_on_backend_error to off prevent the failover in this particular situation?
What would be the best way to test the effect of the parameter?
In the case the parameter would not solve the failover, would you recommend to turn on the hot_standby_feedback parameter on the PostgreSQL side to prevent the termination of the session?

I am also sending the full pgpool.conf file from SVLPGP1PP as an attachment.

Thank you for any advice.

With regards,
Vladimír Láznička
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20210629/f646ca39/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.conf
Type: application/octet-stream
Size: 43700 bytes
Desc: pgpool.conf
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20210629/f646ca39/attachment-0001.obj>


More information about the pgpool-general mailing list