[pgpool-general: 4149] Online recovery issue

Tue Oct 27 23:29:53 JST 2015

Hi all,

I'm having an issue with online recovery.

My pgpool config:
two servers cara0 and cara1 (virtual ip resolves to caracara - we name all our servers with orange names, so this was a perfect name :-)
both run pgpool on port 5432 with watchdog - failover works great
both have postgresql on port 54321 as backend nodes
pgpool is configured with replication and load balancing
I'm using the pgpool package from the epel repo on CentOS 7 
postgresql-pgpool-II-3.4.3-3.el7.x86_64
postgresql-pgpool-II-extensions-3.4.3-3.el7.x86_64

Issue:
I've been running pgpool for a few months now with my less crucial db's and everything has been fine. Recently, I moved the remaining, mission critical and busier db's onto the cluster. Something happened that made the cluster drop one backend node (cara1) and I have been unable to do an online recovery. Once I have the recovery process working properly, I will continue to debug my setup to see what caused the degradation of backend node cara1.

Online recovery has worked for me in the past, but in the last couple of days (early mornings) I have attempted a recovery and it always times out at stage 2.

Oct 27 07:42:41 cara0 pgpool[548]: [617-1] 2015-10-27 07:42:41: pid 548: LOG:  fork a new PCP child process with pid: 23500
Oct 27 07:42:41 cara0 pgpool: 2015-10-27 07:42:41: pid 548: LOG:  fork a new PCP child process with pid: 23500
Oct 27 07:49:21 cara0 pgpool[23500]: [617-1] 2015-10-27 07:49:21: pid 23500: LOG:  starting recovering node 1
Oct 27 07:49:21 cara0 pgpool: 2015-10-27 07:49:21: pid 23500: LOG:  starting recovering node 1
Oct 27 07:49:21 cara0 pgpool[23500]: [618-1] 2015-10-27 07:49:21: pid 23500: LOG:  node recovery, CHECKPOINT in the 1st stage done
Oct 27 07:49:21 cara0 pgpool[23500]: [619-1] 2015-10-27 07:49:21: pid 23500: LOG:  executing recovery
Oct 27 07:49:21 cara0 pgpool[23500]: [619-2] 2015-10-27 07:49:21: pid 23500: DETAIL:  starting recovery command: "SELECT pgpool_recovery('pgpool-recovery-base.sh', 'cara1', '/var/lib/pgsql/data', '54321')"
Oct 27 07:49:21 cara0 pgpool[23500]: [620-1] 2015-10-27 07:49:21: pid 23500: LOG:  executing recovery
Oct 27 07:49:21 cara0 pgpool[23500]: [620-2] 2015-10-27 07:49:21: pid 23500: DETAIL:  disabling statement_timeout
Oct 27 07:49:21 cara0 pgpool: 2015-10-27 07:49:21: pid 23500: LOG:  node recovery, CHECKPOINT in the 1st stage done
Oct 27 07:49:21 cara0 pgpool: 2015-10-27 07:49:21: pid 23500: LOG:  executing recovery
Oct 27 07:49:21 cara0 pgpool: 2015-10-27 07:49:21: pid 23500: DETAIL:  starting recovery command: "SELECT pgpool_recovery('pgpool-recovery-base.sh', 'cara1', '/var/lib/pgsql/data', '54321')"
Oct 27 07:49:21 cara0 pgpool: 2015-10-27 07:49:21: pid 23500: LOG:  executing recovery
Oct 27 07:49:21 cara0 pgpool: 2015-10-27 07:49:21: pid 23500: DETAIL:  disabling statement_timeout
Oct 27 07:59:16 cara0 pgpool[23500]: [621-1] 2015-10-27 07:59:16: pid 23500: LOG:  node recovery, 1st stage is done
Oct 27 07:59:16 cara0 pgpool[23500]: [622-1] 2015-10-27 07:59:16: pid 23500: LOG:  node recovery, starting 2nd stage
Oct 27 07:59:16 cara0 pgpool: 2015-10-27 07:59:16: pid 23500: LOG:  node recovery, 1st stage is done
Oct 27 07:59:16 cara0 pgpool: 2015-10-27 07:59:16: pid 23500: LOG:  node recovery, starting 2nd stage
Oct 27 08:02:19 cara0 pgpool[23500]: [623-1] 2015-10-27 08:02:19: pid 23500: LOG:  wait_connection_closed: existing connections did not close in 180 sec.
Oct 27 08:02:19 cara0 pgpool[23500]: [624-1] 2015-10-27 08:02:19: pid 23500: ERROR:  node recovery failed, waiting connection closed in the other pgpools timeout
Oct 27 08:02:19 cara0 pgpool[23500]: [625-1] 2015-10-27 08:02:19: pid 23500: FATAL:  unable to read from client
Oct 27 08:02:19 cara0 pgpool[23500]: [625-2] 2015-10-27 08:02:19: pid 23500: DETAIL:  pcp_read failed with error : "Success"
Oct 27 08:02:19 cara0 pgpool[548]: [618-1] 2015-10-27 08:02:19: pid 548: LOG:  PCP child process with pid: 23500 exits with status 256
Oct 27 08:02:19 cara0 pgpool: 2015-10-27 08:02:19: pid 23500: LOG:  wait_connection_closed: existing connections did not close in 180 sec.
Oct 27 08:02:19 cara0 pgpool: 2015-10-27 08:02:19: pid 23500: ERROR:  node recovery failed, waiting connection closed in the other pgpools timeout
Oct 27 08:02:19 cara0 pgpool: 2015-10-27 08:02:19: pid 23500: FATAL:  unable to read from client
Oct 27 08:02:19 cara0 pgpool: 2015-10-27 08:02:19: pid 23500: DETAIL:  pcp_read failed with error : "Success"
Oct 27 08:02:19 cara0 pgpool: 2015-10-27 08:02:19: pid 548: LOG:  PCP child process with pid: 23500 exits with status 256

Is it possible to get a list of open connections to help me sort out what is refusing to drop its connection?

Cheers,
Gervais
------------------------------------
Gervais de Montbrun
Systems Administrator
GTalk: gervais at silverorange.com

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20151027/06f8a650/attachment.html>