[Pgpool-general] Node recovery failing in 3.0.x

Thu Feb 17 12:17:39 UTC 2011

Hi!

We were running pgpool-II 2.3.3 with two Postgresql 8.3.11 nodes.

Yesterday, we switched to pgpool-II 3.0.1.
After a while, node 1 degenerated:

> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: ERROR: pid 18268: rewrite_timestamp: could not get current timestamp
> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: LOG:   pid 18268: pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 18670 stat
>  ement: insert into zaehler (typ,timestamp,begintime,endtime,ref,info,val1,id) values (('xxx'),now(),now(),now(),67,('xxx'),3.90000000
>  ,16180085);/*xxx*/; message: current transaction is aborted, commands ignored until end of transaction block
> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: ERROR: pid 18268: read_kind_from_backend: 1 th kind C does not match with master or majority connection kind E
> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: ERROR: pid 18268: kind mismatch among backends. Possible last query was: "insert into zaehler (typ,timestamp,begint
>  ime,endtime,ref,info,val1,id) values (('xxx'),now(),now(),now(),67,('xxx'),3.90000000,16180085);/*xxx*/;" kind
>  details are: 0[E: current transaction is aborted, commands ignored until end of transaction block] 1[C]
> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: LOG:   pid 18268: notice_backend_error: 1 fail over request from pid 18268
> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: LOG:   pid 18221: starting degeneration. shutdown host xxx.34(5433)
> 2011-02-16T15:17:28.000+01:00 pg1 pgpool: LOG:   pid 18221: execute command: ./sendalarm.sh alarm "pgpool.fail.node1" nok "Fail=xxx.34:5433 New_Master_
> Node=0"; if ((0==2)); then ./sendalarm.sh alarm "pgpool.fail" nok "ALL NODES DOWN!"; fi
> 2011-02-16T15:17:29.000+01:00 pg1 pgpool: LOG:   pid 18221: failover_handler: set new master node: 0
> 2011-02-16T15:17:29.000+01:00 pg1 pgpool: LOG:   pid 18221: failover done. shutdown host 172.22.44.34(5433)

We then just wanted to recover the node, using our working recovery
scripts still in place from 2.3.3. Everything worked as usual, including
the recovery_1st_stage_command. And there it ends - the sessions are
still disconnected after a timeout, but the second checkpoint doesn't
get created and the recovery_2nd_stage_command is never executed...

We updated to 3.0.2 by now, but to no avail:

> 2011-02-17T12:06:20.000+01:00 pg1 pgpool: LOG:   pid 30065: 1st stage is done
> 2011-02-17T12:06:20.000+01:00 pg1 pgpool: LOG:   pid 30065: starting 2nd stage
> 2011-02-17T12:06:21.000+01:00 pg1 pgpool: LOG:   pid 31409: pool_process_query: child connection forced to terminate due to client_idle_limit_in_recovery(1) 
> reached
> 2011-02-17T12:06:21.000+01:00 pg1 pgpool: LOG:   pid 31409: statement: ABORT
> 2011-02-17T12:06:21.000+01:00 pg1 pgpool: LOG:   pid 31409: statement:  DISCARD ALL
... some more of those...
> 2011-02-17T12:06:21.000+01:00 pg1 pgpool: LOG:   pid 29924: pool_process_query: child connection forced to terminate due to client_idle_limit_in_recovery(1) 
> reached
> 2011-02-17T12:06:21.000+01:00 pg1 pgpool: LOG:   pid 29924: statement: ABORT
> 2011-02-17T12:06:21.000+01:00 pg1 pgpool: LOG:   pid 29924: statement:  DISCARD ALL
> 2011-02-17T12:13:09.000+01:00 pg1 pgpool: LOG:   pid 29807: received fast shutdown request
> 2011-02-17T12:13:15.000+01:00 pg1 pgpool: LOG:   pid 3065: read_status_file: 1 th backend is set to down status

... and then pgpool just idles until I restart it after a couple of
minutes. The recovery at this stage took only a few seconds with 2.3.3.

pgpool.conf has been adapted to parameters new in 3.0.x, so there
shouldn't be a problem either.

Suggestions?

Karsten