[Pgpool-general] Second stage online recovery with PITR problems on pgpool 3.0.3 / postgresql 9.0.4

Tue Sep 13 11:12:32 UTC 2011

I have problem with 2nd. stage online PITR recovery procedure.
The data received in second stage after base backup and prior to WAL
switch gets lost.

I've managed to isolate the problem down to postgresql without the
pgpool-II running:
- stop failed node
//1st stage
- start backup
- rsync files to failed node
- stop backup
- do intentional insert in master node
//2nd stage
- do pg_switch_log (tested also pgpool_xlog_switch with same results)
- rsync archive WAL files to failed node
- start failed node

The failed node starts fine and it does recovery, but for the last WAL
file it always reports "invalid record length" error, and it returns
to last known good WAL file (the one created in backup step).

Log from failed node when I do restore (increasing verbosity reveals
no more information):
[2011-09-13 12:42:58 CEST]-[]-[31877|] LOG:  database system was
interrupted; last known up at 2011-09-13 12:40:46 CEST
[2011-09-13 12:42:58 CEST]-[]-[31877|] LOG:  creating missing WAL
directory "pg_xlog/archive_status"
[2011-09-13 12:42:58 CEST]-[]-[31877|] LOG:  starting archive recovery
[2011-09-13 12:42:58 CEST]-[postgres]-[31882|] FATAL:  the database
system is starting up
[2011-09-13 12:42:59 CEST]-[]-[31877|] LOG:  restored log file
"000000020000000100000020" from archive
[2011-09-13 12:42:59 CEST]-[]-[31877|] LOG:  redo starts at 1/20000078
[2011-09-13 12:42:59 CEST]-[]-[31877|] LOG:  consistent recovery state
reached at 1/21000000
[2011-09-13 12:42:59 CEST]-[postgres]-[31886|] FATAL:  the database
system is starting up
[2011-09-13 12:43:00 CEST]-[postgres]-[31887|] FATAL:  the database
system is starting up
[2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  restored log file
"000000020000000100000021" from archive
[2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  invalid record length at 1/21000020
[2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  redo done at 1/200000A0
[2011-09-13 12:43:01 CEST]-[postgres]-[31890|] FATAL:  the database
system is starting up
[2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  restored log file
"000000020000000100000020" from archive
[2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  selected new timeline ID: 3
[2011-09-13 12:43:01 CEST]-[]-[31877|] LOG:  archive recovery complete
[2011-09-13 12:43:01 CEST]-[]-[31883|] LOG:  checkpoint starting:
end-of-recovery immediate wait
[2011-09-13 12:43:02 CEST]-[]-[31883|] LOG:  checkpoint complete:
wrote 0 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 0
recycled; write=0.000 s, sync=0.000 s, total=0.659 s
[2011-09-13 12:43:02 CEST]-[]-[31876|] LOG:  database system is ready
to accept connections
[2011-09-13 12:43:02 CEST]-[]-[31896|] LOG:  autovacuum launcher started

I've done md5sum of 000000020000000100000021 WAL file in archive dir
on master and target node, and the file is the same on both nodes.

So my question goes:
Did I miss something, or did I get the procedure wrong?
Is online recovery with PITR procedure still valid as it is presented
in manual?
Can I replace the pg_switch_xlog with another pg_start_backup and
pg_stop_backup call and what are performance implications in this
case?

Software versions:
I'm using: PostgreSQL 9.0.4 on both nodes with same OS
Restore master:
Linux miho 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011
x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux
Restore target:
Linux alice 3.0-ARCH #1 SMP PREEMPT Wed Aug 17 21:55:57 CEST 2011
x86_64 Intel(R) Core(TM) i7 CPU 930 @ 2.80GHz GenuineIntel GNU/Linux

Thanks for help.
Nikola