[Pgpool-general] pcp_recovery_node and errors in postgres log

Wed Dec 23 03:41:14 UTC 2009

> On 22.12.2009 23:05, Tomasz Chmielewski wrote:
> > I followed the http://linuxsilo.net/articles/postgresql-pgpool.html to set up pgpool-ii replication.
> > 
> > 
> > When I detach and recover a node with these commands:
> > 
> > # pcp_detach_node -d 240 127.0.0.1 9898 user pass 1
> > # pcp_recovery_node -d 240 127.0.0.1 9898 user pass 1
> > 
> > 
> > I can observer the following on node 1 in postgres logs - :
> > 
> > 2009-12-23 06:03:15 SGT LOG:  database system was interrupted; last known up at 2009-12-23 06:03:12 SGT
> > 2009-12-23 06:03:15 SGT LOG:  starting archive recovery
> > 2009-12-23 06:03:15 SGT LOG:  restore_command = '/usr/bin/scp db10:/var/lib/postgresql/8.3/main/pg_xlog_archive/%f %p'
> > scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/00000002.history: No such file or directory
> 
> 
> > Because of these errors, recovery sometimes fails.
> > 
> > How does postgres on the node which is recovered determines the %f files it needs to copy?
> 
> OK, I see it's normal that it asks for files which are not present:
> 
> http://developer.postgresql.org/pgdocs/postgres/continuous-archiving.html
> 
>    It is important for the command to return a zero exit status if and 
>    only if it succeeds. The command will be asked for file names that 
>    are not present in the archive; it must return nonzero when so 
>    asked. 
> 
> 
> However, postgres on recovered node fails to start if it finds no files to copy, i.e.:
> 
> 2009-12-23 06:21:40 SGT LOG:  database system was shut down at 2009-12-23 06:21:36 SGT
> 2009-12-23 06:21:40 SGT LOG:  starting archive recovery
> 2009-12-23 06:21:40 SGT LOG:  restore_command = '/usr/bin/scp db10:/var/lib/postgresql/8.3/main/pg_xlog_archive/%f %p'
> scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/00000003.history: No such file or directory
> scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/000000030000000000000063: No such file or directory
> 2009-12-23 06:21:40 SGT LOG:  could not open file "pg_xlog/000000030000000000000063" (log file 0, segment 99): No such file or directory
> 2009-12-23 06:21:40 SGT LOG:  invalid primary checkpoint record
> 2009-12-23 06:21:40 SGT LOG:  incomplete startup packet
> scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/000000030000000000000063: No such file or directory
> 2009-12-23 06:21:40 SGT LOG:  could not open file "pg_xlog/000000030000000000000063" (log file 0, segment 99): No such file or directory
> 2009-12-23 06:21:40 SGT LOG:  invalid secondary checkpoint record
> 2009-12-23 06:21:40 SGT PANIC:  could not locate a valid checkpoint record
> 2009-12-23 06:21:40 SGT LOG:  startup process (PID 24196) was terminated by signal 6: Aborted
> 2009-12-23 06:21:40 SGT LOG:  aborting startup due to startup process failure
> 
> 
> To reproduce:
> 
> 1) on a failed node, do:
> 
> tail -f /var/log/postgresql/postgresql-8.3-main.log
> 
> 
> 2) start pcp_recovery_node, pcp_detach_node and then pcp_recovery_node again:
> 
> pcp_recovery_node -d 240 127.0.0.1 9898 user pass 1
> pcp_detach_node -d 240 127.0.0.1 9898 user pass 1
> pcp_recovery_node -d 240 127.0.0.1 9898 user pass 1
> 
> The log on node 1 will show postgres startup failure; pcp_recovery_node will "hang" until it times out.
> 
> Is it expected?

I don't understand Spanish so, I'm not sure I read following URL
correctly but...

http://linuxsilo.net/articles/postgresql-pgpool.html

I noticed in the article "base-backup" script does this:

$LOGGER "Rsyncing directory pg_xlog" 
$RSYNC $SRC_DATA/pg_xlog/ $DST_HOST:$DST_DATA/pg_xlog/ 

I think this is not neccesary and probably not good. Instead, you
would want to clear $DST_HOST:$DST_DATA/pg_xlog/*.
--
Tatsuo Ishii
SRA OSS, Inc. Japan