[Pgpool-general] pcp_recovery_node and errors in postgres log

Tue Dec 22 22:29:29 UTC 2009

On 22.12.2009 23:05, Tomasz Chmielewski wrote:
> I followed the http://linuxsilo.net/articles/postgresql-pgpool.html to set up pgpool-ii replication.
> 
> 
> When I detach and recover a node with these commands:
> 
> # pcp_detach_node -d 240 127.0.0.1 9898 user pass 1
> # pcp_recovery_node -d 240 127.0.0.1 9898 user pass 1
> 
> 
> I can observer the following on node 1 in postgres logs - :
> 
> 2009-12-23 06:03:15 SGT LOG:  database system was interrupted; last known up at 2009-12-23 06:03:12 SGT
> 2009-12-23 06:03:15 SGT LOG:  starting archive recovery
> 2009-12-23 06:03:15 SGT LOG:  restore_command = '/usr/bin/scp db10:/var/lib/postgresql/8.3/main/pg_xlog_archive/%f %p'
> scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/00000002.history: No such file or directory

> Because of these errors, recovery sometimes fails.
> 
> How does postgres on the node which is recovered determines the %f files it needs to copy?

OK, I see it's normal that it asks for files which are not present:

http://developer.postgresql.org/pgdocs/postgres/continuous-archiving.html

   It is important for the command to return a zero exit status if and 
   only if it succeeds. The command will be asked for file names that 
   are not present in the archive; it must return nonzero when so 
   asked. 

However, postgres on recovered node fails to start if it finds no files to copy, i.e.:

2009-12-23 06:21:40 SGT LOG:  database system was shut down at 2009-12-23 06:21:36 SGT
2009-12-23 06:21:40 SGT LOG:  starting archive recovery
2009-12-23 06:21:40 SGT LOG:  restore_command = '/usr/bin/scp db10:/var/lib/postgresql/8.3/main/pg_xlog_archive/%f %p'
scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/00000003.history: No such file or directory
scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/000000030000000000000063: No such file or directory
2009-12-23 06:21:40 SGT LOG:  could not open file "pg_xlog/000000030000000000000063" (log file 0, segment 99): No such file or directory
2009-12-23 06:21:40 SGT LOG:  invalid primary checkpoint record
2009-12-23 06:21:40 SGT LOG:  incomplete startup packet
scp: /var/lib/postgresql/8.3/main/pg_xlog_archive/000000030000000000000063: No such file or directory
2009-12-23 06:21:40 SGT LOG:  could not open file "pg_xlog/000000030000000000000063" (log file 0, segment 99): No such file or directory
2009-12-23 06:21:40 SGT LOG:  invalid secondary checkpoint record
2009-12-23 06:21:40 SGT PANIC:  could not locate a valid checkpoint record
2009-12-23 06:21:40 SGT LOG:  startup process (PID 24196) was terminated by signal 6: Aborted
2009-12-23 06:21:40 SGT LOG:  aborting startup due to startup process failure

To reproduce:

1) on a failed node, do:

tail -f /var/log/postgresql/postgresql-8.3-main.log

2) start pcp_recovery_node, pcp_detach_node and then pcp_recovery_node again:

pcp_recovery_node -d 240 127.0.0.1 9898 user pass 1
pcp_detach_node -d 240 127.0.0.1 9898 user pass 1
pcp_recovery_node -d 240 127.0.0.1 9898 user pass 1

The log on node 1 will show postgres startup failure; pcp_recovery_node will "hang" until it times out.

Is it expected?

-- 
Tomasz Chmielewski
http://wpkg.org