[pgpool-general: 2638] pcp_recovery_node failing in stage 2

Fri Mar 21 00:18:05 JST 2014

Hi,

In my setup at the moment I have a pair of version 3.3.2 pgpool 
instances with two backend PostgreSQL 9.2.4 servers, all running on 
CentOS 6.4.  The PostgreSQL data directories are quite large - 144GB.  I 
have run into a situation where pcp_recovery_node consistently fails 
with a BackendError.

The stage 1 recovery command is a script called do-base-backup.sh that 
runs an rsync as follows:

     rsync -Cacvv --delete \
             --exclude postmaster.pid --exclude postmaster.opts \
             --exclude recovery.done \
             --exclude pg_log/\* --exclude pg_xlog/\* \
             $SOURCE/ $DESTINATION/ 2>&1 |
     mailx -s "rsync verbose output" sean at compusult.net

For some reason this rsync is failing after some minutes (typically 10 
to 12) with undocumented exit code 255.  The verbose rsync logging says 
this:

     Killed by signal 2.
     rsync: writefd_unbuffered failed to write 4 bytes to socket 
[sender]: Broken pipe (32)
     rsync: connection unexpectedly closed (50735 bytes received so far) 
[sender]
     rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]

Googling has not brought up anything helpful other than bugs with large 
files in older versions of rsync.  I'm fairly certain that is not the 
case here, especially because of the "Killed by signal 2", which is 
suggestive of some sort of timeout on the pgpool end.

The specific command line I'm using to recover the second database node is:

     sudo -u postgres /usr/local/bin/pcp_recovery_node 10000 psql01 9898 
postgres XXXXXX 1

With such a large timeout value I shouldn't be hitting a timeout there.

The weird thing, which makes me point the finger at either pgpool or 
pcp_recovery_node, is that if I run do-base-backup.sh manually it works 
fine (and takes much much longer, as expected).

Does pgpool have some internal limit on how long it will wait for the 
1st stage command to run?  I've attached the log file but it isn't very 
informative.  (Note that the do-base-backup.sh script isn't 
communicating the rsync failure back to pgpool, so pgpool goes ahead and 
runs stage 2.  Of course, that fails because not everything has been 
synced.)

Thanks,
Sean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.log
Type: text/x-log
Size: 21887 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140320/fb3abe15/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sean.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140320/fb3abe15/attachment.vcf>