[pgpool-general: 2639] Re: pcp_recovery_node failing in stage 1

Fri Mar 21 00:21:13 JST 2014

Sorry, the subject line should have said stage *1*.

On 14-03-20 12:48 PM, Sean Hogan wrote:
> Hi,
>
> In my setup at the moment I have a pair of version 3.3.2 pgpool 
> instances with two backend PostgreSQL 9.2.4 servers, all running on 
> CentOS 6.4.  The PostgreSQL data directories are quite large - 144GB.  
> I have run into a situation where pcp_recovery_node consistently fails 
> with a BackendError.
>
> The stage 1 recovery command is a script called do-base-backup.sh that 
> runs an rsync as follows:
>
>     rsync -Cacvv --delete \
>             --exclude postmaster.pid --exclude postmaster.opts \
>             --exclude recovery.done \
>             --exclude pg_log/\* --exclude pg_xlog/\* \
>             $SOURCE/ $DESTINATION/ 2>&1 |
>     mailx -s "rsync verbose output" sean at compusult.net
>
> For some reason this rsync is failing after some minutes (typically 10 
> to 12) with undocumented exit code 255.  The verbose rsync logging 
> says this:
>
>     Killed by signal 2.
>     rsync: writefd_unbuffered failed to write 4 bytes to socket 
> [sender]: Broken pipe (32)
>     rsync: connection unexpectedly closed (50735 bytes received so 
> far) [sender]
>     rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
>
> Googling has not brought up anything helpful other than bugs with 
> large files in older versions of rsync.  I'm fairly certain that is 
> not the case here, especially because of the "Killed by signal 2", 
> which is suggestive of some sort of timeout on the pgpool end.
>
> The specific command line I'm using to recover the second database 
> node is:
>
>     sudo -u postgres /usr/local/bin/pcp_recovery_node 10000 psql01 
> 9898 postgres XXXXXX 1
>
> With such a large timeout value I shouldn't be hitting a timeout there.
>
> The weird thing, which makes me point the finger at either pgpool or 
> pcp_recovery_node, is that if I run do-base-backup.sh manually it 
> works fine (and takes much much longer, as expected).
>
> Does pgpool have some internal limit on how long it will wait for the 
> 1st stage command to run?  I've attached the log file but it isn't 
> very informative.  (Note that the do-base-backup.sh script isn't 
> communicating the rsync failure back to pgpool, so pgpool goes ahead 
> and runs stage 2.  Of course, that fails because not everything has 
> been synced.)
>
> Thanks,
> Sean
>
>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140320/b253fe2d/attachment.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: sean.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140320/b253fe2d/attachment.vcf>