[pgpool-general: 2644] Re: pcp_recovery_node failing in stage 1

Fri Mar 21 17:50:23 JST 2014

> Sorry, the subject line should have said stage *1*.

Really? From what I read from pgpool log:

2014-03-20 12:42:43 LOG:   pid 18259: 1st stage is done
2014-03-20 12:42:43 LOG:   pid 18259: starting 2nd stage
2014-03-20 12:42:47 LOG:   pid 18259: CHECKPOINT in the 2nd stage done
2014-03-20 12:42:47 LOG:   pid 18259: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr.sh', 'psql02.compusult.net', '/var/lib/pgsql/9.2/data')"
2014-03-20 12:42:49 LOG:   pid 18259: check_postmaster_started: try to connect to postmaster on hostname:psql02.compusult.net database:postgres user:postgres (retry 0 times)

I saw "1st stage is done" and I guess the first stage has been
succeeded but the second stage failed. What does the second stage look like?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> On 14-03-20 12:48 PM, Sean Hogan wrote:
>> Hi,
>>
>> In my setup at the moment I have a pair of version 3.3.2 pgpool
>> instances with two backend PostgreSQL 9.2.4 servers, all running on
>> CentOS 6.4.  The PostgreSQL data directories are quite large - 144GB.
>> I have run into a situation where pcp_recovery_node consistently fails
>> with a BackendError.
>>
>> The stage 1 recovery command is a script called do-base-backup.sh that
>> runs an rsync as follows:
>>
>>     rsync -Cacvv --delete \
>>             --exclude postmaster.pid --exclude postmaster.opts \
>>             --exclude recovery.done \
>>             --exclude pg_log/\* --exclude pg_xlog/\* \
>>             $SOURCE/ $DESTINATION/ 2>&1 |
>>     mailx -s "rsync verbose output" sean at compusult.net
>>
>> For some reason this rsync is failing after some minutes (typically 10
>> to 12) with undocumented exit code 255.  The verbose rsync logging
>> says this:
>>
>>     Killed by signal 2.
>>     rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
>>     Broken pipe (32)
>>     rsync: connection unexpectedly closed (50735 bytes received so far)
>>     [sender]
>>     rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
>>
>> Googling has not brought up anything helpful other than bugs with
>> large files in older versions of rsync.  I'm fairly certain that is
>> not the case here, especially because of the "Killed by signal 2",
>> which is suggestive of some sort of timeout on the pgpool end.
>>
>> The specific command line I'm using to recover the second database
>> node is:
>>
>>     sudo -u postgres /usr/local/bin/pcp_recovery_node 10000 psql01 9898
>>     postgres XXXXXX 1
>>
>> With such a large timeout value I shouldn't be hitting a timeout
>> there.
>>
>> The weird thing, which makes me point the finger at either pgpool or
>> pcp_recovery_node, is that if I run do-base-backup.sh manually it
>> works fine (and takes much much longer, as expected).
>>
>> Does pgpool have some internal limit on how long it will wait for the
>> 1st stage command to run?  I've attached the log file but it isn't
>> very informative.  (Note that the do-base-backup.sh script isn't
>> communicating the rsync failure back to pgpool, so pgpool goes ahead
>> and runs stage 2.  Of course, that fails because not everything has
>> been synced.)
>>
>> Thanks,
>> Sean
>>
>>
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>