[pgpool-general: 2645] Re: pcp_recovery_node failing in stage 1

Sean Hogan sean at compusult.net
Fri Mar 21 20:16:25 JST 2014


The stage 1 script is not careful with exit codes, so it continues after 
the failed rsync and eventually exits with success.  This tricks pgpool 
into continuing with stage 2, but it's definitely the state 1 command 
that is failing.

Sean

On 14-03-21 06:20 AM, Tatsuo Ishii wrote:
>> Sorry, the subject line should have said stage *1*.
> Really? From what I read from pgpool log:
>
> 2014-03-20 12:42:43 LOG:   pid 18259: 1st stage is done
> 2014-03-20 12:42:43 LOG:   pid 18259: starting 2nd stage
> 2014-03-20 12:42:47 LOG:   pid 18259: CHECKPOINT in the 2nd stage done
> 2014-03-20 12:42:47 LOG:   pid 18259: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr.sh', 'psql02.compusult.net', '/var/lib/pgsql/9.2/data')"
> 2014-03-20 12:42:49 LOG:   pid 18259: check_postmaster_started: try to connect to postmaster on hostname:psql02.compusult.net database:postgres user:postgres (retry 0 times)
>
> I saw "1st stage is done" and I guess the first stage has been
> succeeded but the second stage failed. What does the second stage look like?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
>> On 14-03-20 12:48 PM, Sean Hogan wrote:
>>> Hi,
>>>
>>> In my setup at the moment I have a pair of version 3.3.2 pgpool
>>> instances with two backend PostgreSQL 9.2.4 servers, all running on
>>> CentOS 6.4.  The PostgreSQL data directories are quite large - 144GB.
>>> I have run into a situation where pcp_recovery_node consistently fails
>>> with a BackendError.
>>>
>>> The stage 1 recovery command is a script called do-base-backup.sh that
>>> runs an rsync as follows:
>>>
>>>      rsync -Cacvv --delete \
>>>              --exclude postmaster.pid --exclude postmaster.opts \
>>>              --exclude recovery.done \
>>>              --exclude pg_log/\* --exclude pg_xlog/\* \
>>>              $SOURCE/ $DESTINATION/ 2>&1 |
>>>      mailx -s "rsync verbose output" sean at compusult.net
>>>
>>> For some reason this rsync is failing after some minutes (typically 10
>>> to 12) with undocumented exit code 255.  The verbose rsync logging
>>> says this:
>>>
>>>      Killed by signal 2.
>>>      rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
>>>      Broken pipe (32)
>>>      rsync: connection unexpectedly closed (50735 bytes received so far)
>>>      [sender]
>>>      rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
>>>
>>> Googling has not brought up anything helpful other than bugs with
>>> large files in older versions of rsync.  I'm fairly certain that is
>>> not the case here, especially because of the "Killed by signal 2",
>>> which is suggestive of some sort of timeout on the pgpool end.
>>>
>>> The specific command line I'm using to recover the second database
>>> node is:
>>>
>>>      sudo -u postgres /usr/local/bin/pcp_recovery_node 10000 psql01 9898
>>>      postgres XXXXXX 1
>>>
>>> With such a large timeout value I shouldn't be hitting a timeout
>>> there.
>>>
>>> The weird thing, which makes me point the finger at either pgpool or
>>> pcp_recovery_node, is that if I run do-base-backup.sh manually it
>>> works fine (and takes much much longer, as expected).
>>>
>>> Does pgpool have some internal limit on how long it will wait for the
>>> 1st stage command to run?  I've attached the log file but it isn't
>>> very informative.  (Note that the do-base-backup.sh script isn't
>>> communicating the rsync failure back to pgpool, so pgpool goes ahead
>>> and runs stage 2.  Of course, that fails because not everything has
>>> been synced.)
>>>
>>> Thanks,
>>> Sean
>>>
>>>
>>> _______________________________________________
>>> pgpool-general mailing list
>>> pgpool-general at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-general

-------------- next part --------------
A non-text attachment was scrubbed...
Name: sean.vcf
Type: text/x-vcard
Size: 275 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140321/13dfd14f/attachment.vcf>


More information about the pgpool-general mailing list