[pgpool-general: 2686] Re: Fwd: Re: Re: pcp_recovery_node failing in stage 1
Tatsuo Ishii
ishii at postgresql.org
Wed Apr 2 08:07:50 JST 2014
Hello Sean,
No, I didn't see any sign of a signal was delivered in the log.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
> Hello Tatsuo,
>
> Did the attached log provide any insight?
>
> Thanks,
> Sean
>
>
> -------- Original Message --------
> Subject: Re: [pgpool-general: 2639] Re: pcp_recovery_node failing in
> stage 1
> Date: Fri, 21 Mar 2014 10:59:21 -0230
> From: Sean Hogan <sean at compusult.net>
> To: Tatsuo Ishii <ishii at postgresql.org>
> CC: pgpool-general at pgpool.net
>
>
>
> I agree, it makes no sense. The strace is attached.
>
> Sean
>
> On 14-03-21 10:02 AM, Tatsuo Ishii wrote:
>> Ridiculous. There's no code in pgpool which sends signal 2 to recovery
>> command. Is it possible to start pgpool from strace and do the
>> recovery so that we could find who sends the signal?
>>
>> strace -f pgpool start
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>>> The stage 1 script is not careful with exit codes, so it continues
>>> after the failed rsync and eventually exits with success. This tricks
>>> pgpool into continuing with stage 2, but it's definitely the state 1
>>> command that is failing.
>>>
>>> Sean
>>>
>>> On 14-03-21 06:20 AM, Tatsuo Ishii wrote:
>>>>> Sorry, the subject line should have said stage *1*.
>>>> Really? From what I read from pgpool log:
>>>>
>>>> 2014-03-20 12:42:43 LOG: pid 18259: 1st stage is done
>>>> 2014-03-20 12:42:43 LOG: pid 18259: starting 2nd stage
>>>> 2014-03-20 12:42:47 LOG: pid 18259: CHECKPOINT in the 2nd stage done
>>>> 2014-03-20 12:42:47 LOG: pid 18259: starting recovery command: "SELECT
>>>> pgpool_recovery('pgpool_recovery_pitr.sh', 'psql02.compusult.net',
>>>> '/var/lib/pgsql/9.2/data')"
>>>> 2014-03-20 12:42:49 LOG: pid 18259: check_postmaster_started: try to
>>>> connect to postmaster on hostname:psql02.compusult.net
>>>> database:postgres user:postgres (retry 0 times)
>>>>
>>>> I saw "1st stage is done" and I guess the first stage has been
>>>> succeeded but the second stage failed. What does the second stage look
>>>> like?
>>>>
>>>> Best regards,
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>>
>>>>> On 14-03-20 12:48 PM, Sean Hogan wrote:
>>>>>> Hi,
>>>>>>
>>>>>> In my setup at the moment I have a pair of version 3.3.2 pgpool
>>>>>> instances with two backend PostgreSQL 9.2.4 servers, all running on
>>>>>> CentOS 6.4. The PostgreSQL data directories are quite large - 144GB.
>>>>>> I have run into a situation where pcp_recovery_node consistently fails
>>>>>> with a BackendError.
>>>>>>
>>>>>> The stage 1 recovery command is a script called do-base-backup.sh that
>>>>>> runs an rsync as follows:
>>>>>>
>>>>>> rsync -Cacvv --delete \
>>>>>> --exclude postmaster.pid --exclude postmaster.opts \
>>>>>> --exclude recovery.done \
>>>>>> --exclude pg_log/\* --exclude pg_xlog/\* \
>>>>>> $SOURCE/ $DESTINATION/ 2>&1 |
>>>>>> mailx -s "rsync verbose output" sean at compusult.net
>>>>>>
>>>>>> For some reason this rsync is failing after some minutes (typically 10
>>>>>> to 12) with undocumented exit code 255. The verbose rsync logging
>>>>>> says this:
>>>>>>
>>>>>> Killed by signal 2.
>>>>>> rsync: writefd_unbuffered failed to write 4 bytes to socket [sender]:
>>>>>> Broken pipe (32)
>>>>>> rsync: connection unexpectedly closed (50735 bytes received so far)
>>>>>> [sender]
>>>>>> rsync error: unexplained error (code 255) at io.c(600) [sender=3.0.6]
>>>>>>
>>>>>> Googling has not brought up anything helpful other than bugs with
>>>>>> large files in older versions of rsync. I'm fairly certain that is
>>>>>> not the case here, especially because of the "Killed by signal 2",
>>>>>> which is suggestive of some sort of timeout on the pgpool end.
>>>>>>
>>>>>> The specific command line I'm using to recover the second database
>>>>>> node is:
>>>>>>
>>>>>> sudo -u postgres /usr/local/bin/pcp_recovery_node 10000 psql01 9898
>>>>>> postgres XXXXXX 1
>>>>>>
>>>>>> With such a large timeout value I shouldn't be hitting a timeout
>>>>>> there.
>>>>>>
>>>>>> The weird thing, which makes me point the finger at either pgpool or
>>>>>> pcp_recovery_node, is that if I run do-base-backup.sh manually it
>>>>>> works fine (and takes much much longer, as expected).
>>>>>>
>>>>>> Does pgpool have some internal limit on how long it will wait for the
>>>>>> 1st stage command to run? I've attached the log file but it isn't
>>>>>> very informative. (Note that the do-base-backup.sh script isn't
>>>>>> communicating the rsync failure back to pgpool, so pgpool goes ahead
>>>>>> and runs stage 2. Of course, that fails because not everything has
>>>>>> been synced.)
>>>>>>
>>>>>> Thanks,
>>>>>> Sean
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> pgpool-general mailing list
>>>>>> pgpool-general at pgpool.net
>>>>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>
>
>
>
More information about the pgpool-general
mailing list