[Pgpool-general] Cannot add node after failure

Fri Dec 18 13:31:07 UTC 2009

Hello,

Just in case someone runs in the same problem, the issue was with ssh  
password-less and a wrong entry on /etc/hosts.

So, if you run into this problem, check ssh access using your postgres  
user :)

Regards,
---

Fernando Marcelo
www.consultorpc.com
fernando at consultorpc.com

Em 17/12/2009, às 10:40, Fernando Morgenstern escreveu:

> Hello,
>
> Sorry for the lack replies.
>
> Before checking your email, i realized that recovery was only  
> working on the nodes that pgpool was compiled. I decided to compile  
> it on the others and now it works ok ( just compiled, didn't run  
> pgpool there ).
>
> During the last day, i have been simulating recoveries by killing  
> postgres and shutting down servers randomly. Most of the time,  
> recovery is perfectly done, but there are some specific cases when  
> pcp_recovery_node reports that the command is complete but the  
> recovery isn't done ( eg.: some databases that i created while the  
> failed node was down, are not present when it starts up ).
>
> I tried to isolate the log messsages of when this behaviour happens  
> and here it is http://pastebin.ca/1718115
>
> I really don't see anything different, but the fact is that some  
> data is missing on node 1, which is being recovered.
>
> Do you mind giving some kind of advice of things that i should check  
> or something that you think it is wrong?
>
>
>
> By the way, i am using pgpool_recovery as 1st and 2nd recovery  
> command on pgpool and the script looks like this one:
>
> #! /bin/sh
>
> if [ $# -ne 3 ]
> then
>    echo "pgpool_recovery datadir remote_host remote_datadir"
>    exit 1
> fi
>
> datadir=$1
> DEST=$2
> DESTDIR=$3
>
> rsync -aurz --delete -e ssh $datadir/global/ $DEST:$DESTDIR/global/ &
> rsync -aurz --delete -e ssh $datadir/base/ $DEST:$DESTDIR/base/ &
> rsync -aurz --delete -e ssh $datadir/pg_multixact/ $DEST:$DESTDIR/ 
> pg_multixact/ &
> rsync -aurz --delete -e ssh $datadir/pg_subtrans/ $DEST:$DESTDIR/ 
> pg_subtrans/ &
> rsync -aurz --delete -e ssh $datadir/pg_clog/ $DEST:$DESTDIR/ 
> pg_clog/ &
> rsync -aurz --delete -e ssh $datadir/pg_xlog/ $DEST:$DESTDIR/ 
> pg_xlog/ &
> rsync -aurz --delete -e ssh $datadir/pg_twophase/ $DEST:$DESTDIR/ 
> pg_twophase/ &
> wait
>
> Regards,
> ---
>
> Fernando Marcelo
> www.consultorpc.com
> fernando at consultorpc.com
> Tel: +34 902 998971
> Fax: +34 91 7903701
>
> ## legal disclaimer
>
> The information contained in this email is confidential. It is  
> intended only
> for the stated addressee(s) and access to it by any other person is
> unauthorized. If you are not an addressee, you must not disclose,  
> copy,
> circulate or in any other way use or rely on the information  
> contained in
> this email. Such unauthorized use may be unlawful. If you have  
> received this
> email in error, please inform us immediately by emailing admin at consultorpc.com
> and delete it and all copies from your system.
>
> ## end mail
>
> Em 16/12/2009, às 06:47, Tatsuo Ishii escreveu:
>
>>> Hello,
>>>
>>> Thanks for your info!
>>>
>>> I was able to do some progress with node recovery when using
>>> pgpool_recovery on both recovery command.
>>>
>>> I am able to recovery most of the times, but sometimes it fails with
>>> the following error:
>>>
>>> $ pcp_recovery_node  -d 90 localhost 9898 postgres ******* 2
>>> DEBUG: send: tos="R", len=46
>>> DEBUG: recv: tos="r", len=21, data=AuthenticationOK
>>> DEBUG: send: tos="D", len=6
>>> DEBUG: recv: tos="e", len=20, data=recovery failed
>>> DEBUG: command failed. reason=recovery failed
>>> BackendError
>>> DEBUG: send: tos="X", len=4
>>>
>>> pgpool log
>>>
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
>>> type of service 'M'
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: salt sent to the  
>>> client
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
>>> type of service 'R'
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: authentication OK
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
>>> type of service 'O'
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: start online  
>>> recovery
>>> 2009-12-15 20:10:56 LOG:   pid 8747: starting recovering node 2
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: exec_checkpoint: start  
>>> checkpoint
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: exec_checkpoint: finish  
>>> checkpoint
>>> 2009-12-15 20:10:56 LOG:   pid 8747: CHECKPOINT in the 1st stage  
>>> done
>>> 2009-12-15 20:10:56 LOG:   pid 8747: starting recovery command:
>>> "SELECT pgpool_recovery('pgpool_recovery', 'im-pp3', '/usr/local/ 
>>> pgsql/
>>> data')"
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: exec_recovery: start recovery
>>> 2009-12-15 20:10:56 ERROR: pid 8747: exec_recovery: pgpool_recovery
>>> command failed at 1st stage
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: exec_recovery: finish recovery
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: received PCP packet
>>> type of service 'X'
>>> 2009-12-15 20:10:56 DEBUG: pid 8747: pcp_child: client  
>>> disconnecting.
>>> close connection
>>> 2009-12-15 20:11:22 DEBUG: pid 8446: starting health checking
>>>
>>> Unfortunately i am not sure what this error means. Did it failed at
>>> "SELECT pgpool_recovery('pgpool_recovery', 'im-pp3', '/usr/local/ 
>>> pgsql/
>>> data')"? How can i find the reason?
>>
>> Recovery command "pgpool_recovery" failed for some reason. Check
>> PostgreSQL log on master node. If it is not clear, try to add -x to
>> shell in your pgpool_recovery script. i.e.
>>
>> #! /bin/sh -x
>>
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>>
>>> Best Regards,
>>> ---
>>>
>>> Fernando Marcelo
>>> www.consultorpc.com
>>> fernando at consultorpc.com
>>>
>>>
>>> Em 15/12/2009, às 13:36, Jaume Sabater escreveu:
>>>
>>>> On Tue, Dec 15, 2009 at 4:20 PM, Fernando Morgenstern
>>>> <fernando at consultorpc.com> wrote:
>>>>
>>>>> While reading pgpool manual i found this:
>>>>> Note that there is a restriction about online recovery. If pgpool-
>>>>> II works
>>>>> on multiple hosts, online recovery does not work correctly,  
>>>>> because
>>>>> pgpool-II stops clients on the 2nd stage of online recovery. If
>>>>> there are
>>>>> some pgpool hosts, pgpool-II excepted for receiving online  
>>>>> recovery
>>>>> request
>>>>> cannot block connections.
>>>>
>>>> It means running two or more pgpool-II instances simultaneously,  
>>>> which
>>>> won't be your case since, with Heartbeat, you'll configure pgpool- 
>>>> II
>>>> as a resource, hence it will only be active in one node at a given
>>>> time.
>>>>
>>>> -- 
>>>> Jaume Sabater
>>>> http://linuxsilo.net/
>>>>
>>>> "Ubi sapientas ibi libertas"
>>>
>>> _______________________________________________
>>> Pgpool-general mailing list
>>> Pgpool-general at pgfoundry.org
>>> http://pgfoundry.org/mailman/listinfo/pgpool-general
>
> _______________________________________________
> Pgpool-general mailing list
> Pgpool-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgpool-general