[Pgpool-general] diagnosing BackendError from pcp_recovery_node

Wed Oct 5 22:15:07 UTC 2011

I have a follow_master_command set which calls the pcp_recovery_node script
after editing a config file on the replica which has been degenerated. 
Occasionally this fails with BackendError.  It looks like the node doesn't get
degenerated fast enough and so it's marked as still alive, thus cancelling the
recovery that I just requested:

BTW, my backend hostnames are:

backend_hostname0: pgpool
backend_hostname1: node1
backend_hostname2: node2

2011-10-05 14:30:02 LOG:   pid 31743: execute command:
/var/lib/postgresql/scripts/failover 0 0 node1
Warning: Permanently added 'node1,216.121.61.228' (RSA) to the list of known
hosts.
Warning: Permanently added 'node1,216.121.61.228' (RSA) to the list of known
hosts.
2011-10-05 14:30:14 LOG:   pid 31743: find_primary_node_repeatedly: waiting
for finding a primary node
2011-10-05 14:30:14 LOG:   pid 31743: find_primary_node: primary node id is 1
2011-10-05 14:30:14 LOG:   pid 31743: starting follow degeneration. shutdown
host pgpool(5432)
2011-10-05 14:30:14 LOG:   pid 31743: starting follow degeneration. shutdown
host node2(5432)
2011-10-05 14:30:14 LOG:   pid 31743: failover: 2 follow backends have been
degenerated
2011-10-05 14:30:14 LOG:   pid 31743: failover: set new primary node: 1
2011-10-05 14:30:14 LOG:   pid 31743: failover: set new master node: 1
2011-10-05 14:30:14 LOG:   pid 2941: start triggering follow command.
2011-10-05 14:30:14 LOG:   pid 2941: execute command:
/var/lib/postgresql/scripts/pgpool_remaster_replica pgpool node1 0
2011-10-05 14:30:14 LOG:   pid 31743: failover done. shutdown host pgpool(5432)
DEBUG: send: tos="R", len=44
DEBUG: recv: tos="r", len=21, data=AuthenticationOK
2011-10-05 14:30:15 LOG:   pid 2536: starting recovering node 0
2011-10-05 14:30:15 ERROR: pid 2536: start_recovery: backend node 0 is alive
DEBUG: send: tos="D", len=6
DEBUG: recv: tos="e", len=20, data=recovery failed
DEBUG: command failed. reason=recovery failed
DEBUG: send: tos="X", len=4
BackendError

Notice that the 'shutdown host pgpool(5432)' message was logged AFTER my
follow command was executed, so does this: ERROR: pid 2536: start_recovery:
backend node 0 is alive

mean that the node is not detached or that postgresql is still running? or
something else I'm not thinking of?

Looks like this is the bit of code in question:

        pool_log("starting recovering node %d", recovery_node);

        if (VALID_BACKEND(recovery_node))
        {
                pool_error("start_recovery: backend node %d is alive",
recovery_node);
                return 1;
        }

So, how does it define a VALID_BACKEND and how can I modify my
follow_master_command to allow the pcp_recovery_node to finish successfully?

If I run the pcp_recovery_node again by hand afterward, it always works fine.
I thought perhaps adding a pcp_detach_node before the pcp_recovery_node would
do the trick, but it does not.

/var/lib/postgresql/scripts/pgpool_remaster_replica:
#!/bin/bash

CONF_FILE=/var/lib/postgresql/scripts/pitr-replication.conf
. ${CONF_FILE}
SCRIPT_DIR=$(dirname ${CONF_FILE})
echo "$0 $* run on $(hostname)" >> ${REPLICATION_LOG}

REPLICA=$1
MASTER=$2
NODEID=$3

# Update the MASTER value in the replication config file
${SSH} ${SSH_OPT} ${REPLICA} /bin/sh -c "'sed -e
\"s/MASTER=.*$/MASTER=${MASTER}/\" ${CONF_FILE} > ${CONF_FILE}.swap && mv
${CONF_FILE}.swap ${CONF_FILE}'"

${SSH} ${SSH_OPT} -T ${REPLICA} $PGCTL ${PGCTLOPTIONS} -m fast -w -D ${PGDATA}
stop 2>/dev/null 1>/dev/null < /dev/null &
#/usr/local/bin/pcp_detach_node --debug ${PCPTIMEOUT} ${PGPOOLHOST} ${PCPPORT}
${PGPOOLUSER} ${PGPOOLPASS} ${NODEID}
/usr/local/bin/pcp_recovery_node --debug ${PCPTIMEOUT} ${PGPOOLHOST}
${PCPPORT} ${PGPOOLUSER} ${PGPOOLPASS} ${NODEID}

-- 
Jeff Frost <jeff at pgexperts.com>
CTO, PostgreSQL Experts, Inc.
Phone: 1-888-PG-EXPRT x506
FAX: 415-762-5122
http://www.pgexperts.com/