[Pgpool-hackers] New patches for pcp_promote_node

Wed Mar 9 11:27:45 UTC 2011

Le 09/03/2011 10:19, Tatsuo Ishii a écrit :
> By further testing, it seems the error occurs when online recovery
> repeats two or more times. This time I got:
>
> 2011-03-09 18:13:04 ERROR: pid 13569: health check failed. 1 th host /tmp at port 5434 is down
> 2011-03-09 18:13:04 LOG:   pid 13569: set 1 th backend down status
> 2011-03-09 18:13:04 LOG:   pid 13569: starting degeneration. shutdown host /tmp(5434)
> 2011-03-09 18:13:04 LOG:   pid 13569: execute command: /usr/local/etc/failover.sh 1 "/tmp" 5434 /usr/local/pgsql/standby 0 1 "/tmp" 1
> 2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: 0 node is standby
> 2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: no primary node found
> 2011-03-09 18:13:05 LOG:   pid 13569: Primary node id saved: -1
> 2011-03-09 18:13:05 LOG:   pid 13569: failover done. shutdown host /tmp(5434)
> 2011-03-09 18:13:18 LOG:   pid 13604: starting recovering node 1
> 2011-03-09 18:13:18 ERROR: pid 13604: start_recover: could not connect master node.
>
> I did the testing in following sequences:
>
> 1) node 0 down, node 1 primary
> 2) recover node 0 (fine)
> 3) node 0 standby, node 1 primary
> 4) node 1 down, node 0 promotes to proimary
> 5) recover node 1 and got above errors
Ok, I was able to reproduce the problem. It occurs when the new promoted
node start too slowly after trigger file is created so that
find_primary_node() could not connect to it.

Forgot this patch for the moment, I don't have time to work on it for
now. I'm also pretty sure I've already fixed that somewhere. I will
check and fix that asap, sorry for the noise.

Regards,

-- 
Gilles Darold
http://dalibo.com - http://dalibo.org