[Pgpool-hackers] New patches for pcp_promote_node

Thu Mar 10 00:03:41 UTC 2011

>> By further testing, it seems the error occurs when online recovery
>> repeats two or more times. This time I got:
>>
>> 2011-03-09 18:13:04 ERROR: pid 13569: health check failed. 1 th host /tmp at port 5434 is down
>> 2011-03-09 18:13:04 LOG:   pid 13569: set 1 th backend down status
>> 2011-03-09 18:13:04 LOG:   pid 13569: starting degeneration. shutdown host /tmp(5434)
>> 2011-03-09 18:13:04 LOG:   pid 13569: execute command: /usr/local/etc/failover.sh 1 "/tmp" 5434 /usr/local/pgsql/standby 0 1 "/tmp" 1
>> 2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: 0 node is standby
>> 2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: no primary node found
>> 2011-03-09 18:13:05 LOG:   pid 13569: Primary node id saved: -1
>> 2011-03-09 18:13:05 LOG:   pid 13569: failover done. shutdown host /tmp(5434)
>> 2011-03-09 18:13:18 LOG:   pid 13604: starting recovering node 1
>> 2011-03-09 18:13:18 ERROR: pid 13604: start_recover: could not connect master node.
>>
>> I did the testing in following sequences:
>>
>> 1) node 0 down, node 1 primary
>> 2) recover node 0 (fine)
>> 3) node 0 standby, node 1 primary
>> 4) node 1 down, node 0 promotes to proimary
>> 5) recover node 1 and got above errors
> Ok, I was able to reproduce the problem. It occurs when the new promoted
> node start too slowly after trigger file is created so that
> find_primary_node() could not connect to it.
> 
> Forgot this patch for the moment, I don't have time to work on it for
> now. I'm also pretty sure I've already fixed that somewhere. I will
> check and fix that asap, sorry for the noise.

Hum. In your patches you changed the condition to check if the node is
the standby or not:

    SELECT pg_is_in_recovery() AND pgpool_walrecrunning()

to this:

    not (SELECT not pg_is_in_recovery() AND not pgpool_walrecrunning())

which is logically equal to:

    SELECT pg_is_in_recovery() OR pgpool_walrecrunning()

Problem is, pg_is_in_recovery() returns true even if it is promoting
to primary. So find_primary_node() can not find the primary node if
the promotion is too slow.

However, this one:

    SELECT pg_is_in_recovery() AND pgpool_walrecrunning()

returns true only if the node is standby *AND* not promoting. If the
node is promoting, wal reciver process is not running, which is
checked by pgpool_walrecrunning() (otherwise we don't need
pgpool_walrecrunning() at all).

In summary I think you need to revert the partches for
find_primary_node().
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp