[Pgpool-hackers] New patches for pcp_promote_node

Thu Mar 10 07:47:24 UTC 2011

Le 10/03/2011 01:03, Tatsuo Ishii a écrit :
>>> By further testing, it seems the error occurs when online recovery
>>> repeats two or more times. This time I got:
>>>
>>> 2011-03-09 18:13:04 ERROR: pid 13569: health check failed. 1 th host /tmp at port 5434 is down
>>> 2011-03-09 18:13:04 LOG:   pid 13569: set 1 th backend down status
>>> 2011-03-09 18:13:04 LOG:   pid 13569: starting degeneration. shutdown host /tmp(5434)
>>> 2011-03-09 18:13:04 LOG:   pid 13569: execute command: /usr/local/etc/failover.sh 1 "/tmp" 5434 /usr/local/pgsql/standby 0 1 "/tmp" 1
>>> 2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: 0 node is standby
>>> 2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: no primary node found
>>> 2011-03-09 18:13:05 LOG:   pid 13569: Primary node id saved: -1
>>> 2011-03-09 18:13:05 LOG:   pid 13569: failover done. shutdown host /tmp(5434)
>>> 2011-03-09 18:13:18 LOG:   pid 13604: starting recovering node 1
>>> 2011-03-09 18:13:18 ERROR: pid 13604: start_recover: could not connect master node.
>>>
>>> I did the testing in following sequences:
>>>
>>> 1) node 0 down, node 1 primary
>>> 2) recover node 0 (fine)
>>> 3) node 0 standby, node 1 primary
>>> 4) node 1 down, node 0 promotes to proimary
>>> 5) recover node 1 and got above errors
>> Ok, I was able to reproduce the problem. It occurs when the new promoted
>> node start too slowly after trigger file is created so that
>> find_primary_node() could not connect to it.
>>
>> Forgot this patch for the moment, I don't have time to work on it for
>> now. I'm also pretty sure I've already fixed that somewhere. I will
>> check and fix that asap, sorry for the noise.
> Hum. In your patches you changed the condition to check if the node is
> the standby or not:
>
>     SELECT pg_is_in_recovery() AND pgpool_walrecrunning()
>
> to this:
>
>     not (SELECT not pg_is_in_recovery() AND not pgpool_walrecrunning())
>
> which is logically equal to:
>
>     SELECT pg_is_in_recovery() OR pgpool_walrecrunning()
>
> Problem is, pg_is_in_recovery() returns true even if it is promoting
> to primary. So find_primary_node() can not find the primary node if
> the promotion is too slow.
>
> However, this one:
>
>     SELECT pg_is_in_recovery() AND pgpool_walrecrunning()
>
> returns true only if the node is standby *AND* not promoting. If the
> node is promoting, wal reciver process is not running, which is
> checked by pgpool_walrecrunning() (otherwise we don't need
> pgpool_walrecrunning() at all).
>
> In summary I think you need to revert the partches for
> find_primary_node().
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp

Yes I'm agree but it doesn't cover all cases too, please take a look at
the following bug report
http://pgfoundry.org/pipermail/pgpool-hackers/2011-January/000525.html

We need to fix that, any idea ? I've attached a video for demonstration
in the last thread response.

-- 
Gilles Darold
http://dalibo.com - http://dalibo.org