[Pgpool-hackers] New patches for pcp_promote_node

Wed Mar 9 09:19:10 UTC 2011

>> BTW, after applying the patches I got following errors while doing
>> online recovery.  In my testing node 0 is down status and is the
>> recovery target. Node 1 is up and running as primary node. This worked
>> perfectly before applying your patches. Thoought?
>>
>> 2011-03-09 08:37:17 ERROR: pid 15531: health check failed. 0 th host /tmp at port 5433 is down
>> 2011-03-09 08:37:17 LOG:   pid 15531: set 0 th backend down status
>> 2011-03-09 08:37:17 LOG:   pid 15531: starting degeneration. shutdown host /tmp(5433)
>> 2011-03-09 08:37:17 LOG:   pid 15531: execute command: /usr/local/etc/failover.sh 0 "/tmp" 5433 /usr/local/pgsql/data 1 0 "/tmp" 0
>> 2011-03-09 08:37:17 LOG:   pid 15531: find_primary_node: 1 node is standby
>> 2011-03-09 08:37:17 LOG:   pid 15531: find_primary_node: no primary node found
>> 2011-03-09 08:37:17 LOG:   pid 15531: Primary node id saved: -1
>> 2011-03-09 08:37:17 LOG:   pid 15531: failover done. shutdown host /tmp(5433)
>> 2011-03-09 08:37:34 LOG:   pid 15566: starting recovering node 0
>> 2011-03-09 08:37:34 ERROR: pid 15566: start_recover: could not connect master node.
> 
> Have you applied the entire patch, any reject ?

No.

> I mean this error
> appears when the change in find_primary_node() has not be done. Please
> take a look, you must have:
> 
>     SELECT pg_is_in_recovery() AND pgpool_walrecrunning()
> 
> replaced by:
> 
>     SELECT not pg_is_in_recovery() AND not pgpool_walrecrunning()
> 
> and the response comparison: strcmp(res->data[0], "t") replaced by
> strcmp(res->data[0], "f")
> 
> Could you please check that? I will check again in my side to see if I
> forgot something in the patch.

All above seem to be fine.

By further testing, it seems the error occurs when online recovery
repeats two or more times. This time I got:

2011-03-09 18:13:04 ERROR: pid 13569: health check failed. 1 th host /tmp at port 5434 is down
2011-03-09 18:13:04 LOG:   pid 13569: set 1 th backend down status
2011-03-09 18:13:04 LOG:   pid 13569: starting degeneration. shutdown host /tmp(5434)
2011-03-09 18:13:04 LOG:   pid 13569: execute command: /usr/local/etc/failover.sh 1 "/tmp" 5434 /usr/local/pgsql/standby 0 1 "/tmp" 1
2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: 0 node is standby
2011-03-09 18:13:05 LOG:   pid 13569: find_primary_node: no primary node found
2011-03-09 18:13:05 LOG:   pid 13569: Primary node id saved: -1
2011-03-09 18:13:05 LOG:   pid 13569: failover done. shutdown host /tmp(5434)
2011-03-09 18:13:18 LOG:   pid 13604: starting recovering node 1
2011-03-09 18:13:18 ERROR: pid 13604: start_recover: could not connect master node.

I did the testing in following sequences:

1) node 0 down, node 1 primary
2) recover node 0 (fine)
3) node 0 standby, node 1 primary
4) node 1 down, node 0 promotes to proimary
5) recover node 1 and got above errors
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp