[pgpool-hackers: 2321] Corner case bug in primary failover
ishii at sraoss.co.jp
Tue May 9 10:16:53 JST 2017
While adding a new test case to 003.failover regression test, I found
a corner case bug in primary failover.
Suppose Pgpool-II starts but is yet finding primary node. If primary
failover happens, it skips finding primary node and let the initial
value of it (Req_info->primary_node_id == -1) to be used as the new
primary node id. As a result, no primary node id exists until next
Initialy I thought The problem is in the code of
pgpool_main.c:failover() which tries to optimize finding primary node
* If the down node was a standby node in streaming replication
* mode, we can avoid calling find_primary_node_repeatedly() and
* recognize the former primary as the new primary node, which
* will reduce the time to process standby down.
else if (MASTER_SLAVE && pool_config->master_slave_sub_mode == STREAM_MODE &&
reqkind == NODE_DOWN_REQUEST)
if (Req_info->primary_node_id != node_id)
new_primary = Req_info->primary_node_id;
new_primary = find_primary_node_repeatedly();
I was attempting to fix it by checking Req_info->primary_node_id to
see if it's initial value (-1) or not. If it's -1,
find_primary_node_repeatedly() need to be called.
But looking into pgpool_main() closely, I suspect there's a
1) It processes failover in CHECK_REQUEST *before* setting
* check for child signals to ensure child startup before reporting successfull start
(errmsg("%s successfully started. version %s (%s)", PACKAGE, VERSION, PGPOOLVERSION)));
* if the primary node id is not loaded by watchdog, search for it
if (Req_info->primary_node_id < 0)
/* Save primary node id */
Req_info->primary_node_id = find_primary_node();
2) It uses find_primary_node(), rather than
find_primary_node_repeatedly(). So if by some reasons (for example
the backend does not come up yet), find_primary_node() will fail
and Req_info->primary_node_id is set to -1.
I think proper fix will be moving the CHECK_REQUEST call above inside
main loop, and change the find_primary_node() call to
Attached is the patch to do that (plus change the
search_primary_node_timeout to smaller value in 055.backend_all_down
test. Otherwise, regression timeout is triggered) against master
What do you think?
SRA OSS, Inc. Japan
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 2343 bytes
Desc: not available
More information about the pgpool-hackers