[Pgpool-hackers] Major problem in find_primary_node()

Wed Jan 19 18:43:16 UTC 2011

Hi,

I've found something critical into the way that find_primary_node()
detect if a node is a standby host and by deduction the primary node in
streaming replication mode.

When it call "SELECT pg_is_in_recovery() AND pgpool_walrecrunning()" it
always return false if pg_is_in_recovery = 'true' and
pgpool_walrecrunning = 'false'. That mean that the host can be promoted
as master, the situation will going worse if a failover occurs and
PgPool takes this fake host as the master.

This is not a problem that can occurs easily but let check the following:

- Master has failover to node 1.
- Node 0 is reconstructed but the primary_conninfo host in recovery.conf
is wrong/broken for some reason.

At this stage, both nodes are running with PgPool status 2 and currently
the only problem is that node 0 is not replicated from the master node.

Then if a failover from node 1 occurs, PgPool will try to find a new
master by querying "SELECT pg_is_in_recovery() AND
pgpool_walrecrunning()" to all nodes, but as node 0 will return false
because the wal receiver is not running, PgPool will take it as the new
master, and very bad annoying things will happen from this point.

What I suggest is to try to find the master by finding a node that is
not in recovery mode and where wal receiver is not running. This will
allow a simple fix by changing the SQL query :

SELECT  pg_is_in_recovery() = 'f' AND pgpool_walrecrunning() = 'f';
  or
SELECT not pg_is_in_recovery() AND not pgpool_walrecrunning();

What's your opinion, if every one is ok with that fix I already have a
patch for that.

Regards,

-- 
Gilles Darold
http://dalibo.com - http://dalibo.org