[pgpool-hackers: 1979] New feature candidate: verify standby node while finding primary node

Thu Jan 12 11:34:59 JST 2017

This is a proposal for a new feature toward Pgpool-II 3.7.

Currently Pgpool-II finds a primary node and standby node like this
(it happens while Pgpool-II starting up or failover):

1) Issue "SELECT pg_is_in_recovery()" to a node in question.

2) If it returns "t", then decide the node is standby. Go to next node
   (go back to step 1).

3) If it returns other than that, then decide the node is
   the primary. Other nodes are regarded as standby.

This logic works mostly well except in an unusual scenario like this:

i) We have two nodes: node 0 is primary, node 1 is standby.

ii) A stupid admin issues "pg_ctl promote" to the standby node and node 1 becomes
  a stand alone PostgreSQL.

In this case, eventually node 1 will be behind to node 0, because no
replication happens. If replication delay check is enabled, Pgpool-II
avoids to send queries to node 1 because of the replication
delay. However, if the replication delay check is not enabled or the
replication delay threshold is large, user will not notice the
situation.

Also the scenario is known as "split brain" which users want to
avoid. I think we need to do something here.

Here is the modified procedure to avoid it.

1) Issue "SELECT pg_is_in_recovery()" to a node in question.

2) If it returns "t", then decide the node is standby. Go to next node
   (go back to step 1).

3) If it returns other than that, then decide the node is the
   primary. Check remaining nodes whether they are actually standby or
   not by issuing "SELECT pg_is_in_recovery()".  Additionally we could
   use pg_stat_wal_receiver view to check if it actually connects to
   the primary node if the PostgreSQL version is 9.6 or higher.

Question is, what if the checking in #3 reveals that the node in
question is not "proper" standby.

- Do we want to add new status code other than "up", "down", "not
  connected" and "unused"?

- Do we want to automatically detach the node so that Pgpool-II does
  not use the node?

- Do we want to the check more ferequetly, say a similar timing as
  health checking?

Comments, suggestions are welcome.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp