[pgpool-hackers: 4391] Re: Proposal: promoting standby which is most up to date to become the new Primary

Tatsuo Ishii ishii at sraoss.co.jp
Tue Aug 29 20:24:42 JST 2023


Hi,

> Hi,
> 
> Currently, pgpool promotes the standby node with minimal node id to become the new primary.

Actually that's not pgpool's decision, but the failover script's
choice.

> If there is replication lag between primary and the standby node,
> it may cause data loss after failover.

Yeah, that's not good.

> I propose to promote the standby, which is most up to date, to become the new Primary.
> This feature is able to prevent data loss.
> 
> I am not sure which senario is better. Do you have any ideas?
> 
> 1. Change get_next_main_node() to return the standby which is most up to date

I suggest not to change get_next_main_node(). As I already said
above, it's not pgpool but the failover script that promotes the
main node. For the purpose (promoting the least delay standby node),
separate information should be provided.

> 2. Add a new PCP command to return the standby which is most up to date.
>    Modify the failover sample script to run the PCP command and set the result to NEW_MAIN_NODE_ID.

Let's step back. I think for the purpose we need to implement two
infrastructure.

(a) Track replication delay and keep it.

At the time when failover script gets called, apparently we cannot
collect replication delay information since the existing primary
already went down. Probably we have to save the last replication delay
information collected by streaming replication check process to
somewhere (probably to shared memory).

(b) Pass the replication delay info to failover script.

Once the information is collected in (a), it needs be handed to the
failover script so that the script can choose the appropriate standby
node to promote. Inventing new PCP command is an idea. Another idea is
adding new parameter to the failover script. For example, we could
have new parameter: %l which is a string representing the least delay
standby node information. Example:

1 host1 5432 /usr/local/pgsql/data

which means:

node id = 1
hostname = host1
port number = 5432
database cluster path = /usr/local/pgsql/data

What do you think?
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list