[pgpool-general: 7453] Re: Faulty state of pgPool-II cluster after disconnecting and reconnecting MASTER pgPool from network

Mon Mar 22 17:34:50 JST 2021

Hello,

Thank you for the confirmation of using the "trusted_servers" parameter. I am sending you the configuration files of both PG servers, where I used the "trusted_servers" parameter in PG1 config, but not in PG2 config mainly for the rare case, when both PG servers get temporarily disconnected from the network, so I won't end up with both of them dead. I tested each PG1 a PG2 being disconnected from the network and it seems to be working as expected.

 However - I also want to ask another set of questions...

The production environment will have two independent datacenters connected via network and half of the servers (for both the database and the application) will be in each, so it will look like this:

Datacenter 1:

DB1
PG1
APL1, APL2

Datacenter 2:

DB2
PG2
APL3, APL4

My main question is - what will happen with pgPool, when the network between datacenters goes down for several minutes and then gets back up in the case that I use all 4 APL servers as "trusted_servers" for PG1?

Since PG1 can still ping 2 of 4 APL servers, it won't go down itself and stays in the MASTER role, while PG2 will bring up the VIP address and switches itself to the MASTER role, because it lost the connection to PG1. At the same time PG1's health-check will degenerate DB2 and PG2's health-check will degenerate DB1 and promote DB2. But I am not sure, what will be the result, when the network between the datacenters goes back up and PG servers start communicating via watchdog.

Will they recover from the split brain and one of the PG servers stays MASTER, while other gets demoted to STANDBY?
What will happen with the state of DB servers, does it depend on the state of PG servers (if PG1 becomes MASTER, then DB1 will be up and DB2 down and the reverse for PG2 being MASTER)?
If I use the "wd_priority" parameter and PG1 will have greater priority, will the PG1 always become MASTER?
Is there a possibility of some data loss during the time watchdog will be solving the split brain?

Sadly I cannot practically test it because our test environment is on the same hardware in one network.

Thank you for your help.

Vladimír Láznička      

-----Original Message-----
From: Bo Peng <pengbo at sraoss.co.jp> 
Sent: Wednesday, March 17, 2021 9:02 AM
To: Láznička Vladimír <Vladimir.Laznicka at cca.cz>
Cc: pgpool-general at pgpool.net
Subject: Re: [pgpool-general: 7435] Faulty state of pgPool-II cluster after disconnecting and reconnecting MASTER pgPool from network

Hi,

> I have been testing an ability of pgPool to recover from certain disaster scenarios and found one particular situation which ended in the database being completely unavailable without manual intervention. As for the cluster - I have 2 servers with a pgPool service (version 4.1.5) with Watchdog and 2 servers with a PostgreSQL service (version 11.11) configured with a streaming replication. By default the servers have following roles:
> 
> PG1 - MASTER pgPool node with VIP address brought up.
> PG2 - STANDBY pgPool node.
> DB1 - PRIMARY PostgreSQL server.
> DB2 - STANDBY PostgreSQL server.
> 
> The problematic situation came up, when I tried to disconnect server PG1 from the network without shutting it down. The result was that VIP was brought up on PG2 server, but since the PGP1 lost all communication with all other servers, it retained the VIP address as well (thinking it was actually PG2 server that went down, effectively causing a split brain situation between PG1 and PG2) and at the same time had the health check failed for both DB servers. At this moment it was still fine from the outside, since PGP2 was working as a connection point for any client with both DB servers connected.
> 
> However - when I connected the PG1 server to the network again, it started a communication with PG2 server via Watchdog and probably went on to solve the split brain situation by demoting the PG2 server to STANDBY role and bringing down its VIP address. But at the same it caused the degeneration of both DB servers, probably because the healthcheck from the PG1 node was failing during the time out of the network. The result was that both DB nodes were down and the database became inaccessible to clients.
> 
> I am sending you a log from the PG1 server which describes the situation.
> 
> Now when I tried the same with the PG2 server (which is the STANDBY node by default), it pretty much went the same route the first half of the scenario (it promoted itself to MASTER status and brought up the VIP address, while being disconnected from the network), but when I connected it back to the network, the VIP on the server was brought down and both DB server were still connected to the cluster without any impact for the client - so it returned to the default state.
> 
> After that I came up with a solution to use the trusted_servers parameter to "kill" the pgPool service on the PG1 server in case of it being out of the network, while NOT using it on the PG2 server in the case that both PG servers went shortly out of the network so it doesn't cause both PG servers to be effectively dead for the client. Can you please advise whether this solution will work in general and won't cause any additional problems with the database being unavailable, or if I should rather find a different solution? I sadly don't have the option to add a third PG server to basically prevent a split brain situation.

In this case, you should specify "trusted_servers" in pgpool.conf.
By specifying this paramater, the PG1 will be marked as "NODE DEAD".
I can't find the configuration of "trusted_servers" in your pgpool.conf.

> I am sending you the pgpool.conf from the PG2 server (the config on PG1 was the same except for IP addresses in the Watchdog section being reversed).
> 
> Let me know, if you have any idea how to handle this situation. Thank you for your time.
> 
> With best regards,
> Vladimír Láznička

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PG1-pgpool.conf
Type: application/octet-stream
Size: 43714 bytes
Desc: PG1-pgpool.conf
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20210322/a5d55412/attachment-0002.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: PG2-pgpool.conf
Type: application/octet-stream
Size: 43674 bytes
Desc: PG2-pgpool.conf
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20210322/a5d55412/attachment-0003.obj>