[pgpool-general: 7436] Re: Faulty state of pgPool-II cluster after disconnecting and reconnecting MASTER pgPool from network

Wed Mar 17 17:01:39 JST 2021

Hi,

> I have been testing an ability of pgPool to recover from certain disaster scenarios and found one particular situation which ended in the database being completely unavailable without manual intervention. As for the cluster - I have 2 servers with a pgPool service (version 4.1.5) with Watchdog and 2 servers with a PostgreSQL service (version 11.11) configured with a streaming replication. By default the servers have following roles:
> 
> PG1 - MASTER pgPool node with VIP address brought up.
> PG2 - STANDBY pgPool node.
> DB1 - PRIMARY PostgreSQL server.
> DB2 - STANDBY PostgreSQL server.
> 
> The problematic situation came up, when I tried to disconnect server PG1 from the network without shutting it down. The result was that VIP was brought up on PG2 server, but since the PGP1 lost all communication with all other servers, it retained the VIP address as well (thinking it was actually PG2 server that went down, effectively causing a split brain situation between PG1 and PG2) and at the same time had the health check failed for both DB servers. At this moment it was still fine from the outside, since PGP2 was working as a connection point for any client with both DB servers connected.
> 
> However - when I connected the PG1 server to the network again, it started a communication with PG2 server via Watchdog and probably went on to solve the split brain situation by demoting the PG2 server to STANDBY role and bringing down its VIP address. But at the same it caused the degeneration of both DB servers, probably because the healthcheck from the PG1 node was failing during the time out of the network. The result was that both DB nodes were down and the database became inaccessible to clients.
> 
> I am sending you a log from the PG1 server which describes the situation.
> 
> Now when I tried the same with the PG2 server (which is the STANDBY node by default), it pretty much went the same route the first half of the scenario (it promoted itself to MASTER status and brought up the VIP address, while being disconnected from the network), but when I connected it back to the network, the VIP on the server was brought down and both DB server were still connected to the cluster without any impact for the client - so it returned to the default state.
> 
> After that I came up with a solution to use the trusted_servers parameter to "kill" the pgPool service on the PG1 server in case of it being out of the network, while NOT using it on the PG2 server in the case that both PG servers went shortly out of the network so it doesn't cause both PG servers to be effectively dead for the client. Can you please advise whether this solution will work in general and won't cause any additional problems with the database being unavailable, or if I should rather find a different solution? I sadly don't have the option to add a third PG server to basically prevent a split brain situation.

In this case, you should specify "trusted_servers" in pgpool.conf.
By specifying this paramater, the PG1 will be marked as "NODE DEAD".
I can't find the configuration of "trusted_servers" in your pgpool.conf.

> I am sending you the pgpool.conf from the PG2 server (the config on PG1 was the same except for IP addresses in the Watchdog section being reversed).
> 
> Let me know, if you have any idea how to handle this situation. Thank you for your time.
> 
> With best regards,
> Vladimír Láznička

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan