[pgpool-general: 5961] Re: Split-brain remedy

Pierre Timmermans ptim007 at yahoo.com
Thu Mar 1 19:13:13 JST 2018


I think this is totally expected behavior: this pgpool instance discovered that it cannot ping the trusted server, so it commits suicide to avoid a split brain scenario. You should check that the other pgpool took over as cluster leader and that it acquired the VIP
So it looks good !
Pierre 

    On Wednesday, February 28, 2018, 11:01:08 PM GMT+1, Alexander Dorogensky <amazinglifetime at gmail.com> wrote:  
 
 It looks like pgpool child crashes.. see below, but I'm not sure..
So the question remains.. is it a bug or expected behavior?

DEBUG:  watchdog trying to ping host "10.0.0.100"
WARNING:  watchdog failed to ping host"10.0.0.100"
DETAIL:  ping process exits with code: 2
WARNING:  watchdog lifecheck, failed to connect to any trusted servers
LOG:  informing the node status change to watchdog
DETAIL:  node id :0 status = "NODE DEAD" message:"trusted server is unreachable"
LOG:  new IPC connection received
LOCATION:  watchdog.c:3319
LOG:  received node status change ipc message
DETAIL:  trusted server is unreachable
DEBUG:  processing node status changed to DEAD event for node ID:0
STATE MACHINE INVOKED WITH EVENT = THIS NODE LOST Current State = MASTER
WARNING:  watchdog lifecheck reported, we are disconnected from the network
DETAIL:  changing the state to LOST
DEBUG:  removing all watchdog nodes from the standby list
DETAIL:  standby list contains 1 nodes
LOG:  watchdog node state changed from [MASTER] to [LOST]
DEBUG:  STATE MACHINE INVOKED WITH EVENT = STATE CHANGED Current State = LOST
FATAL:  system has lost the network
LOG:  Watchdog is shutting down
DEBUG:  sending packet, watchdog node:[10.0.0.2:5432 Linux alex2] command id:[67] type:[INFORM I AM GOING DOWN] state:[LOST]
DEBUG:  sending watchdog packet to socket:7, type:[X], command ID:67, data Length:0
DEBUG:  sending watchdog packet, command id:[67] type:[INFORM I AM GOING DOWN] state :[LOST]
DEBUG:  new cluster command X issued with command id 67
LOG:  watchdog: de-escalation started
DEBUG:  shmem_exit(-1): 0 callbacks to make
DEBUG:  proc_exit(-1): 0 callbacks to make
DEBUG:  shmem_exit(3): 0 callbacks to make
DEBUG:  proc_exit(3): 1 callbacks to make
DEBUG:  exit(3)
DEBUG:  shmem_exit(-1): 0 callbacks to make
DEBUG:  proc_exit(-1): 0 callbacks to make
DEBUG:  reaper handler
DEBUG:  watchdog child process with pid: 30288 exit with FATAL ERROR. pgpool-II will be shutdown
LOG:  watchdog child process with pid: 30288 exits with status 768
FATAL:  watchdog child process exit with fatal error. exiting pgpool-II
LOG:  setting the local watchdog node name to "10.0.0.1:5432 Linux alex1"
LOG:  watchdog cluster is configured with 1 remote nodes
LOG:  watchdog remote node:0 on 10.0.0.2:9000
LOG:  interface monitoring is disabled in watchdog
DEBUG:  pool_write: to backend: 0 kind:X
DEBUG:  pool_flush_it: flush size: 5
...
DEBUG:  shmem_exit(-1): 0 callbacks to make
...
DEBUG:  lifecheck child receives shutdown request signal 2, forwarding to all children
DEBUG:  lifecheck child receives fast shutdown request
DEBUG:  watchdog heartbeat receiver child receives shutdown request signal 2
DEBUG:  shmem_exit(-1): 0 callbacks to make
DEBUG:  proc_exit(-1): 0 callbacks to make
...

On Wed, Feb 28, 2018 at 1:53 PM, Pierre Timmermans <ptim007 at yahoo.com> wrote:

I am using pgpool inside a docker container so I cannot tell what the service command will say
I think you should have a look at the pgpool log file at the moment you unplug the interface: it will probably say something about the fact that it cannot reach the trusted_server and that it will exclude itself from the cluster (I am not sure). You can also start pgpool in debug to get extra logging. I think that I validated that in the past, I cannot find the doc anymore
You can also execute the following command:
pcp_watchdog_info -h <ip pgpool> -p 9898 -w
it will return information about the watchdog, among others the cluster quorum
nb: due to a bug in the packaging by postgres, if you installed pgpool from postgres yum repositories (and not from pgpool) then pcp_watchdog_info will not be in the path (but in a directory somewhere, I forgot which)


Pierre 

    On Wednesday, February 28, 2018, 5:37:49 PM GMT+1, Alexander Dorogensky <amazinglifetime at gmail.com> wrote:  
 
 With 'trusted_servers' configured, when I unplug 10.0.0.1 it kills pgpool, i.e. 'service pgpool status' reports 'pgpool dead but subsys locked'.
Is that how it should be?

​Plug/unplug = ifconfig eth0 up/down​


On Tue, Feb 27, 2018 at 1:49 PM, Pierre Timmermans <ptim007 at yahoo.com> wrote:

To prevent this split brain scenario (caused by a network partition) you can use the configuration trusted_servers. This setting is a list of servers that pgpool can use to determine if a node is suffering a network partition or not. If a node cannot reach any of the servers in the list, then it will assume it is isolated (by a network partition) and will not promote itself to master.
In general, when you have only two nodes, it is not safe to do an automatic failover I believe.  Unless you have some kind of fencing mechanism (means: you can shutdown and prevent a failed node to come back after a failure).
Pierre 

    On Tuesday, February 27, 2018, 7:58:55 PM GMT+1, Alexander Dorogensky <amazinglifetime at gmail.com> wrote:  
 
 Hi All,

I have a 10.0.0.1/10.0.0.2 master/hot standby configuration with streaming replication, where each node runs pgpool with watchdog enabled and postgres.

I shut down the network interface on 10.0.0.1 and wait until 10.0.0.2 triggers failover and promotes itself to master through my failover script.

Now the watchdogs on 10.0.0.1 and 10.0.0.2 are out of sync, have conflicting views on which node has failed and both think they are master.

When I bring back the network interface on 10.0.0.1, 'show pool_nodes' says that 10.0.0.1 is master/up and 10.0.0.2 is standby/down. 

I want 10.0.0.1 to be standby and 10.0.0.2 to be master. 

I've been playing with the failover script.. e.g.

if (default network gateway is pingable) {
    shut down pgpool and postgres
} else if (this node is standby) {
    promote this node to master
    create a job that will run every minute and try to recover failed node (base backup) 
    cancel the job upon successful recovery
} 

Can you please help me with this? Any ideas would be highly appreciated.

Regards, Alex
______________________________ _________________
pgpool-general mailing list
pgpool-general at pgpool.net
http://www.pgpool.net/mailman/ listinfo/pgpool-general
  

  

  
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180301/9bb4a688/attachment-0001.html>


More information about the pgpool-general mailing list