[pgpool-general: 5965] Re: Split-brain remedy

Fri Mar 2 02:22:32 JST 2018

I believe it happens because when pgpool terminates it doesn't remove the
PID file created when it was started with 'service' command.

On Thu, Mar 1, 2018 at 11:08 AM, Pierre Timmermans <ptim007 at yahoo.com>
wrote:

> Not sure why but I would think that pgpool stops with a return code
> different than zero and so the service command reports it as a failure.
> Which is probably good becaue then you can monitor it and see that this
> node requires an intervention (restart pgpool once the network partition is
> solved)
>
> Rgds,
>
> Pierre
>
>
> On Thursday, March 1, 2018, 5:40:36 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> Shutting down pgpool does make sense! My doubts are due the service not
> terminating gracefully.. any ideas why?
>
> Other than that, the other pgpool takes over and all looks good.
>
> Thank you Pierre!
>
> On Thu, Mar 1, 2018 at 4:13 AM, Pierre Timmermans <ptim007 at yahoo.com>
> wrote:
>
> I think this is totally expected behavior: this pgpool instance discovered
> that it cannot ping the trusted server, so it commits suicide to avoid a
> split brain scenario. You should check that the other pgpool took over as
> cluster leader and that it acquired the VIP
>
> So it looks good !
>
> Pierre
>
>
> On Wednesday, February 28, 2018, 11:01:08 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> It looks like pgpool child crashes.. see below, but I'm not sure..
> So the question remains.. is it a bug or expected behavior?
>
> DEBUG:  watchdog trying to ping host "10.0.0.100"
> WARNING:  watchdog failed to ping host"10.0.0.100"
> DETAIL:  ping process exits with code: 2
> WARNING:  watchdog lifecheck, failed to connect to any trusted servers
> LOG:  informing the node status change to watchdog
> DETAIL:  node id :0 status = "NODE DEAD" message:"trusted server is
> unreachable"
> LOG:  new IPC connection received
> LOCATION:  watchdog.c:3319
> LOG:  received node status change ipc message
> DETAIL:  trusted server is unreachable
> DEBUG:  processing node status changed to DEAD event for node ID:0
> STATE MACHINE INVOKED WITH EVENT = THIS NODE LOST Current State = MASTER
> WARNING:  watchdog lifecheck reported, we are disconnected from the network
> DETAIL:  changing the state to LOST
> DEBUG:  removing all watchdog nodes from the standby list
> DETAIL:  standby list contains 1 nodes
> LOG:  watchdog node state changed from [MASTER] to [LOST]
> DEBUG:  STATE MACHINE INVOKED WITH EVENT = STATE CHANGED Current State =
> LOST
> FATAL:  system has lost the network
> LOG:  Watchdog is shutting down
> DEBUG:  sending packet, watchdog node:[10.0.0.2:5432 Linux alex2] command
> id:[67] type:[INFORM I AM GOING DOWN] state:[LOST]
> DEBUG:  sending watchdog packet to socket:7, type:[X], command ID:67, data
> Length:0
> DEBUG:  sending watchdog packet, command id:[67] type:[INFORM I AM GOING
> DOWN] state :[LOST]
> DEBUG:  new cluster command X issued with command id 67
> LOG:  watchdog: de-escalation started
> DEBUG:  shmem_exit(-1): 0 callbacks to make
> DEBUG:  proc_exit(-1): 0 callbacks to make
> DEBUG:  shmem_exit(3): 0 callbacks to make
> DEBUG:  proc_exit(3): 1 callbacks to make
> DEBUG:  exit(3)
> DEBUG:  shmem_exit(-1): 0 callbacks to make
> DEBUG:  proc_exit(-1): 0 callbacks to make
> DEBUG:  reaper handler
> DEBUG:  watchdog child process with pid: 30288 exit with FATAL ERROR.
> pgpool-II will be shutdown
> LOG:  watchdog child process with pid: 30288 exits with status 768
> FATAL:  watchdog child process exit with fatal error. exiting pgpool-II
> LOG:  setting the local watchdog node name to "10.0.0.1:5432 Linux alex1"
> LOG:  watchdog cluster is configured with 1 remote nodes
> LOG:  watchdog remote node:0 on 10.0.0.2:9000
> LOG:  interface monitoring is disabled in watchdog
> DEBUG:  pool_write: to backend: 0 kind:X
> DEBUG:  pool_flush_it: flush size: 5
> ...
> DEBUG:  shmem_exit(-1): 0 callbacks to make
> ...
> DEBUG:  lifecheck child receives shutdown request signal 2, forwarding to
> all children
> DEBUG:  lifecheck child receives fast shutdown request
> DEBUG:  watchdog heartbeat receiver child receives shutdown request signal
> 2
> DEBUG:  shmem_exit(-1): 0 callbacks to make
> DEBUG:  proc_exit(-1): 0 callbacks to make
> ...
>
> On Wed, Feb 28, 2018 at 1:53 PM, Pierre Timmermans <ptim007 at yahoo.com>
> wrote:
>
> I am using pgpool inside a docker container so I cannot tell what the
> service command will say
>
> I think you should have a look at the pgpool log file at the moment you
> unplug the interface: it will probably say something about the fact that it
> cannot reach the trusted_server and that it will exclude itself from the
> cluster (I am not sure). You can also start pgpool in debug to get extra
> logging. I think that I validated that in the past, I cannot find the doc
> anymore
>
> You can also execute the following command:
>
> pcp_watchdog_info -h <ip pgpool> -p 9898 -w
>
> it will return information about the watchdog, among others the cluster
> quorum
>
> nb: due to a bug in the packaging by postgres, if you installed pgpool
> from postgres yum repositories (and not from pgpool) then pcp_watchdog_info
> will not be in the path (but in a directory somewhere, I forgot which)
>
>
>
> Pierre
>
>
> On Wednesday, February 28, 2018, 5:37:49 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> With 'trusted_servers' configured, when I unplug 10.0.0.1 it kills pgpool,
> i.e. 'service pgpool status' reports 'pgpool dead but subsys locked'.
> Is that how it should be?
>
> Plug/unplug = ifconfig eth0 up/down
>
>
>
> On Tue, Feb 27, 2018 at 1:49 PM, Pierre Timmermans <ptim007 at yahoo.com>
> wrote:
>
> To prevent this split brain scenario (caused by a network partition) you
> can use the configuration trusted_servers. This setting is a list of
> servers that pgpool can use to determine if a node is suffering a network
> partition or not. If a node cannot reach any of the servers in the list,
> then it will assume it is isolated (by a network partition) and will not
> promote itself to master.
>
> In general, when you have only two nodes, it is not safe to do an
> automatic failover I believe.  Unless you have some kind of fencing
> mechanism (means: you can shutdown and prevent a failed node to come back
> after a failure).
>
> Pierre
>
>
> On Tuesday, February 27, 2018, 7:58:55 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> Hi All,
>
> I have a 10.0.0.1/10.0.0.2 master/hot standby configuration with
> streaming replication, where each node runs pgpool with watchdog enabled
> and postgres.
>
> I shut down the network interface on 10.0.0.1 and wait until 10.0.0.2
> triggers failover and promotes itself to master through my failover script.
>
> Now the watchdogs on 10.0.0.1 and 10.0.0.2 are out of sync, have
> conflicting views on which node has failed and both think they are master.
>
> When I bring back the network interface on 10.0.0.1, 'show pool_nodes'
> says that 10.0.0.1 is master/up and 10.0.0.2 is standby/down.
>
> I want 10.0.0.1 to be standby and 10.0.0.2 to be master.
>
> I've been playing with the failover script.. e.g.
>
> if (default network gateway is pingable) {
>     shut down pgpool and postgres
> } else if (this node is standby) {
>     promote this node to master
>     create a job that will run every minute and try to recover failed node
> (base backup)
>     cancel the job upon successful recovery
> }
>
> Can you please help me with this? Any ideas would be highly appreciated.
>
> Regards, Alex
> ______________________________ _________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/ listinfo/pgpool-general
> <http://www.pgpool.net/mailman/listinfo/pgpool-general>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180301/3b3861c3/attachment-0001.html>