[pgpool-general: 5959] Re: Split-brain remedy

Alexander Dorogensky amazinglifetime at gmail.com
Thu Mar 1 07:01:06 JST 2018


It looks like pgpool child crashes.. see below, but I'm not sure..
So the question remains.. is it a bug or expected behavior?

DEBUG:  watchdog trying to ping host "10.0.0.100"
WARNING:  watchdog failed to ping host"10.0.0.100"
DETAIL:  ping process exits with code: 2
WARNING:  watchdog lifecheck, failed to connect to any trusted servers
LOG:  informing the node status change to watchdog
DETAIL:  node id :0 status = "NODE DEAD" message:"trusted server is
unreachable"
LOG:  new IPC connection received
LOCATION:  watchdog.c:3319
LOG:  received node status change ipc message
DETAIL:  trusted server is unreachable
DEBUG:  processing node status changed to DEAD event for node ID:0
STATE MACHINE INVOKED WITH EVENT = THIS NODE LOST Current State = MASTER
WARNING:  watchdog lifecheck reported, we are disconnected from the network
DETAIL:  changing the state to LOST
DEBUG:  removing all watchdog nodes from the standby list
DETAIL:  standby list contains 1 nodes
LOG:  watchdog node state changed from [MASTER] to [LOST]
DEBUG:  STATE MACHINE INVOKED WITH EVENT = STATE CHANGED Current State =
LOST
FATAL:  system has lost the network
LOG:  Watchdog is shutting down
DEBUG:  sending packet, watchdog node:[10.0.0.2:5432 Linux alex2] command
id:[67] type:[INFORM I AM GOING DOWN] state:[LOST]
DEBUG:  sending watchdog packet to socket:7, type:[X], command ID:67, data
Length:0
DEBUG:  sending watchdog packet, command id:[67] type:[INFORM I AM GOING
DOWN] state :[LOST]
DEBUG:  new cluster command X issued with command id 67
LOG:  watchdog: de-escalation started
DEBUG:  shmem_exit(-1): 0 callbacks to make
DEBUG:  proc_exit(-1): 0 callbacks to make
DEBUG:  shmem_exit(3): 0 callbacks to make
DEBUG:  proc_exit(3): 1 callbacks to make
DEBUG:  exit(3)
DEBUG:  shmem_exit(-1): 0 callbacks to make
DEBUG:  proc_exit(-1): 0 callbacks to make
DEBUG:  reaper handler
DEBUG:  watchdog child process with pid: 30288 exit with FATAL ERROR.
pgpool-II will be shutdown
LOG:  watchdog child process with pid: 30288 exits with status 768
FATAL:  watchdog child process exit with fatal error. exiting pgpool-II
LOG:  setting the local watchdog node name to "10.0.0.1:5432 Linux alex1"
LOG:  watchdog cluster is configured with 1 remote nodes
LOG:  watchdog remote node:0 on 10.0.0.2:9000
LOG:  interface monitoring is disabled in watchdog
DEBUG:  pool_write: to backend: 0 kind:X
DEBUG:  pool_flush_it: flush size: 5
...
DEBUG:  shmem_exit(-1): 0 callbacks to make
...
DEBUG:  lifecheck child receives shutdown request signal 2, forwarding to
all children
DEBUG:  lifecheck child receives fast shutdown request
DEBUG:  watchdog heartbeat receiver child receives shutdown request signal 2
DEBUG:  shmem_exit(-1): 0 callbacks to make
DEBUG:  proc_exit(-1): 0 callbacks to make
...

On Wed, Feb 28, 2018 at 1:53 PM, Pierre Timmermans <ptim007 at yahoo.com>
wrote:

> I am using pgpool inside a docker container so I cannot tell what the
> service command will say
>
> I think you should have a look at the pgpool log file at the moment you
> unplug the interface: it will probably say something about the fact that it
> cannot reach the trusted_server and that it will exclude itself from the
> cluster (I am not sure). You can also start pgpool in debug to get extra
> logging. I think that I validated that in the past, I cannot find the doc
> anymore
>
> You can also execute the following command:
>
> pcp_watchdog_info -h <ip pgpool> -p 9898 -w
>
> it will return information about the watchdog, among others the cluster
> quorum
>
> nb: due to a bug in the packaging by postgres, if you installed pgpool
> from postgres yum repositories (and not from pgpool) then pcp_watchdog_info
> will not be in the path (but in a directory somewhere, I forgot which)
>
>
>
> Pierre
>
>
> On Wednesday, February 28, 2018, 5:37:49 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> With 'trusted_servers' configured, when I unplug 10.0.0.1 it kills pgpool,
> i.e. 'service pgpool status' reports 'pgpool dead but subsys locked'.
> Is that how it should be?
>
> ​Plug/unplug = ifconfig eth0 up/down​
>
>
>
> On Tue, Feb 27, 2018 at 1:49 PM, Pierre Timmermans <ptim007 at yahoo.com>
> wrote:
>
> To prevent this split brain scenario (caused by a network partition) you
> can use the configuration trusted_servers. This setting is a list of
> servers that pgpool can use to determine if a node is suffering a network
> partition or not. If a node cannot reach any of the servers in the list,
> then it will assume it is isolated (by a network partition) and will not
> promote itself to master.
>
> In general, when you have only two nodes, it is not safe to do an
> automatic failover I believe.  Unless you have some kind of fencing
> mechanism (means: you can shutdown and prevent a failed node to come back
> after a failure).
>
> Pierre
>
>
> On Tuesday, February 27, 2018, 7:58:55 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> Hi All,
>
> I have a 10.0.0.1/10.0.0.2 master/hot standby configuration with
> streaming replication, where each node runs pgpool with watchdog enabled
> and postgres.
>
> I shut down the network interface on 10.0.0.1 and wait until 10.0.0.2
> triggers failover and promotes itself to master through my failover script.
>
> Now the watchdogs on 10.0.0.1 and 10.0.0.2 are out of sync, have
> conflicting views on which node has failed and both think they are master.
>
> When I bring back the network interface on 10.0.0.1, 'show pool_nodes'
> says that 10.0.0.1 is master/up and 10.0.0.2 is standby/down.
>
> I want 10.0.0.1 to be standby and 10.0.0.2 to be master.
>
> I've been playing with the failover script.. e.g.
>
> if (default network gateway is pingable) {
>     shut down pgpool and postgres
> } else if (this node is standby) {
>     promote this node to master
>     create a job that will run every minute and try to recover failed node
> (base backup)
>     cancel the job upon successful recovery
> }
>
> Can you please help me with this? Any ideas would be highly appreciated.
>
> Regards, Alex
> ______________________________ _________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/ listinfo/pgpool-general
> <http://www.pgpool.net/mailman/listinfo/pgpool-general>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180228/b7b8c416/attachment-0001.html>


More information about the pgpool-general mailing list