[pgpool-general: 7780] Re: Fwd: watch_dog cluster down since "system has lost the network"

Bo Peng pengbo at sraoss.co.jp
Wed Oct 13 13:59:45 JST 2021


Hello,

Thank you for sharing the network config.

> 2021-09-20 15:53:37: pid 1900172: WARNING:  network IP is removed and system has no IP is assigned
> 2021-09-20 15:53:37: pid 1900172: DETAIL:  changing the state to in network trouble

> in pgpoo.conf,   after changing wd_monitoring_interfaces_list from emtpy
> ("") to ens10f1, the problem seems is resovled..   I am wondering other
> network interface's status change will caused this issue.

Do you mean 3,4 hours after startup, the network status changes?
Is it possible to share the initial network status?

> The network config for one of the watch_dog node is :
> 
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
> default qlen 1000
>     link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>     inet 127.0.0.1/8 scope host lo
>        valid_lft forever preferred_lft forever
>     inet6 ::1/128 scope host
>        valid_lft forever preferred_lft forever
> 2: eno145: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> group default qlen 1000
>     link/ether 20:04:0f:f1:c2:48 brd ff:ff:ff:ff:ff:ff
>     altname enp24s0f0
> 3: eno146: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> group default qlen 1000
>     link/ether 20:04:0f:f1:c2:49 brd ff:ff:ff:ff:ff:ff
>     altname enp24s0f1
> 4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> group default qlen 1000
>     link/ether 20:04:0f:f1:c2:4a brd ff:ff:ff:ff:ff:ff
>     altname enp25s0f0
> 5: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
> group default qlen 1000
>     link/ether 20:04:0f:f1:c2:4b brd ff:ff:ff:ff:ff:ff
>     altname enp25s0f1
> 6: ens10f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
> DOWN group default qlen 1000
>     link/ether 00:1b:21:bd:58:0e brd ff:ff:ff:ff:ff:ff
>     altname enp177s0f0
> 7: ens10f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
> group default qlen 1000
>     link/ether 00:1b:21:bd:58:0f brd ff:ff:ff:ff:ff:ff
>     altname enp177s0f1
>     inet 192.168.1.121/24 brd 192.168.1.255 scope global noprefixroute
> ens10f1
>        valid_lft forever preferred_lft forever
>     inet6 fe80::1614:8143:8362:6611/64 scope link noprefixroute
>        valid_lft forever preferred_lft forever
> 
> 
> in pgpoo.conf,   after changing wd_monitoring_interfaces_list from emtpy
> ("") to ens10f1, the problem seems is resovled..   I am wondering other
> network interface's status change will caused this issue.
> 
> Bo Peng <pengbo at sraoss.co.jp> 于2021年10月4日周一 上午11:45写道:
> 
> > Hello,
> >
> > Sorry for the late response.
> >
> > > Hi,
> > >
> > > I have two PG severs and three watch_dog nodes to setup a PG HA
> > > environment.
> > >
> > >    - OS: Ubuntu 20.04
> > >    - PG version:12.8
> > >    - Pgpool version: 4.1.4
> > >
> > >
> > >    - PG -primary: 192.168.1.122
> > >    - PG -slave: 192.168.1.121
> > >    - Watch_dog node0: 192.168.1.122
> > >    - Watch_dog node1: 192.168.1.121
> > >    - Watch_dog node2: 192.168.1.101
> > >
> > >
> > > the HA environment works fine while after 3-4 hours, two watch_dog nodes
> > > downs, remaining only 1 watch_dog node (192.168.1.101) running.   the
> > > leader of watch_dog's log shows below error althought the network ip
> > > 192.168.1.122 is alive.
> > >
> > > 2021-09-20 15:53:37: pid 1900172: WARNING:  network IP is removed and
> > > system has no IP is assigned
> > > 2021-09-20 15:53:37: pid 1900172: DETAIL:  changing the state to in
> > network
> > > trouble
> > > 2021-09-20 15:53:37: pid 1900172: DEBUG:  removing all watchdog nodes
> > from
> > > the standby list
> >
> > I think it may be caused by a temporary network problem.
> > Does this issue occur every time?
> >
> > > 2021-09-20 15:53:37: pid 1900172: DETAIL:  standby list contains 1 nodes
> > > 2021-09-20 15:53:37: pid 1900172: DEBUG:  Removing all failover objects
> > > 2021-09-20 15:53:37: pid 1900172: LOG:  watchdog node state changed from
> > > [MASTER] to [IN NETWORK TROUBLE]
> > > 2021-09-20 15:53:37: pid 1900172: DEBUG:  STATE MACHINE INVOKED WITH
> > EVENT
> > > = STATE CHANGED Current State = IN NETWORK TROUBLE
> > > 2021-09-20 15:53:37: pid 1900172: FATAL:  system has lost the network
> > > 2021-09-20 15:53:37: pid 1900172: LOG:  Watchdog is shutting down
> > > 2021-09-20 15:53:37: pid 1900172: DEBUG:  sending packet, watchdog node:[
> > > 192.168.1.101:9999 Linux dell-PowerEdge-R740] command id:[1113]
> > > type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
> > > 2021-09-20 15:53:37: pid 1900172: DEBUG:  sending watchdog packet to
> > > socket:8, type:[X], command ID:1113, data Length:0
> > > 2021-09-20 15:53:37: pid 1933141: LOG:  watchdog: de-escalation started
> > > 2021-09-20 15:53:37: pid 1933141: DEBUG:  watchdog exec interface up/down
> > > command: '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev ens2f0' succeeded
> > > 2021-09-20 15:53:37: pid 1933141: LOG:  successfully released the
> > delegate
> > > IP:"192.168.1.129"
> > > 2021-09-20 15:53:37: pid 1933141: DETAIL:  'if_down_cmd' returned with
> > > success
> > > 2021-09-20 15:53:37: pid 1900168: DEBUG:  reaper handler
> > > 2021-09-20 15:53:37: pid 1900168: DEBUG:  watchdog child process with
> > pid:
> > > 1900172 exit with FATAL ERROR. pgpool-II will be shutdown
> > > 2021-09-20 15:53:37: pid 1900168: LOG:  watchdog child process with pid:
> > > 1900172 exits with status 768
> > > 2021-09-20 15:53:37: pid 1900168: FATAL:  watchdog child process exit
> > with
> > > fatal error. exiting pgpool-II
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  setting the local watchdog node
> > > name to "192.168.1.122:9999 Linux dell-PowerEdge-R740"
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog cluster is configured
> > with
> > > 2 remote nodes
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog remote node:0 on
> > > 192.168.1.121:9000
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog remote node:1 on
> > > 192.168.1.101:9000
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  interface monitoring is disabled
> > in
> > > watchdog
> > > 2021-09-20 15:53:37: pid 1933148: INFO:  IPC socket path:
> > > "/tmp/.s.PGPOOLWD_CMD.9000"
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog node state changed from
> > > [DEAD] to [LOADING]
> > > 2021-09-20 15:53:37: pid 1933148: DEBUG:  STATE MACHINE INVOKED WITH
> > EVENT
> > > = STATE CHANGED Current State = LOADING
> > > 2021-09-20 15:53:37: pid 1933148: DEBUG:  error in outbound connection to
> > > 192.168.1.121:9000
> > > 2021-09-20 15:53:37: pid 1933148: DETAIL:  Connection refused
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  new outbound connection to
> > > 192.168.1.101:9000
> > > 2021-09-20 15:53:37: pid 1900189: DEBUG:  lifecheck child receives
> > shutdown
> > > request signal 2, forwarding to all children
> > > 2021-09-20 15:53:37: pid 1900189: DEBUG:  lifecheck child receives fast
> > > shutdown request
> > > 2021-09-20 15:53:37: pid 1933148: LOG:  Watchdog is shutting down
> > >
> > > Please refer the pgpool.conf and running log on each server.  Any  advice
> > > to fix it?
> >
> >
> > --
> > Bo Peng <pengbo at sraoss.co.jp>
> > SRA OSS, Inc. Japan
> > http://www.sraoss.co.jp/
> >


-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan
http://www.sraoss.co.jp/


More information about the pgpool-general mailing list