[pgpool-general: 7743] Re: Fwd: watch_dog cluster down since "system has lost the network"

Forest Lin zhijia.lin at gmail.com
Tue Oct 5 23:37:10 JST 2021


The network config for one of the watch_dog node is :

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group
default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eno145: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
group default qlen 1000
    link/ether 20:04:0f:f1:c2:48 brd ff:ff:ff:ff:ff:ff
    altname enp24s0f0
3: eno146: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
group default qlen 1000
    link/ether 20:04:0f:f1:c2:49 brd ff:ff:ff:ff:ff:ff
    altname enp24s0f1
4: eno3: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
group default qlen 1000
    link/ether 20:04:0f:f1:c2:4a brd ff:ff:ff:ff:ff:ff
    altname enp25s0f0
5: eno4: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state DOWN
group default qlen 1000
    link/ether 20:04:0f:f1:c2:4b brd ff:ff:ff:ff:ff:ff
    altname enp25s0f1
6: ens10f0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc mq state
DOWN group default qlen 1000
    link/ether 00:1b:21:bd:58:0e brd ff:ff:ff:ff:ff:ff
    altname enp177s0f0
7: ens10f1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP
group default qlen 1000
    link/ether 00:1b:21:bd:58:0f brd ff:ff:ff:ff:ff:ff
    altname enp177s0f1
    inet 192.168.1.121/24 brd 192.168.1.255 scope global noprefixroute
ens10f1
       valid_lft forever preferred_lft forever
    inet6 fe80::1614:8143:8362:6611/64 scope link noprefixroute
       valid_lft forever preferred_lft forever


in pgpoo.conf,   after changing wd_monitoring_interfaces_list from emtpy
("") to ens10f1, the problem seems is resovled..   I am wondering other
network interface's status change will caused this issue.

Bo Peng <pengbo at sraoss.co.jp> 于2021年10月4日周一 上午11:45写道:

> Hello,
>
> Sorry for the late response.
>
> > Hi,
> >
> > I have two PG severs and three watch_dog nodes to setup a PG HA
> > environment.
> >
> >    - OS: Ubuntu 20.04
> >    - PG version:12.8
> >    - Pgpool version: 4.1.4
> >
> >
> >    - PG -primary: 192.168.1.122
> >    - PG -slave: 192.168.1.121
> >    - Watch_dog node0: 192.168.1.122
> >    - Watch_dog node1: 192.168.1.121
> >    - Watch_dog node2: 192.168.1.101
> >
> >
> > the HA environment works fine while after 3-4 hours, two watch_dog nodes
> > downs, remaining only 1 watch_dog node (192.168.1.101) running.   the
> > leader of watch_dog's log shows below error althought the network ip
> > 192.168.1.122 is alive.
> >
> > 2021-09-20 15:53:37: pid 1900172: WARNING:  network IP is removed and
> > system has no IP is assigned
> > 2021-09-20 15:53:37: pid 1900172: DETAIL:  changing the state to in
> network
> > trouble
> > 2021-09-20 15:53:37: pid 1900172: DEBUG:  removing all watchdog nodes
> from
> > the standby list
>
> I think it may be caused by a temporary network problem.
> Does this issue occur every time?
>
> > 2021-09-20 15:53:37: pid 1900172: DETAIL:  standby list contains 1 nodes
> > 2021-09-20 15:53:37: pid 1900172: DEBUG:  Removing all failover objects
> > 2021-09-20 15:53:37: pid 1900172: LOG:  watchdog node state changed from
> > [MASTER] to [IN NETWORK TROUBLE]
> > 2021-09-20 15:53:37: pid 1900172: DEBUG:  STATE MACHINE INVOKED WITH
> EVENT
> > = STATE CHANGED Current State = IN NETWORK TROUBLE
> > 2021-09-20 15:53:37: pid 1900172: FATAL:  system has lost the network
> > 2021-09-20 15:53:37: pid 1900172: LOG:  Watchdog is shutting down
> > 2021-09-20 15:53:37: pid 1900172: DEBUG:  sending packet, watchdog node:[
> > 192.168.1.101:9999 Linux dell-PowerEdge-R740] command id:[1113]
> > type:[INFORM I AM GOING DOWN] state:[IN NETWORK TROUBLE]
> > 2021-09-20 15:53:37: pid 1900172: DEBUG:  sending watchdog packet to
> > socket:8, type:[X], command ID:1113, data Length:0
> > 2021-09-20 15:53:37: pid 1933141: LOG:  watchdog: de-escalation started
> > 2021-09-20 15:53:37: pid 1933141: DEBUG:  watchdog exec interface up/down
> > command: '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev ens2f0' succeeded
> > 2021-09-20 15:53:37: pid 1933141: LOG:  successfully released the
> delegate
> > IP:"192.168.1.129"
> > 2021-09-20 15:53:37: pid 1933141: DETAIL:  'if_down_cmd' returned with
> > success
> > 2021-09-20 15:53:37: pid 1900168: DEBUG:  reaper handler
> > 2021-09-20 15:53:37: pid 1900168: DEBUG:  watchdog child process with
> pid:
> > 1900172 exit with FATAL ERROR. pgpool-II will be shutdown
> > 2021-09-20 15:53:37: pid 1900168: LOG:  watchdog child process with pid:
> > 1900172 exits with status 768
> > 2021-09-20 15:53:37: pid 1900168: FATAL:  watchdog child process exit
> with
> > fatal error. exiting pgpool-II
> > 2021-09-20 15:53:37: pid 1933148: LOG:  setting the local watchdog node
> > name to "192.168.1.122:9999 Linux dell-PowerEdge-R740"
> > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog cluster is configured
> with
> > 2 remote nodes
> > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog remote node:0 on
> > 192.168.1.121:9000
> > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog remote node:1 on
> > 192.168.1.101:9000
> > 2021-09-20 15:53:37: pid 1933148: LOG:  interface monitoring is disabled
> in
> > watchdog
> > 2021-09-20 15:53:37: pid 1933148: INFO:  IPC socket path:
> > "/tmp/.s.PGPOOLWD_CMD.9000"
> > 2021-09-20 15:53:37: pid 1933148: LOG:  watchdog node state changed from
> > [DEAD] to [LOADING]
> > 2021-09-20 15:53:37: pid 1933148: DEBUG:  STATE MACHINE INVOKED WITH
> EVENT
> > = STATE CHANGED Current State = LOADING
> > 2021-09-20 15:53:37: pid 1933148: DEBUG:  error in outbound connection to
> > 192.168.1.121:9000
> > 2021-09-20 15:53:37: pid 1933148: DETAIL:  Connection refused
> > 2021-09-20 15:53:37: pid 1933148: LOG:  new outbound connection to
> > 192.168.1.101:9000
> > 2021-09-20 15:53:37: pid 1900189: DEBUG:  lifecheck child receives
> shutdown
> > request signal 2, forwarding to all children
> > 2021-09-20 15:53:37: pid 1900189: DEBUG:  lifecheck child receives fast
> > shutdown request
> > 2021-09-20 15:53:37: pid 1933148: LOG:  Watchdog is shutting down
> >
> > Please refer the pgpool.conf and running log on each server.  Any  advice
> > to fix it?
>
>
> --
> Bo Peng <pengbo at sraoss.co.jp>
> SRA OSS, Inc. Japan
> http://www.sraoss.co.jp/
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20211005/f06baed2/attachment.htm>


More information about the pgpool-general mailing list