[pgpool-general: 7900] Re: Possible race condition during startup causing node to enter network isolation

Fri Nov 26 11:11:36 JST 2021

Hello,

On Thu, 25 Nov 2021 09:54:50 +0100
Emond Papegaaij <emond.papegaaij at gmail.com> wrote:

> Hi all,
> 
> In our tests we are seeing sporadic failures when services on one node are
> restarted. 

Could your provide a scenario to reproduce this issue?
Did you restart Pgpool-II service only?
At initial startup, watchdog and lifecheck worked?

Could you share your watchdog configurations of each Pgpool-II node?

> These tests run in a 3-node setup all nodes running a database
> and pgpool, which node 1 being the one restarting its services, node 2
> being both pgpool leader and primary database and node 3 running a standby
> database and pgpool. Looking at the logs (see below), it seems node 1 is
> not allowed to connect to node 2 because node 2 has marked node 1 as dead
> via the lifecheck. However, node 1 will not become alive until it has
> joined the cluster. The end result is that node 1 keeps trying to join the
> cluster and node 2 keeps rejecting it.
> 
> In the logs below, what I call node 1 is 172.29.30.1, node 2 is 172.29.30.2
> and node 3 is 172.29.30.3. I think the problem lies in the 7th line of the
> node 2 logs: "node id :0 status = "NODE DEAD" message:"No heartbeat signal
> from node"". Node 2 never lets node 1 recover from this status. Note that
> the node 2 logs also show the automatic failback of the database on node 1.
> 
> Best regards,
> Emond
> 
> The logs for node 1:
> 2021-11-24 23:44:44: pid 13: LOG:  watchdog cluster is configured with 2
> remote nodes
> 2021-11-24 23:44:44: pid 13: LOG:  watchdog remote node:0 on
> 172.29.30.2:9009
> 2021-11-24 23:44:44: pid 13: LOG:  watchdog remote node:1 on
> 172.29.30.3:9009
> 2021-11-24 23:44:44: pid 13: LOG:  interface monitoring is disabled in
> watchdog
> 2021-11-24 23:44:44: pid 13: LOG:  watchdog node state changed from [DEAD]
> to [LOADING]
> 2021-11-24 23:44:44: pid 13: LOG:  new outbound connection to
> 172.29.30.2:9009
> 2021-11-24 23:44:44: pid 13: LOG:  setting the remote node "172.29.30.2:5432
> Linux 610cdb714a72" as watchdog cluster leader
> 2021-11-24 23:44:44: pid 13: LOG:  watchdog node state changed from
> [LOADING] to [INITIALIZING]
> 2021-11-24 23:44:44: pid 13: LOG:  new watchdog node connection is received
> from "172.29.30.2:53967
> 2021-11-24 23:44:44: pid 13: LOG:  new node joined the cluster
> hostname:"172.29.30.2" port:9009 pgpool_port:5432
> 2021-11-24 23:44:44: pid 13: DETAIL:  Pgpool-II version:"4.2.4" watchdog
> messaging version: 1.2
> 2021-11-24 23:44:44: pid 13: LOG:  new outbound connection to
> 172.29.30.3:9009
> 2021-11-24 23:44:44: pid 13: LOG:  new watchdog node connection is received
> from "172.29.30.3:57577"
> 2021-11-24 23:44:44: pid 13: LOG:  new node joined the cluster
> hostname:"172.29.30.3" port:9009 pgpool_port:5432
> 2021-11-24 23:44:44: pid 13: DETAIL:  Pgpool-II version:"4.2.4" watchdog
> messaging version: 1.2
> 2021-11-24 23:44:45: pid 13: LOG:  read from socket failed, remote end
> closed the connection
> 2021-11-24 23:44:45: pid 13: LOG:  client socket of 172.29.30.2:5432 Linux
> 610cdb714a72 is closed
> 2021-11-24 23:44:45: pid 13: LOG:  remote node "172.29.30.2:5432 Linux
> 610cdb714a72" is reporting that it has lost us
> 2021-11-24 23:44:45: pid 13: LOG:  read from socket failed, remote end
> closed the connection
> 2021-11-24 23:44:45: pid 13: LOG:  outbound socket of 172.29.30.2:5432
> Linux 610cdb714a72 is closed
> 2021-11-24 23:44:45: pid 13: LOG:  remote node "172.29.30.2:5432 Linux
> 610cdb714a72" is not reachable
> 2021-11-24 23:44:45: pid 13: DETAIL:  marking the node as lost
> 2021-11-24 23:44:45: pid 13: LOG:  remote node "172.29.30.2:5432 Linux
> 610cdb714a72" is lost
> 2021-11-24 23:44:45: pid 13: LOG:  watchdog cluster has lost the
> coordinator node
> 2021-11-24 23:44:45: pid 13: LOG:  removing the remote node "
> 172.29.30.2:5432 Linux 610cdb714a72" from watchdog cluster leader
> 2021-11-24 23:44:46: pid 13: LOG:  watchdog node state changed from
> [INITIALIZING] to [STANDING FOR LEADER]
> 2021-11-24 23:44:46: pid 13: LOG:  our stand for coordinator request is
> rejected by node "172.29.30.3:5432 Linux 589ce3e63006"
> 2021-11-24 23:44:46: pid 13: DETAIL:  we might be in partial network
> isolation and cluster already have a valid leader
> 2021-11-24 23:44:46: pid 13: HINT:  please verify the watchdog life-check
> and network is working properly
> 2021-11-24 23:44:46: pid 13: LOG:  watchdog node state changed from
> [STANDING FOR LEADER] to [NETWORK ISOLATION]
> 2021-11-24 23:44:46: pid 13: LOG:  read from socket failed, remote end
> closed the connection
> 2021-11-24 23:44:46: pid 13: LOG:  client socket of 172.29.30.3:5432 Linux
> 589ce3e63006 is closed
> 2021-11-24 23:44:46: pid 13: LOG:  remote node "172.29.30.3:5432 Linux
> 589ce3e63006" is reporting that it has lost us
> 2021-11-24 23:44:46: pid 13: LOG:  read from socket failed, remote end
> closed the connection
> 2021-11-24 23:44:46: pid 13: LOG:  outbound socket of 172.29.30.3:5432
> Linux 589ce3e63006 is closed
> 2021-11-24 23:44:46: pid 13: LOG:  remote node "172.29.30.3:5432 Linux
> 589ce3e63006" is not reachable
> 2021-11-24 23:44:46: pid 13: DETAIL:  marking the node as lost
> 2021-11-24 23:44:46: pid 13: LOG:  remote node "172.29.30.3:5432 Linux
> 589ce3e63006" is lost
> 2021-11-24 23:44:56: pid 13: LOG:  trying again to join the cluster
> 2021-11-24 23:44:56: pid 13: LOG:  watchdog node state changed from
> [NETWORK ISOLATION] to [JOINING]
> 2021-11-24 23:44:56: pid 13: LOG:  new outbound connection to
> 172.29.30.2:9009
> 2021-11-24 23:44:56: pid 13: LOG:  new outbound connection to
> 172.29.30.3:9009
> 2021-11-24 23:44:56: pid 13: LOG:  new watchdog node connection is received
> from "172.29.30.2:720"
> 2021-11-24 23:44:56: pid 13: LOG:  new node joined the cluster
> hostname:"172.29.30.2" port:9009 pgpool_port:5432
> 2021-11-24 23:44:56: pid 13: DETAIL:  Pgpool-II version:"4.2.4" watchdog
> messaging version: 1.2
> 2021-11-24 23:44:56: pid 13: LOG:  The newly joined node:"172.29.30.2:5432
> Linux 610cdb714a72" had left the cluster because it was lost
> 2021-11-24 23:44:56: pid 13: DETAIL:  lost reason was "NOT REACHABLE" and
> startup time diff = 1
> 2021-11-24 23:44:56: pid 13: LOG:  new watchdog node connection is received
> from "172.29.30.3:3306"
> 2021-11-24 23:44:56: pid 13: LOG:  new node joined the cluster
> hostname:"172.29.30.3" port:9009 pgpool_port:5432
> 2021-11-24 23:44:56: pid 13: DETAIL:  Pgpool-II version:"4.2.4" watchdog
> messaging version: 1.2
> 2021-11-24 23:44:56: pid 13: LOG:  The newly joined node:"172.29.30.3:5432
> Linux 589ce3e63006" had left the cluster because it was lost
> 2021-11-24 23:44:56: pid 13: DETAIL:  lost reason was "NOT REACHABLE" and
> startup time diff = 0
> 2021-11-24 23:45:00: pid 13: LOG:  watchdog node state changed from
> [JOINING] to [INITIALIZING]
> 2021-11-24 23:45:01: pid 13: LOG:  watchdog node state changed from
> [INITIALIZING] to [STANDING FOR LEADER]
> 2021-11-24 23:45:01: pid 13: LOG:  our stand for coordinator request is
> rejected by node "172.29.30.2:5432 Linux 610cdb714a72"
> 2021-11-24 23:45:01: pid 13: LOG:  watchdog node state changed from
> [STANDING FOR LEADER] to [PARTICIPATING IN ELECTION]
> 2021-11-24 23:45:06: pid 13: LOG:  watchdog node state changed from
> [PARTICIPATING IN ELECTION] to [JOINING]
> 2021-11-24 23:45:06: pid 13: LOG:  setting the remote node "172.29.30.2:5432
> Linux 610cdb714a72" as watchdog cluster leader
> 2021-11-24 23:45:06: pid 13: LOG:  watchdog node state changed from
> [JOINING] to [INITIALIZING]
> 2021-11-24 23:45:07: pid 13: LOG:  watchdog node state changed from
> [INITIALIZING] to [STANDBY]
> 2021-11-24 23:45:07: pid 13: NOTICE:  our join coordinator is rejected by
> node "172.29.30.2:5432 Linux 610cdb714a72"
> 2021-11-24 23:45:07: pid 13: HINT:  rejoining the cluster.
> 2021-11-24 23:45:07: pid 13: LOG:  leader node "172.29.30.2:5432 Linux
> 610cdb714a72" thinks we are lost, and "172.29.30.2:5432 Linux 610cdb714a72"
> is not letting us join
> 2021-11-24 23:45:07: pid 13: HINT:  please verify the watchdog life-check
> and network is working properly
> 2021-11-24 23:45:07: pid 13: LOG:  watchdog node state changed from
> [STANDBY] to [NETWORK ISOLATION]
> 2021-11-24 23:45:17: pid 13: LOG:  trying again to join the cluster
> 2021-11-24 23:45:17: pid 13: LOG:  watchdog node state changed from
> [NETWORK ISOLATION] to [JOINING]
> 2021-11-24 23:45:17: pid 13: LOG:  removing the remote node "
> 172.29.30.2:5432 Linux 610cdb714a72" from watchdog cluster leader
> 2021-11-24 23:45:17: pid 13: LOG:  setting the remote node "172.29.30.2:5432
> Linux 610cdb714a72" as watchdog cluster leader
> 2021-11-24 23:45:17: pid 13: LOG:  watchdog node state changed from
> [JOINING] to [INITIALIZING]
> 
> The logs for node 2:
> 2021-11-24 23:44:44: pid 12: LOG:  new watchdog node connection is received
> from "172.29.30.1:36034"
> 2021-11-24 23:44:44: pid 12: LOG:  new node joined the cluster
> hostname:"172.29.30.1" port:9009 pgpool_port:5432
> 2021-11-24 23:44:44: pid 12: DETAIL:  Pgpool-II version:"4.2.6" watchdog
> messaging version: 1.2
> 2021-11-24 23:44:44: pid 12: LOG:  The newly joined node:"172.29.30.1:5432
> Linux 8e410fda51ac" had left the cluster because it was shutdown
> 2021-11-24 23:44:44: pid 12: LOG:  new outbound connection to
> 172.29.30.1:9009
> 2021-11-24 23:44:45: pid 13: LOG:  informing the node status change to
> watchdog
> 2021-11-24 23:44:45: pid 13: DETAIL:  node id :0 status = "NODE DEAD"
> message:"No heartbeat signal from node"
> 2021-11-24 23:44:45: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:44:45: pid 12: LOG:  received node status change ipc message
> 2021-11-24 23:44:45: pid 12: DETAIL:  No heartbeat signal from node
> 2021-11-24 23:44:45: pid 12: LOG:  remote node "172.29.30.1:5432 Linux
> 8e410fda51ac" is lost
> 2021-11-24 23:44:46: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:44:47: pid 12: LOG:  watchdog received the failover command
> from remote pgpool-II node "172.29.30.3:5432 Linux 589ce3e63006"
> 2021-11-24 23:44:47: pid 12: LOG:  watchdog is processing the failover
> command [FAILBACK_REQUEST] received from 172.29.30.3:5432 Linux 589ce3e63006
> 2021-11-24 23:44:47: pid 12: LOG:  The failover request does not need
> quorum to hold
> 2021-11-24 23:44:47: pid 12: DETAIL:  proceeding with the failover
> 2021-11-24 23:44:47: pid 12: HINT:  REQ_DETAIL_CONFIRMED
> 2021-11-24 23:44:47: pid 12: LOG:  received failback request for node_id: 0
> from pid [12]
> 2021-11-24 23:44:47: pid 12: LOG:  signal_user1_to_parent_with_reason(0)
> 2021-11-24 23:44:47: pid 1: LOG:  Pgpool-II parent process received SIGUSR1
> 2021-11-24 23:44:47: pid 1: LOG:  Pgpool-II parent process has received
> failover request
> 2021-11-24 23:44:47: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:44:47: pid 12: LOG:  received the failover indication from
> Pgpool-II on IPC interface
> 2021-11-24 23:44:47: pid 12: LOG:  watchdog is informed of failover start
> by the main process
> 2021-11-24 23:44:47: pid 1: LOG:  starting fail back. reconnect host
> 172.29.30.1(5432)
> 2021-11-24 23:44:47: pid 1: LOG:  Node 1 is not down (status: 2)
> 2021-11-24 23:44:47: pid 1: LOG:  Do not restart children because we are
> failing back node id 0 host: 172.29.30.1 port: 5432 and we are in streaming
> replication mode and not all backends were down
> 2021-11-24 23:44:47: pid 1: LOG:  find_primary_node_repeatedly: waiting for
> finding a primary node
> 2021-11-24 23:44:47: pid 1: LOG:  find_primary_node: standby node is 0
> 2021-11-24 23:44:47: pid 1: LOG:  find_primary_node: primary node is 1
> 2021-11-24 23:44:47: pid 1: LOG:  find_primary_node: standby node is 2
> 2021-11-24 23:44:47: pid 1: LOG:  failover: set new primary node: 1
> 2021-11-24 23:44:47: pid 1: LOG:  failover: set new main node: 0
> 2021-11-24 23:44:47: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:44:47: pid 12: LOG:  received the failover indication from
> Pgpool-II on IPC interface
> 2021-11-24 23:44:47: pid 12: LOG:  watchdog is informed of failover end by
> the main process
> 2021-11-24 23:44:47: pid 1: LOG:  failback done. reconnect host
> 172.29.30.1(5432)
> 2021-11-24 23:44:47: pid 190: LOG:  worker process received restart request
> 2021-11-24 23:44:48: pid 189: LOG:  restart request received in pcp child
> process
> 2021-11-24 23:44:48: pid 1: LOG:  PCP child 189 exits with status 0 in
> failover()
> 2021-11-24 23:44:48: pid 1: LOG:  fork a new PCP child pid 191 in failover()
> 2021-11-24 23:44:48: pid 1: LOG:  worker child process with pid: 190 exits
> with status 256
> 2021-11-24 23:44:48: pid 191: LOG:  PCP process: 191 started
> 2021-11-24 23:44:48: pid 1: LOG:  fork a new worker child process with pid:
> 192
> 2021-11-24 23:44:48: pid 192: LOG:  process started
> 2021-11-24 23:44:48: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:44:53: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:44:56: pid 12: LOG:  new watchdog node connection is received
> from "172.29.30.1:39618"
> 2021-11-24 23:44:56: pid 12: LOG:  new outbound connection to
> 172.29.30.1:9009
> 2021-11-24 23:44:56: pid 12: LOG:  new node joined the cluster
> hostname:"172.29.30.1" port:9009 pgpool_port:5432
> 2021-11-24 23:44:56: pid 12: DETAIL:  Pgpool-II version:"4.2.6" watchdog
> messaging version: 1.2
> 2021-11-24 23:44:56: pid 12: LOG:  The newly joined node:"172.29.30.1:5432
> Linux 8e410fda51ac" had left the cluster because it was lost
> 2021-11-24 23:44:56: pid 12: DETAIL:  lost reason was "REPORTED BY
> LIFECHECK" and startup time diff = 1
> 2021-11-24 23:44:56: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:44:56: pid 12: DETAIL:  only lifecheck process can mark this
> node alive again
> 2021-11-24 23:44:56: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:44:56: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:44:56: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:44:56: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:44:56: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:44:56: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:44:56: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:44:56: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:44:58: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:45:00: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:00: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:00: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:00: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:00: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:00: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:00: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:00: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:01: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:01: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:01: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:01: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:01: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:01: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:01: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:01: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:01: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:01: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:01: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:01: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:01: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:01: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:01: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:01: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:03: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:45:06: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:06: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:06: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:06: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:06: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:06: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:06: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:06: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:06: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:06: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:06: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:06: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:06: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:06: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:06: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:06: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:07: pid 12: LOG:  lost remote node "172.29.30.1:5432 Linux
> 8e410fda51ac" is requesting to join the cluster
> 2021-11-24 23:45:07: pid 12: DETAIL:  rejecting the request until
> life-check inform us that it is reachable again
> 2021-11-24 23:45:07: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:07: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:07: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:07: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:07: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:07: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:07: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:07: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:07: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:07: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:07: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:07: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:07: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:07: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:07: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:07: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:09: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:45:14: pid 12: LOG:  new IPC connection received
> 2021-11-24 23:45:17: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:17: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:17: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:17: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:17: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:17: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:17: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:17: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:17: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:17: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:17: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:17: pid 12: DETAIL:  only life-check process can mark this
> node alive again
> 2021-11-24 23:45:17: pid 12: LOG:  we have received the NODE INFO message
> from the node:"172.29.30.1:5432 Linux 8e410fda51ac" that was lost
> 2021-11-24 23:45:17: pid 12: DETAIL:  we had lost this node because of
> "REPORTED BY LIFECHECK"
> 2021-11-24 23:45:17: pid 12: LOG:  node:"172.29.30.1:5432 Linux
> 8e410fda51ac" was reported lost by the lifecheck process
> 2021-11-24 23:45:17: pid 12: DETAIL:  only life-check process can mark this
> node alive again

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan
http://www.sraoss.co.jp/