[pgpool-general: 6204] Re: Behavior when the heartbeat is not received

Mon Aug 20 13:32:25 JST 2018

Hi,

>> I'm using the heartbeat mode as the lifecheck method.  What happens if
>> the heartbeat signal is not received?
>> 
>> In Pgpool-II 3.6.12, when I close the heartbeat port on the master,
>> the heartbeat signal is not received and the standby is disconnected
>> once.
>> 
>>   Aug 16 10:01:30 centos7-1 pgpool[12906]: [10-1] LOG:  watchdog: lifecheck started
>> 
>>   # firewall-cmd --remove-port=9694/udp
>> 
>>   Aug 16 10:02:20 centos7-1 pgpool[12906]: [11-1] LOG:  informing the node status change to watchdog
>>   Aug 16 10:02:20 centos7-1 pgpool[12906]: [11-2] DETAIL:  node id :1 status = "NODE DEAD" message:"No heartbeat signal from node"
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [24-1] LOG:  new IPC connection received
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [25-1] LOG:  received node status change ipc message
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [25-2] DETAIL:  No heartbeat signal from node
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [26-1] LOG:  remote node "192.168.137.72:9999 Linux centos7-2" is lost
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [27-1] LOG:  removing watchdog node "192.168.137.72:9999 Linux centos7-2" from the standby list
>> 
>> However then, the standby watchdog will reconnect.
>> 
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [28-1] LOG:  new outbound connection to 192.168.137.72:9000
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [29-1] LOG:  new watchdog node connection is received from "192.168.137.72:12471"
>>   Aug 16 10:02:20 centos7-1 pgpool[12904]: [30-1] LOG:  new node joined the cluster hostname:"192.168.137.72" port:9000 pgpool_port:9999
>> 
>> Is this behavior correct?
> 
> Yes, it is the correct behaviour of watchdog.
> 
> The basic recovery mechanism of watchdog is that 
> it keep trying to connect to the lost nodes to make sure that 
> if the nodes were lost because of network partitioning or some
> other network issue then as soon as the communication link establishes 
> again, it reconnects to the lost nodes.

After the standby dies because of no heartbeat signal, when I stop the
primary, it will failover to the dead standby.

  Aug 20 12:56:49 centos7-2 pgpool[16958]: [35-1] LOG:  remote node "192.168.137.71:9999 Linux centos7-1" is shutting down
  Aug 20 12:56:49 centos7-2 pgpool[16958]: [36-1] LOG:  watchdog cluster has lost the coordinator node
  Aug 20 12:56:49 centos7-2 pgpool[16958]: [37-1] LOG:  unassigning the remote node "192.168.137.71:9999 Linux centos7-1" from watchdog cluster master
  Aug 20 12:56:49 centos7-2 pgpool[16958]: [38-1] LOG:  We have lost the cluster master node "192.168.137.71:9999 Linux centos7-1"
  Aug 20 12:56:49 centos7-2 pgpool[16958]: [39-1] LOG:  watchdog node state changed from [STANDBY] to [JOINING]
  Aug 20 12:56:53 centos7-2 pgpool[16958]: [40-1] LOG:  watchdog node state changed from [JOINING] to [INITIALIZING]
  Aug 20 12:56:54 centos7-2 pgpool[16958]: [41-1] LOG:  I am the only alive node in the watchdog cluster
  Aug 20 12:56:54 centos7-2 pgpool[16958]: [41-2] HINT:  skipping stand for coordinator state
  Aug 20 12:56:54 centos7-2 pgpool[16958]: [42-1] LOG:  watchdog node state changed from [INITIALIZING] to [MASTER]
  Aug 20 12:56:54 centos7-2 pgpool[16958]: [43-1] LOG:  I am announcing my self as master/coordinator watchdog node
  Aug 20 12:56:58 centos7-2 pgpool[16958]: [44-1] LOG:  I am the cluster leader node
  Aug 20 12:56:58 centos7-2 pgpool[16958]: [44-2] DETAIL:  our declare coordinator message is accepted by all nodes
  Aug 20 12:56:58 centos7-2 pgpool[16958]: [45-1] LOG:  setting the local node "192.168.137.72:9999 Linux centos7-2" as watchdog cluster master
  Aug 20 12:56:58 centos7-2 pgpool[16958]: [46-1] LOG:  I am the cluster leader node. Starting escalation process
  Aug 20 12:56:58 centos7-2 pgpool[16958]: [47-1] LOG:  escalation process started with PID:17027
  Aug 20 12:56:58 centos7-2 pgpool[16958]: [48-1] LOG:  new IPC connection received
  Aug 20 12:56:58 centos7-2 pgpool[17027]: [47-1] LOG:  watchdog: escalation started
  Aug 20 12:57:02 centos7-2 pgpool[17027]: [48-1] LOG:  successfully acquired the delegate IP:"192.168.137.91"
  Aug 20 12:57:02 centos7-2 pgpool[17027]: [48-2] DETAIL:  'if_up_cmd' returned with success
  Aug 20 12:57:02 centos7-2 pgpool[16958]: [49-1] LOG:  watchdog escalation process with pid: 17027 exit with SUCCESS.
  Aug 20 12:57:19 centos7-2 pgpool[16959]: [11-1] LOG:  informing the node status change to watchdog
  Aug 20 12:57:19 centos7-2 pgpool[16959]: [11-2] DETAIL:  node id :1 status = "NODE DEAD" message:"No heartbeat signal from node"
  Aug 20 12:57:19 centos7-2 pgpool[16958]: [50-1] LOG:  new IPC connection received
  Aug 20 12:57:19 centos7-2 pgpool[16958]: [51-1] LOG:  received node status change ipc message
  Aug 20 12:57:19 centos7-2 pgpool[16958]: [51-2] DETAIL:  No heartbeat signal from node
  Aug 20 12:57:19 centos7-2 pgpool[16958]: [52-1] LOG:  remote node "192.168.137.71:9999 Linux centos7-1" is shutting down

Is this behavior correct?

Are dead nodes not handled differently from alive nodes?

> This way watchdog makes sure that if split-brain happens because of network
> partitioning it should be immediately recovered from.

In addition, will the standby watchdog reconnect only when a network
issue occurs?

Even when I kill the heartbeat receiver on the standby, the standby
watchdog will reconnect as well.  The hearteat receiver is still dead.

  standby$ ps ax | grep "heartbeat receiver"
   16638 ?        S      0:00 pgpool: heartbeat receiver
   16667 pts/1    R+     0:00 grep --color=auto heartbeat receiver
  standby$ kill 16638

  Aug 20 11:32:45 centos7-1 pgpool[17965]: [25-1] LOG:  read from socket failed, remote end closed the connection
  Aug 20 11:32:45 centos7-1 pgpool[17965]: [26-1] LOG:  client socket of 192.168.137.72:9999 Linux centos7-2 is closed
  Aug 20 11:32:45 centos7-1 pgpool[17965]: [27-1] LOG:  read from socket failed, remote end closed the connection
  Aug 20 11:32:45 centos7-1 pgpool[17965]: [28-1] LOG:  outbound socket of 192.168.137.72:9999 Linux centos7-2 is closed
  Aug 20 11:32:45 centos7-1 pgpool[17965]: [29-1] LOG:  new outbound connection to 192.168.137.72:9000
  Aug 20 11:32:45 centos7-1 pgpool[17965]: [30-1] LOG:  new watchdog node connection is received from "192.168.137.72:14007"
  Aug 20 11:32:45 centos7-1 pgpool[17965]: [31-1] LOG:  new node joined the cluster hostname:"192.168.137.72" port:9000 pgpool_port:9999

>> The pcp_watchdog_info command result is as follows:
>> 
>>   $ pcp_watchdog_info
>>   2 YES 192.168.137.71:9999 Linux centos7-1 192.168.137.71
>>   
>>   192.168.137.71:9999 Linux centos7-1 192.168.137.71 9999 9000 4 MASTER
>>   192.168.137.72:9999 Linux centos7-2 192.168.137.72 9999 9000 7 STANDBY
>> 
>> I will attach the pgpool.conf files for both nodes.

Best regards,

----
Tomoaki Sato <sato ¡÷ sraoss.co.jp>
SRA OSS, Inc. Japan