[pgpool-hackers: 4269] heartbeat behavior

Sat Jan 21 20:38:58 JST 2023

Hi Usama,

I wanted to test heartbeat of Pgpool-II and created a test support
tool, which allows to stop sending heartbeat packet to specified
pgpool node. Attached is the patch to implement that.

While testing heartbeat by using the tool, I found an interesting
behavior of watchdog.

1) create 3 pgpool node cluster

$ watchdog_setup -wn 3

2) start the cluster and waiting for lifekeeper process starting.

3) create "heartbeat_sender_control" file to prevent hearbeat sender
on pgpool1 from sending heartbeart packet to pgpool2.

$ echo 2 > pgpool1/log/heartbeat_sender_control

4) create "heartbeat_sender_control" file to prevent hearbeat sender
on pgpool2 from sending heartbeart packet to pgpool1.

$ echo 1 > pgpool2/log/heartbeat_sender_control

At this point pgpool1 sends hearbeart to pgpool0 but does not send to
pgpool2.  Also pgpool2 sends hearbeart to pgpool0 but does not send to
pgpool1.

5) wait until life_check reports node is in "NODE DEAD".

Here is a pgpool log from pgpool1.

2023-01-21 20:14:13.541: life_check pid 598177: LOG:  informing the node status change to watchdog
2023-01-21 20:14:13.541: life_check pid 598177: DETAIL:  node id :2 status = "NODE DEAD" message:"No heartbeat signal from node"
2023-01-21 20:14:13.541: watchdog pid 598125: LOG:  received node status change ipc message
2023-01-21 20:14:13.541: watchdog pid 598125: DETAIL:  No heartbeat signal from node
2023-01-21 20:14:13.541: watchdog pid 598125: LOG:  remote node "localhost:50008 Linux tishii-CFSV9-2" is lost
2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  new watchdog node connection is received from "127.0.0.1:44185"
2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  new outbound connection to localhost:50010 
2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  new node joined the cluster hostname:"localhost" port:50010 pgpool_port:50008
2023-01-21 20:14:13.542: watchdog pid 598125: DETAIL:  Pgpool-II version:"4.5devel" watchdog messaging version: 1.2
2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  The newly joined node:"localhost:50008 Linux tishii-CFSV9-2" had left the cluster because it was lost
2023-01-21 20:14:13.542: watchdog pid 598125: DETAIL:  lost reason was "REPORTED BY LIFECHECK" and startup time diff = 0

As I expected, "localhost:50008" (that is pgpool2) left the cluster. So far so good.

Then strang thing happend:

2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  remote node "localhost:50008 Linux tishii-CFSV9-2" became reachable again
2023-01-21 20:14:13.542: watchdog pid 598125: DETAIL:  requesting the node info
2023-01-21 20:14:13.542: watchdog pid 598125: LOG:  remote node "localhost:50008 Linux tishii-CFSV9-2" is reporting that it has found us again

Why pgpool2 came back despite that life check continues to report the
node is dead? It seems the life check report has been ignored.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp
-------------- next part --------------
A non-text attachment was scrubbed...
Name: lifecheck.patch
Type: text/x-patch
Size: 469 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-hackers/attachments/20230121/61195816/attachment.bin>