[pgpool-hackers: 4265] Re: Watchdog heartbeat issue

Tatsuo Ishii ishii at sraoss.co.jp
Wed Jan 18 22:36:49 JST 2023


> On Tue, Jan 17, 2023 at 6:49 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> Hi Usama,
>>
>> Thank you for investigating the issue.
>>
>> > Hi Ishii San
>> >
>> > Thanks for figuring out the issue.
>> > I think removing the code in question altogether could mark the remote
>> node
>> > as dead too early at startup and can delay the watchdog cluster
>> > stabilization
>> > when there is a few seconds delay between the node startup.
>> > So IMHO the way to solve this is to wait for twice the wd_interval or
>> > wd_heartbeat_deadtime (depending on the configuration) if
>> > is_wd_lifecheck_ready()
>> > reports a failure.
>> >
>> > What do you think of the attached patch?
>>
>> Probably I am missing something but I wonder why the watchdog leader
>> node's lifecheck does not notice that node 1 watchdog will never send
>> hearbeat signal. In the pgpool0 log:
>>
>> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  read from socket failed,
>> remote end closed the connection
>> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  client socket of
>> localhost:50004 Linux abf1b59af489 is closed
>> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  remote node
>> "localhost:50004 Linux abf1b59af489" is shutting down
>> 2023-01-14 00:27:15: watchdog pid 26708: LOG:  removing watchdog node
>> "localhost:50004 Linux abf1b59af489" from the standby list
>>
>> It seems the leader watchdog alreay noticed that node 1 was down.
>>
> 
> When the watchdog fails to communicate with a remote node despite retries,
> it marks the node status to lost/down. As for the lifecheck
> process, it only informs the node-down status to the watchdog process when
> the heartbeat breaks after at least one successful heartbeat cycle
> is completed.

I see.

I would like to confirm if my understanding is correct.

There are 3 nodes configured. Node 0 and 1 started but node 2 did not
start.  In this case I think lifecheck does not start on node 0 and
node 1 because lifecheck process is waiting for node 2 comes up.

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list