[pgpool-hackers: 4148] exit handler in pgpool main process

Tatsuo Ishii ishii at sraoss.co.jp
Sun Apr 10 09:14:55 JST 2022


While inspecting buildfarm failures, I noticed that the exit handler
in pgpool maind process was interrupted while it was exectuing.

https://pgpool.net/buildfarm/20220408-buildfarm-CentOS7.tar.gz

> Subject: [pgpool-buildfarm: 2070] pgpool-II buildfarm results CentOS7
> From: buildfarm at pgpool.net
> To: pgpool-buildfarm at pgpool.net
> Date: Sat, 09 Apr 2022 08:16:00 +0900
> Sender: "pgpool-buildfarm" <pgpool-buildfarm-bounces at pgpool.net>
> User-Agent: Heirloom mailx 12.5 7/5/10
> 
> =========================================================================
> * master  PostgreSQL 13  CentOS7
> testing 011.watchdog_quorum_failover...timeout.

Here is an excerption from the pgpool.log before timeout.

2022-04-07 22:47:57.565: watchdog pid 16523: LOG:  I am the cluster leader node
2022-04-07 22:47:57.565: watchdog pid 16523: DETAIL:  our declare coordinator message is accepted by all nodes
2022-04-07 22:47:57.565: watchdog pid 16523: LOG:  setting the local node "localhost:11100 Linux 863294c37d9f" as watchdog cluster leader
2022-04-07 22:47:57.565: watchdog pid 16523: LOG:  signal_user1_to_parent_with_reason(1)
2022-04-07 22:47:57.565: watchdog pid 16523: LOG:  I am the cluster leader node but we do not have enough nodes in cluster
2022-04-07 22:47:57.565: watchdog pid 16523: DETAIL:  waiting for the quorum to start escalation process
2022-04-07 22:47:57.565: main pid 16515: LOG:  Pgpool-II parent process received SIGUSR1
2022-04-07 22:47:57.565: main pid 16515: LOG:  Pgpool-II parent process received watchdog state change signal from watchdog
2022-04-07 22:47:57.565: watchdog pid 16523: LOG:  new IPC connection received
2022-04-07 22:47:58.566: watchdog pid 16523: LOG:  adding watchdog node "localhost:11200 Linux 863294c37d9f" to the standby list
2022-04-07 22:47:58.566: watchdog pid 16523: LOG:  quorum found
2022-04-07 22:47:58.566: watchdog pid 16523: DETAIL:  starting escalation process
2022-04-07 22:47:58.567: main pid 16515: LOG:  shutting down
2022-04-07 22:47:58.567: main pid 16515: LOG:  terminating all child processes
2022-04-07 22:47:58.582: watchdog_utility pid 16751: LOG:  watchdog: escalation started
2022-04-07 22:57:03.253: main pid 16515: LOG:  shutting down
2022-04-07 22:57:03.254: main pid 16515: LOG:  terminating all child processes

The main process was entering exit signal handler at: 22:47:58.567 and
then while doing reaping child process, it was interrupted again at:
22:57:03.253.

The signal handler (exit_handler) is registered for SIGTERM, SIGINT
and SIGQUIT. It first blocks most of signal except SIGTERM, SIGQUIT
and SIGALRM. As you know, SIGTERM is used for smart shutdown, and
SIGQUIT is used for immediate shutdown. In my understanding, signal
handlers are automatically protected from the same signal as it was
interrupted by. So if exit_handler is interrupted by SIGTERM, the next
SIGTERM will be blocked. BUT will not be blocked if other than SIGTERM
(that is either SIGINT or SIGQUIT) for example. So my theory is,
exit_handler was interrupted by one of SIGTERM, SIGINT or SIGQUIT,
then was interrupted by other than the previous signal. I think this
should be avoided because this causes infinite wait in
terminate_childrens() which is called from exit_handler as we see in
the buildfarm log.

I will think about fix for this.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list