[pgpool-hackers: 4396] Re: [pgpool-committers: 9566] pgpool: Test: attempt to fix 001.load_balance test timeout.

Tatsuo Ishii ishii at sraoss.co.jp
Sat Sep 16 17:29:03 JST 2023


This looks working except this:

> Subject: [pgpool-buildfarm: 2933] Pgpool-II buildfarm results CentOS7
> From: buildfarm at pgpool.net
> To: pgpool-buildfarm at pgpool.net
> Date: Fri, 15 Sep 2023 16:35:15 +0900
> Sender: "pgpool-buildfarm" <pgpool-buildfarm-bounces at pgpool.net>
> User-Agent: Heirloom mailx 12.5 7/5/10
> 
> =========================================================================
> * master  PostgreSQL 12  CentOS7
> testing 001.load_balance...timeout.

This time the test timed out. Pgpool log reveals the cause:

2023-09-14 05:43:55.006: pcp_main pid 9183: LOG:  restart request received in pcp child process
2023-09-14 05:43:55.006: pcp_main pid 9183: DEBUG:  shmem_exit(-1): 2 callbacks to make
2023-09-14 05:43:55.006: pcp_main pid 9183: DEBUG:  proc_exit(-1): 0 callbacks to make
2023-09-14 05:43:55.008: main pid 9047: LOG:  PCP child 9183 exits with status 0 in failover()
2023-09-14 05:43:55.009: main pid 9047: LOG:  fork a new PCP child pid 9326 in failover()
2023-09-14 05:43:55.009: main pid 9047: LOG:  SIGINT is member
2023-09-14 05:43:55.009: main pid 9047: LOG:  exit handler called (signal: 2)
2023-09-14 05:43:55.009: main pid 9047: LOG:  shutting down by signal 2
2023-09-14 05:43:55.009: main pid 9047: LOG:  terminating all child processes
2023-09-14 05:43:55.009: pcp_main pid 9326: DEBUG:  I am PCP child with pid:9326
2023-09-14 05:43:55.009: child pid 9141: DEBUG:  shmem_exit(-1): 0 callbacks to make

Also this is src/test/regression/log/001.load_balance:

tcp        0      0 0.0.0.0:11000           0.0.0.0:*               LISTEN      9326/pgpool:        
tcp6       0      0 :::11000                :::*                    LISTEN      9326/pgpool:        
tcp        0      0 0.0.0.0:11000           0.0.0.0:*               LISTEN      9326/pgpool:        
tcp6       0      0 :::11000                :::*                    LISTEN      9326/pgpool:        
tcp        0      0 0.0.0.0:11000           0.0.0.0:*               LISTEN      9326/pgpool:        
tcp6       0      0 :::11000                :::*                    LISTEN      9326/pgpool:       
:
:
(test timeout)

So apprently pid 9326 which was created here:

2023-09-14 05:43:55.009: main pid 9047: LOG:  fork a new PCP child pid 9326 in failover()

remained and this is the cause of timeout error. Question is, why pcp child did not exit?

>From src/pcp_con/pcp_child.c:

	ereport(DEBUG1,
			(errmsg("I am PCP child with pid:%d", getpid())));

	ereport(LOG,
			(errmsg("PCP process: %d started", getpid())));

	for (;;)
	{


My guess is process 9326's exit handler was called after "I am PCP child with pid: 9326" log message.
At this point exit handler was already set up:

	pool_signal(SIGTERM, pcp_exit_handler);
	pool_signal(SIGINT, pcp_exit_handler);
	pool_signal(SIGQUIT, pcp_exit_handler);

In the exit handler:

	if (list_length(pcp_worker_children) > 0)
	{
		do
		{
			wpid = wait(NULL);
		} while (wpid > 0 || (wpid == -1 && errno == EINTR));

		list_free(pcp_worker_children);
	}

My guess is the exit handler stuck somewhere in the the loop. I don't
know why it happend. Probably at this point no child was exiting?

Best reagards,
--
Tatsuo Ishii
SRA OSS LLC
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

From: Tatsuo Ishii <ishii at sraoss.co.jp>
Subject: [pgpool-committers: 9566] pgpool: Test: attempt to fix 001.load_balance test timeout.
Date: Wed, 13 Sep 2023 09:17:42 +0000
Message-ID: <E1qgM0E-002UR1-IY at gothos.postgresql.org>

> Test: attempt to fix 001.load_balance test timeout.
> 
> Examining the log reveals that the error was caused by bind error:
> address already used.  This is strange because SO_REUSEADDR flags is
> set to the listening socket.  I guess the cause is, the kernel is not
> fast enough to clean up the socket used by previous pgpool process.
> To fix this, existing CHECK_TIME_WAIT environment variable parameter
> is modified so that it checks the socket by using netstat -tulpn,
> rather than netstat -a, which is too slow.
> 
> Branch
> ------
> master
> 
> Details
> -------
> https://git.postgresql.org/gitweb?p=pgpool2.git;a=commitdiff;h=fb0800e3f0211ba833ca05cf506387e4d2a279b8
> 
> Modified Files
> --------------
> src/test/pgpool_setup.in                           | 2 +-
> src/test/regression/tests/001.load_balance/test.sh | 1 +
> 2 files changed, 2 insertions(+), 1 deletion(-)
> 


More information about the pgpool-hackers mailing list