View Issue Details

IDProjectCategoryView StatusLast Update
0000046Pgpool-IIBugpublic2013-01-23 11:01
ReportermcousinAssigned Tot-ishii 
PrioritynormalSeveritymajorReproducibilitysometimes
Status resolvedResolutionopen 
PlatformLinuxOSLinéuxOS Version
Product Version 
Target VersionFixed in Version 
Summary0000046: Watchdog failing to connect sometimes
DescriptionHi, I'm having a similar problem to this : http://www.pgpool.net/pipermail/pgpool-general/2012-December/001242.html, but on Linux. Capturing TCP frames, I see that a connection is correctly established between PGPool's watchdog and PostgreSQL, but that PGPool thinks the connection is dead.

We have this kind of messages in the log:

Nov 22 11:55:22 fantomas1 pgpool[30351]: connect_inet_domain_socket: connect() failed: Connection timed out
Nov 22 11:55:22 fantomas1 pgpool[30351]: connection to fantomas4.prod.extelia.fr(5432) failed
Nov 22 11:55:22 fantomas1 pgpool[30351]: new_connection: create_cp() failed
Nov 22 11:55:22 fantomas1 pgpool[30351]: degenerate_backend_set: 1 fail over request from pid 30351

A tcpdump from the same period shows that the connection has been established.

Digging further into the problem, I took a look at the code, and am wondering if the cause of the problem isn't to be found in the int connect_inet_domain_socket_by_port function:

This function tries to connect using a non-blocking TCP connection.

pool_set_nonblock(fd); and only then a connect (connect(fd, (struct sockaddr *)&addr, len)) that is called in a loop.

Linux's manpage for socket (7) says this:

« It is possible to do nonblocking I/O on sockets by setting the O_NONBLOCK flag on a socket file descriptor using fcntl(2). Then all opera-
       tions that would block will (usually) return with EAGAIN (operation should be retried later); connect(2) will return EINPROGRESS error. The
       user can then wait for various events via poll(2) or select(2).» This seems to mean that a non-blocking connect() is to be coupled with a poll or select.

That's not what the code is doing. I think this may be the explanation both for what I am seeing (the connect is not blocking, and sometimes is still in progress, and connect is then called again and won't work), and the Mac OS problem (the connect was in progress during the first iteration and then already done on the second, hence the «Socket is already connected»)

There is another potential problem in this code (if I'm still not mistaken): if the connect() takes time (over a slow link for instance), it will be in a tight loop over a system call, and may eat a lof of CPU.
TagsNo tags attached.

Activities

t-ishii

2012-12-15 09:45

developer   ~0000188

As you can see in the thread you are referring to, at least "Socket is already connected" problem has been fixed in the git repo.

t-ishii

2012-12-15 17:49

developer   ~0000190

Ok, I rewite connect_inet_domain_socket_by_port() by using select(2). Can you try it out? attached patch is against pgpool-II 3.2.1.

t-ishii

2012-12-15 17:50

developer  

patch_against_3.2.1.patch (3,239 bytes)
*** pgpool-II-3.2.1/pool_connection_pool.c	2012-09-25 10:18:45.000000000 +0900
--- pgpool2/pool_connection_pool.c	2012-12-15 17:37:52.775565248 +0900
***************
*** 511,516 ****
--- 511,524 ----
  	int on = 1;
  	struct sockaddr_in addr;
  	struct hostent *hp;
+ 	struct timeval timeout;
+ 	fd_set rset, wset;
+ 	int error;
+ 	socklen_t socklen;
+ 	int sts;
+ 
+ #define CONNECT_TIMEOUT_MSEC 100		/* specify select(2) timeout in millisecond */
+ 
  
  	fd = socket(AF_INET, SOCK_STREAM, 0);
  	if (fd < 0)
***************
*** 566,581 ****
  
  		if (connect(fd, (struct sockaddr *)&addr, len) < 0)
  		{
  			if ((errno == EINTR && retry) || errno == EAGAIN)
  				continue;
  
! 			/* Non block fd could return these */
! 			if (errno == EINPROGRESS || errno == EALREADY)
! 				continue;
  
! 			pool_error("connect_inet_domain_socket: connect() failed: %s",strerror(errno));
! 			close(fd);
! 			return -1;
  		}
  		break;
  	}
--- 574,659 ----
  
  		if (connect(fd, (struct sockaddr *)&addr, len) < 0)
  		{
+ 			if (errno == EISCONN)
+ 			{
+ 				/* Socket is already connected */
+ 				break;
+ 			}
+ 
  			if ((errno == EINTR && retry) || errno == EAGAIN)
  				continue;
  
! 			/*
! 			 * If error was "connect(2) is in progress", then wait for
! 			 * completion.  Otherwise error out.
! 			 */
! 			if (errno != EINPROGRESS && errno != EALREADY)
! 			{
! 				pool_error("connect_inet_domain_socket: connect() failed: %s",strerror(errno));
! 				close(fd);
! 				return -1;
! 			}
  
! 			timeout.tv_sec = 0;
! 			timeout.tv_usec = CONNECT_TIMEOUT_MSEC * 1000;
! 			FD_ZERO(&rset);
! 			FD_SET(fd, &rset);	
! 			FD_ZERO(&wset);
! 			FD_SET(fd, &wset);
! 			sts = select(fd+1, &rset, &wset, NULL, &timeout);
! 
! 			if (sts == 0)
! 			{
! 				/* select timeout */
! 				if (retry)
! 				{
! 					pool_log("connect_inet_domain_socket: select() timedout. retrying...");
! 					continue;
! 				}
! 
! 				/*
! 				 * If read data or write data was set, either connect
! 				 * succeeded or error.  We need to figure it out. This
! 				 * is the hardest part in using non blocking
! 				 * connect(2).  See W. Richar Stevens's "UNIX Network
! 				 * Programming: Volume 1, Second Edition" section
! 				 * 15.4.
! 				 */
! 				if (FD_ISSET(fd, &rset) || FD_ISSET(fd, &wset))
! 				{
! 					error = 0;
! 					socklen = sizeof(error);
! 					if (getsockopt(fd, SOL_SOCKET, SO_ERROR, &error, &socklen) < 0)
! 					{
! 						/* Solaris returns error in this case */
! 						pool_error("connect_inet_domain_socket: getsockopt() failed: %s", strerror(errno));
! 						close(fd);
! 						return -1;
! 					}
! 
! 					/* Non Solaris case */
! 					if (error != 0)
! 					{
! 						pool_error("connect_inet_domain_socket: getsockopt() detects error: %s", strerror(error));
! 						close(fd);
! 						return -1;
! 					}
! 				}
! 				else
! 				{
! 					pool_error("connect_inet_domain_socket: both read data and write data was not set");
! 					close(fd);
! 					return -1;
! 				}
! 			}
! 			else		/* select returns error */
! 			{
! 				if((errno == EINTR && retry) || errno == EAGAIN)
! 				{
! 					pool_log("connect_inet_domain_socket: select() interrupted. retrying...");
! 					continue;
! 				}
! 			}
  		}
  		break;
  	}

mcousin

2012-12-17 16:45

reporter   ~0000196

Thanks a lot ! We will apply it ASAP and keep you informed.

mcousin

2013-01-16 19:14

reporter   ~0000215

Hi, and sorry for the long wait.

This patch has been applied on monday. Everything seems to work fine since then. We will report back in another week to confirm this.

t-ishii

2013-01-16 20:11

developer   ~0000216

Thanks. Looking forward to hearing from you next week.

mcousin

2013-01-22 19:21

reporter   ~0000220

The problem has completely disappeared. It hasn't occurred since applying the patch.

t-ishii

2013-01-23 10:50

developer   ~0000221

Great. Fix committed to master and 3.2 stable. Thanks!

Issue History

Date Modified Username Field Change
2012-12-15 01:01 mcousin New Issue
2012-12-15 09:41 t-ishii Assigned To => t-ishii
2012-12-15 09:41 t-ishii Status new => assigned
2012-12-15 09:45 t-ishii Note Added: 0000188
2012-12-15 17:49 t-ishii Note Added: 0000190
2012-12-15 17:50 t-ishii File Added: patch_against_3.2.1.patch
2012-12-17 16:45 mcousin Note Added: 0000196
2013-01-08 09:46 t-ishii Status assigned => feedback
2013-01-16 19:14 mcousin Note Added: 0000215
2013-01-16 19:14 mcousin Status feedback => assigned
2013-01-16 20:11 t-ishii Note Added: 0000216
2013-01-17 06:38 t-ishii Status assigned => feedback
2013-01-22 19:21 mcousin Note Added: 0000220
2013-01-22 19:21 mcousin Status feedback => assigned
2013-01-23 10:50 t-ishii Note Added: 0000221
2013-01-23 10:50 t-ishii Status assigned => resolved
2013-01-23 11:01 t-ishii Changeset attached => pgpool2 master 249af07c