View Issue Details

IDProjectCategoryView StatusLast Update
0000107Pgpool-IIBugpublic2014-07-24 00:28
ReporterswAssigned ToMuhammad Usama 
PrioritynormalSeveritymajorReproducibilityunable to reproduce
Status resolvedResolutionopen 
Product Version 
Target VersionFixed in Version 
Summary0000107: Stuck connections
DescriptionSince I have upgraded my DB cluster to Postgresl 9.3 and Pgpool 3.3.3, it has a problem with stuck connections in Pgpool. Connections never time out and the only way to get rid of them is to kill the processes.

Versions in use:
Systems are Ubuntu 12.04 LTS
pgpool2=3.3.3-2.pgdg12.4+2
postgresql-9.3=9.3.4-1.pgdg12.4+1
postgresql-9.3-pgpool2=3.3.3-1.pgdg12.4+1

The connections are stuck in the reset_query_list (DISCARD ALL) in state 'idle in transaction'.

More details here:
http://www.pgpool.net/pipermail/pgpool-general/2014-July/003044.html

Maybe related cases:
http://www.pgpool.net/mantisbt/view.php?id=106
http://www.pgpool.net/pipermail/pgpool-general/2014-July/003049.html
Steps To ReproduceI haven't found a way to reproduce it. Looks like it happens most when pgpool/database is under heavy load.
Additional InformationExample:
$ date --rfc-2822
Tue, 08 Jul 2014 09:13:44 +0000

PGPOOL ~ $ ps aux|grep 7163
postgres 7163 0.0 0.0 214168 7872 ? S Jul07 0:00 pgpool: db_www db_www 10.20.5.215(41780) SET

PGPOOL ~ $ pcp_proc_info --verbose 90 localhost 9898 root password 7163
Database : db_www
Username : db_www
Start time : 2014-07-07 11:38:51
Creation time: 2014-07-07 11:38:58
Major : 3
Minor : 0
Counter : 11
Backend PID : 13797
Connected : 1
Database : db_www
Username : db_www
Start time : 2014-07-07 11:38:51
Creation time: 2014-07-07 11:38:58
Major : 3
Minor : 0
Counter : 11
Backend PID : 29006
Connected : 1
Database : db_www
Username : db_www
Start time : 2014-07-07 11:38:51
Creation time: 2014-07-07 11:38:58
Major : 3
Minor : 0
Counter : 11
Backend PID : 28686
Connected : 1
Database : db_www
Username : db_www
Start time : 2014-07-07 11:38:51
Creation time: 2014-07-07 11:38:58
Major : 3
Minor : 0
Counter : 11
Backend PID : 25065
Connected : 1


 datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query
-------+----------+-------+----------+--------------+------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+---------------------+----------------------------------------------------------
 17969 | db_www | 13797 | 16396 | db_www | | 10.20.5.189 | | 42993 | 2014-07-07 11:38:58.554118+00 | 2014-07-07 11:39:18.346472+00 | 2014-07-07 11:39:31.17378+00 | 2014-07-07 11:39:31.173841+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT


POSTGRES_MASTER ~ $ ps aux | grep 13797
postgres 13797 0.0 0.0 2630544 10256 ? Ss Jul07 0:00 postgres: db_www db_www 10.20.5.189(42993) idle in transaction

 17969 | db_www | 13797 | 16396 | db_www | | 172.20.5.189 | | 42993 | 2014-07-07 11:38:58.554118+00 | 2014-07-07 11:39:18.346472+00 | 2014-07-07 11:39:31.17378+00 | 2014-07-07 11:39:31.173841+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT
 
 
POSTGRES_MASTER ~ $ sudo strace -p 13797
Process 13797 attached - interrupt to quit
recvfrom(9,
TagsNo tags attached.

Activities

tuomas

2014-07-11 03:12

reporter   ~0000446

We are having the very same problem. Pgpool 3.3.3 is not properly closing the connections and releasing them back to the pool when the client has already closed the connection. We're also seeing only DISCARD ALL on the connections that are stuck, but that query has successfully returned.

Due to this the pool fills up and no longer accepts new connections causing the services to go down so this is critical.

In our case the app is Rails 4.1.4., but there has to be something in pgpool as the connection isn't released even when the client has successfully and completely closed the connection.

tuomas

2014-07-11 04:45

reporter   ~0000447

To add some details:

This is how the processes look like:
pgpool 1903 0.1 0.0 100756 12584 ? S< 16:00 0:09 pgpool: app app_production fe02(39459) DISCARD
pgpool 1906 0.1 0.0 100924 12660 ? S< 16:00 0:08 pgpool: app app_production fe04(60458) DISCARD
pgpool 1927 0.1 0.0 100996 12688 ? S< 16:00 0:12 pgpool: app app_production fe02(41058) DISCARD
pgpool 1951 0.1 0.0 100772 12500 ? S< 16:00 0:08 pgpool: app app_production fe05(48386) DISCARD
pgpool 1994 0.0 0.0 100420 12256 ? S< 16:00 0:04 pgpool: app app_production fe01(57090) DISCARD
pgpool 2019 0.0 0.0 100644 12480 ? S< 16:00 0:04 pgpool: app app_production fe05(38709) DISCARD
pgpool 13340 0.0 0.0 100900 12652 ? S< 16:10 0:04 pgpool: app app_production fe05(38094) DISCARD
pgpool 13348 0.0 0.0 100712 12472 ? S< 16:10 0:02 pgpool: app app_production fe05(55029) DISCARD
pgpool 15216 0.0 0.0 90676 2392 ? S< 17:17 0:01 pgpool: app app_production fe05(37722) DISCARD
pgpool 15296 0.1 0.0 90676 2304 ? S< 17:51 0:01 pgpool: app app_production fe05(45917) DISCARD

In postgres those connections are idle, the connections from those ports seem closed, on the client side we're seeing those in FIN_WAIT2 state for until they timeout there. Even after the connection has fully closed (nothing in netstat for the port), it still appears in pgpool process list as if in use.

tuomas

2014-07-11 09:05

reporter   ~0000448

I dumped the core of one of the stuck processes to see in which state it is:

(gdb) bt
#0 0x00007f1a73322743 in select () from /lib/x86_64-linux-gnu/libc.so.6
0000001 0x000000000041675d in pool_check_fd (cp=<optimized out>) at pool_process_query.c:951
0000002 0x000000000041de70 in pool_read (cp=0x1a30470, buf=<optimized out>, len=1) at pool_stream.c:139
0000003 0x000000000041c8fe in read_kind_from_backend (frontend=0x1a2a640, backend=0x1a2a1e0, decided_kind=0x7ffffeed801f "") at pool_process_query.c:3771
0000004 0x000000000044f6a9 in ProcessBackendResponse (frontend=0x1a2a640, backend=0x1a2a1e0, state=0x7ffffeed8094, num_fields=0x7ffffeed809c) at pool_proto_modules.c:2742
0000005 0x000000000041a095 in pool_process_query (frontend=0x1a2a640, backend=0x1a2a1e0, reset_request=1) at pool_process_query.c:289
0000006 0x000000000040c716 in do_child (unix_fd=5, inet_fd=<optimized out>) at child.c:403
0000007 0x00000000004075df in fork_a_child (unix_fd=5, inet_fd=6, id=148) at main.c:1238
0000008 0x0000000000407d0a in reaper () at main.c:2457
0000009 reaper () at main.c:2369
0000010 0x00000000004086bd in pool_sleep (second=<optimized out>) at main.c:2654
0000011 0x0000000000406349 in main (argc=<optimized out>, argv=<optimized out>) at main.c:836

In the logs there are also errors "ProcessFrontendResponse: failed to read kind from frontend. frontend abnormally exited" for the pids that are stuck.

Muhammad Usama

2014-07-16 19:38

developer  

connection_stuck.patch (1,931 bytes)
diff --git a/child.c b/child.c
index 767ecef..e5cc34e 100644
--- a/child.c
+++ b/child.c
@@ -387,12 +387,15 @@ void do_child(int unix_fd, int inet_fd)
 			{
 				/* client exits */
 				case POOL_END:
+				case POOL_END_WITH_FRONTEND_ERROR:
 					/*
 					 * do not cache connection if:
+					 * frontend abnormally exits or
 					 * pool_config->connection_cahe == 0 or
 					 * database name is template0, template1, postgres or regression
 					 */
-					if (pool_config->connection_cache == 0 ||
+					if (status == POOL_END_WITH_FRONTEND_ERROR ||
+						pool_config->connection_cache == 0 ||
 						!strcmp(sp->database, "template0") ||
 						!strcmp(sp->database, "template1") ||
 						!strcmp(sp->database, "postgres") ||
diff --git a/pool.h b/pool.h
index f90964a..d4b23e2 100644
--- a/pool.h
+++ b/pool.h
@@ -6,7 +6,7 @@
  * pgpool: a language independent connection pool server for PostgreSQL 
  * written by Tatsuo Ishii
  *
- * Copyright (c) 2003-2013	PgPool Global Development Group
+ * Copyright (c) 2003-2014	PgPool Global Development Group
  *
  * Permission to use, copy, modify, and distribute this software and
  * its documentation for any purpose and without fee is hereby
@@ -79,6 +79,7 @@ typedef enum {
 	POOL_CONTINUE = 0,
 	POOL_IDLE,
 	POOL_END,
+	POOL_END_WITH_FRONTEND_ERROR,
 	POOL_ERROR,
 	POOL_FATAL,
 	POOL_DEADLOCK
diff --git a/pool_proto_modules.c b/pool_proto_modules.c
index 7f8ee30..ee05b50 100644
--- a/pool_proto_modules.c
+++ b/pool_proto_modules.c
@@ -2556,7 +2556,7 @@ POOL_STATUS ProcessFrontendResponse(POOL_CONNECTION *frontend,
 	if (pool_read(frontend, &fkind, 1) < 0)
 	{
 		pool_log("ProcessFrontendResponse: failed to read kind from frontend. frontend abnormally exited");
-		return POOL_END;
+		return POOL_END_WITH_FRONTEND_ERROR;
 	}
 
 	pool_debug("ProcessFrontendResponse: kind from frontend %c(%02x)", fkind, fkind);
connection_stuck.patch (1,931 bytes)

Muhammad Usama

2014-07-16 19:44

developer   ~0000449

I have been trying to replicate this issue and still has a little luck in this regard. For further analysis on this issue can you please share the PostgreSQL log and pgpool-II configuration you are using when this situation arrives.
Meanwhile we have tried to come up with the patch to fix the possible problem which can be most likely to cause this connection stuck situation. Can you please try out the attached patch (connection_stuck.patch) and check if this solves the problem.
Please note that patch is generated against 3.3-stable branch

sw

2014-07-17 17:45

reporter   ~0000450

Thanks, Muhammad. I patched pgpool and upgraded all servers. Will keep you posted how things are changing.

Config is:
$ grep -v ^$ /etc/pgpool2/pgpool.conf |egrep -v '^\s*#'
listen_addresses = '*'
port = 9999
socket_dir = '/var/run/postgresql'
pcp_port = 9898
pcp_socket_dir = '/var/run/postgresql'
backend_hostname0 = '172.20.15.1'
backend_port0 = 5432
backend_weight0 = 0
backend_data_directory0 = '/data/pgsql/data'
backend_hostname1 = '172.20.15.3'
backend_port1 = 5432
backend_weight1 = 5
backend_data_directory1 = '/data/pgsql/data'
backend_hostname2 = '172.20.5.119'
backend_port2 = 5432
backend_weight2 = 5
backend_data_directory2 = '/data/postgresql/9.3/main'
backend_hostname3 = '172.20.5.195'
backend_port3 = 5432
backend_weight3 = 5
backend_data_directory3 = '/data/postgresql/9.3/main'
enable_pool_hba = on
pool_passwd = ''
authentication_timeout = 10
ssl = off
num_init_children = 512
max_pool = 4
child_life_time = 10
child_max_connections = 2000
connection_life_time = 300
client_idle_limit = 1200
log_destination = 'stderr'
print_timestamp = on
log_connections = off
log_hostname = off
log_statement = off
log_per_node_statement = off
log_standby_delay = 'if_over_threshold'
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'
debug_level = 0
pid_file_name = '/var/run/postgresql/pgpool.pid'
logdir = '/var/log/postgresql'
connection_cache = on
reset_query_list = 'ABORT; SET SESSION AUTHORIZATION DEFAULT; RESET ALL; DEALLOCATE ALL; CLOSE ALL; UNLISTEN *; SELECT pg_advisory_unlock_all(); DISCARD PLANS; DISCARD TEMP;'
replication_mode = off
replicate_select = off
insert_lock = on
lobj_lock_table = ''
replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off
load_balance_mode = on
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'nextval,setval'
master_slave_mode = on
master_slave_sub_mode = 'stream'
sr_check_period = 10
sr_check_user = 'someuser'
sr_check_password = 'somesecretpassword'
delay_threshold = 32000000
follow_master_command = ''
parallel_mode = off
pgpool2_hostname = ''
health_check_period = 40
health_check_timeout = 10
health_check_user = 'someuser'
health_check_password = 'somesecretpassword'
health_check_max_retries = 4
health_check_retry_delay = 5
failover_command = ''
failback_command = ''
fail_over_on_backend_error = off
search_primary_node_timeout = 120
client_idle_limit_in_recovery = 0
use_watchdog = on # Activates watchdog
trusted_servers = ''
ping_path = '/bin'
wd_hostname = '172.20.5.191'
wd_port = 9000
wd_authkey = ''
delegate_IP = '172.20.15.2'
ifconfig_path = '/etc/pgpool2'
if_up_cmd = 'pgpool_ifconfig eth1:3 inet $_IP_$ netmask 255.255.0.0'
if_down_cmd = 'pgpool_ifconfig eth1:3 down'
arping_path = '/usr/sbin' # arping command path
arping_cmd = 'arping -U -I eth1 -w 1 -s 172.20.5.191 $_IP_$'
clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 20
heartbeat_destination0 = '172.20.5.189'
heartbeat_destination_port0 = 9694
heartbeat_device0 = 'eth1'
heartbeat_destination1 = '172.20.5.190'
heartbeat_destination_port1 = 9694
heartbeat_device1 = 'eth1'
wd_life_point = 6
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_dbname = 'postgres'
wd_lifecheck_user = 'someuser'
wd_lifecheck_password = 'somesecretpassword'
other_pgpool_hostname0 = '172.20.5.189'
other_pgpool_port0 = 5432
other_wd_port0 = 9000
other_pgpool_hostname1 = '172.20.5.190'
other_pgpool_port1 = 5432
other_wd_port1 = 9000
relcache_expire = 60
relcache_size = 512
check_temp_table = off
memory_cache_enabled = off
memqcache_method = 'shmem'
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_total_size = 67108864
memqcache_max_num_cache = 1000000
memqcache_expire = 0
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 409600
memqcache_cache_block_size = 1048576
memqcache_oiddir = '/var/log/pgpool/oiddir'
white_memqcache_table_list = ''
black_memqcache_table_list = ''

sw

2014-07-17 17:45

reporter  

pgpool.log (10,604 bytes)

sw

2014-07-17 17:47

reporter   ~0000451

Log file attached. I kill connections/processes before pgpool runs out of available connections, so there is no log message from that critical situation but maybe it helps. File: pgpool.log

tuomas

2014-07-18 03:41

reporter   ~0000452

I've also applied the patch, and it indeed seems to fix the problem. No more stuck connections. Thanks!

sw

2014-07-21 17:07

reporter   ~0000453

After three days in production I can tell that the patch has mitigated the stuck connection issue but has not entirely fixed it.

This morning I found these connection still open:

PGPOOL ~ $ ps auxfww | grep pgpool.*SET
postgres 5010 0.0 0.0 219468 7652 ? S Jul18 0:00 \_ pgpool: db_www db_www 172.20.5.38(36343) SET
postgres 27254 0.0 0.0 219468 7652 ? S Jul19 0:00 \_ pgpool: db_www db_www 172.20.5.38(14781) SET
postgres 23719 0.0 0.0 219468 7900 ? S Jul20 0:00 \_ pgpool: db_www db_www 172.20.5.38(35763) SET
postgres 22589 0.0 0.0 219468 7652 ? S 02:30 0:00 \_ pgpool: db_www db_www 172.20.5.38(23499) SET

(Please notice that I have the single commands of DISCARD ALL in my reset_query_list. See config above in earlier post.)

 datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query
-------+---------+-------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+---------------------+------------------------------------
 17969 | db_www | 28747 | 16396 | db_www | | 172.20.5.191 | | 8948 | 2014-07-21 02:30:14.013174+00 | 2014-07-21 02:30:14.021105+00 | 2014-07-21 02:50:16.285306+00 | 2014-07-21 02:50:16.285364+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT
 17969 | db_www | 2340 | 16396 | db_www | | 172.20.5.191 | | 55078 | 2014-07-20 02:30:07.414983+00 | 2014-07-20 02:30:13.828291+00 | 2014-07-20 02:50:16.129658+00 | 2014-07-20 02:50:16.129699+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT
 17969 | db_www | 30416 | 16396 | db_www | | 172.20.5.191 | | 64824 | 2014-07-18 02:30:03.504255+00 | 2014-07-18 02:30:03.514478+00 | 2014-07-18 02:50:05.827223+00 | 2014-07-18 02:50:05.827315+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT
 17969 | db_www | 24639 | 16396 | db_www | | 172.20.5.191 | | 28828 | 2014-07-19 02:30:13.349585+00 | 2014-07-19 02:30:13.363914+00 | 2014-07-19 02:50:15.631723+00 | 2014-07-19 02:50:15.631799+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT
(4 rows)

Interesting is that xact_start and query_start are very similar every time. Looks like it is the same query every day. The difference of 20min between both timestamps matches the client_idle_limit (1200s). Couldn't find anything related to this in logfiles yet.

Maybe it is only one extra case to be handled in the patch to entirly fix the issue.

t-ishii

2014-07-22 00:37

developer  

connection_stuck2.patch (843 bytes)
diff --git a/pool_process_query.c b/pool_process_query.c
index e6ada25..c4417ef 100644
--- a/pool_process_query.c
+++ b/pool_process_query.c
@@ -4877,12 +4877,12 @@ SELECT_RETRY:
 
 			if (idle_count > pool_config->client_idle_limit)
 			{
-				pool_log("pool_process_query: child connection forced to terminate due to client_idle_limit(%d) reached",
+				pool_log("pool_process_query: child connection forced to terminate due to client_idle_limit (%d) reached",
 						 pool_config->client_idle_limit);
 				pool_send_error_message(frontend, MAJOR(backend),
 										"57000", "connection terminated due to client idle limit reached",
 										"","",  __FILE__, __LINE__);
-				return POOL_END;
+				return POOL_END_WITH_FRONTEND_ERROR;
 			}
 		}
 		else if (*InRecovery > RECOVERY_INIT && pool_config->client_idle_limit_in_recovery > 0)

t-ishii

2014-07-22 00:41

developer   ~0000454

I looked into code related to client_idle_limit and found similar problem which is fixed by previous patch. Attached patch (connection_stuck2.patch) should fix the case. Could you please apply the patch and see if the problem is solved?

sw

2014-07-23 22:05

reporter   ~0000455

Looks like both patches together solve this issue. No stuck connections since yesterday. Thanks!

t-ishii

2014-07-24 00:28

developer   ~0000456

Great! I will commit the patch to all supported branches.

Issue History

Date Modified Username Field Change
2014-07-08 18:14 sw New Issue
2014-07-11 03:12 tuomas Note Added: 0000446
2014-07-11 04:45 tuomas Note Added: 0000447
2014-07-11 09:05 tuomas Note Added: 0000448
2014-07-11 13:37 t-ishii Assigned To => Muhammad Usama
2014-07-11 13:37 t-ishii Status new => assigned
2014-07-16 19:38 Muhammad Usama File Added: connection_stuck.patch
2014-07-16 19:44 Muhammad Usama Note Added: 0000449
2014-07-17 17:45 sw Note Added: 0000450
2014-07-17 17:45 sw File Added: pgpool.log
2014-07-17 17:47 sw Note Added: 0000451
2014-07-18 03:41 tuomas Note Added: 0000452
2014-07-21 17:07 sw Note Added: 0000453
2014-07-22 00:37 t-ishii File Added: connection_stuck2.patch
2014-07-22 00:41 t-ishii Note Added: 0000454
2014-07-23 22:05 sw Note Added: 0000455
2014-07-24 00:28 t-ishii Note Added: 0000456
2014-07-24 00:28 t-ishii Status assigned => resolved