0000107: Stuck connections - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000107	Pgpool-II	Bug	public	2014-07-08 18:14	2020-10-15 23:05

Reporter	sw	Assigned To	Muhammad Usama
Priority	normal	Severity	major	Reproducibility	unable to reproduce
Status	resolved	Resolution	open

Summary	0000107: Stuck connections
Description	Since I have upgraded my DB cluster to Postgresl 9.3 and Pgpool 3.3.3, it has a problem with stuck connections in Pgpool. Connections never time out and the only way to get rid of them is to kill the processes. Versions in use: Systems are Ubuntu 12.04 LTS pgpool2=3.3.3-2.pgdg12.4+2 postgresql-9.3=9.3.4-1.pgdg12.4+1 postgresql-9.3-pgpool2=3.3.3-1.pgdg12.4+1 The connections are stuck in the reset_query_list (DISCARD ALL) in state 'idle in transaction'. More details here: http://www.pgpool.net/pipermail/pgpool-general/2014-July/003044.html Maybe related cases: http://www.pgpool.net/mantisbt/view.php?id=106 http://www.pgpool.net/pipermail/pgpool-general/2014-July/003049.html
Steps To Reproduce	I haven't found a way to reproduce it. Looks like it happens most when pgpool/database is under heavy load.
Additional Information	Example: $ date --rfc-2822 Tue, 08 Jul 2014 09:13:44 +0000 PGPOOL ~ $ ps aux\|grep 7163 postgres 7163 0.0 0.0 214168 7872 ? S Jul07 0:00 pgpool: db_www db_www 10.20.5.215(41780) SET PGPOOL ~ $ pcp_proc_info --verbose 90 localhost 9898 root password 7163 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 13797 Connected : 1 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 29006 Connected : 1 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 28686 Connected : 1 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 25065 Connected : 1 datid \| datname \| pid \| usesysid \| usename \| application_name \| client_addr \| client_hostname \| client_port \| backend_start \| xact_start \| query_start \| state_change \| waiting \| state \| query -------+----------+-------+----------+--------------+------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+---------------------+---------------------------------------------------------- 17969 \| db_www \| 13797 \| 16396 \| db_www \| \| 10.20.5.189 \| \| 42993 \| 2014-07-07 11:38:58.554118+00 \| 2014-07-07 11:39:18.346472+00 \| 2014-07-07 11:39:31.17378+00 \| 2014-07-07 11:39:31.173841+00 \| f \| idle in transaction \| SET SESSION AUTHORIZATION DEFAULT POSTGRES_MASTER ~ $ ps aux \| grep 13797 postgres 13797 0.0 0.0 2630544 10256 ? Ss Jul07 0:00 postgres: db_www db_www 10.20.5.189(42993) idle in transaction 17969 \| db_www \| 13797 \| 16396 \| db_www \| \| 172.20.5.189 \| \| 42993 \| 2014-07-07 11:38:58.554118+00 \| 2014-07-07 11:39:18.346472+00 \| 2014-07-07 11:39:31.17378+00 \| 2014-07-07 11:39:31.173841+00 \| f \| idle in transaction \| SET SESSION AUTHORIZATION DEFAULT POSTGRES_MASTER ~ $ sudo strace -p 13797 Process 13797 attached - interrupt to quit recvfrom(9,
Tags	No tags attached.

tuomas 2014-07-11 03:12 reporter ~0000446	We are having the very same problem. Pgpool 3.3.3 is not properly closing the connections and releasing them back to the pool when the client has already closed the connection. We're also seeing only DISCARD ALL on the connections that are stuck, but that query has successfully returned. Due to this the pool fills up and no longer accepts new connections causing the services to go down so this is critical. In our case the app is Rails 4.1.4., but there has to be something in pgpool as the connection isn't released even when the client has successfully and completely closed the connection.

tuomas 2014-07-11 04:45 reporter ~0000447	To add some details: This is how the processes look like: pgpool 1903 0.1 0.0 100756 12584 ? S< 16:00 0:09 pgpool: app app_production fe02(39459) DISCARD pgpool 1906 0.1 0.0 100924 12660 ? S< 16:00 0:08 pgpool: app app_production fe04(60458) DISCARD pgpool 1927 0.1 0.0 100996 12688 ? S< 16:00 0:12 pgpool: app app_production fe02(41058) DISCARD pgpool 1951 0.1 0.0 100772 12500 ? S< 16:00 0:08 pgpool: app app_production fe05(48386) DISCARD pgpool 1994 0.0 0.0 100420 12256 ? S< 16:00 0:04 pgpool: app app_production fe01(57090) DISCARD pgpool 2019 0.0 0.0 100644 12480 ? S< 16:00 0:04 pgpool: app app_production fe05(38709) DISCARD pgpool 13340 0.0 0.0 100900 12652 ? S< 16:10 0:04 pgpool: app app_production fe05(38094) DISCARD pgpool 13348 0.0 0.0 100712 12472 ? S< 16:10 0:02 pgpool: app app_production fe05(55029) DISCARD pgpool 15216 0.0 0.0 90676 2392 ? S< 17:17 0:01 pgpool: app app_production fe05(37722) DISCARD pgpool 15296 0.1 0.0 90676 2304 ? S< 17:51 0:01 pgpool: app app_production fe05(45917) DISCARD In postgres those connections are idle, the connections from those ports seem closed, on the client side we're seeing those in FIN_WAIT2 state for until they timeout there. Even after the connection has fully closed (nothing in netstat for the port), it still appears in pgpool process list as if in use.

tuomas 2014-07-11 09:05 reporter ~0000448	I dumped the core of one of the stuck processes to see in which state it is: (gdb) bt #0 0x00007f1a73322743 in select () from /lib/x86_64-linux-gnu/libc.so.6 0000001 0x000000000041675d in pool_check_fd (cp=<optimized out>) at pool_process_query.c:951 0000002 0x000000000041de70 in pool_read (cp=0x1a30470, buf=<optimized out>, len=1) at pool_stream.c:139 0000003 0x000000000041c8fe in read_kind_from_backend (frontend=0x1a2a640, backend=0x1a2a1e0, decided_kind=0x7ffffeed801f "") at pool_process_query.c:3771 0000004 0x000000000044f6a9 in ProcessBackendResponse (frontend=0x1a2a640, backend=0x1a2a1e0, state=0x7ffffeed8094, num_fields=0x7ffffeed809c) at pool_proto_modules.c:2742 0000005 0x000000000041a095 in pool_process_query (frontend=0x1a2a640, backend=0x1a2a1e0, reset_request=1) at pool_process_query.c:289 0000006 0x000000000040c716 in do_child (unix_fd=5, inet_fd=<optimized out>) at child.c:403 0000007 0x00000000004075df in fork_a_child (unix_fd=5, inet_fd=6, id=148) at main.c:1238 0000008 0x0000000000407d0a in reaper () at main.c:2457 0000009 reaper () at main.c:2369 0000010 0x00000000004086bd in pool_sleep (second=<optimized out>) at main.c:2654 0000011 0x0000000000406349 in main (argc=<optimized out>, argv=<optimized out>) at main.c:836 In the logs there are also errors "ProcessFrontendResponse: failed to read kind from frontend. frontend abnormally exited" for the pids that are stuck.

Muhammad Usama 2014-07-16 19:38 developer	connection_stuck.patch (1,931 bytes) connection_stuck.patch (1,931 bytes)

Muhammad Usama 2014-07-16 19:44 developer ~0000449	I have been trying to replicate this issue and still has a little luck in this regard. For further analysis on this issue can you please share the PostgreSQL log and pgpool-II configuration you are using when this situation arrives. Meanwhile we have tried to come up with the patch to fix the possible problem which can be most likely to cause this connection stuck situation. Can you please try out the attached patch (connection_stuck.patch) and check if this solves the problem. Please note that patch is generated against 3.3-stable branch

sw 2014-07-17 17:45 reporter ~0000450	Thanks, Muhammad. I patched pgpool and upgraded all servers. Will keep you posted how things are changing. Config is: $ grep -v ^$ /etc/pgpool2/pgpool.conf \|egrep -v '^\s#' listen_addresses = '' port = 9999 socket_dir = '/var/run/postgresql' pcp_port = 9898 pcp_socket_dir = '/var/run/postgresql' backend_hostname0 = '172.20.15.1' backend_port0 = 5432 backend_weight0 = 0 backend_data_directory0 = '/data/pgsql/data' backend_hostname1 = '172.20.15.3' backend_port1 = 5432 backend_weight1 = 5 backend_data_directory1 = '/data/pgsql/data' backend_hostname2 = '172.20.5.119' backend_port2 = 5432 backend_weight2 = 5 backend_data_directory2 = '/data/postgresql/9.3/main' backend_hostname3 = '172.20.5.195' backend_port3 = 5432 backend_weight3 = 5 backend_data_directory3 = '/data/postgresql/9.3/main' enable_pool_hba = on pool_passwd = '' authentication_timeout = 10 ssl = off num_init_children = 512 max_pool = 4 child_life_time = 10 child_max_connections = 2000 connection_life_time = 300 client_idle_limit = 1200 log_destination = 'stderr' print_timestamp = on log_connections = off log_hostname = off log_statement = off log_per_node_statement = off log_standby_delay = 'if_over_threshold' syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' debug_level = 0 pid_file_name = '/var/run/postgresql/pgpool.pid' logdir = '/var/log/postgresql' connection_cache = on reset_query_list = 'ABORT; SET SESSION AUTHORIZATION DEFAULT; RESET ALL; DEALLOCATE ALL; CLOSE ALL; UNLISTEN *; SELECT pg_advisory_unlock_all(); DISCARD PLANS; DISCARD TEMP;' replication_mode = off replicate_select = off insert_lock = on lobj_lock_table = '' replication_stop_on_mismatch = off failover_if_affected_tuples_mismatch = off load_balance_mode = on ignore_leading_white_space = on white_function_list = '' black_function_list = 'nextval,setval' master_slave_mode = on master_slave_sub_mode = 'stream' sr_check_period = 10 sr_check_user = 'someuser' sr_check_password = 'somesecretpassword' delay_threshold = 32000000 follow_master_command = '' parallel_mode = off pgpool2_hostname = '' health_check_period = 40 health_check_timeout = 10 health_check_user = 'someuser' health_check_password = 'somesecretpassword' health_check_max_retries = 4 health_check_retry_delay = 5 failover_command = '' failback_command = '' fail_over_on_backend_error = off search_primary_node_timeout = 120 client_idle_limit_in_recovery = 0 use_watchdog = on # Activates watchdog trusted_servers = '' ping_path = '/bin' wd_hostname = '172.20.5.191' wd_port = 9000 wd_authkey = '' delegate_IP = '172.20.15.2' ifconfig_path = '/etc/pgpool2' if_up_cmd = 'pgpool_ifconfig eth1:3 inet $_IP_$ netmask 255.255.0.0' if_down_cmd = 'pgpool_ifconfig eth1:3 down' arping_path = '/usr/sbin' # arping command path arping_cmd = 'arping -U -I eth1 -w 1 -s 172.20.5.191 $_IP_$' clear_memqcache_on_escalation = on wd_escalation_command = '' wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_port = 9694 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 20 heartbeat_destination0 = '172.20.5.189' heartbeat_destination_port0 = 9694 heartbeat_device0 = 'eth1' heartbeat_destination1 = '172.20.5.190' heartbeat_destination_port1 = 9694 heartbeat_device1 = 'eth1' wd_life_point = 6 wd_lifecheck_query = 'SELECT 1' wd_lifecheck_dbname = 'postgres' wd_lifecheck_user = 'someuser' wd_lifecheck_password = 'somesecretpassword' other_pgpool_hostname0 = '172.20.5.189' other_pgpool_port0 = 5432 other_wd_port0 = 9000 other_pgpool_hostname1 = '172.20.5.190' other_pgpool_port1 = 5432 other_wd_port1 = 9000 relcache_expire = 60 relcache_size = 512 check_temp_table = off memory_cache_enabled = off memqcache_method = 'shmem' memqcache_memcached_host = 'localhost' memqcache_memcached_port = 11211 memqcache_total_size = 67108864 memqcache_max_num_cache = 1000000 memqcache_expire = 0 memqcache_auto_cache_invalidation = on memqcache_maxcache = 409600 memqcache_cache_block_size = 1048576 memqcache_oiddir = '/var/log/pgpool/oiddir' white_memqcache_table_list = '' black_memqcache_table_list = ''

sw 2014-07-17 17:45 reporter	pgpool.log (10,604 bytes) pgpool.log (10,604 bytes)

sw 2014-07-17 17:47 reporter ~0000451	Log file attached. I kill connections/processes before pgpool runs out of available connections, so there is no log message from that critical situation but maybe it helps. File: pgpool.log

tuomas 2014-07-18 03:41 reporter ~0000452	I've also applied the patch, and it indeed seems to fix the problem. No more stuck connections. Thanks!

sw 2014-07-21 17:07 reporter ~0000453	After three days in production I can tell that the patch has mitigated the stuck connection issue but has not entirely fixed it. This morning I found these connection still open: PGPOOL ~ $ ps auxfww \| grep pgpool.*SET postgres 5010 0.0 0.0 219468 7652 ? S Jul18 0:00 \_ pgpool: db_www db_www 172.20.5.38(36343) SET postgres 27254 0.0 0.0 219468 7652 ? S Jul19 0:00 \_ pgpool: db_www db_www 172.20.5.38(14781) SET postgres 23719 0.0 0.0 219468 7900 ? S Jul20 0:00 \_ pgpool: db_www db_www 172.20.5.38(35763) SET postgres 22589 0.0 0.0 219468 7652 ? S 02:30 0:00 \_ pgpool: db_www db_www 172.20.5.38(23499) SET (Please notice that I have the single commands of DISCARD ALL in my reset_query_list. See config above in earlier post.) datid \| datname \| pid \| usesysid \| usename \| application_name \| client_addr \| client_hostname \| client_port \| backend_start \| xact_start \| query_start \| state_change \| waiting \| state \| query -------+---------+-------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+---------------------+------------------------------------ 17969 \| db_www \| 28747 \| 16396 \| db_www \| \| 172.20.5.191 \| \| 8948 \| 2014-07-21 02:30:14.013174+00 \| 2014-07-21 02:30:14.021105+00 \| 2014-07-21 02:50:16.285306+00 \| 2014-07-21 02:50:16.285364+00 \| f \| idle in transaction \| SET SESSION AUTHORIZATION DEFAULT 17969 \| db_www \| 2340 \| 16396 \| db_www \| \| 172.20.5.191 \| \| 55078 \| 2014-07-20 02:30:07.414983+00 \| 2014-07-20 02:30:13.828291+00 \| 2014-07-20 02:50:16.129658+00 \| 2014-07-20 02:50:16.129699+00 \| f \| idle in transaction \| SET SESSION AUTHORIZATION DEFAULT 17969 \| db_www \| 30416 \| 16396 \| db_www \| \| 172.20.5.191 \| \| 64824 \| 2014-07-18 02:30:03.504255+00 \| 2014-07-18 02:30:03.514478+00 \| 2014-07-18 02:50:05.827223+00 \| 2014-07-18 02:50:05.827315+00 \| f \| idle in transaction \| SET SESSION AUTHORIZATION DEFAULT 17969 \| db_www \| 24639 \| 16396 \| db_www \| \| 172.20.5.191 \| \| 28828 \| 2014-07-19 02:30:13.349585+00 \| 2014-07-19 02:30:13.363914+00 \| 2014-07-19 02:50:15.631723+00 \| 2014-07-19 02:50:15.631799+00 \| f \| idle in transaction \| SET SESSION AUTHORIZATION DEFAULT (4 rows) Interesting is that xact_start and query_start are very similar every time. Looks like it is the same query every day. The difference of 20min between both timestamps matches the client_idle_limit (1200s). Couldn't find anything related to this in logfiles yet. Maybe it is only one extra case to be handled in the patch to entirly fix the issue.

t-ishii 2014-07-22 00:37 developer	connection_stuck2.patch (843 bytes) connection_stuck2.patch (843 bytes)

t-ishii 2014-07-22 00:41 developer ~0000454	I looked into code related to client_idle_limit and found similar problem which is fixed by previous patch. Attached patch (connection_stuck2.patch) should fix the case. Could you please apply the patch and see if the problem is solved?

sw 2014-07-23 22:05 reporter ~0000455	Looks like both patches together solve this issue. No stuck connections since yesterday. Thanks!

t-ishii 2014-07-24 00:28 developer ~0000456	Great! I will commit the patch to all supported branches.

Date Modified	Username	Field	Change
2014-07-08 18:14	sw	New Issue
2014-07-11 03:12	tuomas	Note Added: 0000446
2014-07-11 04:45	tuomas	Note Added: 0000447
2014-07-11 09:05	tuomas	Note Added: 0000448
2014-07-11 13:37	t-ishii	Assigned To	=> Muhammad Usama
2014-07-11 13:37	t-ishii	Status	new => assigned
2014-07-16 19:38	Muhammad Usama	File Added: connection_stuck.patch
2014-07-16 19:44	Muhammad Usama	Note Added: 0000449
2014-07-17 17:45	sw	Note Added: 0000450
2014-07-17 17:45	sw	File Added: pgpool.log
2014-07-17 17:47	sw	Note Added: 0000451
2014-07-18 03:41	tuomas	Note Added: 0000452
2014-07-21 17:07	sw	Note Added: 0000453
2014-07-22 00:37	t-ishii	File Added: connection_stuck2.patch
2014-07-22 00:41	t-ishii	Note Added: 0000454
2014-07-23 22:05	sw	Note Added: 0000455
2014-07-24 00:28	t-ishii	Note Added: 0000456
2014-07-24 00:28	t-ishii	Status	assigned => resolved

View Issue Details

Activities

Issue History