View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000107 | Pgpool-II | Bug | public | 2014-07-08 18:14 | 2020-10-15 23:05 |
| Reporter | sw | Assigned To | Muhammad Usama | ||
| Priority | normal | Severity | major | Reproducibility | unable to reproduce |
| Status | resolved | Resolution | open | ||
| Summary | 0000107: Stuck connections | ||||
| Description | Since I have upgraded my DB cluster to Postgresl 9.3 and Pgpool 3.3.3, it has a problem with stuck connections in Pgpool. Connections never time out and the only way to get rid of them is to kill the processes. Versions in use: Systems are Ubuntu 12.04 LTS pgpool2=3.3.3-2.pgdg12.4+2 postgresql-9.3=9.3.4-1.pgdg12.4+1 postgresql-9.3-pgpool2=3.3.3-1.pgdg12.4+1 The connections are stuck in the reset_query_list (DISCARD ALL) in state 'idle in transaction'. More details here: http://www.pgpool.net/pipermail/pgpool-general/2014-July/003044.html Maybe related cases: http://www.pgpool.net/mantisbt/view.php?id=106 http://www.pgpool.net/pipermail/pgpool-general/2014-July/003049.html | ||||
| Steps To Reproduce | I haven't found a way to reproduce it. Looks like it happens most when pgpool/database is under heavy load. | ||||
| Additional Information | Example: $ date --rfc-2822 Tue, 08 Jul 2014 09:13:44 +0000 PGPOOL ~ $ ps aux|grep 7163 postgres 7163 0.0 0.0 214168 7872 ? S Jul07 0:00 pgpool: db_www db_www 10.20.5.215(41780) SET PGPOOL ~ $ pcp_proc_info --verbose 90 localhost 9898 root password 7163 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 13797 Connected : 1 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 29006 Connected : 1 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 28686 Connected : 1 Database : db_www Username : db_www Start time : 2014-07-07 11:38:51 Creation time: 2014-07-07 11:38:58 Major : 3 Minor : 0 Counter : 11 Backend PID : 25065 Connected : 1 datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query -------+----------+-------+----------+--------------+------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+---------------------+---------------------------------------------------------- 17969 | db_www | 13797 | 16396 | db_www | | 10.20.5.189 | | 42993 | 2014-07-07 11:38:58.554118+00 | 2014-07-07 11:39:18.346472+00 | 2014-07-07 11:39:31.17378+00 | 2014-07-07 11:39:31.173841+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT POSTGRES_MASTER ~ $ ps aux | grep 13797 postgres 13797 0.0 0.0 2630544 10256 ? Ss Jul07 0:00 postgres: db_www db_www 10.20.5.189(42993) idle in transaction 17969 | db_www | 13797 | 16396 | db_www | | 172.20.5.189 | | 42993 | 2014-07-07 11:38:58.554118+00 | 2014-07-07 11:39:18.346472+00 | 2014-07-07 11:39:31.17378+00 | 2014-07-07 11:39:31.173841+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT POSTGRES_MASTER ~ $ sudo strace -p 13797 Process 13797 attached - interrupt to quit recvfrom(9, | ||||
| Tags | No tags attached. | ||||
|
|
We are having the very same problem. Pgpool 3.3.3 is not properly closing the connections and releasing them back to the pool when the client has already closed the connection. We're also seeing only DISCARD ALL on the connections that are stuck, but that query has successfully returned. Due to this the pool fills up and no longer accepts new connections causing the services to go down so this is critical. In our case the app is Rails 4.1.4., but there has to be something in pgpool as the connection isn't released even when the client has successfully and completely closed the connection. |
|
|
To add some details: This is how the processes look like: pgpool 1903 0.1 0.0 100756 12584 ? S< 16:00 0:09 pgpool: app app_production fe02(39459) DISCARD pgpool 1906 0.1 0.0 100924 12660 ? S< 16:00 0:08 pgpool: app app_production fe04(60458) DISCARD pgpool 1927 0.1 0.0 100996 12688 ? S< 16:00 0:12 pgpool: app app_production fe02(41058) DISCARD pgpool 1951 0.1 0.0 100772 12500 ? S< 16:00 0:08 pgpool: app app_production fe05(48386) DISCARD pgpool 1994 0.0 0.0 100420 12256 ? S< 16:00 0:04 pgpool: app app_production fe01(57090) DISCARD pgpool 2019 0.0 0.0 100644 12480 ? S< 16:00 0:04 pgpool: app app_production fe05(38709) DISCARD pgpool 13340 0.0 0.0 100900 12652 ? S< 16:10 0:04 pgpool: app app_production fe05(38094) DISCARD pgpool 13348 0.0 0.0 100712 12472 ? S< 16:10 0:02 pgpool: app app_production fe05(55029) DISCARD pgpool 15216 0.0 0.0 90676 2392 ? S< 17:17 0:01 pgpool: app app_production fe05(37722) DISCARD pgpool 15296 0.1 0.0 90676 2304 ? S< 17:51 0:01 pgpool: app app_production fe05(45917) DISCARD In postgres those connections are idle, the connections from those ports seem closed, on the client side we're seeing those in FIN_WAIT2 state for until they timeout there. Even after the connection has fully closed (nothing in netstat for the port), it still appears in pgpool process list as if in use. |
|
|
I dumped the core of one of the stuck processes to see in which state it is: (gdb) bt #0 0x00007f1a73322743 in select () from /lib/x86_64-linux-gnu/libc.so.6 0000001 0x000000000041675d in pool_check_fd (cp=<optimized out>) at pool_process_query.c:951 0000002 0x000000000041de70 in pool_read (cp=0x1a30470, buf=<optimized out>, len=1) at pool_stream.c:139 0000003 0x000000000041c8fe in read_kind_from_backend (frontend=0x1a2a640, backend=0x1a2a1e0, decided_kind=0x7ffffeed801f "") at pool_process_query.c:3771 0000004 0x000000000044f6a9 in ProcessBackendResponse (frontend=0x1a2a640, backend=0x1a2a1e0, state=0x7ffffeed8094, num_fields=0x7ffffeed809c) at pool_proto_modules.c:2742 0000005 0x000000000041a095 in pool_process_query (frontend=0x1a2a640, backend=0x1a2a1e0, reset_request=1) at pool_process_query.c:289 0000006 0x000000000040c716 in do_child (unix_fd=5, inet_fd=<optimized out>) at child.c:403 0000007 0x00000000004075df in fork_a_child (unix_fd=5, inet_fd=6, id=148) at main.c:1238 0000008 0x0000000000407d0a in reaper () at main.c:2457 0000009 reaper () at main.c:2369 0000010 0x00000000004086bd in pool_sleep (second=<optimized out>) at main.c:2654 0000011 0x0000000000406349 in main (argc=<optimized out>, argv=<optimized out>) at main.c:836 In the logs there are also errors "ProcessFrontendResponse: failed to read kind from frontend. frontend abnormally exited" for the pids that are stuck. |
|
|
|
|
|
I have been trying to replicate this issue and still has a little luck in this regard. For further analysis on this issue can you please share the PostgreSQL log and pgpool-II configuration you are using when this situation arrives. Meanwhile we have tried to come up with the patch to fix the possible problem which can be most likely to cause this connection stuck situation. Can you please try out the attached patch (connection_stuck.patch) and check if this solves the problem. Please note that patch is generated against 3.3-stable branch |
|
|
Thanks, Muhammad. I patched pgpool and upgraded all servers. Will keep you posted how things are changing. Config is: $ grep -v ^$ /etc/pgpool2/pgpool.conf |egrep -v '^\s*#' listen_addresses = '*' port = 9999 socket_dir = '/var/run/postgresql' pcp_port = 9898 pcp_socket_dir = '/var/run/postgresql' backend_hostname0 = '172.20.15.1' backend_port0 = 5432 backend_weight0 = 0 backend_data_directory0 = '/data/pgsql/data' backend_hostname1 = '172.20.15.3' backend_port1 = 5432 backend_weight1 = 5 backend_data_directory1 = '/data/pgsql/data' backend_hostname2 = '172.20.5.119' backend_port2 = 5432 backend_weight2 = 5 backend_data_directory2 = '/data/postgresql/9.3/main' backend_hostname3 = '172.20.5.195' backend_port3 = 5432 backend_weight3 = 5 backend_data_directory3 = '/data/postgresql/9.3/main' enable_pool_hba = on pool_passwd = '' authentication_timeout = 10 ssl = off num_init_children = 512 max_pool = 4 child_life_time = 10 child_max_connections = 2000 connection_life_time = 300 client_idle_limit = 1200 log_destination = 'stderr' print_timestamp = on log_connections = off log_hostname = off log_statement = off log_per_node_statement = off log_standby_delay = 'if_over_threshold' syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' debug_level = 0 pid_file_name = '/var/run/postgresql/pgpool.pid' logdir = '/var/log/postgresql' connection_cache = on reset_query_list = 'ABORT; SET SESSION AUTHORIZATION DEFAULT; RESET ALL; DEALLOCATE ALL; CLOSE ALL; UNLISTEN *; SELECT pg_advisory_unlock_all(); DISCARD PLANS; DISCARD TEMP;' replication_mode = off replicate_select = off insert_lock = on lobj_lock_table = '' replication_stop_on_mismatch = off failover_if_affected_tuples_mismatch = off load_balance_mode = on ignore_leading_white_space = on white_function_list = '' black_function_list = 'nextval,setval' master_slave_mode = on master_slave_sub_mode = 'stream' sr_check_period = 10 sr_check_user = 'someuser' sr_check_password = 'somesecretpassword' delay_threshold = 32000000 follow_master_command = '' parallel_mode = off pgpool2_hostname = '' health_check_period = 40 health_check_timeout = 10 health_check_user = 'someuser' health_check_password = 'somesecretpassword' health_check_max_retries = 4 health_check_retry_delay = 5 failover_command = '' failback_command = '' fail_over_on_backend_error = off search_primary_node_timeout = 120 client_idle_limit_in_recovery = 0 use_watchdog = on # Activates watchdog trusted_servers = '' ping_path = '/bin' wd_hostname = '172.20.5.191' wd_port = 9000 wd_authkey = '' delegate_IP = '172.20.15.2' ifconfig_path = '/etc/pgpool2' if_up_cmd = 'pgpool_ifconfig eth1:3 inet $_IP_$ netmask 255.255.0.0' if_down_cmd = 'pgpool_ifconfig eth1:3 down' arping_path = '/usr/sbin' # arping command path arping_cmd = 'arping -U -I eth1 -w 1 -s 172.20.5.191 $_IP_$' clear_memqcache_on_escalation = on wd_escalation_command = '' wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_port = 9694 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 20 heartbeat_destination0 = '172.20.5.189' heartbeat_destination_port0 = 9694 heartbeat_device0 = 'eth1' heartbeat_destination1 = '172.20.5.190' heartbeat_destination_port1 = 9694 heartbeat_device1 = 'eth1' wd_life_point = 6 wd_lifecheck_query = 'SELECT 1' wd_lifecheck_dbname = 'postgres' wd_lifecheck_user = 'someuser' wd_lifecheck_password = 'somesecretpassword' other_pgpool_hostname0 = '172.20.5.189' other_pgpool_port0 = 5432 other_wd_port0 = 9000 other_pgpool_hostname1 = '172.20.5.190' other_pgpool_port1 = 5432 other_wd_port1 = 9000 relcache_expire = 60 relcache_size = 512 check_temp_table = off memory_cache_enabled = off memqcache_method = 'shmem' memqcache_memcached_host = 'localhost' memqcache_memcached_port = 11211 memqcache_total_size = 67108864 memqcache_max_num_cache = 1000000 memqcache_expire = 0 memqcache_auto_cache_invalidation = on memqcache_maxcache = 409600 memqcache_cache_block_size = 1048576 memqcache_oiddir = '/var/log/pgpool/oiddir' white_memqcache_table_list = '' black_memqcache_table_list = '' |
|
|
|
|
|
Log file attached. I kill connections/processes before pgpool runs out of available connections, so there is no log message from that critical situation but maybe it helps. File: pgpool.log |
|
|
I've also applied the patch, and it indeed seems to fix the problem. No more stuck connections. Thanks! |
|
|
After three days in production I can tell that the patch has mitigated the stuck connection issue but has not entirely fixed it. This morning I found these connection still open: PGPOOL ~ $ ps auxfww | grep pgpool.*SET postgres 5010 0.0 0.0 219468 7652 ? S Jul18 0:00 \_ pgpool: db_www db_www 172.20.5.38(36343) SET postgres 27254 0.0 0.0 219468 7652 ? S Jul19 0:00 \_ pgpool: db_www db_www 172.20.5.38(14781) SET postgres 23719 0.0 0.0 219468 7900 ? S Jul20 0:00 \_ pgpool: db_www db_www 172.20.5.38(35763) SET postgres 22589 0.0 0.0 219468 7652 ? S 02:30 0:00 \_ pgpool: db_www db_www 172.20.5.38(23499) SET (Please notice that I have the single commands of DISCARD ALL in my reset_query_list. See config above in earlier post.) datid | datname | pid | usesysid | usename | application_name | client_addr | client_hostname | client_port | backend_start | xact_start | query_start | state_change | waiting | state | query -------+---------+-------+----------+---------+------------------+--------------+-----------------+-------------+-------------------------------+-------------------------------+-------------------------------+-------------------------------+---------+---------------------+------------------------------------ 17969 | db_www | 28747 | 16396 | db_www | | 172.20.5.191 | | 8948 | 2014-07-21 02:30:14.013174+00 | 2014-07-21 02:30:14.021105+00 | 2014-07-21 02:50:16.285306+00 | 2014-07-21 02:50:16.285364+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT 17969 | db_www | 2340 | 16396 | db_www | | 172.20.5.191 | | 55078 | 2014-07-20 02:30:07.414983+00 | 2014-07-20 02:30:13.828291+00 | 2014-07-20 02:50:16.129658+00 | 2014-07-20 02:50:16.129699+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT 17969 | db_www | 30416 | 16396 | db_www | | 172.20.5.191 | | 64824 | 2014-07-18 02:30:03.504255+00 | 2014-07-18 02:30:03.514478+00 | 2014-07-18 02:50:05.827223+00 | 2014-07-18 02:50:05.827315+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT 17969 | db_www | 24639 | 16396 | db_www | | 172.20.5.191 | | 28828 | 2014-07-19 02:30:13.349585+00 | 2014-07-19 02:30:13.363914+00 | 2014-07-19 02:50:15.631723+00 | 2014-07-19 02:50:15.631799+00 | f | idle in transaction | SET SESSION AUTHORIZATION DEFAULT (4 rows) Interesting is that xact_start and query_start are very similar every time. Looks like it is the same query every day. The difference of 20min between both timestamps matches the client_idle_limit (1200s). Couldn't find anything related to this in logfiles yet. Maybe it is only one extra case to be handled in the patch to entirly fix the issue. |
|
|
|
|
|
I looked into code related to client_idle_limit and found similar problem which is fixed by previous patch. Attached patch (connection_stuck2.patch) should fix the case. Could you please apply the patch and see if the problem is solved? |
|
|
Looks like both patches together solve this issue. No stuck connections since yesterday. Thanks! |
|
|
Great! I will commit the patch to all supported branches. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2014-07-08 18:14 | sw | New Issue | |
| 2014-07-11 03:12 | tuomas | Note Added: 0000446 | |
| 2014-07-11 04:45 | tuomas | Note Added: 0000447 | |
| 2014-07-11 09:05 | tuomas | Note Added: 0000448 | |
| 2014-07-11 13:37 | t-ishii | Assigned To | => Muhammad Usama |
| 2014-07-11 13:37 | t-ishii | Status | new => assigned |
| 2014-07-16 19:38 | Muhammad Usama | File Added: connection_stuck.patch | |
| 2014-07-16 19:44 | Muhammad Usama | Note Added: 0000449 | |
| 2014-07-17 17:45 | sw | Note Added: 0000450 | |
| 2014-07-17 17:45 | sw | File Added: pgpool.log | |
| 2014-07-17 17:47 | sw | Note Added: 0000451 | |
| 2014-07-18 03:41 | tuomas | Note Added: 0000452 | |
| 2014-07-21 17:07 | sw | Note Added: 0000453 | |
| 2014-07-22 00:37 | t-ishii | File Added: connection_stuck2.patch | |
| 2014-07-22 00:41 | t-ishii | Note Added: 0000454 | |
| 2014-07-23 22:05 | sw | Note Added: 0000455 | |
| 2014-07-24 00:28 | t-ishii | Note Added: 0000456 | |
| 2014-07-24 00:28 | t-ishii | Status | assigned => resolved |