View Issue Details

IDProjectCategoryView StatusLast Update
0000076Pgpool-IIBugpublic2013-10-19 12:47
ReporterglenngerhardtAssigned Tot-ishii 
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionopen 
PlatformLinuxOSCentOSOS Version2.6.32-279.22.1
Product Version 
Target VersionFixed in Version 
Summary0000076: pgpool 3.3.1 has segfault under load of UPDATEs to psql 9.2 database.
DescriptionCore file stack trace:

Program terminated with signal 11, Segmentation fault.
#0 0x000000000044ce18 in Execute (frontend=0x1248280, backend=0x10c27d0, len=5, contents=0x121d440 "") at pool_proto_modules.c:745
745 pool_proto_modules.c: No such file or directory.
        in pool_proto_modules.c
#0 0x000000000044ce18 in Execute (frontend=0x1248280, backend=0x10c27d0, len=5, contents=0x121d440 "") at pool_proto_modules.c:745
0000001 0x000000000044e27d in ProcessFrontendResponse (frontend=0x1248280, backend=0x10c27d0) at pool_proto_modules.c:2496
0000002 0x000000000041c8d0 in pool_process_query (frontend=<value optimized out>, backend=0x10c27d0, reset_request=<value optimized out>)
    at pool_process_query.c:232
0000003 0x000000000040b968 in do_child (unix_fd=4, inet_fd=<value optimized out>) at child.c:371
0000004 0x0000000000404dff in fork_a_child (unix_fd=4, inet_fd=5, id=454) at main.c:1246
0000005 0x0000000000408010 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:654
Steps To ReproduceLoad test. Running approximately 300 concurrent UPDATES to pgpool.

The UPDATE looks like :

UPDATE client SET (device_profile, device_type, model, nonce, os, os_version, vendor, updated_at) = ('(iPhone4;4.2.1;Apple)', 'IMC2.3.0_IP_D2.2.0_S0', 'iPhone4', '<0>drm_token', 'iphone', '4.2.1', 'Apple', NOW() AT TIME ZONE 'UTC') WHERE device_id = 'brg3_IMC2.3.0_IP_D2.2.0_S0_1' AND owner_id = '2' AND user_token = 'brg3'

Steps to reproduce are difficult. There's no 1, 2, 3, BANG that does it. I let the load run and occasionally see the crash.
Additional InformationPgpool log:
2013-10-11 17:41:17 ERROR: pid 99744: Child process 100200 was terminated by segmentation fault

/var/log/messages:
Oct 11 17:41:01 butler kernel: pgpool[100200]: segfault at ffffffffbe51e684 ip 000000000044ce18 sp 00007fff65f626e0 error 4 in pgpool[400000+107000]

This defect could be a duplicate of this, but not sure.
http://www.pgpool.net/mantisbt/view.php?id=55
TagsNo tags attached.

Activities

glenngerhardt

2013-10-12 03:17

reporter  

pgpool.conf (20,432 bytes)

t-ishii

2013-10-15 17:46

developer   ~0000339

The place where segfault gets happend is in the code path of query cache. While bug 0000055 did not use the query cache. So I think it is not likely those bugs are directly connected. In the mean time I found a nasty bug with the code path of the query cache. This may or may not be connected your problem but can you please try it out?

t-ishii

2013-10-15 17:48

developer  

pool_proto_modules.c.patch (539 bytes)
diff --git a/pool_proto_modules.c b/pool_proto_modules.c
index cdc3ae3..c982a3f 100644
--- a/pool_proto_modules.c
+++ b/pool_proto_modules.c
@@ -745,7 +745,7 @@ POOL_STATUS Execute(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
 				snprintf(hex_str, sizeof(hex_str), (i == 0) ? " %02X" : "%02X", 0xff & query_in_bind_msg[i]);
 				hexlen = strlen(hex_str);
 
-				if ((len+hexlen) > alloc_len)
+				if ((len+hexlen) >= alloc_len)
 				{
 					alloc_len += STR_ALLOC_SIZE;
 					search_query = realloc(search_query, alloc_len);

glenngerhardt

2013-10-16 02:27

reporter   ~0000343

I am sorry, but the patch did not work. The patch was on line 748 and the crash is on line 745 (before that loop). It still crashes. Here is some info from the core file that should be helpful. It seems that the bind_msg->param_offset is a very large negative number.

(gdb) bt
#0 0x000000000044ce18 in Execute (frontend=0x1b75a20, backend=0x1aaa080, len=5, contents=0x1b6f630 "") at pool_proto_modu
0000001 0x000000000044e27d in ProcessFrontendResponse (frontend=0x1b75a20, backend=0x1aaa080) at pool_proto_modules.c:2496
0000002 0x000000000041c8d0 in pool_process_query (frontend=<value optimized out>, backend=0x1aaa080, reset_request=<value opti
    at pool_process_query.c:232
0000003 0x000000000040b968 in do_child (unix_fd=4, inet_fd=<value optimized out>) at child.c:371
0000004 0x0000000000404dff in fork_a_child (unix_fd=4, inet_fd=5, id=285) at main.c:1246
0000005 0x000000000040648a in reaper () at main.c:2446
0000006 0x0000000000408854 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:702
(gdb) p hex_str
$1 = "^v\340", <incomplete sequence \313>
(gdb) p query_in_bind_msg
$2 = 0x722240 ""
(gdb) p bind_msg
$3 = (POOL_SENT_MESSAGE *) 0x1b67218
(gdb) p *bind_msg
$4 = {kind = 66 'B', len = 24, contents = 0x1b67258 "", num_tsparams = 0, name = 0x1b67278 "", query_context = 0x1b68bb0,
  param_offset = -2054491066}
(gdb) p i
$5 = <value optimized out>
(gdb) p query_in_bind_msg
$6 = 0x722240 ""
(gdb) p query_in_bind_msg[i]
value has been optimized out
(gdb) p query_in_bind_msg[0]
$7 = 0 '\000'
(gdb)

t-ishii

2013-10-16 10:20

developer   ~0000345

Thanks for the info. I think I found the cause of the problem. In the first place, the code should not execute if the query cannot be cached. bind_msg->param_offse is set when the query is SELECT. However, in the code path in question does not check if the query can be cached or not. As a result, if the query cannot be cached, the contents of bind_msg->param_offse is garbage. Attached patch should fix the problem. Also, there was a logic bug with the caliculation of bind_msg->param_offse and it was fixed in the patch as well. This should not affect your case however.

t-ishii

2013-10-16 10:20

developer  

pool_proto_modules.c-v2.patch (964 bytes)
diff --git a/pool_proto_modules.c b/pool_proto_modules.c
index c982a3f..0a9e517 100644
--- a/pool_proto_modules.c
+++ b/pool_proto_modules.c
@@ -722,7 +722,7 @@ POOL_STATUS Execute(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
 		/*
 		 * Add bind message's info to query to search.
 		 */
-		if (bind_msg->param_offset && bind_msg->contents)
+		if (query_context->is_cache_safe && bind_msg->param_offset && bind_msg->contents)
 		{
 			/* Extract binary contents from bind message */
 			char *query_in_bind_msg = bind_msg->contents + bind_msg->param_offset;
@@ -1293,7 +1293,7 @@ POOL_STATUS Bind(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend,
 	 */
 	if (query_context->is_cache_safe)
 	{
-		bind_msg->param_offset = sizeof(int) + sizeof(char) * (strlen(portal_name) + strlen(pstmt_name));
+		bind_msg->param_offset = sizeof(char) * (strlen(portal_name) + strlen(pstmt_name) + 2);
 	}
 
 	session_context->uncompleted_message = bind_msg;

glenngerhardt

2013-10-17 06:48

reporter   ~0000346

Both of these changes have been applied to the code and the test has been running for about 6 hours with no crash seen. Previously it would crash in roughly an hour. Looks good, but we will continue to monitor the test. Thanks!

Issue History

Date Modified Username Field Change
2013-10-12 03:17 glenngerhardt New Issue
2013-10-12 03:17 glenngerhardt File Added: pgpool.conf
2013-10-15 17:46 t-ishii Note Added: 0000339
2013-10-15 17:46 t-ishii Assigned To => t-ishii
2013-10-15 17:46 t-ishii Status new => assigned
2013-10-15 17:48 t-ishii File Added: pool_proto_modules.c.patch
2013-10-15 17:48 t-ishii Status assigned => feedback
2013-10-16 02:27 glenngerhardt Note Added: 0000343
2013-10-16 02:27 glenngerhardt Status feedback => assigned
2013-10-16 10:20 t-ishii Note Added: 0000345
2013-10-16 10:20 t-ishii File Added: pool_proto_modules.c-v2.patch
2013-10-16 10:21 t-ishii Status assigned => feedback
2013-10-17 06:48 glenngerhardt Note Added: 0000346
2013-10-17 06:48 glenngerhardt Status feedback => assigned
2013-10-19 12:47 t-ishii Status assigned => resolved