View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000318 | Pgpool-II | Bug | public | 2017-07-03 17:35 | 2017-08-29 09:28 |
| Reporter | avi | Assigned To | pengbo | ||
| Priority | urgent | Severity | block | Reproducibility | always |
| Status | closed | Resolution | open | ||
| Product Version | 3.6.4 | ||||
| Summary | 0000318: Exception in plpgsql function is not passed to application | ||||
| Description | Hi If exception occurs inside plpgsql function the application fails to get the exception from pgpool and it keeps waiting forever. Please note that in postgres, the connection is marked as "idle in transaction (aborted)". This bug is not reproduced with version 3.6.1. Only version 3.6.3 and 3.6.4 has it. In addition if the slave machine is down and we work only with master, the bug is not reproduced. If we bypass pgpool and connect directly to port 5432 it works. P.S. I saw differences in sniffer captures between good case (error is reported immediately) vs “bad” case (error is not reported) – It seems there is missing one packet in “bad” case (packets 0000045-46 in “pgpool_3_6_1.pcap” vs packet#46 in “pgpool_3_6_4.pcap”). Versions: pgpool: 3.6.4 PostgreSQL: “PostgreSQL 9.6.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 4.4.7 20120313 (Red Hat 4.4.7-17), 64-bit” JDBC PG Driver: 42.1.1 or 9.4.1212.jre7 (in case of NMS + Hibernate4 + Spring4 transaction manager) Java8 | ||||
| Steps To Reproduce | Please use the attached unit test | ||||
| Tags | No tags attached. | ||||
|
|
|
|
|
Greetings, Can you collect gdb backtraces of the pgpool process and postgres backend? |
|
|
I cannot reproduce the problem by using pgproto. pgproto -f bug318.data -p 11000 -d test FE=> Parse(stmt="S1", query="BEGIN") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Parse(stmt="S1", query="SELECT count(*) as dummy from bar()") FE=> Bind(stmt="S1", portal="") FE=> Describe(stmt="S1") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Parse(stmt="S1", query="COMMIT") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE BindComplete <= BE CommandComplete(BEGIN) <= BE CloseComplete <= BE ParseComplete <= BE BindComplete <= BE ParameterDescription <= BE RowDescription <= BE NoticeResponse(S NOTICE V NOTICE C 00000 M AlexL, will perfrom DibByZero W PL/pgSQL function bar() line 2 at RAISE F pl_exec.c L 3165 R exec_stmt_raise ) <= BE ErrorResponse(S ERROR V ERROR C 22012 M division by zero W SQL statement "select 1/0" PL/pgSQL function bar() line 3 at EXECUTE F int.c L 719 R int4div ) <= BE ReadyForQuery(E) FE=> Terminate t-ishii@localhost: pgpool --version pgpool-II version 3.6.4 (subaruboshi) |
|
|
Hi If I execute the same "SELECT count(*) as dummy from bar()" command from psql -h vip_ip -p 9999 it works. However, when I use the sample application with the attached unit test it get stuck 20 -30 percent of the times (it also fails when I execute a simple "select 1/0"). I assume pgproto is more similar to psql than to the sample unit-test therefore I think it will be more beneficial to simulate if using the sample unit-test. If you want to connect to us it is also possible |
|
|
pgproto (https://github.com/tatsuo-ishii/pgproto) can execute either simple queries (like psql) and extended queries (like Java). The test case is for an extended query thus essentially it looks same from Pgpool-II's point of view. |
|
|
Hi Can you please send me the bug318.data file that you used. Is it being written manually or it can be auto generated in some way? In addition why did you use port 11000 and not 9999 Thanks |
|
|
Hi The following is the bug318.data file I used to get it stuck (1 in about 10 attempts got stuck). (If this is not the proper commands sequence please send me the file you would like me to use.) 'P' "S1" "SELECT 1/0" 0 'B' "" "S1" 0 0 0 'E' "" 0 'C' 'S' "S1" # Issue Sync message 'S' # Receive response from backend 'Y' # Send terminate message 'X' The following is the data I got when it completed [root@h2-nms trunk]# pgproto -f bug318.data -p 9999 -h172.18.255.40 -u postgres -d test_db FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE ErrorResponse(S ERROR V ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ReadyForQuery(I) FE=> Terminate [root@h2-nms trunk]# pgproto -f bug318.data -p 9999 -h172.18.255.40 -u postgres -d test_db FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE ErrorResponse(S ERROR V ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ReadyForQuery(I) FE=> Terminate [root@h2-nms trunk]# pgproto -f bug318.data -p 9999 -h172.18.255.40 -u postgres -d test_db FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE ErrorResponse(S ERROR V ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ReadyForQuery(I) FE=> Terminate [root@h2-nms trunk]# pgproto -f bug318.data -p 9999 -h172.18.255.40 -u postgres -d test_db FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE ErrorResponse(S ERROR V ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ReadyForQuery(I) FE=> Terminate The following is the data I got when it got stuck !!!! [root@h2-nms trunk]# pgproto -f bug318.data -p 9999 -h172.18.255.40 -u postgres -d test_db FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE (S) <= BE (S) <= BE (S) <= BE (S) <= BE (S) <= BE (S) <= BE (S) <= BE (S) <= BE (S) |
|
|
I will try to reproduce the bug. |
|
|
Hi Just to give a little more info If I execute pg_terminate_backend on the idle in transaction (aborted) connection I then get the stuck "division by zero" error message + some EOF message <= BE (S) <= BE ErrorResponse(S ERROR V ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ErrorResponse(S FATAL C XX000 M connection to postmaster on DB node 0 was lost due to pg_terminate_backend D pg_terminate_backend was called on the backend F pool_process_query.c L 4650 ) read_char: EOF detected[root@h2-nms trunk]# If I wait for very long time without initiating pg_terminate_backend I eventually get the following (I do not know how long since it was overnight) <= BE (S) <= BE (S) read_char: EOF detected[root@h2-nms trunk]# |
|
|
I used pgpool_setup to execute the following command more than 20 times, but I couldn't reproduce the bug. [pengbo@localhost pgpoolsetup]$ pgproto -f bug318.data -p 11000 -u pengbo -d postgres FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE ErrorResponse(S ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ReadyForQuery(I) FE=> Terminate [pengbo@localhost pgpoolsetup]$ pgproto -f bug318.data -p 11000 -u pengbo -d postgres FE=> Parse(stmt="S1", query="SELECT 1/0") FE=> Bind(stmt="S1", portal="") FE=> Execute(portal="") FE=> Close(stmt="S1") FE=> Sync <= BE ParseComplete <= BE ErrorResponse(S ERROR C 22012 M division by zero F int.c L 719 R int4div ) <= BE ReadyForQuery(I) FE=> Terminate |
|
|
The following is a short shell script to run pgproto in a loop. It fails all the time on all our setups. Please note that it fails on setup running pgpool 3.6.4 with replication master+slave and load balancing. If you have setup with only master (slave is down) you will not see the bug. Please run the sh script and see if it manage to finish the loop without error. #!/bin/bash File="./run_test.log" for ((n=0; n<10000; n++)) do ( echo "iteration $n " pgproto -f bug318.data -p 9999 -h172.18.255.40 -u postgres -d test_db > $File; ) 2>&1 | awk '{ print strftime("%Y-%m-%d %H:%M:%S:"), $0; fflush(); }' | tee -a "$File" if grep -q Terminate "$File"; then echo "ok" else echo "Bug" exit fi done --------------------------------------------------------------------------------------------------------------- The following is a bug318.data used by the batch script (which calls pgproto) 'P' "S1" "SELECT 1/0" 0 'B' "" "S1" 0 0 0 'E' "" 0 'C' 'S' "S1" # Issue Sync message 'S' # Receive response from backend 'Y' # Send terminate message 'X' |
|
|
We do experience the same issue (ERRORs hanging a frontend connection instead of being reported) when Pgpool is balancing over two streaming slaves with their master. I can somewhat reliably reproduce it by running simple error-throwing statements via a JDBC-using command line tool called "sqlline" like below. Running git-bisect yielded commit 7355ebb4d as the first failing one. ,----[ sqlline 3.6.5 or master ] | $ sqlline -n postgres -p "" -u 'jdbc:postgresql://localhost:19999/postgres?loggerLevel=TRACE&loggerFile=/home/andreas/jdbc.log&prepareThreshold=1' | Connecting to jdbc:postgresql://localhost:19999/postgres?loggerLevel=TRACE&loggerFile=/home/andreas/jdbc.log&prepareThreshold=1 | Connected to: PostgreSQL (version 9.6.3) | Driver: PostgreSQL Native Driver (version PostgreSQL 9.4.1212) | Autocommit status: true | Transaction isolation: TRANSACTION_REPEATABLE_READ | sqlline version 1.0.2 by Marc Prud'hommeaux | 0: jdbc:postgresql://localhost:19999/postgres> create table t(c int primary key); | No rows affected (0,034 seconds) | 0: jdbc:postgresql://localhost:19999/postgres> insert into t values(1); | 1 row affected (0,034 seconds) | 0: jdbc:postgresql://localhost:19999/postgres> insert into t values(1); | -- pgpool connection hangs here instead of reporting the unique violation `---- Errors reported correctly in 3.6.1. ,----[ 3.6.1 ] | 0: jdbc:postgresql://localhost:19999/postgres> insert into t values(1); | Error: ERROR: duplicate key value violates unique constraint "t_pkey" | Detail: Key (c)=(1) already exists. (state=23505,code=0) | org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique constraint "t_pkey" `---- |
|
|
ansel, Thank you for the analysis. I also noticed the same thing and created a patch. If you could try with attached patch and see if the problem is solved, that would be very helpful for me. If the problem is not solved, please run pgpool with debug option (-d) and share the pgpool log file. |
|
|
With the patch, I successfully runs sqlline with your example. t-ishii@localhost: JAVA_CLASSPATH=/usr/local/pgsql/share//postgresql-9.2-1003.jdbc4.jar sqlline -n t-ishii -p "" -u 'jdbc:postgresql://localhost:11000/test?&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&prepareThreshold=1' [warning] /usr/bin/sqlline: Unable to locate postgresql-jdbc4 in /usr/share/java [warning] /usr/bin/sqlline: Unable to locate mysql-connector-java in /usr/share/java [warning] /usr/bin/sqlline: Unable to locate jtds in /usr/share/java Connecting to jdbc:postgresql://localhost:11000/test?&loggerLevel=TRACE&loggerFile=/tmp/jdbc.log&prepareThreshold=1 Connected to: PostgreSQL (version 9.6.3) Driver: PostgreSQL Native Driver (version PostgreSQL 9.2 JDBC4 (build 1003)) Autocommit status: true Transaction isolation: TRANSACTION_REPEATABLE_READ sqlline version 1.0.2 by Marc Prud'hommeaux 0: jdbc:postgresql://localhost:11000/test> drop table t; drop table t; No rows affected (0.004 seconds) 0: jdbc:postgresql://localhost:11000/test> create table t(c int primary key); create table t(c int primary key); No rows affected (0.006 seconds) 0: jdbc:postgresql://localhost:11000/test> insert into t values(1); insert into t values(1); 1 row affected (0.004 seconds) 0: jdbc:postgresql://localhost:11000/test> insert into t values(1); insert into t values(1); Error: ERROR: duplicate key value violates unique constraint "t_pkey" Detail: Key (c)=(1) already exists. (state=23505,code=0) 0: jdbc:postgresql://localhost:11000/test> |
|
|
Thanks for the quick response. I can confirm that the patch fixes the issue for me when applied on top of master. pgpool generates the following LOG message now in this situation: 2017-07-26 08:05:26: pid 21769: LOG: Process backend response: sync pending message found after receiving error response I'm attaching the debug output between two of those messages. |
|
|
Ansel, Thank for you for testing the patch. I will commit the patch to master, 3.6 and 3.5 stable trees. > pgpool generates the following LOG message now in this situation: This is normal and I think we should degrade the message to DEBUG1 or so. > I'm attaching the debug output between two of those messages. Thank you for the log. I will look into it so that I can confirm my patch works as expected in your environment. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2017-07-03 17:35 | avi | New Issue | |
| 2017-07-03 17:35 | avi | File Added: idle_in_transaction_aborted.zip | |
| 2017-07-04 04:48 | drrtuy | Note Added: 0001566 | |
| 2017-07-04 10:35 | t-ishii | Note Added: 0001567 | |
| 2017-07-04 16:39 | avi | Note Added: 0001570 | |
| 2017-07-04 16:50 | t-ishii | Note Added: 0001571 | |
| 2017-07-04 20:47 | avi | Note Added: 0001573 | |
| 2017-07-04 22:24 | avi | Note Added: 0001575 | |
| 2017-07-05 15:39 | pengbo | Assigned To | => pengbo |
| 2017-07-05 15:39 | pengbo | Status | new => assigned |
| 2017-07-05 15:42 | pengbo | Note Added: 0001576 | |
| 2017-07-06 15:22 | avi | Note Added: 0001577 | |
| 2017-07-07 16:53 | pengbo | File Added: pgpool.conf | |
| 2017-07-07 16:53 | pengbo | Note Added: 0001579 | |
| 2017-07-09 15:59 | avi | Note Added: 0001580 | |
| 2017-07-25 22:57 | ansel | Note Added: 0001603 | |
| 2017-07-26 07:50 | t-ishii | File Added: pgpool-hung.diff | |
| 2017-07-26 07:50 | t-ishii | Note Added: 0001605 | |
| 2017-07-26 09:34 | t-ishii | Note Added: 0001606 | |
| 2017-07-26 15:11 | ansel | File Added: sync-pending-message.txt | |
| 2017-07-26 15:11 | ansel | Note Added: 0001607 | |
| 2017-07-26 18:13 | t-ishii | Note Added: 0001608 | |
| 2017-08-29 09:28 | pengbo | Status | assigned => closed |