View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000194 | Pgpool-II | Bug | public | 2016-05-06 19:34 | 2016-06-05 17:45 |
| Reporter | harukat | Assigned To | t-ishii | ||
| Priority | normal | Severity | major | Reproducibility | always |
| Status | resolved | Resolution | open | ||
| Product Version | 3.3 | ||||
| Summary | 0000194: "statement timeout" causes kind mismatch errors | ||||
| Description | I checked this in pgpool-II 3.3.10, but I think other versions have the same problem. | ||||
| Steps To Reproduce | $ pgbench -n -h localhost -p 9999 --protocol=extended -c 3 -t 10 -f sleep.sql Client 1 aborted in state 16: ERROR: kind mismatch among backends. Possible last query was: "Parse: END;" kind details are: 0[E: canceling statement due to statement timeout] 1[1] 2[1] HINT: check data consistency among db nodes pgpool-II log -------------- May 6 16:19:18 pg1 pgpool[19398]: pool_send_and_wait: Error or notice message from backend: : DB node id: 0 backend pid: 19488 statement: END; message: canceling statement due to statement timeout May 6 16:19:18 pg1 pgpool[19398]: Parse: Error or notice message from backend: : DB node id: 0 backend pid: 19488 statement: END; message: canceling statement due to statement timeout May 6 16:19:18 pg1 pgpool[19398]: DB node id: 2 backend pid: 8238 statement: Parse: END; May 6 16:19:18 pg1 pgpool[19398]: read_kind_from_backend: 2 th kind 1 does not match with master or majority connection kind E May 6 16:19:18 pg1 pgpool[19398]: kind mismatch among backends. Possible last query was: "Parse: END;" kind details are: 0[E: canceling statement due to statement timeout] 2[1] PostgreSQL log (2) ------------------ May 6 16:19:03 pg1 postgres[19488]: [35-1] [572c4567.4c20-17][0][00000](postgres,postgres,10.10.10.103(47899),pgbench)DEBUG: parse pgpool19398: SELECT current_setting('transaction_isolation'); May 6 16:19:03 pg1 postgres[19488]: [36-1] [572c4567.4c20-18][0][00000](postgres,postgres,10.10.10.103(47899),pgbench)DEBUG: StartTransactionCommand May 6 16:19:03 pg1 postgres[19488]: [37-1] [572c4567.4c20-19][0][00000](postgres,postgres,10.10.10.103(47899),pgbench)DEBUG: bind pgpool19398 to pgpool19398 May 6 16:19:03 pg1 postgres[19488]: [38-1] [572c4567.4c20-20][0][00000](postgres,postgres,10.10.10.103(47899),pgbench)LOG: execute pgpool19398/pgpool19398: SELECT current_setting('transaction_isolation'); May 6 16:19:18 pg1 postgres[19488]: [39-1] [572c4567.4c20-21][0][57014](postgres,postgres,10.10.10.103(47899),pgbench)ERROR: canceling statement due to statement timeout pgpool.conf ----------- master_slave_mode = on load_balance_mode = on debug_level = 1 # (for debug) backend_*0 = ... # (more than 2 backends) backend_*1 = ... backend_*2 = ... postgresql.conf --------------- statement_timeout = 10s log_min_messages = debug3 # (for debug) sleep.sql --------- BEGIN; SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); SELECT pg_sleep(1); END; | ||||
| Additional Information | This patch may fix this bug. "do_query()" doesn't send sync messages. It is a violation of PostgreSQL protocol. $ diff -c pool_process_query.c.ORG pool_process_query.c *** pool_process_query.c.ORG 2016-04-26 12:41:36.000000000 +0900 --- pool_process_query.c 2016-05-06 18:28:07.400598842 +0900 *************** *** 2551,2556 **** --- 2551,2563 ---- pool_write(backend, &len, sizeof(len)); /* + * Send sync message + */ + pool_write(backend, "S", 1); + len = htonl(4U); + pool_write_and_flush(backend, &len, sizeof(len)); + + /* * Send close portal message */ pool_write(backend, "C", 1); | ||||
| Tags | No tags attached. | ||||
|
|
The proposed fix breaks some existing cases. Sync will close implicit transaction and it will break unnamed portal which might have been created by a user query. BTW, I don't understand why pgpool does a "violation of PostgreSQL protocol". Why do you think so? |
|
|
> BTW, I don't understand why pgpool does a "violation of PostgreSQL protocol". Why do you think so? The document says "At completion of each series of extended-query messages, the frontend should issue a Sync message." http://www.postgresql.org/docs/9.4/static/protocol-flow.html#PROTOCOL-FLOW-EXT-QUERY The server cannot recognize the finish of the statement execution for lack of Sync. One "DEBUG: CommitTransactionCommand" message should be out for one "DEBUG: StartTransactionCommand" message in PostgreSQL log, but it isn't out in this case. |
|
|
You misunderstand the document. It merely says "should not", not "must not". Doing something "should not" is not a violation against the protocol. In fact, JDBC driver happily sends sequence of commands "PARSE,BIND,EXECUTE,PARSE,BIND,EXECUTE..." without a sync message. |
|
|
I read the git blame and understood there was a reason not to throw Sync. BTW, JDBC driver throws 'S' just after "execute" properly, avoids this problem. { conn.setAutoCommit(false); PreparedStatement pst1 = conn.prepareStatement(sql1); pst1.setInt(1, 1); pst1.execute(); try{ Thread.sleep(20 * 1000); } catch(InterruptedException e){} /* other commands */ conn.commit(); } 11:22:49.576251 IP 127.0.0.1.55776 > 127.0.0.1.postgres: Flags [P.], seq 171:296, ack 346, win 521, options [nop,nop,TS val 1901774517 ecr 1901771423], length 125 0x0000: 4500 00b1 7a27 4000 4006 c21d 7f00 0001 E...z'@.@....... 0x0010: 7f00 0001 d9e0 1538 777e c121 37a1 c3ab .......8w~.!7... 0x0020: 8018 0209 fea5 0000 0101 080a 715a c6b5 ............qZ.. 0x0030: 715a ba9f 5000 0000 0d00 4245 4749 4e00 qZ..P.....BEGIN. 0x0040: 0000 4200 0000 0c00 0000 0000 0000 0045 ..B............E 0x0050: 0000 0009 0000 0000 0050 0000 002a 0053 .........P...*.S 0x0060: 454c 4543 5420 7620 4652 4f4d 2074 3120 ELECT.v.FROM.t1. 0x0070: 5748 4552 4520 6964 203d 2024 3100 0001 WHERE.id.=.$1... 0x0080: 0000 0017 4200 0000 1600 0000 0100 0100 ....B........... 0x0090: 0100 0000 0400 0000 0100 0044 0000 0006 ...........D.... 0x00a0: 5000 4500 0000 0900 0000 0000 5300 0000 P.E.........S... 0x00b0: 04 . |
|
|
That's just an one of examples of JDBC driver's behaviors. There are number of counter examples. See http://www.pgpool.net/mantisbt/view.php?id=167 for a counter example. |
|
|
|
|
|
Here is a new fix proposal. I attached "pgpool-II-3.3.10.bug194.patch". It makes pgpool throw "SELECT current_setting('transaction_isolation')" to the load balance node rather than master or primary. |
|
|
This breaks the case "SET transaction_isolation TO 'serializable'". Pgpool-II sends this to always primary, and it wants to know the transaction isolation mode on primary. Moreover, "SELECT current_setting('transaction_isolation')" is not the only command which do_query() sends. |
|
|
BTW I think you misunderstand the timing when the statement_time out fires. It is fired in ProcessInterrupts(), not at the end of transaction. |
|
|
> This breaks the case "SET transaction_isolation TO 'serializable'". Yes, I was misunderstanding. This doesn't work. > Moreover, "SELECT current_setting('transaction_isolation')" is > not the only command which do_query() sends The next SQL execution of most of other "do_query()" comes soon, so lack of Sync will not make problem. I study the following soutions now. (A) avoid "ERROR: canceling statement due to statement timeout" occurs in PostgreSQL. (A1) throw Sync at the special case in do_query(). (B) avoid pgpool-II takes such ERROR at the end of transaction. (B1) ignore ERROR response from backends at the special case. |
|
|
|
|
|
"pgpool-II-3.3.10.bug194-fix2.patch" is the implementation of (A1). Changes are followings: * add do_query_ex() based on do_query() that can throw a sync message * use do_query_ex() instead of do_query() in pool_get_transaction_isolation() pool_get_transaction_isolation() saves the return value into session_context's variable, so it calls do_query() first time only. This means that breaking unnamed portal is harmless here. |
|
|
What if the script is something like: BEGIN; SELECT pg_sleep(1); SELECT * FROM tiny_table; : : Surely the first SELECT pg_sleep(1) is not suffered by the statement time out by applying your patch but the second SELECT will be timed out because it causes do_query() gets called. |
|
|
I think the following adjustments are needed totally for this issue. Where do_query() is used and status for this issue: - main process / worker_child .. no problem - insert lock .................. no problem, next SQL throws Sync soon - rewrite timestamp ............ no problem, next SQL throws Sync soon - rewrite lo_creat ............. no problem, next SQL throws Sync soon - memqcache(get database oid) .. no problem, next SQL throws Sync soon - relcache ..................... need fix, do_query on load balance node (pgpool-II-3.3.10.bug194-fix2.patch) - get_transaction_isolation .... need fix, do_query with Sync (pgpool-II-3.3.10.bug194-relcache.patch) |
|
|
|
|
|
You didn't answer my question for the example case above. I don't buy the idea "let issue sync message in do_query" approach because it will easily break some working cases. I don't want to apply such a fragile solution to pgpool-II. Plus, even if we could solve the problem using the approach you proposed, we will have yet another "kind mismatch error" if the statement time out actually raises. Probably the only workable workaround is, if users want to use statement time out, let the load balance to be off. |
|
|
> You didn't answer my question for the example case above. pgpool-II-3.3.10.bug194-fix2.patch and pgpool-II-3.3.10.bug194-relcache.patch have solved it. > Plus, even if we could solve the problem using the approach you proposed, > we will have yet another "kind mismatch error" if the statement > time out actually raises. If the worst thing that the patch causes is only yet another "kind mismatch error", I think this fix has merit to reduce the error. |
|
|
The pgpool-II-3.3.10.bug194-relcache.patch is broken. - It make a query to other than primary node (in streaming replication mode) which makes break some cases (consider temp tables) - Bad coding style. Pgpool-II follows PostgreSQL coding style. BTW, currently I am working on different ideas. Will post some patches while I am in PGCon. |
|
|
|
|
|
Ok, "bug194-3.3.diff" tries to mitigate the problem, at least when user query is in an explicit transaction. If statement timeout is enabled on backend and do_query() sent a query to primary node, and all of following user queries are sent to standby, it is possible that the next command, for example END, could cause a statement timeout error on the primary, and a kind mismatch error on pgpool-II. This fix solves the case by sending sync message instead of flush message because the sync message reset the statement timeout timer. We cannot use this technique for implicit transaction case, because the sync message removes the unnamed portal if there's any. |
|
|
Patch bug194-3.3.diff is seem to clear up the kind mismatch error in the first reported case, but disconnections occur in the other node's role formation. This disconnection can occur without this patch, so I think bug194-3.3.diff is still useful. db1=# show pool_nodes; node_id | hostname | port | status | lb_weight | role ---------+----------+------+--------+-----------+--------- 0 | pg1 | 5432 | 2 | 0.333333 | standby 1 | pg2 | 5432 | 2 | 0.333333 | primary 2 | pg3 | 5432 | 2 | 0.333333 | standby (3 rows) [postgres@pg1 ~]$ pgbench -n -h localhost -p 9999 --protocol=extended -c 3 -t 20 -f sleeps.sql db1 Client 0 aborted in state 1. Probably the backend died while processing. Client 1 aborted in state 1. Probably the backend died while processing. Client 2 aborted in state 1. Probably the backend died while processing. transaction type: Custom query scaling factor: 1 query mode: extended number of clients: 3 number of threads: 1 number of transactions per client: 20 number of transactions actually processed: 3/60 pgpool-II: May 20 16:41:17 pg1 pgpool[11403]: send_to_where: 2 query: SELECT pg_sleep(1); May 20 16:41:17 pg1 pgpool[11403]: do_query: extended:1 query:SELECT current_setting('transaction_isolation') May 20 16:41:17 pg1 pgpool[11403]: do_query: kind: E May 20 16:41:17 pg1 pgpool[11403]: do_query: error message from backend: canceling statement due to statement timeout. Exit this session. May 20 16:41:17 pg1 pgpool[11393]: reap_handler called May 20 16:41:17 pg1 pgpool[11393]: reap_handler: call wait3 May 20 16:41:17 pg1 pgpool[11393]: child 11403 exits with status 256 PostgreSQL: May 20 16:41:01 pg1 postgres[11415]: [3-1] [573ebf8d.2c97-1][0][00000](postgres,db1,10.10.10.103(34853),pgbench)LOG: execute pgpool11403/pgpool11403: SELECT current_setting('transaction_isolation') May 20 16:41:01 pg1 postgres[11416]: [3-1] [573ebf8d.2c98-1][0][00000](postgres,db1,10.10.10.103(34858),pgbench)LOG: execute pgpool11400/pgpool11400: SELECT current_setting('transaction_isolation') -snip- May 20 16:41:17 pg1 postgres[11415]: [4-1] [573ebf8d.2c97-2][0][57014](postgres,db1,10.10.10.103(34853),pgbench)ERROR: canceling statement due to statement timeout May 20 16:41:17 pg1 postgres[11416]: [4-1] [573ebf8d.2c98-2][0][57014](postgres,db1,10.10.10.103(34858),pgbench)ERROR: canceling statement due to statement timeout |
|
|
I have committed bug194-3.3.diff into all supported branches. The newer bug you found is surely different from the one bug194-3.3 fixes. I will come up with a fix soon. |
|
|
Ok, I have committed fix to deal with the problem. http://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=bc01f4f7b1fe9c18fe4411b42b29b64f17eb387c |
|
|
Thank you. I confirmed that the failures were gone in V3_3_STABLE head and V3_4_STABLE head on various nodes' formations. |
|
|
Thank you for your cooperation! I marked this issue as "resolved". |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2016-05-06 19:34 | harukat | New Issue | |
| 2016-05-06 20:19 | t-ishii | Assigned To | => t-ishii |
| 2016-05-06 20:19 | t-ishii | Status | new => assigned |
| 2016-05-06 20:19 | t-ishii | Steps to Reproduce Updated | |
| 2016-05-06 20:39 | t-ishii | Note Added: 0000795 | |
| 2016-05-09 12:13 | harukat | Note Added: 0000798 | |
| 2016-05-09 12:40 | t-ishii | Note Added: 0000799 | |
| 2016-05-10 12:19 | harukat | Note Added: 0000800 | |
| 2016-05-10 12:34 | t-ishii | Note Added: 0000801 | |
| 2016-05-11 15:42 | harukat | File Added: pgpool-II-3.3.10.bug194.patch | |
| 2016-05-11 15:52 | harukat | Note Added: 0000806 | |
| 2016-05-11 16:33 | t-ishii | Note Added: 0000807 | |
| 2016-05-11 16:36 | t-ishii | Note Added: 0000808 | |
| 2016-05-12 10:28 | harukat | Note Added: 0000809 | |
| 2016-05-12 12:13 | harukat | File Added: pgpool-II-3.3.10.bug194-fix2.patch | |
| 2016-05-12 17:44 | harukat | Note Added: 0000811 | |
| 2016-05-13 10:05 | t-ishii | Note Added: 0000813 | |
| 2016-05-13 14:17 | harukat | Note Added: 0000814 | |
| 2016-05-13 14:18 | harukat | File Added: pgpool-II-3.3.10.bug194-relcache.patch | |
| 2016-05-13 15:53 | t-ishii | Note Added: 0000815 | |
| 2016-05-15 17:18 | harukat | Note Added: 0000817 | |
| 2016-05-16 09:21 | t-ishii | Note Added: 0000818 | |
| 2016-05-19 12:09 | t-ishii | File Added: bug194-3.3.diff | |
| 2016-05-19 12:13 | t-ishii | Note Added: 0000830 | |
| 2016-05-19 12:14 | t-ishii | Status | assigned => feedback |
| 2016-05-23 11:33 | harukat | Note Added: 0000837 | |
| 2016-05-23 11:33 | harukat | Status | feedback => assigned |
| 2016-05-24 23:47 | t-ishii | Note Added: 0000840 | |
| 2016-05-25 11:29 | t-ishii | Note Added: 0000841 | |
| 2016-05-25 11:30 | t-ishii | Status | assigned => feedback |
| 2016-06-02 14:47 | harukat | Note Added: 0000845 | |
| 2016-06-02 14:47 | harukat | Status | feedback => assigned |
| 2016-06-05 17:45 | t-ishii | Note Added: 0000848 | |
| 2016-06-05 17:45 | t-ishii | Status | assigned => resolved |