View Issue Details

IDProjectCategoryView StatusLast Update
0000060Pgpool-IIBugpublic2013-07-02 17:44
ReporteremhnAssigned Tot-ishii 
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionopen 
Platformpgpool-II 3.1.3OSLinux amd64OS VersionDebian 7.0
Product Version 
Target VersionFixed in Version 
Summary0000060: Exceptions on COMMIT cause abnormal pgpool process termination and disconnects
DescriptionWe detected this bug using pgpool-II 3.1.3 as shipped by Debian GNU/Linux 7.0.

We have streaming replication working between a master and a single slave using PostgreSQL 9.1.9. The database has several DEFERRED constraints. A transaction starts to perform a sequence of operations that will violate those constraints. When the transaction attempts to COMMIT, some deferred constraint will raise an exception on the master node as expected, but it won't raise it on the slave node. Pgpool interprets this as a "kind error" since it got conflicting answers from both nodes, and abnormally shuts down the process, breaking the connection with the client application.

The expected behavior would be to effectively pass the exception to the caller, as if the exception was triggered by a statement within the transaction, and fthe pgpool process should continue working normally.

We haven't tried newer versions of pgpool-II, as we must keep on using pgpool-II 3.1.3 as shipped by Debian for ease of operation and deployment. We'll file a bug on Debian's BTS as soon as we have resolution with pgpool, to ask for an upgrade if possible.
Steps To ReproduceProvided you have set up streaming replication between two 9.1 databases, you'll find a tarball attached containing three scripts written by Luis Muñoz <lem@isc.org> that help reproduce the bug:

1. database-setup.sql will create a table with a deferred constraint
   enforced by a trigger that simply rises and exception. This simulates
   the otherwise more complex setup we have, that just raises and
   exception on commit.

2. bug.sql should be run through pgpool to reproduce the bug. The script
   performs a simple statement that will raise the exception on COMMIT.
   The pgpool daemon will die abnormally because the master will answer
   with an error while the slave won't.

3. database-clean.sql will clean-up the objects created by database-setup.sql.

We can always reproduce the problem, with output similar to

psql (9.1.9)
Type "help" for help.
    
test=# \i bug.sql
BEGIN
INSERT 0 1
psql:bug.sql:12: ERROR: kind mismatch among backends. Possible last query was: "commit;" kind details are: 0[E: some integrity violation] 1[C]
HINT: check data consistency among db nodes
ERROR: kind mismatch among backends. Possible last query was: "commit;" kind details are: 0[E: some integrity violation] 1[C]
HINT: check data consistency among db nodes
psql:bug.sql:12: connection to server was lost
Additional InformationThe same tarball has a patch written by Francisco Obispo <fobispo@isc.org> that was applied to our pgpool-II 3.1.3 installation. When using the aforementioned test scripts on the patched installation, we get

psql (9.1.9)
Type "help" for help.
    
test=# \i bug.sql
BEGIN
INSERT 0 1
psql:bug.sql:12: ERROR: some integrity violation
test=# \q

as expected. The original application that led us to the discovery of the bug also works as expected.
TagsNo tags attached.

Activities

emhn

2013-05-31 22:48

reporter  

fails-on-commit.tgz (1,389 bytes)

t-ishii

2013-06-02 20:44

developer   ~0000290

Problem confirmed. However attached patch cannot be applied since it has some problems: 1) does not address the case when a transaction is committed by "end" command(even does not work if " commit" for example) 2) it will not work for the case there's more than 1 standbys. I have committed modified patch to git repository. Just in case I attached patch for V3.1 stable tree.

t-ishii

2013-06-02 20:45

developer  

pool_process_query.c-v3.1.patch (2,506 bytes)
diff --git a/pool_process_query.c b/pool_process_query.c
index c7c6bc7..8bb25ca 100644
--- a/pool_process_query.c
+++ b/pool_process_query.c
@@ -77,6 +77,7 @@ static bool pool_has_insert_lock(void);
 static POOL_STATUS add_lock_target(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char* table);
 static bool has_lock_target(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char* table, bool for_update);
 static POOL_STATUS insert_oid_into_insert_lock(POOL_CONNECTION *frontend, POOL_CONNECTION_POOL *backend, char* table);
+static bool is_all_slaves_command_complete(unsigned char *kind_list, int num_backends, int master);
 
 /* timeout sec for pool_check_fd */
 static int timeoutsec;
@@ -3674,6 +3675,27 @@ POOL_STATUS read_kind_from_one_backend(POOL_CONNECTION *frontend, POOL_CONNECTIO
 }
 
 /*
+ * returns true if all slaves status are 'C' (Command Complete)
+ */
+static bool is_all_slaves_command_complete(unsigned char *kind_list, int num_backends, int master)
+{
+	int i;
+	int ok = true;
+
+	for (i=0;i<num_backends;i++)
+	{
+		if (i == master || kind_list[i] == 0)
+			continue;
+		if (kind_list[i] != 'C')
+		{
+			ok = false;
+			break;
+		}
+	}
+	return ok;
+}
+		
+/*
  * read_kind_from_backend: read kind from backends.
  * the "frontend" parameter is used to send "kind mismatch" error message to the frontend.
  * the out parameter "decided_kind" is the packet kind decided by this function.
@@ -3784,7 +3806,23 @@ POOL_STATUS read_kind_from_backend(POOL_CONNECTION *frontend, POOL_CONNECTION_PO
 		 * not all backends agree with kind. We need to do "decide by majority"
 		 */
 
-		if (max_count <= NUM_BACKENDS / 2.0)
+		/*
+		 * However here is a special case. In master slave mode, if
+		 * master gets an error at commit, while other slaves are
+		 * normal at commit, we don't need to degenrate any backend
+		 * because it is likely that the error was caused by a
+		 * deferred trigger.
+		 */
+		if (MASTER_SLAVE && query_context->parse_tree &&
+			is_commit_query(query_context->parse_tree) &&
+			kind_list[MASTER_NODE_ID] == 'E' &&
+			is_all_slaves_command_complete(kind_list, NUM_BACKENDS, MASTER_NODE_ID))
+		{
+			*decided_kind = kind_list[MASTER_NODE_ID];
+			pool_log("read_kind_from_backend: do not degenerate because it is likely caused by a delayed commit");
+			return POOL_CONTINUE;
+		}
+		else if (max_count <= NUM_BACKENDS / 2.0)
 		{
 			/* no one gets majority. We trust master node's kind */
 			trust_kind = kind_list[MASTER_NODE_ID];

emhn

2013-07-02 03:00

reporter   ~0000300

I backported your patch into a private Debian package for 3.1.3 and it's been working fine so far. Thanks!

Is this patch going to make it into 3.1.8? If so, what's the expected date of release? I'm working with the official Debian maintainer to see if we can refresh the package to the latest one.

t-ishii

2013-07-02 08:04

developer   ~0000301

> Is this patch going to make it into 3.1.8?

Yes.

> If so, what's the expected date of release?

Not decided yet but probably this week or next week.

Issue History

Date Modified Username Field Change
2013-05-31 22:48 emhn New Issue
2013-05-31 22:48 emhn File Added: fails-on-commit.tgz
2013-06-02 18:41 t-ishii Assigned To => t-ishii
2013-06-02 18:41 t-ishii Status new => assigned
2013-06-02 20:44 t-ishii Note Added: 0000290
2013-06-02 20:45 t-ishii File Added: pool_process_query.c-v3.1.patch
2013-06-02 20:46 t-ishii Status assigned => feedback
2013-06-02 21:01 t-ishii Changeset attached => pgpool2 master e7e35046
2013-06-02 21:01 t-ishii Changeset attached => pgpool2 V3_1_STABLE 6d8c6d26
2013-06-02 21:01 t-ishii Changeset attached => pgpool2 V3_0_STABLE 865d7690
2013-07-02 03:00 emhn Note Added: 0000300
2013-07-02 03:00 emhn Status feedback => assigned
2013-07-02 08:04 t-ishii Note Added: 0000301
2013-07-02 17:44 t-ishii Status assigned => resolved