[Pgpool-hackers] pgpool-II ideas

Sat Jun 16 14:54:16 UTC 2007

Sorry for delay. We (pgpool team in SRAOSS) have been discussing your
idea.

> I have a few ideas about possible improvements to
> PgPool-II. These are just some thoughts, and I'd like
> to know how possible or useful they would be.
> 
> 1. Adding support for 2PC in replication mode
> 
> Transform all transactions to be of the form:
> BEGIN;
> -- statements go here
> 
> -- pgpool.transaction is table that exists on all
> nodes
> INSERT INTO pgpool.transaction(id)
> VALUES('some_unique_id');
> PREPARE TRANSACTION 'some_unique_id';
> -- wait for all nodes to prepare
> 
> COMMIT PREPARED 'some_unique_id';
> 
> executed on all nodes. If any one node fails to
> pre-commit, it can be dropped and an alert raised. If
> the pgpool server crashes and dies, we can bring up
> all the nodes and find any prepared (but not
> committed)
> transactions and, if it's ID matches any record in
> pgpool.transaction on any node, issue a COMMIT
> PREPARED 'ID'. If no node has a matching transaction
> committed, we ROLLBACK PREPARED 'ID'. This should
> bring all the nodes into sync, and pgpool can resume
> normal operation without any data loss.
> 
> Currently, pgpool-II is safe from any of the nodes
> failing, but unsafe if pgpool-II itself crashes, if I
> understand correctly. This would fix that problem, I
> think.
> 
> This could be a configurable option, since it will
> hurt write performance. I think it would need to be
> combined with strict-enough write ordering to prevent
> deadlocks.

I think the essential problem here is, pgpool does not have
information about node status (which node has committed the tx and
which node has not) that could survive through pgpool crashing. Once
we have such status, pgpool could keep the node consistency even
without 2PC. i.e. detach victim nodes.

So what we really need is, a durable node commit status, which is very
much similar to pg_clog in PostgreSQL I think.

What do you think?

> 2. More strict write ordering in replication more
> 
> Currently, pgpool-II can suffer from inconsistencies
> between nodes due to complex transactions getting
> different snapshots on different nodes:
> 
> http://lists.pgfoundry.org/pipermail/pgpool-general/2007-May/000641.html
> 
> The simple solution is to serialize all writing
> transactions completely, which would seriously hurt
> write performance. 

Year, I agree that it's not acceptable.

> I think the better solution is to transform all
> writing transactions into explicit transactions with a
> BEGIN ... COMMIT. Then, make sure all statements (even
> from different connections) are executed in the same
> order on each node and in the same order across nodes
> (to prevent deadlock). I think this can be done in a
> safe way with a shared counter for the pgpool
> processes. 
> 
> This should be configurable or perhaps replace
> replication_strict 

We think that followings are enough to prevent the problem you
said.

1) transform all writing transactions into explicit transactions with a
   BEGIN ... COMMIT (as you suggested)

2) aquire table locking if the statement is INSERT

3) for UPDATE/DELETE, pgpool need not to aquire any locking since
   PosgtgreSQL already does

4) any DML should be done in the order DB node 0, DB node 1... DB node
   n.

5) Note that in #4, we could issue DML in parallel manner *except*
   node 0. In theory, WRITE performance of pgpool could be no more
   worse than 1/2 comparing with PostgreSQL regardless number of
   nodes.

6) the order of COMMIT should be node n, node n-1, ... node 1, node 0
   (in the reverse order of #4) to keep locking.

What do you think?

> 3. Unification of replication mode and parallel query
> mode
> 
> It doesn't seem like these modes are mutually
> exclusive. It would be nice if both modes could be
> used together; even having replicated partitions, etc.
> Is there a reason this won't work or is it just
> challenging?

This is in our plan.

> Thoughts? Have these things been discussed already? If
> one of these things seem promising, I'll take a look
> into the code.

Thanks. for 2 and 3, we could statrt the implementation. I appreciate
if you take of 1 (of course if you agree with my idea).
--
Tatsuo Ishii
SRA OSS, Inc. Japan