[Pgpool-hackers] pgpool-II ideas

Wed Jun 20 01:16:53 UTC 2007

> --- Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> > I think the essential problem here is, pgpool does not have
> > information about node status (which node has committed the tx and
> > which node has not) that could survive through pgpool crashing. Once
> > we have such status, pgpool could keep the node consistency even
> > without 2PC. i.e. detach victim nodes.
> > 
> > So what we really need is, a durable node commit status, which is
> > very
> > much similar to pg_clog in PostgreSQL I think.
> > 
> > What do you think?
> 
> Node status would prevent node inconsistency. However, detaching victim
> nodes could be a problem. If pgpool dies at an arbitrary time while
> writes are happening, on average half of the nodes would need to be
> detached (and worst case, all but one), correct?

Yes.

> I don't think that would be too much of a problem, except that we don't
> have a good way for a node to catch up while the server is active.
> They'd have to restore all the nodes to a consistent state before
> resuming replication (PITR from one of the good nodes?). 

Recently one of our developers has started to implement "online
recovery" which is similar the one PGCluster already has. Currently
alpha status code is available. Do you want to check it out?

> The main thing 2PC gives us is that we don't have to degrade the nodes.
> If we see inconsistencies we can just COMMIT/ROLLBACK PREPARED and then
> start up with all nodes active. If we don't have 2PC, we can only
> detach the nodes that haven't committed all the transactions.

Sounds reasonable idea.

> I agree that we should separate the issues though. I'll start
> investigating the node status now, because that solves the
> inconsistency problem. If we want to add 2PC after we have node status
> it would not be much additional code (I don't think), and 2PC can be
> optional.

Great.

> > We think that followings are enough to prevent the problem you
> > said.
> > 
> > 1) transform all writing transactions into explicit transactions with
> > a
> >    BEGIN ... COMMIT (as you suggested)
> > 
> > 2) aquire table locking if the statement is INSERT
> > 
> > 3) for UPDATE/DELETE, pgpool need not to aquire any locking since
> >    PosgtgreSQL already does
> > 
> > 4) any DML should be done in the order DB node 0, DB node 1... DB
> > node
> >    n.
> > 
> > 5) Note that in #4, we could issue DML in parallel manner *except*
> >    node 0. In theory, WRITE performance of pgpool could be no more
> >    worse than 1/2 comparing with PostgreSQL regardless number of
> >    nodes.
> > 
> > 6) the order of COMMIT should be node n, node n-1, ... node 1, node 0
> >    (in the reverse order of #4) to keep locking.
> > 
> > What do you think?
> > 
> 
> I think that there's still a problem if the INSERT/UPDATE/DELETE only
> lock the destination table, and not the source table.
> 
> Let t1 be an empty relation, and let t2 be a relation with 5M records.
> 
> Client1=> insert into t1 select i from t2; -- statement1
> Client2=> insert into t2 values(-1); -- statement2
> 
> Now, here's what could happen following those steps:
> 
> (1) statement1 is started on node0 getting a snapshot of t2 that does
> not include the value -1. Table t1 on node0 is locked due to the
> INSERT.
> (2) statement2 is executed on node0, locking t2 on node0, which does
> not conflict with the lock on t1.
> (3) statement2 is executed on node1, locking t2 on node1
> (4) statement2 commits on node1, then commits on node0
> (5) statement1 is started on node1 getting a different snapshot of t2
> that does include the value -1.
> 
> The basic problem is that there's nothing to prevent statement2 from
> being committed on all nodes before statement1 finishes on node0,
> meaning that statement1 will get different snapshots on different
> nodes.
> 
> If my thinking is correct, we need:
> (1) All COMMITs to be ordered exactly the same on all nodes
> (2) each statement to be started in the same relative position between
> COMMITs of other transactions
> 
> for any transactions that could affect each other.
> 
> I'm also worried that, if we start reordering transactions, we could
> run into FK problems, etc. 

Yoshiyuki, any comment?
--
Tatsuo Ishii
SRA OSS, Inc. Japan