[Pgpool-general] Parallel query order by / limit

Thu Jun 2 16:48:37 UTC 2011

Drew,

PgPool has to get all of the rows to order them, then it can limit how many are returned.  Unfortunately, there is no magic distributed index that can help with this operation.  (Imagine fetching a row through an index on a backend database and comparing it to all other backend databases to determine if that row is or is not included in the result set.  Yikes!)  You will find similar issues with group by and other clauses that require a sort.  Subqueries can be problematic, too.

Note the data is returned to the PG cluster running pgpool for your sort and limit operations.  This operation used dblink, which allocates memory to hold the rows and can exhaust memory if the amount of data is too large.  If you want to run a partitioned datawarehouse I suggest looking at Greenplum.  If you have targeted data to retrieve, pgpool can do that.

PgPool works well in parallel mode if you remember these limitations.  I'm running several ginormous databases with it and get excellent response.

Good luck!

Bob Lunney
________________________________
From: Drew Robb <drewrobb at gmail.com>
To: pgpool-general at pgfoundry.org
Sent: Thursday, June 2, 2011 11:17 AM
Subject: [Pgpool-general] Parallel query order by / limit

Hello, I'm fairly new to pgpool (enjoying it so far) and I'm trying to 
understand parallel query behavior. I'm using pgpool 3.0.4 and pqsql 
9.0.

I have a parallel table USERS distributed via the primary key user_id.  When I execute the following query

select user_id, date_created from users order by date_created limit 10;
>
I get the following output from pgpool debug:

statement2: SELECT pool_parallel("SELECT user_id, users.date_created FROM users")
>

My understanding is that the parallel engine is unable to translate the order by 
and limit clause to into the query executed on the worker nodes? This 
creates a big performance problem since all records in the table are 
seemingly being processed.  I read the documentation, but was unable to 
figure out why this is the case. It seems to me that the order by and 
limit clauses could be applied to both the workers (returning upto N*10 
records) and then reapplied to filter these results down to just 10 
records. Is the behavior i'm seeing expected, does the optimization I had in mind make sense?

Either way, is there a way that I could specify exactly the parallel query 
to be executing on the workers to avoid this problem? I know that 
creating a particular view in and registering it in the dist_def table might work, but this isn't very dynamic.

Thanks,
Drew
_______________________________________________
Pgpool-general mailing list
Pgpool-general at pgfoundry.org
http://pgfoundry.org/mailman/listinfo/pgpool-general