[Pgpool-general] Parallel Query Mode - SELECT, Phppgadmin, Psql]]

Wed Jul 16 04:12:54 UTC 2008

Hello,

> Hi.
>
>
>> >> It's now working in pgpool-II 2.1 RC1. No more problems. And yes I
>> use
>> >> PHP
>> >> PDO and AdoDB. My previous post has more details. So what's the
>> >> replicate_def table and replicate_select for? Thanks again. :)
>> >
>> > replicate_def is to replicate data, not to distribute.
>> >
>> > ex) partitioning data
>> >   node1: dist1 -> (1, 2, 3)  dist2 -> (4, 5, 6)
>> >   node2: dist1 -> (4, 5, 6)  dist2 -> (7, 8, 9)
>> >   node3: dist1 -> (7, 8, 9)  dist2 -> (10, 11, 12)
>> >
>> > When pgpool executes the follwing query, the query is not parallel
>> > execution. Because we need to join dist1 table on node2 and dist2
>> > table on node1.
>> >
>> >   SELECT * FROM dist1 INNER JOIN dist2 on dist1.a = dist2.a;
>> >
>> >
>> > ex) partitioning data
>> >   node1: dist1 -> (1, 2, 3)  rep1 -> (4, 5, 6, 7, 8, 9, 10, 11, 12)
>> >   node2: dist1 -> (4, 5, 6)  rep1 -> (4, 5, 6, 7, 8, 9, 10, 11, 12)
>> >   node3: dist1 -> (7, 8, 9)  rep1 -> (4, 5, 6, 7, 8, 9, 10, 11, 12)
>> >
>> > pgpool can execute the following query concurrently. Because rep1
>> > table is the same on all nodes.
>> >
>> >   SELECT * FROM dist1 INNER JOIN rep1 ON dist1.a = rep1.a;
>> >
>> > I think small tables are defined in replicate_def.
>>
>> If small tables are ideal in replicate_def then, it may not be
>> applicable
>> to our situation because here's the structure of the whole thing as
>> planned:
>>
>> database: loadbal
>> schema: public
>> tables: salesdetail
>>         salesheader
>>
>> Tables salesdetail and salesheader both contain transaction date and
>> branchcode columns. We decided to partition the data by transaction date
>> because, as calculated, it partitions data more evenly as compared to by
>> branchcode partitioning. We then extract the MONTH part of the
>> transaction
>> date then use it as an argument on the SELECT CASE function. Therefore,
>> possible values are 1-12. The 2 tables are very very large (millions of
>> records per table because we have like 100+ branches and
>> hundreds/thousands of transactions per day). So that is the reason why
>> my
>> function looks like this:
>

I just found out that it's not just salesdetail table and salesheader
table that we need to generate a sales report from but there is another
database (MySQL) that we need to migrate to PostgreSQL. It's called
loadbalance_lookup. It contains several tables but some of the fields of
the loadbalance_lookup.tables are the same with salesdetail and
salesheader fields. The programmer said that he left joins the ones from
loadbalance_lookup.table.field to one of the main tables but not both
(salesdetail and salesheader). If this is the case then, I may have to
create a loadbalance_lookup table and place the only needed fields inside
it like branchcode and transaction_date, etc. Am I right? And I guess,
partitioning should be:

node1: loadbalance_lookup -> (1, 2, 3, 4) salesheader -> (1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12)  salesdetail -> (1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12)
node2: loadbalance_lookup -> (5, 6, 7, 8) salesheader -> (1, 2, 3, 4, 5,
6, 7, 8, 9, 10, 11, 12)  salesdetail -> (1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12)
node3: loadbalance_lookup -> (9, 10, 11, 12) salesheader -> (1, 2, 3, 4,
5, 6, 7, 8, 9, 10, 11, 12)  salesdetail -> (1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
11, 12)

*** Assuming: "transaction_date" field is the partitioning key column and
present on all the 3 tables then we just EXTRACT(MONTH FROM $1) to get the
1-12 values.

Since loadbalance_lookup fields are to be joined with either salesheader
or salesdetail fields, I distributed loadbalance_lookup then replicated
salesheader and salesdetail (as per the above INNER JOIN example for the
right way of doing parallel query).

Sorry if it looks a little more complicated this time. All we want to do
is to partition our very very large data and make parallel query work to
our advantage.

Is my setup correct?

I appreciate all your help. Thank you very much.

Regards,
Viril Calimlim

> Do you want to execute following query ?
>
> SELECT * FROM salesdetail INNER JOIN salesheader using(transaction_date);
>
> ex) partitioning data
>    node1: salesdetail -> (1, 2, 3, 4)  salesheader -> (1, 2, 3, 4)
>    node2: salesdetail -> (5, 6, 7, 8)  salesheader -> (5, 6, 7, 8)
>    node3: salesdetail -> (9, 10, 11, 12)  salesheader -> (9, 10, 11, 12)
>
> If JOIN is concluded by a single node, you can use the VIEW.
>
> Define VIEW to each node.
>  CREATE VIEW detail_and_header as SELECT * FROM salesdetail INNER JOIN
> salesheader using(transaction_date);
>
> Define VIEW as partitioning table to dist_def table.
>
> You can use parallel query execution.
>
>   SELECT * FROM detail_and_header;
>
>>
>> *** Note: $1 = transaction date
>>
>> SELECT CASE WHEN EXTRACT(MONTH FROM $1) => 1 AND EXTRACT(MONTH FROM $1)
>> <=
>> 4 THEN 0
>>
>> *** Months 1-4 goes to node0
>>
>> Question: Is it safe to say that it is not advisable to use
>> replicate_def
>> because the tables are very large or should I use it to replicate the
>> records on all nodes?
>>
>> Whenever I enter a value in the transactiondate field, ex. 2008-07-15,
>> it
>> goes to node1 alone and no copy on others. Is this the right behavior?
>> Are
>> we going to be able to take advantage of parallel query if this is going
>> to be the setup? If not then what would possibly be the best
>> partitioning
>> scheme to take full advantage of parallel query given the
>> structure/setup
>> above? Our goal again is to prove that pgpool-II parallel query can cut
>> the SELECT query time shorter as compared to a single MySQL server with
>> same amount of data (millions). Ideally, time taken should divided by 3
>> because of 3 backends.
>>
>> Thank you very much for all your help.
>>
>> Regards,
>> Viril Calimlim
>>
>> >
>> > replicate_select is a parameter for replication mode. So you don't
>> > need to set it.
>> >
>> > I'm not good at English. So if you cannot understand, feel free to ask
>> > more.
>> >
>> > Regards,
>> > --
>> > Yoshiyuki Asaba
>> > y-asaba at sraoss.co.jp
>> >
>> >>
>> >> Regards,
>> >> Viril Calimlim
>> >>
>> >> > Hi,
>> >> >
>> >> > From: calimlimvl at nationalbookstore.com.ph
>> >> > Subject: Re: [Pgpool-general] Parallel Query Mode - SELECT,
>> >> Phppgadmin,
>> >> > Psql
>> >> > Date: Thu, 10 Jul 2008 17:25:41 +0800 (PHT)
>> >> >
>> >> >> This is really serious because we code using PHP and we wouldn't
>> be
>> >> able
>> >> >> to make use of pgpool-II if we couldn't solve this problem.
>> >> >
>> >> > Do you use PHP-PDO?
>> >> >
>> >> >  By the way,
>> >> >> where can I get more information regarding other parameters like
>> >> >> replicate_select (yes it replicates the SELECT on all nodes but
>> >> what's
>> >> >> the
>> >> >> advantage? did I miss anything on the manual?), replicate_def
>> table,
>> >> >> etc.
>> >> >> I am more than willing to further test the software for you to
>> >> eliminate
>> >> >> more bugs. Thank you so much for the speedy response.
>> >> >
>> >> > Could you try pgpool-II 2.1 RC1?
>> >> >
>> >> > Regards,
>> >> > --
>> >> > Yoshiyuki Asaba
>> >> > y-asaba at sraoss.co.jp
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> Best regards,
>> >> >> Viril Calimlim
>> >> >>
>> >> >> > Hi,
>> >> >> >
>> >> >> > Do you set parallel_mode to true?
>> >> >> >
>> >> >> > Regards,
>> >> >> > --
>> >> >> > Yoshiyuki Asaba
>> >> >> > y-asaba at sraoss.co.jp
>> >> >> >
>> >> >> >
>> >> >> > From: calimlimvl at nationalbookstore.com.ph
>> >> >> > Subject: [Pgpool-general] Parallel Query Mode - SELECT,
>> Phppgadmin,
>> >> >> Psql
>> >> >> > Date: Thu, 10 Jul 2008 15:45:11 +0800 (PHT)
>> >> >> >
>> >> >> >> Hello!
>> >> >> >>
>> >> >> >> First of all, thank you for taking the time to reply and
>> excellent
>> >> >> tip
>> >> >> >> regarding my last question about Parallel Query. However, I've
>> got
>> >> a
>> >> >> new
>> >> >> >> problem.
>> >> >> >>
>> >> >> >> In Parallel Query Mode, as per the Tutorial guidelines, I
>> created
>> >> the
>> >> >> >> following:
>> >> >> >>
>> >> >> >> - Database "test"
>> >> >> >> - Schema "testschema"
>> >> >> >> - Table "testtable"
>> >> >> >> - Fields "id integer", "name varchar(20)"
>> >> >> >> - Function pgpool_catalog.dist_def_test with same ranges
>> >> (1..100000
>> >> >> etc)
>> >> >> >> - Inserted a row in dist_def table with key "id"
>> >> >> >>
>> >> >> >> I have 3 backend servers 0, 1, 2 and 1 pgpool-II dedicated
>> server.
>> >> I
>> >> >> did
>> >> >> >> not insert a row in replicate_def because I still don't
>> understand
>> >> >> that
>> >> >> >> part (sorry).
>> >> >> >>
>> >> >> >> My question is that whenever I issue the command:
>> >> >> >>
>> >> >> >> postgres at pgpool-server:~$ psql -p 9999 -c "INSERT INTO
>> >> >> >> testschema.testtable VALUES (100000,'testname')" test
>> >> >> >>
>> >> >> >> ... I checked all the backends and it send the data to node0
>> ALONE
>> >> >> and
>> >> >> >> when I do a SELECT *, it shows only 1 row.
>> >> >> >>
>> >> >> >> But when I use Phppgadmin or even run a PHP script (using PDO)
>> on
>> >> >> port
>> >> >> >> 9999 of the pgpool-II server to insert data, I checked all the
>> >> >> backends
>> >> >> >> and all of them has the copy of the inserted row. When I do
>> SELECT
>> >> *,
>> >> >> I
>> >> >> >> get 3 rows of the same data. Another thing about Phppgadmin,
>> when
>> >> I
>> >> >> >> click
>> >> >> >> on the BROWSE button on testtable, it just keeps on Loading...
>> >> >> meaning,
>> >> >> >> it
>> >> >> >> stuck forever for some reason and I had to kill all pgpool
>> >> processes.
>> >> >> >>
>> >> >> >> tail postgresql log: lots of "LOG:  unexpected EOF on client
>> >> >> connection"
>> >> >> >> tail pgpool log:
>> >> >> >>
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: read_kind_from_backend:
>> read
>> >> >> kind
>> >> >> >> from 0 th backend Z NUM_BACKENDS: 3
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: read_kind_from_backend:
>> read
>> >> >> kind
>> >> >> >> from 1 th backend Z NUM_BACKENDS: 3
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: read_kind_from_backend:
>> read
>> >> >> kind
>> >> >> >> from 2 th backend Z NUM_BACKENDS: 3
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: pool_process_query: kind
>> from
>> >> >> >> backend: Z
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: pool_read_message_length:
>> >> slot:
>> >> >> 0
>> >> >> >> length: 5
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: pool_read_message_length:
>> >> slot:
>> >> >> 1
>> >> >> >> length: 5
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: pool_read_message_length:
>> >> slot:
>> >> >> 2
>> >> >> >> length: 5
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: ReadyForQuery: message
>> >> length: 5
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621: ReadyForQuery: transaction
>> >> >> state: I
>> >> >> >> 2008-07-10 15:41:28 DEBUG: pid 6621:
>> pool_connection_pool_timer:
>> >> set
>> >> >> >> close
>> >> >> >> time 1215675688
>> >> >> >>
>> >> >> >> Version: pgpool-II 2.1beta2
>> >> >> >> OS: Ubuntu 8.04 Server
>> >> >> >> PostgreSQL: 8.3.1
>> >> >> >>
>> >> >> >> Please help. I really want this to work. Ive been after this
>> for 3
>> >> >> >> months
>> >> >> >> now.
>> >> >> >>
>> >> >> >> Thank you very much for your help.
>> >> >> >>
>> >> >> >> Regards,
>> >> >> >> Viril
>> >> >> >>
>> >> >> >> _______________________________________________
>> >> >> >> Pgpool-general mailing list
>> >> >> >> Pgpool-general at pgfoundry.org
>> >> >> >> http://pgfoundry.org/mailman/listinfo/pgpool-general
>> >> >> > _______________________________________________
>> >> >> > Pgpool-general mailing list
>> >> >> > Pgpool-general at pgfoundry.org
>> >> >> > http://pgfoundry.org/mailman/listinfo/pgpool-general
>> >> >> >
>> >> >> >
>> >> >>
>> >> >>
>> >> >
>> >> >
>> >>
>> >>
>> >
>> >
>>
>>
>>
>> _______________________________________________
>> Pgpool-general mailing list
>> Pgpool-general at pgfoundry.org
>> http://pgfoundry.org/mailman/listinfo/pgpool-general
>>
>
>
> --
> SRA OSS, Inc. $BF|K\;Y<R(B
> Yoshiharu Mori <y-mori at sraoss.co.jp>
> http://www.sraoss.co.jp/
>
>