[pgpool-general: 2402] PGPool tuning

Justin Cooper jcooper at vendwatchtelematics.com
Tue Jan 14 01:40:05 JST 2014


Greetings!

We are having an issue with PGPool and I wanted to post my analysis to this
list to see if: A). My analysis seems correct to you all and B). To see if
you folks might have any advice on tuning.


For the last month plus, we have been experiencing an intermittent fault
state on our production cluster.  When the fault occurs, any request to the
Apache+PHP web server will either time out connecting, or will connect but
return with a "Could not connect to DB" message from PHP.  I've done some
analysis on the problem and this is what I've found.

First let me describe the cluster as it is configured today.  We have one
web front end running Apache+PHP, which has a MaxClients setting of 256,
meaning that it's possible to have 256 concurrently running processes.  The
PHP application is configured to connect to PGPool 3.2.1 for its database
connection.  PGPool is configured with max_init_children of 32 and max_pool
of 8.  The application runs on 10-12 different databases, all with the same
Postgres username+password.

When the fault occurs, it looks like this: Apache has 256 running processes
and load on the web front end drops to near 0.  PGPool has all 32 sockets
that face Apache filled, and all 256 sockets that face Postgres filled.
 Postgres has 256 connections and its load goes to near 0.  If you try to
connect to PGPool from the command line, it will time out in connecting, or
sometimes partially connect and then receive a connection closed message.

Using our test cluster, I ran some tests that give me high confidence that
PGPool is actually working correctly, as are Apache and Postgres, and that
the fundamental problem is just a badly tuned configuration.  This is the
test that shows that best:


   1. Stop Apache, restart PGPool
   2. Start up 100 psql command line clients to connect to PGPool with a
   single database
   3. The first 32 psql clients connect and work fine
   4. The 33rd psql client blocks waiting to connect (it will time out
   after 30 seconds, but in this test we don't wait that long)
   5. fg the psql client #1, then exit the client, freeing up one of
   PGPool's connections
   6. One of the 68 blocking psql clients now gets through and can run
   queries
   7. Any of the 32 connected psql clients can get through as well

This shows that PGPool is working as expected.

Now we try a test that is more like the real world:

   1. Restart PGPool
   2. Start up 10-20 psql command line clients.  These are simulating long
   running php processes.
   3. Start siege web testing tool with 100-200 concurrent requests to
   Apache.
   4. At 100 clients, the response time from Apache slows down and the time
   taken to service each request goes up to around 15s (from < 1s).  Psql
   command line client can get through most of the time, but it takes some
   time to connect as it is contending for one of the 32 slots to PGPool with
   all of the Apache processes.
   5. At 200 clients, response time goes up more and we start to see
   failures in Apache, as well as "Could not connect to DB" responses.  Psql
   command line client often will timeout before it gets a connection to
   PGPool.
   6. Once lots of failures are happening at the 200 clients level, load on
   Postgres goes to near 0 as well as load on Apache.
   7. Failure will also happen with 250 siege clients and no psql command
   line clients running.


In step 4, I believe the response time from Apache goes up due to PGPool
having to spend so much time managing incoming connections from Apache as
well as managing connections to Postgres.  Database load is not high in
this case, so the slowness is not due to Postgres being overloaded.

I believe that on the live cluster the load is even more severe as there
are more databases being used, and occasionally high load, long running
queries.

It's also notable that restarting Apache has been our fix to get everything
running again.  I believe that this is because PGPool gets a chance to
catch up, which it does fairly quickly, and resumes with 32 available
sockets for Apache.  If we do nothing, PGPool reaches a 10 minute timeout
specified in its config, and closes all 32 sockets, which causes everything
to resume working again.


In the end, I believe the problem is that Apache is just sending too many
requests to PGPool, and PGPool spends all of its time managing connections,
causing it to be slow at everything.  That slowness and contention for 32
slots among up to 256 Apache processes leads to connection timeouts (it
should be noted that Apache seems to have no connect timeout defined and
will wait for a connection until the PHP max execution time is reached).
 Once a threshold is reached, we enter a state where no Apache process is
able to connect to PGPool in enough time and we see the browser requests
either timing out entirely or returning the "Could not connect to DB"
message.


The proposed solution to all of this is to adjust the configuration of
PGPool and Apache to ensure that we can never reach this overwhelmed state.
 Specifically, we need to increase the number of PGPool processes and
decrease the maximum number of Apache processes.  We need to be careful as
we do this, as there is surely an upper limit to how many PGPool processes
can be sustained and increasing that increases overhead on Postgres since
it increases the number of persistent open connections between it and
PGPool.  The same for Apache, we need to lower MaxClients but not so low
that it turns away requests that could have been handled.


There are a few other adjustments that I believe will help that I'll
describe below.

Apache MaxClients:
This is how many concurrent Apache processes can run at once.  The current
setting of 256 is clearly more than the system can handle.  I suggest we
drop it down to 128 to begin with and monitor the results.  I'd like to
make this change before the others.

Apache PHP DB connection timeout:
I can see that it's waiting as long as 150s before returning with 'Could
not connect to DB' at times, which indicates that no timeout is being
specified.  This must be sent as part of the connection string, like:
"pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10".  I'm not
sure at this point what a reasonable value would be, but I'm thinking 10
seconds is a good start.

PGPool backends:
We currently have 2 backends specified in the config.  One has
backend_weight of 1 and the other, that is not used, has backedn_weight of
0.  I have confirmed that whenever a client connects to PGPool and requests
a connection to a database, for example, PGPool opens a persistent
connection to both backends.  We will comment out the backend that
specifies the backup server, which should help PGPool a lot.


PGPool max_init_children:
This is the config parameter that specifies how many PGPool processes can
run, and therefore how many sockets are available to Apache.  Increasing
this number by one increases the number of persistent connections to the DB
by max_pool, currently 8.  Postgres is currently configured to only allow
300 connections maximum, so that would need to be changed as well.  More
research and testing is needed to find the sweet spot.

PGPool max_pool:
This parameter specifies how many different DBs each PGPool process keeps
in its cache of persistent connections to Postgres.  It is currently set to
8, yet we have more than 8 different databases in production (I see 12
connected right now).  If a connection to a database is requested of PGPool
by Apache, and the PGPool process servicing Apache's request does not have
a connection to that database, it will drop one and use the slot to make a
new connection to the requested DB on Postgres.  If max_pool was set to 12,
this would stop happening and there would always be a persistent connection
to the db requested ready to go when requested by apache.  Postgres would
ideally get no new db connections.  Increasing from 8 to 12 would mean that
total connections to Postgres would be 32*12 = 384, which is above
Postgres's connection limit.  So this parameter, max_init_children, and
Postgres's connection limit must all be tuned to eachother, and kept low
enough to not overwhelm Postgres.


I suggest that we begin by commenting out the second backend in
pgpool.conf, and lowering MaxClients on Apache to 128.  This should prevent
PGPool being hammered past the point that it can handle.  If PGPool does
fall behind, only 128 Apache connections will be hitting PGPool and it
seems to be able to handle that many in an orderly fashion.

I also think adding a PHP connection timeout will help keep the system from
grinding to a stop.


Thank you for reading and any help or insight you can provide!

Justin Cooper
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140113/64c9554e/attachment.html>


More information about the pgpool-general mailing list