[pgpool-general: 2403] Re: PGPool tuning
    Tatsuo Ishii 
    ishii at postgresql.org
       
    Tue Jan 14 10:55:09 JST 2014
    
    
  
Thanks for posting detailed analythis. It looks really interesting.
I need more time to understanding full details.
In the mean time I wonder if you care about listen queue
setting. Currently pgpool listens up to num_init_children*2 (which 64,
in your case). However Apache connects to pgpool up to 256, which is
way too low compared with 64. Also Linux allows max the listen queue
to up 128 by default on most systems. You can check it by looking at:
$ sysctl net.core.somaxconn
net.core.somaxconn = 128
128 is too low compared with 256, of course.
If the allowed listen queue length (backlog) is too low, lots of retry
happens in kernel's TCP layer.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp
> Greetings!
> 
> We are having an issue with PGPool and I wanted to post my analysis to this
> list to see if: A). My analysis seems correct to you all and B). To see if
> you folks might have any advice on tuning.
> 
> 
> For the last month plus, we have been experiencing an intermittent fault
> state on our production cluster.  When the fault occurs, any request to the
> Apache+PHP web server will either time out connecting, or will connect but
> return with a "Could not connect to DB" message from PHP.  I've done some
> analysis on the problem and this is what I've found.
> 
> First let me describe the cluster as it is configured today.  We have one
> web front end running Apache+PHP, which has a MaxClients setting of 256,
> meaning that it's possible to have 256 concurrently running processes.  The
> PHP application is configured to connect to PGPool 3.2.1 for its database
> connection.  PGPool is configured with max_init_children of 32 and max_pool
> of 8.  The application runs on 10-12 different databases, all with the same
> Postgres username+password.
> 
> When the fault occurs, it looks like this: Apache has 256 running processes
> and load on the web front end drops to near 0.  PGPool has all 32 sockets
> that face Apache filled, and all 256 sockets that face Postgres filled.
>  Postgres has 256 connections and its load goes to near 0.  If you try to
> connect to PGPool from the command line, it will time out in connecting, or
> sometimes partially connect and then receive a connection closed message.
> 
> Using our test cluster, I ran some tests that give me high confidence that
> PGPool is actually working correctly, as are Apache and Postgres, and that
> the fundamental problem is just a badly tuned configuration.  This is the
> test that shows that best:
> 
> 
>    1. Stop Apache, restart PGPool
>    2. Start up 100 psql command line clients to connect to PGPool with a
>    single database
>    3. The first 32 psql clients connect and work fine
>    4. The 33rd psql client blocks waiting to connect (it will time out
>    after 30 seconds, but in this test we don't wait that long)
>    5. fg the psql client #1, then exit the client, freeing up one of
>    PGPool's connections
>    6. One of the 68 blocking psql clients now gets through and can run
>    queries
>    7. Any of the 32 connected psql clients can get through as well
> 
> This shows that PGPool is working as expected.
> 
> Now we try a test that is more like the real world:
> 
>    1. Restart PGPool
>    2. Start up 10-20 psql command line clients.  These are simulating long
>    running php processes.
>    3. Start siege web testing tool with 100-200 concurrent requests to
>    Apache.
>    4. At 100 clients, the response time from Apache slows down and the time
>    taken to service each request goes up to around 15s (from < 1s).  Psql
>    command line client can get through most of the time, but it takes some
>    time to connect as it is contending for one of the 32 slots to PGPool with
>    all of the Apache processes.
>    5. At 200 clients, response time goes up more and we start to see
>    failures in Apache, as well as "Could not connect to DB" responses.  Psql
>    command line client often will timeout before it gets a connection to
>    PGPool.
>    6. Once lots of failures are happening at the 200 clients level, load on
>    Postgres goes to near 0 as well as load on Apache.
>    7. Failure will also happen with 250 siege clients and no psql command
>    line clients running.
> 
> 
> In step 4, I believe the response time from Apache goes up due to PGPool
> having to spend so much time managing incoming connections from Apache as
> well as managing connections to Postgres.  Database load is not high in
> this case, so the slowness is not due to Postgres being overloaded.
> 
> I believe that on the live cluster the load is even more severe as there
> are more databases being used, and occasionally high load, long running
> queries.
> 
> It's also notable that restarting Apache has been our fix to get everything
> running again.  I believe that this is because PGPool gets a chance to
> catch up, which it does fairly quickly, and resumes with 32 available
> sockets for Apache.  If we do nothing, PGPool reaches a 10 minute timeout
> specified in its config, and closes all 32 sockets, which causes everything
> to resume working again.
> 
> 
> In the end, I believe the problem is that Apache is just sending too many
> requests to PGPool, and PGPool spends all of its time managing connections,
> causing it to be slow at everything.  That slowness and contention for 32
> slots among up to 256 Apache processes leads to connection timeouts (it
> should be noted that Apache seems to have no connect timeout defined and
> will wait for a connection until the PHP max execution time is reached).
>  Once a threshold is reached, we enter a state where no Apache process is
> able to connect to PGPool in enough time and we see the browser requests
> either timing out entirely or returning the "Could not connect to DB"
> message.
> 
> 
> The proposed solution to all of this is to adjust the configuration of
> PGPool and Apache to ensure that we can never reach this overwhelmed state.
>  Specifically, we need to increase the number of PGPool processes and
> decrease the maximum number of Apache processes.  We need to be careful as
> we do this, as there is surely an upper limit to how many PGPool processes
> can be sustained and increasing that increases overhead on Postgres since
> it increases the number of persistent open connections between it and
> PGPool.  The same for Apache, we need to lower MaxClients but not so low
> that it turns away requests that could have been handled.
> 
> 
> There are a few other adjustments that I believe will help that I'll
> describe below.
> 
> Apache MaxClients:
> This is how many concurrent Apache processes can run at once.  The current
> setting of 256 is clearly more than the system can handle.  I suggest we
> drop it down to 128 to begin with and monitor the results.  I'd like to
> make this change before the others.
> 
> Apache PHP DB connection timeout:
> I can see that it's waiting as long as 150s before returning with 'Could
> not connect to DB' at times, which indicates that no timeout is being
> specified.  This must be sent as part of the connection string, like:
> "pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10".  I'm not
> sure at this point what a reasonable value would be, but I'm thinking 10
> seconds is a good start.
> 
> PGPool backends:
> We currently have 2 backends specified in the config.  One has
> backend_weight of 1 and the other, that is not used, has backedn_weight of
> 0.  I have confirmed that whenever a client connects to PGPool and requests
> a connection to a database, for example, PGPool opens a persistent
> connection to both backends.  We will comment out the backend that
> specifies the backup server, which should help PGPool a lot.
> 
> 
> PGPool max_init_children:
> This is the config parameter that specifies how many PGPool processes can
> run, and therefore how many sockets are available to Apache.  Increasing
> this number by one increases the number of persistent connections to the DB
> by max_pool, currently 8.  Postgres is currently configured to only allow
> 300 connections maximum, so that would need to be changed as well.  More
> research and testing is needed to find the sweet spot.
> 
> PGPool max_pool:
> This parameter specifies how many different DBs each PGPool process keeps
> in its cache of persistent connections to Postgres.  It is currently set to
> 8, yet we have more than 8 different databases in production (I see 12
> connected right now).  If a connection to a database is requested of PGPool
> by Apache, and the PGPool process servicing Apache's request does not have
> a connection to that database, it will drop one and use the slot to make a
> new connection to the requested DB on Postgres.  If max_pool was set to 12,
> this would stop happening and there would always be a persistent connection
> to the db requested ready to go when requested by apache.  Postgres would
> ideally get no new db connections.  Increasing from 8 to 12 would mean that
> total connections to Postgres would be 32*12 = 384, which is above
> Postgres's connection limit.  So this parameter, max_init_children, and
> Postgres's connection limit must all be tuned to eachother, and kept low
> enough to not overwhelm Postgres.
> 
> 
> I suggest that we begin by commenting out the second backend in
> pgpool.conf, and lowering MaxClients on Apache to 128.  This should prevent
> PGPool being hammered past the point that it can handle.  If PGPool does
> fall behind, only 128 Apache connections will be hitting PGPool and it
> seems to be able to handle that many in an orderly fashion.
> 
> I also think adding a PHP connection timeout will help keep the system from
> grinding to a stop.
> 
> 
> Thank you for reading and any help or insight you can provide!
> 
> Justin Cooper
    
    
More information about the pgpool-general
mailing list