[pgpool-general: 2443] Re: PGPool tuning
ishii at postgresql.org
Fri Jan 17 13:24:02 JST 2014
When the lock up happens, what "select * from pg_stat_activity"
and "select * from pg_locks" show?
SRA OSS, Inc. Japan
> Thank you, Tatsuo.
> We are still experiencing the problem once or twice per day. I am making
> incremental changes on our live cluster after testing them on the test
> cluster. So far we have done the following:
> -Comment out unused 2nd backend in pgpool.conf
> -Add a connect_timeout of 10 seconds to the pg_connect() connection string
> in the PHP application
> -set sysctl net.core.somaxconn = 1024
> We just did the last step today so we will see if there is any impact.
> When the fault happens, there is work being done in the database, yet
> "select * from pg_stat_activity;" shows only a few running queries at the
> time. To me, this says that Apache+PHP still has the connection open to
> I'll be sure to post back if we figure it out!
> On Mon, Jan 13, 2014 at 7:55 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>> Thanks for posting detailed analythis. It looks really interesting.
>> I need more time to understanding full details.
>> In the mean time I wonder if you care about listen queue
>> setting. Currently pgpool listens up to num_init_children*2 (which 64,
>> in your case). However Apache connects to pgpool up to 256, which is
>> way too low compared with 64. Also Linux allows max the listen queue
>> to up 128 by default on most systems. You can check it by looking at:
>> $ sysctl net.core.somaxconn
>> net.core.somaxconn = 128
>> 128 is too low compared with 256, of course.
>> If the allowed listen queue length (backlog) is too low, lots of retry
>> happens in kernel's TCP layer.
>> Best regards,
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>> > Greetings!
>> > We are having an issue with PGPool and I wanted to post my analysis to
>> > list to see if: A). My analysis seems correct to you all and B). To see
>> > you folks might have any advice on tuning.
>> > For the last month plus, we have been experiencing an intermittent fault
>> > state on our production cluster. When the fault occurs, any request to
>> > Apache+PHP web server will either time out connecting, or will connect
>> > return with a "Could not connect to DB" message from PHP. I've done some
>> > analysis on the problem and this is what I've found.
>> > First let me describe the cluster as it is configured today. We have one
>> > web front end running Apache+PHP, which has a MaxClients setting of 256,
>> > meaning that it's possible to have 256 concurrently running processes.
>> > PHP application is configured to connect to PGPool 3.2.1 for its database
>> > connection. PGPool is configured with max_init_children of 32 and
>> > of 8. The application runs on 10-12 different databases, all with the
>> > Postgres username+password.
>> > When the fault occurs, it looks like this: Apache has 256 running
>> > and load on the web front end drops to near 0. PGPool has all 32 sockets
>> > that face Apache filled, and all 256 sockets that face Postgres filled.
>> > Postgres has 256 connections and its load goes to near 0. If you try to
>> > connect to PGPool from the command line, it will time out in connecting,
>> > sometimes partially connect and then receive a connection closed message.
>> > Using our test cluster, I ran some tests that give me high confidence
>> > PGPool is actually working correctly, as are Apache and Postgres, and
>> > the fundamental problem is just a badly tuned configuration. This is the
>> > test that shows that best:
>> > 1. Stop Apache, restart PGPool
>> > 2. Start up 100 psql command line clients to connect to PGPool with a
>> > single database
>> > 3. The first 32 psql clients connect and work fine
>> > 4. The 33rd psql client blocks waiting to connect (it will time out
>> > after 30 seconds, but in this test we don't wait that long)
>> > 5. fg the psql client #1, then exit the client, freeing up one of
>> > PGPool's connections
>> > 6. One of the 68 blocking psql clients now gets through and can run
>> > queries
>> > 7. Any of the 32 connected psql clients can get through as well
>> > This shows that PGPool is working as expected.
>> > Now we try a test that is more like the real world:
>> > 1. Restart PGPool
>> > 2. Start up 10-20 psql command line clients. These are simulating
>> > running php processes.
>> > 3. Start siege web testing tool with 100-200 concurrent requests to
>> > Apache.
>> > 4. At 100 clients, the response time from Apache slows down and the
>> > taken to service each request goes up to around 15s (from < 1s). Psql
>> > command line client can get through most of the time, but it takes
>> > time to connect as it is contending for one of the 32 slots to PGPool
>> > all of the Apache processes.
>> > 5. At 200 clients, response time goes up more and we start to see
>> > failures in Apache, as well as "Could not connect to DB" responses.
>> > command line client often will timeout before it gets a connection to
>> > PGPool.
>> > 6. Once lots of failures are happening at the 200 clients level, load
>> > Postgres goes to near 0 as well as load on Apache.
>> > 7. Failure will also happen with 250 siege clients and no psql command
>> > line clients running.
>> > In step 4, I believe the response time from Apache goes up due to PGPool
>> > having to spend so much time managing incoming connections from Apache as
>> > well as managing connections to Postgres. Database load is not high in
>> > this case, so the slowness is not due to Postgres being overloaded.
>> > I believe that on the live cluster the load is even more severe as there
>> > are more databases being used, and occasionally high load, long running
>> > queries.
>> > It's also notable that restarting Apache has been our fix to get
>> > running again. I believe that this is because PGPool gets a chance to
>> > catch up, which it does fairly quickly, and resumes with 32 available
>> > sockets for Apache. If we do nothing, PGPool reaches a 10 minute timeout
>> > specified in its config, and closes all 32 sockets, which causes
>> > to resume working again.
>> > In the end, I believe the problem is that Apache is just sending too many
>> > requests to PGPool, and PGPool spends all of its time managing
>> > causing it to be slow at everything. That slowness and contention for 32
>> > slots among up to 256 Apache processes leads to connection timeouts (it
>> > should be noted that Apache seems to have no connect timeout defined and
>> > will wait for a connection until the PHP max execution time is reached).
>> > Once a threshold is reached, we enter a state where no Apache process is
>> > able to connect to PGPool in enough time and we see the browser requests
>> > either timing out entirely or returning the "Could not connect to DB"
>> > message.
>> > The proposed solution to all of this is to adjust the configuration of
>> > PGPool and Apache to ensure that we can never reach this overwhelmed
>> > Specifically, we need to increase the number of PGPool processes and
>> > decrease the maximum number of Apache processes. We need to be careful
>> > we do this, as there is surely an upper limit to how many PGPool
>> > can be sustained and increasing that increases overhead on Postgres since
>> > it increases the number of persistent open connections between it and
>> > PGPool. The same for Apache, we need to lower MaxClients but not so low
>> > that it turns away requests that could have been handled.
>> > There are a few other adjustments that I believe will help that I'll
>> > describe below.
>> > Apache MaxClients:
>> > This is how many concurrent Apache processes can run at once. The
>> > setting of 256 is clearly more than the system can handle. I suggest we
>> > drop it down to 128 to begin with and monitor the results. I'd like to
>> > make this change before the others.
>> > Apache PHP DB connection timeout:
>> > I can see that it's waiting as long as 150s before returning with 'Could
>> > not connect to DB' at times, which indicates that no timeout is being
>> > specified. This must be sent as part of the connection string, like:
>> > "pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10". I'm not
>> > sure at this point what a reasonable value would be, but I'm thinking 10
>> > seconds is a good start.
>> > PGPool backends:
>> > We currently have 2 backends specified in the config. One has
>> > backend_weight of 1 and the other, that is not used, has backedn_weight
>> > 0. I have confirmed that whenever a client connects to PGPool and
>> > a connection to a database, for example, PGPool opens a persistent
>> > connection to both backends. We will comment out the backend that
>> > specifies the backup server, which should help PGPool a lot.
>> > PGPool max_init_children:
>> > This is the config parameter that specifies how many PGPool processes can
>> > run, and therefore how many sockets are available to Apache. Increasing
>> > this number by one increases the number of persistent connections to the
>> > by max_pool, currently 8. Postgres is currently configured to only allow
>> > 300 connections maximum, so that would need to be changed as well. More
>> > research and testing is needed to find the sweet spot.
>> > PGPool max_pool:
>> > This parameter specifies how many different DBs each PGPool process keeps
>> > in its cache of persistent connections to Postgres. It is currently set
>> > 8, yet we have more than 8 different databases in production (I see 12
>> > connected right now). If a connection to a database is requested of
>> > by Apache, and the PGPool process servicing Apache's request does not
>> > a connection to that database, it will drop one and use the slot to make
>> > new connection to the requested DB on Postgres. If max_pool was set to
>> > this would stop happening and there would always be a persistent
>> > to the db requested ready to go when requested by apache. Postgres would
>> > ideally get no new db connections. Increasing from 8 to 12 would mean
>> > total connections to Postgres would be 32*12 = 384, which is above
>> > Postgres's connection limit. So this parameter, max_init_children, and
>> > Postgres's connection limit must all be tuned to eachother, and kept low
>> > enough to not overwhelm Postgres.
>> > I suggest that we begin by commenting out the second backend in
>> > pgpool.conf, and lowering MaxClients on Apache to 128. This should
>> > PGPool being hammered past the point that it can handle. If PGPool does
>> > fall behind, only 128 Apache connections will be hitting PGPool and it
>> > seems to be able to handle that many in an orderly fashion.
>> > I also think adding a PHP connection timeout will help keep the system
>> > grinding to a stop.
>> > Thank you for reading and any help or insight you can provide!
>> > Justin Cooper
More information about the pgpool-general