[pgpool-general: 2472] Re: PGPool tuning

Tatsuo Ishii ishii at postgresql.org
Thu Jan 23 10:21:33 JST 2014


Justin,

Thank you for the follow up.

It would be nice if we could avoid something like "ping" client to
fill up all available sockets of pgpool-II. Any idea anyone?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Just to follow up on this, it turns out it was a problem in our
> application...
> 
> We had a long running job that was tying up the user's session, paired with
> a second browser window that was making a "ping" type call every 10
> seconds.  These pings were stacking up and after 300s, they would tie up
> all of PGPool's available sockets to apache.
> 
> 
> Thanks for the help.  I did a lot of testing with PGPool and thought it was
> the source of the problem, but the more I tested the more I became
> convinced that PGPool is actually working great!
> 
> 
> Cheers,
> Justin
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Thu, Jan 16, 2014 at 10:24 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> When the lock up happens, what "select * from pg_stat_activity"
>> and "select * from pg_locks" show?
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Thank you, Tatsuo.
>> >
>> > We are still experiencing the problem once or twice per day.  I am making
>> > incremental changes on our live cluster after testing them on the test
>> > cluster.  So far we have done the following:
>> >
>> > -Comment out unused 2nd backend in pgpool.conf
>> > -Add a connect_timeout of 10 seconds to the pg_connect() connection
>> string
>> > in the PHP application
>> > -set sysctl net.core.somaxconn = 1024
>> >
>> > We just did the last step today so we will see if there is any impact.
>> >
>> > When the fault happens, there is work being done in the database, yet
>> > "select * from pg_stat_activity;" shows only a few running queries at the
>> > time.  To me, this says that Apache+PHP still has the connection open to
>> > pgpool.
>> >
>> > I'll be sure to post back if we figure it out!
>> >
>> > Justin
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > On Mon, Jan 13, 2014 at 7:55 PM, Tatsuo Ishii <ishii at postgresql.org>
>> wrote:
>> >
>> >> Thanks for posting detailed analythis. It looks really interesting.
>> >> I need more time to understanding full details.
>> >>
>> >> In the mean time I wonder if you care about listen queue
>> >> setting. Currently pgpool listens up to num_init_children*2 (which 64,
>> >> in your case). However Apache connects to pgpool up to 256, which is
>> >> way too low compared with 64. Also Linux allows max the listen queue
>> >> to up 128 by default on most systems. You can check it by looking at:
>> >>
>> >> $ sysctl net.core.somaxconn
>> >> net.core.somaxconn = 128
>> >>
>> >> 128 is too low compared with 256, of course.
>> >>
>> >> If the allowed listen queue length (backlog) is too low, lots of retry
>> >> happens in kernel's TCP layer.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS, Inc. Japan
>> >> English: http://www.sraoss.co.jp/index_en.php
>> >> Japanese: http://www.sraoss.co.jp
>> >>
>> >> > Greetings!
>> >> >
>> >> > We are having an issue with PGPool and I wanted to post my analysis to
>> >> this
>> >> > list to see if: A). My analysis seems correct to you all and B). To
>> see
>> >> if
>> >> > you folks might have any advice on tuning.
>> >> >
>> >> >
>> >> > For the last month plus, we have been experiencing an intermittent
>> fault
>> >> > state on our production cluster.  When the fault occurs, any request
>> to
>> >> the
>> >> > Apache+PHP web server will either time out connecting, or will connect
>> >> but
>> >> > return with a "Could not connect to DB" message from PHP.  I've done
>> some
>> >> > analysis on the problem and this is what I've found.
>> >> >
>> >> > First let me describe the cluster as it is configured today.  We have
>> one
>> >> > web front end running Apache+PHP, which has a MaxClients setting of
>> 256,
>> >> > meaning that it's possible to have 256 concurrently running processes.
>> >>  The
>> >> > PHP application is configured to connect to PGPool 3.2.1 for its
>> database
>> >> > connection.  PGPool is configured with max_init_children of 32 and
>> >> max_pool
>> >> > of 8.  The application runs on 10-12 different databases, all with the
>> >> same
>> >> > Postgres username+password.
>> >> >
>> >> > When the fault occurs, it looks like this: Apache has 256 running
>> >> processes
>> >> > and load on the web front end drops to near 0.  PGPool has all 32
>> sockets
>> >> > that face Apache filled, and all 256 sockets that face Postgres
>> filled.
>> >> >  Postgres has 256 connections and its load goes to near 0.  If you
>> try to
>> >> > connect to PGPool from the command line, it will time out in
>> connecting,
>> >> or
>> >> > sometimes partially connect and then receive a connection closed
>> message.
>> >> >
>> >> > Using our test cluster, I ran some tests that give me high confidence
>> >> that
>> >> > PGPool is actually working correctly, as are Apache and Postgres, and
>> >> that
>> >> > the fundamental problem is just a badly tuned configuration.  This is
>> the
>> >> > test that shows that best:
>> >> >
>> >> >
>> >> >    1. Stop Apache, restart PGPool
>> >> >    2. Start up 100 psql command line clients to connect to PGPool
>> with a
>> >> >    single database
>> >> >    3. The first 32 psql clients connect and work fine
>> >> >    4. The 33rd psql client blocks waiting to connect (it will time out
>> >> >    after 30 seconds, but in this test we don't wait that long)
>> >> >    5. fg the psql client #1, then exit the client, freeing up one of
>> >> >    PGPool's connections
>> >> >    6. One of the 68 blocking psql clients now gets through and can run
>> >> >    queries
>> >> >    7. Any of the 32 connected psql clients can get through as well
>> >> >
>> >> > This shows that PGPool is working as expected.
>> >> >
>> >> > Now we try a test that is more like the real world:
>> >> >
>> >> >    1. Restart PGPool
>> >> >    2. Start up 10-20 psql command line clients.  These are simulating
>> >> long
>> >> >    running php processes.
>> >> >    3. Start siege web testing tool with 100-200 concurrent requests to
>> >> >    Apache.
>> >> >    4. At 100 clients, the response time from Apache slows down and the
>> >> time
>> >> >    taken to service each request goes up to around 15s (from < 1s).
>>  Psql
>> >> >    command line client can get through most of the time, but it takes
>> >> some
>> >> >    time to connect as it is contending for one of the 32 slots to
>> PGPool
>> >> with
>> >> >    all of the Apache processes.
>> >> >    5. At 200 clients, response time goes up more and we start to see
>> >> >    failures in Apache, as well as "Could not connect to DB" responses.
>> >>  Psql
>> >> >    command line client often will timeout before it gets a connection
>> to
>> >> >    PGPool.
>> >> >    6. Once lots of failures are happening at the 200 clients level,
>> load
>> >> on
>> >> >    Postgres goes to near 0 as well as load on Apache.
>> >> >    7. Failure will also happen with 250 siege clients and no psql
>> command
>> >> >    line clients running.
>> >> >
>> >> >
>> >> > In step 4, I believe the response time from Apache goes up due to
>> PGPool
>> >> > having to spend so much time managing incoming connections from
>> Apache as
>> >> > well as managing connections to Postgres.  Database load is not high
>> in
>> >> > this case, so the slowness is not due to Postgres being overloaded.
>> >> >
>> >> > I believe that on the live cluster the load is even more severe as
>> there
>> >> > are more databases being used, and occasionally high load, long
>> running
>> >> > queries.
>> >> >
>> >> > It's also notable that restarting Apache has been our fix to get
>> >> everything
>> >> > running again.  I believe that this is because PGPool gets a chance to
>> >> > catch up, which it does fairly quickly, and resumes with 32 available
>> >> > sockets for Apache.  If we do nothing, PGPool reaches a 10 minute
>> timeout
>> >> > specified in its config, and closes all 32 sockets, which causes
>> >> everything
>> >> > to resume working again.
>> >> >
>> >> >
>> >> > In the end, I believe the problem is that Apache is just sending too
>> many
>> >> > requests to PGPool, and PGPool spends all of its time managing
>> >> connections,
>> >> > causing it to be slow at everything.  That slowness and contention
>> for 32
>> >> > slots among up to 256 Apache processes leads to connection timeouts
>> (it
>> >> > should be noted that Apache seems to have no connect timeout defined
>> and
>> >> > will wait for a connection until the PHP max execution time is
>> reached).
>> >> >  Once a threshold is reached, we enter a state where no Apache
>> process is
>> >> > able to connect to PGPool in enough time and we see the browser
>> requests
>> >> > either timing out entirely or returning the "Could not connect to DB"
>> >> > message.
>> >> >
>> >> >
>> >> > The proposed solution to all of this is to adjust the configuration of
>> >> > PGPool and Apache to ensure that we can never reach this overwhelmed
>> >> state.
>> >> >  Specifically, we need to increase the number of PGPool processes and
>> >> > decrease the maximum number of Apache processes.  We need to be
>> careful
>> >> as
>> >> > we do this, as there is surely an upper limit to how many PGPool
>> >> processes
>> >> > can be sustained and increasing that increases overhead on Postgres
>> since
>> >> > it increases the number of persistent open connections between it and
>> >> > PGPool.  The same for Apache, we need to lower MaxClients but not so
>> low
>> >> > that it turns away requests that could have been handled.
>> >> >
>> >> >
>> >> > There are a few other adjustments that I believe will help that I'll
>> >> > describe below.
>> >> >
>> >> > Apache MaxClients:
>> >> > This is how many concurrent Apache processes can run at once.  The
>> >> current
>> >> > setting of 256 is clearly more than the system can handle.  I suggest
>> we
>> >> > drop it down to 128 to begin with and monitor the results.  I'd like
>> to
>> >> > make this change before the others.
>> >> >
>> >> > Apache PHP DB connection timeout:
>> >> > I can see that it's waiting as long as 150s before returning with
>> 'Could
>> >> > not connect to DB' at times, which indicates that no timeout is being
>> >> > specified.  This must be sent as part of the connection string, like:
>> >> > "pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10".  I'm
>> not
>> >> > sure at this point what a reasonable value would be, but I'm thinking
>> 10
>> >> > seconds is a good start.
>> >> >
>> >> > PGPool backends:
>> >> > We currently have 2 backends specified in the config.  One has
>> >> > backend_weight of 1 and the other, that is not used, has
>> backedn_weight
>> >> of
>> >> > 0.  I have confirmed that whenever a client connects to PGPool and
>> >> requests
>> >> > a connection to a database, for example, PGPool opens a persistent
>> >> > connection to both backends.  We will comment out the backend that
>> >> > specifies the backup server, which should help PGPool a lot.
>> >> >
>> >> >
>> >> > PGPool max_init_children:
>> >> > This is the config parameter that specifies how many PGPool processes
>> can
>> >> > run, and therefore how many sockets are available to Apache.
>>  Increasing
>> >> > this number by one increases the number of persistent connections to
>> the
>> >> DB
>> >> > by max_pool, currently 8.  Postgres is currently configured to only
>> allow
>> >> > 300 connections maximum, so that would need to be changed as well.
>>  More
>> >> > research and testing is needed to find the sweet spot.
>> >> >
>> >> > PGPool max_pool:
>> >> > This parameter specifies how many different DBs each PGPool process
>> keeps
>> >> > in its cache of persistent connections to Postgres.  It is currently
>> set
>> >> to
>> >> > 8, yet we have more than 8 different databases in production (I see 12
>> >> > connected right now).  If a connection to a database is requested of
>> >> PGPool
>> >> > by Apache, and the PGPool process servicing Apache's request does not
>> >> have
>> >> > a connection to that database, it will drop one and use the slot to
>> make
>> >> a
>> >> > new connection to the requested DB on Postgres.  If max_pool was set
>> to
>> >> 12,
>> >> > this would stop happening and there would always be a persistent
>> >> connection
>> >> > to the db requested ready to go when requested by apache.  Postgres
>> would
>> >> > ideally get no new db connections.  Increasing from 8 to 12 would mean
>> >> that
>> >> > total connections to Postgres would be 32*12 = 384, which is above
>> >> > Postgres's connection limit.  So this parameter, max_init_children,
>> and
>> >> > Postgres's connection limit must all be tuned to eachother, and kept
>> low
>> >> > enough to not overwhelm Postgres.
>> >> >
>> >> >
>> >> > I suggest that we begin by commenting out the second backend in
>> >> > pgpool.conf, and lowering MaxClients on Apache to 128.  This should
>> >> prevent
>> >> > PGPool being hammered past the point that it can handle.  If PGPool
>> does
>> >> > fall behind, only 128 Apache connections will be hitting PGPool and it
>> >> > seems to be able to handle that many in an orderly fashion.
>> >> >
>> >> > I also think adding a PHP connection timeout will help keep the system
>> >> from
>> >> > grinding to a stop.
>> >> >
>> >> >
>> >> > Thank you for reading and any help or insight you can provide!
>> >> >
>> >> > Justin Cooper
>> >>
>>


More information about the pgpool-general mailing list