[pgpool-general: 2473] Re: PGPool tuning

Justin Cooper jcooper at vendwatchtelematics.com
Thu Jan 23 12:46:18 JST 2014


Oh, it's entirely our application framework's fault!

It shouldn't be waiting for a session to become free if it is also possible
for sessions to be running for 5 minutes...

Our fix is to make the "pinger" not require a session, since it is only
looking for a lockfile and doesn't need session data to do that.


Thanks again for your helpful suggestions, Tatsuo!

Justin



On Wed, Jan 22, 2014 at 7:21 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> Justin,
>
> Thank you for the follow up.
>
> It would be nice if we could avoid something like "ping" client to
> fill up all available sockets of pgpool-II. Any idea anyone?
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Just to follow up on this, it turns out it was a problem in our
> > application...
> >
> > We had a long running job that was tying up the user's session, paired
> with
> > a second browser window that was making a "ping" type call every 10
> > seconds.  These pings were stacking up and after 300s, they would tie up
> > all of PGPool's available sockets to apache.
> >
> >
> > Thanks for the help.  I did a lot of testing with PGPool and thought it
> was
> > the source of the problem, but the more I tested the more I became
> > convinced that PGPool is actually working great!
> >
> >
> > Cheers,
> > Justin
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Thu, Jan 16, 2014 at 10:24 PM, Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> When the lock up happens, what "select * from pg_stat_activity"
> >> and "select * from pg_locks" show?
> >>
> >> Best regards,
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese: http://www.sraoss.co.jp
> >>
> >> > Thank you, Tatsuo.
> >> >
> >> > We are still experiencing the problem once or twice per day.  I am
> making
> >> > incremental changes on our live cluster after testing them on the test
> >> > cluster.  So far we have done the following:
> >> >
> >> > -Comment out unused 2nd backend in pgpool.conf
> >> > -Add a connect_timeout of 10 seconds to the pg_connect() connection
> >> string
> >> > in the PHP application
> >> > -set sysctl net.core.somaxconn = 1024
> >> >
> >> > We just did the last step today so we will see if there is any impact.
> >> >
> >> > When the fault happens, there is work being done in the database, yet
> >> > "select * from pg_stat_activity;" shows only a few running queries at
> the
> >> > time.  To me, this says that Apache+PHP still has the connection open
> to
> >> > pgpool.
> >> >
> >> > I'll be sure to post back if we figure it out!
> >> >
> >> > Justin
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jan 13, 2014 at 7:55 PM, Tatsuo Ishii <ishii at postgresql.org>
> >> wrote:
> >> >
> >> >> Thanks for posting detailed analythis. It looks really interesting.
> >> >> I need more time to understanding full details.
> >> >>
> >> >> In the mean time I wonder if you care about listen queue
> >> >> setting. Currently pgpool listens up to num_init_children*2 (which
> 64,
> >> >> in your case). However Apache connects to pgpool up to 256, which is
> >> >> way too low compared with 64. Also Linux allows max the listen queue
> >> >> to up 128 by default on most systems. You can check it by looking at:
> >> >>
> >> >> $ sysctl net.core.somaxconn
> >> >> net.core.somaxconn = 128
> >> >>
> >> >> 128 is too low compared with 256, of course.
> >> >>
> >> >> If the allowed listen queue length (backlog) is too low, lots of
> retry
> >> >> happens in kernel's TCP layer.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS, Inc. Japan
> >> >> English: http://www.sraoss.co.jp/index_en.php
> >> >> Japanese: http://www.sraoss.co.jp
> >> >>
> >> >> > Greetings!
> >> >> >
> >> >> > We are having an issue with PGPool and I wanted to post my
> analysis to
> >> >> this
> >> >> > list to see if: A). My analysis seems correct to you all and B). To
> >> see
> >> >> if
> >> >> > you folks might have any advice on tuning.
> >> >> >
> >> >> >
> >> >> > For the last month plus, we have been experiencing an intermittent
> >> fault
> >> >> > state on our production cluster.  When the fault occurs, any
> request
> >> to
> >> >> the
> >> >> > Apache+PHP web server will either time out connecting, or will
> connect
> >> >> but
> >> >> > return with a "Could not connect to DB" message from PHP.  I've
> done
> >> some
> >> >> > analysis on the problem and this is what I've found.
> >> >> >
> >> >> > First let me describe the cluster as it is configured today.  We
> have
> >> one
> >> >> > web front end running Apache+PHP, which has a MaxClients setting of
> >> 256,
> >> >> > meaning that it's possible to have 256 concurrently running
> processes.
> >> >>  The
> >> >> > PHP application is configured to connect to PGPool 3.2.1 for its
> >> database
> >> >> > connection.  PGPool is configured with max_init_children of 32 and
> >> >> max_pool
> >> >> > of 8.  The application runs on 10-12 different databases, all with
> the
> >> >> same
> >> >> > Postgres username+password.
> >> >> >
> >> >> > When the fault occurs, it looks like this: Apache has 256 running
> >> >> processes
> >> >> > and load on the web front end drops to near 0.  PGPool has all 32
> >> sockets
> >> >> > that face Apache filled, and all 256 sockets that face Postgres
> >> filled.
> >> >> >  Postgres has 256 connections and its load goes to near 0.  If you
> >> try to
> >> >> > connect to PGPool from the command line, it will time out in
> >> connecting,
> >> >> or
> >> >> > sometimes partially connect and then receive a connection closed
> >> message.
> >> >> >
> >> >> > Using our test cluster, I ran some tests that give me high
> confidence
> >> >> that
> >> >> > PGPool is actually working correctly, as are Apache and Postgres,
> and
> >> >> that
> >> >> > the fundamental problem is just a badly tuned configuration.  This
> is
> >> the
> >> >> > test that shows that best:
> >> >> >
> >> >> >
> >> >> >    1. Stop Apache, restart PGPool
> >> >> >    2. Start up 100 psql command line clients to connect to PGPool
> >> with a
> >> >> >    single database
> >> >> >    3. The first 32 psql clients connect and work fine
> >> >> >    4. The 33rd psql client blocks waiting to connect (it will time
> out
> >> >> >    after 30 seconds, but in this test we don't wait that long)
> >> >> >    5. fg the psql client #1, then exit the client, freeing up one
> of
> >> >> >    PGPool's connections
> >> >> >    6. One of the 68 blocking psql clients now gets through and can
> run
> >> >> >    queries
> >> >> >    7. Any of the 32 connected psql clients can get through as well
> >> >> >
> >> >> > This shows that PGPool is working as expected.
> >> >> >
> >> >> > Now we try a test that is more like the real world:
> >> >> >
> >> >> >    1. Restart PGPool
> >> >> >    2. Start up 10-20 psql command line clients.  These are
> simulating
> >> >> long
> >> >> >    running php processes.
> >> >> >    3. Start siege web testing tool with 100-200 concurrent
> requests to
> >> >> >    Apache.
> >> >> >    4. At 100 clients, the response time from Apache slows down and
> the
> >> >> time
> >> >> >    taken to service each request goes up to around 15s (from < 1s).
> >>  Psql
> >> >> >    command line client can get through most of the time, but it
> takes
> >> >> some
> >> >> >    time to connect as it is contending for one of the 32 slots to
> >> PGPool
> >> >> with
> >> >> >    all of the Apache processes.
> >> >> >    5. At 200 clients, response time goes up more and we start to
> see
> >> >> >    failures in Apache, as well as "Could not connect to DB"
> responses.
> >> >>  Psql
> >> >> >    command line client often will timeout before it gets a
> connection
> >> to
> >> >> >    PGPool.
> >> >> >    6. Once lots of failures are happening at the 200 clients level,
> >> load
> >> >> on
> >> >> >    Postgres goes to near 0 as well as load on Apache.
> >> >> >    7. Failure will also happen with 250 siege clients and no psql
> >> command
> >> >> >    line clients running.
> >> >> >
> >> >> >
> >> >> > In step 4, I believe the response time from Apache goes up due to
> >> PGPool
> >> >> > having to spend so much time managing incoming connections from
> >> Apache as
> >> >> > well as managing connections to Postgres.  Database load is not
> high
> >> in
> >> >> > this case, so the slowness is not due to Postgres being overloaded.
> >> >> >
> >> >> > I believe that on the live cluster the load is even more severe as
> >> there
> >> >> > are more databases being used, and occasionally high load, long
> >> running
> >> >> > queries.
> >> >> >
> >> >> > It's also notable that restarting Apache has been our fix to get
> >> >> everything
> >> >> > running again.  I believe that this is because PGPool gets a
> chance to
> >> >> > catch up, which it does fairly quickly, and resumes with 32
> available
> >> >> > sockets for Apache.  If we do nothing, PGPool reaches a 10 minute
> >> timeout
> >> >> > specified in its config, and closes all 32 sockets, which causes
> >> >> everything
> >> >> > to resume working again.
> >> >> >
> >> >> >
> >> >> > In the end, I believe the problem is that Apache is just sending
> too
> >> many
> >> >> > requests to PGPool, and PGPool spends all of its time managing
> >> >> connections,
> >> >> > causing it to be slow at everything.  That slowness and contention
> >> for 32
> >> >> > slots among up to 256 Apache processes leads to connection timeouts
> >> (it
> >> >> > should be noted that Apache seems to have no connect timeout
> defined
> >> and
> >> >> > will wait for a connection until the PHP max execution time is
> >> reached).
> >> >> >  Once a threshold is reached, we enter a state where no Apache
> >> process is
> >> >> > able to connect to PGPool in enough time and we see the browser
> >> requests
> >> >> > either timing out entirely or returning the "Could not connect to
> DB"
> >> >> > message.
> >> >> >
> >> >> >
> >> >> > The proposed solution to all of this is to adjust the
> configuration of
> >> >> > PGPool and Apache to ensure that we can never reach this
> overwhelmed
> >> >> state.
> >> >> >  Specifically, we need to increase the number of PGPool processes
> and
> >> >> > decrease the maximum number of Apache processes.  We need to be
> >> careful
> >> >> as
> >> >> > we do this, as there is surely an upper limit to how many PGPool
> >> >> processes
> >> >> > can be sustained and increasing that increases overhead on Postgres
> >> since
> >> >> > it increases the number of persistent open connections between it
> and
> >> >> > PGPool.  The same for Apache, we need to lower MaxClients but not
> so
> >> low
> >> >> > that it turns away requests that could have been handled.
> >> >> >
> >> >> >
> >> >> > There are a few other adjustments that I believe will help that
> I'll
> >> >> > describe below.
> >> >> >
> >> >> > Apache MaxClients:
> >> >> > This is how many concurrent Apache processes can run at once.  The
> >> >> current
> >> >> > setting of 256 is clearly more than the system can handle.  I
> suggest
> >> we
> >> >> > drop it down to 128 to begin with and monitor the results.  I'd
> like
> >> to
> >> >> > make this change before the others.
> >> >> >
> >> >> > Apache PHP DB connection timeout:
> >> >> > I can see that it's waiting as long as 150s before returning with
> >> 'Could
> >> >> > not connect to DB' at times, which indicates that no timeout is
> being
> >> >> > specified.  This must be sent as part of the connection string,
> like:
> >> >> > "pgsql:host=127.0.0.1;port=5432;dbname=vw_bepensa;timeout=10".  I'm
> >> not
> >> >> > sure at this point what a reasonable value would be, but I'm
> thinking
> >> 10
> >> >> > seconds is a good start.
> >> >> >
> >> >> > PGPool backends:
> >> >> > We currently have 2 backends specified in the config.  One has
> >> >> > backend_weight of 1 and the other, that is not used, has
> >> backedn_weight
> >> >> of
> >> >> > 0.  I have confirmed that whenever a client connects to PGPool and
> >> >> requests
> >> >> > a connection to a database, for example, PGPool opens a persistent
> >> >> > connection to both backends.  We will comment out the backend that
> >> >> > specifies the backup server, which should help PGPool a lot.
> >> >> >
> >> >> >
> >> >> > PGPool max_init_children:
> >> >> > This is the config parameter that specifies how many PGPool
> processes
> >> can
> >> >> > run, and therefore how many sockets are available to Apache.
> >>  Increasing
> >> >> > this number by one increases the number of persistent connections
> to
> >> the
> >> >> DB
> >> >> > by max_pool, currently 8.  Postgres is currently configured to only
> >> allow
> >> >> > 300 connections maximum, so that would need to be changed as well.
> >>  More
> >> >> > research and testing is needed to find the sweet spot.
> >> >> >
> >> >> > PGPool max_pool:
> >> >> > This parameter specifies how many different DBs each PGPool process
> >> keeps
> >> >> > in its cache of persistent connections to Postgres.  It is
> currently
> >> set
> >> >> to
> >> >> > 8, yet we have more than 8 different databases in production (I
> see 12
> >> >> > connected right now).  If a connection to a database is requested
> of
> >> >> PGPool
> >> >> > by Apache, and the PGPool process servicing Apache's request does
> not
> >> >> have
> >> >> > a connection to that database, it will drop one and use the slot to
> >> make
> >> >> a
> >> >> > new connection to the requested DB on Postgres.  If max_pool was
> set
> >> to
> >> >> 12,
> >> >> > this would stop happening and there would always be a persistent
> >> >> connection
> >> >> > to the db requested ready to go when requested by apache.  Postgres
> >> would
> >> >> > ideally get no new db connections.  Increasing from 8 to 12 would
> mean
> >> >> that
> >> >> > total connections to Postgres would be 32*12 = 384, which is above
> >> >> > Postgres's connection limit.  So this parameter, max_init_children,
> >> and
> >> >> > Postgres's connection limit must all be tuned to eachother, and
> kept
> >> low
> >> >> > enough to not overwhelm Postgres.
> >> >> >
> >> >> >
> >> >> > I suggest that we begin by commenting out the second backend in
> >> >> > pgpool.conf, and lowering MaxClients on Apache to 128.  This should
> >> >> prevent
> >> >> > PGPool being hammered past the point that it can handle.  If PGPool
> >> does
> >> >> > fall behind, only 128 Apache connections will be hitting PGPool
> and it
> >> >> > seems to be able to handle that many in an orderly fashion.
> >> >> >
> >> >> > I also think adding a PHP connection timeout will help keep the
> system
> >> >> from
> >> >> > grinding to a stop.
> >> >> >
> >> >> >
> >> >> > Thank you for reading and any help or insight you can provide!
> >> >> >
> >> >> > Justin Cooper
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140122/f116a62c/attachment-0001.html>


More information about the pgpool-general mailing list