[pgpool-general: 2152] Re: pgpool latency, hesitation under load

Mon Sep 23 22:56:05 JST 2013

Hello,
Thanks for your message.
Yes, we were cascading pgpool-II.
At first, we tried only one, but when connections exceeded around 10% the
available pool, it increased latency for new connections to up 30 seconds.
By cascading pooling the maximum latency reduced to around 3 seconds.
But even 3 seconds were too much trouble at peak times. 
The web application impose around 2 thousand transactions per second, and even
more at peak times.
By my limited knowledge about pgpool2, it seems that each process forked has its
own queue. Therefore, one given connection could be assigned by kernel to a long
queue composed of long running queries, while at the same moment a side process
containing short queries has its queue flushed.
As we observed many queries running at 0.5 seconds, it takes only 6 queries
queued at same process to sum 3 seconds.
Our team gave pgbouncer a try, as it uses threads and linux kernel assigns tcp
connections to threads in the same process, afaik.
The results were astounding for pooling, with clear visibility of queues by
pgbouncer system queries. The latencies were reduced to around 0.5 seconds most
of time. At some peak hours one or other frontend reach 2 seconds, rarely.
(we have graphs of this difference, but this list may not allow attachments)
It seems that the kernel logic to assign tcp connections to process queues is
causing this behaviour.
The custom queries for monitoring in pgbouncer show the waiting connections
clearly at peak hours and they are flushed very fast. I guess it is because of
the single queue.
A very used feature by application is the reserved connections for slow queries.
Is there a way to configure pgpool to behave similarly and show waiting
connections by process?

We use pg_activity, zabbix, log threshold, and queries agains pg_stats to
monitor backend. This way we see that application issues thousands of short
queries and many queries between 0.1 to 0.5 seconds all time. Sometimes it
issues very long queries of 30 seconds.
The server has enough cpu for the present load (after many tunings and
indexing), as now is running around 12%.

There has been some weeks and app versions since we blacklisted some tables, but
as far I can remember, some query cache data was invalidated during the
execution of some queries before other portion of query was executed. The
queries were complex and  execution plan was not simple, so was reasonable to
pgpool, that do not have the execution plan, to not process the query the same
way backend do.
Since then, as our cache hit was reduced to 0.55, we disabled query cache (as
stated on docs if below 0.7)  and penalty on backend, giving our present
environment  (running on RAM) was negligible.
I have one of such queries recorded and  its present execution plan and may
attach the txt file if list allows it.

Regards.
Andre Felipe
http://www.techforce.com.br

On 18/Sep/2013 20:09 Tatsuo Ishii <ishii at postgresql.org> wrote ..
> I am a little bit confused.
> 
> You said:
> > Using pgpool 3.3.0 as connection pooling (1 backend) with shared memory query
> and:
> > The pgpool at each web frontend does not have query cache enabled.
> 
> So you are "cascading" pgpool-II?
> 
> Also I am confused by this:
> > Most of queries are below 0,5 seconds.
> and this:
> > answer psql requests under 100 ms, about 12% cpu load. Zabbix monitoring.
> 
> In the mean time I'm interested in this:
> > complex queries, because postgresql used a non literal execution and query cache
> > caused errors.
> 
> What kind of errors exactly do you have?
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
> 
> > Hello,
> > Using pgpool 3.3.0 as connection pooling (1 backend) with shared memory query
> > cache, and connection pooling only, we are trying to spot a misconfiguration
> > that is causing a hesitation, latency, staggering under load.
> > During these hesitation periods, sgbd transactions per second drops.
> > If we connect web server directly to sgbd backend (too many processes open) 
> > obviously %sys cpu is too high, but there are not staggering latencies.
> > Sometimes a delay of up o 2.5 seconds to get a new connection for the web app.
> > Also, we had to use 2 level pooling (one level at each web server, other in 
> > front of sgbd) because when connections exceeded about 10% of num_init_children,
> > it hesitated for about 30 seconds. 
> > num_init_children = 160
> > max_pool = 1
> > When leaving child_max_connections at 0, the problem is reduced, but still 
> > happens.
> > connection_life_time = 360
> > client_idle_limit = 360
> > The backend is executing about 2000 transactions per second, around 300 
> > connections per second.
> > Most of queries are below 0,5 seconds.
> > Query cache hit at same machine is 0.55. We had to blacklist some tables with
> > complex queries, because postgresql used a non literal execution and query cache
> > caused errors.
> > The pgpool at each web frontend does not have query cache enabled.
> > The postgresql 9.2.4 backend on debian linux 7.1, kernel 3.2, 40 cores 80 
> > threads, 512 GB ram, raid controller 1GB NVRAM,  is running db on RAM and always
> > answer psql requests under 100 ms, about 12% cpu load. Zabbix monitoring.
> > max_connections = 256
> > 
> > How could we track our misconfiguration that cause pgpool latency staggering?
> > 
> > Regards.
> > Andre Felipe Machado
> > http://www.techforce.com.br