[Pgpool-general] Current problems in pgpool2.1b2+segfault patch

Thu Jun 5 06:51:33 UTC 2008

Nico -telmich- Schottelius wrote:
> Hello Simone,
> 
> Simone Tregnago [Wed, Jun 04, 2008 at 09:31:12AM +0200]:
> 
> Sorry, should have told those not familar with 2.1b2:
> 
> - it requires -d (DEBUG) to get any log output. But I am not interested
>   in debug, I want only LOG
> - it requires -n (no fork) to keep on logging after startup; normally
>   in setups with real init systems and supervisors I am a big fan of
>   this. But under traditional Linux systems it's a mess, if you have to
>   run a process in foreground.

I don't understand what's the point. Do you need debug level logging? Do 
you need logging better than the standard provided?
Compiling pgpool without debug produce a version that do standard log to 
syslog.

> I rsync'd it just 4 hours before, including the final PITR.
> 
> 
> That's the problem: Those limitations break with existing (read in:
> proprietary, closed source) applications.
> Perhaps this is even a design problem of pgpool, as it would have to be
> intelligent to find out which queries do modify data and which don't.

of course, there are limits with pgpool.

> The idea is nice, what I am somehow missing is a clean disconnect
> process like:
> 
> - begin PITR
> - do not allow new connections
> - begin to close idle connections
> - wait $hard_timeout for the running queries to terminate
> - kill all still running queries
> - sync database
> - accept connections again
> 
> 
> So if it is, can you tell me the correct values for num_init_children
> and max_pool for the following situation:
> 
> - 2 postgres backends, both accepting up to 400 connections
> - 3 webservers (one connect per site access, almost always the same
>   connection parameters) and one streaming server (one permanent connection)
> 
> So I would set it to:
> 
> num_init_children = 800
> max_pool = 1
> 
> The website says (http://pgpool.projects.postgresql.org/):
> "If you want to ensure that queries can be cancelled, set this value to
> twice the expected connections"
> 
> This would make
> 
> num_init_children = 1600
> max_pool = 1
> 
> Now comes the first problem: What happens, if the three webservers open
> 801 connections?
> 
> The next problem: What happens if one database server is disconnected
> (which is quite often the case here)?
> 
> Then 800 or 1600 are 400 or 800 (depends on whether you count it twice
> or not, see above) too much.
> 

increase values until they meet your web site requirements. If you have 
increasing number of concurrent connections you will have to define your 
system in order to support the load. This will include hardware and 
software, and configurations.
Personally I use num_init_children = 1000 and max_pool = 4, with 
postgresql max_connections = 1500. This will support my 1500 max 
supported connections. If tomorrow my clients increases, well, I will 
increase parameters.

> 
> Why is the not reusage related to a configuration error?
> 
> If the following happens:
> 
>  webserver1 [user=test,db=test] -> pgpool
>  pgpool_child_1 -> backend (persistent)
>  webserver1 closes the connection after some seconds.
> 
> And now the following happens:
>  webserver1 [user=test,db=test] -> pgpool
>  pgpool_child_2 -> backend (persistent)
>  webserver1 closes the connection after some seconds.
> 
> Why does pgpool not reuse the the first child with the first connection?
> 

I think that PgPool have the rights to do what it want with the 
connection pool. Maybe tomorrow the pooling/load balancing 
implementations will change and PgPool will reuse connections in a 
different way. But that's not what the user have to worry about.

> That's clear and that's why I give feedback to the devs to incorporate
> into future work.
> 
> 
> There was a reason why I couldn't use it, but I'm not sure which feature
> prevented me from using it.
> 
> I think maybe I choosed the wrong tool for the right job, as I wanted
> more something that does not need query rewriting or allow data
> inconsistency. I know that this is a non-trivial requirement, especially
> when trying to do load balancing.
> 
> I also had a look at slony-1, but as it is replicated asynchron it is no
> choice for me.

Purposes of Slony are completely different. If you want data based 
replication instead of query based replication you could try pgcluster 
or cybercluster(more commercial implementation).

Regards,
Simone Tregnago