[Pgpool-general] pcp_child: pcp_read() failed. reason: Success

Tue Nov 10 15:47:47 UTC 2009

In general I think the most desired behavior would be to have pgpool refuse
additional connections once num_init_children connections have been
exhausted. Right now it appears that pgpool accepts arbitrarily large
numbers of connections and queues those connections for later processing
(once resources become available). The net effect of this is that
connections over the num_init_children limit appear to be black-holed
causing client connections to hang.  Having pgpool refuse connections over
num_init_children outright would create a much more sane situation that is
much easier to minitor with existing tools and which helps postgres specific
applications throw useful error messages.

I'm guessing this patch will do something along these lines.
I will give this patch a try when I'm able.
Thank you very much.

-s

On Tue, Nov 10, 2009 at 2:15 AM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> After thinking a little bit more, I came across the idea that kernel's
> listen queue might be ran out by pgpool. Currently pgpool asks 10000
> worth length listen queue to the kernel and maybe this is too much for
> kernel. Included is a patch to fix this, i.e. rather than requesting
> 10000, wanting only num_init_children * 2, which is similar to the
> calculus done by PostgreSQL (actually max_connections * 2).
>
> Please try, if you like.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
>
> > I think this is normal. We can easyly reproduce your problem.
> >
> > 1) set num_init_children = 1
> >
> > 2) connect pgpool via psql
> >
> > 3) fire up more psql
> >
> > 4) psql(3) will be "freezed" until psql(2) disconnect session.
> >
> > This behavior is perfectly expected one.
> >
> > Is this what you meant?
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
> >
> > > I've done extensive testing since my last message.
> > > The problem appears to be that I am getting more connections to pgpool
> than
> > > I have num_init_children.
> > > Per my tests, as soon as pgpool gets connection num_init_children + 1,
> it
> > > locks up and takes at least one of the backend nodes with it.
> > >
> > > Can anyone else confirm this?
> > > I'd like to make sure it is not just my particular configuration
> causing the
> > > issue.
> > >
> > > In the mean time I have simply increased the num_init_children
> parameter
> > > significantly in an effort to stay well ahead of the number of incoming
> > > connections and this appears to be working.
> > >
> > > -s
> > >
> > > On Sun, Nov 8, 2009 at 3:23 AM, Tatsuo Ishii <ishii at sraoss.co.jp>
> wrote:
> > >
> > > > I would like to know how  the condition of pgpool is. What does ps
> > > > show for pgpool processes? Even better, can you attach debugger to
> one
> > > > of pgpool process and get back trace?
> > > > --
> > > > Tatsuo Ishii
> > > > SRA OSS, Inc. Japan
> > > >
> > > > > I've tried setting the local backend_hostname = ''
> > > > > Same problems are occurring.
> > > > > Pgpool has actually failed something like 4 separate times today,
> all but
> > > > > one of them using this local socket configuration.
> > > > >
> > > > > Any other thoughts?
> > > > >
> > > > > thx
> > > > > -s
> > > > >
> > > > > On Thu, Nov 5, 2009 at 10:52 PM, Tatsuo Ishii <ishii at sraoss.co.jp>
> > > > wrote:
> > > > >
> > > > > > > Actually I'm was running pgpool on db2 (backend_hostname1) and
> am now
> > > > > > > running it on db3 (backend_hostname2).
> > > > > > > I have actually suspected that pgpool might be opting for some
> sort
> > > > of
> > > > > > > socket connection to the local instance of postgres instead of
> using
> > > > the
> > > > > > > TCP/IP connection parameters in an effort to speed things up.
> > > > > > >
> > > > > > > I have done my best to ensure that pgpool has completely
> separate
> > > > socket
> > > > > > > directories but it wouldn't be hard for pgpool to find a local
> > > > postgres
> > > > > > > socket if it wanted.  If I end up with another outage and this
> time
> > > > db3
> > > > > > is
> > > > > > > the postgres instance that locks up, I'll be fairly certain
> that this
> > > > is
> > > > > > the
> > > > > > > problem but for the moment I can only speculate.
> > > > > > >
> > > > > > > I'm assuming you're suggesting I set backend_hostname0 = ''
> because
> > > > it is
> > > > > > > already weighted to 0.0 anyway?
> > > > > >
> > > > > > No. Because I thought you are running pgpool on db1. '' means
> force
> > > > > > pgpool to use UNIX domain socket. So if you running pgpool on
> db3, you
> > > > > > could set:
> > > > > >
> > > > > > backend_hostname2 = ''
> > > > > >
> > > > > > > I have db1 (backend_hostname0) weighted to 0.0 in an effort to
> direct
> > > > all
> > > > > > > selects to the two slave hosts (db2 and db3) but still benefit
> from
> > > > > > pgpool
> > > > > > > intelligently sending writes to db1.
> > > > > > > db1 is the mammoth master host and needs all available i/o to
> deal
> > > > with
> > > > > > > writes.
> > > > > > > My understanding is that this is how "master_slave_mode = true"
> > > > works.
> > > > > > > Writes are always directed to backend_hostname0.
> > > > > > >
> > > > > > > If I need to reevaluate that thinking, please advise but that
> has
> > > > been
> > > > > > > working for me for months now.
> > > > > > >
> > > > > > > thx
> > > > > > > -s
> > > > > > >
> > > > > > > On Thu, Nov 5, 2009 at 9:26 PM, Tatsuo Ishii <
> ishii at sraoss.co.jp>
> > > > wrote:
> > > > > > >
> > > > > > > > Besides the useless error message from pgp_child(it seems
> someone
> > > > > > > > believed that EOF will set some error number to the global
> errono
> > > > > > > > variable. I will fix this anyway.), for me it seems socket
> files
> > > > are
> > > > > > > > going dead. I suspect some network stack bugs could cause
> this but
> > > > I'm
> > > > > > > > not sure. One thing you might want to try is, changing this:
> > > > > > > >
> > > > > > > > backend_hostname0 = 'db1.xxx.xxx'
> > > > > > > >
> > > > > > > > to:
> > > > > > > >
> > > > > > > > backend_hostname0 = ''
> > > > > > > >
> > > > > > > > This will make pgpool to use UNIX domain socket for the
> > > > communication
> > > > > > > > channel to PostgreSQL, rather than TCP/IP. It may or may not
> affect
> > > > > > > > the problem you have, since the network code in the kernel
> will be
> > > > > > > > different.
> > > > > > > >
> > > > > > > > (I assume you are running pgpool on db1.xxx.xxx)
> > > > > > > > --
> > > > > > > > Tatsuo Ishii
> > > > > > > > SRA OSS, Inc. Japan
> > > > > > > >
> > > > > > > > > Has anyone else run into this:
> > > > > > > > >
> > > > > > > > > My pgpool instance runs without problems for days on end
> and then
> > > > > > > > suddenly
> > > > > > > > > stops responding to all requests.
> > > > > > > > > At the same moment, one of my three backend db hosts
> becomes
> > > > > > completely
> > > > > > > > > inaccessible.
> > > > > > > > > Pgpool will not respond to shutdown, or even kill and must
> be
> > > > kill
> > > > > > -9'd
> > > > > > > > > Once all pgpool processes are out of the way, the
> inaccessible
> > > > > > postgres
> > > > > > > > > server once again becomes responsive.
> > > > > > > > > I restart pgpool and everything works properly for a few
> more
> > > > days.
> > > > > > > > >
> > > > > > > > > At the moment the problem occurs, pgpool's log output,
> which
> > > > > > typically
> > > > > > > > > consists of just connection logging, turns into a steady
> stream
> > > > of
> > > > > > this:
> > > > > > > > > Nov  5 11:33:18 src at obfuscated pgpool: 2009-11-05 11:33:18
> > > > ERROR:
> > > > > > pid
> > > > > > > > 12811:
> > > > > > > > > pcp_child: pcp_read() failed. reason: Success
> > > > > > > > > These errors show up sporaticlly in my pgpool logs all the
> time
> > > > but
> > > > > > don't
> > > > > > > > > appear to have any adverse effects until the whole thing
> takes a
> > > > > > dive.
> > > > > > > > > I would desperately like to know what this error message is
> > > > trying to
> > > > > > > > tell
> > > > > > > > > me.
> > > > > > > > >
> > > > > > > > > I have not been able to correlate any given
> > > > query/connection/process
> > > > > > to
> > > > > > > > the
> > > > > > > > > timing of the outages.
> > > > > > > > > Sometimes they happens at peak usage periods, sometimes
> they
> > > > happen
> > > > > > in
> > > > > > > > the
> > > > > > > > > middle of the night.
> > > > > > > > >
> > > > > > > > > I experienced this problem using pgpool-II v1.3 and have
> recently
> > > > > > > > upgraded
> > > > > > > > > to pgpool-II v2.2.5 but am still seeing the same issue.
> > > > > > > > >
> > > > > > > > > It may be relevant to point out that I am running pgpool on
> one
> > > > of
> > > > > > the
> > > > > > > > > machines that is also acting as a postgres backend and it
> is
> > > > always
> > > > > > the
> > > > > > > > > postgres instance on the pgpool host that locks up.
> > > > > > > > > This morning I moved the pgpool instance onto another one
> of the
> > > > > > postgres
> > > > > > > > > backend hosts in an effort to see if the cohabitation of
> pgpool
> > > > and
> > > > > > > > postgres
> > > > > > > > > is causing problems or if there is simply an issue with
> that
> > > > postres
> > > > > > on
> > > > > > > > that
> > > > > > > > > host of if this is just a coincidence.
> > > > > > > > > I likely won't gain anything from this test for a day or
> more.
> > > > > > > > >
> > > > > > > > > Also relevant is that I am running mammoth replicator and
> am only
> > > > > > using
> > > > > > > > > pgpool for connection load balancing and high availability.
> > > > > > > > >
> > > > > > > > > Below is my pgpool.conf.
> > > > > > > > >
> > > > > > > > > Any thoughts appreciated.
> > > > > > > > >
> > > > > > > > > -steve crandell
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > f
> > > > > > > > >
> > > > > > > > > #
> > > > > > > > > # pgpool-II configuration file sample
> > > > > > > > > # $Header: /cvsroot/pgpool/pgpool-II/pgpool.conf.sample,v
> 1.4.2.3
> > > > > > > > > 2007/10/12 09:15:02 y-asaba Exp $
> > > > > > > > >
> > > > > > > > > # Host name or IP address to listen on: '*' for all, '' for
> no
> > > > TCP/IP
> > > > > > > > > # connections
> > > > > > > > > #listen_addresses = 'localhost'
> > > > > > > > > listen_addresses = '10.xxx.xxx.xxx'
> > > > > > > > >
> > > > > > > > > # Port number for pgpool
> > > > > > > > > port = 5432
> > > > > > > > >
> > > > > > > > > # Port number for pgpool communication manager
> > > > > > > > > pcp_port = 9898
> > > > > > > > >
> > > > > > > > > # Unix domain socket path.  (The Debian package defaults to
> > > > > > > > > # /var/run/postgresql.)
> > > > > > > > > socket_dir = '/usr/local/pgpool'
> > > > > > > > >
> > > > > > > > > # Unix domain socket path for pgpool communication manager.
> > > > > > > > > pcp_socket_dir = '/usr/local/pgpool'
> > > > > > > > >
> > > > > > > > > # Unix domain socket path for the backend. Debian package
> > > > defaults to
> > > > > > > > > /var/run/postgresql!
> > > > > > > > > backend_socket_dir = '/usr/local/pgpool'
> > > > > > > > >
> > > > > > > > > # pgpool communication manager timeout. 0 means no timeout,
> but
> > > > > > > > > strongly not recommended!
> > > > > > > > > pcp_timeout = 10
> > > > > > > > >
> > > > > > > > > # number of pre-forked child process
> > > > > > > > > num_init_children = 32
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > # Number of connection pools allowed for a child process
> > > > > > > > > max_pool = 4
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > # If idle for this many seconds, child exits.  0 means no
> > > > timeout.
> > > > > > > > > child_life_time = 30
> > > > > > > > >
> > > > > > > > > # If idle for this many seconds, connection to PostgreSQL
> closes.
> > > > > > > > > # 0 means no timeout.
> > > > > > > > > #connection_life_time = 0
> > > > > > > > > connection_life_time = 30
> > > > > > > > >
> > > > > > > > > # If child_max_connections connections were received, child
> > > > exits.
> > > > > > > > > # 0 means no exit.
> > > > > > > > > # change
> > > > > > > > > child_max_connections = 0
> > > > > > > > >
> > > > > > > > > # Maximum time in seconds to complete client
> authentication.
> > > > > > > > > # 0 means no timeout.
> > > > > > > > > authentication_timeout = 60
> > > > > > > > >
> > > > > > > > > # Logging directory (more accurately, the directory for the
> PID
> > > > file)
> > > > > > > > > logdir = '/usr/local/pgpool'
> > > > > > > > >
> > > > > > > > > # Replication mode
> > > > > > > > > replication_mode = false
> > > > > > > > >
> > > > > > > > > # Set this to true if you want to avoid deadlock situations
> when
> > > > > > > > > # replication is enabled.  There will, however, be a
> noticable
> > > > > > > > performance
> > > > > > > > > # degradation.  A workaround is to set this to false and
> insert a
> > > > > > > > /*STRICT*/
> > > > > > > > > # comment at the beginning of the SQL command.
> > > > > > > > > replication_strict = false
> > > > > > > > >
> > > > > > > > > # When replication_strict is set to false, there will be a
> chance
> > > > for
> > > > > > > > > # deadlocks.  Set this to nonzero (in milliseconds) to
> detect
> > > > this
> > > > > > > > > # situation and resolve the deadlock by aborting current
> session.
> > > > > > > > > replication_timeout = 5000
> > > > > > > > >
> > > > > > > > > # Load balancing mode, i.e., all SELECTs except in a
> transaction
> > > > > > block
> > > > > > > > > # are load balanced.  This is ignored if replication_mode
> is
> > > > false.
> > > > > > > > > # change
> > > > > > > > > load_balance_mode = true
> > > > > > > > >
> > > > > > > > > # if there's a data mismatch between master and secondary
> > > > > > > > > # start degeneration to stop replication mode
> > > > > > > > > replication_stop_on_mismatch = false
> > > > > > > > >
> > > > > > > > > # If true, replicate SELECT statement when load balancing
> is
> > > > > > disabled.
> > > > > > > > > # If false, it is only sent to the master node.
> > > > > > > > > # change
> > > > > > > > > replicate_select = true
> > > > > > > > >
> > > > > > > > > # Semicolon separated list of queries to be issued at the
> end of
> > > > a
> > > > > > > > session
> > > > > > > > > reset_query_list = 'ABORT; RESET ALL; SET SESSION
> AUTHORIZATION
> > > > > > DEFAULT'
> > > > > > > > >
> > > > > > > > > # If true print timestamp on each log line.
> > > > > > > > > print_timestamp = true
> > > > > > > > >
> > > > > > > > > # If true, operate in master/slave mode.
> > > > > > > > > # change
> > > > > > > > > master_slave_mode = true
> > > > > > > > >
> > > > > > > > > # If true, cache connection pool.
> > > > > > > > > connection_cache = false
> > > > > > > > >
> > > > > > > > > # Health check timeout.  0 means no timeout.
> > > > > > > > > health_check_timeout = 20
> > > > > > > > >
> > > > > > > > > # Health check period.  0 means no health check.
> > > > > > > > > health_check_period = 0
> > > > > > > > >
> > > > > > > > > # Health check user
> > > > > > > > > health_check_user = 'nobody'
> > > > > > > > >
> > > > > > > > > # If true, automatically lock table with INSERT statements
> to
> > > > keep
> > > > > > SERIAL
> > > > > > > > > # data consistency.  An /*INSERT LOCK*/ comment has the
> same
> > > > effect.
> > > > > >  A
> > > > > > > > > # /NO INSERT LOCK*/ comment disables the effect.
> > > > > > > > > insert_lock = false
> > > > > > > > >
> > > > > > > > > # If true, ignore leading white spaces of each query while
> pgpool
> > > > > > judges
> > > > > > > > > # whether the query is a SELECT so that it can be load
> balanced.
> > > > > >  This
> > > > > > > > > # is useful for certain APIs such as DBI/DBD which is known
> to
> > > > adding
> > > > > > an
> > > > > > > > > # extra leading white space.
> > > > > > > > > ignore_leading_white_space = false
> > > > > > > > >
> > > > > > > > > # If true, print all statements to the log.  Like the
> > > > log_statement
> > > > > > > > option
> > > > > > > > > # to PostgreSQL, this allows for observing queries without
> > > > engaging
> > > > > > in
> > > > > > > > full
> > > > > > > > > # debugging.
> > > > > > > > > log_statement = false
> > > > > > > > >
> > > > > > > > > # If true, incoming connections will be printed to the log.
> > > > > > > > > # change
> > > > > > > > > log_connections = true
> > > > > > > > >
> > > > > > > > > # If true, hostname will be shown in ps status. Also shown
> in
> > > > > > > > > # connection log if log_connections = true.
> > > > > > > > > # Be warned that this feature will add overhead to look up
> > > > hostname.
> > > > > > > > > log_hostname = false
> > > > > > > > >
> > > > > > > > > # if non 0, run in parallel query mode
> > > > > > > > > parallel_mode = false
> > > > > > > > >
> > > > > > > > > # if non 0, use query cache
> > > > > > > > > enable_query_cache = 0
> > > > > > > > >
> > > > > > > > > #set pgpool2 hostname
> > > > > > > > > pgpool2_hostname = ''
> > > > > > > > >
> > > > > > > > > # system DB info
> > > > > > > > > #system_db_hostname = 'localhost'
> > > > > > > > > #system_db_port = 5432
> > > > > > > > > #system_db_dbname = 'pgpool'
> > > > > > > > > #system_db_schema = 'pgpool_catalog'
> > > > > > > > > #system_db_user = 'pgpool'
> > > > > > > > > #system_db_password = ''
> > > > > > > > >
> > > > > > > > > # backend_hostname, backend_port, backend_weight
> > > > > > > > > # here are examples
> > > > > > > > > backend_hostname0 = 'db1.xxx.xxx'
> > > > > > > > > backend_port0 = 5433
> > > > > > > > > backend_weight0 = 0.0
> > > > > > > > >
> > > > > > > > > backend_hostname1 = 'db2.xxx.xxx'
> > > > > > > > > backend_port1 = 5433
> > > > > > > > > backend_weight1 = 0.4
> > > > > > > > >
> > > > > > > > > backend_hostname2 = 'db3.xxx.xxx'
> > > > > > > > > backend_port2 = 5433
> > > > > > > > > backend_weight2 = 0.6
> > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > # - HBA -
> > > > > > > > >
> > > > > > > > > # If true, use pool_hba.conf for client authentication. In
> > > > pgpool-II
> > > > > > > > > # 1.1, the default value is false. The default value will
> be true
> > > > in
> > > > > > > > > # 1.2.
> > > > > > > > > enable_pool_hba = false
> > > > > > > >
> > > > > >
> > > >
> > _______________________________________________
> > Pgpool-general mailing list
> > Pgpool-general at pgfoundry.org
> > http://pgfoundry.org/mailman/listinfo/pgpool-general
>
> Index: main.c
> ===================================================================
> RCS file: /cvsroot/pgpool/pgpool-II/main.c,v
> retrieving revision 1.45.2.8
> diff -c -r1.45.2.8 main.c
> *** main.c      10 Nov 2009 02:24:00 -0000      1.45.2.8
> --- main.c      10 Nov 2009 09:01:36 -0000
> ***************
> *** 853,858 ****
> --- 853,859 ----
>        int status;
>        int one = 1;
>        int len;
> +       int backlog;
>
>        fd = socket(AF_INET, SOCK_STREAM, 0);
>        if (fd == -1)
> ***************
> *** 902,908 ****
>                myexit(1);
>        }
>
> !       status = listen(fd, PGPOOLMAXLITSENQUEUELENGTH);
>        if (status < 0)
>        {
>                pool_error("listen() failed. reason: %s", strerror(errno));
> --- 903,913 ----
>                myexit(1);
>        }
>
> !       backlog = pool_config->num_init_children * 2;
> !       if (backlog > PGPOOLMAXLITSENQUEUELENGTH)
> !               backlog = PGPOOLMAXLITSENQUEUELENGTH;
> !
> !       status = listen(fd, backlog);
>        if (status < 0)
>        {
>                pool_error("listen() failed. reason: %s", strerror(errno));
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pgfoundry.org/pipermail/pgpool-general/attachments/20091110/c8735fe6/attachment-0001.html>