0000364: PGPool hangs when then number of connections are greater than num_init_children

ID	Project	Category	View Status	Date Submitted	Last Update

0000364	Pgpool-II	Bug	public	2017-11-10 01:27	2018-06-17 21:29

Reporter	samuelgomes	Assigned To	pengbo
Priority	normal	Severity	block	Reproducibility	always
Status	assigned	Resolution	open
Platform	Linux	OS	Ubuntu	OS Version	14.04
Product Version	3.6.6

Summary	0000364: PGPool hangs when then number of connections are greater than num_init_children
Description	This is more a support request than a bug. I've searched everywhere and found that something similar happens to PGBench. In my company, we have an asp.net application (using Npgsql) connecting to a PostgreSQL 9.3 DB being pooled by PGPool. I want to move forward with the solution and add load balancing but the problem below is blocking me. When the number of incoming connections are greater than num_init_children, PGPool freezes. New and current connections hangs, I have to force stoping it using pcp_stop_pgpool -m i, since also PGAdmin stops responding. Since PGPool is widely used, It's probably a problem with my setup that I can't figure it out. Also, I'm not sure if this is a Npgsql driver problem. I've tried many versions, since 3.5.
Steps To Reproduce	Connect to PGPool using > num_init_children
Additional Information	DB max connections is set to 400. listen_backlog_multiplier = 20
Tags	No tags attached.

samuelgomes 2017-11-10 01:27 reporter	Screen Shot 2017-11-09 at 17.22.02.png (112,870 bytes) Screen Shot 2017-11-09 at 17.22.02.png (112,870 bytes)

pengbo 2017-11-10 09:10 developer ~0001820	I will look into it.

pengbo 2017-11-20 16:17 developer ~0001835	Do you mean that not only the new connection, but also all the others connections hang, when the num_init_children is reached?

samuelgomes 2017-11-20 16:36 reporter ~0001836	That's right. It seems that pgpool hangs when all workers are busy. It does not respond to pcp commands also.

pengbo 2017-11-24 23:20 developer ~0001843	I could not reproduce this issue. Could you tell me how can I reproduce this? And could you provide pgpool.log with debug mode? Does this problem also happen by Pgpool-II 3.6.7?

jhiemer 2018-01-26 07:37 reporter ~0001915	Hi pengbo, we have experienced exactly the same issue the last couple of days. We are currently running on the latest fix from the git repository (because of the segmentation fault). What we have experienced is as follows: - Starting a two cluster with streaming replication and having set the number of max connections to 500 on Postgres - Initially we had max_pool and num_init_children with the default settings configured. Means 4 * 32 connections = 128 connections - Every time we hit 123 connections the applications using PgPool ran into a timeout and all connections got stuck - For a few minutes it was not able to connect to the PgPool cluster and it was unusable - After some connections timed out (at least this is what we think happens), then connectivity is there again until we run into the problem again What we did then is: - We configured max_pool = 8 and left num_init_children at the default value of 32 - The problem is: it absolutely showed not effect - We ran again into the 123 number of connections and got stuck Then we changed num_init_children to 4 - The number of connections increased to 147 in maximum the applications need in sum 150 connections and problem reduced drastically So we have two assumptions of problems we ran into: 1. When the maximum number of connections reaches the limit of the formula max_pool * num_init_children (-reserved admin connections) the whole connection pooling effectively stalls 2. The change of the value for max_pool has no effect on the stalling Is it possible that we provide additional logs to troubleshoot this?

jhiemer 2018-01-26 07:40 reporter ~0001916	Small correction: we changed num_init_children to 40 not 4 as stated above.

pengbo 2018-02-26 14:10 developer ~0001939	Sorry for late response. I reproduced this issue by using pgbench. But it's a feature, not a bug. With pgbench if the number of concurrent transactions specified by "-c" exceeds num_init_children, pgbench will stuck because it will wait for Pgpool-II accepting connections forever. Pgpool-II accepts up to num_init_children concurrent sessions. I think it is the application's problem, if the number of incoming connections are greater than num_init_children, new and current connections hangs. To make sure if it is the pgpool's problem or application's problem, could you try the following steps? 1. Connect to Pgpool-II by using psql 2. Use application to connect to Pgpool-II to make new and current connections hang. 3. At the connection of step 1, try to execute some query like ("SELECT 1"). If you can execute the query, it is the application's problem. \| So we have two assumptions of problems we ran into: \| 1. When the maximum number of connections reaches the limit of the formula max_pool * num_init_children (-reserved admin connections) the whole connection pooling effectively stalls \| 2. The change of the value for max_pool has no effect on the stalling You have to set "num_init_children >= application connects" rather than "num_init_children >= max_pool * application connections". But by PostgreSQL you have to set "max_connections - superuser_reserved_connections >= (num_init_children * max_pool)".

pengbo 2018-04-02 09:16 developer ~0001985	How about the progress? May I close this issue, if you have resolved this issue?

jhiemer 2018-04-08 16:09 reporter ~0001993	Hi Pengbo, not sure if we can close this as I want to follow up on it with more detailed analysis here. We are running a setup with the following configuration: 1. We have a setup with multiple databases (7 in total) 2. Each of these databases up to 60 connections to the database 3. Postgres is configured to take 750 connections 4. We have two database nodes with streaming replication where only PgPool and Postgres are running 5. CPU/Memory consumption are looking good So our configuration is as follows: # pgpool configuration socket_dir = '/var/vcap/sys/run/pgpool' pcp_socket_dir = '/var/vcap/sys/run/pgpool' serialize_accept = on pid_file_name = '/var/vcap/sys/run/pgpool/pgpool.pid' logdir = '/var/vcap/sys/log/pgpool' log_line_prefix = '%t %a %P[%p] %u@%d: ' connection_cache = on reset_query_list = 'ABORT; DISCARD ALL' pool_passwd = 'pool_passwd' enable_pool_hba = on #------------------------------------------------------------------------------ # FAILOVER AND FAILBACK #------------------------------------------------------------------------------ master_slave_mode = on master_slave_sub_mode = 'stream' sr_check_user = 'admin' sr_check_password = 'A484GiuRHLstM4zAnDdcxaeP3VrtKT8qTwmeWqDjZq6baMZqUwCViXVR37bPR' sr_check_database = 'admin' #------------------------------------------------------------------------------ # FAILOVER AND FAILBACK #------------------------------------------------------------------------------ failover_command = '/var/vcap/jobs/pgpool/config/failover.sh %d %h %p %D %m %M %H %P %R' fail_over_on_backend_error = on #------------------------------------------------------------------------------ # Instances #------------------------------------------------------------------------------ backend_hostname0 = '10.0.X.Z' backend_port0 = 6432 backend_hostname1 = '10.0.X.Y' backend_port1 = 6432 #------------------------------------------------------------------------------ # Config #------------------------------------------------------------------------------ enable_pool_hba = on listen_backlog_multiplier = 5 load_balance_mode = on max_pool = 8 num_init_children = 80 port = 5432 listen_addresses = '*' pcp_port = 9898 When we start the setup with PgPool everything runs fine up until a point where everything is going bazooka. Why is it going bazooka? Look at the the overflows of somaxconn, after several failovers: Node 1: 144471 times the listen queue of a socket overflowed Node 2: 124638 times the listen queue of a socket overflowed The interesting thing is though, that the queue fills up due to some point and the everything is screwed. At that point Postgres is handling 400 connections in parallel, so it is below the 750 configured. But everything is stalled (due to the somaxconn) I think. On both machines the somax/netdev_max_backlog configuration is as follows: sysctl net.core.netdev_max_backlog 16384 sysctl net.core.somaxconn 16384 Also the number of connections is very high, though it shouldn't be. I have attached several monitoring screenshots showing the drop of connection (application is running fine), after I have removed PgPool out of the communication chain. There you can see that the number of connections is much lower. Do you have suggestion how we can debug it in more detail? We really want to use PgPool, but this a big blocker for us atm. Screen Shot 2018-04-08 at 08.53.33.png (1,134,658 bytes) Screen Shot 2018-04-08 at 08.53.53.png (1,161,128 bytes) Screen Shot 2018-04-08 at 08.54.11.png (1,160,648 bytes)

pengbo 2018-06-17 21:29 developer ~0002047	Sorry, because I couldn't reproduce the environment with the application, I couldn't give you any suggestion. I think this is not pgpool's problem, it is due to the application or OS configuration.

Date Modified	Username	Field	Change
2017-11-10 01:27	samuelgomes	New Issue
2017-11-10 01:27	samuelgomes	File Added: Screen Shot 2017-11-09 at 17.22.02.png
2017-11-10 09:10	pengbo	Assigned To	=> pengbo
2017-11-10 09:10	pengbo	Status	new => assigned
2017-11-10 09:10	pengbo	Note Added: 0001820
2017-11-20 16:17	pengbo	Note Added: 0001835
2017-11-20 16:36	samuelgomes	Note Added: 0001836
2017-11-24 23:20	pengbo	Note Added: 0001843
2018-01-26 07:37	jhiemer	Note Added: 0001915
2018-01-26 07:40	jhiemer	Note Added: 0001916
2018-02-26 14:10	pengbo	Note Added: 0001939
2018-04-02 09:16	pengbo	Note Added: 0001985
2018-04-08 16:09	jhiemer	File Added: Screen Shot 2018-04-08 at 08.53.33.png
2018-04-08 16:09	jhiemer	File Added: Screen Shot 2018-04-08 at 08.53.53.png
2018-04-08 16:09	jhiemer	File Added: Screen Shot 2018-04-08 at 08.54.11.png
2018-04-08 16:09	jhiemer	Note Added: 0001993
2018-06-17 21:29	pengbo	Note Added: 0002047

View Issue Details

Activities

Issue History