View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000364 | Pgpool-II | Bug | public | 2017-11-10 01:27 | 2018-06-17 21:29 |
| Reporter | samuelgomes | Assigned To | pengbo | ||
| Priority | normal | Severity | block | Reproducibility | always |
| Status | assigned | Resolution | open | ||
| Platform | Linux | OS | Ubuntu | OS Version | 14.04 |
| Product Version | 3.6.6 | ||||
| Summary | 0000364: PGPool hangs when then number of connections are greater than num_init_children | ||||
| Description | This is more a support request than a bug. I've searched everywhere and found that something similar happens to PGBench. In my company, we have an asp.net application (using Npgsql) connecting to a PostgreSQL 9.3 DB being pooled by PGPool. I want to move forward with the solution and add load balancing but the problem below is blocking me. When the number of incoming connections are greater than num_init_children, PGPool freezes. New and current connections hangs, I have to force stoping it using pcp_stop_pgpool -m i, since also PGAdmin stops responding. Since PGPool is widely used, It's probably a problem with my setup that I can't figure it out. Also, I'm not sure if this is a Npgsql driver problem. I've tried many versions, since 3.5. | ||||
| Steps To Reproduce | Connect to PGPool using > num_init_children | ||||
| Additional Information | DB max connections is set to 400. listen_backlog_multiplier = 20 | ||||
| Tags | No tags attached. | ||||
|
|
|
|
|
I will look into it. |
|
|
Do you mean that not only the new connection, but also all the others connections hang, when the num_init_children is reached? |
|
|
That's right. It seems that pgpool hangs when all workers are busy. It does not respond to pcp commands also. |
|
|
I could not reproduce this issue. Could you tell me how can I reproduce this? And could you provide pgpool.log with debug mode? Does this problem also happen by Pgpool-II 3.6.7? |
|
|
Hi pengbo, we have experienced exactly the same issue the last couple of days. We are currently running on the latest fix from the git repository (because of the segmentation fault). What we have experienced is as follows: - Starting a two cluster with streaming replication and having set the number of max connections to 500 on Postgres - Initially we had max_pool and num_init_children with the default settings configured. Means 4 * 32 connections = 128 connections - Every time we hit 123 connections the applications using PgPool ran into a timeout and all connections got stuck - For a few minutes it was not able to connect to the PgPool cluster and it was unusable - After some connections timed out (at least this is what we think happens), then connectivity is there again until we run into the problem again What we did then is: - We configured max_pool = 8 and left num_init_children at the default value of 32 - The problem is: it absolutely showed not effect - We ran again into the 123 number of connections and got stuck Then we changed num_init_children to 4 - The number of connections increased to 147 in maximum the applications need in sum 150 connections and problem reduced drastically So we have two assumptions of problems we ran into: 1. When the maximum number of connections reaches the limit of the formula max_pool * num_init_children (-reserved admin connections) the whole connection pooling effectively stalls 2. The change of the value for max_pool has no effect on the stalling Is it possible that we provide additional logs to troubleshoot this? |
|
|
Small correction: we changed num_init_children to 40 not 4 as stated above. |
|
|
Sorry for late response. I reproduced this issue by using pgbench. But it's a feature, not a bug. With pgbench if the number of concurrent transactions specified by "-c" exceeds num_init_children, pgbench will stuck because it will wait for Pgpool-II accepting connections forever. Pgpool-II accepts up to num_init_children concurrent sessions. I think it is the application's problem, if the number of incoming connections are greater than num_init_children, new and current connections hangs. To make sure if it is the pgpool's problem or application's problem, could you try the following steps? 1. Connect to Pgpool-II by using psql 2. Use application to connect to Pgpool-II to make new and current connections hang. 3. At the connection of step 1, try to execute some query like ("SELECT 1"). If you can execute the query, it is the application's problem. | So we have two assumptions of problems we ran into: | 1. When the maximum number of connections reaches the limit of the formula max_pool * num_init_children (-reserved admin connections) the whole connection pooling effectively stalls | 2. The change of the value for max_pool has no effect on the stalling You have to set "num_init_children >= application connects" rather than "num_init_children >= max_pool * application connections". But by PostgreSQL you have to set "max_connections - superuser_reserved_connections >= (num_init_children * max_pool)". |
|
|
How about the progress? May I close this issue, if you have resolved this issue? |
|
|
Hi Pengbo, not sure if we can close this as I want to follow up on it with more detailed analysis here. We are running a setup with the following configuration: 1. We have a setup with multiple databases (7 in total) 2. Each of these databases up to 60 connections to the database 3. Postgres is configured to take 750 connections 4. We have two database nodes with streaming replication where only PgPool and Postgres are running 5. CPU/Memory consumption are looking good So our configuration is as follows: # pgpool configuration socket_dir = '/var/vcap/sys/run/pgpool' pcp_socket_dir = '/var/vcap/sys/run/pgpool' serialize_accept = on pid_file_name = '/var/vcap/sys/run/pgpool/pgpool.pid' logdir = '/var/vcap/sys/log/pgpool' log_line_prefix = '%t %a %P[%p] %u@%d: ' connection_cache = on reset_query_list = 'ABORT; DISCARD ALL' pool_passwd = 'pool_passwd' enable_pool_hba = on #------------------------------------------------------------------------------ # FAILOVER AND FAILBACK #------------------------------------------------------------------------------ master_slave_mode = on master_slave_sub_mode = 'stream' sr_check_user = 'admin' sr_check_password = 'A484GiuRHLstM4zAnDdcxaeP3VrtKT8qTwmeWqDjZq6baMZqUwCViXVR37bPR' sr_check_database = 'admin' #------------------------------------------------------------------------------ # FAILOVER AND FAILBACK #------------------------------------------------------------------------------ failover_command = '/var/vcap/jobs/pgpool/config/failover.sh %d %h %p %D %m %M %H %P %R' fail_over_on_backend_error = on #------------------------------------------------------------------------------ # Instances #------------------------------------------------------------------------------ backend_hostname0 = '10.0.X.Z' backend_port0 = 6432 backend_hostname1 = '10.0.X.Y' backend_port1 = 6432 #------------------------------------------------------------------------------ # Config #------------------------------------------------------------------------------ enable_pool_hba = on listen_backlog_multiplier = 5 load_balance_mode = on max_pool = 8 num_init_children = 80 port = 5432 listen_addresses = '*' pcp_port = 9898 When we start the setup with PgPool everything runs fine up until a point where everything is going bazooka. Why is it going bazooka? Look at the the overflows of somaxconn, after several failovers: Node 1: 144471 times the listen queue of a socket overflowed Node 2: 124638 times the listen queue of a socket overflowed The interesting thing is though, that the queue fills up due to some point and the everything is screwed. At that point Postgres is handling 400 connections in parallel, so it is below the 750 configured. But everything is stalled (due to the somaxconn) I think. On both machines the somax/netdev_max_backlog configuration is as follows: sysctl net.core.netdev_max_backlog 16384 sysctl net.core.somaxconn 16384 Also the number of connections is very high, though it shouldn't be. I have attached several monitoring screenshots showing the drop of connection (application is running fine), after I have removed PgPool out of the communication chain. There you can see that the number of connections is much lower. Do you have suggestion how we can debug it in more detail? We really want to use PgPool, but this a big blocker for us atm. |
|
|
Sorry, because I couldn't reproduce the environment with the application, I couldn't give you any suggestion. I think this is not pgpool's problem, it is due to the application or OS configuration. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2017-11-10 01:27 | samuelgomes | New Issue | |
| 2017-11-10 01:27 | samuelgomes | File Added: Screen Shot 2017-11-09 at 17.22.02.png | |
| 2017-11-10 09:10 | pengbo | Assigned To | => pengbo |
| 2017-11-10 09:10 | pengbo | Status | new => assigned |
| 2017-11-10 09:10 | pengbo | Note Added: 0001820 | |
| 2017-11-20 16:17 | pengbo | Note Added: 0001835 | |
| 2017-11-20 16:36 | samuelgomes | Note Added: 0001836 | |
| 2017-11-24 23:20 | pengbo | Note Added: 0001843 | |
| 2018-01-26 07:37 | jhiemer | Note Added: 0001915 | |
| 2018-01-26 07:40 | jhiemer | Note Added: 0001916 | |
| 2018-02-26 14:10 | pengbo | Note Added: 0001939 | |
| 2018-04-02 09:16 | pengbo | Note Added: 0001985 | |
| 2018-04-08 16:09 | jhiemer | File Added: Screen Shot 2018-04-08 at 08.53.33.png | |
| 2018-04-08 16:09 | jhiemer | File Added: Screen Shot 2018-04-08 at 08.53.53.png | |
| 2018-04-08 16:09 | jhiemer | File Added: Screen Shot 2018-04-08 at 08.54.11.png | |
| 2018-04-08 16:09 | jhiemer | Note Added: 0001993 | |
| 2018-06-17 21:29 | pengbo | Note Added: 0002047 |