View Issue Details

IDProjectCategoryView StatusLast Update
0000364Pgpool-IIBugpublic2018-06-17 21:29
ReportersamuelgomesAssigned Topengbo 
PrioritynormalSeverityblockReproducibilityalways
Status assignedResolutionopen 
PlatformLinuxOSUbuntu OS Version14.04
Product Version3.6.6 
Target VersionFixed in Version 
Summary0000364: PGPool hangs when then number of connections are greater than num_init_children
DescriptionThis is more a support request than a bug.

I've searched everywhere and found that something similar happens to PGBench.

In my company, we have an asp.net application (using Npgsql) connecting to a PostgreSQL 9.3 DB being pooled by PGPool. I want to move forward with the solution and add load balancing but the problem below is blocking me.

When the number of incoming connections are greater than num_init_children, PGPool freezes. New and current connections hangs, I have to force stoping it using pcp_stop_pgpool -m i, since also PGAdmin stops responding.

Since PGPool is widely used, It's probably a problem with my setup that I can't figure it out. Also, I'm not sure if this is a Npgsql driver problem.

I've tried many versions, since 3.5.

Steps To ReproduceConnect to PGPool using > num_init_children
Additional InformationDB max connections is set to 400.

listen_backlog_multiplier = 20
TagsNo tags attached.

Activities

samuelgomes

2017-11-10 01:27

reporter  

pengbo

2017-11-10 09:10

developer   ~0001820

I will look into it.

pengbo

2017-11-20 16:17

developer   ~0001835

Do you mean that not only the new connection, but also all the others connections hang, when the num_init_children is reached?

samuelgomes

2017-11-20 16:36

reporter   ~0001836

That's right. It seems that pgpool hangs when all workers are busy. It does not respond to pcp commands also.

pengbo

2017-11-24 23:20

developer   ~0001843

I could not reproduce this issue. Could you tell me how can I reproduce this?
And could you provide pgpool.log with debug mode?

Does this problem also happen by Pgpool-II 3.6.7?

jhiemer

2018-01-26 07:37

reporter   ~0001915

Hi pengbo,
we have experienced exactly the same issue the last couple of days. We are currently running on the latest fix from the git repository (because of the segmentation fault). What we have experienced is as follows:

- Starting a two cluster with streaming replication and having set the number of max connections to 500 on Postgres
- Initially we had max_pool and num_init_children with the default settings configured. Means 4 * 32 connections = 128 connections
- Every time we hit 123 connections the applications using PgPool ran into a timeout and all connections got stuck
- For a few minutes it was not able to connect to the PgPool cluster and it was unusable
- After some connections timed out (at least this is what we think happens), then connectivity is there again until we run into the problem again

What we did then is:
- We configured max_pool = 8 and left num_init_children at the default value of 32
- The problem is: it absolutely showed not effect
- We ran again into the 123 number of connections and got stuck

Then we changed num_init_children to 4
- The number of connections increased to 147 in maximum the applications need in sum 150 connections and problem reduced drastically

So we have two assumptions of problems we ran into:
1. When the maximum number of connections reaches the limit of the formula max_pool * num_init_children (-reserved admin connections) the whole connection pooling effectively stalls
2. The change of the value for max_pool has no effect on the stalling

Is it possible that we provide additional logs to troubleshoot this?

jhiemer

2018-01-26 07:40

reporter   ~0001916

Small correction: we changed num_init_children to 40 not 4 as stated above.

pengbo

2018-02-26 14:10

developer   ~0001939

Sorry for late response.

I reproduced this issue by using pgbench.
But it's a feature, not a bug.
With pgbench if the number of concurrent transactions specified by "-c" exceeds num_init_children,
pgbench will stuck because it will wait for Pgpool-II accepting connections forever.
Pgpool-II accepts up to num_init_children concurrent sessions.

I think it is the application's problem, if the number of incoming connections are greater than num_init_children, new and current connections hangs.

To make sure if it is the pgpool's problem or application's problem, could you try the following steps?

1. Connect to Pgpool-II by using psql
2. Use application to connect to Pgpool-II to make new and current connections hang.
3. At the connection of step 1, try to execute some query like ("SELECT 1").
    If you can execute the query, it is the application's problem.


| So we have two assumptions of problems we ran into:
| 1. When the maximum number of connections reaches the limit of the formula max_pool * num_init_children (-reserved admin connections) the whole connection pooling effectively stalls
| 2. The change of the value for max_pool has no effect on the stalling

You have to set "num_init_children >= application connects" rather than "num_init_children >= max_pool * application connections".
But by PostgreSQL you have to set "max_connections - superuser_reserved_connections >= (num_init_children * max_pool)".

pengbo

2018-04-02 09:16

developer   ~0001985

How about the progress?
May I close this issue, if you have resolved this issue?

jhiemer

2018-04-08 16:09

reporter   ~0001993

Hi Pengbo,
not sure if we can close this as I want to follow up on it with more detailed analysis here. We are running a setup with the following configuration:

1. We have a setup with multiple databases (7 in total)
2. Each of these databases up to 60 connections to the database
3. Postgres is configured to take 750 connections
4. We have two database nodes with streaming replication where only PgPool and Postgres are running
5. CPU/Memory consumption are looking good

So our configuration is as follows:

# pgpool configuration
socket_dir = '/var/vcap/sys/run/pgpool'
pcp_socket_dir = '/var/vcap/sys/run/pgpool'
serialize_accept = on
pid_file_name = '/var/vcap/sys/run/pgpool/pgpool.pid'
logdir = '/var/vcap/sys/log/pgpool'
log_line_prefix = '%t %a %P[%p] %u@%d: '
connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'
pool_passwd = 'pool_passwd'
enable_pool_hba = on
#------------------------------------------------------------------------------
# FAILOVER AND FAILBACK
#------------------------------------------------------------------------------
master_slave_mode = on
master_slave_sub_mode = 'stream'
sr_check_user = 'admin'
sr_check_password = 'A484GiuRHLstM4zAnDdcxaeP3VrtKT8qTwmeWqDjZq6baMZqUwCViXVR37bPR'
sr_check_database = 'admin'
#------------------------------------------------------------------------------
# FAILOVER AND FAILBACK
#------------------------------------------------------------------------------
failover_command = '/var/vcap/jobs/pgpool/config/failover.sh %d %h %p %D %m %M %H %P %R'
fail_over_on_backend_error = on
#------------------------------------------------------------------------------
# Instances
#------------------------------------------------------------------------------
backend_hostname0 = '10.0.X.Z'
backend_port0 = 6432
backend_hostname1 = '10.0.X.Y'
backend_port1 = 6432
#------------------------------------------------------------------------------
# Config
#------------------------------------------------------------------------------
enable_pool_hba = on
listen_backlog_multiplier = 5
load_balance_mode = on
max_pool = 8
num_init_children = 80
port = 5432
listen_addresses = '*'
pcp_port = 9898

When we start the setup with PgPool everything runs fine up until a point where everything is going bazooka. Why is it going bazooka? Look at the the overflows of somaxconn, after several failovers:

Node 1: 144471 times the listen queue of a socket overflowed

Node 2: 124638 times the listen queue of a socket overflowed
 
The interesting thing is though, that the queue fills up due to some point and the everything is screwed. At that point Postgres is handling 400 connections in parallel, so it is below the 750 configured. But everything is stalled (due to the somaxconn) I think.

On both machines the somax/netdev_max_backlog configuration is as follows:
sysctl net.core.netdev_max_backlog
16384

sysctl net.core.somaxconn
16384

Also the number of connections is very high, though it shouldn't be. I have attached several monitoring screenshots showing the drop of connection (application is running fine), after I have removed PgPool out of the communication chain. There you can see that the number of connections is much lower.

Do you have suggestion how we can debug it in more detail? We really want to use PgPool, but this a big blocker for us atm.

Screen Shot 2018-04-08 at 08.54.11.png (1,160,648 bytes)
Screen Shot 2018-04-08 at 08.53.53.png (1,161,128 bytes)
Screen Shot 2018-04-08 at 08.53.33.png (1,134,658 bytes)

pengbo

2018-06-17 21:29

developer   ~0002047

Sorry, because I couldn't reproduce the environment with the application,
I couldn't give you any suggestion.
I think this is not pgpool's problem, it is due to the application or OS configuration.

Issue History

Date Modified Username Field Change
2017-11-10 01:27 samuelgomes New Issue
2017-11-10 01:27 samuelgomes File Added: Screen Shot 2017-11-09 at 17.22.02.png
2017-11-10 09:10 pengbo Assigned To => pengbo
2017-11-10 09:10 pengbo Status new => assigned
2017-11-10 09:10 pengbo Note Added: 0001820
2017-11-20 16:17 pengbo Note Added: 0001835
2017-11-20 16:36 samuelgomes Note Added: 0001836
2017-11-24 23:20 pengbo Note Added: 0001843
2018-01-26 07:37 jhiemer Note Added: 0001915
2018-01-26 07:40 jhiemer Note Added: 0001916
2018-02-26 14:10 pengbo Note Added: 0001939
2018-04-02 09:16 pengbo Note Added: 0001985
2018-04-08 16:09 jhiemer File Added: Screen Shot 2018-04-08 at 08.53.33.png
2018-04-08 16:09 jhiemer File Added: Screen Shot 2018-04-08 at 08.53.53.png
2018-04-08 16:09 jhiemer File Added: Screen Shot 2018-04-08 at 08.54.11.png
2018-04-08 16:09 jhiemer Note Added: 0001993
2018-06-17 21:29 pengbo Note Added: 0002047