View Issue Details

IDProjectCategoryView StatusLast Update
0000765Pgpool-IIBugpublic2022-08-27 19:27
ReporterDarvids0n Assigned Tot-ishii  
PrioritynormalSeveritymajorReproducibilityalways
Status assignedResolutionopen 
PlatformLinuxOSRed HatOS Version8.6 Ootpa
Product Version4.2.9 
Summary0000765: Certain workers become degenerate with SSL_read errors
DescriptionNote: Issue also occurs in 4.1.12

I have a config set up which is stable when using socket connections or bypassing the connection cache (i.e. connect to postgres dbname). However when clients connect via SSL to their db, the following happens:

- Connections are healthy for a couple of minutes
- A health check or worker process will begin failing all of its connection attempts with
  pool_ssl: "SSL_read": "no SSL error reported"
- This triggers failover etc, and after auto failback the connections are healthy for a couple more minutes, then the issue repeats
Steps To Reproduce1. Deploy 1-2 pgpool and 1-2 postgres, with SSL on and connection_cache on, pool_hba off
2. Create a user and empty database owned by the user. Ensure postgres uses md5 encryption, having the password in pool_passwd is optional.
3. Start a test psql or other connection to pgpool as the new user, connecting to the new database and selecting from anything e.g. SHOW POOL_POOLS. I use a watch command with delay of 0.1
4. Wait a few minutes, the issue should occur
Additional InformationNot sure if the pg version will change the behaviour, but I am using Postgres 12.11.
Originally, child_max_connections was set to 0, but now set to 50 as part of troubleshooting this.
I also disabled failover_on_backend_error as this was causing more frequent failovers when a client connection worker had the SSL issue.
Tagsconnection cache, networking, ssl

Activities

Darvids0n

2022-08-22 12:23

reporter  

pgpool.conf (3,513 bytes)   
#
# required_config generated by pgpool_conf resource
#
port = 9999
pcp_port = 9898

backend_hostname0 = 'dld220822093114'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/edodata/pgsql/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'dld220822093114'
backend_hostname1 = 'dld220822093119'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/edodata/pgsql/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = 'dld220822093119'

use_watchdog = on
wd_lifecheck_method = 'heartbeat'


hostname0 = 'dld220822093125'
wd_port0 = 9000
pgpool_port00 = 9999

heartbeat_hostname0 = 'dld220822093125'
heartbeat_port0 = 9694
heartbeat_device0 = ''

hostname1 = 'dld220822093130'
wd_port1 = 9000
pgpool_port01 = 9999

heartbeat_hostname1 = 'dld220822093130'
heartbeat_port1 = 9694
heartbeat_device1 = ''


# additional_config generated by pgpool_conf resource
#
backend_clustering_mode = 'streaming_replication'
follow_primary_command = '/etc/pgpool/follow_primary.sh %d %h %p %D %m %H %M %P %r %R %N %S'
listen_addresses = '*'
pcp_listen_addresses = '*'
pid_file_name = '/var/run/pgpool/pgpool.pid'
socket_dir = '/var/run/pgpool'
pcp_socket_dir = '/var/run/pgpool'
wd_ipc_socket_dir = '/var/run/pgpool'
sr_check_user = 'pgpool'
recovery_user = 'pgsql'
health_check_user = 'pgpool'
enable_pool_hba = off
pool_passwd = 'pool_passwd'
authentication_timeout = 30
allow_clear_text_frontend_auth = off
ssl = on
ssl_key = '/etc/pgpool/ssl/pgpool.key'
ssl_cert = '/etc/pgpool/ssl/pgpool.pem'
ssl_ca_cert = '/etc/ssl/certs/ca-bundle.crt'
ssl_ciphers = 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-DSS-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:DHE-RSA-AES128-SHA256:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK'
ssl_prefer_server_ciphers = on
ssl_ecdh_curve = 'prime256v1'
num_init_children = 32
max_pool = 4
listen_backlog_multiplier = 2
serialize_accept = off
reserved_connections = 0
child_life_time = 300
child_max_connections = 50
connection_life_time = 0
client_idle_limit = 10
log_destination = 'syslog'
log_line_prefix = '%a: '
log_connections = off
log_hostname = on
log_statement = off
log_per_node_statement = off
log_client_messages = off
syslog_facility = 'LOCAL2'
syslog_ident = 'pgpool'
connection_cache = on
load_balance_mode = on
ignore_leading_white_space = on
allow_sql_comments = on
disable_load_balance_on_write = 'transaction'
statement_level_load_balance = on
sr_check_period = 10
delay_threshold = 10000000
log_standby_delay = 'if_over_threshold'
health_check_period = 5
health_check_timeout = 20
health_check_max_retries = 2
health_check_retry_delay = 1
failover_command = '/etc/pgpool/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
failback_command = '/etc/pgpool/failback.sh %d %h %p %D %m %H %M %P %r %R %N %S'
failover_on_backend_error = off
detach_false_primary = on
search_primary_node_timeout = 300
auto_failback = on
recovery_1st_stage_command = 'recovery_1st_stage'
recovery_timeout = 900
ping_path = '/bin'
wd_priority = 1
failover_when_quorum_exists = on
failover_require_consensus = on
allow_multiple_failover_requests_from_node = off
enable_consensus_with_half_votes = on
wd_interval = 10
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
pgpool.conf (3,513 bytes)   
postgresql.conf (3,982 bytes)   
max_connections = 50                    # (change requires restart)
superuser_reserved_connections = 5      # (change requires restart)
ssl = on                                # (change requires restart)
ssl_ciphers = 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-DSS-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:DHE-RSA-AES128-SHA256:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK' # allowed SSL ciphers
ssl_prefer_server_ciphers = on          # (change requires restart)
#ssl_ecdh_curve = 'prime256v1'          # (change requires restart)
ssl_cert_file = '/edodata/pgsql/data/ssl/postgres.pem'       # (change requires restart)
ssl_key_file = '/edodata/pgsql/data/ssl/postgres.key'# (change requires restart)
ssl_ca_file = '/etc/ssl/certs/ca-bundle.crt'              # (change requires restart)
shared_buffers = 128MB                  # min 128kB
huge_pages = try                        # on, off, or try
temp_buffers = 8MB                      # min 800kB
work_mem = 64MB                                 # min 64kB
maintenance_work_mem = 128MB            # min 1MB
dynamic_shared_memory_type = posix      # the default is the first option
effective_io_concurrency = 200          # 1-1000; 0 disables prefetching
max_worker_processes = 8                # (change requires restart)
max_parallel_workers_per_gather = 4     # taken from max_worker_processes
synchronous_commit = off                # synchronization level;
wal_buffers = 16MB                      # min 32kB, -1 sets based on shared_buffers
max_wal_size = 2GB
min_wal_size = 1GB
checkpoint_completion_target = 0.7      # checkpoint target duration, 0.0 - 1.0
track_commit_timestamp = on     # collect timestamp of transaction commit
synchronous_standby_names = 'walreceiver'       # standby servers that provide sync rep
random_page_cost = 1.1                  # same scale as above
effective_cache_size = 128MB
default_statistics_target = 100 # range 1-10000
log_destination = 'csvlog'              # Valid values are combinations of
log_file_mode = 0640                    # creation mode for log files,
client_min_messages = notice            # values in order of decreasing detail:
log_min_messages = warning              # values in order of decreasing detail:
log_min_error_statement = error # values in order of decreasing detail:
log_error_verbosity = default           # terse, default, or verbose messages
log_replication_commands = off
log_temp_files = 0                      # log temporary files equal or larger
log_timezone = 'Australia/NSW'
autovacuum = on                 # Enable autovacuum subprocess?  'on'
autovacuum_max_workers = 3              # max number of autovacuum subprocesses
datestyle = 'iso, dmy'
timezone = 'Australia/NSW'
lc_messages = 'en_AU.UTF-8'                     # locale for system error message
lc_monetary = 'en_AU.UTF-8'                     # locale for monetary formatting
lc_numeric = 'en_AU.UTF-8'                      # locale for number formatting
lc_time = 'en_AU.UTF-8'                         # locale for time formatting
default_text_search_config = 'pg_catalog.english'

archive_command = 'cp "%p" "/edodbtranslogs/pgsql/%f"'
log_directory = '/edologs/pgsql'

# additional_config generated by postgresql_ha_conf resource
#
wal_keep_segments = 10
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 12
wal_level = 'replica'
hot_standby = on
archive_mode = on
wal_log_hints = on
logging_collector = on
log_filename = 'postgresql.log'
log_line_prefix = '%m [%p] %q%u@%d '
log_checkpoints = on
log_connections = off
log_disconnections = off
log_lock_waits = on
log_duration = on
log_statement = 'none'
log_min_duration_statement = 5000
log_rotation_age = 0
log_rotation_size = 0
postgresql.conf (3,982 bytes)   
pgpool.log (9,074 bytes)   
Aug 22 11:49:49 dld220822093130 pgpool[44725]: [23-1] health_check0: LOG:  pool_ssl: "SSL_read": "no SSL error reported"
Aug 22 11:49:49 dld220822093130 pgpool[44725]: [24-1] health_check0: LOG:  received degenerate backend request for node_id: 0 from pid [44725]
Aug 22 11:49:49 dld220822093130 pgpool[44685]: [35-1] watchdog: LOG:  new IPC connection received
Aug 22 11:49:49 dld220822093130 pgpool[44726]: [23-1] health_check1: LOG:  pool_ssl: "SSL_read": "no SSL error reported"
Aug 22 11:49:49 dld220822093130 pgpool[44726]: [24-1] health_check1: LOG:  received degenerate backend request for node_id: 1 from pid [44726]
Aug 22 11:49:49 dld220822093130 pgpool[44685]: [36-1] watchdog: LOG:  failover request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:49 dld220822093130 pgpool[44685]: [36-2] watchdog: DETAIL:  waiting for the reply...
Aug 22 11:49:49 dld220822093130 pgpool[44685]: [37-1] watchdog: LOG:  new IPC connection received
Aug 22 11:49:49 dld220822093130 pgpool[44685]: [38-1] watchdog: LOG:  failover request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:49 dld220822093130 pgpool[44685]: [38-2] watchdog: DETAIL:  waiting for the reply...
Aug 22 11:49:49 dld220822093130 pgpool[44725]: [25-1] health_check0: LOG:  degenerate backend request for 1 node(s) from pid [44725], will be handled by watchdog
Aug 22 11:49:49 dld220822093130 pgpool[44726]: [25-1] health_check1: LOG:  degenerate backend request for 1 node(s) from pid [44726], will be handled by watchdog
Aug 22 11:49:49 dld220822093130 pgpool[44725]: [27-1] health_check0: LOG:  health check retrying on DB node: 0 (round:1)
Aug 22 11:49:49 dld220822093130 pgpool[44726]: [27-1] health_check1: LOG:  health check retrying on DB node: 1 (round:1)
Aug 22 11:49:50 dld220822093130 pgpool[44725]: [28-1] health_check0: LOG:  health check retrying on DB node: 0 succeeded
Aug 22 11:49:50 dld220822093130 pgpool[44726]: [28-1] health_check1: LOG:  health check retrying on DB node: 1 succeeded
Aug 22 11:49:50 dld220822093130 pgpool[44711]: [39-1] psql: LOG:  connection closed.
Aug 22 11:49:50 dld220822093130 pgpool[44711]: [39-2] psql: DETAIL:  retry to create new connection pool
Aug 22 11:49:50 dld220822093130 pgpool[44711]: [40-1] psql: LOG:  pool_read_kind: error message from main backend:the database system is starting up
Aug 22 11:49:50 dld220822093130 pgpool[44711]: [42-1] psql: LOG:  pool_read_kind: error message from main backend:the database system is starting up
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [39-1] watchdog: LOG:  signal_user1_to_parent_with_reason(2)
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [25-1] main: LOG:  Pgpool-II parent process received SIGUSR1
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [26-1] main: LOG:  Pgpool-II parent process received sync backend signal from watchdog
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [40-1] watchdog: LOG:  new IPC connection received
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [27-1] main: LOG:  leader watchdog has performed failover
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [27-2] main: DETAIL:  syncing the backend states from the LEADER watchdog node
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [41-1] watchdog: LOG:  new IPC connection received
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [42-1] watchdog: LOG:  received the get data request from local pgpool-II on IPC interface
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [43-1] watchdog: LOG:  get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [43-2] watchdog: DETAIL:  waiting for the reply...
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [28-1] main: LOG:  leader watchdog node "dld220822093125:9999 Linux dld220822093125" returned status for 2 backend nodes
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [29-1] main: LOG:  backend:0 is set to down status
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [29-2] main: DETAIL:  backend:0 is DOWN on cluster leader "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [30-1] main: LOG:  backend:1 is set to down status
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [30-2] main: DETAIL:  backend:1 is DOWN on cluster leader "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [31-1] main: LOG:  primary node:1 on leader watchdog node "dld220822093125:9999 Linux dld220822093125" is different from local primary node:0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [32-1] main: LOG:  primary node was changed after the sync from "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [32-2] main: DETAIL:  all children needs to be restarted
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [33-1] main: LOG:  child process with pid: 44693 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [34-1] main: LOG:  child process with pid: 44694 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [35-1] main: LOG:  child process with pid: 44695 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [36-1] main: LOG:  child process with pid: 44699 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [37-1] main: LOG:  child process with pid: 44700 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [38-1] main: LOG:  child process with pid: 44701 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44724]: [23-1] sr_check_worker: LOG:  worker process received restart request
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [39-1] main: LOG:  child process with pid: 44691 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [41-1] main: LOG:  child process with pid: 44696 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [42-1] main: LOG:  child process with pid: 44697 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [43-1] main: LOG:  child process with pid: 44698 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [44-1] main: LOG:  child process with pid: 44703 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [49-1] main: LOG:  child process with pid: 44711 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [50-1] main: LOG:  child process with pid: 44712 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [51-1] main: LOG:  child process with pid: 44713 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [52-1] main: LOG:  child process with pid: 44714 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [53-1] main: LOG:  child process with pid: 44715 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [55-1] main: LOG:  child process with pid: 44717 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [59-1] main: LOG:  child process with pid: 44721 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [60-1] main: LOG:  child process with pid: 44722 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [61-1] main: LOG:  worker child process with pid: 44724 exits with status 256
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [62-1] main: LOG:  fork a new worker child process with pid: 47177
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [63-1] main: LOG:  child process with pid: 44710 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [65-1] main: LOG:  child process with pid: 44705 exits with status 0
Aug 22 11:49:50 dld220822093130 pgpool[47177]: [62-1] sr_check_worker: LOG:  process started
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [44-1] watchdog: LOG:  new IPC connection received
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [45-1] watchdog: LOG:  processing follow primary looking[acquire] request from remote node "dld220822093125:9999 Linux dld220822093125"
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [67-1] main: LOG:  child process with pid: 47162 exits with status 256
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [68-1] main: LOG:  fork a new child process with pid: 47180
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [46-1] watchdog: LOG:  signal_user1_to_parent_with_reason(2)
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [69-1] main: LOG:  Pgpool-II parent process received SIGUSR1
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [70-1] main: LOG:  Pgpool-II parent process received sync backend signal from watchdog
Aug 22 11:49:50 dld220822093130 pgpool[44685]: [47-1] watchdog: LOG:  new IPC connection received
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [71-1] main: LOG:  leader watchdog has performed failover
Aug 22 11:49:50 dld220822093130 pgpool[44682]: [71-2] main: DETAIL:  syncing the backend states from the LEADER watchdog node
pgpool.log (9,074 bytes)   

t-ishii

2022-08-24 09:00

developer   ~0004115

I think SSL in health check process is not supported (yet).
https://pgpool.net/mediawiki/index.php/TODO#Allow_SSL_in_health_check_etc.

t-ishii

2022-08-27 16:56

developer   ~0004116

Sorry, I was wrong. Actually Pgpool-II's health check supports SSL at least since 2010/1/26 with commit c00f8be7f.
I confirmed it using master branch. I find following in PostgreSQL log by enabling log_connections parameter:

796117 2022-08-27 16:44:35.831 JST LOG: connection authorized: user=t-ishii database=postgres SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256)

Regarding your issue, I suspect PostgreSQL's max_connections is not large enough to support pgpool's num_init_children . Can you share those parameters?

Darvids0n

2022-08-27 17:32

reporter   ~0004117

max_connections = 50 # (change requires restart)
for each of the 2 DB nodes.

And my pgpool.conf params are (see attached files also):

num_init_children = 32
max_pool = 4
listen_backlog_multiplier = 2
serialize_accept = off
reserved_connections = 0
child_life_time = 300
child_max_connections = 50
connection_life_time = 0
client_idle_limit = 10

t-ishii

2022-08-27 19:11

developer   ~0004118

Do you have connections with multiple users/databases? If so, the system has to satisfy following formula:

max_pool*num_init_children <= (max_connections - superuser_reserved_connections) (no query canceling needed)
max_pool*num_init_children*2 <= (max_connections - superuser_reserved_connections) (query canceling needed)

Darvids0n

2022-08-27 19:27

reporter   ~0004119

Ah, I can see the formula in docs as well. Sorry, I must have missed that.

I would expect we have several long-running system connections and automation processes, as well as 100 or more user sessions for the webapp. So I think raising num_init_children and max_connections is the right idea.

I will try and report if the problem is solved. Thanks!

Issue History

Date Modified Username Field Change
2022-08-22 12:23 Darvids0n New Issue
2022-08-22 12:23 Darvids0n File Added: pgpool.conf
2022-08-22 12:23 Darvids0n File Added: postgresql.conf
2022-08-22 12:23 Darvids0n File Added: pgpool.log
2022-08-22 12:42 Darvids0n Tag Attached: ssl
2022-08-22 12:42 Darvids0n Tag Attached: networking
2022-08-22 12:42 Darvids0n Tag Attached: connection cache
2022-08-24 09:00 t-ishii Note Added: 0004115
2022-08-27 16:56 t-ishii Note Added: 0004116
2022-08-27 16:57 t-ishii Assigned To => t-ishii
2022-08-27 16:57 t-ishii Status new => assigned
2022-08-27 16:57 t-ishii Status assigned => feedback
2022-08-27 17:32 Darvids0n Note Added: 0004117
2022-08-27 17:32 Darvids0n Status feedback => assigned
2022-08-27 19:11 t-ishii Note Added: 0004118
2022-08-27 19:27 Darvids0n Note Added: 0004119