0000765: Certain workers become degenerate with SSL_read errors - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000765	Pgpool-II	Bug	public	2022-08-22 12:23	2022-08-27 19:27

Reporter	Darvids0n	Assigned To	t-ishii
Priority	normal	Severity	major	Reproducibility	always
Status	assigned	Resolution	open
Platform	Linux	OS	Red Hat	OS Version	8.6 Ootpa
Product Version	4.2.9

Summary	0000765: Certain workers become degenerate with SSL_read errors
Description	Note: Issue also occurs in 4.1.12 I have a config set up which is stable when using socket connections or bypassing the connection cache (i.e. connect to postgres dbname). However when clients connect via SSL to their db, the following happens: - Connections are healthy for a couple of minutes - A health check or worker process will begin failing all of its connection attempts with pool_ssl: "SSL_read": "no SSL error reported" - This triggers failover etc, and after auto failback the connections are healthy for a couple more minutes, then the issue repeats
Steps To Reproduce	1. Deploy 1-2 pgpool and 1-2 postgres, with SSL on and connection_cache on, pool_hba off 2. Create a user and empty database owned by the user. Ensure postgres uses md5 encryption, having the password in pool_passwd is optional. 3. Start a test psql or other connection to pgpool as the new user, connecting to the new database and selecting from anything e.g. SHOW POOL_POOLS. I use a watch command with delay of 0.1 4. Wait a few minutes, the issue should occur
Additional Information	Not sure if the pg version will change the behaviour, but I am using Postgres 12.11. Originally, child_max_connections was set to 0, but now set to 50 as part of troubleshooting this. I also disabled failover_on_backend_error as this was causing more frequent failovers when a client connection worker had the SSL issue.
Tags	connection cache, networking, ssl

Darvids0n 2022-08-22 12:23 reporter	pgpool.conf (3,513 bytes) # # required_config generated by pgpool_conf resource # port = 9999 pcp_port = 9898 backend_hostname0 = 'dld220822093114' backend_port0 = 5432 backend_weight0 = 1 backend_data_directory0 = '/edodata/pgsql/data' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_application_name0 = 'dld220822093114' backend_hostname1 = 'dld220822093119' backend_port1 = 5432 backend_weight1 = 1 backend_data_directory1 = '/edodata/pgsql/data' backend_flag1 = 'ALLOW_TO_FAILOVER' backend_application_name1 = 'dld220822093119' use_watchdog = on wd_lifecheck_method = 'heartbeat' hostname0 = 'dld220822093125' wd_port0 = 9000 pgpool_port00 = 9999 heartbeat_hostname0 = 'dld220822093125' heartbeat_port0 = 9694 heartbeat_device0 = '' hostname1 = 'dld220822093130' wd_port1 = 9000 pgpool_port01 = 9999 heartbeat_hostname1 = 'dld220822093130' heartbeat_port1 = 9694 heartbeat_device1 = '' # additional_config generated by pgpool_conf resource # backend_clustering_mode = 'streaming_replication' follow_primary_command = '/etc/pgpool/follow_primary.sh %d %h %p %D %m %H %M %P %r %R %N %S' listen_addresses = '' pcp_listen_addresses = '' pid_file_name = '/var/run/pgpool/pgpool.pid' socket_dir = '/var/run/pgpool' pcp_socket_dir = '/var/run/pgpool' wd_ipc_socket_dir = '/var/run/pgpool' sr_check_user = 'pgpool' recovery_user = 'pgsql' health_check_user = 'pgpool' enable_pool_hba = off pool_passwd = 'pool_passwd' authentication_timeout = 30 allow_clear_text_frontend_auth = off ssl = on ssl_key = '/etc/pgpool/ssl/pgpool.key' ssl_cert = '/etc/pgpool/ssl/pgpool.pem' ssl_ca_cert = '/etc/ssl/certs/ca-bundle.crt' ssl_ciphers = 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-DSS-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:DHE-RSA-AES128-SHA256:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK' ssl_prefer_server_ciphers = on ssl_ecdh_curve = 'prime256v1' num_init_children = 32 max_pool = 4 listen_backlog_multiplier = 2 serialize_accept = off reserved_connections = 0 child_life_time = 300 child_max_connections = 50 connection_life_time = 0 client_idle_limit = 10 log_destination = 'syslog' log_line_prefix = '%a: ' log_connections = off log_hostname = on log_statement = off log_per_node_statement = off log_client_messages = off syslog_facility = 'LOCAL2' syslog_ident = 'pgpool' connection_cache = on load_balance_mode = on ignore_leading_white_space = on allow_sql_comments = on disable_load_balance_on_write = 'transaction' statement_level_load_balance = on sr_check_period = 10 delay_threshold = 10000000 log_standby_delay = 'if_over_threshold' health_check_period = 5 health_check_timeout = 20 health_check_max_retries = 2 health_check_retry_delay = 1 failover_command = '/etc/pgpool/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S' failback_command = '/etc/pgpool/failback.sh %d %h %p %D %m %H %M %P %r %R %N %S' failover_on_backend_error = off detach_false_primary = on search_primary_node_timeout = 300 auto_failback = on recovery_1st_stage_command = 'recovery_1st_stage' recovery_timeout = 900 ping_path = '/bin' wd_priority = 1 failover_when_quorum_exists = on failover_require_consensus = on allow_multiple_failover_requests_from_node = off enable_consensus_with_half_votes = on wd_interval = 10 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 30 pgpool.conf (3,513 bytes) postgresql.conf (3,982 bytes) max_connections = 50 # (change requires restart) superuser_reserved_connections = 5 # (change requires restart) ssl = on # (change requires restart) ssl_ciphers = 'ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:DHE-DSS-AES256-GCM-SHA384:DHE-RSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:DHE-RSA-AES128-SHA256:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK' # allowed SSL ciphers ssl_prefer_server_ciphers = on # (change requires restart) #ssl_ecdh_curve = 'prime256v1' # (change requires restart) ssl_cert_file = '/edodata/pgsql/data/ssl/postgres.pem' # (change requires restart) ssl_key_file = '/edodata/pgsql/data/ssl/postgres.key'# (change requires restart) ssl_ca_file = '/etc/ssl/certs/ca-bundle.crt' # (change requires restart) shared_buffers = 128MB # min 128kB huge_pages = try # on, off, or try temp_buffers = 8MB # min 800kB work_mem = 64MB # min 64kB maintenance_work_mem = 128MB # min 1MB dynamic_shared_memory_type = posix # the default is the first option effective_io_concurrency = 200 # 1-1000; 0 disables prefetching max_worker_processes = 8 # (change requires restart) max_parallel_workers_per_gather = 4 # taken from max_worker_processes synchronous_commit = off # synchronization level; wal_buffers = 16MB # min 32kB, -1 sets based on shared_buffers max_wal_size = 2GB min_wal_size = 1GB checkpoint_completion_target = 0.7 # checkpoint target duration, 0.0 - 1.0 track_commit_timestamp = on # collect timestamp of transaction commit synchronous_standby_names = 'walreceiver' # standby servers that provide sync rep random_page_cost = 1.1 # same scale as above effective_cache_size = 128MB default_statistics_target = 100 # range 1-10000 log_destination = 'csvlog' # Valid values are combinations of log_file_mode = 0640 # creation mode for log files, client_min_messages = notice # values in order of decreasing detail: log_min_messages = warning # values in order of decreasing detail: log_min_error_statement = error # values in order of decreasing detail: log_error_verbosity = default # terse, default, or verbose messages log_replication_commands = off log_temp_files = 0 # log temporary files equal or larger log_timezone = 'Australia/NSW' autovacuum = on # Enable autovacuum subprocess? 'on' autovacuum_max_workers = 3 # max number of autovacuum subprocesses datestyle = 'iso, dmy' timezone = 'Australia/NSW' lc_messages = 'en_AU.UTF-8' # locale for system error message lc_monetary = 'en_AU.UTF-8' # locale for monetary formatting lc_numeric = 'en_AU.UTF-8' # locale for number formatting lc_time = 'en_AU.UTF-8' # locale for time formatting default_text_search_config = 'pg_catalog.english' archive_command = 'cp "%p" "/edodbtranslogs/pgsql/%f"' log_directory = '/edologs/pgsql' # additional_config generated by postgresql_ha_conf resource # wal_keep_segments = 10 listen_addresses = '*' max_wal_senders = 10 max_replication_slots = 12 wal_level = 'replica' hot_standby = on archive_mode = on wal_log_hints = on logging_collector = on log_filename = 'postgresql.log' log_line_prefix = '%m [%p] %q%u@%d ' log_checkpoints = on log_connections = off log_disconnections = off log_lock_waits = on log_duration = on log_statement = 'none' log_min_duration_statement = 5000 log_rotation_age = 0 log_rotation_size = 0 postgresql.conf (3,982 bytes) pgpool.log (9,074 bytes) Aug 22 11:49:49 dld220822093130 pgpool[44725]: [23-1] health_check0: LOG: pool_ssl: "SSL_read": "no SSL error reported" Aug 22 11:49:49 dld220822093130 pgpool[44725]: [24-1] health_check0: LOG: received degenerate backend request for node_id: 0 from pid [44725] Aug 22 11:49:49 dld220822093130 pgpool[44685]: [35-1] watchdog: LOG: new IPC connection received Aug 22 11:49:49 dld220822093130 pgpool[44726]: [23-1] health_check1: LOG: pool_ssl: "SSL_read": "no SSL error reported" Aug 22 11:49:49 dld220822093130 pgpool[44726]: [24-1] health_check1: LOG: received degenerate backend request for node_id: 1 from pid [44726] Aug 22 11:49:49 dld220822093130 pgpool[44685]: [36-1] watchdog: LOG: failover request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:49 dld220822093130 pgpool[44685]: [36-2] watchdog: DETAIL: waiting for the reply... Aug 22 11:49:49 dld220822093130 pgpool[44685]: [37-1] watchdog: LOG: new IPC connection received Aug 22 11:49:49 dld220822093130 pgpool[44685]: [38-1] watchdog: LOG: failover request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:49 dld220822093130 pgpool[44685]: [38-2] watchdog: DETAIL: waiting for the reply... Aug 22 11:49:49 dld220822093130 pgpool[44725]: [25-1] health_check0: LOG: degenerate backend request for 1 node(s) from pid [44725], will be handled by watchdog Aug 22 11:49:49 dld220822093130 pgpool[44726]: [25-1] health_check1: LOG: degenerate backend request for 1 node(s) from pid [44726], will be handled by watchdog Aug 22 11:49:49 dld220822093130 pgpool[44725]: [27-1] health_check0: LOG: health check retrying on DB node: 0 (round:1) Aug 22 11:49:49 dld220822093130 pgpool[44726]: [27-1] health_check1: LOG: health check retrying on DB node: 1 (round:1) Aug 22 11:49:50 dld220822093130 pgpool[44725]: [28-1] health_check0: LOG: health check retrying on DB node: 0 succeeded Aug 22 11:49:50 dld220822093130 pgpool[44726]: [28-1] health_check1: LOG: health check retrying on DB node: 1 succeeded Aug 22 11:49:50 dld220822093130 pgpool[44711]: [39-1] psql: LOG: connection closed. Aug 22 11:49:50 dld220822093130 pgpool[44711]: [39-2] psql: DETAIL: retry to create new connection pool Aug 22 11:49:50 dld220822093130 pgpool[44711]: [40-1] psql: LOG: pool_read_kind: error message from main backend:the database system is starting up Aug 22 11:49:50 dld220822093130 pgpool[44711]: [42-1] psql: LOG: pool_read_kind: error message from main backend:the database system is starting up Aug 22 11:49:50 dld220822093130 pgpool[44685]: [39-1] watchdog: LOG: signal_user1_to_parent_with_reason(2) Aug 22 11:49:50 dld220822093130 pgpool[44682]: [25-1] main: LOG: Pgpool-II parent process received SIGUSR1 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [26-1] main: LOG: Pgpool-II parent process received sync backend signal from watchdog Aug 22 11:49:50 dld220822093130 pgpool[44685]: [40-1] watchdog: LOG: new IPC connection received Aug 22 11:49:50 dld220822093130 pgpool[44682]: [27-1] main: LOG: leader watchdog has performed failover Aug 22 11:49:50 dld220822093130 pgpool[44682]: [27-2] main: DETAIL: syncing the backend states from the LEADER watchdog node Aug 22 11:49:50 dld220822093130 pgpool[44685]: [41-1] watchdog: LOG: new IPC connection received Aug 22 11:49:50 dld220822093130 pgpool[44685]: [42-1] watchdog: LOG: received the get data request from local pgpool-II on IPC interface Aug 22 11:49:50 dld220822093130 pgpool[44685]: [43-1] watchdog: LOG: get data request from local pgpool-II node received on IPC interface is forwarded to leader watchdog node "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:50 dld220822093130 pgpool[44685]: [43-2] watchdog: DETAIL: waiting for the reply... Aug 22 11:49:50 dld220822093130 pgpool[44682]: [28-1] main: LOG: leader watchdog node "dld220822093125:9999 Linux dld220822093125" returned status for 2 backend nodes Aug 22 11:49:50 dld220822093130 pgpool[44682]: [29-1] main: LOG: backend:0 is set to down status Aug 22 11:49:50 dld220822093130 pgpool[44682]: [29-2] main: DETAIL: backend:0 is DOWN on cluster leader "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:50 dld220822093130 pgpool[44682]: [30-1] main: LOG: backend:1 is set to down status Aug 22 11:49:50 dld220822093130 pgpool[44682]: [30-2] main: DETAIL: backend:1 is DOWN on cluster leader "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:50 dld220822093130 pgpool[44682]: [31-1] main: LOG: primary node:1 on leader watchdog node "dld220822093125:9999 Linux dld220822093125" is different from local primary node:0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [32-1] main: LOG: primary node was changed after the sync from "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:50 dld220822093130 pgpool[44682]: [32-2] main: DETAIL: all children needs to be restarted Aug 22 11:49:50 dld220822093130 pgpool[44682]: [33-1] main: LOG: child process with pid: 44693 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [34-1] main: LOG: child process with pid: 44694 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [35-1] main: LOG: child process with pid: 44695 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [36-1] main: LOG: child process with pid: 44699 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [37-1] main: LOG: child process with pid: 44700 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [38-1] main: LOG: child process with pid: 44701 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44724]: [23-1] sr_check_worker: LOG: worker process received restart request Aug 22 11:49:50 dld220822093130 pgpool[44682]: [39-1] main: LOG: child process with pid: 44691 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [41-1] main: LOG: child process with pid: 44696 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [42-1] main: LOG: child process with pid: 44697 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [43-1] main: LOG: child process with pid: 44698 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [44-1] main: LOG: child process with pid: 44703 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [49-1] main: LOG: child process with pid: 44711 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [50-1] main: LOG: child process with pid: 44712 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [51-1] main: LOG: child process with pid: 44713 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [52-1] main: LOG: child process with pid: 44714 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [53-1] main: LOG: child process with pid: 44715 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [55-1] main: LOG: child process with pid: 44717 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [59-1] main: LOG: child process with pid: 44721 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [60-1] main: LOG: child process with pid: 44722 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [61-1] main: LOG: worker child process with pid: 44724 exits with status 256 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [62-1] main: LOG: fork a new worker child process with pid: 47177 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [63-1] main: LOG: child process with pid: 44710 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [65-1] main: LOG: child process with pid: 44705 exits with status 0 Aug 22 11:49:50 dld220822093130 pgpool[47177]: [62-1] sr_check_worker: LOG: process started Aug 22 11:49:50 dld220822093130 pgpool[44685]: [44-1] watchdog: LOG: new IPC connection received Aug 22 11:49:50 dld220822093130 pgpool[44685]: [45-1] watchdog: LOG: processing follow primary looking[acquire] request from remote node "dld220822093125:9999 Linux dld220822093125" Aug 22 11:49:50 dld220822093130 pgpool[44682]: [67-1] main: LOG: child process with pid: 47162 exits with status 256 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [68-1] main: LOG: fork a new child process with pid: 47180 Aug 22 11:49:50 dld220822093130 pgpool[44685]: [46-1] watchdog: LOG: signal_user1_to_parent_with_reason(2) Aug 22 11:49:50 dld220822093130 pgpool[44682]: [69-1] main: LOG: Pgpool-II parent process received SIGUSR1 Aug 22 11:49:50 dld220822093130 pgpool[44682]: [70-1] main: LOG: Pgpool-II parent process received sync backend signal from watchdog Aug 22 11:49:50 dld220822093130 pgpool[44685]: [47-1] watchdog: LOG: new IPC connection received Aug 22 11:49:50 dld220822093130 pgpool[44682]: [71-1] main: LOG: leader watchdog has performed failover Aug 22 11:49:50 dld220822093130 pgpool[44682]: [71-2] main: DETAIL: syncing the backend states from the LEADER watchdog node pgpool.log (9,074 bytes)

t-ishii 2022-08-24 09:00 developer ~0004115	I think SSL in health check process is not supported (yet). https://pgpool.net/mediawiki/index.php/TODO#Allow_SSL_in_health_check_etc.

t-ishii 2022-08-27 16:56 developer ~0004116	Sorry, I was wrong. Actually Pgpool-II's health check supports SSL at least since 2010/1/26 with commit c00f8be7f. I confirmed it using master branch. I find following in PostgreSQL log by enabling log_connections parameter: 796117 2022-08-27 16:44:35.831 JST LOG: connection authorized: user=t-ishii database=postgres SSL enabled (protocol=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384, bits=256) Regarding your issue, I suspect PostgreSQL's max_connections is not large enough to support pgpool's num_init_children . Can you share those parameters?

Darvids0n 2022-08-27 17:32 reporter ~0004117	max_connections = 50 # (change requires restart) for each of the 2 DB nodes. And my pgpool.conf params are (see attached files also): num_init_children = 32 max_pool = 4 listen_backlog_multiplier = 2 serialize_accept = off reserved_connections = 0 child_life_time = 300 child_max_connections = 50 connection_life_time = 0 client_idle_limit = 10

t-ishii 2022-08-27 19:11 developer ~0004118	Do you have connections with multiple users/databases? If so, the system has to satisfy following formula: max_poolnum_init_children <= (max_connections - superuser_reserved_connections) (no query canceling needed) max_poolnum_init_children*2 <= (max_connections - superuser_reserved_connections) (query canceling needed)

Darvids0n 2022-08-27 19:27 reporter ~0004119	Ah, I can see the formula in docs as well. Sorry, I must have missed that. I would expect we have several long-running system connections and automation processes, as well as 100 or more user sessions for the webapp. So I think raising num_init_children and max_connections is the right idea. I will try and report if the problem is solved. Thanks!

Date Modified	Username	Field	Change
2022-08-22 12:23	Darvids0n	New Issue
2022-08-22 12:23	Darvids0n	File Added: pgpool.conf
2022-08-22 12:23	Darvids0n	File Added: postgresql.conf
2022-08-22 12:23	Darvids0n	File Added: pgpool.log
2022-08-22 12:42	Darvids0n	Tag Attached: ssl
2022-08-22 12:42	Darvids0n	Tag Attached: networking
2022-08-22 12:42	Darvids0n	Tag Attached: connection cache
2022-08-24 09:00	t-ishii	Note Added: 0004115
2022-08-27 16:56	t-ishii	Note Added: 0004116
2022-08-27 16:57	t-ishii	Assigned To	=> t-ishii
2022-08-27 16:57	t-ishii	Status	new => assigned
2022-08-27 16:57	t-ishii	Status	assigned => feedback
2022-08-27 17:32	Darvids0n	Note Added: 0004117
2022-08-27 17:32	Darvids0n	Status	feedback => assigned
2022-08-27 19:11	t-ishii	Note Added: 0004118
2022-08-27 19:27	Darvids0n	Note Added: 0004119