View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000599 | Pgpool-II | General | public | 2020-03-22 21:50 | 2020-04-02 15:00 |
| Reporter | maiquelet | Assigned To | hoshiai | ||
| Priority | normal | Severity | major | Reproducibility | N/A |
| Status | feedback | Resolution | open | ||
| Platform | Docker | OS | Alpine Linux | OS Version | 3.11 |
| Summary | 0000599: Fortinet discards lots of TCP packets with reason: Not a Valid SYN Packet | ||||
| Description | Hello, We are running PgPool 4.0.8 inside an Alpine Linux container (Docker). Access to the backend server goes through a firewall. It's a one PgPool, one Postgresql setup. The problem is that we keep seeing a lot of discarded TCP packets with the message "Not a Valid SYN Packet". From this link: https://help.stonesoft.com/onlinehelp/StoneGate/SMC/6.4.0/GUID-F62E7E1A-7A2B-4B35-B70A-183A8C11FFB6.html, we get the meaning of the message: “Not a Valid SYN Packet” is a TCP packet that is not the first packet of a TCP connection (the packet does not have the SYN flag set), but is not part of an existing connection either (there is no connection tracking entry on the Firewall matching this packet) We thought that this could be caused by keepalive intervals that are longer that what the firewall is set to wait for an idle connection to be reused, so we set this settings in the container: net.ipv4.tcp_keepalive_time = 60 net.ipv4.tcp_keepalive_intvl = 5 net.ipv4.tcp_keepalive_probes = 3 Those net settings were picked from a container (running another technology) that is actually using them. But they don't seem to work with PgPool as we still got discarded packets for the same reason. We are using a set of few known tuples: user/database/client_application and, as we are still at the preproduction stage, the number of queries over time are no more than 10 per minute and they are short and fast queries. You can find attached our PgPool configuration and I can tell that we are not reaching the "max_pool * num_init_children" limit (we use half of that). Also, we are using a Python script to check the pg_stat_activity table and, sometimes, this script can take almost 30 seconds to complete (and sometimes I have been waiting for minutes), while sometimes it only takes a second. The thing is that we ran out of ideas and we don't know where else to look. If someone has any idea, we would really appreciate it. We are planning on using PgPool for production purposes, so we'll keep trying. Thank you very much, Miquel. | ||||
| Additional Information | ----- PgPool configuration listen_addresses = '*' port = 5432 socket_dir = '/tmp' pcp_listen_addresses = '*' pcp_port = 9898 pcp_socket_dir = '/tmp' listen_backlog_multiplier = 2 serialize_accept = off backend_hostname0 = 'some_database' backend_port0 = 5432 backend_weight0 = 10 backend_data_directory0 = '/data' backend_flag0 = DISALLOW_TO_FAILOVER enable_pool_hba = on pool_passwd = 'pool_passwd' authentication_timeout = 60 ssl = off num_init_children = 40 max_pool = 5 child_life_time = 300 child_max_connections = 600 connection_life_time = 0 client_idle_limit = 0 log_destination = 'stderr' log_line_prefix = '%t: pid %p: ' # printf-style string to output at beginning of each log line. log_connections = off log_hostname = off log_statement = off log_per_node_statement = off log_standby_delay = 'if_over_threshold' syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' pid_file_name = '/var/run/pgpool/pgpool.pid' logdir = '/tmp' connection_cache = on reset_query_list = 'ABORT; DISCARD ALL' replication_mode = off replicate_select = off insert_lock = off lobj_lock_table = '' replication_stop_on_mismatch = off failover_if_affected_tuples_mismatch = off load_balance_mode = off ignore_leading_white_space = on white_function_list = '' black_function_list = 'currval,lastval,nextval,setval' database_redirect_preference_list = '' app_name_redirect_preference_list = '' allow_sql_comments = off master_slave_mode = off master_slave_sub_mode = 'stream' sr_check_period = 10 sr_check_user = some_user' sr_check_password = 'some_pass' sr_check_database = 'postgres' delay_threshold = 10000000 follow_master_command = '' health_check_period = 0 health_check_timeout = 20 health_check_user = 'nobody' health_check_password = '' health_check_database = '' health_check_max_retries = 0 health_check_retry_delay = 1 connect_timeout = 10000 health_check_period0 = 0 health_check_timeout0 = 20 health_check_user0 = 'nobody' health_check_password0 = '' health_check_database0 = '' health_check_max_retries0 = 0 health_check_retry_delay0 = 1 connect_timeout0 = 10000 failover_command = '' failback_command = '' fail_over_on_backend_error = on search_primary_node_timeout = 300 recovery_user = 'nobody' recovery_password = '' recovery_1st_stage_command = '' recovery_2nd_stage_command = '' recovery_timeout = 90 client_idle_limit_in_recovery = 0 use_watchdog = off trusted_servers = '' ping_path = '/bin' wd_hostname = '' wd_port = 9000 wd_priority = 1 wd_authkey = '' wd_ipc_socket_dir = '/tmp' delegate_IP = '' if_cmd_path = '/sbin' if_up_cmd = 'ip addr add $_IP_$/24 dev eth0 label eth0:0' if_down_cmd = 'ip addr del $_IP_$/24 dev eth0' arping_path = '/usr/sbin' arping_cmd = 'arping -U $_IP_$ -w 1' clear_memqcache_on_escalation = on wd_escalation_command = '' wd_de_escalation_command = '' failover_when_quorum_exists = on failover_require_consensus = on allow_multiple_failover_requests_from_node = off wd_monitoring_interfaces_list = '' # Comma separated list of interfaces names to monitor. wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_port = 9694 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 30 heartbeat_destination0 = 'host0_ip1' heartbeat_destination_port0 = 9694 heartbeat_device0 = '' wd_life_point = 3 wd_lifecheck_query = 'SELECT 1' wd_lifecheck_dbname = 'template1' wd_lifecheck_user = 'nobody' wd_lifecheck_password = '' relcache_expire = 0 relcache_size = 256 check_temp_table = on check_unlogged_table = on memory_cache_enabled = off memqcache_method = 'shmem' memqcache_memcached_host = 'localhost' memqcache_memcached_port = 11211 memqcache_total_size = 67108864 memqcache_max_num_cache = 1000000 memqcache_expire = 0 memqcache_auto_cache_invalidation = on memqcache_maxcache = 409600 memqcache_cache_block_size = 1048576 memqcache_oiddir = '/var/log/pgpool/oiddir' white_memqcache_table_list = '' black_memqcache_table_list = '' | ||||
| Tags | Alpine, firewall, networking, TCP discards | ||||
|
|
Pgpool-II don't has keep_alive paramters, so ordinarily use OS parameters. as was already done by you. Do your Pgpool and PostgreSQL output any error meesages? Please share your log file. |
|
|
Hello, you will find attached one log fragment, I couldn't get the ones generated at the sservice' start as this instance has been running for days. We have a bunch of those firewall errors every minute, so it might be enough. If not, just let me know. This is the most strange behaviour we have seen so far, it happens wevery now and then: 2020-03-25 06:27:48: pid 2380: WARNING: write on backend 0 failed with error :"No error information" 2020-03-25 06:27:48: pid 2380: DETAIL: while trying to write data from offset: 0 wlen: 5 2020-03-25 06:27:48: pid 1: DEBUG: reaper handler 2020-03-25 06:27:48: pid 1: LOG: child process with pid: 2380 exits with status 256 2020-03-25 06:27:48: pid 1: LOG: fork a new child process with pid: 2406 2020-03-25 06:27:48: pid 1: DEBUG: reaper handler: exiting normally 2020-03-25 06:27:48: pid 2406: DEBUG: initializing backend status Thanks again! |
|
|
Hello, from now on, my colleague 'juananglobalia' will take my place in troubleshooting the issue. |
|
|
Could you get tcpdump between pgpool and postgresql on this server? We can confirm that keep alive working by this dump. If enable keep alive, length 0 tcp message is appeared. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2020-03-22 21:50 | maiquelet | New Issue | |
| 2020-03-22 21:50 | maiquelet | Tag Attached: Alpine | |
| 2020-03-22 21:50 | maiquelet | Tag Attached: firewall | |
| 2020-03-22 21:50 | maiquelet | Tag Attached: networking | |
| 2020-03-22 21:50 | maiquelet | Tag Attached: TCP discards | |
| 2020-03-25 16:09 | hoshiai | Assigned To | => hoshiai |
| 2020-03-25 16:09 | hoshiai | Status | new => feedback |
| 2020-03-25 16:09 | hoshiai | Note Added: 0003283 | |
| 2020-03-25 19:38 | maiquelet | File Added: logsSample.txt | |
| 2020-03-25 19:38 | maiquelet | Note Added: 0003284 | |
| 2020-03-25 19:38 | maiquelet | Status | feedback => assigned |
| 2020-03-27 19:08 | maiquelet | Note Added: 0003285 | |
| 2020-04-02 15:00 | hoshiai | Status | assigned => feedback |
| 2020-04-02 15:00 | hoshiai | Note Added: 0003296 |