View Issue Details

IDProjectCategoryView StatusLast Update
0000599Pgpool-II[All Projects] Generalpublic2020-04-02 15:00
ReportermaiqueletAssigned Tohoshiai 
PrioritynormalSeveritymajorReproducibilityN/A
Status feedbackResolutionopen 
PlatformDockerOSAlpine LinuxOS Version3.11
Product Version 
Target VersionFixed in Version 
Summary0000599: Fortinet discards lots of TCP packets with reason: Not a Valid SYN Packet
DescriptionHello,

We are running PgPool 4.0.8 inside an Alpine Linux container (Docker). Access to the backend server goes through a firewall. It's a one PgPool, one Postgresql setup.

The problem is that we keep seeing a lot of discarded TCP packets with the message "Not a Valid SYN Packet". From this link: https://help.stonesoft.com/onlinehelp/StoneGate/SMC/6.4.0/GUID-F62E7E1A-7A2B-4B35-B70A-183A8C11FFB6.html, we get the meaning of the message:

“Not a Valid SYN Packet” is a TCP packet that is not the first packet of a TCP connection (the packet does not have the SYN flag set), but is not part of an existing connection either (there is no connection tracking entry on the Firewall matching this packet)

We thought that this could be caused by keepalive intervals that are longer that what the firewall is set to wait for an idle connection to be reused, so we set this settings in the container:
          net.ipv4.tcp_keepalive_time = 60
          net.ipv4.tcp_keepalive_intvl = 5
          net.ipv4.tcp_keepalive_probes = 3

Those net settings were picked from a container (running another technology) that is actually using them. But they don't seem to work with PgPool as we still got discarded packets for the same reason.

We are using a set of few known tuples: user/database/client_application and, as we are still at the preproduction stage, the number of queries over time are no more than 10 per minute and they are short and fast queries. You can find attached our PgPool configuration and I can tell that we are not reaching the "max_pool * num_init_children" limit (we use half of that). Also, we are using a Python script to check the pg_stat_activity table and, sometimes, this script can take almost 30 seconds to complete (and sometimes I have been waiting for minutes), while sometimes it only takes a second.

The thing is that we ran out of ideas and we don't know where else to look.

If someone has any idea, we would really appreciate it. We are planning on using PgPool for production purposes, so we'll keep trying.

Thank you very much,

Miquel.
Additional Information----- PgPool configuration
listen_addresses = '*'
port = 5432
socket_dir = '/tmp'
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/tmp'
listen_backlog_multiplier = 2
serialize_accept = off
backend_hostname0 = 'some_database'
backend_port0 = 5432
backend_weight0 = 10
backend_data_directory0 = '/data'
backend_flag0 = DISALLOW_TO_FAILOVER
enable_pool_hba = on
pool_passwd = 'pool_passwd'
authentication_timeout = 60
ssl = off
num_init_children = 40
max_pool = 5
child_life_time = 300
child_max_connections = 600
connection_life_time = 0
client_idle_limit = 0
log_destination = 'stderr'
log_line_prefix = '%t: pid %p: ' # printf-style string to output at beginning of each log line.
log_connections = off
log_hostname = off
log_statement = off
log_per_node_statement = off
log_standby_delay = 'if_over_threshold'
syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'
pid_file_name = '/var/run/pgpool/pgpool.pid'
logdir = '/tmp'
connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'
replication_mode = off
replicate_select = off
insert_lock = off
lobj_lock_table = ''
replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off
load_balance_mode = off
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'currval,lastval,nextval,setval'
database_redirect_preference_list = ''
app_name_redirect_preference_list = ''
allow_sql_comments = off
master_slave_mode = off
master_slave_sub_mode = 'stream'
sr_check_period = 10
sr_check_user = some_user'
sr_check_password = 'some_pass'
sr_check_database = 'postgres'
delay_threshold = 10000000
follow_master_command = ''
health_check_period = 0
health_check_timeout = 20
health_check_user = 'nobody'
health_check_password = ''
health_check_database = ''
health_check_max_retries = 0
health_check_retry_delay = 1
connect_timeout = 10000
health_check_period0 = 0
health_check_timeout0 = 20
health_check_user0 = 'nobody'
health_check_password0 = ''
health_check_database0 = ''
health_check_max_retries0 = 0
health_check_retry_delay0 = 1
connect_timeout0 = 10000
failover_command = ''
failback_command = ''
fail_over_on_backend_error = on
search_primary_node_timeout = 300
recovery_user = 'nobody'
recovery_password = ''
recovery_1st_stage_command = ''
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0
use_watchdog = off
trusted_servers = ''
ping_path = '/bin'
wd_hostname = ''
wd_port = 9000
wd_priority = 1
wd_authkey = ''
wd_ipc_socket_dir = '/tmp'
delegate_IP = ''
if_cmd_path = '/sbin'
if_up_cmd = 'ip addr add $_IP_$/24 dev eth0 label eth0:0'
if_down_cmd = 'ip addr del $_IP_$/24 dev eth0'
arping_path = '/usr/sbin'
arping_cmd = 'arping -U $_IP_$ -w 1'
clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_de_escalation_command = ''
failover_when_quorum_exists = on
failover_require_consensus = on
allow_multiple_failover_requests_from_node = off
wd_monitoring_interfaces_list = '' # Comma separated list of interfaces names to monitor.
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
heartbeat_destination0 = 'host0_ip1'
heartbeat_destination_port0 = 9694
heartbeat_device0 = ''
wd_life_point = 3
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''
relcache_expire = 0
relcache_size = 256
check_temp_table = on
check_unlogged_table = on
memory_cache_enabled = off
memqcache_method = 'shmem'
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_total_size = 67108864
memqcache_max_num_cache = 1000000
memqcache_expire = 0
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 409600
memqcache_cache_block_size = 1048576
memqcache_oiddir = '/var/log/pgpool/oiddir'
white_memqcache_table_list = ''
black_memqcache_table_list = ''
TagsAlpine, firewall, networking, TCP discards

Activities

hoshiai

2020-03-25 16:09

developer   ~0003283

Pgpool-II don't has keep_alive paramters, so ordinarily use OS parameters. as was already done by you.

Do your Pgpool and PostgreSQL output any error meesages? Please share your log file.

maiquelet

2020-03-25 19:38

reporter   ~0003284

Hello, you will find attached one log fragment, I couldn't get the ones generated at the sservice' start as this instance has been running for days. We have a bunch of those firewall errors every minute, so it might be enough. If not, just let me know.

This is the most strange behaviour we have seen so far, it happens wevery now and then:

2020-03-25 06:27:48: pid 2380: WARNING: write on backend 0 failed with error :"No error information"
2020-03-25 06:27:48: pid 2380: DETAIL: while trying to write data from offset: 0 wlen: 5
2020-03-25 06:27:48: pid 1: DEBUG: reaper handler
2020-03-25 06:27:48: pid 1: LOG: child process with pid: 2380 exits with status 256
2020-03-25 06:27:48: pid 1: LOG: fork a new child process with pid: 2406
2020-03-25 06:27:48: pid 1: DEBUG: reaper handler: exiting normally
2020-03-25 06:27:48: pid 2406: DEBUG: initializing backend status

Thanks again!

logsSample.txt (393,921 bytes)

maiquelet

2020-03-27 19:08

reporter   ~0003285

Hello, from now on, my colleague 'juananglobalia' will take my place in troubleshooting the issue.

hoshiai

2020-04-02 15:00

developer   ~0003296

Could you get tcpdump between pgpool and postgresql on this server?
We can confirm that keep alive working by this dump. If enable keep alive, length 0 tcp message is appeared.

Issue History

Date Modified Username Field Change
2020-03-22 21:50 maiquelet New Issue
2020-03-22 21:50 maiquelet Tag Attached: Alpine
2020-03-22 21:50 maiquelet Tag Attached: firewall
2020-03-22 21:50 maiquelet Tag Attached: networking
2020-03-22 21:50 maiquelet Tag Attached: TCP discards
2020-03-25 16:09 hoshiai Assigned To => hoshiai
2020-03-25 16:09 hoshiai Status new => feedback
2020-03-25 16:09 hoshiai Note Added: 0003283
2020-03-25 19:38 maiquelet File Added: logsSample.txt
2020-03-25 19:38 maiquelet Note Added: 0003284
2020-03-25 19:38 maiquelet Status feedback => assigned
2020-03-27 19:08 maiquelet Note Added: 0003285
2020-04-02 15:00 hoshiai Status assigned => feedback
2020-04-02 15:00 hoshiai Note Added: 0003296