View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000601 | Pgpool-II | Bug | public | 2020-03-31 17:28 | 2020-04-14 12:46 |
| Reporter | mvineza | Assigned To | pengbo | ||
| Priority | normal | Severity | major | Reproducibility | random |
| Status | feedback | Resolution | open | ||
| Platform | Container | OS | Centos | OS Version | 7.7.1908 |
| Product Version | 4.1.0 | ||||
| Summary | 0000601: Some SQL queries are missed | ||||
| Description | I have AWX running inside kubernetes. The components (web and task/scheduler) connects to postgresql. All of them are containers running as kubernetes pods. For 2 months that setup is running fine without connection issues between the pods. web (pod) --> postgres (pod) task/scheduler (pod) --> postgres (pod) Due to lack of persistent storage in kubernetes cluster for postgres, I decided to move postgresql outside of kubernetes so it will be now a traditional virtual machine (1 master and 1 slave) instead of a container. Then I deployed pgpool inside kubernete and remove the postgres pod. The current setup now is: web (pod) --> pgpool (pod) --> postgres master/slave (virtual machine) task/scheduler (pod) --> pgpool (pod) --> postgres master/slave (virtual machine) During that setup, I encounter connection issues such as some SQL statements are not sent properly to backend postgres. I notice that scenario when I encounter this error from the task/scheduler pod: File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method return getattr(self.get_queryset(), name)(*args, **kwargs) File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 408, in get self.model._meta.object_name awx.main.models.jobs.Job.DoesNotExist: Job matching query does not exist. I looked closely and found out that some SQL queries are not being handled by pgpool. In effect, the queries never reached the backend postgres servers. I uploaded here an image of the missed job in AWX that corresponds to the missed SQL query (miss_awx_job.png). From the image job 2214 is successful and ran, job 2216 is missed, job 2217 is successful, and so on. Checking the pgpool logs, I see the job ID 2217 is there. ➜ ~ ☿ kubectl logs dev-pgpool-77949cf588-mx25h | egrep '2020-03-31' | grep dev_execute_command | egrep 'INSERT INTO "main_activitystream"' | egrep '"id": 2217' 2020-03-31 05:04:28: pid 36: LOG: statement: INSERT INTO "main_activitystream" ("actor_id", "operation", "timestamp", "changes", "deleted_actor", "action_node", "object_relationship_type", "object1", "object2", "setting") VALUES (NULL, 'create', '2020-03-31T05:04:28.282003+00:00'::timestamptz, '{"name": "dev_execute_command", "description": "", "job_type": "run", "inventory": "awx-playbooks-dev-1", "project": "awx-playbooks-6", "playbook": "playbooks/execute_command.yaml", "scm_branch": "", "forks": 0, "limit": "", "verbosity": 0, "extra_vars": "{\"command\": \"date\"}", "job_tags": "", "force_handlers": false, "skip_tags": "", "start_at_task": "", "timeout": 0, "use_fact_cache": false, "job_template": "dev_execute_command-18", "allow_simultaneous": false, "instance_group": null, "diff_mode": false, "job_slice_number": 0, "job_slice_count": 1, "id": 2217, "credentials": [], "labels": []}', NULL, 'awx', '', 'job', '', '{}') RETURNING "main_activitystream"."id" ➜ ~ ☿ But job id 2216 is not there. ➜ ~ ☿ kubectl logs dev-pgpool-77949cf588-mx25h | egrep '2020-03-31' | grep dev_execute_command | egrep 'INSERT INTO "main_activitystream"' | egrep '"id": 2216' ➜ ~ ☿ That is a major issue since our AWX application will run several batch jobs regularly and there is a high chance that a job will be missed. I tried playing around to on some connection settings with pgpool like setting "child_max_connections" to "1" but no luck. Here are the versions of the software involved: AWX: 7.0.0 Pgpool: 4.1.1 Postgres: 12 | ||||
| Tags | No tags attached. | ||||
|
|
|
|
|
If you are using postgres master/slave, you should set "master_slave_mode = on" and "replication_mode = off" in pgpool.conf. Or you can copy the sample config file for streaming replication. # cp /etc/pgpool-II/pgpool.conf.sample-stream /etc/pgpool-II/pgpool.conf Could you share more pgpool log and pgpool.conf? # kubectl logs dev-pgpool-77949cf588-mx25h | egrep '2020-03-31' |
|
|
Here is the pgpool.conf. # - pgpool Connection Settings - listen_addresses = '*' port = '5432' socket_dir = '/tmp' reserved_connections = 0 # - pgpool Communication Manager Connection Settings - pcp_listen_addresses = '*' pcp_port = 9898 pcp_socket_dir = '/tmp' listen_backlog_multiplier = 2 serialize_accept = off # - Backend Connection Settings - backend_hostname0 = 'awxpostgresql0' backend_port0 = 5432 backend_weight0 = 1 backend_data_directory0 = '/var/lib/pgsql/12/data' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_application_name0 = 'awxpostgresql0' backend_hostname1 = 'awxpostgresql1' backend_port1 = 5432 backend_weight1 = 1 backend_data_directory1 = '/var/lib/pgsql/12/data' backend_flag1 = 'ALLOW_TO_FAILOVER' backend_application_name1 = 'awxpostgresql1' # - Authentication - enable_pool_hba = off pool_passwd = 'pool_passwd' authentication_timeout = '30' allow_clear_text_frontend_auth = off # - SSL Connections - ssl = off ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL' ssl_prefer_server_ciphers = off ssl_ecdh_curve = 'prime256v1' ssl_dh_params_file = '' # - Concurrent session and pool size - num_init_children = 32 max_pool = '15' # - Life time - child_life_time = 300 child_max_connections = 1 connection_life_time = 0 client_idle_limit = 0 # - Where to log - log_destination = 'stderr' # - What to log - log_line_prefix = '%t: pid %p: ' log_connections = off log_hostname = off log_statement = off log_per_node_statement = off log_client_messages = off log_standby_delay = 'if_over_threshold' # - Syslog specific - syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' # - File Locations - pid_file_name = '/tmp/pgpool.pid' logdir = '/var/log' # - Connection Pooling - connection_cache = on reset_query_list = 'ABORT; DISCARD ALL' # - Replication Mode - replication_mode = off replicate_select = off insert_lock = off lobj_lock_table = '' # - Degenerate handling - replication_stop_on_mismatch = off failover_if_affected_tuples_mismatch = off # - Load Balancing Mode - load_balance_mode = 'on' ignore_leading_white_space = on white_function_list = '' black_function_list = 'nextval,setval' black_query_pattern_list = '' database_redirect_preference_list = '' app_name_redirect_preference_list = '' allow_sql_comments = off disable_load_balance_on_write = 'transaction' statement_level_load_balance = off # - Master/Slave Mode - master_slave_mode = on master_slave_sub_mode = 'stream' # - Streaming - sr_check_period = '30' sr_check_user = 'postgres' sr_check_password = 'postgres' sr_check_database = 'postgres' delay_threshold = 10000000 # - Special commands - follow_master_command = '' # - Healthcheck - health_check_period = '30' health_check_timeout = '10' health_check_user = 'postgres' health_check_password = 'postgres' health_check_database = '' health_check_max_retries = '5' health_check_retry_delay = '5' connect_timeout = 10000 # - Failover - failover_command = '/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S' failback_command = '' failover_on_backend_error = 'on' detach_false_primary = off search_primary_node_timeout = '0' # - Online Recovery - recovery_user = 'nobody' recovery_password = '' recovery_1st_stage_command = 'recovery_1st_stage' recovery_2nd_stage_command = '' recovery_timeout = 90 client_idle_limit_in_recovery = 0 auto_failback = off auto_failback_interval = 60 # - Watchdog Settings - use_watchdog = off trusted_servers = '' ping_path = '/bin' wd_hostname = '' wd_port = 9000 wd_priority = 1 wd_authkey = '' wd_ipc_socket_dir = '/tmp' delegate_IP = '' if_cmd_path = '/sbin' if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev eth0 label eth0:0' if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev eth0' arping_path = '/usr/sbin' # - pgpool Connection Settings - listen_addresses = '*' port = '5432' socket_dir = '/tmp' reserved_connections = 0 # - pgpool Communication Manager Connection Settings - pcp_listen_addresses = '*' pcp_port = 9898 pcp_socket_dir = '/tmp' listen_backlog_multiplier = 2 serialize_accept = off # - Backend Connection Settings - backend_hostname0 = 'awxpostgresql0' backend_port0 = 5432 backend_weight0 = 1 backend_data_directory0 = '/var/lib/pgsql/12/data' backend_flag0 = 'ALLOW_TO_FAILOVER' backend_application_name0 = 'awxpostgresql0' backend_hostname1 = 'awxpostgresql1' backend_port1 = 5432 backend_weight1 = 1 backend_data_directory1 = '/var/lib/pgsql/12/data' backend_flag1 = 'ALLOW_TO_FAILOVER' backend_application_name1 = 'awxpostgresql1' # - Authentication - enable_pool_hba = off pool_passwd = 'pool_passwd' authentication_timeout = '30' allow_clear_text_frontend_auth = off # - SSL Connections - ssl = off ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL' ssl_prefer_server_ciphers = off ssl_ecdh_curve = 'prime256v1' ssl_dh_params_file = '' # - Concurrent session and pool size - num_init_children = 32 max_pool = '15' # - Life time - child_life_time = 300 child_max_connections = 1 connection_life_time = 0 client_idle_limit = 0 # - Where to log - log_destination = 'stderr' # - What to log - log_line_prefix = '%t: pid %p: ' log_connections = off log_hostname = off log_statement = off log_per_node_statement = off log_client_messages = off log_standby_delay = 'if_over_threshold' # - Syslog specific - syslog_facility = 'LOCAL0' syslog_ident = 'pgpool' # - File Locations - pid_file_name = '/tmp/pgpool.pid' logdir = '/var/log' # - Connection Pooling - connection_cache = on reset_query_list = 'ABORT; DISCARD ALL' # - Replication Mode - replication_mode = off replicate_select = off insert_lock = off lobj_lock_table = '' # - Degenerate handling - replication_stop_on_mismatch = off failover_if_affected_tuples_mismatch = off # - Load Balancing Mode - load_balance_mode = 'on' ignore_leading_white_space = on white_function_list = '' black_function_list = 'nextval,setval' black_query_pattern_list = '' database_redirect_preference_list = '' app_name_redirect_preference_list = '' allow_sql_comments = off disable_load_balance_on_write = 'transaction' statement_level_load_balance = off # - Master/Slave Mode - master_slave_mode = on master_slave_sub_mode = 'stream' # - Streaming - sr_check_period = '30' sr_check_user = 'postgres' sr_check_password = 'postgres' sr_check_database = 'postgres' delay_threshold = 10000000 # - Special commands - follow_master_command = '' # - Healthcheck - health_check_period = '30' health_check_timeout = '10' health_check_user = 'postgres' health_check_password = 'postgres' health_check_database = '' health_check_max_retries = '5' health_check_retry_delay = '5' connect_timeout = 10000 # - Failover - failover_command = '/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S' failback_command = '' failover_on_backend_error = 'on' detach_false_primary = off search_primary_node_timeout = '0' # - Online Recovery - recovery_user = 'nobody' recovery_password = '' recovery_1st_stage_command = 'recovery_1st_stage' recovery_2nd_stage_command = '' recovery_timeout = 90 client_idle_limit_in_recovery = 0 auto_failback = off auto_failback_interval = 60 # - Watchdog Settings - use_watchdog = off trusted_servers = '' ping_path = '/bin' wd_hostname = '' wd_port = 9000 wd_priority = 1 wd_authkey = '' arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0' clear_memqcache_on_escalation = on wd_escalation_command = '' wd_de_escalation_command = '' failover_when_quorum_exists = off failover_require_consensus = off allow_multiple_failover_requests_from_node = off enable_consensus_with_half_votes = off wd_monitoring_interfaces_list = '' wd_lifecheck_method = 'heartbeat' wd_interval = 10 wd_heartbeat_port = 9694 wd_heartbeat_keepalive = 2 wd_heartbeat_deadtime = 30 heartbeat_destination0 = 'host0_ip1' heartbeat_destination_port0 = 9694 heartbeat_device0 = '' wd_life_point = 3 wd_lifecheck_query = 'SELECT 1' wd_lifecheck_dbname = 'template1' wd_lifecheck_user = 'nobody' wd_lifecheck_password = '' # - Other Settings - relcache_expire = 0 relcache_size = 256 check_temp_table = catalog check_unlogged_table = on enable_shared_relcache = on relcache_query_target = master memory_cache_enabled = off memqcache_method = 'shmem' memqcache_memcached_host = 'localhost' memqcache_memcached_port = 11211 memqcache_total_size = 67108864 memqcache_max_num_cache = 1000000 memqcache_expire = 0 memqcache_auto_cache_invalidation = on memqcache_maxcache = 409600 memqcache_cache_block_size = 1048576 memqcache_oiddir = '/var/log/pgpool/oiddir' white_memqcache_table_list = '' black_memqcache_table_list = '' |
|
|
The old pgpool logs is no longer there because our kubernetes cluster has limits on the size. So I will generate again and put it here. |
|
|
I checked your pgpool.conf, it is fine. > The old pgpool logs is no longer there because our kubernetes cluster has limits on the size. So I will generate again and put it here. OK. I wait for your logs. |
|
|
There are confidential data on the logs and its hard to filter the output since they are so many of them. So I think this would work for our investigation: 2020-04-01 05:17:14: pid 11: DETAIL: connecting host=a.b.c.d port=44468 2020-04-01 05:17:14: pid 11: LOG: statement: SET TIME ZONE 'UTC' 2020-04-01 05:17:14: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:14: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:14: pid 11: LOG: statement: SELECT xyz .... 2020-04-01 05:17:15: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:15: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:15: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:15: pid 38: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 11: LOG: statement: INSERT INTO "main_activitystream" ... # --> the job ID will appear here 2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 11: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 11: LOG: statement: DECLARE xyz 2020-04-01 05:17:16: pid 11: LOG: statement: FETCH FORWARD 2000 FROM "_django_curs_140286434461504_333" 2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz 2020-04-01 05:17:16: pid 11: LOG: statement: CLOSE "_django_curs_140286434461504_333" But in general, once a job in awx is launched, you will see something similar to the one above when "log_statement" is set to "on". With "log_statement = 'off'" and "child_max_connections = 1", the pgpool logs would look like this whenever a job is launched in AWX. 2020-04-01 08:15:02: pid 1: LOG: Backend status file /var/log/pgpool_status does not exist 2020-04-01 08:15:02: pid 1: LOG: memory cache initialized 2020-04-01 08:15:02: pid 1: DETAIL: memcache blocks :64 2020-04-01 08:15:02: pid 1: LOG: pool_discard_oid_maps: discarded memqcache oid maps 2020-04-01 08:15:02: pid 1: LOG: Setting up socket for 0.0.0.0:5432 2020-04-01 08:15:02: pid 1: LOG: Setting up socket for :::5432 2020-04-01 08:15:02: pid 1: LOG: find_primary_node_repeatedly: waiting for finding a primary node 2020-04-01 08:15:02: pid 1: LOG: find_primary_node: primary node is 0 2020-04-01 08:15:02: pid 1: LOG: find_primary_node: standby node is 1 2020-04-01 08:15:02: pid 40: LOG: PCP process: 40 started 2020-04-01 08:15:02: pid 1: LOG: pgpool-II successfully started. version 4.1.1 (karasukiboshi) 2020-04-01 08:15:02: pid 1: LOG: node status[0]: 1 2020-04-01 08:15:02: pid 1: LOG: node status[1]: 2 2020-04-01 08:15:37: pid 11: LOG: pool_reuse_block: blockid: 0 2020-04-01 08:15:37: pid 11: CONTEXT: while searching system catalog, When relcache is missed 2020-04-01 08:15:37: pid 11: LOG: child exiting, 1 connections reached 2020-04-01 08:15:37: pid 1: LOG: child process with pid: 11 exits with status 256 2020-04-01 08:15:37: pid 1: LOG: fork a new child process with pid: 44 2020-04-01 08:15:37: pid 17: LOG: child exiting, 1 connections reached 2020-04-01 08:15:37: pid 1: LOG: child process with pid: 17 exits with status 256 2020-04-01 08:15:37: pid 1: LOG: fork a new child process with pid: 45 2020-04-01 08:15:37: pid 14: LOG: child exiting, 1 connections reached 2020-04-01 08:15:37: pid 1: LOG: child process with pid: 14 exits with status 256 2020-04-01 08:15:37: pid 1: LOG: fork a new child process with pid: 46 2020-04-01 08:15:38: pid 44: LOG: child exiting, 1 connections reached 2020-04-01 08:15:38: pid 1: LOG: child process with pid: 44 exits with status 256 2020-04-01 08:15:38: pid 1: LOG: fork a new child process with pid: 47 2020-04-01 08:15:38: pid 23: LOG: child exiting, 1 connections reached 2020-04-01 08:15:38: pid 1: LOG: child process with pid: 23 exits with status 256 2020-04-01 08:15:38: pid 1: LOG: fork a new child process with pid: 48 2020-04-01 08:15:40: pid 22: LOG: child exiting, 1 connections reached 2020-04-01 08:15:40: pid 1: LOG: child process with pid: 22 exits with status 256 2020-04-01 08:15:40: pid 1: LOG: fork a new child process with pid: 49 2020-04-01 08:15:40: pid 47: LOG: child exiting, 1 connections reached 2020-04-01 08:15:40: pid 1: LOG: child process with pid: 47 exits with status 256 2020-04-01 08:15:40: pid 1: LOG: fork a new child process with pid: 50 2020-04-01 08:15:45: pid 26: LOG: child exiting, 1 connections reached 2020-04-01 08:15:45: pid 1: LOG: child process with pid: 26 exits with status 256 2020-04-01 08:15:45: pid 1: LOG: fork a new child process with pid: 51 But in the case of missing job, its entirely not visible on the pgpool logs. Seems it even never reached pgpool because I can't even see the job ID. |
|
|
Thank you for the logs. Yes. It seems pgpool didn't receive the query. Could you enable the following parameters and share the logs? --------------------------------- log_statement = on log_client_messages = on --------------------------------- |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2020-03-31 17:28 | mvineza | New Issue | |
| 2020-03-31 17:28 | mvineza | File Added: missed_awx_job.png | |
| 2020-04-01 09:19 | pengbo | Assigned To | => pengbo |
| 2020-04-01 09:19 | pengbo | Status | new => assigned |
| 2020-04-01 09:19 | pengbo | Description Updated | |
| 2020-04-01 09:45 | pengbo | Note Added: 0003290 | |
| 2020-04-01 10:41 | pengbo | Note Edited: 0003290 | |
| 2020-04-01 13:43 | mvineza | Note Added: 0003291 | |
| 2020-04-01 13:46 | mvineza | Note Added: 0003292 | |
| 2020-04-01 14:50 | pengbo | Note Added: 0003293 | |
| 2020-04-01 17:19 | mvineza | Note Added: 0003294 | |
| 2020-04-07 15:54 | pengbo | Note Added: 0003309 | |
| 2020-04-14 12:46 | pengbo | Status | assigned => feedback |