View Issue Details

IDProjectCategoryView StatusLast Update
0000601Pgpool-IIBugpublic2020-04-14 12:46
ReportermvinezaAssigned Topengbo 
PrioritynormalSeveritymajorReproducibilityrandom
Status feedbackResolutionopen 
PlatformContainerOSCentosOS Version7.7.1908
Product Version4.1.0 
Target VersionFixed in Version 
Summary0000601: Some SQL queries are missed
DescriptionI have AWX running inside kubernetes. The components (web and task/scheduler) connects to postgresql. All of them are containers running as kubernetes pods. For 2 months that setup is running fine without connection issues between the pods.

web (pod) --> postgres (pod)
task/scheduler (pod) --> postgres (pod)

Due to lack of persistent storage in kubernetes cluster for postgres, I decided to move postgresql outside of kubernetes so it will be now a traditional virtual machine (1 master and 1 slave) instead of a container. Then I deployed pgpool inside kubernete and remove the postgres pod. The current setup now is:

web (pod) --> pgpool (pod) --> postgres master/slave (virtual machine)
task/scheduler (pod) --> pgpool (pod) --> postgres master/slave (virtual machine)

During that setup, I encounter connection issues such as some SQL statements are not sent properly to backend postgres. I notice that scenario when I encounter this error from the task/scheduler pod:

  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/manager.py", line 82, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/var/lib/awx/venv/awx/lib64/python3.6/site-packages/django/db/models/query.py", line 408, in get
    self.model._meta.object_name
awx.main.models.jobs.Job.DoesNotExist: Job matching query does not exist.

I looked closely and found out that some SQL queries are not being handled by pgpool. In effect, the queries never reached the backend postgres servers.

I uploaded here an image of the missed job in AWX that corresponds to the missed SQL query (miss_awx_job.png). From the image job 2214 is successful and ran, job 2216 is missed, job 2217 is successful, and so on.

Checking the pgpool logs, I see the job ID 2217 is there.

➜ ~ ☿ kubectl logs dev-pgpool-77949cf588-mx25h | egrep '2020-03-31' | grep dev_execute_command | egrep 'INSERT INTO "main_activitystream"' | egrep '"id": 2217'
2020-03-31 05:04:28: pid 36: LOG: statement: INSERT INTO "main_activitystream" ("actor_id", "operation", "timestamp", "changes", "deleted_actor", "action_node", "object_relationship_type", "object1", "object2", "setting") VALUES (NULL, 'create', '2020-03-31T05:04:28.282003+00:00'::timestamptz, '{"name": "dev_execute_command", "description": "", "job_type": "run", "inventory": "awx-playbooks-dev-1", "project": "awx-playbooks-6", "playbook": "playbooks/execute_command.yaml", "scm_branch": "", "forks": 0, "limit": "", "verbosity": 0, "extra_vars": "{\"command\": \"date\"}", "job_tags": "", "force_handlers": false, "skip_tags": "", "start_at_task": "", "timeout": 0, "use_fact_cache": false, "job_template": "dev_execute_command-18", "allow_simultaneous": false, "instance_group": null, "diff_mode": false, "job_slice_number": 0, "job_slice_count": 1, "id": 2217, "credentials": [], "labels": []}', NULL, 'awx', '', 'job', '', '{}') RETURNING "main_activitystream"."id"
➜ ~ ☿

But job id 2216 is not there.

➜ ~ ☿ kubectl logs dev-pgpool-77949cf588-mx25h | egrep '2020-03-31' | grep dev_execute_command | egrep 'INSERT INTO "main_activitystream"' | egrep '"id": 2216'
➜ ~ ☿

That is a major issue since our AWX application will run several batch jobs regularly and there is a high chance that a job will be missed.

I tried playing around to on some connection settings with pgpool like setting "child_max_connections" to "1" but no luck.

Here are the versions of the software involved:

AWX: 7.0.0
Pgpool: 4.1.1
Postgres: 12
TagsNo tags attached.

Activities

mvineza

2020-03-31 17:28

reporter  

missed_awx_job.png (294,037 bytes)

pengbo

2020-04-01 09:45

developer   ~0003290

Last edited: 2020-04-01 10:41

View 2 revisions

If you are using postgres master/slave, you should set "master_slave_mode = on" and "replication_mode = off" in pgpool.conf.
Or you can copy the sample config file for streaming replication.

     # cp /etc/pgpool-II/pgpool.conf.sample-stream /etc/pgpool-II/pgpool.conf

Could you share more pgpool log and pgpool.conf?

    # kubectl logs dev-pgpool-77949cf588-mx25h | egrep '2020-03-31'

mvineza

2020-04-01 13:43

reporter   ~0003291

Here is the pgpool.conf.

# - pgpool Connection Settings -

listen_addresses = '*'
port = '5432'
socket_dir = '/tmp'
reserved_connections = 0

# - pgpool Communication Manager Connection Settings -

pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/tmp'
listen_backlog_multiplier = 2
serialize_accept = off

# - Backend Connection Settings -

backend_hostname0 = 'awxpostgresql0'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/pgsql/12/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'awxpostgresql0'
backend_hostname1 = 'awxpostgresql1'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/pgsql/12/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = 'awxpostgresql1'

# - Authentication -

enable_pool_hba = off
pool_passwd = 'pool_passwd'
authentication_timeout = '30'
allow_clear_text_frontend_auth = off

# - SSL Connections -

ssl = off
ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
ssl_prefer_server_ciphers = off
ssl_ecdh_curve = 'prime256v1'
ssl_dh_params_file = ''

# - Concurrent session and pool size -

num_init_children = 32
max_pool = '15'

# - Life time -

child_life_time = 300
child_max_connections = 1
connection_life_time = 0
client_idle_limit = 0

# - Where to log -

log_destination = 'stderr'

# - What to log -

log_line_prefix = '%t: pid %p: '
log_connections = off
log_hostname = off
log_statement = off
log_per_node_statement = off
log_client_messages = off
log_standby_delay = 'if_over_threshold'

# - Syslog specific -

syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'

# - File Locations -

pid_file_name = '/tmp/pgpool.pid'
logdir = '/var/log'

# - Connection Pooling -

connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'

# - Replication Mode -

replication_mode = off
replicate_select = off
insert_lock = off
lobj_lock_table = ''

# - Degenerate handling -

replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off

# - Load Balancing Mode -

load_balance_mode = 'on'
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'nextval,setval'
black_query_pattern_list = ''
database_redirect_preference_list = ''
app_name_redirect_preference_list = ''
allow_sql_comments = off
disable_load_balance_on_write = 'transaction'
statement_level_load_balance = off

# - Master/Slave Mode -

master_slave_mode = on
master_slave_sub_mode = 'stream'

# - Streaming -

sr_check_period = '30'
sr_check_user = 'postgres'
sr_check_password = 'postgres'
sr_check_database = 'postgres'
delay_threshold = 10000000

# - Special commands -

follow_master_command = ''

# - Healthcheck -

health_check_period = '30'
health_check_timeout = '10'
health_check_user = 'postgres'
health_check_password = 'postgres'
health_check_database = ''
health_check_max_retries = '5'
health_check_retry_delay = '5'
connect_timeout = 10000

# - Failover -

failover_command = '/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
failback_command = ''
failover_on_backend_error = 'on'
detach_false_primary = off
search_primary_node_timeout = '0'

# - Online Recovery -

recovery_user = 'nobody'
recovery_password = ''
recovery_1st_stage_command = 'recovery_1st_stage'
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0
auto_failback = off
auto_failback_interval = 60

# - Watchdog Settings -

use_watchdog = off
trusted_servers = ''
ping_path = '/bin'
wd_hostname = ''
wd_port = 9000
wd_priority = 1
wd_authkey = ''
wd_ipc_socket_dir = '/tmp'
delegate_IP = ''
if_cmd_path = '/sbin'
if_up_cmd = '/usr/bin/sudo /sbin/ip addr add $_IP_$/24 dev eth0 label eth0:0'
if_down_cmd = '/usr/bin/sudo /sbin/ip addr del $_IP_$/24 dev eth0'
arping_path = '/usr/sbin'
# - pgpool Connection Settings -

listen_addresses = '*'
port = '5432'
socket_dir = '/tmp'
reserved_connections = 0

# - pgpool Communication Manager Connection Settings -

pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/tmp'
listen_backlog_multiplier = 2
serialize_accept = off

# - Backend Connection Settings -

backend_hostname0 = 'awxpostgresql0'
backend_port0 = 5432
backend_weight0 = 1
backend_data_directory0 = '/var/lib/pgsql/12/data'
backend_flag0 = 'ALLOW_TO_FAILOVER'
backend_application_name0 = 'awxpostgresql0'
backend_hostname1 = 'awxpostgresql1'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/var/lib/pgsql/12/data'
backend_flag1 = 'ALLOW_TO_FAILOVER'
backend_application_name1 = 'awxpostgresql1'

# - Authentication -

enable_pool_hba = off
pool_passwd = 'pool_passwd'
authentication_timeout = '30'
allow_clear_text_frontend_auth = off

# - SSL Connections -

ssl = off
ssl_ciphers = 'HIGH:MEDIUM:+3DES:!aNULL'
ssl_prefer_server_ciphers = off
ssl_ecdh_curve = 'prime256v1'
ssl_dh_params_file = ''

# - Concurrent session and pool size -

num_init_children = 32
max_pool = '15'

# - Life time -

child_life_time = 300
child_max_connections = 1
connection_life_time = 0
client_idle_limit = 0

# - Where to log -

log_destination = 'stderr'

# - What to log -

log_line_prefix = '%t: pid %p: '
log_connections = off
log_hostname = off
log_statement = off
log_per_node_statement = off
log_client_messages = off
log_standby_delay = 'if_over_threshold'

# - Syslog specific -

syslog_facility = 'LOCAL0'
syslog_ident = 'pgpool'

# - File Locations -

pid_file_name = '/tmp/pgpool.pid'
logdir = '/var/log'

# - Connection Pooling -

connection_cache = on
reset_query_list = 'ABORT; DISCARD ALL'

# - Replication Mode -

replication_mode = off
replicate_select = off
insert_lock = off
lobj_lock_table = ''

# - Degenerate handling -

replication_stop_on_mismatch = off
failover_if_affected_tuples_mismatch = off

# - Load Balancing Mode -

load_balance_mode = 'on'
ignore_leading_white_space = on
white_function_list = ''
black_function_list = 'nextval,setval'
black_query_pattern_list = ''
database_redirect_preference_list = ''
app_name_redirect_preference_list = ''
allow_sql_comments = off
disable_load_balance_on_write = 'transaction'
statement_level_load_balance = off

# - Master/Slave Mode -

master_slave_mode = on
master_slave_sub_mode = 'stream'

# - Streaming -

sr_check_period = '30'
sr_check_user = 'postgres'
sr_check_password = 'postgres'
sr_check_database = 'postgres'
delay_threshold = 10000000

# - Special commands -

follow_master_command = ''

# - Healthcheck -

health_check_period = '30'
health_check_timeout = '10'
health_check_user = 'postgres'
health_check_password = 'postgres'
health_check_database = ''
health_check_max_retries = '5'
health_check_retry_delay = '5'
connect_timeout = 10000

# - Failover -

failover_command = '/failover.sh %d %h %p %D %m %H %M %P %r %R %N %S'
failback_command = ''
failover_on_backend_error = 'on'
detach_false_primary = off
search_primary_node_timeout = '0'

# - Online Recovery -

recovery_user = 'nobody'
recovery_password = ''
recovery_1st_stage_command = 'recovery_1st_stage'
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0
auto_failback = off
auto_failback_interval = 60

# - Watchdog Settings -

use_watchdog = off
trusted_servers = ''
ping_path = '/bin'
wd_hostname = ''
wd_port = 9000
wd_priority = 1
wd_authkey = ''
arping_cmd = '/usr/bin/sudo /usr/sbin/arping -U $_IP_$ -w 1 -I eth0'
clear_memqcache_on_escalation = on
wd_escalation_command = ''
wd_de_escalation_command = ''
failover_when_quorum_exists = off
failover_require_consensus = off
allow_multiple_failover_requests_from_node = off
enable_consensus_with_half_votes = off
wd_monitoring_interfaces_list = ''
wd_lifecheck_method = 'heartbeat'
wd_interval = 10
wd_heartbeat_port = 9694
wd_heartbeat_keepalive = 2
wd_heartbeat_deadtime = 30
heartbeat_destination0 = 'host0_ip1'
heartbeat_destination_port0 = 9694
heartbeat_device0 = ''
wd_life_point = 3
wd_lifecheck_query = 'SELECT 1'
wd_lifecheck_dbname = 'template1'
wd_lifecheck_user = 'nobody'
wd_lifecheck_password = ''

# - Other Settings -

relcache_expire = 0
relcache_size = 256
check_temp_table = catalog
check_unlogged_table = on
enable_shared_relcache = on
relcache_query_target = master
memory_cache_enabled = off
memqcache_method = 'shmem'
memqcache_memcached_host = 'localhost'
memqcache_memcached_port = 11211
memqcache_total_size = 67108864
memqcache_max_num_cache = 1000000
memqcache_expire = 0
memqcache_auto_cache_invalidation = on
memqcache_maxcache = 409600
memqcache_cache_block_size = 1048576
memqcache_oiddir = '/var/log/pgpool/oiddir'
white_memqcache_table_list = ''
black_memqcache_table_list = ''

mvineza

2020-04-01 13:46

reporter   ~0003292

The old pgpool logs is no longer there because our kubernetes cluster has limits on the size. So I will generate again and put it here.

pengbo

2020-04-01 14:50

developer   ~0003293

I checked your pgpool.conf, it is fine.

> The old pgpool logs is no longer there because our kubernetes cluster has limits on the size. So I will generate again and put it here.

OK. I wait for your logs.

mvineza

2020-04-01 17:19

reporter   ~0003294

There are confidential data on the logs and its hard to filter the output since they are so many of them. So I think this would work for our investigation:

2020-04-01 05:17:14: pid 11: DETAIL: connecting host=a.b.c.d port=44468
2020-04-01 05:17:14: pid 11: LOG: statement: SET TIME ZONE 'UTC'
2020-04-01 05:17:14: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:14: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:14: pid 11: LOG: statement: SELECT xyz
....
2020-04-01 05:17:15: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:15: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:15: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:15: pid 38: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 11: LOG: statement: INSERT INTO "main_activitystream" ... # --> the job ID will appear here
2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 11: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 11: LOG: statement: DECLARE xyz
2020-04-01 05:17:16: pid 11: LOG: statement: FETCH FORWARD 2000 FROM "_django_curs_140286434461504_333"
2020-04-01 05:17:16: pid 38: LOG: statement: SELECT xyz
2020-04-01 05:17:16: pid 11: LOG: statement: CLOSE "_django_curs_140286434461504_333"

But in general, once a job in awx is launched, you will see something similar to the one above when "log_statement" is set to "on".

With "log_statement = 'off'" and "child_max_connections = 1", the pgpool logs would look like this whenever a job is launched in AWX.

2020-04-01 08:15:02: pid 1: LOG: Backend status file /var/log/pgpool_status does not exist
2020-04-01 08:15:02: pid 1: LOG: memory cache initialized
2020-04-01 08:15:02: pid 1: DETAIL: memcache blocks :64
2020-04-01 08:15:02: pid 1: LOG: pool_discard_oid_maps: discarded memqcache oid maps
2020-04-01 08:15:02: pid 1: LOG: Setting up socket for 0.0.0.0:5432
2020-04-01 08:15:02: pid 1: LOG: Setting up socket for :::5432
2020-04-01 08:15:02: pid 1: LOG: find_primary_node_repeatedly: waiting for finding a primary node
2020-04-01 08:15:02: pid 1: LOG: find_primary_node: primary node is 0
2020-04-01 08:15:02: pid 1: LOG: find_primary_node: standby node is 1
2020-04-01 08:15:02: pid 40: LOG: PCP process: 40 started
2020-04-01 08:15:02: pid 1: LOG: pgpool-II successfully started. version 4.1.1 (karasukiboshi)
2020-04-01 08:15:02: pid 1: LOG: node status[0]: 1
2020-04-01 08:15:02: pid 1: LOG: node status[1]: 2
2020-04-01 08:15:37: pid 11: LOG: pool_reuse_block: blockid: 0
2020-04-01 08:15:37: pid 11: CONTEXT: while searching system catalog, When relcache is missed
2020-04-01 08:15:37: pid 11: LOG: child exiting, 1 connections reached
2020-04-01 08:15:37: pid 1: LOG: child process with pid: 11 exits with status 256
2020-04-01 08:15:37: pid 1: LOG: fork a new child process with pid: 44
2020-04-01 08:15:37: pid 17: LOG: child exiting, 1 connections reached
2020-04-01 08:15:37: pid 1: LOG: child process with pid: 17 exits with status 256
2020-04-01 08:15:37: pid 1: LOG: fork a new child process with pid: 45
2020-04-01 08:15:37: pid 14: LOG: child exiting, 1 connections reached
2020-04-01 08:15:37: pid 1: LOG: child process with pid: 14 exits with status 256
2020-04-01 08:15:37: pid 1: LOG: fork a new child process with pid: 46
2020-04-01 08:15:38: pid 44: LOG: child exiting, 1 connections reached
2020-04-01 08:15:38: pid 1: LOG: child process with pid: 44 exits with status 256
2020-04-01 08:15:38: pid 1: LOG: fork a new child process with pid: 47
2020-04-01 08:15:38: pid 23: LOG: child exiting, 1 connections reached
2020-04-01 08:15:38: pid 1: LOG: child process with pid: 23 exits with status 256
2020-04-01 08:15:38: pid 1: LOG: fork a new child process with pid: 48
2020-04-01 08:15:40: pid 22: LOG: child exiting, 1 connections reached
2020-04-01 08:15:40: pid 1: LOG: child process with pid: 22 exits with status 256
2020-04-01 08:15:40: pid 1: LOG: fork a new child process with pid: 49
2020-04-01 08:15:40: pid 47: LOG: child exiting, 1 connections reached
2020-04-01 08:15:40: pid 1: LOG: child process with pid: 47 exits with status 256
2020-04-01 08:15:40: pid 1: LOG: fork a new child process with pid: 50
2020-04-01 08:15:45: pid 26: LOG: child exiting, 1 connections reached
2020-04-01 08:15:45: pid 1: LOG: child process with pid: 26 exits with status 256
2020-04-01 08:15:45: pid 1: LOG: fork a new child process with pid: 51

But in the case of missing job, its entirely not visible on the pgpool logs. Seems it even never reached pgpool because I can't even see the job ID.

pengbo

2020-04-07 15:54

developer   ~0003309

Thank you for the logs.

Yes. It seems pgpool didn't receive the query.

Could you enable the following parameters and share the logs?

---------------------------------
log_statement = on
log_client_messages = on
---------------------------------

Issue History

Date Modified Username Field Change
2020-03-31 17:28 mvineza New Issue
2020-03-31 17:28 mvineza File Added: missed_awx_job.png
2020-04-01 09:19 pengbo Assigned To => pengbo
2020-04-01 09:19 pengbo Status new => assigned
2020-04-01 09:19 pengbo Description Updated View Revisions
2020-04-01 09:45 pengbo Note Added: 0003290
2020-04-01 10:41 pengbo Note Edited: 0003290 View Revisions
2020-04-01 13:43 mvineza Note Added: 0003291
2020-04-01 13:46 mvineza Note Added: 0003292
2020-04-01 14:50 pengbo Note Added: 0003293
2020-04-01 17:19 mvineza Note Added: 0003294
2020-04-07 15:54 pengbo Note Added: 0003309
2020-04-14 12:46 pengbo Status assigned => feedback