0000795: invalid memory alloc request size 18446744073709551592 - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000795	Pgpool-II	Bug	public	2023-04-19 21:15	2023-06-14 09:17

Reporter	Anton_Tupitsyn	Assigned To	t-ishii
Priority	normal	Severity	major	Reproducibility	random
Status	assigned	Resolution	open
Platform	OpenShift
Product Version	4.4.2

Summary	0000795: invalid memory alloc request size 18446744073709551592
Description	2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: I am 1105 accept fd 7 2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.302: [unknown] pid 1105: DETAIL: Protocol Major: 1234 Minor: 5679 database: user: 2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: selecting backend connection 2023-04-19 09:58:07.302: [unknown] pid 1105: DETAIL: SSLRequest from client 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: DateStyle value: ISO 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: TimeZone value: Etc/UTC 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: client_encoding value: UTF8 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: extra_float_digits value: 2 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: Protocol Major: 3 Minor: 0 database: excerpt user: excerpt_service_user 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: creating new connection to backend 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: connecting 0 backend 2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: authentication backend 2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: auth kind:5 2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: authentication backend: 0 2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: trying md5 authentication 2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: performing md5 authentication 2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: DB node id: 0 salt: b31d1d4 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"application_name" value:"" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"client_encoding" value:"UTF8" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"DateStyle" value:"ISO, MDY" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"default_transaction_read_only" value:"off" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"in_hot_standby" value:"off" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"integer_datetimes" value:"on" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"IntervalStyle" value:"postgres" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"is_superuser" value:"off" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"server_encoding" value:"UTF8" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"server_version" value:"14.3" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"session_authorization" value:"excerpt_service_user" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"standard_conforming_strings" value:"on" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"TimeZone" value:"Etc/UTC" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: authentication backend 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: cp->info[i]:0x7f884762fc08 pid:10047 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: sending backend key data 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: send pid 10047 to frontend 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: selecting load balance node 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: selected backend id is 0 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: Parse: statement name <> 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: invoking the minimal parser 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: username: "excerpt_service_user" database_name: "excerpt" 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: query: "SELECT version()" 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: `excerpt_service_userSELECT version()excerpt' -> `1a7e8e4add217fd8849a3e966a36bbb2' 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: fetching from cache storage 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: search key "1a7e8e4add217fd8849a3e966a36bbb2" 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: ERROR: invalid memory alloc request size 18446744073709551592 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: invoking the minimal parser 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: username: "excerpt_service_user" database_name: "excerpt" 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: query: "SELECT version()" 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: `excerpt_service_userSELECT version()excerpt' -> `1a7e8e4add217fd8849a3e966a36bbb2' 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: fetching from cache storage 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: search key "1a7e8e4add217fd8849a3e966a36bbb2" 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: RESET ALL CONFIG VARIABLES
Additional Information	We a running dozens of PgPool deployments in number of Openshift clusters using pgpool/pgpool:4.4.2 docker image. From time to time some of them enter a state when it regularly throws "invalid memory alloc" to a client. In this state the same query can end successfully or with the error. After pod restart the error goes away. It can work then for days until problem reoccurs. It happens on all the clusters and almost with all the deployments at least once a week. Please notice that the size in the error is always 18446744073709551592 which is 0xFFFFFFFFFFFFFFE8 in hex. From looking at pgpool sources it seems like something goes wrong in pool_fetch_cache when it calculates cache size in pool_get_item_shmem_cache. configmap with pgpool.conf is attached log excerpt for one pid is in the description
Tags	No tags attached.

Anton_Tupitsyn 2023-04-19 21:15 reporter	operational-pool-configmap.yaml (1,702 bytes) apiVersion: v1 kind: ConfigMap metadata: name: operational-pool-config labels: name: operational-pool-config data: pgpool.conf: \|- listen_addresses = '' port = 5432 # pool_passwd = /config/pool_passwd socket_dir = '/var/run/pgpool' pcp_listen_addresses = '' pcp_port = 9898 pcp_socket_dir = '/var/run/pgpool' backend_hostname0 = 'operational-primary' backend_port0 = 5432 backend_weight0 = 0 backend_flag0 = 'ALWAYS_PRIMARY\|DISALLOW_TO_FAILOVER' {{- $operational_replicas := .Values.global.crunchyPostgresOperator.instances.operational.replicas \| int -}} {{- if gt $operational_replicas 1 }} backend_hostname1 = 'operational-replicas' backend_port1 = 5432 backend_weight1 = 1 backend_flag1 = 'DISALLOW_TO_FAILOVER' {{- else }} #backend_hostname1 = 'operational-replicas' #backend_port1 = 5432 #backend_weight1 = 1 #backend_flag1 = 'DISALLOW_TO_FAILOVER' {{- end }} sr_check_period = 0 enable_pool_hba = off backend_clustering_mode = 'streaming_replication' num_init_children = 200 max_pool = 1 reserved_connections = 0 child_life_time = 300 child_max_connections = 0 connection_life_time = 0 client_idle_limit = 0 connection_cache = on load_balance_mode = on statement_level_load_balance = off ssl = off failover_on_backend_error = off logging_collector = off pool_hba.conf: \|- local all all trust host all all 127.0.0.1/32 trust host all all ::1/128 trust host all all 0.0.0.0/0 md5 operational-pool-configmap.yaml (1,702 bytes)

t-ishii 2023-04-20 09:19 developer ~0004345	I suspect this related to the bug found on Jan 30 after 4.2.2 was released. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=93f55858c6ba8b5243f5dc79808926961d78d826 You can try either: 1) grab the patch from git repository and apply it, build pgpool from the source code. 2) turn off enable_shared_relcache. The code path is on the route when enable_shared_relcache is on. So I expect the bug goes away if the parameter is disabled.

Anton_Tupitsyn 2023-04-20 16:33 reporter ~0004346	Thanks t-ishii. Building from the source doesn't fit well in our CI/CD process, so I won't be able test your patch at scale, unfortunately. I'll be able to do it when it gets to pgpool/pgpool image. As for turning off enable_shared_relcache, I've figured the same. We have done that yesterday for some environments and so far there were no "invalid memory alloc" errors on them.

neverous 2023-05-16 06:33 reporter ~0004380	I think I'm hitting the same issue but as I was unable to reliably reproduce it with my config while troubleshooting I found another setting that seems to make it much easier to hit (it still can take a bunch of attempts though) and this troubleshooting led me to some potential other missing shmem exclusive locks I think? My reproduction steps are: 1. Run attached docker compose: docker compose up (or setup default pgpool with one postgres and relcache_expire=1) 2. Run pgbench with any simple SELECT query with at least 2 clients simultaneously in a loop and reconnecting for each transaction (the more simultaneous new connections at the same time the better it seems): pgbench -h localhost -U postgres postgres -f query.sql -C -c 2 -j 2 -t 10000 3. pgpool children start to die with segfaults or other weird memory/shmem related errors (like the alloc size) and stay that way until complete pgpool restart 4. if not, retry from scratch (seems like quite often I get "lucky" memory layout and it keeps working fine for long while? but in pgpool/pgpool:4.4.2 like once in every 3 tries I get to crash children with segfaults this way) As you mentioned enable_shared_relcache=off seems to help mitigate all issues for me. Attached docker-compose.yml, query.sql for reproduction, maybe its useful. When troubleshooting, I found shared memory optimizations in https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blobdiff;f=src/utils/pool_relcache.c;h=1b0324656a9522577f6dec8e82d7b73baa8c5f42;hp=6f577e6f7728f6f4afe1021dcc113c996f50b985;hb=dc559c07ee5affc7035efa6e0f00185e211079a0;hpb=f8fba590c2795aeec1afc5ef654000bd1332591e switching to shared lock, while in that relcache case, pool_fetch_cache can occasionally hit a path that seems to be writing to shared memory if I see correctly (exactly on cache expiry): - pool_fetch_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l481 - pool_get_item_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2697 - pool_find_item_on_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2741 - pool_delete_item_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2758 It might be worth to recheck if more places from https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=dc559c07ee5affc7035efa6e0f00185e211079a0 shouldn't actually have stayed as exclusive locks (or maybe extend some flags in the locking mechanism so the writes can check themselves if process holds exclusive lock and warn otherwise). I might be wrong though as testing now with freshly built V4_4_STABLE (8c3250ed35cb1d9713cb2515d2ab8230c51fe3ac), I was unable to reproduce the issues so it might have just been the fix you mentioned and the writes I'm pointing out are somehow protected? (or it just got much harder to trigger, or something else was also fixed in the meantime, btw. when next official bugfix release is expected to land?) docker-compose.yml (678 bytes) services: postgres: image: postgres:14.8-alpine restart: always environment: - POSTGRES_HOST_AUTH_METHOD=trust - POSTGRES_USERNAME=postgres - POSTGRES_PASSWORD= pgpool: image: pgpool/pgpool:4.4.2 #image: pgpool:4.4.stable #build: pgpool restart: always ports: - 5432:9999 environment: - PGPOOL_PARAMS_BACKEND_HOSTNAME0=postgres - POSTGRES_USERNAME=postgres - POSTGRES_PASSWORD= - PGPOOL_PARAMS_SR_CHECK_USER=postgres - PGPOOL_PARAMS_SR_CHECK_PASSWORD= - PGPOOL_PARAMS_RELCACHE_EXPIRE=1 # makes it much easier to trigger issues #- PGPOOL_PARAMS_ENABLE_SHARED_RELCACHE=off docker-compose.yml (678 bytes) query.sql (39 bytes) SELECT COUNT(*) FROM pg_stat_activity; query.sql (39 bytes)

t-ishii 2023-05-17 09:23 developer ~0004387	Thank you for the analysis. I will look into them. > I might be wrong though as testing now with freshly built V4_4_STABLE (8c3250ed35cb1d9713cb2515d2ab8230c51fe3ac), I was unable to reproduce the issues so it might have just been the fix you mentioned and the writes I'm pointing out are somehow protected? I believe the issue was fixed by this commit: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=93f55858c6ba8b5243f5dc79808926961d78d826 > when next official bugfix release is expected to land?) Tomorrow! Stay tuned... https://pgpool.net/mediawiki/index.php/Roadmap

t-ishii 2023-05-29 14:21 developer ~0004392	Attached is a patch trying to fix the bug you found. In pool_find_item_on_shmem_cache, I release the lock and acquire an exclusive lock before calling pool_delete_item_shmem_cache. Releasing/acquiring lock will create a window. So I get the cache pointer again. What do you think? fix-query-cache.patch (1,603 bytes) diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c index 89510f73..c1ae800d 100644 --- a/src/query_cache/pool_memqcache.c +++ b/src/query_cache/pool_memqcache.c @@ -2734,12 +2734,37 @@ static POOL_CACHEID * pool_find_item_on_shmem_cache(POOL_QUERY_HASH * query_hash now = time(NULL); if (now > (cih->timestamp + cih->expire)) { - ereport(DEBUG1, - (errmsg("memcache finding item"), - errdetail("cache expired: now: %ld timestamp: %ld", - now, cih->timestamp + cih->expire))); - pool_delete_item_shmem_cache(c); - return NULL; + /* + * We need to acquire an exclusive lock before removing the cache + * entry. Since a lock escalation from shared lock to exclusive + * lock is not supported, we need to release the lock then acquire + * an exclusive lock. + / + pool_shmem_unlock(); + pool_shmem_lock(POOL_MEMQ_EXCLUSIVE_LOCK); + / + * There is a window between pool_shmem_unlock() and + * pool_shmem_lock(). We need to get POOL_CACHEID and + * POOL_CACHE_ITEM_HEADER again because they could have been + * modified by someone else. + */ + c = pool_hash_search(query_hash); + if (!c) + { + return NULL; + } + + cih = item_header(block_address(c->blockid), c->itemid); + now = time(NULL); + if (now > (cih->timestamp + cih->expire)) + { + ereport(DEBUG1, + (errmsg("memcache finding item"), + errdetail("cache expired: now: %ld timestamp: %ld", + now, cih->timestamp + cih->expire))); + pool_delete_item_shmem_cache(c); + return NULL; + } } } fix-query-cache.patch (1,603 bytes)

neverous 2023-06-13 22:20 reporter ~0004393	Looks good to me! And seems to be working. As mentioned I'm unable to cause the crash now even without this particular patch, but the changes makes sense in my general understanding on whats happening here. (I also tried to look around the code and I think this was the last place left with potentially loose shmem locks)

t-ishii 2023-06-14 09:17 developer ~0004394	Thanks for reviewing the patch. I will push this today.

Date Modified	Username	Field	Change
2023-04-19 21:15	Anton_Tupitsyn	New Issue
2023-04-19 21:15	Anton_Tupitsyn	File Added: operational-pool-configmap.yaml
2023-04-20 09:14	t-ishii	Assigned To	=> t-ishii
2023-04-20 09:14	t-ishii	Status	new => assigned
2023-04-20 09:19	t-ishii	Note Added: 0004345
2023-04-20 09:36	t-ishii	Status	assigned => feedback
2023-04-20 16:33	Anton_Tupitsyn	Note Added: 0004346
2023-04-20 16:33	Anton_Tupitsyn	Status	feedback => assigned
2023-05-16 06:33	neverous	Note Added: 0004380
2023-05-16 06:33	neverous	File Added: docker-compose.yml
2023-05-16 06:33	neverous	File Added: query.sql
2023-05-17 09:23	t-ishii	Note Added: 0004387
2023-05-29 14:21	t-ishii	Note Added: 0004392
2023-05-29 14:21	t-ishii	File Added: fix-query-cache.patch
2023-06-13 22:20	neverous	Note Added: 0004393
2023-06-14 09:17	t-ishii	Note Added: 0004394