View Issue Details

IDProjectCategoryView StatusLast Update
0000795Pgpool-IIBugpublic2023-06-14 09:17
ReporterAnton_Tupitsyn Assigned Tot-ishii  
PrioritynormalSeveritymajorReproducibilityrandom
Status assignedResolutionopen 
PlatformOpenShift 
Product Version4.4.2 
Summary0000795: invalid memory alloc request size 18446744073709551592
Description2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: I am 1105 accept fd 7
2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: reading startup packet
2023-04-19 09:58:07.302: [unknown] pid 1105: DETAIL: Protocol Major: 1234 Minor: 5679 database: user:
2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: selecting backend connection
2023-04-19 09:58:07.302: [unknown] pid 1105: DETAIL: SSLRequest from client
2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet
2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: DateStyle value: ISO
2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet
2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: TimeZone value: Etc/UTC
2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet
2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: client_encoding value: UTF8
2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet
2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: extra_float_digits value: 2
2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet
2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: Protocol Major: 3 Minor: 0 database: excerpt user: excerpt_service_user
2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: creating new connection to backend
2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: connecting 0 backend
2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: authentication backend
2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: auth kind:5
2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: authentication backend: 0
2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: trying md5 authentication
2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: performing md5 authentication
2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: DB node id: 0 salt: b31d1d4
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"application_name" value:""
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"client_encoding" value:"UTF8"
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"DateStyle" value:"ISO, MDY"
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"default_transaction_read_only" value:"off"
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"in_hot_standby" value:"off"
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"integer_datetimes" value:"on"
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"IntervalStyle" value:"postgres"
2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"is_superuser" value:"off"
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"server_encoding" value:"UTF8"
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"server_version" value:"14.3"
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"session_authorization" value:"excerpt_service_user"
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"standard_conforming_strings" value:"on"
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"TimeZone" value:"Etc/UTC"
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: authentication backend
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: cp->info[i]:0x7f884762fc08 pid:10047
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: sending backend key data
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: send pid 10047 to frontend
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: selecting load balance node
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: selected backend id is 0
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: Parse: statement name <>
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: invoking the minimal parser
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: username: "excerpt_service_user" database_name: "excerpt"
2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: query: "SELECT version()"
2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: `excerpt_service_userSELECT version()excerpt' -> `1a7e8e4add217fd8849a3e966a36bbb2'
2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: fetching from cache storage
2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: search key "1a7e8e4add217fd8849a3e966a36bbb2"
2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.310: [unknown] pid 1105: ERROR: invalid memory alloc request size 18446744073709551592
2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: invoking the minimal parser
2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key
2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: username: "excerpt_service_user" database_name: "excerpt"
2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key
2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: query: "SELECT version()"
2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key
2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: `excerpt_service_userSELECT version()excerpt' -> `1a7e8e4add217fd8849a3e966a36bbb2'
2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: fetching from cache storage
2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: search key "1a7e8e4add217fd8849a3e966a36bbb2"
2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed
2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: RESET ALL CONFIG VARIABLES
Additional InformationWe a running dozens of PgPool deployments in number of Openshift clusters using pgpool/pgpool:4.4.2 docker image. From time to time some of them enter a state when it regularly throws "invalid memory alloc" to a client. In this state the same query can end successfully or with the error. After pod restart the error goes away. It can work then for days until problem reoccurs. It happens on all the clusters and almost with all the deployments at least once a week.

Please notice that the size in the error is always 18446744073709551592 which is 0xFFFFFFFFFFFFFFE8 in hex.

From looking at pgpool sources it seems like something goes wrong in pool_fetch_cache when it calculates cache size in pool_get_item_shmem_cache.

configmap with pgpool.conf is attached
log excerpt for one pid is in the description
TagsNo tags attached.

Activities

Anton_Tupitsyn

2023-04-19 21:15

reporter  

operational-pool-configmap.yaml (1,702 bytes)   
apiVersion: v1
kind: ConfigMap
metadata:
  name: operational-pool-config
  labels:
    name: operational-pool-config
data:
  pgpool.conf: |-
    listen_addresses = '*'
    port = 5432
    # pool_passwd = /config/pool_passwd
    socket_dir = '/var/run/pgpool'
    pcp_listen_addresses = '*'
    pcp_port = 9898
    pcp_socket_dir = '/var/run/pgpool'
    backend_hostname0 = 'operational-primary'
    backend_port0 = 5432
    backend_weight0 = 0
    backend_flag0 = 'ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER'
    {{- $operational_replicas := .Values.global.crunchyPostgresOperator.instances.operational.replicas | int -}}
    {{- if gt $operational_replicas 1 }}
    backend_hostname1 = 'operational-replicas'
    backend_port1 = 5432
    backend_weight1 = 1
    backend_flag1 = 'DISALLOW_TO_FAILOVER'
    {{- else }}
    #backend_hostname1 = 'operational-replicas'
    #backend_port1 = 5432
    #backend_weight1 = 1
    #backend_flag1 = 'DISALLOW_TO_FAILOVER'
    {{- end }}
    sr_check_period = 0
    enable_pool_hba = off
    backend_clustering_mode = 'streaming_replication'
    num_init_children = 200
    max_pool = 1
    reserved_connections = 0
    child_life_time = 300
    child_max_connections = 0
    connection_life_time = 0
    client_idle_limit = 0
    connection_cache = on
    load_balance_mode = on
    statement_level_load_balance = off
    ssl = off
    failover_on_backend_error = off
    logging_collector = off
  pool_hba.conf: |-
    local   all         all                               trust
    host    all         all         127.0.0.1/32          trust
    host    all         all         ::1/128               trust
    host    all         all         0.0.0.0/0             md5

t-ishii

2023-04-20 09:19

developer   ~0004345

I suspect this related to the bug found on Jan 30 after 4.2.2 was released.
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=93f55858c6ba8b5243f5dc79808926961d78d826
You can try either:
1) grab the patch from git repository and apply it, build pgpool from the source code.
2) turn off enable_shared_relcache. The code path is on the route when enable_shared_relcache is on. So I expect the bug goes away if the parameter is disabled.

Anton_Tupitsyn

2023-04-20 16:33

reporter   ~0004346

Thanks t-ishii.

Building from the source doesn't fit well in our CI/CD process, so I won't be able test your patch at scale, unfortunately. I'll be able to do it when it gets to pgpool/pgpool image.
As for turning off enable_shared_relcache, I've figured the same. We have done that yesterday for some environments and so far there were no "invalid memory alloc" errors on them.

neverous

2023-05-16 06:33

reporter   ~0004380

I think I'm hitting the same issue but as I was unable to reliably reproduce it with my config while troubleshooting I found another setting that seems to make it much easier to hit (it still can take a bunch of attempts though) and this troubleshooting led me to some potential other missing shmem exclusive locks I think?

My reproduction steps are:
1. Run attached docker compose: docker compose up (or setup default pgpool with one postgres and relcache_expire=1)
2. Run pgbench with any simple SELECT query with at least 2 clients simultaneously in a loop and reconnecting for each transaction (the more simultaneous new connections at the same time the better it seems): pgbench -h localhost -U postgres postgres -f query.sql -C -c 2 -j 2 -t 10000
3. pgpool children start to die with segfaults or other weird memory/shmem related errors (like the alloc size) and stay that way until complete pgpool restart
4. if not, retry from scratch (seems like quite often I get "lucky" memory layout and it keeps working fine for long while? but in pgpool/pgpool:4.4.2 like once in every 3 tries I get to crash children with segfaults this way)

As you mentioned enable_shared_relcache=off seems to help mitigate all issues for me.

Attached docker-compose.yml, query.sql for reproduction, maybe its useful.

When troubleshooting, I found shared memory optimizations in https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blobdiff;f=src/utils/pool_relcache.c;h=1b0324656a9522577f6dec8e82d7b73baa8c5f42;hp=6f577e6f7728f6f4afe1021dcc113c996f50b985;hb=dc559c07ee5affc7035efa6e0f00185e211079a0;hpb=f8fba590c2795aeec1afc5ef654000bd1332591e switching to shared lock, while in that relcache case, pool_fetch_cache can occasionally hit a path that seems to be writing to shared memory if I see correctly (exactly on cache expiry):
- pool_fetch_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l481
- pool_get_item_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2697
- pool_find_item_on_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2741
- pool_delete_item_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2758

It might be worth to recheck if more places from https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=dc559c07ee5affc7035efa6e0f00185e211079a0 shouldn't actually have stayed as exclusive locks (or maybe extend some flags in the locking mechanism so the writes can check themselves if process holds exclusive lock and warn otherwise).

I might be wrong though as testing now with freshly built V4_4_STABLE (8c3250ed35cb1d9713cb2515d2ab8230c51fe3ac), I was unable to reproduce the issues so it might have just been the fix you mentioned and the writes I'm pointing out are somehow protected? (or it just got much harder to trigger, or something else was also fixed in the meantime, btw. when next official bugfix release is expected to land?)
docker-compose.yml (678 bytes)   
services:
  postgres:
    image: postgres:14.8-alpine
    restart: always
    environment:
      - POSTGRES_HOST_AUTH_METHOD=trust
      - POSTGRES_USERNAME=postgres
      - POSTGRES_PASSWORD=

  pgpool:
    image: pgpool/pgpool:4.4.2
    #image: pgpool:4.4.stable
    #build: pgpool
    restart: always
    ports:
      - 5432:9999
    environment:
      - PGPOOL_PARAMS_BACKEND_HOSTNAME0=postgres
      - POSTGRES_USERNAME=postgres
      - POSTGRES_PASSWORD=
      - PGPOOL_PARAMS_SR_CHECK_USER=postgres
      - PGPOOL_PARAMS_SR_CHECK_PASSWORD=
      - PGPOOL_PARAMS_RELCACHE_EXPIRE=1 # makes it much easier to trigger issues
      #- PGPOOL_PARAMS_ENABLE_SHARED_RELCACHE=off
docker-compose.yml (678 bytes)   
query.sql (39 bytes)   
SELECT COUNT(*) FROM pg_stat_activity;
query.sql (39 bytes)   

t-ishii

2023-05-17 09:23

developer   ~0004387

Thank you for the analysis. I will look into them.

> I might be wrong though as testing now with freshly built V4_4_STABLE (8c3250ed35cb1d9713cb2515d2ab8230c51fe3ac), I was unable to reproduce the issues so it might have just been the fix you mentioned and the writes I'm pointing out are somehow protected?

I believe the issue was fixed by this commit:
https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=93f55858c6ba8b5243f5dc79808926961d78d826

> when next official bugfix release is expected to land?)
Tomorrow! Stay tuned...
https://pgpool.net/mediawiki/index.php/Roadmap

t-ishii

2023-05-29 14:21

developer   ~0004392

Attached is a patch trying to fix the bug you found. In pool_find_item_on_shmem_cache, I release the lock and acquire an exclusive lock before calling pool_delete_item_shmem_cache. Releasing/acquiring lock will create a window. So I get the cache pointer again.
What do you think?
fix-query-cache.patch (1,603 bytes)   
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index 89510f73..c1ae800d 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -2734,12 +2734,37 @@ static POOL_CACHEID * pool_find_item_on_shmem_cache(POOL_QUERY_HASH * query_hash
 		now = time(NULL);
 		if (now > (cih->timestamp + cih->expire))
 		{
-			ereport(DEBUG1,
-					(errmsg("memcache finding item"),
-					 errdetail("cache expired: now: %ld timestamp: %ld",
-							   now, cih->timestamp + cih->expire)));
-			pool_delete_item_shmem_cache(c);
-			return NULL;
+			/*
+			 * We need to acquire an exclusive lock before removing the cache
+			 * entry.  Since a lock escalation from shared lock to exclusive
+			 * lock is not supported, we need to release the lock then acquire
+			 * an exclusive lock.
+			 */
+			pool_shmem_unlock();
+			pool_shmem_lock(POOL_MEMQ_EXCLUSIVE_LOCK);
+			/*
+			 * There is a window between pool_shmem_unlock() and
+			 * pool_shmem_lock().  We need to get POOL_CACHEID and
+			 * POOL_CACHE_ITEM_HEADER again because they could have been
+			 * modified by someone else.
+			 */
+			c = pool_hash_search(query_hash);
+			if (!c)
+			{
+				return NULL;
+			}
+
+			cih = item_header(block_address(c->blockid), c->itemid);
+			now = time(NULL);
+			if (now > (cih->timestamp + cih->expire))
+			{
+				ereport(DEBUG1,
+						(errmsg("memcache finding item"),
+						 errdetail("cache expired: now: %ld timestamp: %ld",
+								   now, cih->timestamp + cih->expire)));
+				pool_delete_item_shmem_cache(c);
+				return NULL;
+			}
 		}
 	}
 
fix-query-cache.patch (1,603 bytes)   

neverous

2023-06-13 22:20

reporter   ~0004393

Looks good to me! And seems to be working. As mentioned I'm unable to cause the crash now even without this particular patch, but the changes makes sense in my general understanding on whats happening here. (I also tried to look around the code and I think this was the last place left with potentially loose shmem locks)

t-ishii

2023-06-14 09:17

developer   ~0004394

Thanks for reviewing the patch. I will push this today.

Issue History

Date Modified Username Field Change
2023-04-19 21:15 Anton_Tupitsyn New Issue
2023-04-19 21:15 Anton_Tupitsyn File Added: operational-pool-configmap.yaml
2023-04-20 09:14 t-ishii Assigned To => t-ishii
2023-04-20 09:14 t-ishii Status new => assigned
2023-04-20 09:19 t-ishii Note Added: 0004345
2023-04-20 09:36 t-ishii Status assigned => feedback
2023-04-20 16:33 Anton_Tupitsyn Note Added: 0004346
2023-04-20 16:33 Anton_Tupitsyn Status feedback => assigned
2023-05-16 06:33 neverous Note Added: 0004380
2023-05-16 06:33 neverous File Added: docker-compose.yml
2023-05-16 06:33 neverous File Added: query.sql
2023-05-17 09:23 t-ishii Note Added: 0004387
2023-05-29 14:21 t-ishii Note Added: 0004392
2023-05-29 14:21 t-ishii File Added: fix-query-cache.patch
2023-06-13 22:20 neverous Note Added: 0004393
2023-06-14 09:17 t-ishii Note Added: 0004394