View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000795 | Pgpool-II | Bug | public | 2023-04-19 21:15 | 2023-06-14 09:17 |
| Reporter | Anton_Tupitsyn | Assigned To | t-ishii | ||
| Priority | normal | Severity | major | Reproducibility | random |
| Status | assigned | Resolution | open | ||
| Platform | OpenShift | ||||
| Product Version | 4.4.2 | ||||
| Summary | 0000795: invalid memory alloc request size 18446744073709551592 | ||||
| Description | 2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: I am 1105 accept fd 7 2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.302: [unknown] pid 1105: DETAIL: Protocol Major: 1234 Minor: 5679 database: user: 2023-04-19 09:58:07.302: [unknown] pid 1105: DEBUG: selecting backend connection 2023-04-19 09:58:07.302: [unknown] pid 1105: DETAIL: SSLRequest from client 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: DateStyle value: ISO 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: TimeZone value: Etc/UTC 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: client_encoding value: UTF8 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: guc name: extra_float_digits value: 2 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: reading startup packet 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: Protocol Major: 3 Minor: 0 database: excerpt user: excerpt_service_user 2023-04-19 09:58:07.303: [unknown] pid 1105: DEBUG: creating new connection to backend 2023-04-19 09:58:07.303: [unknown] pid 1105: DETAIL: connecting 0 backend 2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: authentication backend 2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: auth kind:5 2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: authentication backend: 0 2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: trying md5 authentication 2023-04-19 09:58:07.307: [unknown] pid 1105: DEBUG: performing md5 authentication 2023-04-19 09:58:07.307: [unknown] pid 1105: DETAIL: DB node id: 0 salt: b31d1d4 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"application_name" value:"" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"client_encoding" value:"UTF8" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"DateStyle" value:"ISO, MDY" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"default_transaction_read_only" value:"off" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"in_hot_standby" value:"off" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"integer_datetimes" value:"on" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.309: [unknown] pid 1105: DETAIL: backend:0 name:"IntervalStyle" value:"postgres" 2023-04-19 09:58:07.309: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"is_superuser" value:"off" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"server_encoding" value:"UTF8" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"server_version" value:"14.3" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"session_authorization" value:"excerpt_service_user" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"standard_conforming_strings" value:"on" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: process parameter status 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: backend:0 name:"TimeZone" value:"Etc/UTC" 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: authentication backend 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: cp->info[i]:0x7f884762fc08 pid:10047 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: sending backend key data 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: send pid 10047 to frontend 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: selecting load balance node 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: selected backend id is 0 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: Parse: statement name <> 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: invoking the minimal parser 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: username: "excerpt_service_user" database_name: "excerpt" 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: query: "SELECT version()" 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: `excerpt_service_userSELECT version()excerpt' -> `1a7e8e4add217fd8849a3e966a36bbb2' 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: DEBUG: fetching from cache storage 2023-04-19 09:58:07.310: [unknown] pid 1105: DETAIL: search key "1a7e8e4add217fd8849a3e966a36bbb2" 2023-04-19 09:58:07.310: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.310: [unknown] pid 1105: ERROR: invalid memory alloc request size 18446744073709551592 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: invoking the minimal parser 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: username: "excerpt_service_user" database_name: "excerpt" 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: query: "SELECT version()" 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: memcache encode key 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: `excerpt_service_userSELECT version()excerpt' -> `1a7e8e4add217fd8849a3e966a36bbb2' 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: fetching from cache storage 2023-04-19 09:58:07.311: [unknown] pid 1105: DETAIL: search key "1a7e8e4add217fd8849a3e966a36bbb2" 2023-04-19 09:58:07.311: [unknown] pid 1105: CONTEXT: while searching system catalog, When relcache is missed 2023-04-19 09:58:07.311: [unknown] pid 1105: DEBUG: RESET ALL CONFIG VARIABLES | ||||
| Additional Information | We a running dozens of PgPool deployments in number of Openshift clusters using pgpool/pgpool:4.4.2 docker image. From time to time some of them enter a state when it regularly throws "invalid memory alloc" to a client. In this state the same query can end successfully or with the error. After pod restart the error goes away. It can work then for days until problem reoccurs. It happens on all the clusters and almost with all the deployments at least once a week. Please notice that the size in the error is always 18446744073709551592 which is 0xFFFFFFFFFFFFFFE8 in hex. From looking at pgpool sources it seems like something goes wrong in pool_fetch_cache when it calculates cache size in pool_get_item_shmem_cache. configmap with pgpool.conf is attached log excerpt for one pid is in the description | ||||
| Tags | No tags attached. | ||||
|
|
operational-pool-configmap.yaml (1,702 bytes)
apiVersion: v1
kind: ConfigMap
metadata:
name: operational-pool-config
labels:
name: operational-pool-config
data:
pgpool.conf: |-
listen_addresses = '*'
port = 5432
# pool_passwd = /config/pool_passwd
socket_dir = '/var/run/pgpool'
pcp_listen_addresses = '*'
pcp_port = 9898
pcp_socket_dir = '/var/run/pgpool'
backend_hostname0 = 'operational-primary'
backend_port0 = 5432
backend_weight0 = 0
backend_flag0 = 'ALWAYS_PRIMARY|DISALLOW_TO_FAILOVER'
{{- $operational_replicas := .Values.global.crunchyPostgresOperator.instances.operational.replicas | int -}}
{{- if gt $operational_replicas 1 }}
backend_hostname1 = 'operational-replicas'
backend_port1 = 5432
backend_weight1 = 1
backend_flag1 = 'DISALLOW_TO_FAILOVER'
{{- else }}
#backend_hostname1 = 'operational-replicas'
#backend_port1 = 5432
#backend_weight1 = 1
#backend_flag1 = 'DISALLOW_TO_FAILOVER'
{{- end }}
sr_check_period = 0
enable_pool_hba = off
backend_clustering_mode = 'streaming_replication'
num_init_children = 200
max_pool = 1
reserved_connections = 0
child_life_time = 300
child_max_connections = 0
connection_life_time = 0
client_idle_limit = 0
connection_cache = on
load_balance_mode = on
statement_level_load_balance = off
ssl = off
failover_on_backend_error = off
logging_collector = off
pool_hba.conf: |-
local all all trust
host all all 127.0.0.1/32 trust
host all all ::1/128 trust
host all all 0.0.0.0/0 md5
|
|
|
I suspect this related to the bug found on Jan 30 after 4.2.2 was released. https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=93f55858c6ba8b5243f5dc79808926961d78d826 You can try either: 1) grab the patch from git repository and apply it, build pgpool from the source code. 2) turn off enable_shared_relcache. The code path is on the route when enable_shared_relcache is on. So I expect the bug goes away if the parameter is disabled. |
|
|
Thanks t-ishii. Building from the source doesn't fit well in our CI/CD process, so I won't be able test your patch at scale, unfortunately. I'll be able to do it when it gets to pgpool/pgpool image. As for turning off enable_shared_relcache, I've figured the same. We have done that yesterday for some environments and so far there were no "invalid memory alloc" errors on them. |
|
|
I think I'm hitting the same issue but as I was unable to reliably reproduce it with my config while troubleshooting I found another setting that seems to make it much easier to hit (it still can take a bunch of attempts though) and this troubleshooting led me to some potential other missing shmem exclusive locks I think? My reproduction steps are: 1. Run attached docker compose: docker compose up (or setup default pgpool with one postgres and relcache_expire=1) 2. Run pgbench with any simple SELECT query with at least 2 clients simultaneously in a loop and reconnecting for each transaction (the more simultaneous new connections at the same time the better it seems): pgbench -h localhost -U postgres postgres -f query.sql -C -c 2 -j 2 -t 10000 3. pgpool children start to die with segfaults or other weird memory/shmem related errors (like the alloc size) and stay that way until complete pgpool restart 4. if not, retry from scratch (seems like quite often I get "lucky" memory layout and it keeps working fine for long while? but in pgpool/pgpool:4.4.2 like once in every 3 tries I get to crash children with segfaults this way) As you mentioned enable_shared_relcache=off seems to help mitigate all issues for me. Attached docker-compose.yml, query.sql for reproduction, maybe its useful. When troubleshooting, I found shared memory optimizations in https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blobdiff;f=src/utils/pool_relcache.c;h=1b0324656a9522577f6dec8e82d7b73baa8c5f42;hp=6f577e6f7728f6f4afe1021dcc113c996f50b985;hb=dc559c07ee5affc7035efa6e0f00185e211079a0;hpb=f8fba590c2795aeec1afc5ef654000bd1332591e switching to shared lock, while in that relcache case, pool_fetch_cache can occasionally hit a path that seems to be writing to shared memory if I see correctly (exactly on cache expiry): - pool_fetch_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l481 - pool_get_item_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2697 - pool_find_item_on_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2741 - pool_delete_item_shmem_cache: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=blob;f=src/query_cache/pool_memqcache.c;h=89510f73ee05cd398962d885bda9bac630274830;hb=HEAD#l2758 It might be worth to recheck if more places from https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=dc559c07ee5affc7035efa6e0f00185e211079a0 shouldn't actually have stayed as exclusive locks (or maybe extend some flags in the locking mechanism so the writes can check themselves if process holds exclusive lock and warn otherwise). I might be wrong though as testing now with freshly built V4_4_STABLE (8c3250ed35cb1d9713cb2515d2ab8230c51fe3ac), I was unable to reproduce the issues so it might have just been the fix you mentioned and the writes I'm pointing out are somehow protected? (or it just got much harder to trigger, or something else was also fixed in the meantime, btw. when next official bugfix release is expected to land?) docker-compose.yml (678 bytes)
services:
postgres:
image: postgres:14.8-alpine
restart: always
environment:
- POSTGRES_HOST_AUTH_METHOD=trust
- POSTGRES_USERNAME=postgres
- POSTGRES_PASSWORD=
pgpool:
image: pgpool/pgpool:4.4.2
#image: pgpool:4.4.stable
#build: pgpool
restart: always
ports:
- 5432:9999
environment:
- PGPOOL_PARAMS_BACKEND_HOSTNAME0=postgres
- POSTGRES_USERNAME=postgres
- POSTGRES_PASSWORD=
- PGPOOL_PARAMS_SR_CHECK_USER=postgres
- PGPOOL_PARAMS_SR_CHECK_PASSWORD=
- PGPOOL_PARAMS_RELCACHE_EXPIRE=1 # makes it much easier to trigger issues
#- PGPOOL_PARAMS_ENABLE_SHARED_RELCACHE=off
|
|
|
Thank you for the analysis. I will look into them. > I might be wrong though as testing now with freshly built V4_4_STABLE (8c3250ed35cb1d9713cb2515d2ab8230c51fe3ac), I was unable to reproduce the issues so it might have just been the fix you mentioned and the writes I'm pointing out are somehow protected? I believe the issue was fixed by this commit: https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commit;h=93f55858c6ba8b5243f5dc79808926961d78d826 > when next official bugfix release is expected to land?) Tomorrow! Stay tuned... https://pgpool.net/mediawiki/index.php/Roadmap |
|
|
Attached is a patch trying to fix the bug you found. In pool_find_item_on_shmem_cache, I release the lock and acquire an exclusive lock before calling pool_delete_item_shmem_cache. Releasing/acquiring lock will create a window. So I get the cache pointer again. What do you think? fix-query-cache.patch (1,603 bytes)
diff --git a/src/query_cache/pool_memqcache.c b/src/query_cache/pool_memqcache.c
index 89510f73..c1ae800d 100644
--- a/src/query_cache/pool_memqcache.c
+++ b/src/query_cache/pool_memqcache.c
@@ -2734,12 +2734,37 @@ static POOL_CACHEID * pool_find_item_on_shmem_cache(POOL_QUERY_HASH * query_hash
now = time(NULL);
if (now > (cih->timestamp + cih->expire))
{
- ereport(DEBUG1,
- (errmsg("memcache finding item"),
- errdetail("cache expired: now: %ld timestamp: %ld",
- now, cih->timestamp + cih->expire)));
- pool_delete_item_shmem_cache(c);
- return NULL;
+ /*
+ * We need to acquire an exclusive lock before removing the cache
+ * entry. Since a lock escalation from shared lock to exclusive
+ * lock is not supported, we need to release the lock then acquire
+ * an exclusive lock.
+ */
+ pool_shmem_unlock();
+ pool_shmem_lock(POOL_MEMQ_EXCLUSIVE_LOCK);
+ /*
+ * There is a window between pool_shmem_unlock() and
+ * pool_shmem_lock(). We need to get POOL_CACHEID and
+ * POOL_CACHE_ITEM_HEADER again because they could have been
+ * modified by someone else.
+ */
+ c = pool_hash_search(query_hash);
+ if (!c)
+ {
+ return NULL;
+ }
+
+ cih = item_header(block_address(c->blockid), c->itemid);
+ now = time(NULL);
+ if (now > (cih->timestamp + cih->expire))
+ {
+ ereport(DEBUG1,
+ (errmsg("memcache finding item"),
+ errdetail("cache expired: now: %ld timestamp: %ld",
+ now, cih->timestamp + cih->expire)));
+ pool_delete_item_shmem_cache(c);
+ return NULL;
+ }
}
}
|
|
|
Looks good to me! And seems to be working. As mentioned I'm unable to cause the crash now even without this particular patch, but the changes makes sense in my general understanding on whats happening here. (I also tried to look around the code and I think this was the last place left with potentially loose shmem locks) |
|
|
Thanks for reviewing the patch. I will push this today. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2023-04-19 21:15 | Anton_Tupitsyn | New Issue | |
| 2023-04-19 21:15 | Anton_Tupitsyn | File Added: operational-pool-configmap.yaml | |
| 2023-04-20 09:14 | t-ishii | Assigned To | => t-ishii |
| 2023-04-20 09:14 | t-ishii | Status | new => assigned |
| 2023-04-20 09:19 | t-ishii | Note Added: 0004345 | |
| 2023-04-20 09:36 | t-ishii | Status | assigned => feedback |
| 2023-04-20 16:33 | Anton_Tupitsyn | Note Added: 0004346 | |
| 2023-04-20 16:33 | Anton_Tupitsyn | Status | feedback => assigned |
| 2023-05-16 06:33 | neverous | Note Added: 0004380 | |
| 2023-05-16 06:33 | neverous | File Added: docker-compose.yml | |
| 2023-05-16 06:33 | neverous | File Added: query.sql | |
| 2023-05-17 09:23 | t-ishii | Note Added: 0004387 | |
| 2023-05-29 14:21 | t-ishii | Note Added: 0004392 | |
| 2023-05-29 14:21 | t-ishii | File Added: fix-query-cache.patch | |
| 2023-06-13 22:20 | neverous | Note Added: 0004393 | |
| 2023-06-14 09:17 | t-ishii | Note Added: 0004394 |