[pgpool-general: 9032] Very high numbers of segfaults in PgPool-II node

Ian van der Linde ian at ivdl.co.za
Thu Feb 29 22:47:31 JST 2024


Good day

We run three nodes of PgPool 4.4.2 and three nodes of PostgreSQL 14, and last week on Friday we started seeing lots of the following messages in dmesg on the VIP holder for PgPool:

[11212159.856014] pgpool[342760]: segfault at 0 ip 000000000044c686 sp 00007ffd5329f080 error 4 in pgpool[400000+21d000]
[11212159.856026] Code: 34 48 01 c3 8b 43 10 85 c0 0f 8f ad 00 00 00 89 35 8f b9 3e 00 bf 00 80 83 00 be 04 80 83 00 89 0d 83 b9 3e 00 e8 aa bc ff ff <8b> 18 48 83 c0 18 49 89 c6 8d 7b e8 0f 84 4d 01 00 00 48 63 ff 48
[11212160.122781] pgpool[346613]: segfault at 0 ip 000000000044c686 sp 00007ffd532a1590 error 4 in pgpool[400000+21d000]
[11212160.122790] Code: 34 48 01 c3 8b 43 10 85 c0 0f 8f ad 00 00 00 89 35 8f b9 3e 00 bf 00 80 83 00 be 04 80 83 00 89 0d 83 b9 3e 00 e8 aa bc ff ff <8b> 18 48 83 c0 18 49 89 c6 8d 7b e8 0f 84 4d 01 00 00 48 63 ff 48
[11212160.260834] pgpool[345943]: segfault at 0 ip 000000000044c686 sp 00007ffd5329f080 error 4 in pgpool[400000+21d000]
[11212160.260845] Code: 34 48 01 c3 8b 43 10 85 c0 0f 8f ad 00 00 00 89 35 8f b9 3e 00 bf 00 80 83 00 be 04 80 83 00 89 0d 83 b9 3e 00 e8 aa bc ff ff <8b> 18 48 83 c0 18 49 89 c6 8d 7b e8 0f 84 4d 01 00 00 48 63 ff 48
...
Lots more of these trimmed. Probably hundreds or thousands per minute.

When trying to query the pool nodes from another machine, we saw the following:

WARNING:  error while getting cache item header, invalid item id: 30
SSL SYSCALL error: EOF detected
connection to server was lost

The PgPool logs themselves looked fairly normal during this time too, although none of the applications were working correctly:

...
child pid 232684: LOCATION:  pool_auth.c:1502
child pid 234178: LOG:  SSL certificate authentication for user "<redacted>" with Pgpool-II is successful
child pid 234178: LOCATION:  pool_auth.c:1502
main pid 2029505: LOG:  reaper handler
main pid 2029505: LOCATION:  pgpool_main.c:1825
main pid 2029505: LOG:  reaper handler: exiting normally
main pid 2029505: LOCATION:  pgpool_main.c:2045
[unknown] pid 232762: LOG:  frontend disconnection: session time: 0:00:28.524 user=<redacted> database=<redacted> host=<redacted> port=46650
[unknown] pid 232762: LOCATION:  child.c:2086
main pid 2029505: LOG:  reaper handler
main pid 2029505: LOCATION:  pgpool_main.c:1825
main pid 2029505: LOG:  reaper handler: exiting normally
main pid 2029505: LOCATION:  pgpool_main.c:2045
child pid 229847: LOG:  new connection received
child pid 229847: DETAIL:  connecting host=localhost port=53120
child pid 229847: LOCATION:  child.c:1870
child pid 234552: LOG:  new connection received
child pid 234552: DETAIL:  connecting host=<redacted> port=49158
child pid 234552: LOCATION:  child.c:1870
...

After a restart of the broken PgPool node, the services all started working again. Post-mortem, I wanted to determine what the problem was by looking at the instruction pointer in the binary:

# file /bin/pgpool
/bin/pgpool: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=49218c772bf871538ab5a8f37ab4383b482eb1b9, stripped
# addr2line -e /bin/pgpool 000000000044c686
??:0

But unfortunately the binaries we are using are stripped. Given this, is there any way in which we can determine what happened? I saw online that one can use the "Code:" lines to extract the bytes surrounding the failing instruction, but would that be helpful in this case? It's the first time it has ever happened, and we can't correlate the event with any changes or external factors. Is there some way in which we can find such a transient issue?

Thank you for your time.

Kind regards
Ian van der Linde


More information about the pgpool-general mailing list