[pgpool-general: 9506] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Wed Jun 4 09:18:20 JST 2025

Hi,

> Hi, Gave it a try, although we went with option #2 (RemoveIPC=no) as
> it was quicker to do.  After 20+ minutes, the semaphores are still
> present and there are no errors in the log.  Tomorrow I will discuss
> with our sysadmins how we can implement option #1, as the UID of
> postgres user is above SYS_UID_MAX.  Maybe we won't go through
> changing postgres user's UID, changing the SYS_UID_MAX to higher
> number might be ok as we only have a couple users above that number.

Thank you for sharing your thought.

> Also, I forgot to mention - our repo showed the newer (4.6.2)
> version as available so I upgraded the packages and restarted
> pgpool, but that ipv6 error at start up is still present.  It's more
> of a nuisance more than anything else, but figured I would let you
> know: 2025-06-03 18:56:39: pid 1950700: WARNING: failed to create
> watchdog heartbeat receive socket as much intended 2025-06-03
> 18:56:39: pid 1950700: DETAIL: only 1 out of 2 socket(s) had been
> created 2025-06-03 18:56:39: pid 1950700: LOCATION:
> wd_heartbeat.c:428 2025-06-03 18:56:39: pid 1950704: LOG: failed to
> create watchdog heartbeat receive socket 2025-06-03 18:56:39: pid
> 1950704: LOCATION: wd_heartbeat.c:408 2025-06-03 18:56:39: pid
> 1950704: WARNING: failed to create watchdog heartbeat receive socket
> as much intended 2025-06-03 18:56:39: pid 1950704: DETAIL: only 1
> out of 2 socket(s) had been created 2025-06-03 18:56:39: pid
> 1950704: LOCATION: wd_heartbeat.c:428 Earlier, when I had debugging
> enabled, it was still showing unable to create this socket:
> 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: creating
> watchdog heartbeat receive socket.  2025-06-03 11:57:24.833:
> heart_beat_receiver pid 4005: DETAIL: creating socket on
> 127.0.0.1:9694 2025-06-03 11:57:24.833: heart_beat_receiver pid
> 4005: LOCATION: wd_heartbeat.c:385 2025-06-03 11:57:24.833:
> heart_beat_receiver pid 4005: LOG: set SO_REUSEPORT option to the
> socket 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005:
> LOCATION: wd_heartbeat.c:891 2025-06-03 11:57:24.833:
> heart_beat_receiver pid 4005: LOG: creating watchdog heartbeat
> receive socket.  2025-06-03 11:57:24.833: heart_beat_receiver pid
> 4005: DETAIL: *creating socket on ::1:9694* 2025-06-03 11:57:24.833:
> heart_beat_receiver pid 4005: LOCATION: wd_heartbeat.c:385
> 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: failed
> to create watchdog heartbeat receive socket. retrying...  2025-06-03
> 11:57:24.833: heart_beat_receiver pid 4005: DETAIL: bind failed with
> reason: "Cannot assign requested address" Doesn't seem to affect the
> pgpool functionality, I'd prefer to disable it altogether if
> possible to have 'clean' logs, if possible.  So far, tried to change
> listen_addresses and pcp_listen_addresses in pgpool.conf from "*" to
> "0.0.0.0" but that didn't resolve it.

pcp_listen_addresses is used for pcp commands only, not relevant to
heartbeart.

BTW, I think I found possible cause of heartbeat problem: before 4.6,
the bind address of heartbeat receive socket was specified as
INADDR_ANY, which binds all available network devices. But from 4.6.0,
it only binds to loopback address (127.0.0.1 for IPv4, ::1 for
IPv6). We should have bound to all available network devices in 4.6.
Because of this, watchdog won't start lifecheck using heartbeat as
other heartbeat sender process send heartbeat packet to hostnameX,
which is not loopback address of course. We have alreay created a
patch to fix it, and doing an internal test.

Note that if lifecheck_method = 'query', you are not bitten by the
bug.

> Just curious if there's a way to
> skip ipv6 functionality?

No.

> On Tue, Jun 3, 2025 at 1:57 PM Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> > Hi,
>> >
>> > Thank you for the great information!
>> >
>> >> Hi,
>> >> We've made some progress here and I'd like to get your feedback
>> regarding
>> >> the fix/workaround.  Basically, after some research (incl.
>> google/chatgpt
>> >> etc), we came across this information:
>> >>
>> >>   1. systemd-logind manages user sessions
>> >>
>> >> In Ubuntu 24.04 (and since ~15.04), systemd manages user sessions. When
>> a
>> >> user session ends (e.g. when a background process or login shell
>> finishes),
>> >> the following happens:
>> >>
>> >>    -
>> >>
>> >>    The session scope (session-XXXX.scope) is stopped.
>> >>    -
>> >>
>> >>    The user manager (user at UID.service) is also stopped *unless
>> lingering is
>> >>    enabled*.
>> >>
>> >>   If no processes remain and the user is not lingering:
>> >>
>> >>    -
>> >>
>> >>    Shared memory and *System V IPC resources* (like semaphores) *may be
>> >>    destroyed* automatically if no one is using them.
>> >>    -
>> >>
>> >>    If a process (like pgpool2 or PostgreSQL) relied on those semaphores,
>> >>    you may later see:
>> >>
>> >>    failed to lock semaphore
>> >>
>> >>
>> >> This is a strong sign that the user session ended and IPC resources
>> (like
>> >> semaphores) were destroyed.
>> >>
>> >> This does seem to be happening here, as I see these in /var/log/syslog
>> >> around that time we discussed in last posting, just prior to semaphore
>> >> errors showing in pgpool logs:
>> >> 2025-06-02T12:45:17.173486+02:00 db-replica3 systemd[1]:
>> user at 3000.service:
>> >> Deactivated successfully.
>> >> 2025-06-02T12:45:17.173997+02:00 db-replica3 systemd[1]: Stopped
>> >> user at 3000.service - User Manager for UID 3000.
>> >> 2025-06-02T12:45:17.240314+02:00 db-replica3 systemd[1]: Stopping
>> >> user-runtime-dir at 3000.service - User Runtime Directory
>> /run/user/3000...
>> >> 2025-06-02T12:45:17.248017+02:00 db-replica3 systemd[1]:
>> >> run-user-3000.mount: Deactivated successfully.
>> >> 2025-06-02T12:45:17.249107+02:00 db-replica3 systemd[1]:
>> >> user-runtime-dir at 3000.service: Deactivated successfully.
>> >> 2025-06-02T12:45:17.249353+02:00 db-replica3 systemd[1]: Stopped
>> >> user-runtime-dir at 3000.service - User Runtime Directory /run/user/3000.
>> >> 2025-06-02T12:45:17.251672+02:00 db-replica3 systemd[1]: Removed slice
>> >> user-3000.slice - User Slice of UID 3000.
>> >> 2025-06-02T12:45:23.581108+02:00 db-replica3 strace[1692927]: [pid
>> 1693495]
>> >> 12:45:23.580360 semtimedop(41, [{sem_num=6, sem_op=-1,
>> sem_flg=SEM_UNDO}],
>> >> 1, NULL) = -1 EINVAL (Invalid argument)
>> >>
>> >> It also advises to add Type=forking to the [Service] section of the
>> >> pgpool.service file, which we don't currently have, I think I posted it
>> >> earlier, but just in case here is relevant part of our service file:
>> >>
>> >> [Unit]
>> >> Description=Pgpool-II
>> >> After=syslog.target network.target postgresql.service
>> >> Wants=postgresql.service
>> >>
>> >> [Service]
>> >> User=postgres
>> >> Group=postgres
>> >>
>> >> EnvironmentFile=-/etc/default/pgpool2
>> >>
>> >> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
>> >> ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop
>> >> ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
>> >>
>> >> -------------------
>> >> One bit I read that is different is the Type=forking and adding -n is
>> >> required (as you can see above, we don't have that right now:
>> >> -
>> >>
>> >> User=postgres: Assumes pgpool should run as the postgres user. Adjust if
>> >> different.
>> >> -
>> >>
>> >> ExecStart=/usr/sbin/pgpool -n: Use the *correct path and arguments* for
>> >> your installation.
>> >> -
>> >>
>> >> pgpool must be configured to *not daemonize* (-n flag), or systemd will
>> >> lose track of it.
>> >> -
>> >>
>> >> You can add environment variables for config files, ports, etc.
>> >> -----------------------
>> >>
>> >> Anyways, our sysadmins suggested changing the user postgres to have
>> Linger
>> >> option on:
>> >> $ sudo loginctl show-user postgres -p Linger
>> >> Linger=yes
>> >> Once we did that, the semaphore issue disappeared, it's been running for
>> >> over an hour now and ipcs shows them:
>> >> $ ipcs -s
>> >>
>> >> ------ Semaphore Arrays --------
>> >> key        semid      owner      perms      nsems
>> >> 0x00000000 6          postgres   600        8
>> >>
>> >>
>> >> So, what do you think about this fix? I read that letting the user have
>> >> linger on is not generally recommended, but that seems to be the only
>> way
>> >> to keep the semaphores from disappearing right now...
>> >
>> > I am not familiar with this area. I will discuss with other
>> > pgpool/PostgreSQL developers and reply back to you.
>>
>> I found this in the PostgreSQL manual:
>>
>> https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC
>>
>> If my reading is correct, we can do one of:
>>
>> 1) register the user to start pgpool as a "system user".
>>
>> 2) change the line "RemoveIPC=no" in /etc/systemd/logind.conf .
>>
>> For me, #1 is better, since #2 affects to other servers in the system.
>>
>> BTW, I think the doc does not suggest to use Linger option.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>