[pgpool-general: 9512] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Sun Jun 8 10:15:31 JST 2025

Hi,

Thank you for the report. I am happy to hear that!

> Hi, sorry to reply late, things were pretty busy.  I am very happy to
> report your suggested solution did the trick (we ended up changing the
> UID/GUID of postgres user to a lower number from 3000, which made it a
> system user).  It has been a couple of days and it is working fine, no
> issues with semaphores.
> 
> On the other issues I mentioned, the reason it was complaining about user
> "nobody" was that other setting you advised me to change ( lifecheck_method
> = 'query'), as that had user 'nobody' by default.  Once I changed that user
> to our default pgpool user, the error was gone.
> 
> Also, the other issue I mentioned with pcp_node_info also is now gone and
> all nodes show status 2, so everything seems to be working properly now.
> 
> Once again, thank you very much for your assistance with this problem, much
> appreciated!
> 
> 
> On Thu, Jun 5, 2025 at 8:33 AM Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> > Hi, I changed the wd_lifecheck_method to query, saw this in the log
>> > afterwards:
>> > 2025-06-04 13:32:42.930: main pid 120326: LOG:  create socket files[0]:
>> > /var/run/postgresql/.s.PGSQL.9999
>> > 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:4888
>> > 2025-06-04 13:32:42.930: main pid 120326: LOG:  listen address[0]: *
>> > 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:4920
>> > 2025-06-04 13:32:42.930: main pid 120326: LOG:  Setting up socket for
>> > 0.0.0.0:9999
>> > 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:999
>> >
>> >
>> >
>> > *2025-06-04 13:32:42.930: main pid 120326: LOG:  Setting up socket for
>> > :::99992025-06-04 13:32:42.930: main pid 120326: LOCATION:
>> >  pgpool_main.c:9992025-06-04 13:32:42.932: child pid 120332: FATAL:
>> SCRAM
>> > authentication failed2025-06-04 13:32:42.932: child pid 120332: DETAIL:
>> >  pool_passwd file does not contain an entry for "nobody"*
>> > 2025-06-04 13:32:42.932: child pid 120332: LOCATION:  pool_hba.c:1272
>> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  create socket files[0]:
>> > /var/run/postgresql/.s.PGSQL.9898
>> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:4888
>> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  listen address[0]: *
>> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:4920
>> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  Setting up socket for
>> > 0.0.0.0:9898
>> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:999
>> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  Setting up socket for
>> > :::9898
>> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:999
>> > 2025-06-04 13:32:43.376: pcp_main pid 121356: LOG:  PCP process: 121356
>> > started
>> >
>> > Any idea about that pool_passwd error
>>
>> Because you set wd_lifecheck_password = '' and do not set password for
>> nobody in pool_passwd.
>>
>> > attempting to set up socket for
>> > :::9999?
>>
>> You mean 9898? because you set pcp_port = 9898.
>>
>> > Again, not a big deal, just wondering if it's something to be concerned
>> > about.
>> >
>> > Also, I'm wondering about the status of that pcp_node_info command I
>> > mentioned earlier... even though semaphores seem to be present and we
>> have
>> > plenty of queries/connections, the status is still showing as 1/waiting
>> > where I suspect it should be 2... I might be stressing over nothing, but
>> > none of the nodes are showing in status 2 at this point, example:
>> > $ pcp_node_info -h localhost -p 9898 -U pcpadmin --verbose -n 2
>> > Hostname               : db-replica3
>> > Port                   : 5434
>> > Status                 : 1
>> > Weight                 : 0.250000
>> > Status Name            : waiting
>> > Backend Status Name    : up
>> > Role                   : standby
>> > Backend Role           : standby
>> > Replication Delay      : 0
>> > Replication State      : streaming
>> > Replication Sync State : async
>> > Last Status Change     : 2025-06-03 11:59:27
>>
>> As I explained upthread, you must have run pcp_node_info command
>> against non-leader watchdog node. In this case no clients has been
>> connected the node and the status remains 1.
>>
>> > On Wed, Jun 4, 2025 at 2:18 AM Tatsuo Ishii <ishii at postgresql.org>
>> wrote:
>> >
>> >> Hi,
>> >>
>> >> > Hi, Gave it a try, although we went with option #2 (RemoveIPC=no) as
>> >> > it was quicker to do.  After 20+ minutes, the semaphores are still
>> >> > present and there are no errors in the log.  Tomorrow I will discuss
>> >> > with our sysadmins how we can implement option #1, as the UID of
>> >> > postgres user is above SYS_UID_MAX.  Maybe we won't go through
>> >> > changing postgres user's UID, changing the SYS_UID_MAX to higher
>> >> > number might be ok as we only have a couple users above that number.
>> >>
>> >> Thank you for sharing your thought.
>> >>
>> >> > Also, I forgot to mention - our repo showed the newer (4.6.2)
>> >> > version as available so I upgraded the packages and restarted
>> >> > pgpool, but that ipv6 error at start up is still present.  It's more
>> >> > of a nuisance more than anything else, but figured I would let you
>> >> > know: 2025-06-03 18:56:39: pid 1950700: WARNING: failed to create
>> >> > watchdog heartbeat receive socket as much intended 2025-06-03
>> >> > 18:56:39: pid 1950700: DETAIL: only 1 out of 2 socket(s) had been
>> >> > created 2025-06-03 18:56:39: pid 1950700: LOCATION:
>> >> > wd_heartbeat.c:428 2025-06-03 18:56:39: pid 1950704: LOG: failed to
>> >> > create watchdog heartbeat receive socket 2025-06-03 18:56:39: pid
>> >> > 1950704: LOCATION: wd_heartbeat.c:408 2025-06-03 18:56:39: pid
>> >> > 1950704: WARNING: failed to create watchdog heartbeat receive socket
>> >> > as much intended 2025-06-03 18:56:39: pid 1950704: DETAIL: only 1
>> >> > out of 2 socket(s) had been created 2025-06-03 18:56:39: pid
>> >> > 1950704: LOCATION: wd_heartbeat.c:428 Earlier, when I had debugging
>> >> > enabled, it was still showing unable to create this socket:
>> >> > 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: creating
>> >> > watchdog heartbeat receive socket.  2025-06-03 11:57:24.833:
>> >> > heart_beat_receiver pid 4005: DETAIL: creating socket on
>> >> > 127.0.0.1:9694 2025-06-03 11:57:24.833: heart_beat_receiver pid
>> >> > 4005: LOCATION: wd_heartbeat.c:385 2025-06-03 11:57:24.833:
>> >> > heart_beat_receiver pid 4005: LOG: set SO_REUSEPORT option to the
>> >> > socket 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005:
>> >> > LOCATION: wd_heartbeat.c:891 2025-06-03 11:57:24.833:
>> >> > heart_beat_receiver pid 4005: LOG: creating watchdog heartbeat
>> >> > receive socket.  2025-06-03 11:57:24.833: heart_beat_receiver pid
>> >> > 4005: DETAIL: *creating socket on ::1:9694* 2025-06-03 11:57:24.833:
>> >> > heart_beat_receiver pid 4005: LOCATION: wd_heartbeat.c:385
>> >> > 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: failed
>> >> > to create watchdog heartbeat receive socket. retrying...  2025-06-03
>> >> > 11:57:24.833: heart_beat_receiver pid 4005: DETAIL: bind failed with
>> >> > reason: "Cannot assign requested address" Doesn't seem to affect the
>> >> > pgpool functionality, I'd prefer to disable it altogether if
>> >> > possible to have 'clean' logs, if possible.  So far, tried to change
>> >> > listen_addresses and pcp_listen_addresses in pgpool.conf from "*" to
>> >> > "0.0.0.0" but that didn't resolve it.
>> >>
>> >> pcp_listen_addresses is used for pcp commands only, not relevant to
>> >> heartbeart.
>> >>
>> >> BTW, I think I found possible cause of heartbeat problem: before 4.6,
>> >> the bind address of heartbeat receive socket was specified as
>> >> INADDR_ANY, which binds all available network devices. But from 4.6.0,
>> >> it only binds to loopback address (127.0.0.1 for IPv4, ::1 for
>> >> IPv6). We should have bound to all available network devices in 4.6.
>> >> Because of this, watchdog won't start lifecheck using heartbeat as
>> >> other heartbeat sender process send heartbeat packet to hostnameX,
>> >> which is not loopback address of course. We have alreay created a
>> >> patch to fix it, and doing an internal test.
>> >>
>> >> Note that if lifecheck_method = 'query', you are not bitten by the
>> >> bug.
>> >>
>> >> > Just curious if there's a way to
>> >> > skip ipv6 functionality?
>> >>
>> >> No.
>> >>
>> >> > On Tue, Jun 3, 2025 at 1:57 PM Tatsuo Ishii <ishii at postgresql.org>
>> >> wrote:
>> >> >
>> >> >> > Hi,
>> >> >> >
>> >> >> > Thank you for the great information!
>> >> >> >
>> >> >> >> Hi,
>> >> >> >> We've made some progress here and I'd like to get your feedback
>> >> >> regarding
>> >> >> >> the fix/workaround.  Basically, after some research (incl.
>> >> >> google/chatgpt
>> >> >> >> etc), we came across this information:
>> >> >> >>
>> >> >> >>   1. systemd-logind manages user sessions
>> >> >> >>
>> >> >> >> In Ubuntu 24.04 (and since ~15.04), systemd manages user sessions.
>> >> When
>> >> >> a
>> >> >> >> user session ends (e.g. when a background process or login shell
>> >> >> finishes),
>> >> >> >> the following happens:
>> >> >> >>
>> >> >> >>    -
>> >> >> >>
>> >> >> >>    The session scope (session-XXXX.scope) is stopped.
>> >> >> >>    -
>> >> >> >>
>> >> >> >>    The user manager (user at UID.service) is also stopped *unless
>> >> >> lingering is
>> >> >> >>    enabled*.
>> >> >> >>
>> >> >> >>   If no processes remain and the user is not lingering:
>> >> >> >>
>> >> >> >>    -
>> >> >> >>
>> >> >> >>    Shared memory and *System V IPC resources* (like semaphores)
>> *may
>> >> be
>> >> >> >>    destroyed* automatically if no one is using them.
>> >> >> >>    -
>> >> >> >>
>> >> >> >>    If a process (like pgpool2 or PostgreSQL) relied on those
>> >> semaphores,
>> >> >> >>    you may later see:
>> >> >> >>
>> >> >> >>    failed to lock semaphore
>> >> >> >>
>> >> >> >>
>> >> >> >> This is a strong sign that the user session ended and IPC
>> resources
>> >> >> (like
>> >> >> >> semaphores) were destroyed.
>> >> >> >>
>> >> >> >> This does seem to be happening here, as I see these in
>> >> /var/log/syslog
>> >> >> >> around that time we discussed in last posting, just prior to
>> >> semaphore
>> >> >> >> errors showing in pgpool logs:
>> >> >> >> 2025-06-02T12:45:17.173486+02:00 db-replica3 systemd[1]:
>> >> >> user at 3000.service:
>> >> >> >> Deactivated successfully.
>> >> >> >> 2025-06-02T12:45:17.173997+02:00 db-replica3 systemd[1]: Stopped
>> >> >> >> user at 3000.service - User Manager for UID 3000.
>> >> >> >> 2025-06-02T12:45:17.240314+02:00 db-replica3 systemd[1]: Stopping
>> >> >> >> user-runtime-dir at 3000.service - User Runtime Directory
>> >> >> /run/user/3000...
>> >> >> >> 2025-06-02T12:45:17.248017+02:00 db-replica3 systemd[1]:
>> >> >> >> run-user-3000.mount: Deactivated successfully.
>> >> >> >> 2025-06-02T12:45:17.249107+02:00 db-replica3 systemd[1]:
>> >> >> >> user-runtime-dir at 3000.service: Deactivated successfully.
>> >> >> >> 2025-06-02T12:45:17.249353+02:00 db-replica3 systemd[1]: Stopped
>> >> >> >> user-runtime-dir at 3000.service - User Runtime Directory
>> >> /run/user/3000.
>> >> >> >> 2025-06-02T12:45:17.251672+02:00 db-replica3 systemd[1]: Removed
>> >> slice
>> >> >> >> user-3000.slice - User Slice of UID 3000.
>> >> >> >> 2025-06-02T12:45:23.581108+02:00 db-replica3 strace[1692927]: [pid
>> >> >> 1693495]
>> >> >> >> 12:45:23.580360 semtimedop(41, [{sem_num=6, sem_op=-1,
>> >> >> sem_flg=SEM_UNDO}],
>> >> >> >> 1, NULL) = -1 EINVAL (Invalid argument)
>> >> >> >>
>> >> >> >> It also advises to add Type=forking to the [Service] section of
>> the
>> >> >> >> pgpool.service file, which we don't currently have, I think I
>> posted
>> >> it
>> >> >> >> earlier, but just in case here is relevant part of our service
>> file:
>> >> >> >>
>> >> >> >> [Unit]
>> >> >> >> Description=Pgpool-II
>> >> >> >> After=syslog.target network.target postgresql.service
>> >> >> >> Wants=postgresql.service
>> >> >> >>
>> >> >> >> [Service]
>> >> >> >> User=postgres
>> >> >> >> Group=postgres
>> >> >> >>
>> >> >> >> EnvironmentFile=-/etc/default/pgpool2
>> >> >> >>
>> >> >> >> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
>> >> >> >> ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS
>> stop
>> >> >> >> ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
>> >> >> >>
>> >> >> >> -------------------
>> >> >> >> One bit I read that is different is the Type=forking and adding
>> -n is
>> >> >> >> required (as you can see above, we don't have that right now:
>> >> >> >> -
>> >> >> >>
>> >> >> >> User=postgres: Assumes pgpool should run as the postgres user.
>> >> Adjust if
>> >> >> >> different.
>> >> >> >> -
>> >> >> >>
>> >> >> >> ExecStart=/usr/sbin/pgpool -n: Use the *correct path and
>> arguments*
>> >> for
>> >> >> >> your installation.
>> >> >> >> -
>> >> >> >>
>> >> >> >> pgpool must be configured to *not daemonize* (-n flag), or systemd
>> >> will
>> >> >> >> lose track of it.
>> >> >> >> -
>> >> >> >>
>> >> >> >> You can add environment variables for config files, ports, etc.
>> >> >> >> -----------------------
>> >> >> >>
>> >> >> >> Anyways, our sysadmins suggested changing the user postgres to
>> have
>> >> >> Linger
>> >> >> >> option on:
>> >> >> >> $ sudo loginctl show-user postgres -p Linger
>> >> >> >> Linger=yes
>> >> >> >> Once we did that, the semaphore issue disappeared, it's been
>> running
>> >> for
>> >> >> >> over an hour now and ipcs shows them:
>> >> >> >> $ ipcs -s
>> >> >> >>
>> >> >> >> ------ Semaphore Arrays --------
>> >> >> >> key        semid      owner      perms      nsems
>> >> >> >> 0x00000000 6          postgres   600        8
>> >> >> >>
>> >> >> >>
>> >> >> >> So, what do you think about this fix? I read that letting the user
>> >> have
>> >> >> >> linger on is not generally recommended, but that seems to be the
>> only
>> >> >> way
>> >> >> >> to keep the semaphores from disappearing right now...
>> >> >> >
>> >> >> > I am not familiar with this area. I will discuss with other
>> >> >> > pgpool/PostgreSQL developers and reply back to you.
>> >> >>
>> >> >> I found this in the PostgreSQL manual:
>> >> >>
>> >> >>
>> >>
>> https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC
>> >> >>
>> >> >> If my reading is correct, we can do one of:
>> >> >>
>> >> >> 1) register the user to start pgpool as a "system user".
>> >> >>
>> >> >> 2) change the line "RemoveIPC=no" in /etc/systemd/logind.conf .
>> >> >>
>> >> >> For me, #1 is better, since #2 affects to other servers in the
>> system.
>> >> >>
>> >> >> BTW, I think the doc does not suggest to use Linger option.
>> >> >>
>> >> >> Best regards,
>> >> >> --
>> >> >> Tatsuo Ishii
>> >> >> SRA OSS K.K.
>> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> Japanese:http://www.sraoss.co.jp
>> >> >>
>> >>
>>