[pgpool-general: 9511] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Sat Jun 7 20:23:57 JST 2025

Hi, sorry to reply late, things were pretty busy.  I am very happy to
report your suggested solution did the trick (we ended up changing the
UID/GUID of postgres user to a lower number from 3000, which made it a
system user).  It has been a couple of days and it is working fine, no
issues with semaphores.

On the other issues I mentioned, the reason it was complaining about user
"nobody" was that other setting you advised me to change ( lifecheck_method
= 'query'), as that had user 'nobody' by default.  Once I changed that user
to our default pgpool user, the error was gone.

Also, the other issue I mentioned with pcp_node_info also is now gone and
all nodes show status 2, so everything seems to be working properly now.

Once again, thank you very much for your assistance with this problem, much
appreciated!

On Thu, Jun 5, 2025 at 8:33 AM Tatsuo Ishii <ishii at postgresql.org> wrote:

> > Hi, I changed the wd_lifecheck_method to query, saw this in the log
> > afterwards:
> > 2025-06-04 13:32:42.930: main pid 120326: LOG:  create socket files[0]:
> > /var/run/postgresql/.s.PGSQL.9999
> > 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:4888
> > 2025-06-04 13:32:42.930: main pid 120326: LOG:  listen address[0]: *
> > 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:4920
> > 2025-06-04 13:32:42.930: main pid 120326: LOG:  Setting up socket for
> > 0.0.0.0:9999
> > 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:999
> >
> >
> >
> > *2025-06-04 13:32:42.930: main pid 120326: LOG:  Setting up socket for
> > :::99992025-06-04 13:32:42.930: main pid 120326: LOCATION:
> >  pgpool_main.c:9992025-06-04 13:32:42.932: child pid 120332: FATAL:
> SCRAM
> > authentication failed2025-06-04 13:32:42.932: child pid 120332: DETAIL:
> >  pool_passwd file does not contain an entry for "nobody"*
> > 2025-06-04 13:32:42.932: child pid 120332: LOCATION:  pool_hba.c:1272
> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  create socket files[0]:
> > /var/run/postgresql/.s.PGSQL.9898
> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:4888
> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  listen address[0]: *
> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:4920
> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  Setting up socket for
> > 0.0.0.0:9898
> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:999
> > 2025-06-04 13:32:43.375: main pid 120326: LOG:  Setting up socket for
> > :::9898
> > 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:999
> > 2025-06-04 13:32:43.376: pcp_main pid 121356: LOG:  PCP process: 121356
> > started
> >
> > Any idea about that pool_passwd error
>
> Because you set wd_lifecheck_password = '' and do not set password for
> nobody in pool_passwd.
>
> > attempting to set up socket for
> > :::9999?
>
> You mean 9898? because you set pcp_port = 9898.
>
> > Again, not a big deal, just wondering if it's something to be concerned
> > about.
> >
> > Also, I'm wondering about the status of that pcp_node_info command I
> > mentioned earlier... even though semaphores seem to be present and we
> have
> > plenty of queries/connections, the status is still showing as 1/waiting
> > where I suspect it should be 2... I might be stressing over nothing, but
> > none of the nodes are showing in status 2 at this point, example:
> > $ pcp_node_info -h localhost -p 9898 -U pcpadmin --verbose -n 2
> > Hostname               : db-replica3
> > Port                   : 5434
> > Status                 : 1
> > Weight                 : 0.250000
> > Status Name            : waiting
> > Backend Status Name    : up
> > Role                   : standby
> > Backend Role           : standby
> > Replication Delay      : 0
> > Replication State      : streaming
> > Replication Sync State : async
> > Last Status Change     : 2025-06-03 11:59:27
>
> As I explained upthread, you must have run pcp_node_info command
> against non-leader watchdog node. In this case no clients has been
> connected the node and the status remains 1.
>
> > On Wed, Jun 4, 2025 at 2:18 AM Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> Hi,
> >>
> >> > Hi, Gave it a try, although we went with option #2 (RemoveIPC=no) as
> >> > it was quicker to do.  After 20+ minutes, the semaphores are still
> >> > present and there are no errors in the log.  Tomorrow I will discuss
> >> > with our sysadmins how we can implement option #1, as the UID of
> >> > postgres user is above SYS_UID_MAX.  Maybe we won't go through
> >> > changing postgres user's UID, changing the SYS_UID_MAX to higher
> >> > number might be ok as we only have a couple users above that number.
> >>
> >> Thank you for sharing your thought.
> >>
> >> > Also, I forgot to mention - our repo showed the newer (4.6.2)
> >> > version as available so I upgraded the packages and restarted
> >> > pgpool, but that ipv6 error at start up is still present.  It's more
> >> > of a nuisance more than anything else, but figured I would let you
> >> > know: 2025-06-03 18:56:39: pid 1950700: WARNING: failed to create
> >> > watchdog heartbeat receive socket as much intended 2025-06-03
> >> > 18:56:39: pid 1950700: DETAIL: only 1 out of 2 socket(s) had been
> >> > created 2025-06-03 18:56:39: pid 1950700: LOCATION:
> >> > wd_heartbeat.c:428 2025-06-03 18:56:39: pid 1950704: LOG: failed to
> >> > create watchdog heartbeat receive socket 2025-06-03 18:56:39: pid
> >> > 1950704: LOCATION: wd_heartbeat.c:408 2025-06-03 18:56:39: pid
> >> > 1950704: WARNING: failed to create watchdog heartbeat receive socket
> >> > as much intended 2025-06-03 18:56:39: pid 1950704: DETAIL: only 1
> >> > out of 2 socket(s) had been created 2025-06-03 18:56:39: pid
> >> > 1950704: LOCATION: wd_heartbeat.c:428 Earlier, when I had debugging
> >> > enabled, it was still showing unable to create this socket:
> >> > 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: creating
> >> > watchdog heartbeat receive socket.  2025-06-03 11:57:24.833:
> >> > heart_beat_receiver pid 4005: DETAIL: creating socket on
> >> > 127.0.0.1:9694 2025-06-03 11:57:24.833: heart_beat_receiver pid
> >> > 4005: LOCATION: wd_heartbeat.c:385 2025-06-03 11:57:24.833:
> >> > heart_beat_receiver pid 4005: LOG: set SO_REUSEPORT option to the
> >> > socket 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005:
> >> > LOCATION: wd_heartbeat.c:891 2025-06-03 11:57:24.833:
> >> > heart_beat_receiver pid 4005: LOG: creating watchdog heartbeat
> >> > receive socket.  2025-06-03 11:57:24.833: heart_beat_receiver pid
> >> > 4005: DETAIL: *creating socket on ::1:9694* 2025-06-03 11:57:24.833:
> >> > heart_beat_receiver pid 4005: LOCATION: wd_heartbeat.c:385
> >> > 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: failed
> >> > to create watchdog heartbeat receive socket. retrying...  2025-06-03
> >> > 11:57:24.833: heart_beat_receiver pid 4005: DETAIL: bind failed with
> >> > reason: "Cannot assign requested address" Doesn't seem to affect the
> >> > pgpool functionality, I'd prefer to disable it altogether if
> >> > possible to have 'clean' logs, if possible.  So far, tried to change
> >> > listen_addresses and pcp_listen_addresses in pgpool.conf from "*" to
> >> > "0.0.0.0" but that didn't resolve it.
> >>
> >> pcp_listen_addresses is used for pcp commands only, not relevant to
> >> heartbeart.
> >>
> >> BTW, I think I found possible cause of heartbeat problem: before 4.6,
> >> the bind address of heartbeat receive socket was specified as
> >> INADDR_ANY, which binds all available network devices. But from 4.6.0,
> >> it only binds to loopback address (127.0.0.1 for IPv4, ::1 for
> >> IPv6). We should have bound to all available network devices in 4.6.
> >> Because of this, watchdog won't start lifecheck using heartbeat as
> >> other heartbeat sender process send heartbeat packet to hostnameX,
> >> which is not loopback address of course. We have alreay created a
> >> patch to fix it, and doing an internal test.
> >>
> >> Note that if lifecheck_method = 'query', you are not bitten by the
> >> bug.
> >>
> >> > Just curious if there's a way to
> >> > skip ipv6 functionality?
> >>
> >> No.
> >>
> >> > On Tue, Jun 3, 2025 at 1:57 PM Tatsuo Ishii <ishii at postgresql.org>
> >> wrote:
> >> >
> >> >> > Hi,
> >> >> >
> >> >> > Thank you for the great information!
> >> >> >
> >> >> >> Hi,
> >> >> >> We've made some progress here and I'd like to get your feedback
> >> >> regarding
> >> >> >> the fix/workaround.  Basically, after some research (incl.
> >> >> google/chatgpt
> >> >> >> etc), we came across this information:
> >> >> >>
> >> >> >>   1. systemd-logind manages user sessions
> >> >> >>
> >> >> >> In Ubuntu 24.04 (and since ~15.04), systemd manages user sessions.
> >> When
> >> >> a
> >> >> >> user session ends (e.g. when a background process or login shell
> >> >> finishes),
> >> >> >> the following happens:
> >> >> >>
> >> >> >>    -
> >> >> >>
> >> >> >>    The session scope (session-XXXX.scope) is stopped.
> >> >> >>    -
> >> >> >>
> >> >> >>    The user manager (user at UID.service) is also stopped *unless
> >> >> lingering is
> >> >> >>    enabled*.
> >> >> >>
> >> >> >>   If no processes remain and the user is not lingering:
> >> >> >>
> >> >> >>    -
> >> >> >>
> >> >> >>    Shared memory and *System V IPC resources* (like semaphores)
> *may
> >> be
> >> >> >>    destroyed* automatically if no one is using them.
> >> >> >>    -
> >> >> >>
> >> >> >>    If a process (like pgpool2 or PostgreSQL) relied on those
> >> semaphores,
> >> >> >>    you may later see:
> >> >> >>
> >> >> >>    failed to lock semaphore
> >> >> >>
> >> >> >>
> >> >> >> This is a strong sign that the user session ended and IPC
> resources
> >> >> (like
> >> >> >> semaphores) were destroyed.
> >> >> >>
> >> >> >> This does seem to be happening here, as I see these in
> >> /var/log/syslog
> >> >> >> around that time we discussed in last posting, just prior to
> >> semaphore
> >> >> >> errors showing in pgpool logs:
> >> >> >> 2025-06-02T12:45:17.173486+02:00 db-replica3 systemd[1]:
> >> >> user at 3000.service:
> >> >> >> Deactivated successfully.
> >> >> >> 2025-06-02T12:45:17.173997+02:00 db-replica3 systemd[1]: Stopped
> >> >> >> user at 3000.service - User Manager for UID 3000.
> >> >> >> 2025-06-02T12:45:17.240314+02:00 db-replica3 systemd[1]: Stopping
> >> >> >> user-runtime-dir at 3000.service - User Runtime Directory
> >> >> /run/user/3000...
> >> >> >> 2025-06-02T12:45:17.248017+02:00 db-replica3 systemd[1]:
> >> >> >> run-user-3000.mount: Deactivated successfully.
> >> >> >> 2025-06-02T12:45:17.249107+02:00 db-replica3 systemd[1]:
> >> >> >> user-runtime-dir at 3000.service: Deactivated successfully.
> >> >> >> 2025-06-02T12:45:17.249353+02:00 db-replica3 systemd[1]: Stopped
> >> >> >> user-runtime-dir at 3000.service - User Runtime Directory
> >> /run/user/3000.
> >> >> >> 2025-06-02T12:45:17.251672+02:00 db-replica3 systemd[1]: Removed
> >> slice
> >> >> >> user-3000.slice - User Slice of UID 3000.
> >> >> >> 2025-06-02T12:45:23.581108+02:00 db-replica3 strace[1692927]: [pid
> >> >> 1693495]
> >> >> >> 12:45:23.580360 semtimedop(41, [{sem_num=6, sem_op=-1,
> >> >> sem_flg=SEM_UNDO}],
> >> >> >> 1, NULL) = -1 EINVAL (Invalid argument)
> >> >> >>
> >> >> >> It also advises to add Type=forking to the [Service] section of
> the
> >> >> >> pgpool.service file, which we don't currently have, I think I
> posted
> >> it
> >> >> >> earlier, but just in case here is relevant part of our service
> file:
> >> >> >>
> >> >> >> [Unit]
> >> >> >> Description=Pgpool-II
> >> >> >> After=syslog.target network.target postgresql.service
> >> >> >> Wants=postgresql.service
> >> >> >>
> >> >> >> [Service]
> >> >> >> User=postgres
> >> >> >> Group=postgres
> >> >> >>
> >> >> >> EnvironmentFile=-/etc/default/pgpool2
> >> >> >>
> >> >> >> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> >> >> >> ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS
> stop
> >> >> >> ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
> >> >> >>
> >> >> >> -------------------
> >> >> >> One bit I read that is different is the Type=forking and adding
> -n is
> >> >> >> required (as you can see above, we don't have that right now:
> >> >> >> -
> >> >> >>
> >> >> >> User=postgres: Assumes pgpool should run as the postgres user.
> >> Adjust if
> >> >> >> different.
> >> >> >> -
> >> >> >>
> >> >> >> ExecStart=/usr/sbin/pgpool -n: Use the *correct path and
> arguments*
> >> for
> >> >> >> your installation.
> >> >> >> -
> >> >> >>
> >> >> >> pgpool must be configured to *not daemonize* (-n flag), or systemd
> >> will
> >> >> >> lose track of it.
> >> >> >> -
> >> >> >>
> >> >> >> You can add environment variables for config files, ports, etc.
> >> >> >> -----------------------
> >> >> >>
> >> >> >> Anyways, our sysadmins suggested changing the user postgres to
> have
> >> >> Linger
> >> >> >> option on:
> >> >> >> $ sudo loginctl show-user postgres -p Linger
> >> >> >> Linger=yes
> >> >> >> Once we did that, the semaphore issue disappeared, it's been
> running
> >> for
> >> >> >> over an hour now and ipcs shows them:
> >> >> >> $ ipcs -s
> >> >> >>
> >> >> >> ------ Semaphore Arrays --------
> >> >> >> key        semid      owner      perms      nsems
> >> >> >> 0x00000000 6          postgres   600        8
> >> >> >>
> >> >> >>
> >> >> >> So, what do you think about this fix? I read that letting the user
> >> have
> >> >> >> linger on is not generally recommended, but that seems to be the
> only
> >> >> way
> >> >> >> to keep the semaphores from disappearing right now...
> >> >> >
> >> >> > I am not familiar with this area. I will discuss with other
> >> >> > pgpool/PostgreSQL developers and reply back to you.
> >> >>
> >> >> I found this in the PostgreSQL manual:
> >> >>
> >> >>
> >>
> https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC
> >> >>
> >> >> If my reading is correct, we can do one of:
> >> >>
> >> >> 1) register the user to start pgpool as a "system user".
> >> >>
> >> >> 2) change the line "RemoveIPC=no" in /etc/systemd/logind.conf .
> >> >>
> >> >> For me, #1 is better, since #2 affects to other servers in the
> system.
> >> >>
> >> >> BTW, I think the doc does not suggest to use Linger option.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20250607/0809d442/attachment-0001.htm>