[pgpool-general: 9502] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Tue Jun 3 19:46:30 JST 2025

Hi,
We've made some progress here and I'd like to get your feedback regarding
the fix/workaround.  Basically, after some research (incl. google/chatgpt
etc), we came across this information:

🧠 1. systemd-logind manages user sessions

In Ubuntu 24.04 (and since ~15.04), systemd manages user sessions. When a
user session ends (e.g. when a background process or login shell finishes),
the following happens:

   -

   The session scope (session-XXXX.scope) is stopped.
   -

   The user manager (user at UID.service) is also stopped *unless lingering is
   enabled*.

🧨 If no processes remain and the user is not lingering:

   -

   Shared memory and *System V IPC resources* (like semaphores) *may be
   destroyed* automatically if no one is using them.
   -

   If a process (like pgpool2 or PostgreSQL) relied on those semaphores,
   you may later see:

   failed to lock semaphore

This is a strong sign that the user session ended and IPC resources (like
semaphores) were destroyed.

This does seem to be happening here, as I see these in /var/log/syslog
around that time we discussed in last posting, just prior to semaphore
errors showing in pgpool logs:
2025-06-02T12:45:17.173486+02:00 db-replica3 systemd[1]: user at 3000.service:
Deactivated successfully.
2025-06-02T12:45:17.173997+02:00 db-replica3 systemd[1]: Stopped
user at 3000.service - User Manager for UID 3000.
2025-06-02T12:45:17.240314+02:00 db-replica3 systemd[1]: Stopping
user-runtime-dir at 3000.service - User Runtime Directory /run/user/3000...
2025-06-02T12:45:17.248017+02:00 db-replica3 systemd[1]:
run-user-3000.mount: Deactivated successfully.
2025-06-02T12:45:17.249107+02:00 db-replica3 systemd[1]:
user-runtime-dir at 3000.service: Deactivated successfully.
2025-06-02T12:45:17.249353+02:00 db-replica3 systemd[1]: Stopped
user-runtime-dir at 3000.service - User Runtime Directory /run/user/3000.
2025-06-02T12:45:17.251672+02:00 db-replica3 systemd[1]: Removed slice
user-3000.slice - User Slice of UID 3000.
2025-06-02T12:45:23.581108+02:00 db-replica3 strace[1692927]: [pid 1693495]
12:45:23.580360 semtimedop(41, [{sem_num=6, sem_op=-1, sem_flg=SEM_UNDO}],
1, NULL) = -1 EINVAL (Invalid argument)

It also advises to add Type=forking to the [Service] section of the
pgpool.service file, which we don't currently have, I think I posted it
earlier, but just in case here is relevant part of our service file:

[Unit]
Description=Pgpool-II
After=syslog.target network.target postgresql.service
Wants=postgresql.service

[Service]
User=postgres
Group=postgres

EnvironmentFile=-/etc/default/pgpool2

ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop
ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload

-------------------
One bit I read that is different is the Type=forking and adding -n is
required (as you can see above, we don't have that right now:
-

User=postgres: Assumes pgpool should run as the postgres user. Adjust if
different.
-

ExecStart=/usr/sbin/pgpool -n: Use the *correct path and arguments* for
your installation.
-

pgpool must be configured to *not daemonize* (-n flag), or systemd will
lose track of it.
-

You can add environment variables for config files, ports, etc.
-----------------------

Anyways, our sysadmins suggested changing the user postgres to have Linger
option on:
$ sudo loginctl show-user postgres -p Linger
Linger=yes
Once we did that, the semaphore issue disappeared, it's been running for
over an hour now and ipcs shows them:
$ ipcs -s

------ Semaphore Arrays --------
key        semid      owner      perms      nsems
0x00000000 6          postgres   600        8

So, what do you think about this fix? I read that letting the user have
linger on is not generally recommended, but that seems to be the only way
to keep the semaphores from disappearing right now...

On Tue, Jun 3, 2025 at 2:30 AM Tatsuo Ishii <ishii at postgresql.org> wrote:

> Thanks for the strace log.
>
> 1756952 16:45:14.448431 semtimedop(46, [{sem_num=6, sem_op=1,
> sem_flg=SEM_UNDO}], 1, NULL) = 0
> 1756952 16:45:24.482757 semtimedop(46, [{sem_num=6, sem_op=-1,
> sem_flg=SEM_UNDO}], 1, NULL) = -1 EINVAL (Invalid argument)
>
> So the semaphore was removed between 16:45:14.448431 and
> 16:45:24.482757.  I was expecting "semctl(46, 0, IPC_RMID, NULL) = 0"
> (which removes semaphore) was executed there, but actually it was
> not. Nothing was recorded in the period.
>
> I think this definitely means that a process which is not a child of
> pgpool executed semctl(46, 0, IPC_RMID, NULL) (or ipcrm command). Do
> you have any idea such a program other than pgpool?
>
> > Here is the strace result, hope this helps in any way, it takes about 10
> > minutes for the semaphors to 'go away'...
> >
> >
> > On Mon, Jun 2, 2025 at 2:53 PM Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> You have 4 pgpool nodes and you issue the queries through one of
> >> pgpools. Suppose you issued the queries through node 0, the "status"
> >> field is only updated on node 0. On other 3 nodes the status remains
> >> "waiting". However the status on all 4 nodes will be synched when the
> >> status becomes "down", or becomes "up" and the former status is
> >> "down".
> >>
> >> > That's where I wonder if this is working properly, because there seem
> to
> >> be
> >> > clients/connections happening, I ran the backend status command this
> is
> >> > what it showed:
> >> >
> >> > node_id |        hostname        | port | status |  role   |
> select_cnt |
> >> > insert_cnt | update_cnt | delete_cnt | ddl_cnt | other_cnt |
> panic_cnt |
> >> > fatal_cnt | error_cnt
> >> >
> >> >
> >>
> ---------+------------------------+------+--------+---------+------------+------------+------------+------------+---------+-----------+-----------+-----------+-----------
> >> >
> >> > 0       | db-primarypriv | 5434 | up     | primary | 7552       | 47
> >> >   | 332        | 27         | 0       | 13678     | 0         | 0
> >>  |
> >> > 6
> >> >
> >> > 1       | db-replica1priv | 5434 | up     | standby | 6817       | 0
> >> >   | 0          | 0          | 0       | 13682     | 0         | 0
> >>  |
> >> > 0
> >> >
> >> > 2       | db-replica2priv | 5434 | up     | standby | 6514       | 0
> >> >   | 0          | 0          | 0       | 13693     | 0         | 0
> >>  |
> >> > 1
> >> >
> >> > 3       | db-replica3priv | 5434 | up     | standby | 6910       | 0
> >> >   | 0          | 0          | 0       | 13679     | 0         | 0
> >>  |
> >> > 1
> >> >
> >> > (4 rows)
> >> >
> >> > So plenty of queries supposedly happening on them, but the nodes still
> >> show
> >> > status 1.  What gives?
> >> >
> >> >
> >> > On Mon, Jun 2, 2025 at 1:43 PM Tatsuo Ishii <ishii at postgresql.org>
> >> wrote:
> >> >
> >> >> > Hi, another thing I noticed - the status of all those nodes in
> >> >> > pcp_node_info seems to be wrong, for example:
> >> >> > $ pcp_node_info -h localhost -p 9898 -U pcpadmin --verbose -n 3
> >> >> > Hostname               : db-replica2
> >> >> > Port                   : 5434
> >> >> > Status                 : 1
> >> >> > Weight                 : 0.250000
> >> >> > Status Name            : waiting
> >> >> > Backend Status Name    : up
> >> >> > Role                   : standby
> >> >> > Backend Role           : standby
> >> >> > Replication Delay      : 0
> >> >> > Replication State      : streaming
> >> >> > Replication Sync State : async
> >> >> > Last Status Change     : 2025-06-02 12:31:46
> >> >>
> >> >> Are you saying this is wrong?
> >> >>
> >> >> > Status                 : 1
> >> >>
> >> >> Seems normal to me because it just says no client has connected to
> the
> >> >> pgpool server until now.
> >> >>
> >> >> > Do you think this is because of the semaphore problem, or does it
> >> suggest
> >> >> > something else completely?
> >> >> >
> >> >> > On Mon, Jun 2, 2025 at 12:59 PM Tatsuo Ishii <ishii at postgresql.org
> >
> >> >> wrote:
> >> >> >
> >> >> >> > Hi, I changed the pgpool.service file for startup as you said,
> >> >> replaced
> >> >> >> my
> >> >> >> > start with this:
> >> >> >> > ExecStart=strace -f -tt --trace=ipc /usr/sbin/pgpool -f
> >> >> >> > /etc/pgpool2/pgpool.conf $OPTS
> >> >> >> > I'm attaching the log from replica again, but I'm not sure this
> >> >> actually
> >> >> >> > worked (as in, it doesn't look much different, to me at least).
> >> >> >>
> >> >> >> Yeah. It seems there's strace output in the log.
> >> >> >> Can you try "-o" of strace, which instructs strace a file name to
> >> print
> >> >> >> the output. e.g.
> >> >> >>
> >> >> >> ExecStart=strace -f -tt --trace=ipc -o /tmp/strace.out
> >> /usr/sbin/pgpool
> >> >> -f
> >> >> >> /etc/pgpool2/pgpool.conf $OPTS
> >> >> >>
> >> >> >> > On Mon, Jun 2, 2025 at 11:47 AM Tatsuo Ishii <
> ishii at postgresql.org
> >> >
> >> >> >> wrote:
> >> >> >> >
> >> >> >> >> > Hi,
> >> >> >> >> > I tried the strace as you shown but it didn't really output
> >> >> anything:
> >> >> >> >> >
> >> >> >> >> > $ sudo strace -p 1662563 -f -tt --trace=ipc
> >> >> >> >> > strace: Process 1662563 attached
> >> >> >> >> >
> >> >> >> >> > ^Cstrace: Process 1662563 detached.   <--- this is me getting
> >> out
> >> >> of
> >> >> >> it,
> >> >> >> >> > since I saw in another window the semaphores were gone
> already,
> >> >> >> without
> >> >> >> >> > anything outputting here.  Unless something's wrong with the
> >> >> syntax of
> >> >> >> >> the
> >> >> >> >> > command, it didn't really output anything.  I changed the log
> >> line
> >> >> you
> >> >> >> >> > suggested as well.
> >> >> >> >> >
> >> >> >> >> > Also, FYI - I tried to downgrade to a lower (4.3.7) version
> as
> >> it
> >> >> was
> >> >> >> >> > available in my package repository, and the same semaphore
> >> version
> >> >> >> >> persists,
> >> >> >> >>
> >> >> >> >> Okay. I think you need to start pgpool from strace. I mean,
> >> modifiy
> >> >> >> >> pgpool.service from:
> >> >> >> >> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> >> >> >> >> to:
> >> >> >> >> ExecStart=strace -f -tt --trace=ipc /usr/sbin/pgpool -f
> >> >> >> >> /etc/pgpool2/pgpool.conf $OPTS
> >> >> >> >>
> >> >> >> >> > On Mon, Jun 2, 2025 at 2:40 AM Tatsuo Ishii <
> >> ishii at postgresql.org>
> >> >> >> >> wrote:
> >> >> >> >> >
> >> >> >> >> >> > Hi,
> >> >> >> >> >> > Not sure I can patch the source, as it was installed via
> apt
> >> get
> >> >> >> and
> >> >> >> >> not
> >> >> >> >> >> > compiled from source.  If that is doable to a
> package-built
> >> one
> >> >> and
> >> >> >> >> you
> >> >> >> >> >> can
> >> >> >> >> >> > give me some instructions, I can certainly try.
> >> >> >> >> >>
> >> >> >> >> >> Unfortunately I'm not familiar with debian packaging.
> >> >> >> >> >>
> >> >> >> >> >> > As for the pgpool log, I'm attaching it, hopefully it
> will go
> >> >> >> through,
> >> >> >> >> >> it's
> >> >> >> >> >> > about 5mb as I enabled higher debug options.
> >> >> >> >> >>
> >> >> >> >> >> It was sent as a direct mail to me. Not sure who/what
> removed
> >> the
> >> >> >> >> >> semaphore. I observe following from the log anyway:
> >> >> >> >> >>
> >> >> >> >> >> - pgpool started on 16:10:21. At that time the semaphore
> >> >> definitely
> >> >> >> >> >> existed.
> >> >> >> >> >> - Until 16:15:13, everything worked great.
> >> >> >> >> >> - On 16:15:23, the problem started.
> >> >> >> >> >>
> >> >> >> >> >> So it seems the semaphore was removed by someone between
> >> 16:15:13
> >> >> and
> >> >> >> >> >> 16:15:23.
> >> >> >> >> >>
> >> >> >> >> >> Unfortunately the removal of semaphore was not recorded in
> the
> >> >> log.
> >> >> >> To
> >> >> >> >> >> determine who removed the semaphore, can you please attach
> >> strace
> >> >> to
> >> >> >> >> >> pgpool main and share the log? This time debug log is not
> >> >> necessary.
> >> >> >> >> >>
> >> >> >> >> >> - Find the pgpool parent process pid (in the debug log it
> was
> >> >> >> 1483510)
> >> >> >> >> >> - Become root (attaching strace usually requires root
> >> privilege)
> >> >> >> >> >> - Attach strace to the process $pid
> >> >> >> >> >>
> >> >> >> >> >> strace -p $pid -f -tt --trace=ipc
> >> >> >> >> >>
> >> >> >> >> >> -f: (follow the child process of pgpool main)
> >> >> >> >> >> -tt: add timestamp with milliseconds
> >> >> >> >> >> --trace=ipc: limit the tracing system call to ipc to reduce
> the
> >> >> log
> >> >> >> >> >>
> >> >> >> >> >> BTW, I recommend yo to set log_line_prefix to something like
> >> this
> >> >> so
> >> >> >> >> >> that you can take milliseconds log (%m) and process
> application
> >> >> name
> >> >> >> >> >> (%a). They are usable in 4.6.
> >> >> >> >> >>
> >> >> >> >> >> log_line_prefix = '%m: %a pid %p: '
> >> >> >> >> >>
> >> >> >> >> >> > On Sun, Jun 1, 2025 at 4:20 PM Tatsuo Ishii <
> >> >> ishii at postgresql.org>
> >> >> >> >> >> wrote:
> >> >> >> >> >> >
> >> >> >> >> >> >> > To your points/questions:
> >> >> >> >> >> >> > 1) version is 4.6.1 - the source I'm using (via apt
> get)
> >> >> doesn't
> >> >> >> >> have
> >> >> >> >> >> >> 4.6.2
> >> >> >> >> >> >> > yet
> >> >> >> >> >> >>
> >> >> >> >> >> >> Ok.
> >> >> >> >> >> >>
> >> >> >> >> >> >> > 2) ipcs -s shows... nothing, literally.  Checked all
> >> nodes,
> >> >> i.e.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > postgres at db-replica3:~$ ipcs -s
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > ------ Semaphore Arrays --------
> >> >> >> >> >> >> > key        semid      owner      perms      nsems
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Also, it kinda seems like "failed to unlock semaphore"
> >> shows
> >> >> >> during
> >> >> >> >> >> >> startup
> >> >> >> >> >> >> > of pgpool and while it's running, but maybe I'm not
> >> reading
> >> >> the
> >> >> >> log
> >> >> >> >> >> >> > properly.
> >> >> >> >> >> >>
> >> >> >> >> >> >> So pgpool lost semaphore soon after starting. Can you
> share
> >> >> full
> >> >> >> >> >> >> pgpool log before "failed to unlock semaphore" appears?
> >> >> >> >> >> >>
> >> >> >> >> >> >> > 3) regarding heartbeats, my pgpool.conf has it roughly
> >> like
> >> >> >> this:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > heartbeat_hostname0 = 'db-primary-priv'
> >> >> >> >> >> >> > heartbeat_port0 = 9694
> >> >> >> >> >> >> > heartbeat_device0 = ''
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > heartbeat_hostname1 = 'db-replica1-priv'
> >> >> >> >> >> >> > heartbeat_port1 = 9694
> >> >> >> >> >> >> > heartbeat_device1 = ''
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > heartbeat_hostname2 = 'db-replica2-priv'
> >> >> >> >> >> >> > heartbeat_port2 = 9694
> >> >> >> >> >> >> > heartbeat_device2 = ''
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > heartbeat_hostname3 = 'db-replica3-priv'
> >> >> >> >> >> >> > heartbeat_port3 = 9694
> >> >> >> >> >> >> > heartbeat_device3 = ''
> >> >> >> >> >> >>
> >> >> >> >> >> >> Looks normal.
> >> >> >> >> >> >>
> >> >> >> >> >> >> I have checked the pgpool source code and found
> suspicious
> >> >> >> >> >> >> code. before 4.6, pgpool heartbeart receiver process
> binds
> >> to
> >> >> any
> >> >> >> >> >> >> address on the host. However from 4.6 pgpool binds to
> only
> >> loop
> >> >> >> back
> >> >> >> >> >> >> address, which is plain wrong. Is it possible for you to
> >> patch
> >> >> the
> >> >> >> >> >> >> pgpool source code and build pgpool binary?
> >> >> >> >> >> >>
> >> >> >> >> >> >> > So it seems likely that the inability to lock/unlock
> >> >> semaphores
> >> >> >> is
> >> >> >> >> the
> >> >> >> >> >> >> > problem, question is why and how do I resolve it? :(
> >> >> >> >> >> >>
> >> >> >> >> >> >> Currently I have no idea. That's why I am requesting the
> >> pgpool
> >> >> >> log.
> >> >> >> >> >> >>
> >> >> >> >> >> >> > I'll c/p parts of my pgpool.service file as well, in
> case
> >> you
> >> >> >> may
> >> >> >> >> see
> >> >> >> >> >> >> > something 'off' in it?
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > [Service]
> >> >> >> >> >> >> > User=postgres
> >> >> >> >> >> >> > Group=postgres
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > EnvironmentFile=-/etc/default/pgpool2
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf
> >> $OPTS
> >> >> >> >> >> >> > ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf
> >> >> $STOP_OPTS
> >> >> >> >> stop
> >> >> >> >> >> >> > ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf
> >> >> reload
> >> >> >> >> >> >> > LimitNOFILE=65536:65536
> >> >> >> >> >> >> > LimitNPROC=16384:16384
> >> >> >> >> >> >> > LimitSTACK=10485760:33554432
> >> >> >> >> >> >> > LimitMEMLOCK=infinity
> >> >> >> >> >> >> > LimitDATA=infinity
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > Thanks in advance.
> >> >> >> >> >> >> >
> >> >> >> >> >> >> > On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <
> >> >> >> ishii at postgresql.org>
> >> >> >> >> >> >> wrote:
> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > Hi,
> >> >> >> >> >> >> >> > So long story short, I'm migrating a PG13 cluster
> from
> >> >> RHEL
> >> >> >> to
> >> >> >> >> >> Ubuntu
> >> >> >> >> >> >> >> > 24.04. In the process, the pgpool version is also
> being
> >> >> >> >> upgraded to
> >> >> >> >> >> >> 4.6
> >> >> >> >> >> >> >> > (from 4.2).
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> What is the exact version of 4.6? I am asking because
> >> 4.6.0
> >> >> has
> >> >> >> >> >> >> >> vulnerability.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >>
> https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Also 4.6.1 (and 4.6.0) have an issue with IPv6.
> >> >> >> >> >> >> >>
> >> >> https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> If possible, I recommend to upgrade to 4.6.2.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > There is also repmgr running on this 4 node
> cluster, 1
> >> >> >> primary
> >> >> >> >> >> >> >> > and 3 replicas.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> I am not famililar with repmgr, and I ignore this
> part.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > All the migration seems to have completed
> successfully,
> >> >> >> >> >> >> >> > but I've ran into a weird problem with pgpool.  I
> have
> >> >> quite
> >> >> >> a
> >> >> >> >> few
> >> >> >> >> >> >> errors
> >> >> >> >> >> >> >> > that look like the below entries, and it seems to
> "come
> >> >> and
> >> >> >> go"
> >> >> >> >> >> when
> >> >> >> >> >> >> >> > checking pgpool log.  The process itself, seems to
> >> work (I
> >> >> >> >> think),
> >> >> >> >> >> but
> >> >> >> >> >> >> >> > every now and then, these errors (warnings?) show
> up:
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog
> >> >> heartbeat:
> >> >> >> >> send
> >> >> >> >> >> >> >> heartbeat
> >> >> >> >> >> >> >> > signal to db-replica3-priv:9694
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628043: LOCATION:
> >> >> >> wd_heartbeat.c:771
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler
> >> called
> >> >> >> >> >> (signal: 2)
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
> >> >> >> pgpool_main.c:1272
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed
> to
> >> lock
> >> >> >> >> >> >> >> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:
> >> Invalid
> >> >> >> >> >> >> >> argument*2025-06-01
> >> >> >> >> >> >> >> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to
> >> >> unlock
> >> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
> >> >> pool_sema.c:202
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting
> down by
> >> >> >> signal 2
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
> >> >> >> pgpool_main.c:1324
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating
> all
> >> >> child
> >> >> >> >> >> processes
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
> >> >> >> pgpool_main.c:1334
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to
> >> lock
> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628047: LOCATION:
> >> >> pool_sema.c:129
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to
> >> lock
> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628042: LOCATION:
> >> >> pool_sema.c:129
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to
> >> lock
> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Can you share the result of "ipcs -s" so that I can
> >> confirm
> >> >> >> pgpool
> >> >> >> >> >> >> >> properly allocates the semaphore.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> By the way do you see "WARNING: failed to unlock
> >> semaphore"
> >> >> >> only
> >> >> >> >> when
> >> >> >> >> >> >> >> pgpool is shutting down?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > Also, earlier in the log, I see errors about being
> >> unable
> >> >> to
> >> >> >> >> >> create a
> >> >> >> >> >> >> >> > socket, for some reason it's trying (and failing) to
> >> >> create
> >> >> >> >> socket
> >> >> >> >> >> on
> >> >> >> >> >> >> >> > localhost (127.0.0.1:9694) and  "creating socket on
> >> >> >> ::1:9694".
> >> >> >> >> >> Are
> >> >> >> >> >> >> >> these
> >> >> >> >> >> >> >> > "ok to ignore"?
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> First of all "creating socket on ::1:9694" looks
> unusual
> >> >> >> because
> >> >> >> >> it
> >> >> >> >> >> >> >> means that heartbeat_hostname is set to "::1", which
> is
> >> the
> >> >> >> IPv6
> >> >> >> >> loop
> >> >> >> >> >> >> >> back address. heartbeat_hostname is expected to be set
> >> IPv4
> >> >> or
> >> >> >> >> IPv6
> >> >> >> >> >> >> >> inet addesss (not loop back address).
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> For now, I would like to wait for your reply regarding
> >> >> these my
> >> >> >> >> >> >> >> questions above.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> Best regards,
> >> >> >> >> >> >> >> --
> >> >> >> >> >> >> >> Tatsuo Ishii
> >> >> >> >> >> >> >> SRA OSS K.K.
> >> >> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >>
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating
> >> socket
> >> >> on
> >> >> >> >> >> >> >> 127.0.0.1:9694
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
> >> >> >> wd_heartbeat.c:385
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  set
> SO_REUSEPORT
> >> >> >> option
> >> >> >> >> to
> >> >> >> >> >> the
> >> >> >> >> >> >> >> socket
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
> >> >> >> wd_heartbeat.c:891
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  creating
> >> watchdog
> >> >> >> >> heartbeat
> >> >> >> >> >> >> >> receive
> >> >> >> >> >> >> >> > socket.
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating
> >> socket
> >> >> on
> >> >> >> >> >> ::1:9694
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
> >> >> >> wd_heartbeat.c:385
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to
> create
> >> >> >> watchdog
> >> >> >> >> >> >> >> heartbeat
> >> >> >> >> >> >> >> > receive socket. retrying...
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind
> failed
> >> with
> >> >> >> >> reason:
> >> >> >> >> >> >> >> "Cannot
> >> >> >> >> >> >> >> > assign requested address"
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
> >> >> >> wd_heartbeat.c:394
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
> >> >> >> >> >> verify_backend_node_status:
> >> >> >> >> >> >> >> > true_primary 0
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> >> >> >> pgpool_main.c:2704
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
> >> status[0]: 1
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> >> >> >> >> pool_worker_child.c:267
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
> >> status[1]: 2
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> >> >> >> >> pool_worker_child.c:267
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
> >> status[2]: 2
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> >> >> >> >> pool_worker_child.c:267
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
> >> status[3]: 2
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> >> >> >> >> pool_worker_child.c:267
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
> >> >> >> >> >> >> pool_release_follow_primary_lock
> >> >> >> >> >> >> >> > called
> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> >> >> >> pgpool_main.c:4106
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set
> SO_REUSEPORT
> >> >> >> option
> >> >> >> >> to
> >> >> >> >> >> the
> >> >> >> >> >> >> >> socket
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >> wd_heartbeat.c:891
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating
> >> watchdog
> >> >> >> >> heartbeat
> >> >> >> >> >> >> >> receive
> >> >> >> >> >> >> >> > socket.
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating
> >> socket
> >> >> on
> >> >> >> >> >> >> >> 127.0.0.1:9694
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >> wd_heartbeat.c:385
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set
> SO_REUSEPORT
> >> >> >> option
> >> >> >> >> to
> >> >> >> >> >> the
> >> >> >> >> >> >> >> socket
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >> wd_heartbeat.c:891
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating
> >> watchdog
> >> >> >> >> heartbeat
> >> >> >> >> >> >> >> receive
> >> >> >> >> >> >> >> > socket.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating
> >> >> socket on
> >> >> >> >> >> >> >> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >> >> >> >> >> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765:
> >> LOG:
> >> >> >> >> failed to
> >> >> >> >> >> >> >> create
> >> >> >> >> >> >> >> > watchdog heartbeat receive socket.
> >> retrying...*2025-06-01
> >> >> >> >> 09:11:23:
> >> >> >> >> >> >> pid
> >> >> >> >> >> >> >> > 634765: DETAIL:  bind failed with reason: "Cannot
> >> assign
> >> >> >> >> requested
> >> >> >> >> >> >> >> address"
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >> wd_heartbeat.c:394
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  set
> SO_REUSEPORT
> >> >> >> option
> >> >> >> >> to
> >> >> >> >> >> the
> >> >> >> >> >> >> >> socket
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOCATION:
> >> >> >> wd_heartbeat.c:891
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  creating
> socket
> >> for
> >> >> >> >> sending
> >> >> >> >> >> >> >> heartbeat
> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set
> >> SO_REUSEPORT
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > Other odd errors I see in the logs later on are when
> >> >> pgpool
> >> >> >> >> tries
> >> >> >> >> >> to
> >> >> >> >> >> >> run
> >> >> >> >> >> >> >> > escalation script - can't quite find where it's
> >> failing to
> >> >> >> find
> >> >> >> >> >> ipv4
> >> >> >> >> >> >> >> > address:
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II
> parent
> >> >> >> process
> >> >> >> >> >> >> received
> >> >> >> >> >> >> >> > SIGUSR1
> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II
> parent
> >> >> >> process
> >> >> >> >> >> >> received
> >> >> >> >> >> >> >> > watchdog quorum change signal from watchdog
> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog:
> >> >> escalation
> >> >> >> >> started
> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting
> failover
> >> >> >> command
> >> >> >> >> >> >> timeout
> >> >> >> >> >> >> >> to
> >> >> >> >> >> >> >> > 135
> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog
> cluster
> >> >> now
> >> >> >> >> holds
> >> >> >> >> >> the
> >> >> >> >> >> >> >> quorum
> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating
> the
> >> >> state
> >> >> >> of
> >> >> >> >> >> >> >> quarantine
> >> >> >> >> >> >> >> > backend nodes
> >> >> >> >> >> >> >> > Sun Jun  1 12:07:15 AM CEST 2025
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > *escalation.sh: - remove VIP IP from all nodesNode:
> >> >> >> >> >> >> db-primary-privError:
> >> >> >> >> >> >> >> > ipv4: Address not found.Node: db-replica1-privError:
> >> ipv4:
> >> >> >> >> Address
> >> >> >> >> >> not
> >> >> >> >> >> >> >> > found.Node: db-replica2-privError: ipv4: Address not
> >> >> >> found.Node:
> >> >> >> >> >> >> >> > db-replica3-privError: ipv4: Address not
> >> found.*2025-06-01
> >> >> >> >> >> 00:07:20:
> >> >> >> >> >> >> pid
> >> >> >> >> >> >> >> > 396973: LOG:  watchdog escalation successful
> >> >> >> >> >> >> >> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully
> >> >> acquired
> >> >> >> the
> >> >> >> >> >> >> delegate
> >> >> >> >> >> >> >> > IP:"192.168.77.130"
> >> >> >> >> >> >> >> > 2025-06-01 00:07:23: pid 396973: DETAIL:
> 'if_up_cmd'
> >> >> >> returned
> >> >> >> >> with
> >> >> >> >> >> >> >> success
> >> >> >> >> >> >> >> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog
> >> escalation
> >> >> >> >> process
> >> >> >> >> >> >> with
> >> >> >> >> >> >> >> > pid: 396973 exit with SUCCESS.
> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to
> >> lock
> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to
> >> >> unlock
> >> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to
> >> lock
> >> >> >> >> semaphore
> >> >> >> >> >> >> >> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid
> >> argument
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > The way I started was I copied pgpool.conf from the
> >> "old"
> >> >> and
> >> >> >> >> >> working
> >> >> >> >> >> >> >> > pgpool 4.2 configuration and modified it slightly,
> >> though
> >> >> >> >> nothing
> >> >> >> >> >> >> >> > major,changed hostnames and i.e. delegate_ip
> variable
> >> name
> >> >> >> was
> >> >> >> >> >> changed
> >> >> >> >> >> >> >> > from delegate_IP.  Also added this variable as
> without
> >> >> it, I
> >> >> >> >> didn't
> >> >> >> >> >> >> see a
> >> >> >> >> >> >> >> > socket created for port 9898:
> >> >> >> >> >> >> >> > unix_socket_directories = '/var/run/postgresql'
> >> >> >> >> >> >> >> > I had to change network interfaces as well, in the
> >> RHEL it
> >> >> >> was
> >> >> >> >> >> using
> >> >> >> >> >> >> eth0
> >> >> >> >> >> >> >> > where on Ubuntu it's ens18, so I changed lines like:
> >> >> >> >> >> >> >> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18
> >> label
> >> >> >> >> ens18:0
> >> >> >> >> >> >> >> > Other thing I had to change was path to arping as on
> >> >> Ubuntu
> >> >> >> 24
> >> >> >> >> >> it's in
> >> >> >> >> >> >> >> > /usr/bin (not sbin)
> >> >> >> >> >> >> >> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U
> $_IP_$
> >> -w
> >> >> 1
> >> >> >> -I
> >> >> >> >> >> ens18'
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > The other thing I'm confused about, I have two
> network
> >> >> >> >> interfaces
> >> >> >> >> >> on
> >> >> >> >> >> >> each
> >> >> >> >> >> >> >> > host, i.e.
> >> >> >> >> >> >> >> > 192.168.77.135/23 brd 192.168.77.255 scope global
> >> ens18
> >> >> >> >> >> >> >> > and
> >> >> >> >> >> >> >> > inet 192.168.78.135/24 brd 192.168.78.255 scope
> global
> >> >> ens19
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > In pgpool.conf, it is referencing the "priv" hosts
> >> (i.e.
> >> >> >> >> >> >> >> > heartbeat_hostname2 = 'db-replica2-priv'), which
> >> >> >> theoretically
> >> >> >> >> >> should
> >> >> >> >> >> >> be
> >> >> >> >> >> >> >> on
> >> >> >> >> >> >> >> > ens19 network interface as shown above, but in my
> old
> >> >> >> >> pgpool.conf,
> >> >> >> >> >> it
> >> >> >> >> >> >> was
> >> >> >> >> >> >> >> > only referencing eth0 on RHEL, even though "priv"
> hosts
> >> >> were
> >> >> >> on
> >> >> >> >> >> eth1.
> >> >> >> >> >> >> >> I'm
> >> >> >> >> >> >> >> > wondering if that needs to be changed in Ubuntu or
> it
> >> >> makes
> >> >> >> no
> >> >> >> >> >> >> difference
> >> >> >> >> >> >> >> > here?
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> >  Right now, I can't quite figure out whether it is
> >> >> something
> >> >> >> >> with
> >> >> >> >> >> >> >> > pgpool.conf or with the Ubuntu settings and the
> >> semaphore
> >> >> >> error
> >> >> >> >> is
> >> >> >> >> >> >> quite
> >> >> >> >> >> >> >> > cryptic.  If you need me to post some parts of my
> >> >> >> pgpool.conf,
> >> >> >> >> >> please
> >> >> >> >> >> >> >> > advise which.  I've seen an older version of the FAQ
> >> that
> >> >> >> said
> >> >> >> >> >> repmgr
> >> >> >> >> >> >> and
> >> >> >> >> >> >> >> > pgpool don't work together but then I saw mention
> that
> >> >> this
> >> >> >> was
> >> >> >> >> no
> >> >> >> >> >> >> longer
> >> >> >> >> >> >> >> > the case.
> >> >> >> >> >> >> >> >
> >> >> >> >> >> >> >> > I would much appreciate any help you can provide in
> >> >> solving
> >> >> >> this
> >> >> >> >> >> >> problem.
> >> >> >> >> >> >> >>
> >> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >>
> >> >> >>
> >> >>
> >> _______________________________________________
> >> pgpool-general mailing list
> >> pgpool-general at pgpool.net
> >> http://www.pgpool.net/mailman/listinfo/pgpool-general
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20250603/9a7fa425/attachment-0001.htm>