[pgpool-general: 9503] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Tue Jun 3 20:16:46 JST 2025

Hi,

Thank you for the great information!

> Hi,
> We've made some progress here and I'd like to get your feedback regarding
> the fix/workaround.  Basically, after some research (incl. google/chatgpt
> etc), we came across this information:
> 
> 🧠 1. systemd-logind manages user sessions
> 
> In Ubuntu 24.04 (and since ~15.04), systemd manages user sessions. When a
> user session ends (e.g. when a background process or login shell finishes),
> the following happens:
> 
>    -
> 
>    The session scope (session-XXXX.scope) is stopped.
>    -
> 
>    The user manager (user at UID.service) is also stopped *unless lingering is
>    enabled*.
> 
> 🧨 If no processes remain and the user is not lingering:
> 
>    -
> 
>    Shared memory and *System V IPC resources* (like semaphores) *may be
>    destroyed* automatically if no one is using them.
>    -
> 
>    If a process (like pgpool2 or PostgreSQL) relied on those semaphores,
>    you may later see:
> 
>    failed to lock semaphore
> 
> 
> This is a strong sign that the user session ended and IPC resources (like
> semaphores) were destroyed.
> 
> This does seem to be happening here, as I see these in /var/log/syslog
> around that time we discussed in last posting, just prior to semaphore
> errors showing in pgpool logs:
> 2025-06-02T12:45:17.173486+02:00 db-replica3 systemd[1]: user at 3000.service:
> Deactivated successfully.
> 2025-06-02T12:45:17.173997+02:00 db-replica3 systemd[1]: Stopped
> user at 3000.service - User Manager for UID 3000.
> 2025-06-02T12:45:17.240314+02:00 db-replica3 systemd[1]: Stopping
> user-runtime-dir at 3000.service - User Runtime Directory /run/user/3000...
> 2025-06-02T12:45:17.248017+02:00 db-replica3 systemd[1]:
> run-user-3000.mount: Deactivated successfully.
> 2025-06-02T12:45:17.249107+02:00 db-replica3 systemd[1]:
> user-runtime-dir at 3000.service: Deactivated successfully.
> 2025-06-02T12:45:17.249353+02:00 db-replica3 systemd[1]: Stopped
> user-runtime-dir at 3000.service - User Runtime Directory /run/user/3000.
> 2025-06-02T12:45:17.251672+02:00 db-replica3 systemd[1]: Removed slice
> user-3000.slice - User Slice of UID 3000.
> 2025-06-02T12:45:23.581108+02:00 db-replica3 strace[1692927]: [pid 1693495]
> 12:45:23.580360 semtimedop(41, [{sem_num=6, sem_op=-1, sem_flg=SEM_UNDO}],
> 1, NULL) = -1 EINVAL (Invalid argument)
> 
> It also advises to add Type=forking to the [Service] section of the
> pgpool.service file, which we don't currently have, I think I posted it
> earlier, but just in case here is relevant part of our service file:
> 
> [Unit]
> Description=Pgpool-II
> After=syslog.target network.target postgresql.service
> Wants=postgresql.service
> 
> [Service]
> User=postgres
> Group=postgres
> 
> EnvironmentFile=-/etc/default/pgpool2
> 
> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop
> ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
> 
> -------------------
> One bit I read that is different is the Type=forking and adding -n is
> required (as you can see above, we don't have that right now:
> -
> 
> User=postgres: Assumes pgpool should run as the postgres user. Adjust if
> different.
> -
> 
> ExecStart=/usr/sbin/pgpool -n: Use the *correct path and arguments* for
> your installation.
> -
> 
> pgpool must be configured to *not daemonize* (-n flag), or systemd will
> lose track of it.
> -
> 
> You can add environment variables for config files, ports, etc.
> -----------------------
> 
> Anyways, our sysadmins suggested changing the user postgres to have Linger
> option on:
> $ sudo loginctl show-user postgres -p Linger
> Linger=yes
> Once we did that, the semaphore issue disappeared, it's been running for
> over an hour now and ipcs shows them:
> $ ipcs -s
> 
> ------ Semaphore Arrays --------
> key        semid      owner      perms      nsems
> 0x00000000 6          postgres   600        8
> 
> 
> So, what do you think about this fix? I read that letting the user have
> linger on is not generally recommended, but that seems to be the only way
> to keep the semaphores from disappearing right now...

I am not familiar with this area. I will discuss with other
pgpool/PostgreSQL developers and reply back to you.

Best regards,
--
Tatsuo Ishii
SRA OSS K.K.
English: http://www.sraoss.co.jp/index_en/
Japanese:http://www.sraoss.co.jp

> 
> On Tue, Jun 3, 2025 at 2:30 AM Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> Thanks for the strace log.
>>
>> 1756952 16:45:14.448431 semtimedop(46, [{sem_num=6, sem_op=1,
>> sem_flg=SEM_UNDO}], 1, NULL) = 0
>> 1756952 16:45:24.482757 semtimedop(46, [{sem_num=6, sem_op=-1,
>> sem_flg=SEM_UNDO}], 1, NULL) = -1 EINVAL (Invalid argument)
>>
>> So the semaphore was removed between 16:45:14.448431 and
>> 16:45:24.482757.  I was expecting "semctl(46, 0, IPC_RMID, NULL) = 0"
>> (which removes semaphore) was executed there, but actually it was
>> not. Nothing was recorded in the period.
>>
>> I think this definitely means that a process which is not a child of
>> pgpool executed semctl(46, 0, IPC_RMID, NULL) (or ipcrm command). Do
>> you have any idea such a program other than pgpool?
>>
>> > Here is the strace result, hope this helps in any way, it takes about 10
>> > minutes for the semaphors to 'go away'...
>> >
>> >
>> > On Mon, Jun 2, 2025 at 2:53 PM Tatsuo Ishii <ishii at postgresql.org>
>> wrote:
>> >
>> >> You have 4 pgpool nodes and you issue the queries through one of
>> >> pgpools. Suppose you issued the queries through node 0, the "status"
>> >> field is only updated on node 0. On other 3 nodes the status remains
>> >> "waiting". However the status on all 4 nodes will be synched when the
>> >> status becomes "down", or becomes "up" and the former status is
>> >> "down".
>> >>
>> >> > That's where I wonder if this is working properly, because there seem
>> to
>> >> be
>> >> > clients/connections happening, I ran the backend status command this
>> is
>> >> > what it showed:
>> >> >
>> >> > node_id |        hostname        | port | status |  role   |
>> select_cnt |
>> >> > insert_cnt | update_cnt | delete_cnt | ddl_cnt | other_cnt |
>> panic_cnt |
>> >> > fatal_cnt | error_cnt
>> >> >
>> >> >
>> >>
>> ---------+------------------------+------+--------+---------+------------+------------+------------+------------+---------+-----------+-----------+-----------+-----------
>> >> >
>> >> > 0       | db-primarypriv | 5434 | up     | primary | 7552       | 47
>> >> >   | 332        | 27         | 0       | 13678     | 0         | 0
>> >>  |
>> >> > 6
>> >> >
>> >> > 1       | db-replica1priv | 5434 | up     | standby | 6817       | 0
>> >> >   | 0          | 0          | 0       | 13682     | 0         | 0
>> >>  |
>> >> > 0
>> >> >
>> >> > 2       | db-replica2priv | 5434 | up     | standby | 6514       | 0
>> >> >   | 0          | 0          | 0       | 13693     | 0         | 0
>> >>  |
>> >> > 1
>> >> >
>> >> > 3       | db-replica3priv | 5434 | up     | standby | 6910       | 0
>> >> >   | 0          | 0          | 0       | 13679     | 0         | 0
>> >>  |
>> >> > 1
>> >> >
>> >> > (4 rows)
>> >> >
>> >> > So plenty of queries supposedly happening on them, but the nodes still
>> >> show
>> >> > status 1.  What gives?
>> >> >
>> >> >
>> >> > On Mon, Jun 2, 2025 at 1:43 PM Tatsuo Ishii <ishii at postgresql.org>
>> >> wrote:
>> >> >
>> >> >> > Hi, another thing I noticed - the status of all those nodes in
>> >> >> > pcp_node_info seems to be wrong, for example:
>> >> >> > $ pcp_node_info -h localhost -p 9898 -U pcpadmin --verbose -n 3
>> >> >> > Hostname               : db-replica2
>> >> >> > Port                   : 5434
>> >> >> > Status                 : 1
>> >> >> > Weight                 : 0.250000
>> >> >> > Status Name            : waiting
>> >> >> > Backend Status Name    : up
>> >> >> > Role                   : standby
>> >> >> > Backend Role           : standby
>> >> >> > Replication Delay      : 0
>> >> >> > Replication State      : streaming
>> >> >> > Replication Sync State : async
>> >> >> > Last Status Change     : 2025-06-02 12:31:46
>> >> >>
>> >> >> Are you saying this is wrong?
>> >> >>
>> >> >> > Status                 : 1
>> >> >>
>> >> >> Seems normal to me because it just says no client has connected to
>> the
>> >> >> pgpool server until now.
>> >> >>
>> >> >> > Do you think this is because of the semaphore problem, or does it
>> >> suggest
>> >> >> > something else completely?
>> >> >> >
>> >> >> > On Mon, Jun 2, 2025 at 12:59 PM Tatsuo Ishii <ishii at postgresql.org
>> >
>> >> >> wrote:
>> >> >> >
>> >> >> >> > Hi, I changed the pgpool.service file for startup as you said,
>> >> >> replaced
>> >> >> >> my
>> >> >> >> > start with this:
>> >> >> >> > ExecStart=strace -f -tt --trace=ipc /usr/sbin/pgpool -f
>> >> >> >> > /etc/pgpool2/pgpool.conf $OPTS
>> >> >> >> > I'm attaching the log from replica again, but I'm not sure this
>> >> >> actually
>> >> >> >> > worked (as in, it doesn't look much different, to me at least).
>> >> >> >>
>> >> >> >> Yeah. It seems there's strace output in the log.
>> >> >> >> Can you try "-o" of strace, which instructs strace a file name to
>> >> print
>> >> >> >> the output. e.g.
>> >> >> >>
>> >> >> >> ExecStart=strace -f -tt --trace=ipc -o /tmp/strace.out
>> >> /usr/sbin/pgpool
>> >> >> -f
>> >> >> >> /etc/pgpool2/pgpool.conf $OPTS
>> >> >> >>
>> >> >> >> > On Mon, Jun 2, 2025 at 11:47 AM Tatsuo Ishii <
>> ishii at postgresql.org
>> >> >
>> >> >> >> wrote:
>> >> >> >> >
>> >> >> >> >> > Hi,
>> >> >> >> >> > I tried the strace as you shown but it didn't really output
>> >> >> anything:
>> >> >> >> >> >
>> >> >> >> >> > $ sudo strace -p 1662563 -f -tt --trace=ipc
>> >> >> >> >> > strace: Process 1662563 attached
>> >> >> >> >> >
>> >> >> >> >> > ^Cstrace: Process 1662563 detached.   <--- this is me getting
>> >> out
>> >> >> of
>> >> >> >> it,
>> >> >> >> >> > since I saw in another window the semaphores were gone
>> already,
>> >> >> >> without
>> >> >> >> >> > anything outputting here.  Unless something's wrong with the
>> >> >> syntax of
>> >> >> >> >> the
>> >> >> >> >> > command, it didn't really output anything.  I changed the log
>> >> line
>> >> >> you
>> >> >> >> >> > suggested as well.
>> >> >> >> >> >
>> >> >> >> >> > Also, FYI - I tried to downgrade to a lower (4.3.7) version
>> as
>> >> it
>> >> >> was
>> >> >> >> >> > available in my package repository, and the same semaphore
>> >> version
>> >> >> >> >> persists,
>> >> >> >> >>
>> >> >> >> >> Okay. I think you need to start pgpool from strace. I mean,
>> >> modifiy
>> >> >> >> >> pgpool.service from:
>> >> >> >> >> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
>> >> >> >> >> to:
>> >> >> >> >> ExecStart=strace -f -tt --trace=ipc /usr/sbin/pgpool -f
>> >> >> >> >> /etc/pgpool2/pgpool.conf $OPTS
>> >> >> >> >>
>> >> >> >> >> > On Mon, Jun 2, 2025 at 2:40 AM Tatsuo Ishii <
>> >> ishii at postgresql.org>
>> >> >> >> >> wrote:
>> >> >> >> >> >
>> >> >> >> >> >> > Hi,
>> >> >> >> >> >> > Not sure I can patch the source, as it was installed via
>> apt
>> >> get
>> >> >> >> and
>> >> >> >> >> not
>> >> >> >> >> >> > compiled from source.  If that is doable to a
>> package-built
>> >> one
>> >> >> and
>> >> >> >> >> you
>> >> >> >> >> >> can
>> >> >> >> >> >> > give me some instructions, I can certainly try.
>> >> >> >> >> >>
>> >> >> >> >> >> Unfortunately I'm not familiar with debian packaging.
>> >> >> >> >> >>
>> >> >> >> >> >> > As for the pgpool log, I'm attaching it, hopefully it
>> will go
>> >> >> >> through,
>> >> >> >> >> >> it's
>> >> >> >> >> >> > about 5mb as I enabled higher debug options.
>> >> >> >> >> >>
>> >> >> >> >> >> It was sent as a direct mail to me. Not sure who/what
>> removed
>> >> the
>> >> >> >> >> >> semaphore. I observe following from the log anyway:
>> >> >> >> >> >>
>> >> >> >> >> >> - pgpool started on 16:10:21. At that time the semaphore
>> >> >> definitely
>> >> >> >> >> >> existed.
>> >> >> >> >> >> - Until 16:15:13, everything worked great.
>> >> >> >> >> >> - On 16:15:23, the problem started.
>> >> >> >> >> >>
>> >> >> >> >> >> So it seems the semaphore was removed by someone between
>> >> 16:15:13
>> >> >> and
>> >> >> >> >> >> 16:15:23.
>> >> >> >> >> >>
>> >> >> >> >> >> Unfortunately the removal of semaphore was not recorded in
>> the
>> >> >> log.
>> >> >> >> To
>> >> >> >> >> >> determine who removed the semaphore, can you please attach
>> >> strace
>> >> >> to
>> >> >> >> >> >> pgpool main and share the log? This time debug log is not
>> >> >> necessary.
>> >> >> >> >> >>
>> >> >> >> >> >> - Find the pgpool parent process pid (in the debug log it
>> was
>> >> >> >> 1483510)
>> >> >> >> >> >> - Become root (attaching strace usually requires root
>> >> privilege)
>> >> >> >> >> >> - Attach strace to the process $pid
>> >> >> >> >> >>
>> >> >> >> >> >> strace -p $pid -f -tt --trace=ipc
>> >> >> >> >> >>
>> >> >> >> >> >> -f: (follow the child process of pgpool main)
>> >> >> >> >> >> -tt: add timestamp with milliseconds
>> >> >> >> >> >> --trace=ipc: limit the tracing system call to ipc to reduce
>> the
>> >> >> log
>> >> >> >> >> >>
>> >> >> >> >> >> BTW, I recommend yo to set log_line_prefix to something like
>> >> this
>> >> >> so
>> >> >> >> >> >> that you can take milliseconds log (%m) and process
>> application
>> >> >> name
>> >> >> >> >> >> (%a). They are usable in 4.6.
>> >> >> >> >> >>
>> >> >> >> >> >> log_line_prefix = '%m: %a pid %p: '
>> >> >> >> >> >>
>> >> >> >> >> >> > On Sun, Jun 1, 2025 at 4:20 PM Tatsuo Ishii <
>> >> >> ishii at postgresql.org>
>> >> >> >> >> >> wrote:
>> >> >> >> >> >> >
>> >> >> >> >> >> >> > To your points/questions:
>> >> >> >> >> >> >> > 1) version is 4.6.1 - the source I'm using (via apt
>> get)
>> >> >> doesn't
>> >> >> >> >> have
>> >> >> >> >> >> >> 4.6.2
>> >> >> >> >> >> >> > yet
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Ok.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > 2) ipcs -s shows... nothing, literally.  Checked all
>> >> nodes,
>> >> >> i.e.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > postgres at db-replica3:~$ ipcs -s
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > ------ Semaphore Arrays --------
>> >> >> >> >> >> >> > key        semid      owner      perms      nsems
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > Also, it kinda seems like "failed to unlock semaphore"
>> >> shows
>> >> >> >> during
>> >> >> >> >> >> >> startup
>> >> >> >> >> >> >> > of pgpool and while it's running, but maybe I'm not
>> >> reading
>> >> >> the
>> >> >> >> log
>> >> >> >> >> >> >> > properly.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> So pgpool lost semaphore soon after starting. Can you
>> share
>> >> >> full
>> >> >> >> >> >> >> pgpool log before "failed to unlock semaphore" appears?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > 3) regarding heartbeats, my pgpool.conf has it roughly
>> >> like
>> >> >> >> this:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > heartbeat_hostname0 = 'db-primary-priv'
>> >> >> >> >> >> >> > heartbeat_port0 = 9694
>> >> >> >> >> >> >> > heartbeat_device0 = ''
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > heartbeat_hostname1 = 'db-replica1-priv'
>> >> >> >> >> >> >> > heartbeat_port1 = 9694
>> >> >> >> >> >> >> > heartbeat_device1 = ''
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > heartbeat_hostname2 = 'db-replica2-priv'
>> >> >> >> >> >> >> > heartbeat_port2 = 9694
>> >> >> >> >> >> >> > heartbeat_device2 = ''
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > heartbeat_hostname3 = 'db-replica3-priv'
>> >> >> >> >> >> >> > heartbeat_port3 = 9694
>> >> >> >> >> >> >> > heartbeat_device3 = ''
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Looks normal.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> I have checked the pgpool source code and found
>> suspicious
>> >> >> >> >> >> >> code. before 4.6, pgpool heartbeart receiver process
>> binds
>> >> to
>> >> >> any
>> >> >> >> >> >> >> address on the host. However from 4.6 pgpool binds to
>> only
>> >> loop
>> >> >> >> back
>> >> >> >> >> >> >> address, which is plain wrong. Is it possible for you to
>> >> patch
>> >> >> the
>> >> >> >> >> >> >> pgpool source code and build pgpool binary?
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > So it seems likely that the inability to lock/unlock
>> >> >> semaphores
>> >> >> >> is
>> >> >> >> >> the
>> >> >> >> >> >> >> > problem, question is why and how do I resolve it? :(
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> Currently I have no idea. That's why I am requesting the
>> >> pgpool
>> >> >> >> log.
>> >> >> >> >> >> >>
>> >> >> >> >> >> >> > I'll c/p parts of my pgpool.service file as well, in
>> case
>> >> you
>> >> >> >> may
>> >> >> >> >> see
>> >> >> >> >> >> >> > something 'off' in it?
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > [Service]
>> >> >> >> >> >> >> > User=postgres
>> >> >> >> >> >> >> > Group=postgres
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > EnvironmentFile=-/etc/default/pgpool2
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf
>> >> $OPTS
>> >> >> >> >> >> >> > ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf
>> >> >> $STOP_OPTS
>> >> >> >> >> stop
>> >> >> >> >> >> >> > ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf
>> >> >> reload
>> >> >> >> >> >> >> > LimitNOFILE=65536:65536
>> >> >> >> >> >> >> > LimitNPROC=16384:16384
>> >> >> >> >> >> >> > LimitSTACK=10485760:33554432
>> >> >> >> >> >> >> > LimitMEMLOCK=infinity
>> >> >> >> >> >> >> > LimitDATA=infinity
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > Thanks in advance.
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> > On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <
>> >> >> >> ishii at postgresql.org>
>> >> >> >> >> >> >> wrote:
>> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > Hi,
>> >> >> >> >> >> >> >> > So long story short, I'm migrating a PG13 cluster
>> from
>> >> >> RHEL
>> >> >> >> to
>> >> >> >> >> >> Ubuntu
>> >> >> >> >> >> >> >> > 24.04. In the process, the pgpool version is also
>> being
>> >> >> >> >> upgraded to
>> >> >> >> >> >> >> 4.6
>> >> >> >> >> >> >> >> > (from 4.2).
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> What is the exact version of 4.6? I am asking because
>> >> 4.6.0
>> >> >> has
>> >> >> >> >> >> >> >> vulnerability.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Also 4.6.1 (and 4.6.0) have an issue with IPv6.
>> >> >> >> >> >> >> >>
>> >> >> https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> If possible, I recommend to upgrade to 4.6.2.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > There is also repmgr running on this 4 node
>> cluster, 1
>> >> >> >> primary
>> >> >> >> >> >> >> >> > and 3 replicas.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> I am not famililar with repmgr, and I ignore this
>> part.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > All the migration seems to have completed
>> successfully,
>> >> >> >> >> >> >> >> > but I've ran into a weird problem with pgpool.  I
>> have
>> >> >> quite
>> >> >> >> a
>> >> >> >> >> few
>> >> >> >> >> >> >> errors
>> >> >> >> >> >> >> >> > that look like the below entries, and it seems to
>> "come
>> >> >> and
>> >> >> >> go"
>> >> >> >> >> >> when
>> >> >> >> >> >> >> >> > checking pgpool log.  The process itself, seems to
>> >> work (I
>> >> >> >> >> think),
>> >> >> >> >> >> but
>> >> >> >> >> >> >> >> > every now and then, these errors (warnings?) show
>> up:
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog
>> >> >> heartbeat:
>> >> >> >> >> send
>> >> >> >> >> >> >> >> heartbeat
>> >> >> >> >> >> >> >> > signal to db-replica3-priv:9694
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628043: LOCATION:
>> >> >> >> wd_heartbeat.c:771
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler
>> >> called
>> >> >> >> >> >> (signal: 2)
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
>> >> >> >> pgpool_main.c:1272
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed
>> to
>> >> lock
>> >> >> >> >> >> >> >> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:
>> >> Invalid
>> >> >> >> >> >> >> >> argument*2025-06-01
>> >> >> >> >> >> >> >> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to
>> >> >> unlock
>> >> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
>> >> >> pool_sema.c:202
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting
>> down by
>> >> >> >> signal 2
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
>> >> >> >> pgpool_main.c:1324
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating
>> all
>> >> >> child
>> >> >> >> >> >> processes
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:
>> >> >> >> pgpool_main.c:1334
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to
>> >> lock
>> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628047: LOCATION:
>> >> >> pool_sema.c:129
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to
>> >> lock
>> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628042: LOCATION:
>> >> >> pool_sema.c:129
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to
>> >> lock
>> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Can you share the result of "ipcs -s" so that I can
>> >> confirm
>> >> >> >> pgpool
>> >> >> >> >> >> >> >> properly allocates the semaphore.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> By the way do you see "WARNING: failed to unlock
>> >> semaphore"
>> >> >> >> only
>> >> >> >> >> when
>> >> >> >> >> >> >> >> pgpool is shutting down?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > Also, earlier in the log, I see errors about being
>> >> unable
>> >> >> to
>> >> >> >> >> >> create a
>> >> >> >> >> >> >> >> > socket, for some reason it's trying (and failing) to
>> >> >> create
>> >> >> >> >> socket
>> >> >> >> >> >> on
>> >> >> >> >> >> >> >> > localhost (127.0.0.1:9694) and  "creating socket on
>> >> >> >> ::1:9694".
>> >> >> >> >> >> Are
>> >> >> >> >> >> >> >> these
>> >> >> >> >> >> >> >> > "ok to ignore"?
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> First of all "creating socket on ::1:9694" looks
>> unusual
>> >> >> >> because
>> >> >> >> >> it
>> >> >> >> >> >> >> >> means that heartbeat_hostname is set to "::1", which
>> is
>> >> the
>> >> >> >> IPv6
>> >> >> >> >> loop
>> >> >> >> >> >> >> >> back address. heartbeat_hostname is expected to be set
>> >> IPv4
>> >> >> or
>> >> >> >> >> IPv6
>> >> >> >> >> >> >> >> inet addesss (not loop back address).
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> For now, I would like to wait for your reply regarding
>> >> >> these my
>> >> >> >> >> >> >> >> questions above.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> Best regards,
>> >> >> >> >> >> >> >> --
>> >> >> >> >> >> >> >> Tatsuo Ishii
>> >> >> >> >> >> >> >> SRA OSS K.K.
>> >> >> >> >> >> >> >> English: http://www.sraoss.co.jp/index_en/
>> >> >> >> >> >> >> >> Japanese:http://www.sraoss.co.jp
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating
>> >> socket
>> >> >> on
>> >> >> >> >> >> >> >> 127.0.0.1:9694
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
>> >> >> >> wd_heartbeat.c:385
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  set
>> SO_REUSEPORT
>> >> >> >> option
>> >> >> >> >> to
>> >> >> >> >> >> the
>> >> >> >> >> >> >> >> socket
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
>> >> >> >> wd_heartbeat.c:891
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  creating
>> >> watchdog
>> >> >> >> >> heartbeat
>> >> >> >> >> >> >> >> receive
>> >> >> >> >> >> >> >> > socket.
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating
>> >> socket
>> >> >> on
>> >> >> >> >> >> ::1:9694
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
>> >> >> >> wd_heartbeat.c:385
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to
>> create
>> >> >> >> watchdog
>> >> >> >> >> >> >> >> heartbeat
>> >> >> >> >> >> >> >> > receive socket. retrying...
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind
>> failed
>> >> with
>> >> >> >> >> reason:
>> >> >> >> >> >> >> >> "Cannot
>> >> >> >> >> >> >> >> > assign requested address"
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:
>> >> >> >> wd_heartbeat.c:394
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
>> >> >> >> >> >> verify_backend_node_status:
>> >> >> >> >> >> >> >> > true_primary 0
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
>> >> >> >> pgpool_main.c:2704
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
>> >> status[0]: 1
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
>> >> >> >> >> pool_worker_child.c:267
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
>> >> status[1]: 2
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
>> >> >> >> >> pool_worker_child.c:267
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
>> >> status[2]: 2
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
>> >> >> >> >> pool_worker_child.c:267
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node
>> >> status[3]: 2
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
>> >> >> >> >> pool_worker_child.c:267
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
>> >> >> >> >> >> >> pool_release_follow_primary_lock
>> >> >> >> >> >> >> >> > called
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
>> >> >> >> pgpool_main.c:4106
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set
>> SO_REUSEPORT
>> >> >> >> option
>> >> >> >> >> to
>> >> >> >> >> >> the
>> >> >> >> >> >> >> >> socket
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
>> >> >> >> wd_heartbeat.c:891
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating
>> >> watchdog
>> >> >> >> >> heartbeat
>> >> >> >> >> >> >> >> receive
>> >> >> >> >> >> >> >> > socket.
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating
>> >> socket
>> >> >> on
>> >> >> >> >> >> >> >> 127.0.0.1:9694
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
>> >> >> >> wd_heartbeat.c:385
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set
>> SO_REUSEPORT
>> >> >> >> option
>> >> >> >> >> to
>> >> >> >> >> >> the
>> >> >> >> >> >> >> >> socket
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
>> >> >> >> wd_heartbeat.c:891
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating
>> >> watchdog
>> >> >> >> >> heartbeat
>> >> >> >> >> >> >> >> receive
>> >> >> >> >> >> >> >> > socket.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating
>> >> >> socket on
>> >> >> >> >> >> >> >> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
>> >> >> >> >> >> >> >> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765:
>> >> LOG:
>> >> >> >> >> failed to
>> >> >> >> >> >> >> >> create
>> >> >> >> >> >> >> >> > watchdog heartbeat receive socket.
>> >> retrying...*2025-06-01
>> >> >> >> >> 09:11:23:
>> >> >> >> >> >> >> pid
>> >> >> >> >> >> >> >> > 634765: DETAIL:  bind failed with reason: "Cannot
>> >> assign
>> >> >> >> >> requested
>> >> >> >> >> >> >> >> address"
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:
>> >> >> >> wd_heartbeat.c:394
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  set
>> SO_REUSEPORT
>> >> >> >> option
>> >> >> >> >> to
>> >> >> >> >> >> the
>> >> >> >> >> >> >> >> socket
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOCATION:
>> >> >> >> wd_heartbeat.c:891
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  creating
>> socket
>> >> for
>> >> >> >> >> sending
>> >> >> >> >> >> >> >> heartbeat
>> >> >> >> >> >> >> >> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set
>> >> SO_REUSEPORT
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > Other odd errors I see in the logs later on are when
>> >> >> pgpool
>> >> >> >> >> tries
>> >> >> >> >> >> to
>> >> >> >> >> >> >> run
>> >> >> >> >> >> >> >> > escalation script - can't quite find where it's
>> >> failing to
>> >> >> >> find
>> >> >> >> >> >> ipv4
>> >> >> >> >> >> >> >> > address:
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II
>> parent
>> >> >> >> process
>> >> >> >> >> >> >> received
>> >> >> >> >> >> >> >> > SIGUSR1
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II
>> parent
>> >> >> >> process
>> >> >> >> >> >> >> received
>> >> >> >> >> >> >> >> > watchdog quorum change signal from watchdog
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog:
>> >> >> escalation
>> >> >> >> >> started
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting
>> failover
>> >> >> >> command
>> >> >> >> >> >> >> timeout
>> >> >> >> >> >> >> >> to
>> >> >> >> >> >> >> >> > 135
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog
>> cluster
>> >> >> now
>> >> >> >> >> holds
>> >> >> >> >> >> the
>> >> >> >> >> >> >> >> quorum
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating
>> the
>> >> >> state
>> >> >> >> of
>> >> >> >> >> >> >> >> quarantine
>> >> >> >> >> >> >> >> > backend nodes
>> >> >> >> >> >> >> >> > Sun Jun  1 12:07:15 AM CEST 2025
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > *escalation.sh: - remove VIP IP from all nodesNode:
>> >> >> >> >> >> >> db-primary-privError:
>> >> >> >> >> >> >> >> > ipv4: Address not found.Node: db-replica1-privError:
>> >> ipv4:
>> >> >> >> >> Address
>> >> >> >> >> >> not
>> >> >> >> >> >> >> >> > found.Node: db-replica2-privError: ipv4: Address not
>> >> >> >> found.Node:
>> >> >> >> >> >> >> >> > db-replica3-privError: ipv4: Address not
>> >> found.*2025-06-01
>> >> >> >> >> >> 00:07:20:
>> >> >> >> >> >> >> pid
>> >> >> >> >> >> >> >> > 396973: LOG:  watchdog escalation successful
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully
>> >> >> acquired
>> >> >> >> the
>> >> >> >> >> >> >> delegate
>> >> >> >> >> >> >> >> > IP:"192.168.77.130"
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:23: pid 396973: DETAIL:
>> 'if_up_cmd'
>> >> >> >> returned
>> >> >> >> >> with
>> >> >> >> >> >> >> >> success
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog
>> >> escalation
>> >> >> >> >> process
>> >> >> >> >> >> >> with
>> >> >> >> >> >> >> >> > pid: 396973 exit with SUCCESS.
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to
>> >> lock
>> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to
>> >> >> unlock
>> >> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to
>> >> lock
>> >> >> >> >> semaphore
>> >> >> >> >> >> >> >> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid
>> >> argument
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > The way I started was I copied pgpool.conf from the
>> >> "old"
>> >> >> and
>> >> >> >> >> >> working
>> >> >> >> >> >> >> >> > pgpool 4.2 configuration and modified it slightly,
>> >> though
>> >> >> >> >> nothing
>> >> >> >> >> >> >> >> > major,changed hostnames and i.e. delegate_ip
>> variable
>> >> name
>> >> >> >> was
>> >> >> >> >> >> changed
>> >> >> >> >> >> >> >> > from delegate_IP.  Also added this variable as
>> without
>> >> >> it, I
>> >> >> >> >> didn't
>> >> >> >> >> >> >> see a
>> >> >> >> >> >> >> >> > socket created for port 9898:
>> >> >> >> >> >> >> >> > unix_socket_directories = '/var/run/postgresql'
>> >> >> >> >> >> >> >> > I had to change network interfaces as well, in the
>> >> RHEL it
>> >> >> >> was
>> >> >> >> >> >> using
>> >> >> >> >> >> >> eth0
>> >> >> >> >> >> >> >> > where on Ubuntu it's ens18, so I changed lines like:
>> >> >> >> >> >> >> >> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18
>> >> label
>> >> >> >> >> ens18:0
>> >> >> >> >> >> >> >> > Other thing I had to change was path to arping as on
>> >> >> Ubuntu
>> >> >> >> 24
>> >> >> >> >> >> it's in
>> >> >> >> >> >> >> >> > /usr/bin (not sbin)
>> >> >> >> >> >> >> >> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U
>> $_IP_$
>> >> -w
>> >> >> 1
>> >> >> >> -I
>> >> >> >> >> >> ens18'
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > The other thing I'm confused about, I have two
>> network
>> >> >> >> >> interfaces
>> >> >> >> >> >> on
>> >> >> >> >> >> >> each
>> >> >> >> >> >> >> >> > host, i.e.
>> >> >> >> >> >> >> >> > 192.168.77.135/23 brd 192.168.77.255 scope global
>> >> ens18
>> >> >> >> >> >> >> >> > and
>> >> >> >> >> >> >> >> > inet 192.168.78.135/24 brd 192.168.78.255 scope
>> global
>> >> >> ens19
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > In pgpool.conf, it is referencing the "priv" hosts
>> >> (i.e.
>> >> >> >> >> >> >> >> > heartbeat_hostname2 = 'db-replica2-priv'), which
>> >> >> >> theoretically
>> >> >> >> >> >> should
>> >> >> >> >> >> >> be
>> >> >> >> >> >> >> >> on
>> >> >> >> >> >> >> >> > ens19 network interface as shown above, but in my
>> old
>> >> >> >> >> pgpool.conf,
>> >> >> >> >> >> it
>> >> >> >> >> >> >> was
>> >> >> >> >> >> >> >> > only referencing eth0 on RHEL, even though "priv"
>> hosts
>> >> >> were
>> >> >> >> on
>> >> >> >> >> >> eth1.
>> >> >> >> >> >> >> >> I'm
>> >> >> >> >> >> >> >> > wondering if that needs to be changed in Ubuntu or
>> it
>> >> >> makes
>> >> >> >> no
>> >> >> >> >> >> >> difference
>> >> >> >> >> >> >> >> > here?
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> >  Right now, I can't quite figure out whether it is
>> >> >> something
>> >> >> >> >> with
>> >> >> >> >> >> >> >> > pgpool.conf or with the Ubuntu settings and the
>> >> semaphore
>> >> >> >> error
>> >> >> >> >> is
>> >> >> >> >> >> >> quite
>> >> >> >> >> >> >> >> > cryptic.  If you need me to post some parts of my
>> >> >> >> pgpool.conf,
>> >> >> >> >> >> please
>> >> >> >> >> >> >> >> > advise which.  I've seen an older version of the FAQ
>> >> that
>> >> >> >> said
>> >> >> >> >> >> repmgr
>> >> >> >> >> >> >> and
>> >> >> >> >> >> >> >> > pgpool don't work together but then I saw mention
>> that
>> >> >> this
>> >> >> >> was
>> >> >> >> >> no
>> >> >> >> >> >> >> longer
>> >> >> >> >> >> >> >> > the case.
>> >> >> >> >> >> >> >> >
>> >> >> >> >> >> >> >> > I would much appreciate any help you can provide in
>> >> >> solving
>> >> >> >> this
>> >> >> >> >> >> >> problem.
>> >> >> >> >> >> >> >>
>> >> >> >> >> >> >>
>> >> >> >> >> >>
>> >> >> >> >>
>> >> >> >>
>> >> >>
>> >> _______________________________________________
>> >> pgpool-general mailing list
>> >> pgpool-general at pgpool.net
>> >> http://www.pgpool.net/mailman/listinfo/pgpool-general
>> >>
>>