[pgpool-general: 9488] Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Sun Jun 1 17:50:37 JST 2025

Hi,
So long story short, I'm migrating a PG13 cluster from RHEL to Ubuntu
24.04. In the process, the pgpool version is also being upgraded to 4.6
(from 4.2). There is also repmgr running on this 4 node cluster, 1 primary
and 3 replicas.  All the migration seems to have completed successfully,
but I've ran into a weird problem with pgpool.  I have quite a few errors
that look like the below entries, and it seems to "come and go" when
checking pgpool log.  The process itself, seems to work (I think), but
every now and then, these errors (warnings?) show up:

2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog heartbeat: send heartbeat
signal to db-replica3-priv:9694
2025-06-01 09:11:00: pid 628043: LOCATION:  wd_heartbeat.c:771
2025-06-01 09:11:00: pid 627958: LOG:  exit handler called (signal: 2)
2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1272

*2025-06-01 09:11:00: pid 627958: WARNING:  failed to lock
semaphore2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument*2025-06-01
09:11:00: pid 627958: LOCATION:  pool_sema.c:129
2025-06-01 09:11:00: pid 627958: WARNING:  failed to unlock semaphore
2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument
2025-06-01 09:11:00: pid 627958: LOCATION:  pool_sema.c:202
2025-06-01 09:11:00: pid 627958: LOG:  shutting down by signal 2
2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1324
2025-06-01 09:11:00: pid 627958: LOG:  terminating all child processes
2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1334
2025-06-01 09:11:00: pid 628047: WARNING:  failed to lock semaphore
2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid argument
2025-06-01 09:11:00: pid 628047: LOCATION:  pool_sema.c:129
2025-06-01 09:11:00: pid 628042: WARNING:  failed to lock semaphore
2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid argument
2025-06-01 09:11:00: pid 628042: LOCATION:  pool_sema.c:129
2025-06-01 09:11:00: pid 628039: WARNING:  failed to lock semaphore
2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid argument

Also, earlier in the log, I see errors about being unable to create a
socket, for some reason it's trying (and failing) to create socket on
localhost (127.0.0.1:9694) and  "creating socket on ::1:9694".  Are these
"ok to ignore"?

2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on 127.0.0.1:9694
2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
2025-06-01 09:11:23: pid 634772: LOG:  set SO_REUSEPORT option to the socket
2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:891
2025-06-01 09:11:23: pid 634772: LOG:  creating watchdog heartbeat receive
socket.
2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on ::1:9694
2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
2025-06-01 09:11:23: pid 634772: LOG:  failed to create watchdog heartbeat
receive socket. retrying...
2025-06-01 09:11:23: pid 634772: DETAIL:  bind failed with reason: "Cannot
assign requested address"
2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:394

2025-06-01 09:11:22: pid 635274: DEBUG:  verify_backend_node_status:
true_primary 0
2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:2704
2025-06-01 09:11:22: pid 635274: DEBUG:  node status[0]: 1
2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
2025-06-01 09:11:22: pid 635274: DEBUG:  node status[1]: 2
2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
2025-06-01 09:11:22: pid 635274: DEBUG:  node status[2]: 2
2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
2025-06-01 09:11:22: pid 635274: DEBUG:  node status[3]: 2
2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
2025-06-01 09:11:22: pid 635274: DEBUG:  pool_release_follow_primary_lock
called
2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:4106
2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the socket
2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat receive
socket.
2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on 127.0.0.1:9694
2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:385
2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the socket
2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat receive
socket.

*2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
 wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765: LOG:  failed to create
watchdog heartbeat receive socket. retrying...*2025-06-01 09:11:23: pid
634765: DETAIL:  bind failed with reason: "Cannot assign requested address"
2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:394
2025-06-01 09:11:23: pid 634767: LOG:  set SO_REUSEPORT option to the socket
2025-06-01 09:11:23: pid 634767: LOCATION:  wd_heartbeat.c:891
2025-06-01 09:11:23: pid 634767: LOG:  creating socket for sending heartbeat
2025-06-01 09:11:23: pid 634767: DETAIL:  set SO_REUSEPORT

Other odd errors I see in the logs later on are when pgpool tries to run
escalation script - can't quite find where it's failing to find ipv4
address:

2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process received
SIGUSR1
2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process received
watchdog quorum change signal from watchdog
2025-06-01 00:07:15: pid 396973: LOG:  watchdog: escalation started
2025-06-01 00:07:15: pid 395786: LOG:  Setting failover command timeout to
135
2025-06-01 00:07:15: pid 395780: LOG:  watchdog cluster now holds the quorum
2025-06-01 00:07:15: pid 395780: DETAIL:  updating the state of quarantine
backend nodes
Sun Jun  1 12:07:15 AM CEST 2025

*escalation.sh: - remove VIP IP from all nodesNode: db-primary-privError:
ipv4: Address not found.Node: db-replica1-privError: ipv4: Address not
found.Node: db-replica2-privError: ipv4: Address not found.Node:
db-replica3-privError: ipv4: Address not found.*2025-06-01 00:07:20: pid
396973: LOG:  watchdog escalation successful
2025-06-01 00:07:23: pid 396973: LOG:  successfully acquired the delegate
IP:"192.168.77.130"
2025-06-01 00:07:23: pid 396973: DETAIL:  'if_up_cmd' returned with success
2025-06-01 00:07:23: pid 395786: LOG:  watchdog escalation process with
pid: 396973 exit with SUCCESS.
2025-06-01 00:07:27: pid 395962: WARNING:  failed to lock semaphore
2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
2025-06-01 00:07:27: pid 395962: WARNING:  failed to unlock semaphore
2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
2025-06-01 00:07:28: pid 396924: WARNING:  failed to lock semaphore
2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid argument

The way I started was I copied pgpool.conf from the "old" and working
pgpool 4.2 configuration and modified it slightly, though nothing
major,changed hostnames and i.e. delegate_ip variable name was changed
from delegate_IP.  Also added this variable as without it, I didn't see a
socket created for port 9898:
unix_socket_directories = '/var/run/postgresql'
I had to change network interfaces as well, in the RHEL it was using eth0
where on Ubuntu it's ens18, so I changed lines like:
/usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18 label ens18:0
Other thing I had to change was path to arping as on Ubuntu 24 it's in
/usr/bin (not sbin)
 arping_cmd = '/usr/bin/sudo /usr/bin/arping -U $_IP_$ -w 1 -I ens18'

The other thing I'm confused about, I have two network interfaces on each
host, i.e.
192.168.77.135/23 brd 192.168.77.255 scope global ens18
and
inet 192.168.78.135/24 brd 192.168.78.255 scope global ens19

In pgpool.conf, it is referencing the "priv" hosts (i.e.
heartbeat_hostname2 = 'db-replica2-priv'), which theoretically should be on
ens19 network interface as shown above, but in my old pgpool.conf, it was
only referencing eth0 on RHEL, even though "priv" hosts were on eth1.  I'm
wondering if that needs to be changed in Ubuntu or it makes no difference
here?

 Right now, I can't quite figure out whether it is something with
pgpool.conf or with the Ubuntu settings and the semaphore error is quite
cryptic.  If you need me to post some parts of my pgpool.conf, please
advise which.  I've seen an older version of the FAQ that said repmgr and
pgpool don't work together but then I saw mention that this was no longer
the case.

I would much appreciate any help you can provide in solving this problem.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20250601/ff31d445/attachment.htm>