[pgpool-general: 9494] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Mon Jun 2 19:39:50 JST 2025

Hi, I changed the pgpool.service file for startup as you said, replaced my
start with this:
ExecStart=strace -f -tt --trace=ipc /usr/sbin/pgpool -f
/etc/pgpool2/pgpool.conf $OPTS
I'm attaching the log from replica again, but I'm not sure this actually
worked (as in, it doesn't look much different, to me at least).

On Mon, Jun 2, 2025 at 11:47 AM Tatsuo Ishii <ishii at postgresql.org> wrote:

> > Hi,
> > I tried the strace as you shown but it didn't really output anything:
> >
> > $ sudo strace -p 1662563 -f -tt --trace=ipc
> > strace: Process 1662563 attached
> >
> > ^Cstrace: Process 1662563 detached.   <--- this is me getting out of it,
> > since I saw in another window the semaphores were gone already, without
> > anything outputting here.  Unless something's wrong with the syntax of
> the
> > command, it didn't really output anything.  I changed the log line you
> > suggested as well.
> >
> > Also, FYI - I tried to downgrade to a lower (4.3.7) version as it was
> > available in my package repository, and the same semaphore version
> persists,
>
> Okay. I think you need to start pgpool from strace. I mean, modifiy
> pgpool.service from:
> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> to:
> ExecStart=strace -f -tt --trace=ipc /usr/sbin/pgpool -f
> /etc/pgpool2/pgpool.conf $OPTS
>
> > On Mon, Jun 2, 2025 at 2:40 AM Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> > Hi,
> >> > Not sure I can patch the source, as it was installed via apt get and
> not
> >> > compiled from source.  If that is doable to a package-built one and
> you
> >> can
> >> > give me some instructions, I can certainly try.
> >>
> >> Unfortunately I'm not familiar with debian packaging.
> >>
> >> > As for the pgpool log, I'm attaching it, hopefully it will go through,
> >> it's
> >> > about 5mb as I enabled higher debug options.
> >>
> >> It was sent as a direct mail to me. Not sure who/what removed the
> >> semaphore. I observe following from the log anyway:
> >>
> >> - pgpool started on 16:10:21. At that time the semaphore definitely
> >> existed.
> >> - Until 16:15:13, everything worked great.
> >> - On 16:15:23, the problem started.
> >>
> >> So it seems the semaphore was removed by someone between 16:15:13 and
> >> 16:15:23.
> >>
> >> Unfortunately the removal of semaphore was not recorded in the log. To
> >> determine who removed the semaphore, can you please attach strace to
> >> pgpool main and share the log? This time debug log is not necessary.
> >>
> >> - Find the pgpool parent process pid (in the debug log it was 1483510)
> >> - Become root (attaching strace usually requires root privilege)
> >> - Attach strace to the process $pid
> >>
> >> strace -p $pid -f -tt --trace=ipc
> >>
> >> -f: (follow the child process of pgpool main)
> >> -tt: add timestamp with milliseconds
> >> --trace=ipc: limit the tracing system call to ipc to reduce the log
> >>
> >> BTW, I recommend yo to set log_line_prefix to something like this so
> >> that you can take milliseconds log (%m) and process application name
> >> (%a). They are usable in 4.6.
> >>
> >> log_line_prefix = '%m: %a pid %p: '
> >>
> >> > On Sun, Jun 1, 2025 at 4:20 PM Tatsuo Ishii <ishii at postgresql.org>
> >> wrote:
> >> >
> >> >> > To your points/questions:
> >> >> > 1) version is 4.6.1 - the source I'm using (via apt get) doesn't
> have
> >> >> 4.6.2
> >> >> > yet
> >> >>
> >> >> Ok.
> >> >>
> >> >> > 2) ipcs -s shows... nothing, literally.  Checked all nodes, i.e.
> >> >> >
> >> >> > postgres at db-replica3:~$ ipcs -s
> >> >> >
> >> >> > ------ Semaphore Arrays --------
> >> >> > key        semid      owner      perms      nsems
> >> >> >
> >> >> > Also, it kinda seems like "failed to unlock semaphore" shows during
> >> >> startup
> >> >> > of pgpool and while it's running, but maybe I'm not reading the log
> >> >> > properly.
> >> >>
> >> >> So pgpool lost semaphore soon after starting. Can you share full
> >> >> pgpool log before "failed to unlock semaphore" appears?
> >> >>
> >> >> > 3) regarding heartbeats, my pgpool.conf has it roughly like this:
> >> >> >
> >> >> > heartbeat_hostname0 = 'db-primary-priv'
> >> >> > heartbeat_port0 = 9694
> >> >> > heartbeat_device0 = ''
> >> >> >
> >> >> > heartbeat_hostname1 = 'db-replica1-priv'
> >> >> > heartbeat_port1 = 9694
> >> >> > heartbeat_device1 = ''
> >> >> >
> >> >> > heartbeat_hostname2 = 'db-replica2-priv'
> >> >> > heartbeat_port2 = 9694
> >> >> > heartbeat_device2 = ''
> >> >> >
> >> >> > heartbeat_hostname3 = 'db-replica3-priv'
> >> >> > heartbeat_port3 = 9694
> >> >> > heartbeat_device3 = ''
> >> >>
> >> >> Looks normal.
> >> >>
> >> >> I have checked the pgpool source code and found suspicious
> >> >> code. before 4.6, pgpool heartbeart receiver process binds to any
> >> >> address on the host. However from 4.6 pgpool binds to only loop back
> >> >> address, which is plain wrong. Is it possible for you to patch the
> >> >> pgpool source code and build pgpool binary?
> >> >>
> >> >> > So it seems likely that the inability to lock/unlock semaphores is
> the
> >> >> > problem, question is why and how do I resolve it? :(
> >> >>
> >> >> Currently I have no idea. That's why I am requesting the pgpool log.
> >> >>
> >> >> > I'll c/p parts of my pgpool.service file as well, in case you may
> see
> >> >> > something 'off' in it?
> >> >> >
> >> >> >
> >> >> > [Service]
> >> >> > User=postgres
> >> >> > Group=postgres
> >> >> >
> >> >> > EnvironmentFile=-/etc/default/pgpool2
> >> >> >
> >> >> > ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> >> >> > ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS
> stop
> >> >> > ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
> >> >> > LimitNOFILE=65536:65536
> >> >> > LimitNPROC=16384:16384
> >> >> > LimitSTACK=10485760:33554432
> >> >> > LimitMEMLOCK=infinity
> >> >> > LimitDATA=infinity
> >> >> >
> >> >> >
> >> >> > Thanks in advance.
> >> >> >
> >> >> > On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <ishii at postgresql.org>
> >> >> wrote:
> >> >> >
> >> >> >> > Hi,
> >> >> >> > So long story short, I'm migrating a PG13 cluster from RHEL to
> >> Ubuntu
> >> >> >> > 24.04. In the process, the pgpool version is also being
> upgraded to
> >> >> 4.6
> >> >> >> > (from 4.2).
> >> >> >>
> >> >> >> What is the exact version of 4.6? I am asking because 4.6.0 has
> >> >> >> vulnerability.
> >> >> >>
> >> >> >>
> >> >>
> >>
> https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29
> >> >> >>
> >> >> >> Also 4.6.1 (and 4.6.0) have an issue with IPv6.
> >> >> >> https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html
> >> >> >>
> >> >> >> If possible, I recommend to upgrade to 4.6.2.
> >> >> >>
> >> >> >> > There is also repmgr running on this 4 node cluster, 1 primary
> >> >> >> > and 3 replicas.
> >> >> >>
> >> >> >> I am not famililar with repmgr, and I ignore this part.
> >> >> >>
> >> >> >> > All the migration seems to have completed successfully,
> >> >> >> > but I've ran into a weird problem with pgpool.  I have quite a
> few
> >> >> errors
> >> >> >> > that look like the below entries, and it seems to "come and go"
> >> when
> >> >> >> > checking pgpool log.  The process itself, seems to work (I
> think),
> >> but
> >> >> >> > every now and then, these errors (warnings?) show up:
> >> >> >> >
> >> >> >> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog heartbeat:
> send
> >> >> >> heartbeat
> >> >> >> > signal to db-replica3-priv:9694
> >> >> >> > 2025-06-01 09:11:00: pid 628043: LOCATION:  wd_heartbeat.c:771
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler called
> >> (signal: 2)
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1272
> >> >> >> >
> >> >> >> >
> >> >> >> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed to lock
> >> >> >> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid
> >> >> >> argument*2025-06-01
> >> >> >> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129
> >> >> >> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to unlock
> >> semaphore
> >> >> >> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pool_sema.c:202
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting down by signal 2
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1324
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating all child
> >> processes
> >> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1334
> >> >> >> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to lock
> semaphore
> >> >> >> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid argument
> >> >> >> > 2025-06-01 09:11:00: pid 628047: LOCATION:  pool_sema.c:129
> >> >> >> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to lock
> semaphore
> >> >> >> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid argument
> >> >> >> > 2025-06-01 09:11:00: pid 628042: LOCATION:  pool_sema.c:129
> >> >> >> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to lock
> semaphore
> >> >> >> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid argument
> >> >> >>
> >> >> >> Can you share the result of "ipcs -s" so that I can confirm pgpool
> >> >> >> properly allocates the semaphore.
> >> >> >>
> >> >> >> By the way do you see "WARNING: failed to unlock semaphore" only
> when
> >> >> >> pgpool is shutting down?
> >> >> >>
> >> >> >> > Also, earlier in the log, I see errors about being unable to
> >> create a
> >> >> >> > socket, for some reason it's trying (and failing) to create
> socket
> >> on
> >> >> >> > localhost (127.0.0.1:9694) and  "creating socket on ::1:9694".
> >> Are
> >> >> >> these
> >> >> >> > "ok to ignore"?
> >> >> >>
> >> >> >> First of all "creating socket on ::1:9694" looks unusual because
> it
> >> >> >> means that heartbeat_hostname is set to "::1", which is the IPv6
> loop
> >> >> >> back address. heartbeat_hostname is expected to be set IPv4 or
> IPv6
> >> >> >> inet addesss (not loop back address).
> >> >> >>
> >> >> >> For now, I would like to wait for your reply regarding these my
> >> >> >> questions above.
> >> >> >>
> >> >> >> Best regards,
> >> >> >> --
> >> >> >> Tatsuo Ishii
> >> >> >> SRA OSS K.K.
> >> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> >> Japanese:http://www.sraoss.co.jp
> >> >> >>
> >> >> >>
> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on
> >> >> >> 127.0.0.1:9694
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  set SO_REUSEPORT option
> to
> >> the
> >> >> >> socket
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:891
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  creating watchdog
> heartbeat
> >> >> >> receive
> >> >> >> > socket.
> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on
> >> ::1:9694
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to create watchdog
> >> >> >> heartbeat
> >> >> >> > receive socket. retrying...
> >> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind failed with
> reason:
> >> >> >> "Cannot
> >> >> >> > assign requested address"
> >> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:394
> >> >> >> >
> >> >> >> >
> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
> >> verify_backend_node_status:
> >> >> >> > true_primary 0
> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:2704
> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[0]: 1
> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> pool_worker_child.c:267
> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[1]: 2
> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> pool_worker_child.c:267
> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[2]: 2
> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> pool_worker_child.c:267
> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[3]: 2
> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:
> pool_worker_child.c:267
> >> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
> >> >> pool_release_follow_primary_lock
> >> >> >> > called
> >> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:4106
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option
> to
> >> the
> >> >> >> socket
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog
> heartbeat
> >> >> >> receive
> >> >> >> > socket.
> >> >> >> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
> >> >> >> 127.0.0.1:9694
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:385
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option
> to
> >> the
> >> >> >> socket
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog
> heartbeat
> >> >> >> receive
> >> >> >> > socket.
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
> >> >> >> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765: LOG:
> failed to
> >> >> >> create
> >> >> >> > watchdog heartbeat receive socket. retrying...*2025-06-01
> 09:11:23:
> >> >> pid
> >> >> >> > 634765: DETAIL:  bind failed with reason: "Cannot assign
> requested
> >> >> >> address"
> >> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:394
> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  set SO_REUSEPORT option
> to
> >> the
> >> >> >> socket
> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOCATION:  wd_heartbeat.c:891
> >> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  creating socket for
> sending
> >> >> >> heartbeat
> >> >> >> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set SO_REUSEPORT
> >> >> >> >
> >> >> >> >
> >> >> >> > Other odd errors I see in the logs later on are when pgpool
> tries
> >> to
> >> >> run
> >> >> >> > escalation script - can't quite find where it's failing to find
> >> ipv4
> >> >> >> > address:
> >> >> >> >
> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process
> >> >> received
> >> >> >> > SIGUSR1
> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process
> >> >> received
> >> >> >> > watchdog quorum change signal from watchdog
> >> >> >> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog: escalation
> started
> >> >> >> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting failover command
> >> >> timeout
> >> >> >> to
> >> >> >> > 135
> >> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog cluster now
> holds
> >> the
> >> >> >> quorum
> >> >> >> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating the state of
> >> >> >> quarantine
> >> >> >> > backend nodes
> >> >> >> > Sun Jun  1 12:07:15 AM CEST 2025
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >> > *escalation.sh: - remove VIP IP from all nodesNode:
> >> >> db-primary-privError:
> >> >> >> > ipv4: Address not found.Node: db-replica1-privError: ipv4:
> Address
> >> not
> >> >> >> > found.Node: db-replica2-privError: ipv4: Address not found.Node:
> >> >> >> > db-replica3-privError: ipv4: Address not found.*2025-06-01
> >> 00:07:20:
> >> >> pid
> >> >> >> > 396973: LOG:  watchdog escalation successful
> >> >> >> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully acquired the
> >> >> delegate
> >> >> >> > IP:"192.168.77.130"
> >> >> >> > 2025-06-01 00:07:23: pid 396973: DETAIL:  'if_up_cmd' returned
> with
> >> >> >> success
> >> >> >> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog escalation
> process
> >> >> with
> >> >> >> > pid: 396973 exit with SUCCESS.
> >> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to lock
> semaphore
> >> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
> >> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to unlock
> >> semaphore
> >> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
> >> >> >> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to lock
> semaphore
> >> >> >> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid argument
> >> >> >> >
> >> >> >> > The way I started was I copied pgpool.conf from the "old" and
> >> working
> >> >> >> > pgpool 4.2 configuration and modified it slightly, though
> nothing
> >> >> >> > major,changed hostnames and i.e. delegate_ip variable name was
> >> changed
> >> >> >> > from delegate_IP.  Also added this variable as without it, I
> didn't
> >> >> see a
> >> >> >> > socket created for port 9898:
> >> >> >> > unix_socket_directories = '/var/run/postgresql'
> >> >> >> > I had to change network interfaces as well, in the RHEL it was
> >> using
> >> >> eth0
> >> >> >> > where on Ubuntu it's ens18, so I changed lines like:
> >> >> >> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18 label
> ens18:0
> >> >> >> > Other thing I had to change was path to arping as on Ubuntu 24
> >> it's in
> >> >> >> > /usr/bin (not sbin)
> >> >> >> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U $_IP_$ -w 1 -I
> >> ens18'
> >> >> >> >
> >> >> >> > The other thing I'm confused about, I have two network
> interfaces
> >> on
> >> >> each
> >> >> >> > host, i.e.
> >> >> >> > 192.168.77.135/23 brd 192.168.77.255 scope global ens18
> >> >> >> > and
> >> >> >> > inet 192.168.78.135/24 brd 192.168.78.255 scope global ens19
> >> >> >> >
> >> >> >> > In pgpool.conf, it is referencing the "priv" hosts (i.e.
> >> >> >> > heartbeat_hostname2 = 'db-replica2-priv'), which theoretically
> >> should
> >> >> be
> >> >> >> on
> >> >> >> > ens19 network interface as shown above, but in my old
> pgpool.conf,
> >> it
> >> >> was
> >> >> >> > only referencing eth0 on RHEL, even though "priv" hosts were on
> >> eth1.
> >> >> >> I'm
> >> >> >> > wondering if that needs to be changed in Ubuntu or it makes no
> >> >> difference
> >> >> >> > here?
> >> >> >> >
> >> >> >> >  Right now, I can't quite figure out whether it is something
> with
> >> >> >> > pgpool.conf or with the Ubuntu settings and the semaphore error
> is
> >> >> quite
> >> >> >> > cryptic.  If you need me to post some parts of my pgpool.conf,
> >> please
> >> >> >> > advise which.  I've seen an older version of the FAQ that said
> >> repmgr
> >> >> and
> >> >> >> > pgpool don't work together but then I saw mention that this was
> no
> >> >> longer
> >> >> >> > the case.
> >> >> >> >
> >> >> >> > I would much appreciate any help you can provide in solving this
> >> >> problem.
> >> >> >>
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20250602/334c0afc/attachment-0001.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: debug2.log
Type: application/octet-stream
Size: 4336203 bytes
Desc: not available
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20250602/334c0afc/attachment-0001.obj>