[pgpool-general: 9492] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Mon Jun 2 17:19:49 JST 2025

Hi,
I tried the strace as you shown but it didn't really output anything:

$ sudo strace -p 1662563 -f -tt --trace=ipc
strace: Process 1662563 attached

^Cstrace: Process 1662563 detached.   <--- this is me getting out of it,
since I saw in another window the semaphores were gone already, without
anything outputting here.  Unless something's wrong with the syntax of the
command, it didn't really output anything.  I changed the log line you
suggested as well.

Also, FYI - I tried to downgrade to a lower (4.3.7) version as it was
available in my package repository, and the same semaphore version persists,

On Mon, Jun 2, 2025 at 2:40 AM Tatsuo Ishii <ishii at postgresql.org> wrote:

> > Hi,
> > Not sure I can patch the source, as it was installed via apt get and not
> > compiled from source.  If that is doable to a package-built one and you
> can
> > give me some instructions, I can certainly try.
>
> Unfortunately I'm not familiar with debian packaging.
>
> > As for the pgpool log, I'm attaching it, hopefully it will go through,
> it's
> > about 5mb as I enabled higher debug options.
>
> It was sent as a direct mail to me. Not sure who/what removed the
> semaphore. I observe following from the log anyway:
>
> - pgpool started on 16:10:21. At that time the semaphore definitely
> existed.
> - Until 16:15:13, everything worked great.
> - On 16:15:23, the problem started.
>
> So it seems the semaphore was removed by someone between 16:15:13 and
> 16:15:23.
>
> Unfortunately the removal of semaphore was not recorded in the log. To
> determine who removed the semaphore, can you please attach strace to
> pgpool main and share the log? This time debug log is not necessary.
>
> - Find the pgpool parent process pid (in the debug log it was 1483510)
> - Become root (attaching strace usually requires root privilege)
> - Attach strace to the process $pid
>
> strace -p $pid -f -tt --trace=ipc
>
> -f: (follow the child process of pgpool main)
> -tt: add timestamp with milliseconds
> --trace=ipc: limit the tracing system call to ipc to reduce the log
>
> BTW, I recommend yo to set log_line_prefix to something like this so
> that you can take milliseconds log (%m) and process application name
> (%a). They are usable in 4.6.
>
> log_line_prefix = '%m: %a pid %p: '
>
> > On Sun, Jun 1, 2025 at 4:20 PM Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> > To your points/questions:
> >> > 1) version is 4.6.1 - the source I'm using (via apt get) doesn't have
> >> 4.6.2
> >> > yet
> >>
> >> Ok.
> >>
> >> > 2) ipcs -s shows... nothing, literally.  Checked all nodes, i.e.
> >> >
> >> > postgres at db-replica3:~$ ipcs -s
> >> >
> >> > ------ Semaphore Arrays --------
> >> > key        semid      owner      perms      nsems
> >> >
> >> > Also, it kinda seems like "failed to unlock semaphore" shows during
> >> startup
> >> > of pgpool and while it's running, but maybe I'm not reading the log
> >> > properly.
> >>
> >> So pgpool lost semaphore soon after starting. Can you share full
> >> pgpool log before "failed to unlock semaphore" appears?
> >>
> >> > 3) regarding heartbeats, my pgpool.conf has it roughly like this:
> >> >
> >> > heartbeat_hostname0 = 'db-primary-priv'
> >> > heartbeat_port0 = 9694
> >> > heartbeat_device0 = ''
> >> >
> >> > heartbeat_hostname1 = 'db-replica1-priv'
> >> > heartbeat_port1 = 9694
> >> > heartbeat_device1 = ''
> >> >
> >> > heartbeat_hostname2 = 'db-replica2-priv'
> >> > heartbeat_port2 = 9694
> >> > heartbeat_device2 = ''
> >> >
> >> > heartbeat_hostname3 = 'db-replica3-priv'
> >> > heartbeat_port3 = 9694
> >> > heartbeat_device3 = ''
> >>
> >> Looks normal.
> >>
> >> I have checked the pgpool source code and found suspicious
> >> code. before 4.6, pgpool heartbeart receiver process binds to any
> >> address on the host. However from 4.6 pgpool binds to only loop back
> >> address, which is plain wrong. Is it possible for you to patch the
> >> pgpool source code and build pgpool binary?
> >>
> >> > So it seems likely that the inability to lock/unlock semaphores is the
> >> > problem, question is why and how do I resolve it? :(
> >>
> >> Currently I have no idea. That's why I am requesting the pgpool log.
> >>
> >> > I'll c/p parts of my pgpool.service file as well, in case you may see
> >> > something 'off' in it?
> >> >
> >> >
> >> > [Service]
> >> > User=postgres
> >> > Group=postgres
> >> >
> >> > EnvironmentFile=-/etc/default/pgpool2
> >> >
> >> > ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> >> > ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop
> >> > ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
> >> > LimitNOFILE=65536:65536
> >> > LimitNPROC=16384:16384
> >> > LimitSTACK=10485760:33554432
> >> > LimitMEMLOCK=infinity
> >> > LimitDATA=infinity
> >> >
> >> >
> >> > Thanks in advance.
> >> >
> >> > On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <ishii at postgresql.org>
> >> wrote:
> >> >
> >> >> > Hi,
> >> >> > So long story short, I'm migrating a PG13 cluster from RHEL to
> Ubuntu
> >> >> > 24.04. In the process, the pgpool version is also being upgraded to
> >> 4.6
> >> >> > (from 4.2).
> >> >>
> >> >> What is the exact version of 4.6? I am asking because 4.6.0 has
> >> >> vulnerability.
> >> >>
> >> >>
> >>
> https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29
> >> >>
> >> >> Also 4.6.1 (and 4.6.0) have an issue with IPv6.
> >> >> https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html
> >> >>
> >> >> If possible, I recommend to upgrade to 4.6.2.
> >> >>
> >> >> > There is also repmgr running on this 4 node cluster, 1 primary
> >> >> > and 3 replicas.
> >> >>
> >> >> I am not famililar with repmgr, and I ignore this part.
> >> >>
> >> >> > All the migration seems to have completed successfully,
> >> >> > but I've ran into a weird problem with pgpool.  I have quite a few
> >> errors
> >> >> > that look like the below entries, and it seems to "come and go"
> when
> >> >> > checking pgpool log.  The process itself, seems to work (I think),
> but
> >> >> > every now and then, these errors (warnings?) show up:
> >> >> >
> >> >> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog heartbeat: send
> >> >> heartbeat
> >> >> > signal to db-replica3-priv:9694
> >> >> > 2025-06-01 09:11:00: pid 628043: LOCATION:  wd_heartbeat.c:771
> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler called
> (signal: 2)
> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1272
> >> >> >
> >> >> >
> >> >> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed to lock
> >> >> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid
> >> >> argument*2025-06-01
> >> >> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129
> >> >> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to unlock
> semaphore
> >> >> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument
> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pool_sema.c:202
> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting down by signal 2
> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1324
> >> >> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating all child
> processes
> >> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1334
> >> >> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to lock semaphore
> >> >> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid argument
> >> >> > 2025-06-01 09:11:00: pid 628047: LOCATION:  pool_sema.c:129
> >> >> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to lock semaphore
> >> >> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid argument
> >> >> > 2025-06-01 09:11:00: pid 628042: LOCATION:  pool_sema.c:129
> >> >> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to lock semaphore
> >> >> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid argument
> >> >>
> >> >> Can you share the result of "ipcs -s" so that I can confirm pgpool
> >> >> properly allocates the semaphore.
> >> >>
> >> >> By the way do you see "WARNING: failed to unlock semaphore" only when
> >> >> pgpool is shutting down?
> >> >>
> >> >> > Also, earlier in the log, I see errors about being unable to
> create a
> >> >> > socket, for some reason it's trying (and failing) to create socket
> on
> >> >> > localhost (127.0.0.1:9694) and  "creating socket on ::1:9694".
> Are
> >> >> these
> >> >> > "ok to ignore"?
> >> >>
> >> >> First of all "creating socket on ::1:9694" looks unusual because it
> >> >> means that heartbeat_hostname is set to "::1", which is the IPv6 loop
> >> >> back address. heartbeat_hostname is expected to be set IPv4 or IPv6
> >> >> inet addesss (not loop back address).
> >> >>
> >> >> For now, I would like to wait for your reply regarding these my
> >> >> questions above.
> >> >>
> >> >> Best regards,
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS K.K.
> >> >> English: http://www.sraoss.co.jp/index_en/
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >>
> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on
> >> >> 127.0.0.1:9694
> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  set SO_REUSEPORT option to
> the
> >> >> socket
> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:891
> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  creating watchdog heartbeat
> >> >> receive
> >> >> > socket.
> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on
> ::1:9694
> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
> >> >> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to create watchdog
> >> >> heartbeat
> >> >> > receive socket. retrying...
> >> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind failed with reason:
> >> >> "Cannot
> >> >> > assign requested address"
> >> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:394
> >> >> >
> >> >> >
> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
> verify_backend_node_status:
> >> >> > true_primary 0
> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:2704
> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[0]: 1
> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[1]: 2
> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[2]: 2
> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[3]: 2
> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
> >> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
> >> pool_release_follow_primary_lock
> >> >> > called
> >> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:4106
> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to
> the
> >> >> socket
> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat
> >> >> receive
> >> >> > socket.
> >> >> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
> >> >> 127.0.0.1:9694
> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:385
> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to
> the
> >> >> socket
> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
> >> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat
> >> >> receive
> >> >> > socket.
> >> >> >
> >> >> >
> >> >> >
> >> >> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
> >> >> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
> >> >> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765: LOG:  failed to
> >> >> create
> >> >> > watchdog heartbeat receive socket. retrying...*2025-06-01 09:11:23:
> >> pid
> >> >> > 634765: DETAIL:  bind failed with reason: "Cannot assign requested
> >> >> address"
> >> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:394
> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  set SO_REUSEPORT option to
> the
> >> >> socket
> >> >> > 2025-06-01 09:11:23: pid 634767: LOCATION:  wd_heartbeat.c:891
> >> >> > 2025-06-01 09:11:23: pid 634767: LOG:  creating socket for sending
> >> >> heartbeat
> >> >> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set SO_REUSEPORT
> >> >> >
> >> >> >
> >> >> > Other odd errors I see in the logs later on are when pgpool tries
> to
> >> run
> >> >> > escalation script - can't quite find where it's failing to find
> ipv4
> >> >> > address:
> >> >> >
> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process
> >> received
> >> >> > SIGUSR1
> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process
> >> received
> >> >> > watchdog quorum change signal from watchdog
> >> >> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog: escalation started
> >> >> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting failover command
> >> timeout
> >> >> to
> >> >> > 135
> >> >> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog cluster now holds
> the
> >> >> quorum
> >> >> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating the state of
> >> >> quarantine
> >> >> > backend nodes
> >> >> > Sun Jun  1 12:07:15 AM CEST 2025
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > *escalation.sh: - remove VIP IP from all nodesNode:
> >> db-primary-privError:
> >> >> > ipv4: Address not found.Node: db-replica1-privError: ipv4: Address
> not
> >> >> > found.Node: db-replica2-privError: ipv4: Address not found.Node:
> >> >> > db-replica3-privError: ipv4: Address not found.*2025-06-01
> 00:07:20:
> >> pid
> >> >> > 396973: LOG:  watchdog escalation successful
> >> >> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully acquired the
> >> delegate
> >> >> > IP:"192.168.77.130"
> >> >> > 2025-06-01 00:07:23: pid 396973: DETAIL:  'if_up_cmd' returned with
> >> >> success
> >> >> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog escalation process
> >> with
> >> >> > pid: 396973 exit with SUCCESS.
> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to lock semaphore
> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
> >> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to unlock
> semaphore
> >> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
> >> >> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to lock semaphore
> >> >> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid argument
> >> >> >
> >> >> > The way I started was I copied pgpool.conf from the "old" and
> working
> >> >> > pgpool 4.2 configuration and modified it slightly, though nothing
> >> >> > major,changed hostnames and i.e. delegate_ip variable name was
> changed
> >> >> > from delegate_IP.  Also added this variable as without it, I didn't
> >> see a
> >> >> > socket created for port 9898:
> >> >> > unix_socket_directories = '/var/run/postgresql'
> >> >> > I had to change network interfaces as well, in the RHEL it was
> using
> >> eth0
> >> >> > where on Ubuntu it's ens18, so I changed lines like:
> >> >> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18 label ens18:0
> >> >> > Other thing I had to change was path to arping as on Ubuntu 24
> it's in
> >> >> > /usr/bin (not sbin)
> >> >> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U $_IP_$ -w 1 -I
> ens18'
> >> >> >
> >> >> > The other thing I'm confused about, I have two network interfaces
> on
> >> each
> >> >> > host, i.e.
> >> >> > 192.168.77.135/23 brd 192.168.77.255 scope global ens18
> >> >> > and
> >> >> > inet 192.168.78.135/24 brd 192.168.78.255 scope global ens19
> >> >> >
> >> >> > In pgpool.conf, it is referencing the "priv" hosts (i.e.
> >> >> > heartbeat_hostname2 = 'db-replica2-priv'), which theoretically
> should
> >> be
> >> >> on
> >> >> > ens19 network interface as shown above, but in my old pgpool.conf,
> it
> >> was
> >> >> > only referencing eth0 on RHEL, even though "priv" hosts were on
> eth1.
> >> >> I'm
> >> >> > wondering if that needs to be changed in Ubuntu or it makes no
> >> difference
> >> >> > here?
> >> >> >
> >> >> >  Right now, I can't quite figure out whether it is something with
> >> >> > pgpool.conf or with the Ubuntu settings and the semaphore error is
> >> quite
> >> >> > cryptic.  If you need me to post some parts of my pgpool.conf,
> please
> >> >> > advise which.  I've seen an older version of the FAQ that said
> repmgr
> >> and
> >> >> > pgpool don't work together but then I saw mention that this was no
> >> longer
> >> >> > the case.
> >> >> >
> >> >> > I would much appreciate any help you can provide in solving this
> >> problem.
> >> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20250602/68b2c798/attachment-0001.htm>