[pgpool-general: 9491] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Mon Jun 2 09:40:10 JST 2025

> Hi,
> Not sure I can patch the source, as it was installed via apt get and not
> compiled from source.  If that is doable to a package-built one and you can
> give me some instructions, I can certainly try.

Unfortunately I'm not familiar with debian packaging.

> As for the pgpool log, I'm attaching it, hopefully it will go through, it's
> about 5mb as I enabled higher debug options.

It was sent as a direct mail to me. Not sure who/what removed the
semaphore. I observe following from the log anyway:

- pgpool started on 16:10:21. At that time the semaphore definitely existed.
- Until 16:15:13, everything worked great.
- On 16:15:23, the problem started.

So it seems the semaphore was removed by someone between 16:15:13 and
16:15:23.

Unfortunately the removal of semaphore was not recorded in the log. To
determine who removed the semaphore, can you please attach strace to
pgpool main and share the log? This time debug log is not necessary.

- Find the pgpool parent process pid (in the debug log it was 1483510)
- Become root (attaching strace usually requires root privilege)
- Attach strace to the process $pid

strace -p $pid -f -tt --trace=ipc

-f: (follow the child process of pgpool main)
-tt: add timestamp with milliseconds
--trace=ipc: limit the tracing system call to ipc to reduce the log

BTW, I recommend yo to set log_line_prefix to something like this so
that you can take milliseconds log (%m) and process application name
(%a). They are usable in 4.6.

log_line_prefix = '%m: %a pid %p: '

> On Sun, Jun 1, 2025 at 4:20 PM Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> > To your points/questions:
>> > 1) version is 4.6.1 - the source I'm using (via apt get) doesn't have
>> 4.6.2
>> > yet
>>
>> Ok.
>>
>> > 2) ipcs -s shows... nothing, literally.  Checked all nodes, i.e.
>> >
>> > postgres at db-replica3:~$ ipcs -s
>> >
>> > ------ Semaphore Arrays --------
>> > key        semid      owner      perms      nsems
>> >
>> > Also, it kinda seems like "failed to unlock semaphore" shows during
>> startup
>> > of pgpool and while it's running, but maybe I'm not reading the log
>> > properly.
>>
>> So pgpool lost semaphore soon after starting. Can you share full
>> pgpool log before "failed to unlock semaphore" appears?
>>
>> > 3) regarding heartbeats, my pgpool.conf has it roughly like this:
>> >
>> > heartbeat_hostname0 = 'db-primary-priv'
>> > heartbeat_port0 = 9694
>> > heartbeat_device0 = ''
>> >
>> > heartbeat_hostname1 = 'db-replica1-priv'
>> > heartbeat_port1 = 9694
>> > heartbeat_device1 = ''
>> >
>> > heartbeat_hostname2 = 'db-replica2-priv'
>> > heartbeat_port2 = 9694
>> > heartbeat_device2 = ''
>> >
>> > heartbeat_hostname3 = 'db-replica3-priv'
>> > heartbeat_port3 = 9694
>> > heartbeat_device3 = ''
>>
>> Looks normal.
>>
>> I have checked the pgpool source code and found suspicious
>> code. before 4.6, pgpool heartbeart receiver process binds to any
>> address on the host. However from 4.6 pgpool binds to only loop back
>> address, which is plain wrong. Is it possible for you to patch the
>> pgpool source code and build pgpool binary?
>>
>> > So it seems likely that the inability to lock/unlock semaphores is the
>> > problem, question is why and how do I resolve it? :(
>>
>> Currently I have no idea. That's why I am requesting the pgpool log.
>>
>> > I'll c/p parts of my pgpool.service file as well, in case you may see
>> > something 'off' in it?
>> >
>> >
>> > [Service]
>> > User=postgres
>> > Group=postgres
>> >
>> > EnvironmentFile=-/etc/default/pgpool2
>> >
>> > ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
>> > ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop
>> > ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
>> > LimitNOFILE=65536:65536
>> > LimitNPROC=16384:16384
>> > LimitSTACK=10485760:33554432
>> > LimitMEMLOCK=infinity
>> > LimitDATA=infinity
>> >
>> >
>> > Thanks in advance.
>> >
>> > On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <ishii at postgresql.org>
>> wrote:
>> >
>> >> > Hi,
>> >> > So long story short, I'm migrating a PG13 cluster from RHEL to Ubuntu
>> >> > 24.04. In the process, the pgpool version is also being upgraded to
>> 4.6
>> >> > (from 4.2).
>> >>
>> >> What is the exact version of 4.6? I am asking because 4.6.0 has
>> >> vulnerability.
>> >>
>> >>
>> https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29
>> >>
>> >> Also 4.6.1 (and 4.6.0) have an issue with IPv6.
>> >> https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html
>> >>
>> >> If possible, I recommend to upgrade to 4.6.2.
>> >>
>> >> > There is also repmgr running on this 4 node cluster, 1 primary
>> >> > and 3 replicas.
>> >>
>> >> I am not famililar with repmgr, and I ignore this part.
>> >>
>> >> > All the migration seems to have completed successfully,
>> >> > but I've ran into a weird problem with pgpool.  I have quite a few
>> errors
>> >> > that look like the below entries, and it seems to "come and go" when
>> >> > checking pgpool log.  The process itself, seems to work (I think), but
>> >> > every now and then, these errors (warnings?) show up:
>> >> >
>> >> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog heartbeat: send
>> >> heartbeat
>> >> > signal to db-replica3-priv:9694
>> >> > 2025-06-01 09:11:00: pid 628043: LOCATION:  wd_heartbeat.c:771
>> >> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler called (signal: 2)
>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1272
>> >> >
>> >> >
>> >> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed to lock
>> >> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid
>> >> argument*2025-06-01
>> >> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129
>> >> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to unlock semaphore
>> >> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument
>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pool_sema.c:202
>> >> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting down by signal 2
>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1324
>> >> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating all child processes
>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1334
>> >> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to lock semaphore
>> >> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid argument
>> >> > 2025-06-01 09:11:00: pid 628047: LOCATION:  pool_sema.c:129
>> >> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to lock semaphore
>> >> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid argument
>> >> > 2025-06-01 09:11:00: pid 628042: LOCATION:  pool_sema.c:129
>> >> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to lock semaphore
>> >> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid argument
>> >>
>> >> Can you share the result of "ipcs -s" so that I can confirm pgpool
>> >> properly allocates the semaphore.
>> >>
>> >> By the way do you see "WARNING: failed to unlock semaphore" only when
>> >> pgpool is shutting down?
>> >>
>> >> > Also, earlier in the log, I see errors about being unable to create a
>> >> > socket, for some reason it's trying (and failing) to create socket on
>> >> > localhost (127.0.0.1:9694) and  "creating socket on ::1:9694".  Are
>> >> these
>> >> > "ok to ignore"?
>> >>
>> >> First of all "creating socket on ::1:9694" looks unusual because it
>> >> means that heartbeat_hostname is set to "::1", which is the IPv6 loop
>> >> back address. heartbeat_hostname is expected to be set IPv4 or IPv6
>> >> inet addesss (not loop back address).
>> >>
>> >> For now, I would like to wait for your reply regarding these my
>> >> questions above.
>> >>
>> >> Best regards,
>> >> --
>> >> Tatsuo Ishii
>> >> SRA OSS K.K.
>> >> English: http://www.sraoss.co.jp/index_en/
>> >> Japanese:http://www.sraoss.co.jp
>> >>
>> >>
>> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on
>> >> 127.0.0.1:9694
>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
>> >> > 2025-06-01 09:11:23: pid 634772: LOG:  set SO_REUSEPORT option to the
>> >> socket
>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:891
>> >> > 2025-06-01 09:11:23: pid 634772: LOG:  creating watchdog heartbeat
>> >> receive
>> >> > socket.
>> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on ::1:9694
>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
>> >> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to create watchdog
>> >> heartbeat
>> >> > receive socket. retrying...
>> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind failed with reason:
>> >> "Cannot
>> >> > assign requested address"
>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:394
>> >> >
>> >> >
>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  verify_backend_node_status:
>> >> > true_primary 0
>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:2704
>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[0]: 1
>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[1]: 2
>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[2]: 2
>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[3]: 2
>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:
>> pool_release_follow_primary_lock
>> >> > called
>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:4106
>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the
>> >> socket
>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat
>> >> receive
>> >> > socket.
>> >> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
>> >> 127.0.0.1:9694
>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:385
>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the
>> >> socket
>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat
>> >> receive
>> >> > socket.
>> >> >
>> >> >
>> >> >
>> >> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
>> >> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
>> >> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765: LOG:  failed to
>> >> create
>> >> > watchdog heartbeat receive socket. retrying...*2025-06-01 09:11:23:
>> pid
>> >> > 634765: DETAIL:  bind failed with reason: "Cannot assign requested
>> >> address"
>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:394
>> >> > 2025-06-01 09:11:23: pid 634767: LOG:  set SO_REUSEPORT option to the
>> >> socket
>> >> > 2025-06-01 09:11:23: pid 634767: LOCATION:  wd_heartbeat.c:891
>> >> > 2025-06-01 09:11:23: pid 634767: LOG:  creating socket for sending
>> >> heartbeat
>> >> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set SO_REUSEPORT
>> >> >
>> >> >
>> >> > Other odd errors I see in the logs later on are when pgpool tries to
>> run
>> >> > escalation script - can't quite find where it's failing to find ipv4
>> >> > address:
>> >> >
>> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process
>> received
>> >> > SIGUSR1
>> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process
>> received
>> >> > watchdog quorum change signal from watchdog
>> >> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog: escalation started
>> >> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting failover command
>> timeout
>> >> to
>> >> > 135
>> >> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog cluster now holds the
>> >> quorum
>> >> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating the state of
>> >> quarantine
>> >> > backend nodes
>> >> > Sun Jun  1 12:07:15 AM CEST 2025
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > *escalation.sh: - remove VIP IP from all nodesNode:
>> db-primary-privError:
>> >> > ipv4: Address not found.Node: db-replica1-privError: ipv4: Address not
>> >> > found.Node: db-replica2-privError: ipv4: Address not found.Node:
>> >> > db-replica3-privError: ipv4: Address not found.*2025-06-01 00:07:20:
>> pid
>> >> > 396973: LOG:  watchdog escalation successful
>> >> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully acquired the
>> delegate
>> >> > IP:"192.168.77.130"
>> >> > 2025-06-01 00:07:23: pid 396973: DETAIL:  'if_up_cmd' returned with
>> >> success
>> >> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog escalation process
>> with
>> >> > pid: 396973 exit with SUCCESS.
>> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to lock semaphore
>> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
>> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to unlock semaphore
>> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
>> >> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to lock semaphore
>> >> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid argument
>> >> >
>> >> > The way I started was I copied pgpool.conf from the "old" and working
>> >> > pgpool 4.2 configuration and modified it slightly, though nothing
>> >> > major,changed hostnames and i.e. delegate_ip variable name was changed
>> >> > from delegate_IP.  Also added this variable as without it, I didn't
>> see a
>> >> > socket created for port 9898:
>> >> > unix_socket_directories = '/var/run/postgresql'
>> >> > I had to change network interfaces as well, in the RHEL it was using
>> eth0
>> >> > where on Ubuntu it's ens18, so I changed lines like:
>> >> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18 label ens18:0
>> >> > Other thing I had to change was path to arping as on Ubuntu 24 it's in
>> >> > /usr/bin (not sbin)
>> >> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U $_IP_$ -w 1 -I ens18'
>> >> >
>> >> > The other thing I'm confused about, I have two network interfaces on
>> each
>> >> > host, i.e.
>> >> > 192.168.77.135/23 brd 192.168.77.255 scope global ens18
>> >> > and
>> >> > inet 192.168.78.135/24 brd 192.168.78.255 scope global ens19
>> >> >
>> >> > In pgpool.conf, it is referencing the "priv" hosts (i.e.
>> >> > heartbeat_hostname2 = 'db-replica2-priv'), which theoretically should
>> be
>> >> on
>> >> > ens19 network interface as shown above, but in my old pgpool.conf, it
>> was
>> >> > only referencing eth0 on RHEL, even though "priv" hosts were on eth1.
>> >> I'm
>> >> > wondering if that needs to be changed in Ubuntu or it makes no
>> difference
>> >> > here?
>> >> >
>> >> >  Right now, I can't quite figure out whether it is something with
>> >> > pgpool.conf or with the Ubuntu settings and the semaphore error is
>> quite
>> >> > cryptic.  If you need me to post some parts of my pgpool.conf, please
>> >> > advise which.  I've seen an older version of the FAQ that said repmgr
>> and
>> >> > pgpool don't work together but then I saw mention that this was no
>> longer
>> >> > the case.
>> >> >
>> >> > I would much appreciate any help you can provide in solving this
>> problem.
>> >>
>>