[pgpool-general: 9490] Re: Weird error in pgpool 4.6 logs: WARNING: failed to lock semaphore

Sun Jun 1 23:20:20 JST 2025

> To your points/questions:
> 1) version is 4.6.1 - the source I'm using (via apt get) doesn't have 4.6.2
> yet

Ok.

> 2) ipcs -s shows... nothing, literally.  Checked all nodes, i.e.
> 
> postgres at db-replica3:~$ ipcs -s
> 
> ------ Semaphore Arrays --------
> key        semid      owner      perms      nsems
> 
> Also, it kinda seems like "failed to unlock semaphore" shows during startup
> of pgpool and while it's running, but maybe I'm not reading the log
> properly.

So pgpool lost semaphore soon after starting. Can you share full
pgpool log before "failed to unlock semaphore" appears?

> 3) regarding heartbeats, my pgpool.conf has it roughly like this:
> 
> heartbeat_hostname0 = 'db-primary-priv'
> heartbeat_port0 = 9694
> heartbeat_device0 = ''
> 
> heartbeat_hostname1 = 'db-replica1-priv'
> heartbeat_port1 = 9694
> heartbeat_device1 = ''
> 
> heartbeat_hostname2 = 'db-replica2-priv'
> heartbeat_port2 = 9694
> heartbeat_device2 = ''
> 
> heartbeat_hostname3 = 'db-replica3-priv'
> heartbeat_port3 = 9694
> heartbeat_device3 = ''

Looks normal.

I have checked the pgpool source code and found suspicious
code. before 4.6, pgpool heartbeart receiver process binds to any
address on the host. However from 4.6 pgpool binds to only loop back
address, which is plain wrong. Is it possible for you to patch the
pgpool source code and build pgpool binary?

> So it seems likely that the inability to lock/unlock semaphores is the
> problem, question is why and how do I resolve it? :(

Currently I have no idea. That's why I am requesting the pgpool log.

> I'll c/p parts of my pgpool.service file as well, in case you may see
> something 'off' in it?
> 
> 
> [Service]
> User=postgres
> Group=postgres
> 
> EnvironmentFile=-/etc/default/pgpool2
> 
> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS
> ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop
> ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload
> LimitNOFILE=65536:65536
> LimitNPROC=16384:16384
> LimitSTACK=10485760:33554432
> LimitMEMLOCK=infinity
> LimitDATA=infinity
> 
> 
> Thanks in advance.
> 
> On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <ishii at postgresql.org> wrote:
> 
>> > Hi,
>> > So long story short, I'm migrating a PG13 cluster from RHEL to Ubuntu
>> > 24.04. In the process, the pgpool version is also being upgraded to 4.6
>> > (from 4.2).
>>
>> What is the exact version of 4.6? I am asking because 4.6.0 has
>> vulnerability.
>>
>> https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29
>>
>> Also 4.6.1 (and 4.6.0) have an issue with IPv6.
>> https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html
>>
>> If possible, I recommend to upgrade to 4.6.2.
>>
>> > There is also repmgr running on this 4 node cluster, 1 primary
>> > and 3 replicas.
>>
>> I am not famililar with repmgr, and I ignore this part.
>>
>> > All the migration seems to have completed successfully,
>> > but I've ran into a weird problem with pgpool.  I have quite a few errors
>> > that look like the below entries, and it seems to "come and go" when
>> > checking pgpool log.  The process itself, seems to work (I think), but
>> > every now and then, these errors (warnings?) show up:
>> >
>> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog heartbeat: send
>> heartbeat
>> > signal to db-replica3-priv:9694
>> > 2025-06-01 09:11:00: pid 628043: LOCATION:  wd_heartbeat.c:771
>> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler called (signal: 2)
>> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1272
>> >
>> >
>> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed to lock
>> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid
>> argument*2025-06-01
>> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129
>> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to unlock semaphore
>> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument
>> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pool_sema.c:202
>> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting down by signal 2
>> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1324
>> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating all child processes
>> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1334
>> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to lock semaphore
>> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid argument
>> > 2025-06-01 09:11:00: pid 628047: LOCATION:  pool_sema.c:129
>> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to lock semaphore
>> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid argument
>> > 2025-06-01 09:11:00: pid 628042: LOCATION:  pool_sema.c:129
>> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to lock semaphore
>> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid argument
>>
>> Can you share the result of "ipcs -s" so that I can confirm pgpool
>> properly allocates the semaphore.
>>
>> By the way do you see "WARNING: failed to unlock semaphore" only when
>> pgpool is shutting down?
>>
>> > Also, earlier in the log, I see errors about being unable to create a
>> > socket, for some reason it's trying (and failing) to create socket on
>> > localhost (127.0.0.1:9694) and  "creating socket on ::1:9694".  Are
>> these
>> > "ok to ignore"?
>>
>> First of all "creating socket on ::1:9694" looks unusual because it
>> means that heartbeat_hostname is set to "::1", which is the IPv6 loop
>> back address. heartbeat_hostname is expected to be set IPv4 or IPv6
>> inet addesss (not loop back address).
>>
>> For now, I would like to wait for your reply regarding these my
>> questions above.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS K.K.
>> English: http://www.sraoss.co.jp/index_en/
>> Japanese:http://www.sraoss.co.jp
>>
>>
>> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on
>> 127.0.0.1:9694
>> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
>> > 2025-06-01 09:11:23: pid 634772: LOG:  set SO_REUSEPORT option to the
>> socket
>> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:891
>> > 2025-06-01 09:11:23: pid 634772: LOG:  creating watchdog heartbeat
>> receive
>> > socket.
>> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on ::1:9694
>> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385
>> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to create watchdog
>> heartbeat
>> > receive socket. retrying...
>> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind failed with reason:
>> "Cannot
>> > assign requested address"
>> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:394
>> >
>> >
>> > 2025-06-01 09:11:22: pid 635274: DEBUG:  verify_backend_node_status:
>> > true_primary 0
>> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:2704
>> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[0]: 1
>> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[1]: 2
>> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[2]: 2
>> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[3]: 2
>> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267
>> > 2025-06-01 09:11:22: pid 635274: DEBUG:  pool_release_follow_primary_lock
>> > called
>> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:4106
>> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the
>> socket
>> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
>> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat
>> receive
>> > socket.
>> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
>> 127.0.0.1:9694
>> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:385
>> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the
>> socket
>> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891
>> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat
>> receive
>> > socket.
>> >
>> >
>> >
>> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on
>> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:
>> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765: LOG:  failed to
>> create
>> > watchdog heartbeat receive socket. retrying...*2025-06-01 09:11:23: pid
>> > 634765: DETAIL:  bind failed with reason: "Cannot assign requested
>> address"
>> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:394
>> > 2025-06-01 09:11:23: pid 634767: LOG:  set SO_REUSEPORT option to the
>> socket
>> > 2025-06-01 09:11:23: pid 634767: LOCATION:  wd_heartbeat.c:891
>> > 2025-06-01 09:11:23: pid 634767: LOG:  creating socket for sending
>> heartbeat
>> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set SO_REUSEPORT
>> >
>> >
>> > Other odd errors I see in the logs later on are when pgpool tries to run
>> > escalation script - can't quite find where it's failing to find ipv4
>> > address:
>> >
>> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process received
>> > SIGUSR1
>> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process received
>> > watchdog quorum change signal from watchdog
>> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog: escalation started
>> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting failover command timeout
>> to
>> > 135
>> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog cluster now holds the
>> quorum
>> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating the state of
>> quarantine
>> > backend nodes
>> > Sun Jun  1 12:07:15 AM CEST 2025
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> >
>> > *escalation.sh: - remove VIP IP from all nodesNode: db-primary-privError:
>> > ipv4: Address not found.Node: db-replica1-privError: ipv4: Address not
>> > found.Node: db-replica2-privError: ipv4: Address not found.Node:
>> > db-replica3-privError: ipv4: Address not found.*2025-06-01 00:07:20: pid
>> > 396973: LOG:  watchdog escalation successful
>> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully acquired the delegate
>> > IP:"192.168.77.130"
>> > 2025-06-01 00:07:23: pid 396973: DETAIL:  'if_up_cmd' returned with
>> success
>> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog escalation process with
>> > pid: 396973 exit with SUCCESS.
>> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to lock semaphore
>> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
>> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to unlock semaphore
>> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument
>> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to lock semaphore
>> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid argument
>> >
>> > The way I started was I copied pgpool.conf from the "old" and working
>> > pgpool 4.2 configuration and modified it slightly, though nothing
>> > major,changed hostnames and i.e. delegate_ip variable name was changed
>> > from delegate_IP.  Also added this variable as without it, I didn't see a
>> > socket created for port 9898:
>> > unix_socket_directories = '/var/run/postgresql'
>> > I had to change network interfaces as well, in the RHEL it was using eth0
>> > where on Ubuntu it's ens18, so I changed lines like:
>> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18 label ens18:0
>> > Other thing I had to change was path to arping as on Ubuntu 24 it's in
>> > /usr/bin (not sbin)
>> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U $_IP_$ -w 1 -I ens18'
>> >
>> > The other thing I'm confused about, I have two network interfaces on each
>> > host, i.e.
>> > 192.168.77.135/23 brd 192.168.77.255 scope global ens18
>> > and
>> > inet 192.168.78.135/24 brd 192.168.78.255 scope global ens19
>> >
>> > In pgpool.conf, it is referencing the "priv" hosts (i.e.
>> > heartbeat_hostname2 = 'db-replica2-priv'), which theoretically should be
>> on
>> > ens19 network interface as shown above, but in my old pgpool.conf, it was
>> > only referencing eth0 on RHEL, even though "priv" hosts were on eth1.
>> I'm
>> > wondering if that needs to be changed in Ubuntu or it makes no difference
>> > here?
>> >
>> >  Right now, I can't quite figure out whether it is something with
>> > pgpool.conf or with the Ubuntu settings and the semaphore error is quite
>> > cryptic.  If you need me to post some parts of my pgpool.conf, please
>> > advise which.  I've seen an older version of the FAQ that said repmgr and
>> > pgpool don't work together but then I saw mention that this was no longer
>> > the case.
>> >
>> > I would much appreciate any help you can provide in solving this problem.
>>