<div dir="ltr"><div>Hi,</div><div>I tried the strace as you shown but it didn't really output anything:</div><div><br></div><div>$ sudo strace -p 1662563 -f -tt --trace=ipc<br>strace: Process 1662563 attached<br><br>^Cstrace: Process 1662563 detached.   <--- this is me getting out of it, since I saw in another window the semaphores were gone already, without anything outputting here.  Unless something's wrong with the syntax of the command, it didn't really output anything.  I changed the log line you suggested as well.</div><div><br></div><div>Also, FYI - I tried to downgrade to a lower (4.3.7) version as it was available in my package repository, and the same semaphore version persists,</div><div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Mon, Jun 2, 2025 at 2:40 AM Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> Hi,<br>

> Not sure I can patch the source, as it was installed via apt get and not<br>

> compiled from source.  If that is doable to a package-built one and you can<br>

> give me some instructions, I can certainly try.<br>

<br>

Unfortunately I'm not familiar with debian packaging.<br>

<br>

> As for the pgpool log, I'm attaching it, hopefully it will go through, it's<br>

> about 5mb as I enabled higher debug options.<br>

<br>

It was sent as a direct mail to me. Not sure who/what removed the<br>

semaphore. I observe following from the log anyway:<br>

<br>

- pgpool started on 16:10:21. At that time the semaphore definitely existed.<br>

- Until 16:15:13, everything worked great.<br>

- On 16:15:23, the problem started.<br>

<br>

So it seems the semaphore was removed by someone between 16:15:13 and<br>

16:15:23.<br>

<br>

Unfortunately the removal of semaphore was not recorded in the log. To<br>

determine who removed the semaphore, can you please attach strace to<br>

pgpool main and share the log? This time debug log is not necessary.<br>

<br>

- Find the pgpool parent process pid (in the debug log it was 1483510)<br>

- Become root (attaching strace usually requires root privilege)<br>

- Attach strace to the process $pid<br>

<br>

strace -p $pid -f -tt --trace=ipc<br>

<br>

-f: (follow the child process of pgpool main)<br>

-tt: add timestamp with milliseconds<br>

--trace=ipc: limit the tracing system call to ipc to reduce the log<br>

<br>

BTW, I recommend yo to set log_line_prefix to something like this so<br>

that you can take milliseconds log (%m) and process application name<br>

(%a). They are usable in 4.6.<br>

<br>

log_line_prefix = '%m: %a pid %p: '<br>

<br>

> On Sun, Jun 1, 2025 at 4:20 PM Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>> wrote:<br>

> <br>

>> > To your points/questions:<br>

>> > 1) version is 4.6.1 - the source I'm using (via apt get) doesn't have<br>

>> 4.6.2<br>

>> > yet<br>

>><br>

>> Ok.<br>

>><br>

>> > 2) ipcs -s shows... nothing, literally.  Checked all nodes, i.e.<br>

>> ><br>

>> > postgres@db-replica3:~$ ipcs -s<br>

>> ><br>

>> > ------ Semaphore Arrays --------<br>

>> > key        semid      owner      perms      nsems<br>

>> ><br>

>> > Also, it kinda seems like "failed to unlock semaphore" shows during<br>

>> startup<br>

>> > of pgpool and while it's running, but maybe I'm not reading the log<br>

>> > properly.<br>

>><br>

>> So pgpool lost semaphore soon after starting. Can you share full<br>

>> pgpool log before "failed to unlock semaphore" appears?<br>

>><br>

>> > 3) regarding heartbeats, my pgpool.conf has it roughly like this:<br>

>> ><br>

>> > heartbeat_hostname0 = 'db-primary-priv'<br>

>> > heartbeat_port0 = 9694<br>

>> > heartbeat_device0 = ''<br>

>> ><br>

>> > heartbeat_hostname1 = 'db-replica1-priv'<br>

>> > heartbeat_port1 = 9694<br>

>> > heartbeat_device1 = ''<br>

>> ><br>

>> > heartbeat_hostname2 = 'db-replica2-priv'<br>

>> > heartbeat_port2 = 9694<br>

>> > heartbeat_device2 = ''<br>

>> ><br>

>> > heartbeat_hostname3 = 'db-replica3-priv'<br>

>> > heartbeat_port3 = 9694<br>

>> > heartbeat_device3 = ''<br>

>><br>

>> Looks normal.<br>

>><br>

>> I have checked the pgpool source code and found suspicious<br>

>> code. before 4.6, pgpool heartbeart receiver process binds to any<br>

>> address on the host. However from 4.6 pgpool binds to only loop back<br>

>> address, which is plain wrong. Is it possible for you to patch the<br>

>> pgpool source code and build pgpool binary?<br>

>><br>

>> > So it seems likely that the inability to lock/unlock semaphores is the<br>

>> > problem, question is why and how do I resolve it? :(<br>

>><br>

>> Currently I have no idea. That's why I am requesting the pgpool log.<br>

>><br>

>> > I'll c/p parts of my pgpool.service file as well, in case you may see<br>

>> > something 'off' in it?<br>

>> ><br>

>> ><br>

>> > [Service]<br>

>> > User=postgres<br>

>> > Group=postgres<br>

>> ><br>

>> > EnvironmentFile=-/etc/default/pgpool2<br>

>> ><br>

>> > ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS<br>

>> > ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop<br>

>> > ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload<br>

>> > LimitNOFILE=65536:65536<br>

>> > LimitNPROC=16384:16384<br>

>> > LimitSTACK=10485760:33554432<br>

>> > LimitMEMLOCK=infinity<br>

>> > LimitDATA=infinity<br>

>> ><br>

>> ><br>

>> > Thanks in advance.<br>

>> ><br>

>> > On Sun, Jun 1, 2025 at 2:06 PM Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>><br>

>> wrote:<br>

>> ><br>

>> >> > Hi,<br>

>> >> > So long story short, I'm migrating a PG13 cluster from RHEL to Ubuntu<br>

>> >> > 24.04. In the process, the pgpool version is also being upgraded to<br>

>> 4.6<br>

>> >> > (from 4.2).<br>

>> >><br>

>> >> What is the exact version of 4.6? I am asking because 4.6.0 has<br>

>> >> vulnerability.<br>

>> >><br>

>> >><br>

>> <a href="https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29" rel="noreferrer" target="_blank">https://pgpool.net/mediawiki/index.php/Main_Page#Pgpool-II_4.6.1.2C_4.5.7.2C_4.4.12.2C_4.3.15_and_4.2.22_officially_released_.282025.2F05.2F15.29</a><br>

>> >><br>

>> >> Also 4.6.1 (and 4.6.0) have an issue with IPv6.<br>

>> >> <a href="https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html" rel="noreferrer" target="_blank">https://www.pgpool.net/docs/latest/en/html/release-4-6-2.html</a><br>

>> >><br>

>> >> If possible, I recommend to upgrade to 4.6.2.<br>

>> >><br>

>> >> > There is also repmgr running on this 4 node cluster, 1 primary<br>

>> >> > and 3 replicas.<br>

>> >><br>

>> >> I am not famililar with repmgr, and I ignore this part.<br>

>> >><br>

>> >> > All the migration seems to have completed successfully,<br>

>> >> > but I've ran into a weird problem with pgpool.  I have quite a few<br>

>> errors<br>

>> >> > that look like the below entries, and it seems to "come and go" when<br>

>> >> > checking pgpool log.  The process itself, seems to work (I think), but<br>

>> >> > every now and then, these errors (warnings?) show up:<br>

>> >> ><br>

>> >> > 2025-06-01 09:11:00: pid 628043: DEBUG:  watchdog heartbeat: send<br>

>> >> heartbeat<br>

>> >> > signal to db-replica3-priv:9694<br>

>> >> > 2025-06-01 09:11:00: pid 628043: LOCATION:  wd_heartbeat.c:771<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOG:  exit handler called (signal: 2)<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1272<br>

>> >> ><br>

>> >> ><br>

>> >> > *2025-06-01 09:11:00: pid 627958: WARNING:  failed to lock<br>

>> >> > semaphore2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid<br>

>> >> argument*2025-06-01<br>

>> >> > 09:11:00: pid 627958: LOCATION:  pool_sema.c:129<br>

>> >> > 2025-06-01 09:11:00: pid 627958: WARNING:  failed to unlock semaphore<br>

>> >> > 2025-06-01 09:11:00: pid 627958: DETAIL:  Invalid argument<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pool_sema.c:202<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOG:  shutting down by signal 2<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1324<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOG:  terminating all child processes<br>

>> >> > 2025-06-01 09:11:00: pid 627958: LOCATION:  pgpool_main.c:1334<br>

>> >> > 2025-06-01 09:11:00: pid 628047: WARNING:  failed to lock semaphore<br>

>> >> > 2025-06-01 09:11:00: pid 628047: DETAIL:  Invalid argument<br>

>> >> > 2025-06-01 09:11:00: pid 628047: LOCATION:  pool_sema.c:129<br>

>> >> > 2025-06-01 09:11:00: pid 628042: WARNING:  failed to lock semaphore<br>

>> >> > 2025-06-01 09:11:00: pid 628042: DETAIL:  Invalid argument<br>

>> >> > 2025-06-01 09:11:00: pid 628042: LOCATION:  pool_sema.c:129<br>

>> >> > 2025-06-01 09:11:00: pid 628039: WARNING:  failed to lock semaphore<br>

>> >> > 2025-06-01 09:11:00: pid 628039: DETAIL:  Invalid argument<br>

>> >><br>

>> >> Can you share the result of "ipcs -s" so that I can confirm pgpool<br>

>> >> properly allocates the semaphore.<br>

>> >><br>

>> >> By the way do you see "WARNING: failed to unlock semaphore" only when<br>

>> >> pgpool is shutting down?<br>

>> >><br>

>> >> > Also, earlier in the log, I see errors about being unable to create a<br>

>> >> > socket, for some reason it's trying (and failing) to create socket on<br>

>> >> > localhost (<a href="http://127.0.0.1:9694" rel="noreferrer" target="_blank">127.0.0.1:9694</a>) and  "creating socket on ::1:9694".  Are<br>

>> >> these<br>

>> >> > "ok to ignore"?<br>

>> >><br>

>> >> First of all "creating socket on ::1:9694" looks unusual because it<br>

>> >> means that heartbeat_hostname is set to "::1", which is the IPv6 loop<br>

>> >> back address. heartbeat_hostname is expected to be set IPv4 or IPv6<br>

>> >> inet addesss (not loop back address).<br>

>> >><br>

>> >> For now, I would like to wait for your reply regarding these my<br>

>> >> questions above.<br>

>> >><br>

>> >> Best regards,<br>

>> >> --<br>

>> >> Tatsuo Ishii<br>

>> >> SRA OSS K.K.<br>

>> >> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

>> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

>> >><br>

>> >><br>

>> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on<br>

>> >> <a href="http://127.0.0.1:9694" rel="noreferrer" target="_blank">127.0.0.1:9694</a><br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385<br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOG:  set SO_REUSEPORT option to the<br>

>> >> socket<br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:891<br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOG:  creating watchdog heartbeat<br>

>> >> receive<br>

>> >> > socket.<br>

>> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  creating socket on ::1:9694<br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:385<br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOG:  failed to create watchdog<br>

>> >> heartbeat<br>

>> >> > receive socket. retrying...<br>

>> >> > 2025-06-01 09:11:23: pid 634772: DETAIL:  bind failed with reason:<br>

>> >> "Cannot<br>

>> >> > assign requested address"<br>

>> >> > 2025-06-01 09:11:23: pid 634772: LOCATION:  wd_heartbeat.c:394<br>

>> >> ><br>

>> >> ><br>

>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  verify_backend_node_status:<br>

>> >> > true_primary 0<br>

>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:2704<br>

>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[0]: 1<br>

>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267<br>

>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[1]: 2<br>

>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267<br>

>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[2]: 2<br>

>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267<br>

>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:  node status[3]: 2<br>

>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pool_worker_child.c:267<br>

>> >> > 2025-06-01 09:11:22: pid 635274: DEBUG:<br>

>> pool_release_follow_primary_lock<br>

>> >> > called<br>

>> >> > 2025-06-01 09:11:22: pid 635274: LOCATION:  pgpool_main.c:4106<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the<br>

>> >> socket<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat<br>

>> >> receive<br>

>> >> > socket.<br>

>> >> > 2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on<br>

>> >> <a href="http://127.0.0.1:9694" rel="noreferrer" target="_blank">127.0.0.1:9694</a><br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:385<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  set SO_REUSEPORT option to the<br>

>> >> socket<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:891<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOG:  creating watchdog heartbeat<br>

>> >> receive<br>

>> >> > socket.<br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> > *2025-06-01 09:11:23: pid 634765: DETAIL:  creating socket on<br>

>> >> > ::1:96942025-06-01 09:11:23: pid 634765: LOCATION:<br>

>> >> >  wd_heartbeat.c:3852025-06-01 09:11:23: pid 634765: LOG:  failed to<br>

>> >> create<br>

>> >> > watchdog heartbeat receive socket. retrying...*2025-06-01 09:11:23:<br>

>> pid<br>

>> >> > 634765: DETAIL:  bind failed with reason: "Cannot assign requested<br>

>> >> address"<br>

>> >> > 2025-06-01 09:11:23: pid 634765: LOCATION:  wd_heartbeat.c:394<br>

>> >> > 2025-06-01 09:11:23: pid 634767: LOG:  set SO_REUSEPORT option to the<br>

>> >> socket<br>

>> >> > 2025-06-01 09:11:23: pid 634767: LOCATION:  wd_heartbeat.c:891<br>

>> >> > 2025-06-01 09:11:23: pid 634767: LOG:  creating socket for sending<br>

>> >> heartbeat<br>

>> >> > 2025-06-01 09:11:23: pid 634767: DETAIL:  set SO_REUSEPORT<br>

>> >> ><br>

>> >> ><br>

>> >> > Other odd errors I see in the logs later on are when pgpool tries to<br>

>> run<br>

>> >> > escalation script - can't quite find where it's failing to find ipv4<br>

>> >> > address:<br>

>> >> ><br>

>> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process<br>

>> received<br>

>> >> > SIGUSR1<br>

>> >> > 2025-06-01 00:07:15: pid 395780: LOG:  Pgpool-II parent process<br>

>> received<br>

>> >> > watchdog quorum change signal from watchdog<br>

>> >> > 2025-06-01 00:07:15: pid 396973: LOG:  watchdog: escalation started<br>

>> >> > 2025-06-01 00:07:15: pid 395786: LOG:  Setting failover command<br>

>> timeout<br>

>> >> to<br>

>> >> > 135<br>

>> >> > 2025-06-01 00:07:15: pid 395780: LOG:  watchdog cluster now holds the<br>

>> >> quorum<br>

>> >> > 2025-06-01 00:07:15: pid 395780: DETAIL:  updating the state of<br>

>> >> quarantine<br>

>> >> > backend nodes<br>

>> >> > Sun Jun  1 12:07:15 AM CEST 2025<br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> ><br>

>> >> > *escalation.sh: - remove VIP IP from all nodesNode:<br>

>> db-primary-privError:<br>

>> >> > ipv4: Address not found.Node: db-replica1-privError: ipv4: Address not<br>

>> >> > found.Node: db-replica2-privError: ipv4: Address not found.Node:<br>

>> >> > db-replica3-privError: ipv4: Address not found.*2025-06-01 00:07:20:<br>

>> pid<br>

>> >> > 396973: LOG:  watchdog escalation successful<br>

>> >> > 2025-06-01 00:07:23: pid 396973: LOG:  successfully acquired the<br>

>> delegate<br>

>> >> > IP:"192.168.77.130"<br>

>> >> > 2025-06-01 00:07:23: pid 396973: DETAIL:  'if_up_cmd' returned with<br>

>> >> success<br>

>> >> > 2025-06-01 00:07:23: pid 395786: LOG:  watchdog escalation process<br>

>> with<br>

>> >> > pid: 396973 exit with SUCCESS.<br>

>> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to lock semaphore<br>

>> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument<br>

>> >> > 2025-06-01 00:07:27: pid 395962: WARNING:  failed to unlock semaphore<br>

>> >> > 2025-06-01 00:07:27: pid 395962: DETAIL:  Invalid argument<br>

>> >> > 2025-06-01 00:07:28: pid 396924: WARNING:  failed to lock semaphore<br>

>> >> > 2025-06-01 00:07:28: pid 396924: DETAIL:  Invalid argument<br>

>> >> ><br>

>> >> > The way I started was I copied pgpool.conf from the "old" and working<br>

>> >> > pgpool 4.2 configuration and modified it slightly, though nothing<br>

>> >> > major,changed hostnames and i.e. delegate_ip variable name was changed<br>

>> >> > from delegate_IP.  Also added this variable as without it, I didn't<br>

>> see a<br>

>> >> > socket created for port 9898:<br>

>> >> > unix_socket_directories = '/var/run/postgresql'<br>

>> >> > I had to change network interfaces as well, in the RHEL it was using<br>

>> eth0<br>

>> >> > where on Ubuntu it's ens18, so I changed lines like:<br>

>> >> > /usr/bin/sudo /sbin/ip addr add $_IP_$/23 dev ens18 label ens18:0<br>

>> >> > Other thing I had to change was path to arping as on Ubuntu 24 it's in<br>

>> >> > /usr/bin (not sbin)<br>

>> >> >  arping_cmd = '/usr/bin/sudo /usr/bin/arping -U $_IP_$ -w 1 -I ens18'<br>

>> >> ><br>

>> >> > The other thing I'm confused about, I have two network interfaces on<br>

>> each<br>

>> >> > host, i.e.<br>

>> >> > <a href="http://192.168.77.135/23" rel="noreferrer" target="_blank">192.168.77.135/23</a> brd 192.168.77.255 scope global ens18<br>

>> >> > and<br>

>> >> > inet <a href="http://192.168.78.135/24" rel="noreferrer" target="_blank">192.168.78.135/24</a> brd 192.168.78.255 scope global ens19<br>

>> >> ><br>

>> >> > In pgpool.conf, it is referencing the "priv" hosts (i.e.<br>

>> >> > heartbeat_hostname2 = 'db-replica2-priv'), which theoretically should<br>

>> be<br>

>> >> on<br>

>> >> > ens19 network interface as shown above, but in my old pgpool.conf, it<br>

>> was<br>

>> >> > only referencing eth0 on RHEL, even though "priv" hosts were on eth1.<br>

>> >> I'm<br>

>> >> > wondering if that needs to be changed in Ubuntu or it makes no<br>

>> difference<br>

>> >> > here?<br>

>> >> ><br>

>> >> >  Right now, I can't quite figure out whether it is something with<br>

>> >> > pgpool.conf or with the Ubuntu settings and the semaphore error is<br>

>> quite<br>

>> >> > cryptic.  If you need me to post some parts of my pgpool.conf, please<br>

>> >> > advise which.  I've seen an older version of the FAQ that said repmgr<br>

>> and<br>

>> >> > pgpool don't work together but then I saw mention that this was no<br>

>> longer<br>

>> >> > the case.<br>

>> >> ><br>

>> >> > I would much appreciate any help you can provide in solving this<br>

>> problem.<br>

>> >><br>

>><br>

</blockquote></div>