<div dir="ltr"><div>Hi, sorry to reply late, things were pretty busy.  I am very happy to report your suggested solution did the trick (we ended up changing the UID/GUID of postgres user to a lower number from 3000, which made it a system user).  It has been a couple of days and it is working fine, no issues with semaphores.</div><div><br></div><div>On the other issues I mentioned, the reason it was complaining about user "nobody" was that other setting you advised me to change ( lifecheck_method = 'query'), as that had user 'nobody' by default.  Once I changed that user to our default pgpool user, the error was gone.</div><div><br></div><div>Also, the other issue I mentioned with pcp_node_info also is now gone and all nodes show status 2, so everything seems to be working properly now.</div><div><br></div><div>Once again, thank you very much for your assistance with this problem, much appreciated!</div><div><br></div></div><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">On Thu, Jun 5, 2025 at 8:33 AM Tatsuo Ishii <<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">> Hi, I changed the wd_lifecheck_method to query, saw this in the log<br>

> afterwards:<br>

> 2025-06-04 13:32:42.930: main pid 120326: LOG:  create socket files[0]:<br>

> /var/run/postgresql/.s.PGSQL.9999<br>

> 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:4888<br>

> 2025-06-04 13:32:42.930: main pid 120326: LOG:  listen address[0]: *<br>

> 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:4920<br>

> 2025-06-04 13:32:42.930: main pid 120326: LOG:  Setting up socket for<br>

> <a href="http://0.0.0.0:9999" rel="noreferrer" target="_blank">0.0.0.0:9999</a><br>

> 2025-06-04 13:32:42.930: main pid 120326: LOCATION:  pgpool_main.c:999<br>

> <br>

> <br>

> <br>

> *2025-06-04 13:32:42.930: main pid 120326: LOG:  Setting up socket for<br>

> :::99992025-06-04 13:32:42.930: main pid 120326: LOCATION:<br>

>  pgpool_main.c:9992025-06-04 13:32:42.932: child pid 120332: FATAL:  SCRAM<br>

> authentication failed2025-06-04 13:32:42.932: child pid 120332: DETAIL:<br>

>  pool_passwd file does not contain an entry for "nobody"*<br>

> 2025-06-04 13:32:42.932: child pid 120332: LOCATION:  pool_hba.c:1272<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOG:  create socket files[0]:<br>

> /var/run/postgresql/.s.PGSQL.9898<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:4888<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOG:  listen address[0]: *<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:4920<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOG:  Setting up socket for<br>

> <a href="http://0.0.0.0:9898" rel="noreferrer" target="_blank">0.0.0.0:9898</a><br>

> 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:999<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOG:  Setting up socket for<br>

> :::9898<br>

> 2025-06-04 13:32:43.375: main pid 120326: LOCATION:  pgpool_main.c:999<br>

> 2025-06-04 13:32:43.376: pcp_main pid 121356: LOG:  PCP process: 121356<br>

> started<br>

> <br>

> Any idea about that pool_passwd error<br>

<br>

Because you set wd_lifecheck_password = '' and do not set password for<br>

nobody in pool_passwd.<br>

<br>

> attempting to set up socket for<br>

> :::9999?<br>

<br>

You mean 9898? because you set pcp_port = 9898.<br>

<br>

> Again, not a big deal, just wondering if it's something to be concerned<br>

> about.<br>

> <br>

> Also, I'm wondering about the status of that pcp_node_info command I<br>

> mentioned earlier... even though semaphores seem to be present and we have<br>

> plenty of queries/connections, the status is still showing as 1/waiting<br>

> where I suspect it should be 2... I might be stressing over nothing, but<br>

> none of the nodes are showing in status 2 at this point, example:<br>

> $ pcp_node_info -h localhost -p 9898 -U pcpadmin --verbose -n 2<br>

> Hostname               : db-replica3<br>

> Port                   : 5434<br>

> Status                 : 1<br>

> Weight                 : 0.250000<br>

> Status Name            : waiting<br>

> Backend Status Name    : up<br>

> Role                   : standby<br>

> Backend Role           : standby<br>

> Replication Delay      : 0<br>

> Replication State      : streaming<br>

> Replication Sync State : async<br>

> Last Status Change     : 2025-06-03 11:59:27<br>

<br>

As I explained upthread, you must have run pcp_node_info command<br>

against non-leader watchdog node. In this case no clients has been<br>

connected the node and the status remains 1.<br>

<br>

> On Wed, Jun 4, 2025 at 2:18 AM Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>> wrote:<br>

> <br>

>> Hi,<br>

>><br>

>> > Hi, Gave it a try, although we went with option #2 (RemoveIPC=no) as<br>

>> > it was quicker to do.  After 20+ minutes, the semaphores are still<br>

>> > present and there are no errors in the log.  Tomorrow I will discuss<br>

>> > with our sysadmins how we can implement option #1, as the UID of<br>

>> > postgres user is above SYS_UID_MAX.  Maybe we won't go through<br>

>> > changing postgres user's UID, changing the SYS_UID_MAX to higher<br>

>> > number might be ok as we only have a couple users above that number.<br>

>><br>

>> Thank you for sharing your thought.<br>

>><br>

>> > Also, I forgot to mention - our repo showed the newer (4.6.2)<br>

>> > version as available so I upgraded the packages and restarted<br>

>> > pgpool, but that ipv6 error at start up is still present.  It's more<br>

>> > of a nuisance more than anything else, but figured I would let you<br>

>> > know: 2025-06-03 18:56:39: pid 1950700: WARNING: failed to create<br>

>> > watchdog heartbeat receive socket as much intended 2025-06-03<br>

>> > 18:56:39: pid 1950700: DETAIL: only 1 out of 2 socket(s) had been<br>

>> > created 2025-06-03 18:56:39: pid 1950700: LOCATION:<br>

>> > wd_heartbeat.c:428 2025-06-03 18:56:39: pid 1950704: LOG: failed to<br>

>> > create watchdog heartbeat receive socket 2025-06-03 18:56:39: pid<br>

>> > 1950704: LOCATION: wd_heartbeat.c:408 2025-06-03 18:56:39: pid<br>

>> > 1950704: WARNING: failed to create watchdog heartbeat receive socket<br>

>> > as much intended 2025-06-03 18:56:39: pid 1950704: DETAIL: only 1<br>

>> > out of 2 socket(s) had been created 2025-06-03 18:56:39: pid<br>

>> > 1950704: LOCATION: wd_heartbeat.c:428 Earlier, when I had debugging<br>

>> > enabled, it was still showing unable to create this socket:<br>

>> > 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: creating<br>

>> > watchdog heartbeat receive socket.  2025-06-03 11:57:24.833:<br>

>> > heart_beat_receiver pid 4005: DETAIL: creating socket on<br>

>> > <a href="http://127.0.0.1:9694" rel="noreferrer" target="_blank">127.0.0.1:9694</a> 2025-06-03 11:57:24.833: heart_beat_receiver pid<br>

>> > 4005: LOCATION: wd_heartbeat.c:385 2025-06-03 11:57:24.833:<br>

>> > heart_beat_receiver pid 4005: LOG: set SO_REUSEPORT option to the<br>

>> > socket 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005:<br>

>> > LOCATION: wd_heartbeat.c:891 2025-06-03 11:57:24.833:<br>

>> > heart_beat_receiver pid 4005: LOG: creating watchdog heartbeat<br>

>> > receive socket.  2025-06-03 11:57:24.833: heart_beat_receiver pid<br>

>> > 4005: DETAIL: *creating socket on ::1:9694* 2025-06-03 11:57:24.833:<br>

>> > heart_beat_receiver pid 4005: LOCATION: wd_heartbeat.c:385<br>

>> > 2025-06-03 11:57:24.833: heart_beat_receiver pid 4005: LOG: failed<br>

>> > to create watchdog heartbeat receive socket. retrying...  2025-06-03<br>

>> > 11:57:24.833: heart_beat_receiver pid 4005: DETAIL: bind failed with<br>

>> > reason: "Cannot assign requested address" Doesn't seem to affect the<br>

>> > pgpool functionality, I'd prefer to disable it altogether if<br>

>> > possible to have 'clean' logs, if possible.  So far, tried to change<br>

>> > listen_addresses and pcp_listen_addresses in pgpool.conf from "*" to<br>

>> > "0.0.0.0" but that didn't resolve it.<br>

>><br>

>> pcp_listen_addresses is used for pcp commands only, not relevant to<br>

>> heartbeart.<br>

>><br>

>> BTW, I think I found possible cause of heartbeat problem: before 4.6,<br>

>> the bind address of heartbeat receive socket was specified as<br>

>> INADDR_ANY, which binds all available network devices. But from 4.6.0,<br>

>> it only binds to loopback address (127.0.0.1 for IPv4, ::1 for<br>

>> IPv6). We should have bound to all available network devices in 4.6.<br>

>> Because of this, watchdog won't start lifecheck using heartbeat as<br>

>> other heartbeat sender process send heartbeat packet to hostnameX,<br>

>> which is not loopback address of course. We have alreay created a<br>

>> patch to fix it, and doing an internal test.<br>

>><br>

>> Note that if lifecheck_method = 'query', you are not bitten by the<br>

>> bug.<br>

>><br>

>> > Just curious if there's a way to<br>

>> > skip ipv6 functionality?<br>

>><br>

>> No.<br>

>><br>

>> > On Tue, Jun 3, 2025 at 1:57 PM Tatsuo Ishii <<a href="mailto:ishii@postgresql.org" target="_blank">ishii@postgresql.org</a>><br>

>> wrote:<br>

>> ><br>

>> >> > Hi,<br>

>> >> ><br>

>> >> > Thank you for the great information!<br>

>> >> ><br>

>> >> >> Hi,<br>

>> >> >> We've made some progress here and I'd like to get your feedback<br>

>> >> regarding<br>

>> >> >> the fix/workaround.  Basically, after some research (incl.<br>

>> >> google/chatgpt<br>

>> >> >> etc), we came across this information:<br>

>> >> >><br>

>> >> >>   1. systemd-logind manages user sessions<br>

>> >> >><br>

>> >> >> In Ubuntu 24.04 (and since ~15.04), systemd manages user sessions.<br>

>> When<br>

>> >> a<br>

>> >> >> user session ends (e.g. when a background process or login shell<br>

>> >> finishes),<br>

>> >> >> the following happens:<br>

>> >> >><br>

>> >> >>    -<br>

>> >> >><br>

>> >> >>    The session scope (session-XXXX.scope) is stopped.<br>

>> >> >>    -<br>

>> >> >><br>

>> >> >>    The user manager (user@UID.service) is also stopped *unless<br>

>> >> lingering is<br>

>> >> >>    enabled*.<br>

>> >> >><br>

>> >> >>   If no processes remain and the user is not lingering:<br>

>> >> >><br>

>> >> >>    -<br>

>> >> >><br>

>> >> >>    Shared memory and *System V IPC resources* (like semaphores) *may<br>

>> be<br>

>> >> >>    destroyed* automatically if no one is using them.<br>

>> >> >>    -<br>

>> >> >><br>

>> >> >>    If a process (like pgpool2 or PostgreSQL) relied on those<br>

>> semaphores,<br>

>> >> >>    you may later see:<br>

>> >> >><br>

>> >> >>    failed to lock semaphore<br>

>> >> >><br>

>> >> >><br>

>> >> >> This is a strong sign that the user session ended and IPC resources<br>

>> >> (like<br>

>> >> >> semaphores) were destroyed.<br>

>> >> >><br>

>> >> >> This does seem to be happening here, as I see these in<br>

>> /var/log/syslog<br>

>> >> >> around that time we discussed in last posting, just prior to<br>

>> semaphore<br>

>> >> >> errors showing in pgpool logs:<br>

>> >> >> 2025-06-02T12:45:17.173486+02:00 db-replica3 systemd[1]:<br>

>> >> user@3000.service:<br>

>> >> >> Deactivated successfully.<br>

>> >> >> 2025-06-02T12:45:17.173997+02:00 db-replica3 systemd[1]: Stopped<br>

>> >> >> user@3000.service - User Manager for UID 3000.<br>

>> >> >> 2025-06-02T12:45:17.240314+02:00 db-replica3 systemd[1]: Stopping<br>

>> >> >> user-runtime-dir@3000.service - User Runtime Directory<br>

>> >> /run/user/3000...<br>

>> >> >> 2025-06-02T12:45:17.248017+02:00 db-replica3 systemd[1]:<br>

>> >> >> run-user-3000.mount: Deactivated successfully.<br>

>> >> >> 2025-06-02T12:45:17.249107+02:00 db-replica3 systemd[1]:<br>

>> >> >> user-runtime-dir@3000.service: Deactivated successfully.<br>

>> >> >> 2025-06-02T12:45:17.249353+02:00 db-replica3 systemd[1]: Stopped<br>

>> >> >> user-runtime-dir@3000.service - User Runtime Directory<br>

>> /run/user/3000.<br>

>> >> >> 2025-06-02T12:45:17.251672+02:00 db-replica3 systemd[1]: Removed<br>

>> slice<br>

>> >> >> user-3000.slice - User Slice of UID 3000.<br>

>> >> >> 2025-06-02T12:45:23.581108+02:00 db-replica3 strace[1692927]: [pid<br>

>> >> 1693495]<br>

>> >> >> 12:45:23.580360 semtimedop(41, [{sem_num=6, sem_op=-1,<br>

>> >> sem_flg=SEM_UNDO}],<br>

>> >> >> 1, NULL) = -1 EINVAL (Invalid argument)<br>

>> >> >><br>

>> >> >> It also advises to add Type=forking to the [Service] section of the<br>

>> >> >> pgpool.service file, which we don't currently have, I think I posted<br>

>> it<br>

>> >> >> earlier, but just in case here is relevant part of our service file:<br>

>> >> >><br>

>> >> >> [Unit]<br>

>> >> >> Description=Pgpool-II<br>

>> >> >> After=syslog.target network.target postgresql.service<br>

>> >> >> Wants=postgresql.service<br>

>> >> >><br>

>> >> >> [Service]<br>

>> >> >> User=postgres<br>

>> >> >> Group=postgres<br>

>> >> >><br>

>> >> >> EnvironmentFile=-/etc/default/pgpool2<br>

>> >> >><br>

>> >> >> ExecStart=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $OPTS<br>

>> >> >> ExecStop=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf $STOP_OPTS stop<br>

>> >> >> ExecReload=/usr/sbin/pgpool -f /etc/pgpool2/pgpool.conf reload<br>

>> >> >><br>

>> >> >> -------------------<br>

>> >> >> One bit I read that is different is the Type=forking and adding -n is<br>

>> >> >> required (as you can see above, we don't have that right now:<br>

>> >> >> -<br>

>> >> >><br>

>> >> >> User=postgres: Assumes pgpool should run as the postgres user.<br>

>> Adjust if<br>

>> >> >> different.<br>

>> >> >> -<br>

>> >> >><br>

>> >> >> ExecStart=/usr/sbin/pgpool -n: Use the *correct path and arguments*<br>

>> for<br>

>> >> >> your installation.<br>

>> >> >> -<br>

>> >> >><br>

>> >> >> pgpool must be configured to *not daemonize* (-n flag), or systemd<br>

>> will<br>

>> >> >> lose track of it.<br>

>> >> >> -<br>

>> >> >><br>

>> >> >> You can add environment variables for config files, ports, etc.<br>

>> >> >> -----------------------<br>

>> >> >><br>

>> >> >> Anyways, our sysadmins suggested changing the user postgres to have<br>

>> >> Linger<br>

>> >> >> option on:<br>

>> >> >> $ sudo loginctl show-user postgres -p Linger<br>

>> >> >> Linger=yes<br>

>> >> >> Once we did that, the semaphore issue disappeared, it's been running<br>

>> for<br>

>> >> >> over an hour now and ipcs shows them:<br>

>> >> >> $ ipcs -s<br>

>> >> >><br>

>> >> >> ------ Semaphore Arrays --------<br>

>> >> >> key        semid      owner      perms      nsems<br>

>> >> >> 0x00000000 6          postgres   600        8<br>

>> >> >><br>

>> >> >><br>

>> >> >> So, what do you think about this fix? I read that letting the user<br>

>> have<br>

>> >> >> linger on is not generally recommended, but that seems to be the only<br>

>> >> way<br>

>> >> >> to keep the semaphores from disappearing right now...<br>

>> >> ><br>

>> >> > I am not familiar with this area. I will discuss with other<br>

>> >> > pgpool/PostgreSQL developers and reply back to you.<br>

>> >><br>

>> >> I found this in the PostgreSQL manual:<br>

>> >><br>

>> >><br>

>> <a href="https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC" rel="noreferrer" target="_blank">https://www.postgresql.org/docs/current/kernel-resources.html#SYSTEMD-REMOVEIPC</a><br>

>> >><br>

>> >> If my reading is correct, we can do one of:<br>

>> >><br>

>> >> 1) register the user to start pgpool as a "system user".<br>

>> >><br>

>> >> 2) change the line "RemoveIPC=no" in /etc/systemd/logind.conf .<br>

>> >><br>

>> >> For me, #1 is better, since #2 affects to other servers in the system.<br>

>> >><br>

>> >> BTW, I think the doc does not suggest to use Linger option.<br>

>> >><br>

>> >> Best regards,<br>

>> >> --<br>

>> >> Tatsuo Ishii<br>

>> >> SRA OSS K.K.<br>

>> >> English: <a href="http://www.sraoss.co.jp/index_en/" rel="noreferrer" target="_blank">http://www.sraoss.co.jp/index_en/</a><br>

>> >> Japanese:<a href="http://www.sraoss.co.jp" rel="noreferrer" target="_blank">http://www.sraoss.co.jp</a><br>

>> >><br>

>><br>

</blockquote></div>