[pgpool-general: 7991] Re: pcp_node_info does not return when host is lost on 4.3.0

Tue Jan 25 09:04:23 JST 2022

> Hi,
> 
> I've managed to do some bisecting today, and I can say with quite high
> confidence that the issue is introduced with these 2 commits:
> 1ae1f159b89f4d18a8f7b737929e9a6448ad63ab Add new fields to show pool_nodes
> command and friends.
> 6de0d264be66ce145d3ed726235920401cf74ebe Fix pcp_node_info failure when
> backend is down.
> 
> When running the tests with the first commit, pgpool fails to start at that
> point in the test. I suspect the second commit fixes that, but since that
> point, all builds got stuck on the pcp_node_info call. In between there's
> another commit that updates some docs
> (8e8ecaced44ec9e6322023729c427bcfa732deda), but that should not be relevant
> to this issue. I hope this info helps you pinpoint the exact problem.

I am the author of the commits.

Question is why the issue does not happen in other environments. I and
Peng failed to reproduce the problem. The build farm does not report
any failure caused by this too. Is there anything special with your
envrionment?

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Best regards,
> Emond
> 
> On Mon, Jan 24, 2022 at 8:23 AM Emond Papegaaij <emond.papegaaij at gmail.com>
> wrote:
> 
>> Hi,
>>
>> Unfortunately, the patch doesn't help. The call to pcp_node_info still
>> hangs. I do however see a difference in the pgpool log. The pcp worker only
>> logs a single line:
>>
>> 2022-01-24 05:26:37: pid 81: LOG:  forked new pcp worker, pid=211 socket=7
>> 2022-01-24 05:26:47: pid 211: LOG:  failed to connect to PostgreSQL server
>> on "172.29.30.2:5432", timed out
>>
>> After this, there's no mention of pid 211. No log messages from that pid,
>> but also not from pid 81 (which I would expect to log the PCP process to
>> exit).
>>
>> Best regards,
>> Emond
>>
>> On Sat, Jan 22, 2022 at 2:15 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
>>
>>> Hello,
>>>
>>> Thank you for your reply.
>>>
>>> I think it is a particular issue of 4.3.0.
>>> Another developer, Tatsuo Ishii, has created a patch that fixes this
>>> issue.
>>> Could you check the attached patch if you can apply this patch?
>>>
>>> Best regards,
>>>
>>> On Fri, 21 Jan 2022 14:10:05 +0100
>>> Emond Papegaaij <emond.papegaaij at gmail.com> wrote:
>>>
>>> > >
>>> > > > We are working on the upgrade from 4.2.6 to 4.3.0 and we are facing
>>> a
>>> > >> test
>>> > >> > that is failing consistently. In one of our tests we powerdown 2
>>> of the
>>> > >> 3
>>> > >> > hosts with a hard poweroff. Prior to the poweroff, we configure the
>>> > >> cluster
>>> > >>
>>> > >> Thank you for reporting this issue.
>>> > >> I am going to look into it.
>>> > >> Does this issue only occur in 4.3.0?
>>> > >
>>> > >
>>> > > Thanks for looking into this. As often is the case with these kinds of
>>> > > errors, I cannot be absolutely sure, but I haven't seen this error
>>> before
>>> > > with 4.2.6 or earlier. We skipped 4.2.7, as the release notes state
>>> it was
>>> > > only for PG14 support, which we don't need at the moment.
>>> > >
>>> > > To report back on this. We've ran 11 consecutive builds with 4.3.0,
>>> all
>>> > failing on this issue. I've check the past 40 or so build with 4.2.6 and
>>> > none of them failed. So this is definitely a regression in 4.3.0. Do you
>>> > already have an idea on the cause of this? If not, I can try to perform
>>> a
>>> > bisect on the diff between 4.2.6 and 4.3.0. This will however take me
>>> some
>>> > time, as every build takes about 2 hours. Git expects about 8 revisions
>>> to
>>> > check, so that's 2 whole working days.
>>> >
>>> > Best regards,
>>> > Emond
>>>
>>>
>>> --
>>> Bo Peng <pengbo at sraoss.co.jp>
>>> SRA OSS, Inc. Japan
>>> http://www.sraoss.co.jp/
>>>
>>