[pgpool-general: 5929] Re: [pgpool-hackers: 2694] Re: Re: Pgpool-3.7.1 segmentation fault

Tue Feb 27 13:45:47 JST 2018

It turns out the patch broke badly the case when ALWAYS_MASTER flag is
not.

With the commit, write queries are always sent to node 0 even if the
primary node is not 0 because PRIMARY_NODE_ID macro returns
REAL_MASTER_NODE_ID, which is usually 0. Thus write queries are failed
with:

    ERROR:  cannot execute INSERT in a read-only transaction

So I am going to revert the patch (and think about more sane fix for
the original problem).

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

From: Tatsuo Ishii <ishii at sraoss.co.jp>
Subject: [pgpool-hackers: 2694] Re: [pgpool-general: 5885] Re: Pgpool-3.7.1 segmentation fault
Date: Mon, 29 Jan 2018 13:18:34 +0900 (JST)
Message-ID: <20180129.131834.1179560355095753654.t-ishii at sraoss.co.jp>

> I have committed the proposed patch since there was no objection.
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
>> Proposed patch attached.
>> 
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>> 
>>> Hi Pgpool-II developers,
>>> 
>>> It is reported that Pgpool-II child process could have a segfault error.
>>> 
>>> #0  0x000000000042478d in select_load_balancing_node () at
>>> protocol/child.c:1680
>>> 1680                    char *database = MASTER_CONNECTION(ses->backend)->sp->database;
>>> (gdb) backtrace
>>> #0  0x000000000042478d in select_load_balancing_node () at
>>> protocol/child.c:1680
>>> 
>>> To reproduce the problem following conditions should be all met:
>>> 
>>> 1) Streaming replication mode.
>>> 
>>> 2) fail_over_on_backend_error is off.
>>> 
>>> 3) ALWAYS_MASTER flags is set to the master (writer) node.
>>> 
>>> 4) pgpool_status file indicates that the node mentioned in #2 is in
>>>    down status.
>>> 
>>> What happens here is,
>>> 
>>> 1) find_primary_node() returns node id 0 without checking the status
>>>    of node 0 since ALWAYS_MASTER is set. It's remembered as the
>>>    primary node id. The node id is stored in Req_info->primary_node_id.
>>> 
>>> 2) The connection to backend 0 is not created since pgpool_status says
>>>    it's in down status.
>>> 
>>> 3) upon starting of session, select_load_balancing_node () is called
>>>    and it tries to determine the database name from client's start up
>>>    packet.
>>> 
>>> 4) Since MASTER_CONNECTION macro points to the PRIMARY_NODE,
>>>    MASTER_CONNECTION(ses->backend) is NULL and it results in a segfault.
>>> 
>>> The fix I propose is, to change PRIMARY_NODE_ID macro so that it
>>> returns REAL_MASTER_NODE_ID (that is the youngest node id which is
>>> alive) if the node id in Req_info->primary_node_id is in down status.
>>> So we have the "true" primary node id in Req_info->primary_node_id,
>>> and "fake" primary node id returned by PRIMARY_NODE_ID macro.
>>> 
>>> I am afraid it's confusing and may have potential bad effect to
>>> somewhere in Pgpool-II. Note, however, we already let PRIMARY_NODE_ID
>>> return REAL_MASTER_NODE_ID if find_primary_node() cannot find a
>>> primary node. So maybe I am too worried... but I don't know.
>>> 
>>> So I would like hear opinions from Pgpool-II developers.
>>> 
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese:http://www.sraoss.co.jp
>>> 
>>> From: Tatsuo Ishii <ishii at sraoss.co.jp>
>>> Subject: [pgpool-general: 5885] Re: Pgpool-3.7.1 segmentation fault
>>> Date: Wed, 24 Jan 2018 12:00:40 +0900 (JST)
>>> Message-ID: <20180124.120040.507189908198617602.t-ishii at sraoss.co.jp>
>>> 
>>>>> Thanks for the quick reply! I realized that I ended up in this state,
>>>>> because I was using indexed health checks, and the primary's health checks
>>>>> had been disabled. I've gone back to a single health_check config, to avoid
>>>>> this issue.
>>>> 
>>>> Do you have an issue with "indexed health checks"? I thought it was
>>>> fixed in 3.7.1.
>>>> 
>>>> I've also added an extra pre-start step, which removes the
>>>>> pgpool_status file.
>>>> 
>>>> That might be a solution but I would like to add a guard to Pgpool-II
>>>> against the segfault. The segfault occurs when conditions below are
>>>> all met:
>>>> 
>>>> 1) fail_over_on_backend_error is off.
>>>> 2) ALWAYS_MASTER flags is set to the master (writer) node.
>>>> 
>>>> Attached patch implements the guard against the segfault. Developers
>>>> will start a discussion regarding the patch in pgpool-hackers.
>>>> 
>>>> Best regards,
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese:http://www.sraoss.co.jp
>>>> 
>>>>> On Tue, Jan 23, 2018 at 5:49 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>>>> 
>>>>>> Hi Philip,
>>>>>>
>>>>>> > Hello poolers,
>>>>>> >
>>>>>> > I've compiled pgpool-3.7.1 (./configure --with-openssl; libpq.5.9), for
>>>>>> > Ubuntu 14.04, to connect to RDS Aurora Postgres (9.6.3). When I try to
>>>>>> > authenticate, pgpool child process segfaults. My config file follows the
>>>>>> > instructions set forth by the aurora instructions
>>>>>> > <http://www.pgpool.net/docs/latest/en/html/example-aurora.html>, I
>>>>>> think?
>>>>>> > Have I misconfigured something, to cause this segfault?
>>>>>> >
>>>>>> > Any guidance would be appreciated!
>>>>>> >
>>>>>> > Thanks,
>>>>>> > Philip
>>>>>> >
>>>>>> >
>>>>>> > $ psql -h localhost -U user staging
>>>>>> > Password for user user:
>>>>>> > psql: server closed the connection unexpectedly
>>>>>> >         This probably means the server terminated abnormally
>>>>>> >         before or while processing the request.
>>>>>>
>>>>>> It seems your status file (/var/log/pgpool/pool_status) is out of
>>>>>> date.
>>>>>>
>>>>>> > 2018-01-23 19:23:42: pid 19872: DEBUG:  creating new connection to
>>>>>> backend
>>>>>> > 2018-01-23 19:23:42: pid 19872: DETAIL:  skipping backend slot 0 because
>>>>>> > backend_status = 3
>>>>>>
>>>>>> So Pgpool-II failes to create a connection to backend0, which causes
>>>>>> the segfault later on. Surely Pgpool-II needs to have a guard for the
>>>>>> situation, but for now you could workaround this by shutting down
>>>>>> pgpool, removing /var/log/pgpool/pool_status, and restarting pgoool.
>>>>>>
>>>>>> Once proper pool_status is created, you don't need to repeat the
>>>>>> operation above. i.e. skip removing pool_status.
>>>>>>
>>>>>> Best regards,
>>>>>> --
>>>>>> Tatsuo Ishii
>>>>>> SRA OSS, Inc. Japan
>>>>>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> Japanese:http://www.sraoss.co.jp
>>>>>>
>>> _______________________________________________
>>> pgpool-hackers mailing list
>>> pgpool-hackers at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-hackers
> _______________________________________________
> pgpool-hackers mailing list
> pgpool-hackers at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-hackers