[pgpool-general: 778] Re: pgpool dropping backends too much

Fri Jul 20 23:14:00 JST 2012

Yes.

On Fri, Jul 20, 2012 at 11:18 AM, Lazáro Rubén García Martínez
<lgarciam at vnz.uci.cu> wrote:
> This is in postgresql 9.1??
>
> Regards.
> ________________________________________
> From: Luiz Pasqual [luiz at pasquall.com]
> Sent: Friday, July 20, 2012 9:12 AM
> To: Lazáro Rubén García Martínez
> Cc: pgpool-general at pgpool.net
> Subject: Re: [pgpool-general: 774] Re: pgpool dropping backends too much
>
> It is in postgresql.conf, you should set it in the slaves.
>
>
> On Fri, Jul 20, 2012 at 11:02 AM, Lazáro Rubén García Martínez
> <lgarciam at vnz.uci.cu> wrote:
>> Where is the property 'hot_standby_feedback'?
>>
>> Regards.
>> ________________________________________
>> From: pgpool-general-bounces at pgpool.net [pgpool-general-bounces at pgpool.net] On Behalf Of Luiz Pasqual [luiz at pasquall.com]
>> Sent: Friday, July 20, 2012 8:37 AM
>> To: pgpool-general at pgpool.net
>> Subject: [pgpool-general: 774] Re: pgpool dropping backends too much
>>
>> Karl,
>>
>> We had this kind of errors also in our setup:
>> FATAL:  terminating connection due to conflict with recovery
>> DETAIL:  User query might have needed to see row versions that must be removed.
>> HINT:  In a moment you should be able to reconnect to the database and
>> repeat your command.
>>
>> When it happens the slave was disconnected from pgpool (as expected)
>> and we have to attached manually again. Then we set
>> "hot_standby_feedback = on", and it solve, because in our case this
>> conflicts was related with autovacuum.
>>
>> Well, it may not be your case, but I think pgpool is doing the right thing.
>>
>> [ ]'s
>>
>>
>> On Thu, Jul 19, 2012 at 6:09 AM, Karl von Randow
>> <karl+pgpool at cactuslab.com> wrote:
>>> Thanks very much Tatsuo, when you said "please give us some time to develop a patch" I didn't expect you to come back with a patch so quickly!
>>>
>>> I have patched, compiled and deployed it across our app server cluster (3 servers) and I'll report back tomorrow on whether I get a full night's sleep!
>>>
>>> On 19/07/2012, at 5:10 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>>
>>>> After thinking more, I decided not try to read error messages from
>>>> backend because there's no guarantee to be able to read whole
>>>> messages(in this case error packet) from the broken
>>>> connection. Actually you case seems to verify my idea. Pgpool was able
>>>> to read first 1 byte, which indicates the packet is an error
>>>> message('E'). So it tries to read rest of the message and failed, thus
>>>> triggers failover. However without reading whole error message
>>>> contents, it is not possible what to do. Probably what we can do the
>>>> best is:
>>>>
>>>> - If ssl_read() returns unknown error code, assume the connection is
>>>>  broken (your log shows that you are the case).  Not trigger failover
>>>>  but terminate the session.
>>>>
>>>> This is essentially same as what libpq does. Attached is the patch
>>>> trying to do this.
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>>
>>>>> Absolutely, thank you very much. We will be ready to test the patch
>>>>> whenever you are. We will work around this in the meantime by marking
>>>>> the back-ends as not to failover!
>>>>>
>>>>> On 19/07/2012 4:31 p.m., Tatsuo Ishii wrote:
>>>>>> Thanks. This helps me a lot.
>>>>>>
>>>>>> It seems the cause of error is this:
>>>>>>> 2012-07-17 00:11:03 NZST [15696]: [366-1] FATAL: terminating
>>>>>>> connection due to conflict with recovery
>>>>>> SSL library is not smart enough to detect the connection shutdown
>>>>>> event. Thus it returns:
>>>>>>> pool_ssl: SSL_read: no SSL error reported
>>>>>> Probably in this case pgpool should inspect the error event and avoid
>>>>>> to call SS_read(). Then pgpool should disconnect the connection to
>>>>>> client hopefully with the error messages returned from backend, rather
>>>>>> than trigger failover. Please give us some time to develop a patch.
>>>>>> --
>>>>>> Tatsuo Ishii
>>>>>> SRA OSS, Inc. Japan
>>>>>> English: http://www.sraoss.co.jp/index_en.php
>>>>>> Japanese: http://www.sraoss.co.jp
>>>>>>
>>>>>>> Right, sorry I should have investigated that! Here is a snip from the
>>>>>>> logs for that moment in time that shows the ERROR due to conflict with
>>>>>>> recovery that I mentioned, also there's a FATAL in there. We see a lot
>>>>>>> of the conflict with recovery... however they don't always have the
>>>>>>> FATAL, more below.
>>>>>>>
>>>>>>> 2012-07-17 00:11:03 NZST [15692]: [257-1] ERROR: canceling statement
>>>>>>> due to conflict with recovery
>>>>>>> 2012-07-17 00:11:03 NZST [15692]: [258-1] DETAIL: User query might
>>>>>>> have needed to see row versions that must be removed.
>>>>>>> 2012-07-17 00:11:03 NZST [15692]: [259-1] STATEMENT: <SNIP>
>>>>>>> 2012-07-17 00:11:03 NZST [15696]: [366-1] FATAL: terminating
>>>>>>> connection due to conflict with recovery
>>>>>>> 2012-07-17 00:11:03 NZST [15696]: [367-1] DETAIL: User query might
>>>>>>> have needed to see row versions that must be removed.
>>>>>>> 2012-07-17 00:11:03 NZST [15696]: [368-1] HINT: In a moment you should
>>>>>>> be able to reconnect to the database and repeat your command.
>>>>>>> 2012-07-17 00:11:03 NZST [15682]: [99-1] ERROR: canceling statement
>>>>>>> due to conflict with recovery
>>>>>>> 2012-07-17 00:11:03 NZST [15682]: [100-1] DETAIL: User query might
>>>>>>> have needed to see row versions that must be removed.
>>>>>>> 2012-07-17 00:11:03 NZST [15682]: [101-1] STATEMENT:  <SNIP>
>>>>>>> 2012-07-17 00:11:04 NZST [16731]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:06 NZST [16734]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:07 NZST [16737]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:09 NZST [16789]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:16 NZST [16806]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:17 NZST [16822]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:17 NZST [16826]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:18 NZST [16829]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:21 NZST [16835]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:22 NZST [16837]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:22 NZST [16852]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>> 2012-07-17 00:11:23 NZST [16855]: [1-1] LOG: could not receive data
>>>>>>> from client: Connection reset by peer
>>>>>>>
>>>>>>> I presume that all of the connection reset by peer is pgpool dropping
>>>>>>> all of its connections to this instance.
>>>>>>>
>>>>>>> Here is another time there was a conflict with recovery. This time it
>>>>>>> did not coincide with the backend being degraded:
>>>>>>>
>>>>>>> 2012-07-19 07:16:58 NZST [22343]: [18-1] ERROR: canceling statement
>>>>>>> due to conflict with recovery
>>>>>>> 2012-07-19 07:16:58 NZST [22343]: [19-1] DETAIL: User query might have
>>>>>>> needed to see row versions that must be removed.
>>>>>>> 2012-07-19 07:16:58 NZST [22343]: [20-1] STATEMENT:  <SNIP>
>>>>>>> 2012-07-19 07:16:58 NZST [22343]: [21-1] LOG: duration: 64.141 ms
>>>>>>> execute <unnamed>: <SNIP>
>>>>>>> 2012-07-19 07:16:58 NZST [22343]: [22-1] DETAIL:  parameters: <SNIP>
>>>>>>>
>>>>>>> However looking through the logs for instances of the FATAL, it
>>>>>>> happens more often than we degenerate back-ends as well. I'm not sure
>>>>>>> if this helps!
>>>>>>>
>>>>>>> k
>>>>>>>
>>>>>>> On 19/07/2012 3:44 p.m., Tatsuo Ishii wrote:
>>>>>>>> When you see this;
>>>>>>>>> pool_process_query: discard E packet from backend 1
>>>>>>>> Is there any error reported in backend 1's log?
>>>>>>>> I'm asking because E corresponds to "ERROR" of PostgreSQL.
>>>>>>>> --
>>>>>>>> Tatsuo Ishii
>>>>>>>> SRA OSS, Inc. Japan
>>>>>>>> English: http://www.sraoss.co.jp/index_en.php
>>>>>>>> Japanese: http://www.sraoss.co.jp
>>>>>>>>
>>>>>>>>> We are running pgpool with 3 backend servers (9.0, streaming
>>>>>>>>> replication). We are running non-SSL client-pgpool and SSL
>>>>>>>>> pgpool-server (my previous email was in error, we do not appear to be
>>>>>>>>> using SSL client-pgpool).
>>>>>>>>>
>>>>>>>>> I have set the primary server to not support fail-over, and that
>>>>>>>>> works, it doesn't failover. However our slaves failover once or twice
>>>>>>>>> a day, when the slave has not in fact failed. I have to reattach the
>>>>>>>>> node, and it continues happily.
>>>>>>>>>
>>>>>>>>> The syslog always contains this note about E packets:
>>>>>>>>>
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 31871:
>>>>>>>>> pool_process_query: discard E packet from backend 1
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 ERROR: pid 31871:
>>>>>>>>> pool_ssl: SSL_read: no SSL error reported
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 ERROR: pid 31871:
>>>>>>>>> pool_read: read failed (Success)
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 31871:
>>>>>>>>> degenerate_backend_set: 1 fail over request from pid 31871
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> starting degeneration. shutdown host db2(5432)
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> Restart all children
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> find_primary_node_repeatedly: waiting for finding a primary node
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> find_primary_node: primary node id is 0
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> failover: set new primary node: 0
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> failover: set new master node: 0
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 923: worker
>>>>>>>>> process received restart request
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>>>>>>>> failover done. shutdown host db2(5432)
>>>>>>>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346:
>>>>>>>>> worker child 923 exits with status 256
>>>>>>>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 924: pcp
>>>>>>>>> child process received restart request
>>>>>>>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346: fork
>>>>>>>>> a new worker child pid 9434
>>>>>>>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346: PCP
>>>>>>>>> child 924 exits with status 256
>>>>>>>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346: fork
>>>>>>>>> a new PCP child pid 9435
>>>>>>>>>
>>>>>>>>> Sometimes the proceeding syslog entry is a LOG notice about a
>>>>>>>>> statement that failed, eg:
>>>>>>>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 26664:
>>>>>>>>> pool_send_and_wait: Error or notice message from backend: : DB node
>>>>>>>>> id: 1 backend pid: 15682 statement: <SNIP> message: canceling
>>>>>>>>> statement due to conflict with recovery
>>>>>>>>>
>>>>>>>>> I don't want to mark our slaves as no-failover but it seems that
>>>>>>>>> pgpool is either experiencing a fault and interpreting it as a
>>>>>>>>> failover, or is a bit sensitive? I'm happy to test patches!
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Karl
>>>>>>>>> _______________________________________________
>>>>>>>>> pgpool-general mailing list
>>>>>>>>> pgpool-general at pgpool.net
>>>>>>>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>>>>>> _______________________________________________
>>>>>>> pgpool-general mailing list
>>>>>>> pgpool-general at pgpool.net
>>>>>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>>>>
>>>>> _______________________________________________
>>>>> pgpool-general mailing list
>>>>> pgpool-general at pgpool.net
>>>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>>> diff --git a/pool_ssl.c b/pool_ssl.c
>>>> index b3e2d50..194a01c 100644
>>>> --- a/pool_ssl.c
>>>> +++ b/pool_ssl.c
>>>> @@ -167,8 +167,13 @@ int pool_ssl_read(POOL_CONNECTION *cp, void *buf, int size) {
>>>>                       n = -1;
>>>>                       break;
>>>>               default:
>>>> -                     perror_ssl("SSL_read");
>>>> -                     n = -1;
>>>> +                     pool_error("pool_ssl_read: unrecognized error code: %d", err);
>>>> +                     /*
>>>> +                      * We assume that the connection is broken. Returns 0
>>>> +                      * rather than -1 in this case because -1 triggers
>>>> +                      * unwanted failover in the caller (pool_read).
>>>> +                      */
>>>> +                     n = 0;
>>>>                       break;
>>>>       }
>>>>
>>>
>>> _______________________________________________
>>> pgpool-general mailing list
>>> pgpool-general at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
>>
>>
>> --
>> Luiz Fernando Pasqual S. Souza
>> mail: luiz at pasquall.com
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>
>> Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
>> http://www.antiterroristas.cu
>> http://justiciaparaloscinco.wordpress.com
>>
>> Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
>> http://www.antiterroristas.cu
>> http://justiciaparaloscinco.wordpress.com
>
>
>
> --
> Luiz Fernando Pasqual S. Souza
> mail: luiz at pasquall.com
>
> Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
> http://www.antiterroristas.cu
> http://justiciaparaloscinco.wordpress.com
>
> Fin a la injusticia, LIBERTAD AHORA A NUESTROS CINCO COMPATRIOTAS QUE SE ENCUENTRAN INJUSTAMENTE EN PRISIONES DE LOS EEUU!
> http://www.antiterroristas.cu
> http://justiciaparaloscinco.wordpress.com

-- 
Luiz Fernando Pasqual S. Souza
mail: luiz at pasquall.com