[pgpool-general: 767] Re: pgpool dropping backends too much

Tatsuo Ishii ishii at postgresql.org
Thu Jul 19 13:31:22 JST 2012


Thanks. This helps me a lot.

It seems the cause of error is this:
> 2012-07-17 00:11:03 NZST [15696]: [366-1] FATAL: terminating
> connection due to conflict with recovery

SSL library is not smart enough to detect the connection shutdown
event. Thus it returns:
> pool_ssl: SSL_read: no SSL error reported

Probably in this case pgpool should inspect the error event and avoid
to call SS_read(). Then pgpool should disconnect the connection to
client hopefully with the error messages returned from backend, rather
than trigger failover. Please give us some time to develop a patch.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Right, sorry I should have investigated that! Here is a snip from the
> logs for that moment in time that shows the ERROR due to conflict with
> recovery that I mentioned, also there's a FATAL in there. We see a lot
> of the conflict with recovery... however they don't always have the
> FATAL, more below.
> 
> 2012-07-17 00:11:03 NZST [15692]: [257-1] ERROR: canceling statement
> due to conflict with recovery
> 2012-07-17 00:11:03 NZST [15692]: [258-1] DETAIL: User query might
> have needed to see row versions that must be removed.
> 2012-07-17 00:11:03 NZST [15692]: [259-1] STATEMENT: <SNIP>
> 2012-07-17 00:11:03 NZST [15696]: [366-1] FATAL: terminating
> connection due to conflict with recovery
> 2012-07-17 00:11:03 NZST [15696]: [367-1] DETAIL: User query might
> have needed to see row versions that must be removed.
> 2012-07-17 00:11:03 NZST [15696]: [368-1] HINT: In a moment you should
> be able to reconnect to the database and repeat your command.
> 2012-07-17 00:11:03 NZST [15682]: [99-1] ERROR: canceling statement
> due to conflict with recovery
> 2012-07-17 00:11:03 NZST [15682]: [100-1] DETAIL: User query might
> have needed to see row versions that must be removed.
> 2012-07-17 00:11:03 NZST [15682]: [101-1] STATEMENT:  <SNIP>
> 2012-07-17 00:11:04 NZST [16731]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:06 NZST [16734]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:07 NZST [16737]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:09 NZST [16789]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:16 NZST [16806]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:17 NZST [16822]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:17 NZST [16826]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:18 NZST [16829]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:21 NZST [16835]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:22 NZST [16837]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:22 NZST [16852]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 2012-07-17 00:11:23 NZST [16855]: [1-1] LOG: could not receive data
> from client: Connection reset by peer
> 
> I presume that all of the connection reset by peer is pgpool dropping
> all of its connections to this instance.
> 
> Here is another time there was a conflict with recovery. This time it
> did not coincide with the backend being degraded:
> 
> 2012-07-19 07:16:58 NZST [22343]: [18-1] ERROR: canceling statement
> due to conflict with recovery
> 2012-07-19 07:16:58 NZST [22343]: [19-1] DETAIL: User query might have
> needed to see row versions that must be removed.
> 2012-07-19 07:16:58 NZST [22343]: [20-1] STATEMENT:  <SNIP>
> 2012-07-19 07:16:58 NZST [22343]: [21-1] LOG: duration: 64.141 ms
> execute <unnamed>: <SNIP>
> 2012-07-19 07:16:58 NZST [22343]: [22-1] DETAIL:  parameters: <SNIP>
> 
> However looking through the logs for instances of the FATAL, it
> happens more often than we degenerate back-ends as well. I'm not sure
> if this helps!
> 
> k
> 
> On 19/07/2012 3:44 p.m., Tatsuo Ishii wrote:
>> When you see this;
>>> pool_process_query: discard E packet from backend 1
>> Is there any error reported in backend 1's log?
>> I'm asking because E corresponds to "ERROR" of PostgreSQL.
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>>> We are running pgpool with 3 backend servers (9.0, streaming
>>> replication). We are running non-SSL client-pgpool and SSL
>>> pgpool-server (my previous email was in error, we do not appear to be
>>> using SSL client-pgpool).
>>>
>>> I have set the primary server to not support fail-over, and that
>>> works, it doesn't failover. However our slaves failover once or twice
>>> a day, when the slave has not in fact failed. I have to reattach the
>>> node, and it continues happily.
>>>
>>> The syslog always contains this note about E packets:
>>>
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 31871:
>>> pool_process_query: discard E packet from backend 1
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 ERROR: pid 31871:
>>> pool_ssl: SSL_read: no SSL error reported
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 ERROR: pid 31871:
>>> pool_read: read failed (Success)
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 31871:
>>> degenerate_backend_set: 1 fail over request from pid 31871
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> starting degeneration. shutdown host db2(5432)
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> Restart all children
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> find_primary_node_repeatedly: waiting for finding a primary node
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> find_primary_node: primary node id is 0
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> failover: set new primary node: 0
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> failover: set new master node: 0
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 923: worker
>>> process received restart request
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 30346:
>>> failover done. shutdown host db2(5432)
>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346:
>>> worker child 923 exits with status 256
>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 924: pcp
>>> child process received restart request
>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346: fork
>>> a new worker child pid 9434
>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346: PCP
>>> child 924 exits with status 256
>>> Jul 17 00:11:04 app2 pgpool: 2012-07-17 00:11:04 LOG: pid 30346: fork
>>> a new PCP child pid 9435
>>>
>>> Sometimes the proceeding syslog entry is a LOG notice about a
>>> statement that failed, eg:
>>> Jul 17 00:11:03 app2 pgpool: 2012-07-17 00:11:03 LOG: pid 26664:
>>> pool_send_and_wait: Error or notice message from backend: : DB node
>>> id: 1 backend pid: 15682 statement: <SNIP> message: canceling
>>> statement due to conflict with recovery
>>>
>>> I don't want to mark our slaves as no-failover but it seems that
>>> pgpool is either experiencing a fault and interpreting it as a
>>> failover, or is a bit sensitive? I'm happy to test patches!
>>>
>>> Best regards,
>>> Karl
>>> _______________________________________________
>>> pgpool-general mailing list
>>> pgpool-general at pgpool.net
>>> http://www.pgpool.net/mailman/listinfo/pgpool-general
> 
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general


More information about the pgpool-general mailing list