[pgpool-general: 5707] Re: log from regular queries that got stuck
AviW at gilat.com
Tue Aug 29 17:38:19 JST 2017
Hi Tatsuo Ishii,
>From the tests we ran it seems like the pool_unread patch fixed our problem. We will run more tests to be sure, but so far it looks very good.
Thank you very much for all your assistance.
From: Tatsuo Ishii [mailto:ishii at sraoss.co.jp]
Sent: Tuesday, August 15, 2017 4:16 AM
To: Avi Weinberg <AviW at gilat.com>
Cc: pgpool-general at pgpool.net
Subject: Re: [pgpool-general: 5696] Re: log from regular queries that got stuck
> I have been looking into the log. I think there's nothing unexpected
> line in the log except the last line:
> <= BE NoticeResponse(S DEBUG C XX000 M reading backend data packet
> kind D master node id: 0 F pooltext: 0x8487f8 F pool_process_quea )
> The later half of the line seems screwed up. I am not sure the message
> was actually screwed up or just JDBC screwed up the receiving message.
> Can you please share the pgpool log so that I can check if the log was
> actually screwed up?
> If so, that could be a sign of memory corruption in Pgpool-II (or
> maybe not).
I think I have found a possible cause of the problem.
When frontend sends a sync message, Pgpool-II forwards it to backend then read messages from backend until "ready for query" message sent from the backend. After this, the messages Pgpool-II has read are returned to internal buffer, using pool_unread(). There was a critical bug in pool_unread(): not memorizing new buffer size when the buffer size is changed by realloc(). This could cause memory corruption by subsequent use of the buffer. The screwed up message above could be a victim of this. Also the hang you are seeing could be too because Pgpool-II tries to read from the corrupted buffer.
The chance bitten by the bug depends on how many messages are read before finding the "ready for query" message. The more message, the more chance to be bitten by the bug because realloc() could be called. This is purely depending on the timing of messages arriving from backend. My guess is, with slow/high latency network more pending message accumulate while Pgpool-II is doing its job.
BTW, the bug has been there for 10 years. Probably the reason why we did not found it until today is, recently Pgpool-II extensively uses pool_unread().
Could you please try with attached 1 line patch?
SRA OSS, Inc. Japan
IMPORTANT - This email and any attachments is intended for the above named addressee(s), and may contain information which is confidential or privileged. If you are not the intended recipient, please inform the sender immediately and delete this email: you should not copy or use this e-mail for any purpose nor disclose its contents to any person.
More information about the pgpool-general