[Pgpool-general] pgpool 3.0.4 wedging up

Wed Aug 24 21:15:52 UTC 2011

I just raised health_check_timeout and authentication_timeout in pgpool.conf

I'm really concerned that pgpool stopped working without logging a
failover message for the backend.

Aleksey

On Wed, Aug 24, 2011 at 11:45 AM, Aleksey Tsalolikhin
<atsaloli.tech at gmail.com> wrote:
> Situation:  pgpool stops working.  stops accepting requests.  but no
> failover message!  (even though configured to log failover events)
>
> 09:11:38.325349 IP pgpool.44937 > database.5432: P
> 2447077764:2447077849(85) ack 240034773 win 501 <nop,nop,timestamp
> 2748616563 2731686531>
> 09:11:38.325819 IP database.5432 > pgpool.44937: P 1:58(57) ack 85 win
> 501 <nop,nop,timestamp 2731751942 2748616563>
> 09:11:38.325941 IP pgpool.44937 > database.5432: . ack 58 win 501
> <nop,nop,timestamp 2748616564 2731751942>
> 09:11:38.328699 IP pgpool.44937 > database.5432: P 85:418(333) ack 58
> win 501 <nop,nop,timestamp 2748616564 2731751942>
> 09:11:38.329098 IP database.5432 > pgpool.44937: P 58:78(20) ack 418
> win 501 <nop,nop,timestamp 2731751945 2748616564>
> 09:11:38.329666 IP pgpool.44937 > database.5432: P 418:430(12) ack 78
> win 501 <nop,nop,timestamp 2748616564 2731751945>
> 09:11:38.330222 IP database.5432 > pgpool.44937: P 78:96(18) ack 430
> win 501 <nop,nop,timestamp 2731751946 2748616564>
> 09:11:38.361913 IP pgpool.44937 > database.5432: P 430:489(59) ack 96
> win 501 <nop,nop,timestamp 2748616573 2731751946>
> 09:11:38.362110 IP database.5432 > pgpool.44937: P 96:122(26) ack 489
> win 501 <nop,nop,timestamp 2731751978 2748616573>
> 09:11:38.362692 IP pgpool.44937 > database.5432: P 489:642(153) ack
> 122 win 501 <nop,nop,timestamp 2748616573 2731751978>
> 09:11:38.363169 IP database.5432 > pgpool.44937: P 122:179(57) ack 642
> win 501 <nop,nop,timestamp 2731751979 2748616573>
> 09:11:38.365298 IP pgpool.44937 > database.5432: P 642:1049(407) ack
> 179 win 501 <nop,nop,timestamp 2748616573 2731751979>
> 09:11:38.365950 IP database.5432 > pgpool.44937: P 179:650(471) ack
> 1049 win 501 <nop,nop,timestamp 2731751982 2748616573>
> 09:11:38.405640 IP pgpool.44937 > database.5432: . ack 650 win 501
> <nop,nop,timestamp 2748616584 2731751982>
>
>
> # lsof -i:44937
> COMMAND   PID USER   FD   TYPE     DEVICE SIZE NODE NAME
> pgpool  20830 root    7u  IPv4 1025702434       TCP
> ddc-gate-dev02:44937->ddc-db-dev02:postgres (ESTABLISHED)
> # strace -f -p 20830
>
>
> tcpdump shows SQL traffic between pgpool and database servers, and
> lsof and strace confirm it is a pgpool process exchanging SQL with the
> database
>
>
> I can run psql on the pgpool server, targeting the database server,
> and it connects just fine, so the backend is up.
>
>
> pgpool log contains messages like this:
>
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 05:31:23 ERROR: pid 13731: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:48 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 07:49:48 ERROR: pid 27235: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
> (Connection reset by peer)
> 2011-08-24 09:09:20 LOG:   pid 19949: authentication is timeout
>
> I stopped pgpool (/etc/init.d/pgpool stop).
>
> I tried to start it but got this:
>
> # /etc/init.d/pgpool start
> Starting pgpool service: pid file found but it seems bogus. Trying to
> start pgpool anyway...
> 2011-08-24 09:20:16 LOG:   pid 16492: read_status_file: 0 th backend
> is set to down status
> 2011-08-24 09:20:16 ERROR: pid 16492: bind(/tmp/.s.PGSQL.9999) failed.
> reason: Address already in use
> 2011-08-24 09:20:16 ERROR: pid 16492: unlink() failed: No such file or directory
>                                                           [FAILED]
> #
>
> I tried again and got this:
>
> Starting pgpool service: 2011-08-24 09:21:29 LOG:   pid 16658:
> pgpool-II successfully started. version 3.0.4 (umiyameboshi)
> 2011-08-24 09:21:29 ERROR: pid 16658: bind(/tmp/.s.PGSQL.9898) failed.
> reason: Address already in use
>
> No pgpool process running...
>
> # lsof -i:9898
> # lsof -i:9999
> #
>
>
> I tried again a third time and it started.
>
>
> The first sign of trouble was at 2:19 AM, when some queries through
> pgpool failed.  I came in at 9 AM and restarted pgpool.
>
> Then a couple of hours later, queries through pgpool started failing
> again, and I restarted it again, and it worked fine afterwards.
>
> I do monitor the database server.  The load is higher than usual now.
> Could a query taking a longer time to return cause this kind of
> instability in pgpool?  If so, is there any way to compensate for it?
>
> Please let me know what other information I could gather if this
> happens again -- I did grab a core dump of the pgpool parent at 9 AM,
> and an lsof on that process.
>
> We are running pgpool on CentOS 5.5, 64 bit.
>
> Best,
> Aleksey
>