[Pgpool-general] pgpool 3.0.4 wedging up

Wed Aug 24 18:45:37 UTC 2011

Situation:  pgpool stops working.  stops accepting requests.  but no
failover message!  (even though configured to log failover events)

09:11:38.325349 IP pgpool.44937 > database.5432: P
2447077764:2447077849(85) ack 240034773 win 501 <nop,nop,timestamp
2748616563 2731686531>
09:11:38.325819 IP database.5432 > pgpool.44937: P 1:58(57) ack 85 win
501 <nop,nop,timestamp 2731751942 2748616563>
09:11:38.325941 IP pgpool.44937 > database.5432: . ack 58 win 501
<nop,nop,timestamp 2748616564 2731751942>
09:11:38.328699 IP pgpool.44937 > database.5432: P 85:418(333) ack 58
win 501 <nop,nop,timestamp 2748616564 2731751942>
09:11:38.329098 IP database.5432 > pgpool.44937: P 58:78(20) ack 418
win 501 <nop,nop,timestamp 2731751945 2748616564>
09:11:38.329666 IP pgpool.44937 > database.5432: P 418:430(12) ack 78
win 501 <nop,nop,timestamp 2748616564 2731751945>
09:11:38.330222 IP database.5432 > pgpool.44937: P 78:96(18) ack 430
win 501 <nop,nop,timestamp 2731751946 2748616564>
09:11:38.361913 IP pgpool.44937 > database.5432: P 430:489(59) ack 96
win 501 <nop,nop,timestamp 2748616573 2731751946>
09:11:38.362110 IP database.5432 > pgpool.44937: P 96:122(26) ack 489
win 501 <nop,nop,timestamp 2731751978 2748616573>
09:11:38.362692 IP pgpool.44937 > database.5432: P 489:642(153) ack
122 win 501 <nop,nop,timestamp 2748616573 2731751978>
09:11:38.363169 IP database.5432 > pgpool.44937: P 122:179(57) ack 642
win 501 <nop,nop,timestamp 2731751979 2748616573>
09:11:38.365298 IP pgpool.44937 > database.5432: P 642:1049(407) ack
179 win 501 <nop,nop,timestamp 2748616573 2731751979>
09:11:38.365950 IP database.5432 > pgpool.44937: P 179:650(471) ack
1049 win 501 <nop,nop,timestamp 2731751982 2748616573>
09:11:38.405640 IP pgpool.44937 > database.5432: . ack 650 win 501
<nop,nop,timestamp 2748616584 2731751982>

# lsof -i:44937
COMMAND   PID USER   FD   TYPE     DEVICE SIZE NODE NAME
pgpool  20830 root    7u  IPv4 1025702434       TCP
ddc-gate-dev02:44937->ddc-db-dev02:postgres (ESTABLISHED)
# strace -f -p 20830

tcpdump shows SQL traffic between pgpool and database servers, and
lsof and strace confirm it is a pgpool process exchanging SQL with the
database

I can run psql on the pgpool server, targeting the database server,
and it connects just fine, so the backend is up.

pgpool log contains messages like this:

2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:07:48 ERROR: pid 14589: pool_read: read failed
(Connection reset by peer)
2011-08-24 05:31:23 ERROR: pid 13731: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:47 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:48 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 07:49:48 ERROR: pid 27235: pool_read: read failed
(Connection reset by peer)
2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
(Connection reset by peer)
2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
(Connection reset by peer)
2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
(Connection reset by peer)
2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
(Connection reset by peer)
2011-08-24 08:56:19 ERROR: pid 24482: pool_read: read failed
(Connection reset by peer)
2011-08-24 09:09:20 LOG:   pid 19949: authentication is timeout

I stopped pgpool (/etc/init.d/pgpool stop).

I tried to start it but got this:

# /etc/init.d/pgpool start
Starting pgpool service: pid file found but it seems bogus. Trying to
start pgpool anyway...
2011-08-24 09:20:16 LOG:   pid 16492: read_status_file: 0 th backend
is set to down status
2011-08-24 09:20:16 ERROR: pid 16492: bind(/tmp/.s.PGSQL.9999) failed.
reason: Address already in use
2011-08-24 09:20:16 ERROR: pid 16492: unlink() failed: No such file or directory
                                                           [FAILED]
#

I tried again and got this:

Starting pgpool service: 2011-08-24 09:21:29 LOG:   pid 16658:
pgpool-II successfully started. version 3.0.4 (umiyameboshi)
2011-08-24 09:21:29 ERROR: pid 16658: bind(/tmp/.s.PGSQL.9898) failed.
reason: Address already in use

No pgpool process running...

# lsof -i:9898
# lsof -i:9999
#

I tried again a third time and it started.

The first sign of trouble was at 2:19 AM, when some queries through
pgpool failed.  I came in at 9 AM and restarted pgpool.

Then a couple of hours later, queries through pgpool started failing
again, and I restarted it again, and it worked fine afterwards.

I do monitor the database server.  The load is higher than usual now.
Could a query taking a longer time to return cause this kind of
instability in pgpool?  If so, is there any way to compensate for it?

Please let me know what other information I could gather if this
happens again -- I did grab a core dump of the pgpool parent at 9 AM,
and an lsof on that process.

We are running pgpool on CentOS 5.5, 64 bit.

Best,
Aleksey