[Pgpool-general] INSERT slow on one system not on another
Derek Jones
scunacc at yahoo.com
Wed Mar 4 15:15:32 UTC 2009
Hi folks,
I've been using pgpool II with Postgres 8.2.5 successfully for quite a
while in load-balanced mode on a cluster, but I've just noticed a problem.
I have two identical DB setups on two different clusters.
4 nodes + one node for pgpool. Clients connect from other nodes.
On one cluster, the INSERTs happen in reasonable time, such that several
hundred thousand inserts take place over a period of a few minutes
(other processing happening too).
On the other, it might take, say, 12 hours...
It seems to get stuck in a LOCK TABLE waiting with an INSERT maxed out
on one CPU core. I have tried turned off locking because these tables
have no serial data type, and it doesn't make much difference, just Idle
waiting instead - it's the INSERT that's taking time, not a locking
issue per se.
There is a difference in the way the systems (Linux) are configured, and
that is the only difference between them but I can't explain the long
INSERT times in spite of that.
I am the cluster admin - I have a significant UNIX & Linux background as
a sysadmin, programmer, system configurator etc. and have set up these
clusters, so I have a reasonable idea of what's going on under the hood
in terms of locking semantics and actions, disk I/O, caching , etc. and
still can't reasonably explain this at the moment.
OK. The difference:
One cluster (the good one) has each node booting off of a local disk,
and the postgres partition (on the postgres backend nodes) mounted from
a local disk partition. The Linux kernel rev is: 2.6.22.9
df
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 226G 45G 170G 21% /
....
df /postgres/
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 226G 45G 170G 21% /
The other cluster (the bad one), has each node NFS root booting via PXE,
so, the root partition is NFS.
df
Filesystem Size Used Avail Use% Mounted on
master2:/tftpboot/192.168.5.209
26G 23G 2.2G 92% /
master2:/usr 26G 23G 2.2G 92% /usr
master2:/home 26G 23G 2.2G 92% /home
.....
/dev/sda3 51G 41G 7.6G 85% /postgres
and is running kernel rev: 2.6.12-12 with some customizations for NFS
booting, high BW net connections, etc.
The disk on the good cluster is SATA. The disk on the bad cluster is SCSI.
Does anyone have any idea what might cause the one to be soooo slow?
I am wondering if, because I am mounting /postgres in the 2nd example on
top of a mount point that is itself NFS-mounted, that I am actually
seeing some weird interaction with NFS happening to slow things down. I
don't see that in terms of disk or net activity tho'.
On the postgres (bad) cluster nodes of course, lib, usr/lib etc. are all
NFS mounted as well as you see above, but unless there's something
really kooky happening about shared libs being endlessly pulled (which I
don't see) or continual NFS access that way, then I don't see the
connection.
An strace shows lots of __llseek SEEK_END activity as the thing is "stuck"
I've tried changing things like mount options, turning off atime
updates, mounting sync vs. async, turning off WAL updates with fsync in
postgres. Etc. Nothing changes.
Given the 2 difference system configs I would understand if the NFS
setup on the bad one - NFS booting / root FS - being the issue, but I
wonder if there's a pgpool / postgres config tweak I could make that
would fix this instead of having to completely reconfigure a cluster to
be locally booting (which, being it's a production cluster would be
pretty prohibitive).
Any help much appreciated. (I have trawled lists and googled and come up
with nothing that seems focused on this issue).
Kind regards
Derek.
More information about the Pgpool-general
mailing list