[pgpool-general: 1253] Strange issue with ports remaining open after switchover

Wed Dec 12 17:58:41 JST 2012

Hi,

I'm having a really strange issue with pgpool. Let me first explain the setup.

We have 2 physical servers that both run Postgres in master/slave
setup using streaming replication and this is controlled by pgpool.
Pgpool itself is also running on both the machines using the watchdog
functionality.

The servers have a dedicated interconnect used for the replication
between Postgres and by pgpool to connect to the Postgres backends.
The servers also have a frontend connection used by the clients to
connect to pgpool.

At a steady state one server is the postgres master, the other the
pgpool master.

Now the issue, we have been testing the redundancy of this setup and
in one scenario we see this strange issue.

The scenario is this. We kill the master postgres server, the master
pgpool server which is up to that moment a postgres slave then becomes
the postgres master. This all works fine. The problem appears when we
want to do the failback.

So we restore the server and then let pgpool do a node recovery so
that the failed server becomes a postgres slave. Then we also start
pgpool on the recoverd server which becomes a pgpool standby. Then
finaly to return to our steady state we want to failover the pgpool
master to the recovered node.

We do this by simple stopping the current pgpool master so that the
standby takes over. Up to here everything works fine.

It is only when we want to start pgpool again so it can become a
standby that we see the issue.

Pgpool does not want to startup again because its ports are in use
(9898 and 9999). We double check to see that there are no pgpool
processes left and that is the case. Using netstat we can see that the
ports are now in use by a SSH process that parts of the recovery
scripts that will start Postgres on the other node.
Postgres is running fine on the other node so I kill that ssh process.
But then the pgpool still fails, now netstat shows that the ports are
used by the local postgres process.

And that is the behavior that I don't understand.

The only way to recover from this is to stop and restart the whole
stack (both postgres & pgpool on both nodes). And this is of course
not what you want in a redundant clustered setup.

Does anyone has any idea what exactly is going on here?

Thank you,
Tim

--
Tim Verhoeven - tim.verhoeven.be at gmail.com - 0479 / 88 11 83

Hoping the problem  magically goes away  by ignoring it is the
"microsoft approach to programming" and should never be allowed.
(Linus Torvalds)