[pgpool-general: 1455] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

ning chan ninchan8328 at gmail.com
Tue Mar 5 07:08:28 JST 2013


Hi Tatsuo,
I shutdown one watchdog instead of both, I can't reproduce the problem.

Here is the details:
server0 pgpool watchdog is disabled
server1 pgpool watchdog is enabled and it is a primary database for
streaming replication, failover & failback works just fine; except that the
virtual ip will not be migrated to the other pgpool server because
watchdog on server0 is not running.

FYI: as i reported on the other email thread, running watchdog on both
server will not allow me to failover & failback more than once which I am
still looking for root cause.

1) both node shows pool_nodes as state 2
2) shutdown database on server1, then cause the DB to failover to server0,
server0 is now primary
3) execute pcp_recovery on server0 to bring the server1 failed database
back online and connects to server0 as a standby; however, pool_nodes on
server1 shows the following:
[root at server1 data]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 3      | 0.500000  | standby
(2 rows)

As shows, server1 pgpool think itself as in state 3.
Replication however is working fine.

4) i have to execute pcp_attach_node on server1 to bring its pool_nodes
state to 2, however, server0 pool_nodes info about server1 becomes 3. see
below for both servers output:
[root at server1 data]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 2      | 0.500000  | standby

[root at server0 ~]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 3      | 0.500000  | standby


5) execute the following command on server1 will bring the server1 status
to 2 on both node:
/usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1

[root at server1 data]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 2      | 0.500000  | standby

[root at server0 ~]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 2      | 0.500000  | standby

Please advise the next step.

Thanks~
Ning


On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason: Success
>
> This error messge seems pretty strange. ":" should be something like
> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
> Success". To isolate the problem, can please disable watchdog and try
> again?
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
>
> > Hi All,
> > After upgrade to pgPool-II 3.2.3 and I tested my failover/ failback
> setup,
> > and start / stop pgpool mutlip times, I see one of the pgpool goes in to
> an
> > unrecoverable state.
> >
> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown request
> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky: ifup[/sbin/ip]
> doesn't
> > have sticky bit
> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason: Success
> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898) failed:
> No
> > such file or directory
> >
> >
> > netstat shows the following:
> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > LISTEN
> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > LISTEN
> > tcp        0      0 172.16.6.154:46650          172.16.6.153:9999
> > TIME_WAIT
> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51868
> > CLOSE_WAIT
> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51906
> > CLOSE_WAIT
> > tcp        0      0 172.16.6.154:9999           172.16.6.154:50624
> > TIME_WAIT
> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51946
> > CLOSE_WAIT
> > unix  2      [ ACC ]     STREAM     LISTENING     18698
>  /tmp/.s.PGSQL.9898
> > unix  2      [ ACC ]     STREAM     LISTENING     18685
>  /tmp/.s.PGSQL.9999
> >
> > Is this a known issue?
> >
> > I will have to reboot the server in order to start pgpool back online.
> >
> > My cluster has two servers (server0 & server1) which each of them are
> > running pgpool, and postgreSQL with streaming Replication setup.
> >
> > Thanks~
> > Ning
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130304/3b9af0b4/attachment.html>


More information about the pgpool-general mailing list