[pgpool-general: 1470] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

ning chan ninchan8328 at gmail.com
Fri Mar 8 15:28:20 JST 2013


Hi Yugo,
Thanks for looking at the issue, here is the exact steps i did to get in to
the problem.
1) make sure replication is setup and pgpool on both server have the
backend value set to 2
2) shutdown postgresql on the primary, this will promote the
standby(server1)  to become new primary
3) execute pcp_recovery on server1 which will  recover the failed node
(server0) and connect to the new primary (server1), check backend status
value
4) shudown postfresql on the server1 (new Primary), this should promote
server0 to become primary again
5) execute pcp_recovery on server0 which will recover the failed node
(server1) and connect to the new primary (server0 again), check backend
status value
6) go to server1, shutdown pgpool, and start it up again, pgpool at the
point will not be able to start anymore, server reboot is required in order
to bring pgpool online.

I attached you the db-server0 and db-server1.log which i redirected all the
command (search for 'Issue command') I executed in above steps to the log
file as well, you should be able to follow it very easily.
I also attached you my postgresql and pgpool conf files as well as my
basebackup.sh and remote start script just in case you need them for
reproduce.

Thanks~
Ning


On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:

> Hi ning,
>
> I tried to reproduce the bind error by repeatedly starting/stopping pgpools
> with both watchdog enabled. But I cannot see the error.
>
> Can you tell me a reliable way to to reproduce it?
>
>
> On Wed, 6 Mar 2013 11:21:01 -0600
> ning chan <ninchan8328 at gmail.com> wrote:
>
> > Hi Tatsuo,
> >
> > Do you need any more data for your investigation?
> >
> > Thanks~
> > Ning
> >
> >
> > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com> wrote:
> >
> > > Hi Tatsuo,
> > > I shutdown one watchdog instead of both, I can't reproduce the problem.
> > >
> > > Here is the details:
> > > server0 pgpool watchdog is disabled
> > > server1 pgpool watchdog is enabled and it is a primary database for
> > > streaming replication, failover & failback works just fine; except
> that the
> > > virtual ip will not be migrated to the other pgpool server because
> > > watchdog on server0 is not running.
> > >
> > > FYI: as i reported on the other email thread, running watchdog on both
> > > server will not allow me to failover & failback more than once which I
> am
> > > still looking for root cause.
> > >
> > > 1) both node shows pool_nodes as state 2
> > > 2) shutdown database on server1, then cause the DB to failover to
> server0,
> > > server0 is now primary
> > > 3) execute pcp_recovery on server0 to bring the server1 failed database
> > > back online and connects to server0 as a standby; however, pool_nodes
> on
> > > server1 shows the following:
> > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > >  node_id | hostname | port | status | lb_weight |  role
> > > ---------+----------+------+--------+-----------+---------
> > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > (2 rows)
> > >
> > > As shows, server1 pgpool think itself as in state 3.
> > > Replication however is working fine.
> > >
> > > 4) i have to execute pcp_attach_node on server1 to bring its pool_nodes
> > > state to 2, however, server0 pool_nodes info about server1 becomes 3.
> see
> > > below for both servers output:
> > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > >  node_id | hostname | port | status | lb_weight |  role
> > > ---------+----------+------+--------+-----------+---------
> > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > >
> > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > >  node_id | hostname | port | status | lb_weight |  role
> > > ---------+----------+------+--------+-----------+---------
> > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > >
> > >
> > > 5) execute the following command on server1 will bring the server1
> status
> > > to 2 on both node:
> > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1
> > >
> > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > >  node_id | hostname | port | status | lb_weight |  role
> > > ---------+----------+------+--------+-----------+---------
> > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > >
> > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > >  node_id | hostname | port | status | lb_weight |  role
> > > ---------+----------+------+--------+-----------+---------
> > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > >
> > > Please advise the next step.
> > >
> > > Thanks~
> > > Ning
> > >
> > >
> > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> > >
> > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> Success
> > >>
> > >> This error messge seems pretty strange. ":" should be something like
> > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
> > >> Success". To isolate the problem, can please disable watchdog and try
> > >> again?
> > >> --
> > >> Tatsuo Ishii
> > >> SRA OSS, Inc. Japan
> > >> English: http://www.sraoss.co.jp/index_en.php
> > >> Japanese: http://www.sraoss.co.jp
> > >>
> > >>
> > >> > Hi All,
> > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/ failback
> > >> setup,
> > >> > and start / stop pgpool mutlip times, I see one of the pgpool goes
> in
> > >> to an
> > >> > unrecoverable state.
> > >> >
> > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown
> request
> > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky: ifup[/sbin/ip]
> > >> doesn't
> > >> > have sticky bit
> > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> Success
> > >> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)
> > >> failed: No
> > >> > such file or directory
> > >> >
> > >> >
> > >> > netstat shows the following:
> > >> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > >> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > >> > LISTEN
> > >> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > >> > LISTEN
> > >> > tcp        0      0 172.16.6.154:46650          172.16.6.153:9999
> > >> > TIME_WAIT
> > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51868
> > >> > CLOSE_WAIT
> > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51906
> > >> > CLOSE_WAIT
> > >> > tcp        0      0 172.16.6.154:9999           172.16.6.154:50624
> > >> > TIME_WAIT
> > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51946
> > >> > CLOSE_WAIT
> > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698
> > >>  /tmp/.s.PGSQL.9898
> > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685
> > >>  /tmp/.s.PGSQL.9999
> > >> >
> > >> > Is this a known issue?
> > >> >
> > >> > I will have to reboot the server in order to start pgpool back
> online.
> > >> >
> > >> > My cluster has two servers (server0 & server1) which each of them
> are
> > >> > running pgpool, and postgreSQL with streaming Replication setup.
> > >> >
> > >> > Thanks~
> > >> > Ning
> > >>
> > >
> > >
>
>
> --
> Yugo Nagata <nagata at sraoss.co.jp>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool-server1.conf
Type: application/octet-stream
Size: 25586 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0009.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool-server0.conf
Type: application/octet-stream
Size: 25375 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0010.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: postgresql-server1.conf
Type: application/octet-stream
Size: 19700 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0011.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool_remote_start
Type: application/octet-stream
Size: 866 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0012.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: basebackup.sh
Type: application/x-sh
Size: 2615 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0001.sh>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: postgresql-server0.conf
Type: application/octet-stream
Size: 19698 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0013.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: db-server0.log
Type: application/octet-stream
Size: 43273 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0014.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: db-server1.log
Type: application/octet-stream
Size: 40688 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0015.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failback.py
Type: application/octet-stream
Size: 10531 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0016.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: failover.py
Type: application/octet-stream
Size: 6234 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130308/da35e0fe/attachment-0017.obj>


More information about the pgpool-general mailing list