[pgpool-general: 1474] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

Sat Mar 9 02:05:04 JST 2013

Hi Yugo,

I am using Centos 6.3x64 version.

[root at server1 ~]# uname -a
Linux server1 2.6.32-279.el6.x86_64 #1 SMP Fri Jun 22 12:19:21 UTC 2012
x86_64 x86_64 x86_64 GNU/Linux

After I shut down pgpool, i output the process (server1) to the db.log, i
don't see any process left over.
I also output the netstat on both server to the db.log, that you may find
interesting about.

Thanks~
Ning

On Fri, Mar 8, 2013 at 5:44 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:

> Hi ning,
>
> Thanks for detailed information. I'll try to reproduce the problem.
>
> In addition ,could you please provide me some more information?
>
> 1. What's OS version? (just to be sure)
>
> 2. After shutdown of the pgpool on server1, are there any pgpool process
> left?
> In the log of server1, I see port 9999 is still opend to listen while port
> 9898 is closed. It might mean there are some problem in exiting the pgpool.
>
> > Mar  7 23:57:42 server1 pgpool[2555]: received smart shutdown request
> > Mar  7 23:57:42 server1 pgpool[2555]: watchdog_pid: 2558
> > Mar  7 23:57:49 server1 pgpool[4407]: wd_chk_sticky: ifup[/sbin/ip]
> doesn't have > sticky bit
> > Mar  7 23:57:49 server1 pgpool[4408]: bind(:) failed. reason: Success
> > Mar  7 23:57:49 server1 pgpool[4408]: unlink(/tmp/.s.PGSQL.9898) failed:
> No such file or directory
> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
>     LISTEN
> > tcp        8      0 172.16.6.154:9999           172.16.6.153:34048
>      ESTABLISHED
> > tcp        0      0 172.16.6.154:9999           172.16.6.153:33924
>      TIME_WAIT
> > tcp        0      0 172.16.6.154:9999           172.16.6.154:36458
>      TIME_WAIT
> > tcp        0      0 172.16.6.154:9999           172.16.6.154:36514
>      TIME_WAIT
> > tcp        0      0 172.16.6.154:44297          172.16.6.153:9999
>     TIME_WAIT
> > tcp        9      0 172.16.6.154:9999           172.16.6.153:34008
>      CLOSE_WAIT
> > tcp        0      0 172.16.6.154:9999           172.16.6.154:36486
>      TIME_WAIT
> > unix  2      [ ACC ]     STREAM     LISTENING     15867
>  /tmp/.s.PGSQL.9999
>
>
> On Fri, 8 Mar 2013 00:28:20 -0600
> ning chan <ninchan8328 at gmail.com> wrote:
>
> > Hi Yugo,
> > Thanks for looking at the issue, here is the exact steps i did to get in
> to
> > the problem.
> > 1) make sure replication is setup and pgpool on both server have the
> > backend value set to 2
> > 2) shutdown postgresql on the primary, this will promote the
> > standby(server1)  to become new primary
> > 3) execute pcp_recovery on server1 which will  recover the failed node
> > (server0) and connect to the new primary (server1), check backend status
> > value
> > 4) shudown postfresql on the server1 (new Primary), this should promote
> > server0 to become primary again
> > 5) execute pcp_recovery on server0 which will recover the failed node
> > (server1) and connect to the new primary (server0 again), check backend
> > status value
> > 6) go to server1, shutdown pgpool, and start it up again, pgpool at the
> > point will not be able to start anymore, server reboot is required in
> order
> > to bring pgpool online.
> >
> > I attached you the db-server0 and db-server1.log which i redirected all
> the
> > command (search for 'Issue command') I executed in above steps to the log
> > file as well, you should be able to follow it very easily.
> > I also attached you my postgresql and pgpool conf files as well as my
> > basebackup.sh and remote start script just in case you need them for
> > reproduce.
> >
> > Thanks~
> > Ning
> >
> >
> > On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
> >
> > > Hi ning,
> > >
> > > I tried to reproduce the bind error by repeatedly starting/stopping
> pgpools
> > > with both watchdog enabled. But I cannot see the error.
> > >
> > > Can you tell me a reliable way to to reproduce it?
> > >
> > >
> > > On Wed, 6 Mar 2013 11:21:01 -0600
> > > ning chan <ninchan8328 at gmail.com> wrote:
> > >
> > > > Hi Tatsuo,
> > > >
> > > > Do you need any more data for your investigation?
> > > >
> > > > Thanks~
> > > > Ning
> > > >
> > > >
> > > > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com>
> wrote:
> > > >
> > > > > Hi Tatsuo,
> > > > > I shutdown one watchdog instead of both, I can't reproduce the
> problem.
> > > > >
> > > > > Here is the details:
> > > > > server0 pgpool watchdog is disabled
> > > > > server1 pgpool watchdog is enabled and it is a primary database for
> > > > > streaming replication, failover & failback works just fine; except
> > > that the
> > > > > virtual ip will not be migrated to the other pgpool server because
> > > > > watchdog on server0 is not running.
> > > > >
> > > > > FYI: as i reported on the other email thread, running watchdog on
> both
> > > > > server will not allow me to failover & failback more than once
> which I
> > > am
> > > > > still looking for root cause.
> > > > >
> > > > > 1) both node shows pool_nodes as state 2
> > > > > 2) shutdown database on server1, then cause the DB to failover to
> > > server0,
> > > > > server0 is now primary
> > > > > 3) execute pcp_recovery on server0 to bring the server1 failed
> database
> > > > > back online and connects to server0 as a standby; however,
> pool_nodes
> > > on
> > > > > server1 shows the following:
> > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > ---------+----------+------+--------+-----------+---------
> > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > > (2 rows)
> > > > >
> > > > > As shows, server1 pgpool think itself as in state 3.
> > > > > Replication however is working fine.
> > > > >
> > > > > 4) i have to execute pcp_attach_node on server1 to bring its
> pool_nodes
> > > > > state to 2, however, server0 pool_nodes info about server1 becomes
> 3.
> > > see
> > > > > below for both servers output:
> > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > ---------+----------+------+--------+-----------+---------
> > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > >
> > > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > ---------+----------+------+--------+-----------+---------
> > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > >
> > > > >
> > > > > 5) execute the following command on server1 will bring the server1
> > > status
> > > > > to 2 on both node:
> > > > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1
> > > > >
> > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > ---------+----------+------+--------+-----------+---------
> > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > >
> > > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > ---------+----------+------+--------+-----------+---------
> > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > >
> > > > > Please advise the next step.
> > > > >
> > > > > Thanks~
> > > > > Ning
> > > > >
> > > > >
> > > > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org
> >
> > > wrote:
> > > > >
> > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > > Success
> > > > >>
> > > > >> This error messge seems pretty strange. ":" should be something
> like
> > > > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
> > > > >> Success". To isolate the problem, can please disable watchdog and
> try
> > > > >> again?
> > > > >> --
> > > > >> Tatsuo Ishii
> > > > >> SRA OSS, Inc. Japan
> > > > >> English: http://www.sraoss.co.jp/index_en.php
> > > > >> Japanese: http://www.sraoss.co.jp
> > > > >>
> > > > >>
> > > > >> > Hi All,
> > > > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/
> failback
> > > > >> setup,
> > > > >> > and start / stop pgpool mutlip times, I see one of the pgpool
> goes
> > > in
> > > > >> to an
> > > > >> > unrecoverable state.
> > > > >> >
> > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown
> > > request
> > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > > > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky:
> ifup[/sbin/ip]
> > > > >> doesn't
> > > > >> > have sticky bit
> > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > > Success
> > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)
> > > > >> failed: No
> > > > >> > such file or directory
> > > > >> >
> > > > >> >
> > > > >> > netstat shows the following:
> > > > >> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > > > >> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > > > >> > LISTEN
> > > > >> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > > > >> > LISTEN
> > > > >> > tcp        0      0 172.16.6.154:46650
> 172.16.6.153:9999
> > > > >> > TIME_WAIT
> > > > >> > tcp        9      0 172.16.6.154:9999
> 172.16.6.153:51868
> > > > >> > CLOSE_WAIT
> > > > >> > tcp        9      0 172.16.6.154:9999
> 172.16.6.153:51906
> > > > >> > CLOSE_WAIT
> > > > >> > tcp        0      0 172.16.6.154:9999
> 172.16.6.154:50624
> > > > >> > TIME_WAIT
> > > > >> > tcp        9      0 172.16.6.154:9999
> 172.16.6.153:51946
> > > > >> > CLOSE_WAIT
> > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698
> > > > >>  /tmp/.s.PGSQL.9898
> > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685
> > > > >>  /tmp/.s.PGSQL.9999
> > > > >> >
> > > > >> > Is this a known issue?
> > > > >> >
> > > > >> > I will have to reboot the server in order to start pgpool back
> > > online.
> > > > >> >
> > > > >> > My cluster has two servers (server0 & server1) which each of
> them
> > > are
> > > > >> > running pgpool, and postgreSQL with streaming Replication setup.
> > > > >> >
> > > > >> > Thanks~
> > > > >> > Ning
> > > > >>
> > > > >
> > > > >
> > >
> > >
> > > --
> > > Yugo Nagata <nagata at sraoss.co.jp>
> > >
>
>
> --
> Yugo Nagata <nagata at sraoss.co.jp>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20130308/de54baa8/attachment.htm>