[pgpool-general: 1511] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

Yugo Nagata nagata at sraoss.co.jp
Mon Mar 18 20:13:48 JST 2013


Hi ning,

Sorry for delay in response, but unfortunately I haven't been able to reproduce it.

In failover.py/failback.py, the error message following occurs.
 ImportError: No module named dbllib

I can't find the module dbllib in yum repositories.
What and how should I install to my machine?

BTW, could you please tell me what your scripts' purpose are?
I guess, failover.py does is just for touching a trigger file.
However, I cannot understand what failback.py are for.

In addition, in your pgpool_remote_start, backend DB is stopped before started.
However, pgpool_remote_start doesn't have to stop backend DB because the DB 
would be stopping  after basebackup. Also, basebackup.sh don't have to stop & 
start backend DB. Are there any special intent for these stop & start?

The information of scripts might be help for solving the problem.


On Fri, 8 Mar 2013 00:28:20 -0600
ning chan <ninchan8328 at gmail.com> wrote:

> Hi Yugo,
> Thanks for looking at the issue, here is the exact steps i did to get in to
> the problem.
> 1) make sure replication is setup and pgpool on both server have the
> backend value set to 2
> 2) shutdown postgresql on the primary, this will promote the
> standby(server1)  to become new primary
> 3) execute pcp_recovery on server1 which will  recover the failed node
> (server0) and connect to the new primary (server1), check backend status
> value
> 4) shudown postfresql on the server1 (new Primary), this should promote
> server0 to become primary again
> 5) execute pcp_recovery on server0 which will recover the failed node
> (server1) and connect to the new primary (server0 again), check backend
> status value
> 6) go to server1, shutdown pgpool, and start it up again, pgpool at the
> point will not be able to start anymore, server reboot is required in order
> to bring pgpool online.
> 
> I attached you the db-server0 and db-server1.log which i redirected all the
> command (search for 'Issue command') I executed in above steps to the log
> file as well, you should be able to follow it very easily.
> I also attached you my postgresql and pgpool conf files as well as my
> basebackup.sh and remote start script just in case you need them for
> reproduce.
> 
> Thanks~
> Ning
> 
> 
> On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
> 
> > Hi ning,
> >
> > I tried to reproduce the bind error by repeatedly starting/stopping pgpools
> > with both watchdog enabled. But I cannot see the error.
> >
> > Can you tell me a reliable way to to reproduce it?
> >
> >
> > On Wed, 6 Mar 2013 11:21:01 -0600
> > ning chan <ninchan8328 at gmail.com> wrote:
> >
> > > Hi Tatsuo,
> > >
> > > Do you need any more data for your investigation?
> > >
> > > Thanks~
> > > Ning
> > >
> > >
> > > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com> wrote:
> > >
> > > > Hi Tatsuo,
> > > > I shutdown one watchdog instead of both, I can't reproduce the problem.
> > > >
> > > > Here is the details:
> > > > server0 pgpool watchdog is disabled
> > > > server1 pgpool watchdog is enabled and it is a primary database for
> > > > streaming replication, failover & failback works just fine; except
> > that the
> > > > virtual ip will not be migrated to the other pgpool server because
> > > > watchdog on server0 is not running.
> > > >
> > > > FYI: as i reported on the other email thread, running watchdog on both
> > > > server will not allow me to failover & failback more than once which I
> > am
> > > > still looking for root cause.
> > > >
> > > > 1) both node shows pool_nodes as state 2
> > > > 2) shutdown database on server1, then cause the DB to failover to
> > server0,
> > > > server0 is now primary
> > > > 3) execute pcp_recovery on server0 to bring the server1 failed database
> > > > back online and connects to server0 as a standby; however, pool_nodes
> > on
> > > > server1 shows the following:
> > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > (2 rows)
> > > >
> > > > As shows, server1 pgpool think itself as in state 3.
> > > > Replication however is working fine.
> > > >
> > > > 4) i have to execute pcp_attach_node on server1 to bring its pool_nodes
> > > > state to 2, however, server0 pool_nodes info about server1 becomes 3.
> > see
> > > > below for both servers output:
> > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > >
> > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > >
> > > >
> > > > 5) execute the following command on server1 will bring the server1
> > status
> > > > to 2 on both node:
> > > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1
> > > >
> > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > >
> > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > >
> > > > Please advise the next step.
> > > >
> > > > Thanks~
> > > > Ning
> > > >
> > > >
> > > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org>
> > wrote:
> > > >
> > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > Success
> > > >>
> > > >> This error messge seems pretty strange. ":" should be something like
> > > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
> > > >> Success". To isolate the problem, can please disable watchdog and try
> > > >> again?
> > > >> --
> > > >> Tatsuo Ishii
> > > >> SRA OSS, Inc. Japan
> > > >> English: http://www.sraoss.co.jp/index_en.php
> > > >> Japanese: http://www.sraoss.co.jp
> > > >>
> > > >>
> > > >> > Hi All,
> > > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/ failback
> > > >> setup,
> > > >> > and start / stop pgpool mutlip times, I see one of the pgpool goes
> > in
> > > >> to an
> > > >> > unrecoverable state.
> > > >> >
> > > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown
> > request
> > > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky: ifup[/sbin/ip]
> > > >> doesn't
> > > >> > have sticky bit
> > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > Success
> > > >> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)
> > > >> failed: No
> > > >> > such file or directory
> > > >> >
> > > >> >
> > > >> > netstat shows the following:
> > > >> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > > >> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > > >> > LISTEN
> > > >> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > > >> > LISTEN
> > > >> > tcp        0      0 172.16.6.154:46650          172.16.6.153:9999
> > > >> > TIME_WAIT
> > > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51868
> > > >> > CLOSE_WAIT
> > > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51906
> > > >> > CLOSE_WAIT
> > > >> > tcp        0      0 172.16.6.154:9999           172.16.6.154:50624
> > > >> > TIME_WAIT
> > > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51946
> > > >> > CLOSE_WAIT
> > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698
> > > >>  /tmp/.s.PGSQL.9898
> > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685
> > > >>  /tmp/.s.PGSQL.9999
> > > >> >
> > > >> > Is this a known issue?
> > > >> >
> > > >> > I will have to reboot the server in order to start pgpool back
> > online.
> > > >> >
> > > >> > My cluster has two servers (server0 & server1) which each of them
> > are
> > > >> > running pgpool, and postgreSQL with streaming Replication setup.
> > > >> >
> > > >> > Thanks~
> > > >> > Ning
> > > >>
> > > >
> > > >
> >
> >
> > --
> > Yugo Nagata <nagata at sraoss.co.jp>
> >


-- 
Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-general mailing list