[pgpool-general: 1518] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

Yugo Nagata nagata at sraoss.co.jp
Thu Mar 21 18:43:25 JST 2013


Hi ning,

The samples of pgpool.conf and scripts(failover.sh, recovery_1st_stage
and pgpool_remote_start) are available in the following document.

"pgpool-II Tutorial [watchdog in master-slave mode]".
http://www.pgpool.net/pgpool-web/contrib_docs/watchdog_master_slave/en.html

Could you please try to reproduce the problem with using the scripts?

You would have to edit the scripts because some variables such as port
number and install directory are hard coded in these.

On Tue, 19 Mar 2013 12:57:57 -0500
ning chan <ninchan8328 at gmail.com> wrote:

> HI Yugo,
> 
> You are correct, failover.py is just simply used to detect if the current
> node will be the NEW Primary, and it will create the trigger file if it is.
> The failback.py is used to failback the failed server by pg_start_backup,
> rsyncing the file from the Primary to local, touch the recovery.conf,
> pg_stop_backup.
> And the pgpool_remote_start is just basically startup the remote DB.
> 
> The additional start stop of the DB engine in above scripts may not be
> necessary but it shouldn't hurt at all and they are database engine related
> I would think.
> 
> Sorry that dbllib is in house library that I may not be able to share.
> 
> Meanwhile, do you have a sample failover failback script that you can share
> for me to reproduce the problem for you? I will also try to look for some
> online as well.
> 
> Thanks~
> Ning
> 
> 
> On Mon, Mar 18, 2013 at 6:13 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
> 
> > Hi ning,
> >
> > Sorry for delay in response, but unfortunately I haven't been able to
> > reproduce it.
> >
> > In failover.py/failback.py, the error message following occurs.
> >  ImportError: No module named dbllib
> >
> > I can't find the module dbllib in yum repositories.
> > What and how should I install to my machine?
> >
> > BTW, could you please tell me what your scripts' purpose are?
> > I guess, failover.py does is just for touching a trigger file.
> > However, I cannot understand what failback.py are for.
> >
> > In addition, in your pgpool_remote_start, backend DB is stopped before
> > started.
> > However, pgpool_remote_start doesn't have to stop backend DB because the DB
> > would be stopping  after basebackup. Also, basebackup.sh don't have to
> > stop &
> > start backend DB. Are there any special intent for these stop & start?
> >
> > The information of scripts might be help for solving the problem.
> >
> >
> > On Fri, 8 Mar 2013 00:28:20 -0600
> > ning chan <ninchan8328 at gmail.com> wrote:
> >
> > > Hi Yugo,
> > > Thanks for looking at the issue, here is the exact steps i did to get in
> > to
> > > the problem.
> > > 1) make sure replication is setup and pgpool on both server have the
> > > backend value set to 2
> > > 2) shutdown postgresql on the primary, this will promote the
> > > standby(server1)  to become new primary
> > > 3) execute pcp_recovery on server1 which will  recover the failed node
> > > (server0) and connect to the new primary (server1), check backend status
> > > value
> > > 4) shudown postfresql on the server1 (new Primary), this should promote
> > > server0 to become primary again
> > > 5) execute pcp_recovery on server0 which will recover the failed node
> > > (server1) and connect to the new primary (server0 again), check backend
> > > status value
> > > 6) go to server1, shutdown pgpool, and start it up again, pgpool at the
> > > point will not be able to start anymore, server reboot is required in
> > order
> > > to bring pgpool online.
> > >
> > > I attached you the db-server0 and db-server1.log which i redirected all
> > the
> > > command (search for 'Issue command') I executed in above steps to the log
> > > file as well, you should be able to follow it very easily.
> > > I also attached you my postgresql and pgpool conf files as well as my
> > > basebackup.sh and remote start script just in case you need them for
> > > reproduce.
> > >
> > > Thanks~
> > > Ning
> > >
> > >
> > > On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
> > >
> > > > Hi ning,
> > > >
> > > > I tried to reproduce the bind error by repeatedly starting/stopping
> > pgpools
> > > > with both watchdog enabled. But I cannot see the error.
> > > >
> > > > Can you tell me a reliable way to to reproduce it?
> > > >
> > > >
> > > > On Wed, 6 Mar 2013 11:21:01 -0600
> > > > ning chan <ninchan8328 at gmail.com> wrote:
> > > >
> > > > > Hi Tatsuo,
> > > > >
> > > > > Do you need any more data for your investigation?
> > > > >
> > > > > Thanks~
> > > > > Ning
> > > > >
> > > > >
> > > > > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Tatsuo,
> > > > > > I shutdown one watchdog instead of both, I can't reproduce the
> > problem.
> > > > > >
> > > > > > Here is the details:
> > > > > > server0 pgpool watchdog is disabled
> > > > > > server1 pgpool watchdog is enabled and it is a primary database for
> > > > > > streaming replication, failover & failback works just fine; except
> > > > that the
> > > > > > virtual ip will not be migrated to the other pgpool server because
> > > > > > watchdog on server0 is not running.
> > > > > >
> > > > > > FYI: as i reported on the other email thread, running watchdog on
> > both
> > > > > > server will not allow me to failover & failback more than once
> > which I
> > > > am
> > > > > > still looking for root cause.
> > > > > >
> > > > > > 1) both node shows pool_nodes as state 2
> > > > > > 2) shutdown database on server1, then cause the DB to failover to
> > > > server0,
> > > > > > server0 is now primary
> > > > > > 3) execute pcp_recovery on server0 to bring the server1 failed
> > database
> > > > > > back online and connects to server0 as a standby; however,
> > pool_nodes
> > > > on
> > > > > > server1 shows the following:
> > > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > ---------+----------+------+--------+-----------+---------
> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > > > (2 rows)
> > > > > >
> > > > > > As shows, server1 pgpool think itself as in state 3.
> > > > > > Replication however is working fine.
> > > > > >
> > > > > > 4) i have to execute pcp_attach_node on server1 to bring its
> > pool_nodes
> > > > > > state to 2, however, server0 pool_nodes info about server1 becomes
> > 3.
> > > > see
> > > > > > below for both servers output:
> > > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > ---------+----------+------+--------+-----------+---------
> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > > >
> > > > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > ---------+----------+------+--------+-----------+---------
> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > > >
> > > > > >
> > > > > > 5) execute the following command on server1 will bring the server1
> > > > status
> > > > > > to 2 on both node:
> > > > > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1
> > > > > >
> > > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > ---------+----------+------+--------+-----------+---------
> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > > >
> > > > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > ---------+----------+------+--------+-----------+---------
> > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > > >
> > > > > > Please advise the next step.
> > > > > >
> > > > > > Thanks~
> > > > > > Ning
> > > > > >
> > > > > >
> > > > > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org
> > >
> > > > wrote:
> > > > > >
> > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > > > Success
> > > > > >>
> > > > > >> This error messge seems pretty strange. ":" should be something
> > like
> > > > > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
> > > > > >> Success". To isolate the problem, can please disable watchdog and
> > try
> > > > > >> again?
> > > > > >> --
> > > > > >> Tatsuo Ishii
> > > > > >> SRA OSS, Inc. Japan
> > > > > >> English: http://www.sraoss.co.jp/index_en.php
> > > > > >> Japanese: http://www.sraoss.co.jp
> > > > > >>
> > > > > >>
> > > > > >> > Hi All,
> > > > > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/
> > failback
> > > > > >> setup,
> > > > > >> > and start / stop pgpool mutlip times, I see one of the pgpool
> > goes
> > > > in
> > > > > >> to an
> > > > > >> > unrecoverable state.
> > > > > >> >
> > > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown
> > > > request
> > > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > > > > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky:
> > ifup[/sbin/ip]
> > > > > >> doesn't
> > > > > >> > have sticky bit
> > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > > > Success
> > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)
> > > > > >> failed: No
> > > > > >> > such file or directory
> > > > > >> >
> > > > > >> >
> > > > > >> > netstat shows the following:
> > > > > >> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > > > > >> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > > > > >> > LISTEN
> > > > > >> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > > > > >> > LISTEN
> > > > > >> > tcp        0      0 172.16.6.154:46650
> > 172.16.6.153:9999
> > > > > >> > TIME_WAIT
> > > > > >> > tcp        9      0 172.16.6.154:9999
> > 172.16.6.153:51868
> > > > > >> > CLOSE_WAIT
> > > > > >> > tcp        9      0 172.16.6.154:9999
> > 172.16.6.153:51906
> > > > > >> > CLOSE_WAIT
> > > > > >> > tcp        0      0 172.16.6.154:9999
> > 172.16.6.154:50624
> > > > > >> > TIME_WAIT
> > > > > >> > tcp        9      0 172.16.6.154:9999
> > 172.16.6.153:51946
> > > > > >> > CLOSE_WAIT
> > > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698
> > > > > >>  /tmp/.s.PGSQL.9898
> > > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685
> > > > > >>  /tmp/.s.PGSQL.9999
> > > > > >> >
> > > > > >> > Is this a known issue?
> > > > > >> >
> > > > > >> > I will have to reboot the server in order to start pgpool back
> > > > online.
> > > > > >> >
> > > > > >> > My cluster has two servers (server0 & server1) which each of
> > them
> > > > are
> > > > > >> > running pgpool, and postgreSQL with streaming Replication setup.
> > > > > >> >
> > > > > >> > Thanks~
> > > > > >> > Ning
> > > > > >>
> > > > > >
> > > > > >
> > > >
> > > >
> > > > --
> > > > Yugo Nagata <nagata at sraoss.co.jp>
> > > >
> >
> >
> > --
> > Yugo Nagata <nagata at sraoss.co.jp>
> >


-- 
Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-general mailing list