[pgpool-general: 1562] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

Sat Mar 30 12:56:18 JST 2013

Hi Yugo,
Sorry for the late reply as I was busy on other project.
Actually, i cannot reproduce the problem anymore after I take out the
failback,py on both pgpool node.
Now I can failover, and recover the failed node at will.

However, when I move on to the next test case. Shut down the Primary server
completely (execute reboot on the Primary server), failover failed.
Looking at the log on the Standby, i see that the failover script is being
called, however, the parameter %H is missing for failover command.

Mar 29 22:52:02 server0 pgpool[35742]: execute command:
/home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
-m -1 -H  -M 0 -P 0 -r  -R
Fri Mar 29 22:52:02 2013 failover  DEBUG:  -->
Fri Mar 29 22:52:02 2013 failover  DEBUG:  Invalid node ID
Fri Mar 29 22:52:02 2013 failover  DEBUG:  <--

the option are equivalent to the parameter listed in the pgpool.conf.
In this case, -H value is missing which is %H = hostname of the new master
node.

Do you have any idea why?

Thanks~
Ning

On Thu, Mar 21, 2013 at 4:43 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:

> Hi ning,
>
> The samples of pgpool.conf and scripts(failover.sh, recovery_1st_stage
> and pgpool_remote_start) are available in the following document.
>
> "pgpool-II Tutorial [watchdog in master-slave mode]".
> http://www.pgpool.net/pgpool-web/contrib_docs/watchdog_master_slave/en.html
>
> Could you please try to reproduce the problem with using the scripts?
>
> You would have to edit the scripts because some variables such as port
> number and install directory are hard coded in these.
>
> On Tue, 19 Mar 2013 12:57:57 -0500
> ning chan <ninchan8328 at gmail.com> wrote:
>
> > HI Yugo,
> >
> > You are correct, failover.py is just simply used to detect if the current
> > node will be the NEW Primary, and it will create the trigger file if it
> is.
> > The failback.py is used to failback the failed server by pg_start_backup,
> > rsyncing the file from the Primary to local, touch the recovery.conf,
> > pg_stop_backup.
> > And the pgpool_remote_start is just basically startup the remote DB.
> >
> > The additional start stop of the DB engine in above scripts may not be
> > necessary but it shouldn't hurt at all and they are database engine
> related
> > I would think.
> >
> > Sorry that dbllib is in house library that I may not be able to share.
> >
> > Meanwhile, do you have a sample failover failback script that you can
> share
> > for me to reproduce the problem for you? I will also try to look for some
> > online as well.
> >
> > Thanks~
> > Ning
> >
> >
> > On Mon, Mar 18, 2013 at 6:13 AM, Yugo Nagata <nagata at sraoss.co.jp>
> wrote:
> >
> > > Hi ning,
> > >
> > > Sorry for delay in response, but unfortunately I haven't been able to
> > > reproduce it.
> > >
> > > In failover.py/failback.py, the error message following occurs.
> > >  ImportError: No module named dbllib
> > >
> > > I can't find the module dbllib in yum repositories.
> > > What and how should I install to my machine?
> > >
> > > BTW, could you please tell me what your scripts' purpose are?
> > > I guess, failover.py does is just for touching a trigger file.
> > > However, I cannot understand what failback.py are for.
> > >
> > > In addition, in your pgpool_remote_start, backend DB is stopped before
> > > started.
> > > However, pgpool_remote_start doesn't have to stop backend DB because
> the DB
> > > would be stopping  after basebackup. Also, basebackup.sh don't have to
> > > stop &
> > > start backend DB. Are there any special intent for these stop & start?
> > >
> > > The information of scripts might be help for solving the problem.
> > >
> > >
> > > On Fri, 8 Mar 2013 00:28:20 -0600
> > > ning chan <ninchan8328 at gmail.com> wrote:
> > >
> > > > Hi Yugo,
> > > > Thanks for looking at the issue, here is the exact steps i did to
> get in
> > > to
> > > > the problem.
> > > > 1) make sure replication is setup and pgpool on both server have the
> > > > backend value set to 2
> > > > 2) shutdown postgresql on the primary, this will promote the
> > > > standby(server1)  to become new primary
> > > > 3) execute pcp_recovery on server1 which will  recover the failed
> node
> > > > (server0) and connect to the new primary (server1), check backend
> status
> > > > value
> > > > 4) shudown postfresql on the server1 (new Primary), this should
> promote
> > > > server0 to become primary again
> > > > 5) execute pcp_recovery on server0 which will recover the failed node
> > > > (server1) and connect to the new primary (server0 again), check
> backend
> > > > status value
> > > > 6) go to server1, shutdown pgpool, and start it up again, pgpool at
> the
> > > > point will not be able to start anymore, server reboot is required in
> > > order
> > > > to bring pgpool online.
> > > >
> > > > I attached you the db-server0 and db-server1.log which i redirected
> all
> > > the
> > > > command (search for 'Issue command') I executed in above steps to
> the log
> > > > file as well, you should be able to follow it very easily.
> > > > I also attached you my postgresql and pgpool conf files as well as my
> > > > basebackup.sh and remote start script just in case you need them for
> > > > reproduce.
> > > >
> > > > Thanks~
> > > > Ning
> > > >
> > > >
> > > > On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <nagata at sraoss.co.jp>
> wrote:
> > > >
> > > > > Hi ning,
> > > > >
> > > > > I tried to reproduce the bind error by repeatedly starting/stopping
> > > pgpools
> > > > > with both watchdog enabled. But I cannot see the error.
> > > > >
> > > > > Can you tell me a reliable way to to reproduce it?
> > > > >
> > > > >
> > > > > On Wed, 6 Mar 2013 11:21:01 -0600
> > > > > ning chan <ninchan8328 at gmail.com> wrote:
> > > > >
> > > > > > Hi Tatsuo,
> > > > > >
> > > > > > Do you need any more data for your investigation?
> > > > > >
> > > > > > Thanks~
> > > > > > Ning
> > > > > >
> > > > > >
> > > > > > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com
> >
> > > wrote:
> > > > > >
> > > > > > > Hi Tatsuo,
> > > > > > > I shutdown one watchdog instead of both, I can't reproduce the
> > > problem.
> > > > > > >
> > > > > > > Here is the details:
> > > > > > > server0 pgpool watchdog is disabled
> > > > > > > server1 pgpool watchdog is enabled and it is a primary
> database for
> > > > > > > streaming replication, failover & failback works just fine;
> except
> > > > > that the
> > > > > > > virtual ip will not be migrated to the other pgpool server
> because
> > > > > > > watchdog on server0 is not running.
> > > > > > >
> > > > > > > FYI: as i reported on the other email thread, running watchdog
> on
> > > both
> > > > > > > server will not allow me to failover & failback more than once
> > > which I
> > > > > am
> > > > > > > still looking for root cause.
> > > > > > >
> > > > > > > 1) both node shows pool_nodes as state 2
> > > > > > > 2) shutdown database on server1, then cause the DB to failover
> to
> > > > > server0,
> > > > > > > server0 is now primary
> > > > > > > 3) execute pcp_recovery on server0 to bring the server1 failed
> > > database
> > > > > > > back online and connects to server0 as a standby; however,
> > > pool_nodes
> > > > > on
> > > > > > > server1 shows the following:
> > > > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > > ---------+----------+------+--------+-----------+---------
> > > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > > > > (2 rows)
> > > > > > >
> > > > > > > As shows, server1 pgpool think itself as in state 3.
> > > > > > > Replication however is working fine.
> > > > > > >
> > > > > > > 4) i have to execute pcp_attach_node on server1 to bring its
> > > pool_nodes
> > > > > > > state to 2, however, server0 pool_nodes info about server1
> becomes
> > > 3.
> > > > > see
> > > > > > > below for both servers output:
> > > > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > > ---------+----------+------+--------+-----------+---------
> > > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > > > >
> > > > > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > > ---------+----------+------+--------+-----------+---------
> > > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > > > >
> > > > > > >
> > > > > > > 5) execute the following command on server1 will bring the
> server1
> > > > > status
> > > > > > > to 2 on both node:
> > > > > > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd]
> 1
> > > > > > >
> > > > > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > > ---------+----------+------+--------+-----------+---------
> > > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > > > >
> > > > > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > > > > >  node_id | hostname | port | status | lb_weight |  role
> > > > > > > ---------+----------+------+--------+-----------+---------
> > > > > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > > > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > > > > >
> > > > > > > Please advise the next step.
> > > > > > >
> > > > > > > Thanks~
> > > > > > > Ning
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <
> ishii at postgresql.org
> > > >
> > > > > wrote:
> > > > > > >
> > > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed.
> reason:
> > > > > Success
> > > > > > >>
> > > > > > >> This error messge seems pretty strange. ":" should be
> something
> > > like
> > > > > > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed.
> reason:
> > > > > > >> Success". To isolate the problem, can please disable watchdog
> and
> > > try
> > > > > > >> again?
> > > > > > >> --
> > > > > > >> Tatsuo Ishii
> > > > > > >> SRA OSS, Inc. Japan
> > > > > > >> English: http://www.sraoss.co.jp/index_en.php
> > > > > > >> Japanese: http://www.sraoss.co.jp
> > > > > > >>
> > > > > > >>
> > > > > > >> > Hi All,
> > > > > > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/
> > > failback
> > > > > > >> setup,
> > > > > > >> > and start / stop pgpool mutlip times, I see one of the
> pgpool
> > > goes
> > > > > in
> > > > > > >> to an
> > > > > > >> > unrecoverable state.
> > > > > > >> >
> > > > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart
> shutdown
> > > > > request
> > > > > > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > > > > > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky:
> > > ifup[/sbin/ip]
> > > > > > >> doesn't
> > > > > > >> > have sticky bit
> > > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed.
> reason:
> > > > > Success
> > > > > > >> > Mar  1 10:45:31 server1 pgpool[3339]:
> unlink(/tmp/.s.PGSQL.9898)
> > > > > > >> failed: No
> > > > > > >> > such file or directory
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > netstat shows the following:
> > > > > > >> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > > > > > >> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > > > > > >> > LISTEN
> > > > > > >> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > > > > > >> > LISTEN
> > > > > > >> > tcp        0      0 172.16.6.154:46650
> > > 172.16.6.153:9999
> > > > > > >> > TIME_WAIT
> > > > > > >> > tcp        9      0 172.16.6.154:9999
> > > 172.16.6.153:51868
> > > > > > >> > CLOSE_WAIT
> > > > > > >> > tcp        9      0 172.16.6.154:9999
> > > 172.16.6.153:51906
> > > > > > >> > CLOSE_WAIT
> > > > > > >> > tcp        0      0 172.16.6.154:9999
> > > 172.16.6.154:50624
> > > > > > >> > TIME_WAIT
> > > > > > >> > tcp        9      0 172.16.6.154:9999
> > > 172.16.6.153:51946
> > > > > > >> > CLOSE_WAIT
> > > > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698
> > > > > > >>  /tmp/.s.PGSQL.9898
> > > > > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685
> > > > > > >>  /tmp/.s.PGSQL.9999
> > > > > > >> >
> > > > > > >> > Is this a known issue?
> > > > > > >> >
> > > > > > >> > I will have to reboot the server in order to start pgpool
> back
> > > > > online.
> > > > > > >> >
> > > > > > >> > My cluster has two servers (server0 & server1) which each of
> > > them
> > > > > are
> > > > > > >> > running pgpool, and postgreSQL with streaming Replication
> setup.
> > > > > > >> >
> > > > > > >> > Thanks~
> > > > > > >> > Ning
> > > > > > >>
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Yugo Nagata <nagata at sraoss.co.jp>
> > > > >
> > >
> > >
> > > --
> > > Yugo Nagata <nagata at sraoss.co.jp>
> > >
>
>
> --
> Yugo Nagata <nagata at sraoss.co.jp>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130329/d4ee2eff/attachment-0001.html>