[pgpool-general: 1473] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

Fri Mar 8 20:44:34 JST 2013

Hi ning,

Thanks for detailed information. I'll try to reproduce the problem.

In addition ,could you please provide me some more information?

1. What's OS version? (just to be sure)

2. After shutdown of the pgpool on server1, are there any pgpool process left?
In the log of server1, I see port 9999 is still opend to listen while port
9898 is closed. It might mean there are some problem in exiting the pgpool.

> Mar  7 23:57:42 server1 pgpool[2555]: received smart shutdown request
> Mar  7 23:57:42 server1 pgpool[2555]: watchdog_pid: 2558
> Mar  7 23:57:49 server1 pgpool[4407]: wd_chk_sticky: ifup[/sbin/ip] doesn't have > sticky bit
> Mar  7 23:57:49 server1 pgpool[4408]: bind(:) failed. reason: Success
> Mar  7 23:57:49 server1 pgpool[4408]: unlink(/tmp/.s.PGSQL.9898) failed: No such file or directory
> tcp        0      0 0.0.0.0:9999                0.0.0.0:*                   LISTEN      
> tcp        8      0 172.16.6.154:9999           172.16.6.153:34048          ESTABLISHED 
> tcp        0      0 172.16.6.154:9999           172.16.6.153:33924          TIME_WAIT   
> tcp        0      0 172.16.6.154:9999           172.16.6.154:36458          TIME_WAIT   
> tcp        0      0 172.16.6.154:9999           172.16.6.154:36514          TIME_WAIT   
> tcp        0      0 172.16.6.154:44297          172.16.6.153:9999           TIME_WAIT   
> tcp        9      0 172.16.6.154:9999           172.16.6.153:34008          CLOSE_WAIT  
> tcp        0      0 172.16.6.154:9999           172.16.6.154:36486          TIME_WAIT   
> unix  2      [ ACC ]     STREAM     LISTENING     15867  /tmp/.s.PGSQL.9999

On Fri, 8 Mar 2013 00:28:20 -0600
ning chan <ninchan8328 at gmail.com> wrote:

> Hi Yugo,
> Thanks for looking at the issue, here is the exact steps i did to get in to
> the problem.
> 1) make sure replication is setup and pgpool on both server have the
> backend value set to 2
> 2) shutdown postgresql on the primary, this will promote the
> standby(server1)  to become new primary
> 3) execute pcp_recovery on server1 which will  recover the failed node
> (server0) and connect to the new primary (server1), check backend status
> value
> 4) shudown postfresql on the server1 (new Primary), this should promote
> server0 to become primary again
> 5) execute pcp_recovery on server0 which will recover the failed node
> (server1) and connect to the new primary (server0 again), check backend
> status value
> 6) go to server1, shutdown pgpool, and start it up again, pgpool at the
> point will not be able to start anymore, server reboot is required in order
> to bring pgpool online.
> 
> I attached you the db-server0 and db-server1.log which i redirected all the
> command (search for 'Issue command') I executed in above steps to the log
> file as well, you should be able to follow it very easily.
> I also attached you my postgresql and pgpool conf files as well as my
> basebackup.sh and remote start script just in case you need them for
> reproduce.
> 
> Thanks~
> Ning
> 
> 
> On Thu, Mar 7, 2013 at 6:01 AM, Yugo Nagata <nagata at sraoss.co.jp> wrote:
> 
> > Hi ning,
> >
> > I tried to reproduce the bind error by repeatedly starting/stopping pgpools
> > with both watchdog enabled. But I cannot see the error.
> >
> > Can you tell me a reliable way to to reproduce it?
> >
> >
> > On Wed, 6 Mar 2013 11:21:01 -0600
> > ning chan <ninchan8328 at gmail.com> wrote:
> >
> > > Hi Tatsuo,
> > >
> > > Do you need any more data for your investigation?
> > >
> > > Thanks~
> > > Ning
> > >
> > >
> > > On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com> wrote:
> > >
> > > > Hi Tatsuo,
> > > > I shutdown one watchdog instead of both, I can't reproduce the problem.
> > > >
> > > > Here is the details:
> > > > server0 pgpool watchdog is disabled
> > > > server1 pgpool watchdog is enabled and it is a primary database for
> > > > streaming replication, failover & failback works just fine; except
> > that the
> > > > virtual ip will not be migrated to the other pgpool server because
> > > > watchdog on server0 is not running.
> > > >
> > > > FYI: as i reported on the other email thread, running watchdog on both
> > > > server will not allow me to failover & failback more than once which I
> > am
> > > > still looking for root cause.
> > > >
> > > > 1) both node shows pool_nodes as state 2
> > > > 2) shutdown database on server1, then cause the DB to failover to
> > server0,
> > > > server0 is now primary
> > > > 3) execute pcp_recovery on server0 to bring the server1 failed database
> > > > back online and connects to server0 as a standby; however, pool_nodes
> > on
> > > > server1 shows the following:
> > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > > (2 rows)
> > > >
> > > > As shows, server1 pgpool think itself as in state 3.
> > > > Replication however is working fine.
> > > >
> > > > 4) i have to execute pcp_attach_node on server1 to bring its pool_nodes
> > > > state to 2, however, server0 pool_nodes info about server1 becomes 3.
> > see
> > > > below for both servers output:
> > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > >
> > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 3      | 0.500000  | standby
> > > >
> > > >
> > > > 5) execute the following command on server1 will bring the server1
> > status
> > > > to 2 on both node:
> > > > /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1
> > > >
> > > > [root at server1 data]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > >
> > > > [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> > > >  node_id | hostname | port | status | lb_weight |  role
> > > > ---------+----------+------+--------+-----------+---------
> > > >  0       | server0  | 5432 | 2      | 0.500000  | primary
> > > >  1       | server1  | 5432 | 2      | 0.500000  | standby
> > > >
> > > > Please advise the next step.
> > > >
> > > > Thanks~
> > > > Ning
> > > >
> > > >
> > > > On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org>
> > wrote:
> > > >
> > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > Success
> > > >>
> > > >> This error messge seems pretty strange. ":" should be something like
> > > >> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
> > > >> Success". To isolate the problem, can please disable watchdog and try
> > > >> again?
> > > >> --
> > > >> Tatsuo Ishii
> > > >> SRA OSS, Inc. Japan
> > > >> English: http://www.sraoss.co.jp/index_en.php
> > > >> Japanese: http://www.sraoss.co.jp
> > > >>
> > > >>
> > > >> > Hi All,
> > > >> > After upgrade to pgPool-II 3.2.3 and I tested my failover/ failback
> > > >> setup,
> > > >> > and start / stop pgpool mutlip times, I see one of the pgpool goes
> > in
> > > >> to an
> > > >> > unrecoverable state.
> > > >> >
> > > >> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown
> > request
> > > >> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
> > > >> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky: ifup[/sbin/ip]
> > > >> doesn't
> > > >> > have sticky bit
> > > >> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason:
> > Success
> > > >> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)
> > > >> failed: No
> > > >> > such file or directory
> > > >> >
> > > >> >
> > > >> > netstat shows the following:
> > > >> > [root at server1 ~]# netstat -na |egrep "9898|9999"
> > > >> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
> > > >> > LISTEN
> > > >> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
> > > >> > LISTEN
> > > >> > tcp        0      0 172.16.6.154:46650          172.16.6.153:9999
> > > >> > TIME_WAIT
> > > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51868
> > > >> > CLOSE_WAIT
> > > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51906
> > > >> > CLOSE_WAIT
> > > >> > tcp        0      0 172.16.6.154:9999           172.16.6.154:50624
> > > >> > TIME_WAIT
> > > >> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51946
> > > >> > CLOSE_WAIT
> > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18698
> > > >>  /tmp/.s.PGSQL.9898
> > > >> > unix  2      [ ACC ]     STREAM     LISTENING     18685
> > > >>  /tmp/.s.PGSQL.9999
> > > >> >
> > > >> > Is this a known issue?
> > > >> >
> > > >> > I will have to reboot the server in order to start pgpool back
> > online.
> > > >> >
> > > >> > My cluster has two servers (server0 & server1) which each of them
> > are
> > > >> > running pgpool, and postgreSQL with streaming Replication setup.
> > > >> >
> > > >> > Thanks~
> > > >> > Ning
> > > >>
> > > >
> > > >
> >
> >
> > --
> > Yugo Nagata <nagata at sraoss.co.jp>
> >

-- 
Yugo Nagata <nagata at sraoss.co.jp>