[pgpool-general: 1463] Re: pgPool-II 3.2.3 going in to an unrecoverable state after multiple starting stopping pgpool

Tatsuo Ishii ishii at postgresql.org
Thu Mar 7 07:26:23 JST 2013


I think Nagata is investigating this. Nagata?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hi Tatsuo,
> 
> Do you need any more data for your investigation?
> 
> Thanks~
> Ning
> 
> 
> On Mon, Mar 4, 2013 at 4:08 PM, ning chan <ninchan8328 at gmail.com> wrote:
> 
>> Hi Tatsuo,
>> I shutdown one watchdog instead of both, I can't reproduce the problem.
>>
>> Here is the details:
>> server0 pgpool watchdog is disabled
>> server1 pgpool watchdog is enabled and it is a primary database for
>> streaming replication, failover & failback works just fine; except that the
>> virtual ip will not be migrated to the other pgpool server because
>> watchdog on server0 is not running.
>>
>> FYI: as i reported on the other email thread, running watchdog on both
>> server will not allow me to failover & failback more than once which I am
>> still looking for root cause.
>>
>> 1) both node shows pool_nodes as state 2
>> 2) shutdown database on server1, then cause the DB to failover to server0,
>> server0 is now primary
>> 3) execute pcp_recovery on server0 to bring the server1 failed database
>> back online and connects to server0 as a standby; however, pool_nodes on
>> server1 shows the following:
>> [root at server1 data]# psql -c "show pool_nodes" -p 9999
>>  node_id | hostname | port | status | lb_weight |  role
>> ---------+----------+------+--------+-----------+---------
>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>  1       | server1  | 5432 | 3      | 0.500000  | standby
>> (2 rows)
>>
>> As shows, server1 pgpool think itself as in state 3.
>> Replication however is working fine.
>>
>> 4) i have to execute pcp_attach_node on server1 to bring its pool_nodes
>> state to 2, however, server0 pool_nodes info about server1 becomes 3. see
>> below for both servers output:
>> [root at server1 data]# psql -c "show pool_nodes" -p 9999
>>  node_id | hostname | port | status | lb_weight |  role
>> ---------+----------+------+--------+-----------+---------
>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>>
>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>>  node_id | hostname | port | status | lb_weight |  role
>> ---------+----------+------+--------+-----------+---------
>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>  1       | server1  | 5432 | 3      | 0.500000  | standby
>>
>>
>> 5) execute the following command on server1 will bring the server1 status
>> to 2 on both node:
>> /usr/local/bin/pcp_attach_node 10 server0 9898 pgpool [passwd] 1
>>
>> [root at server1 data]# psql -c "show pool_nodes" -p 9999
>>  node_id | hostname | port | status | lb_weight |  role
>> ---------+----------+------+--------+-----------+---------
>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>>
>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>>  node_id | hostname | port | status | lb_weight |  role
>> ---------+----------+------+--------+-----------+---------
>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>>
>> Please advise the next step.
>>
>> Thanks~
>> Ning
>>
>>
>> On Sun, Mar 3, 2013 at 6:03 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>
>>> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason: Success
>>>
>>> This error messge seems pretty strange. ":" should be something like
>>> "/tmp/.s.PGSQL.9898". Also it's weired because 2failed. reason:
>>> Success". To isolate the problem, can please disable watchdog and try
>>> again?
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>>
>>> > Hi All,
>>> > After upgrade to pgPool-II 3.2.3 and I tested my failover/ failback
>>> setup,
>>> > and start / stop pgpool mutlip times, I see one of the pgpool goes in
>>> to an
>>> > unrecoverable state.
>>> >
>>> > Mar  1 10:45:25 server1 pgpool[3007]: received smart shutdown request
>>> > Mar  1 10:45:25 server1 pgpool[3007]: watchdog_pid: 3010
>>> > Mar  1 10:45:31 server1 pgpool[3338]: wd_chk_sticky: ifup[/sbin/ip]
>>> doesn't
>>> > have sticky bit
>>> > Mar  1 10:45:31 server1 pgpool[3339]: bind(:) failed. reason: Success
>>> > Mar  1 10:45:31 server1 pgpool[3339]: unlink(/tmp/.s.PGSQL.9898)
>>> failed: No
>>> > such file or directory
>>> >
>>> >
>>> > netstat shows the following:
>>> > [root at server1 ~]# netstat -na |egrep "9898|9999"
>>> > tcp        0      0 0.0.0.0:9898                0.0.0.0:*
>>> > LISTEN
>>> > tcp        0      0 0.0.0.0:9999                0.0.0.0:*
>>> > LISTEN
>>> > tcp        0      0 172.16.6.154:46650          172.16.6.153:9999
>>> > TIME_WAIT
>>> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51868
>>> > CLOSE_WAIT
>>> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51906
>>> > CLOSE_WAIT
>>> > tcp        0      0 172.16.6.154:9999           172.16.6.154:50624
>>> > TIME_WAIT
>>> > tcp        9      0 172.16.6.154:9999           172.16.6.153:51946
>>> > CLOSE_WAIT
>>> > unix  2      [ ACC ]     STREAM     LISTENING     18698
>>>  /tmp/.s.PGSQL.9898
>>> > unix  2      [ ACC ]     STREAM     LISTENING     18685
>>>  /tmp/.s.PGSQL.9999
>>> >
>>> > Is this a known issue?
>>> >
>>> > I will have to reboot the server in order to start pgpool back online.
>>> >
>>> > My cluster has two servers (server0 & server1) which each of them are
>>> > running pgpool, and postgreSQL with streaming Replication setup.
>>> >
>>> > Thanks~
>>> > Ning
>>>
>>
>>


More information about the pgpool-general mailing list