[pgpool-general-jp: 1214] watchdogの死活監視について

2013年 10月 31日 (木) 18:48:42 JST

to:pgpool-general-jp メーリングリスト 御中

下記の環境を構築しPostgreSQL / pgpool-IIの障害テストを実施しています。
その工程で動きがおかしいのではないかと思われるところがありましたのでML宛てに問合せさせていただきます。

[Server1]
 OS : CentOS 5.6
 hostname : DB01
 IP : 192.168.10.30
 DB : PostgreSQL 9.1.6 ( Primary : 5433 )
 M/L : pgpool-II 3.3.1 ( Standby :5432 )
[Server2]
 OS : CentOS 5.6
 hostname : DB02
 IP 192.168.10.33
 DB : PostgreSQL 9.1.6 ( Hot Standby : 5434 )
 M/L : pgpool-II 3.3.1 ( Active : 5432 )
[Server3]
 OS : CentOS 5.6
 hostname : DB03
 IP : 192.168.10.34
 DB : PostgreSQL 9.1.6 ( Hot Standby : 5435 / potentioal )
 M/L : none

障害テストは下記をランダムに繰り返していました。
1.DBのPrimaryを停止し、復旧を確認して再度、DBを停止する。
2.DBのHot Standbyを停止し、手動復旧を行う。
3.pgpool-IIのActiveを停止し、StandbyがActiveに昇格後、旧Activeを再起動する。

上記3.を実施したなかで、1度旧Activeの再起動（復旧）を新Activeが受信していないのではないかと思われる動きがありました。

[Server2のpgpool-II ( Active )を停止した時のログ]
2013-10-31 16:51:24 LOG:   pid 23015: worker child 25156 exits with status 0
2013-10-31 16:51:24 LOG:   pid 23015: fork a new worker child pid 25156
2013-10-31 16:51:24 LOG:   pid 23015: PCP child 25155 exits with status 0
2013-10-31 16:51:24 LOG:   pid 23015: fork a new PCP child pid 25327
2013-10-31 16:51:24 LOG:   pid 23015: fork a new watchdog heartbeat sender:
pid 25328
2013-10-31 16:51:24 LOG:   pid 23015: fork a new watchdog heartbeat
receiver: pid 25329
2013-10-31 16:51:24 LOG:   pid 23015: fork a new watchdog child pid 25330
2013-10-31 16:51:24 LOG:   pid 23015: received smart shutdown request
2013-10-31 16:51:27 LOG:   pid 23026: wd_IP_down: ifconfig down succeeded

[上記の際のServer1のログ]
2013-10-31 16:51:27 LOG:   pid 18970: wd_escalation: escalating to master
pgpool
2013-10-31 16:51:29 LOG:   pid 18970: wd_escalation: escalated to master
pgpool successfully
2013-10-31 16:51:36 LOG:   pid 18973: check_pgpool_status_by_hb: pgpool 1
(DB02:5432) is in down status
2013-10-31 16:51:48 LOG:   pid 18973: check_pgpool_status_by_hb: pgpool 1
(DB02:5432) is in down status

[Server2のpgpool-IIを再起動した時のログ]
2013-10-31 16:54:33 LOG:   pid 25392: wd_chk_setuid: ifup[/sbin/ifconfig]
doesn't have setuid bit
2013-10-31 16:54:33 LOG:   pid 25392: Backend status file
/var/log/pgpool-II/pgpool_status discarded
2013-10-31 16:54:35 LOG:   pid 25392: wd_init: start watchdog
2013-10-31 16:54:35 LOG:   pid 25392: pgpool-II successfully started.
version 3.3.1 (tokakiboshi)
2013-10-31 16:54:35 LOG:   pid 25392: find_primary_node: primary node id is
0
2013-10-31 16:54:36 LOG:   pid 25400: wd_create_hb_recv_socket: bind
receive socket to device: eth1
2013-10-31 16:54:36 LOG:   pid 25401: wd_create_hb_send_socket: bind send
socket to device: eth1
2013-10-31 16:56:16 LOG:   pid 25402: watchdog: lifecheck started

[Server2のpgpool-IIを再起動した時のServer1のログ]
2013-10-31 16:54:35 LOG:   pid 18970: wd_send_response: receive add request
from DB02:5432 and accept it
2013-10-31 16:54:36 LOG:   pid 18973: check_pgpool_status_by_hb: lifecheck
failed. pgpool 1 (DB02:5432) seems not to be working
2013-10-31 16:54:36 LOG:   pid 18973: pgpool_down: DB02:5432 is going down
2013-10-31 16:54:48 LOG:   pid 18973: check_pgpool_status_by_hb: pgpool 1
(DB02:5432) is in down status
2013-10-31 16:55:00 LOG:   pid 18973: check_pgpool_status_by_hb: pgpool 1
(DB02:5432) is in down status

ここで出ているメッセージではServer1はServer2のpgpool-IIが起動していない（ダウン中）をいう事ですが、この時にpcp_watchdog_infoコマンドでステータスを確認したら下記のようになりました。

[postgres ＠ DB01 ~]$ pcp_watchdog_info 60 DB01 9898 postgres postgres
DB01 5432 9000 3
[postgres ＠ DB01 ~]$ pcp_watchdog_info 60 DB02 9898 postgres postgres
DB02 5432 9000 2

なぜ、この様な状態がおきるのでしょうか？
宜しくお願いします。
-------------- next part --------------
HTMLの添付ファイルを保管しました...
URL: <http://www.sraoss.jp/pipermail/pgpool-general-jp/attachments/20131031/c3d4663f/attachment-0001.html>