[pgpool-general-jp: 1547] failover時に渡されるold primary node idの値について

2018年 6月 6日 (水) 14:11:29 JST

はじめまして。
畑と申します。よろしくお願い致します。

failover_commandにて指定する引数"%P"に渡される値について
質問をさせてください。

node:0がstandby、node:1がprimaryの状態でnode:1の
PostgreSQLを停止させると、"%P"に"0"が指定されています。

以下の手順でfailoverを発生させた場合に上記のようになりました。

1) node:0がprimary(up)、node:1がstandby(up)でPostgreSQLを停止
2) failoverスクリプトがnode:1のPostgreSQLをpromote
   →%Pには"0"が指定
   →node:0がstandby(down)、node:1がprimary(up)の状態を確認
3) node:0でpg_basebackupを実行しnode:1と同期
   →同期後、recovery.confを作成
   →postgresql.confのhot_standby_feedbackをonに変更
4) node:0のPostgreSQLを起動
5) pcp_attach_nodeコマンドでnode:0をpgpool-IIの管理ノードに追加
   →node:0がstandby(up)、node:1がprimary(up)の状態を確認
6) node:1のPostgreSQLを停止
7) failoverスクリプトで取得した%Pの値が"0"
   →%dの値は"1"が指定されておりました

7)では%Pは"1"が指定されるものと考えておりましたが、
そもそもの認識の間違いなのでしょうか？
それとも、設定もしくは環境の問題なのでしょうか？

ご教示いただけますと幸いです。

OS、PostgreSQL、pgpool-IIの構成
--------------
OS : CentOS 7.4
PostgreSQL : 10.3
pgpool-II : 3.7.3
--------------

サーバ構成
--------------
・node:0、node:1は別筐体のサーバ
　　node:0 → 10.252.0.212
　　node:1 → 10.252.0.213
・PostgreSQLは
　　node:0 → primary
　　node:1 → standby
　となるストリーミングレプリケーション構成でセットアップ
・pgpool-IIはwatchdogで両ノードで稼働（node:0がmaster）
--------------

pgpool.confのfailover_commandの設定
--------------
failover_command = '/bin/sh /home/pgpool/replication/failover.sh %d %P %H %R'
--------------

PostgreSQLの停止コマンド
--------------
pg_ctl stop -D <cluster path> -m immediate
--------------

failover時のpgpoolのログ
●node:0がprimary、node:1がstandbyの場合
--------------
2018-06-06 13:07:18: pid 23755: LOG:  new IPC connection received
2018-06-06 13:07:18: pid 23755: LOG:  watchdog received the failover
command from local pgpool-II on IPC interface
2018-06-06 13:07:18: pid 23755: LOG:  watchdog is processing the
failover command [DEGENERATE_BACKEND_REQUEST] received from local
pgpool-II on IPC interface
2018-06-06 13:07:18: pid 23755: LOG:  we have got the consensus to
perform the failover
2018-06-06 13:07:18: pid 23755: DETAIL:  1 node(s) voted in the favor
2018-06-06 13:07:18: pid 23752: LOG:  Pgpool-II parent process has
received failover request
2018-06-06 13:07:18: pid 23755: LOG:  new IPC connection received
2018-06-06 13:07:18: pid 23755: LOG:  received the failover indication
from Pgpool-II on IPC interface
2018-06-06 13:07:18: pid 23755: LOG:  watchdog is informed of failover
end by the main process
2018-06-06 13:07:18: pid 23752: LOG:  starting degeneration. shutdown
host 10.252.0.212(5432)
2018-06-06 13:07:18: pid 23755: LOG:  watchdog received the failover
command from remote pgpool-II node "10.252.0.213:9999 Linux
tk9-n5z-568-18-r330.directorz.internal"
2018-06-06 13:07:18: pid 23755: LOG:  watchdog is processing the
failover command [DEGENERATE_BACKEND_REQUEST] received from
10.252.0.213:9999 Linux tk9-n5z-568-18-r330.directorz.internal
2018-06-06 13:07:18: pid 23755: LOG:  we have got the consensus to
perform the failover
2018-06-06 13:07:18: pid 23755: DETAIL:  1 node(s) voted in the favor
2018-06-06 13:07:18: pid 23755: LOG:  invalid degenerate backend
request, node id : 0 status: [3] is not valid for failover
2018-06-06 13:07:18: pid 23752: LOG:  Restart all children
2018-06-06 13:07:18: pid 23752: LOG:  execute command: /bin/sh
/home/pgpool/replication/failover.sh 0 0 10.252.0.213 /database/pgdata
2018/06/06 13:07:18 : start failover script(failover.sh)

Failed node : 0
Currenr master node : 0
New master server : 10.252.0.213
New master cluster path : /database/pgdata


2018/06/06 13:07:18 : node:0 status
2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30038 socket=7
Hostname   : 10.252.0.212
Port       : 5432
Status     : 3
Weight     : 0.500000
Status Name: down
Role       : primary
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30038 exit
with SUCCESS.
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30038
exits with status 0
2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30040 socket=7
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30040 exit
with SUCCESS.
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30040
exits with status 0
2018/06/06 13:07:18 : node:1 status
2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30054 socket=7
Hostname   : 10.252.0.213
Port       : 5432
Status     : 2
Weight     : 0.500000
Status Name: up
Role       : standby
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30054 exit
with SUCCESS.
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30054
exits with status 0
2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30056 socket=7
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30056 exit
with SUCCESS.
2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30056
exits with status 0
2018/06/06 13:07:18 : node:0 is down
2018/06/06 13:07:18 : promote command : /usr/bin/ssh -i
/home/pgpool/.ssh/replication.pk -T postgres ＠ 10.252.0.213
/usr/local/pgsql/bin/pg_ctl promote -D /database/pgdata
2018/06/06 13:07:20 : 10.252.0.213 has been promoted from slave to temp master
2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node_repeatedly:
waiting for finding a primary node
2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node: checking backend no 0
2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node: checking backend no 1
2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node: primary node id is 1
2018-06-06 13:07:20: pid 23752: LOG:  failover: set new primary node: 1
2018-06-06 13:07:20: pid 23752: LOG:  failover: set new master node: 1
--------------

●node:0がstandby、node:1がprimaryの場合
--------------
2018-06-06 13:19:43: pid 23755: LOG:  new IPC connection received
2018-06-06 13:19:43: pid 23755: LOG:  received the failover indication
from Pgpool-II on IPC interface
2018-06-06 13:19:43: pid 23755: LOG:  watchdog is informed of failover
end by the main process
2018-06-06 13:19:43: pid 23752: LOG:  starting degeneration. shutdown
host 10.252.0.213(5432)
2018-06-06 13:19:43: pid 23752: LOG:  Do not restart children because
we are switching over node id 1 host: 10.252.0.213 port: 432 and we
are in streaming replication mode
2018-06-06 13:19:43: pid 23752: LOG:  execute command: /bin/sh
/home/pgpool/replication/failover.sh 1 0 10.252.0.212 /database/gdata
2018/06/06 13:19:43 : start failover script(failover.sh)

Failed node : 1
Currenr master node : 0
New master server : 10.252.0.212
New master cluster path : /database/pgdata


2018/06/06 13:19:43 : node:0 status
2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31634 socket=7
Hostname   : 10.252.0.212
Port       : 5432
Status     : 2
Weight     : 0.500000
Status Name: up
Role       : standby
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31634 exit
with SUCCESS.
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31634
exits with status 0
2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31636 socket=7
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31636 exit
with SUCCESS.
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31636
exits with status 0
2018/06/06 13:19:43 : node:1 status
2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31650 socket=7
Hostname   : 10.252.0.213
Port       : 5432
Status     : 3
Weight     : 0.500000
Status Name: down
Role       : primary
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31650 exit
with SUCCESS.
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31650
exits with status 0
2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31652 socket=7
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31652 exit
with SUCCESS.
2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31652
exits with status 0
2018/06/06 13:19:43 : node:1 is down
2018/06/06 13:19:43 : promote command : /usr/bin/ssh -i
/home/pgpool/.ssh/replication.pk -T postgres ＠ 10.252.0.212
/usr/local/pgql/bin/pg_ctl promote -D /database/pgdata
2018-06-06 13:19:43: pid 23846: LOG:  failed to connect to PostgreSQL
server on "10.252.0.213:5432", getsockopt() detected erro "Connection
refused"
2018-06-06 13:19:43: pid 23846: ERROR:  failed to make persistent db connection
2018-06-06 13:19:43: pid 23846: DETAIL:  connection to
host:"10.252.0.213:5432" failed
2018-06-06 13:19:43: pid 23846: LOG:  health check failed on node 1 (timeout:0)
2018-06-06 13:19:43: pid 23846: LOG:  received degenerate backend
request for node_id: 1 from pid [23846]
2018-06-06 13:19:43: pid 23755: LOG:  new IPC connection received
2018-06-06 13:19:43: pid 23755: LOG:  watchdog received the failover
command from local pgpool-II on IPC interface
2018-06-06 13:19:43: pid 23755: LOG:  watchdog is processing the
failover command [DEGENERATE_BACKEND_REQUEST] received from loal
pgpool-II on IPC interface
2018-06-06 13:19:43: pid 23755: LOG:  we have got the consensus to
perform the failover
2018-06-06 13:19:43: pid 23755: DETAIL:  1 node(s) voted in the favor
2018-06-06 13:19:44: pid 31383: LOG:  failed to connect to PostgreSQL
server on "10.252.0.213:5432", getsockopt() detected erro "Connection
refused"
2018-06-06 13:19:44: pid 31383: ERROR:  failed to make persistent db connection
2018-06-06 13:19:44: pid 31383: DETAIL:  connection to
host:"10.252.0.213:5432" failed
2018/06/06 13:19:45 : 10.252.0.212 has been promoted from slave to temp master
2018-06-06 13:19:45: pid 23752: LOG:  find_primary_node_repeatedly:
waiting for finding a primary node
2018-06-06 13:19:45: pid 23752: LOG:  find_primary_node: checking backend no 0
2018-06-06 13:19:45: pid 23752: LOG:  find_primary_node: primary node id is 0
2018-06-06 13:19:45: pid 23752: LOG:  failover: set new primary node: 0
2018-06-06 13:19:45: pid 23752: LOG:  failover: set new master node: 0
--------------

※failover.sh内でpcp_node_infoコマンドで各ノードの状態をログに出力しております。
※failover.shでは引数の値ではなく、node:0、node:1の状態をpcp_node_infoで取得して
　promoteが必要かを判断するようにしております。

以上、よろしくお願いします。

-- 
Katsuay HATA <hata ＠ cellant.jp>
CELLANT Corp.