[pgpool-general-jp: 1548] Re: failover時に渡されるold primary node idの値について

Bo Peng pengbo @ sraoss.co.jp
2018年 6月 6日 (水) 18:31:51 JST


彭です。

ご報告ありがとうございます。

> 7)では%Pは"1"が指定されるものと考えておりましたが、
> そもそもの認識の間違いなのでしょうか?
> それとも、設定もしくは環境の問題なのでしょうか?

ご報告頂いている事象は認識済の不具合です。
修正はすでにコミット済で、6/12(火) に 3.7.4 として
リリースされる予定です。

  https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=92cbc38ea4679af7ca5e6a7e4ca0d667995c0ac7

以上、よろしくお願いします。

On Wed, 6 Jun 2018 14:11:29 +0900
Katsuya Hata <hata @ cellant.jp> wrote:

> はじめまして。
> 畑と申します。よろしくお願い致します。
> 
> failover_commandにて指定する引数"%P"に渡される値について
> 質問をさせてください。
> 
> node:0がstandby、node:1がprimaryの状態でnode:1の
> PostgreSQLを停止させると、"%P"に"0"が指定されています。
> 
> 以下の手順でfailoverを発生させた場合に上記のようになりました。
> 
> 1) node:0がprimary(up)、node:1がstandby(up)でPostgreSQLを停止
> 2) failoverスクリプトがnode:1のPostgreSQLをpromote
>    →%Pには"0"が指定
>    →node:0がstandby(down)、node:1がprimary(up)の状態を確認
> 3) node:0でpg_basebackupを実行しnode:1と同期
>    →同期後、recovery.confを作成
>    →postgresql.confのhot_standby_feedbackをonに変更
> 4) node:0のPostgreSQLを起動
> 5) pcp_attach_nodeコマンドでnode:0をpgpool-IIの管理ノードに追加
>    →node:0がstandby(up)、node:1がprimary(up)の状態を確認
> 6) node:1のPostgreSQLを停止
> 7) failoverスクリプトで取得した%Pの値が"0"
>    →%dの値は"1"が指定されておりました
> 
> 7)では%Pは"1"が指定されるものと考えておりましたが、
> そもそもの認識の間違いなのでしょうか?
> それとも、設定もしくは環境の問題なのでしょうか?
> 
> ご教示いただけますと幸いです。
> 
> OS、PostgreSQL、pgpool-IIの構成
> --------------
> OS : CentOS 7.4
> PostgreSQL : 10.3
> pgpool-II : 3.7.3
> --------------
> 
> サーバ構成
> --------------
> ・node:0、node:1は別筐体のサーバ
>   node:0 → 10.252.0.212
>   node:1 → 10.252.0.213
> ・PostgreSQLは
>   node:0 → primary
>   node:1 → standby
>  となるストリーミングレプリケーション構成でセットアップ
> ・pgpool-IIはwatchdogで両ノードで稼働(node:0がmaster)
> --------------
> 
> pgpool.confのfailover_commandの設定
> --------------
> failover_command = '/bin/sh /home/pgpool/replication/failover.sh %d %P %H %R'
> --------------
> 
> PostgreSQLの停止コマンド
> --------------
> pg_ctl stop -D <cluster path> -m immediate
> --------------
> 
> failover時のpgpoolのログ
> ●node:0がprimary、node:1がstandbyの場合
> --------------
> 2018-06-06 13:07:18: pid 23755: LOG:  new IPC connection received
> 2018-06-06 13:07:18: pid 23755: LOG:  watchdog received the failover
> command from local pgpool-II on IPC interface
> 2018-06-06 13:07:18: pid 23755: LOG:  watchdog is processing the
> failover command [DEGENERATE_BACKEND_REQUEST] received from local
> pgpool-II on IPC interface
> 2018-06-06 13:07:18: pid 23755: LOG:  we have got the consensus to
> perform the failover
> 2018-06-06 13:07:18: pid 23755: DETAIL:  1 node(s) voted in the favor
> 2018-06-06 13:07:18: pid 23752: LOG:  Pgpool-II parent process has
> received failover request
> 2018-06-06 13:07:18: pid 23755: LOG:  new IPC connection received
> 2018-06-06 13:07:18: pid 23755: LOG:  received the failover indication
> from Pgpool-II on IPC interface
> 2018-06-06 13:07:18: pid 23755: LOG:  watchdog is informed of failover
> end by the main process
> 2018-06-06 13:07:18: pid 23752: LOG:  starting degeneration. shutdown
> host 10.252.0.212(5432)
> 2018-06-06 13:07:18: pid 23755: LOG:  watchdog received the failover
> command from remote pgpool-II node "10.252.0.213:9999 Linux
> tk9-n5z-568-18-r330.directorz.internal"
> 2018-06-06 13:07:18: pid 23755: LOG:  watchdog is processing the
> failover command [DEGENERATE_BACKEND_REQUEST] received from
> 10.252.0.213:9999 Linux tk9-n5z-568-18-r330.directorz.internal
> 2018-06-06 13:07:18: pid 23755: LOG:  we have got the consensus to
> perform the failover
> 2018-06-06 13:07:18: pid 23755: DETAIL:  1 node(s) voted in the favor
> 2018-06-06 13:07:18: pid 23755: LOG:  invalid degenerate backend
> request, node id : 0 status: [3] is not valid for failover
> 2018-06-06 13:07:18: pid 23752: LOG:  Restart all children
> 2018-06-06 13:07:18: pid 23752: LOG:  execute command: /bin/sh
> /home/pgpool/replication/failover.sh 0 0 10.252.0.213 /database/pgdata
> 2018/06/06 13:07:18 : start failover script(failover.sh)
> 
> Failed node : 0
> Currenr master node : 0
> New master server : 10.252.0.213
> New master cluster path : /database/pgdata
> 
> 
> 2018/06/06 13:07:18 : node:0 status
> 2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30038 socket=7
> Hostname   : 10.252.0.212
> Port       : 5432
> Status     : 3
> Weight     : 0.500000
> Status Name: down
> Role       : primary
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30038 exit
> with SUCCESS.
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30038
> exits with status 0
> 2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30040 socket=7
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30040 exit
> with SUCCESS.
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30040
> exits with status 0
> 2018/06/06 13:07:18 : node:1 status
> 2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30054 socket=7
> Hostname   : 10.252.0.213
> Port       : 5432
> Status     : 2
> Weight     : 0.500000
> Status Name: up
> Role       : standby
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30054 exit
> with SUCCESS.
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30054
> exits with status 0
> 2018-06-06 13:07:18: pid 24351: LOG:  forked new pcp worker, pid=30056 socket=7
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30056 exit
> with SUCCESS.
> 2018-06-06 13:07:18: pid 24351: LOG:  PCP process with pid: 30056
> exits with status 0
> 2018/06/06 13:07:18 : node:0 is down
> 2018/06/06 13:07:18 : promote command : /usr/bin/ssh -i
> /home/pgpool/.ssh/replication.pk -T postgres @ 10.252.0.213
> /usr/local/pgsql/bin/pg_ctl promote -D /database/pgdata
> 2018/06/06 13:07:20 : 10.252.0.213 has been promoted from slave to temp master
> 2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node_repeatedly:
> waiting for finding a primary node
> 2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node: checking backend no 0
> 2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node: checking backend no 1
> 2018-06-06 13:07:20: pid 23752: LOG:  find_primary_node: primary node id is 1
> 2018-06-06 13:07:20: pid 23752: LOG:  failover: set new primary node: 1
> 2018-06-06 13:07:20: pid 23752: LOG:  failover: set new master node: 1
> --------------
> 
> ●node:0がstandby、node:1がprimaryの場合
> --------------
> 2018-06-06 13:19:43: pid 23755: LOG:  new IPC connection received
> 2018-06-06 13:19:43: pid 23755: LOG:  received the failover indication
> from Pgpool-II on IPC interface
> 2018-06-06 13:19:43: pid 23755: LOG:  watchdog is informed of failover
> end by the main process
> 2018-06-06 13:19:43: pid 23752: LOG:  starting degeneration. shutdown
> host 10.252.0.213(5432)
> 2018-06-06 13:19:43: pid 23752: LOG:  Do not restart children because
> we are switching over node id 1 host: 10.252.0.213 port: 432 and we
> are in streaming replication mode
> 2018-06-06 13:19:43: pid 23752: LOG:  execute command: /bin/sh
> /home/pgpool/replication/failover.sh 1 0 10.252.0.212 /database/gdata
> 2018/06/06 13:19:43 : start failover script(failover.sh)
> 
> Failed node : 1
> Currenr master node : 0
> New master server : 10.252.0.212
> New master cluster path : /database/pgdata
> 
> 
> 2018/06/06 13:19:43 : node:0 status
> 2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31634 socket=7
> Hostname   : 10.252.0.212
> Port       : 5432
> Status     : 2
> Weight     : 0.500000
> Status Name: up
> Role       : standby
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31634 exit
> with SUCCESS.
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31634
> exits with status 0
> 2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31636 socket=7
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31636 exit
> with SUCCESS.
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31636
> exits with status 0
> 2018/06/06 13:19:43 : node:1 status
> 2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31650 socket=7
> Hostname   : 10.252.0.213
> Port       : 5432
> Status     : 3
> Weight     : 0.500000
> Status Name: down
> Role       : primary
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31650 exit
> with SUCCESS.
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31650
> exits with status 0
> 2018-06-06 13:19:43: pid 31382: LOG:  forked new pcp worker, pid=31652 socket=7
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31652 exit
> with SUCCESS.
> 2018-06-06 13:19:43: pid 31382: LOG:  PCP process with pid: 31652
> exits with status 0
> 2018/06/06 13:19:43 : node:1 is down
> 2018/06/06 13:19:43 : promote command : /usr/bin/ssh -i
> /home/pgpool/.ssh/replication.pk -T postgres @ 10.252.0.212
> /usr/local/pgql/bin/pg_ctl promote -D /database/pgdata
> 2018-06-06 13:19:43: pid 23846: LOG:  failed to connect to PostgreSQL
> server on "10.252.0.213:5432", getsockopt() detected erro "Connection
> refused"
> 2018-06-06 13:19:43: pid 23846: ERROR:  failed to make persistent db connection
> 2018-06-06 13:19:43: pid 23846: DETAIL:  connection to
> host:"10.252.0.213:5432" failed
> 2018-06-06 13:19:43: pid 23846: LOG:  health check failed on node 1 (timeout:0)
> 2018-06-06 13:19:43: pid 23846: LOG:  received degenerate backend
> request for node_id: 1 from pid [23846]
> 2018-06-06 13:19:43: pid 23755: LOG:  new IPC connection received
> 2018-06-06 13:19:43: pid 23755: LOG:  watchdog received the failover
> command from local pgpool-II on IPC interface
> 2018-06-06 13:19:43: pid 23755: LOG:  watchdog is processing the
> failover command [DEGENERATE_BACKEND_REQUEST] received from loal
> pgpool-II on IPC interface
> 2018-06-06 13:19:43: pid 23755: LOG:  we have got the consensus to
> perform the failover
> 2018-06-06 13:19:43: pid 23755: DETAIL:  1 node(s) voted in the favor
> 2018-06-06 13:19:44: pid 31383: LOG:  failed to connect to PostgreSQL
> server on "10.252.0.213:5432", getsockopt() detected erro "Connection
> refused"
> 2018-06-06 13:19:44: pid 31383: ERROR:  failed to make persistent db connection
> 2018-06-06 13:19:44: pid 31383: DETAIL:  connection to
> host:"10.252.0.213:5432" failed
> 2018/06/06 13:19:45 : 10.252.0.212 has been promoted from slave to temp master
> 2018-06-06 13:19:45: pid 23752: LOG:  find_primary_node_repeatedly:
> waiting for finding a primary node
> 2018-06-06 13:19:45: pid 23752: LOG:  find_primary_node: checking backend no 0
> 2018-06-06 13:19:45: pid 23752: LOG:  find_primary_node: primary node id is 0
> 2018-06-06 13:19:45: pid 23752: LOG:  failover: set new primary node: 0
> 2018-06-06 13:19:45: pid 23752: LOG:  failover: set new master node: 0
> --------------
> 
> ※failover.sh内でpcp_node_infoコマンドで各ノードの状態をログに出力しております。
> ※failover.shでは引数の値ではなく、node:0、node:1の状態をpcp_node_infoで取得して
>  promoteが必要かを判断するようにしております。
> 
> 以上、よろしくお願いします。
> 
> -- 
> Katsuay HATA <hata @ cellant.jp>
> CELLANT Corp.
> _______________________________________________
> pgpool-general-jp mailing list
> pgpool-general-jp @ sraoss.jp
> http://www.sraoss.jp/mailman/listinfo/pgpool-general-jp
> 


-- 
Bo Peng <pengbo @ sraoss.co.jp>
SRA OSS, Inc. Japan



pgpool-general-jp メーリングリストの案内