[pgpool-general-jp: 1332] Re: failover_commandの中でpcp_attach_nodeするとBackendError

sai kato tmi.katosai @ gmail.com
2014年 11月 28日 (金) 16:06:07 JST


石井様

お世話になっております。加藤です。

回答ありがとうございます。

>> →failover_commandの中でpcp_attach_nodeを実行してよいのか?
>
>結論から言うと、これがNGでした。

詳しい説明をありがとうございます。


>ただ、仮にフェイルオーバスクリプトの中でpcp_attach_nodeを呼び出すことが
>できたとしても、それはおすすめしません。理由は、以下です。

基本的には、pcp_attach_nodeは手動で実施した方がよいということですね。
石井様が書かれたようなリスクがあるとして、やはり自動でノードを復帰させたい場合、一連処理の中でノードを復帰させる方法はあるでしょうか?


お忙しいところすみませんが、よろしくお願い致します。



2014年11月26日 9:51 Tatsuo Ishii <ishii @ sraoss.co.jp>:

> 石井です。
>
> > →failover_commandの中でpcp_attach_nodeを実行してよいのか?
>
> 結論から言うと、これがNGでした。
>
> フェルオーバ/フェイルバックの処理を行う際、どのノードを処理するかのパラ
> メータを格納するエリアは、共有メモリに置かれているのですが、このエリア
> はリクエスト1個分の領域しか確保されておらず、
>
> 1) フェイルオーバリクエストによってこのエリアにノードIDが設定
> 2) failover scriptから起動されたpcp_attach_nodeがこのエリアにノードIDを設定、実行待ちに入る
> 3) 処理が終了した#1がこのエリアに-1を設定
> 4) #2が実行開始、しかしその頃にはノードIDが-1になっているのでエラー
>
> という流れでうまく動いていません。
>
> ただ、仮にフェイルオーバスクリプトの中でpcp_attach_nodeを呼び出すことが
> できたとしても、それはおすすめしません。理由は、以下です。
>
>
> a) 一度フェイルしたDBノードが再起動でうまく立ち上がるとは限らない
>
> b) 仮に立ち上がったとしても、プライマリノードと同期が取れているかどうか
>    疑問
>
> 要するに、再起動したDBノードの健全性が担保されていない限り、仕組みとし
> ては動いたとしても、業務システムとして目的を達成できるかどうかの保障が
> ない、というのが難点かと思います。
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
> > 石井様
> >
> > お世話になっております。加藤です。
> >
> > 以下は   PC2:DB3(スレーブ)を落とした場合のログです。
> > ■failover2.log
> > ---------------------------------
> >
> > 2014年 11月 25日 火曜日 10:04:14 JST
> > user[postgres] failed_node_id[2] failed_host_name[10.10.32.14]
> > failed_port[5434] failed_db_cluster[/home/pgsql3/data] ###
> old_master_id[0]
> > old_primary_node_id[1] ### new_master_id[0]
> > new_master_host_name[10.10.32.14] new_master_port_number[5432]
> > new_master_db_cluster[/home/pgsql/data]
> > ssh -l postgres -T 10.10.32.14 /usr/pgsql-9.3/bin/postmaster -i -p 5434
> -D
> > /home/pgsql3/data &
> > /usr/local/pgpool-3.3/bin/pcp_attach_node 5 localhost 9898 postgres
> > postgres 2
> > BackendError
> > ---------------------------------
> >
> > その際のpgpoolログは以下の通りです。
> > ■pgpool.log
> > ---------------------------------
> > 2014-11-25 10:04:05 DEBUG: pid 1890: num_fileds: 1
> > 2014-11-25 10:04:05 DEBUG: pid 1890: do_query: kind: D
> > 2014-11-25 10:04:05 DEBUG: pid 1890: do_query: data row received
> > 2014-11-25 10:04:05 DEBUG: pid 1890: do_query: kind: C
> > 2014-11-25 10:04:05 DEBUG: pid 1890: do_query: Command complete received
> > 2014-11-25 10:04:05 DEBUG: pid 1890: do_query: kind: Z
> > 2014-11-25 10:04:05 DEBUG: pid 1890: do_query: Ready for query
> > 2014-11-25 10:04:05 DEBUG: pid 1890: check_replication_time_lag: DB node
> is
> > valid but no persistent connection
> > 2014-11-25 10:04:05 ERROR: pid 1890: check_replication_time_lag: could
> not
> > connect to DB node 2, check sr_check_user and sr_check_password
> > 2014-11-25 10:04:14 DEBUG: pid 1651: starting health checking
> > 2014-11-25 10:04:14 DEBUG: pid 1651: health check: clearing alarm
> > 2014-11-25 10:04:14 DEBUG: pid 1651: health_check: 0 th DB node status: 2
> > 2014-11-25 10:04:14 DEBUG: pid 1651: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:14 DEBUG: pid 1651: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:14 DEBUG: pid 1651: s_do_auth: backend key data received
> > 2014-11-25 10:04:14 DEBUG: pid 1651: s_do_auth: transaction state: I
> > 2014-11-25 10:04:14 DEBUG: pid 1651: health_check: 1 th DB node status: 2
> > 2014-11-25 10:04:14 DEBUG: pid 1651: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:14 DEBUG: pid 1651: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:14 DEBUG: pid 1651: s_do_auth: backend key data received
> > 2014-11-25 10:04:14 DEBUG: pid 1651: s_do_auth: transaction state: I
> > 2014-11-25 10:04:14 DEBUG: pid 1651: health_check: 2 th DB node status: 1
> > 2014-11-25 10:04:14 ERROR: pid 1651: connect_inet_domain_socket:
> > getsockopt() detected error: Connection refused
> > 2014-11-25 10:04:14 ERROR: pid 1651: make_persistent_db_connection:
> > connection to 10.10.32.14(5434) failed
> > 2014-11-25 10:04:14 ERROR: pid 1651: health check failed. 2 th host
> > 10.10.32.14 at port 5434 is down
> > 2014-11-25 10:04:14 DEBUG: pid 1651: health check: clearing alarm
> > 2014-11-25 10:04:14 DEBUG: pid 1651: health check: clearing alarm
> > 2014-11-25 10:04:14 LOG:   pid 1651: set 2 th backend down status
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler called
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: starting to select
> > new master node
> > 2014-11-25 10:04:14 LOG:   pid 1651: starting degeneration. shutdown host
> > 10.10.32.14(5434)
> > 2014-11-25 10:04:14 LOG:   pid 1651: Restart all children
> > 2014-11-25 10:04:14 DEBUG: pid 1831: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1831
> > 2014-11-25 10:04:14 DEBUG: pid 1832: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1832
> > 2014-11-25 10:04:14 DEBUG: pid 1833: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1833
> > 2014-11-25 10:04:14 DEBUG: pid 1834: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1834
> > 2014-11-25 10:04:14 DEBUG: pid 1835: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1835
> > 2014-11-25 10:04:14 DEBUG: pid 1836: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1836
> > 2014-11-25 10:04:14 DEBUG: pid 1827: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1827
> > 2014-11-25 10:04:14 DEBUG: pid 1837: child received shutdown request
> signal
> > 3
> > 2014-11-25 10:04:14 DEBUG: pid 1651: failover_handler: kill 1837
> > 2014-11-25 10:04:14 LOG:   pid 1651: execute command:
> > /usr/local/pgpool-3.3/bin/failover.sh 2 10.10.32.14 5434
> /home/pgsql3/data
> > 0 0 10.10.32.14 1 5432 /home/pgsql/data
> > < 2014-11-25 10:04:14.644 JST >LOG:  could not create IPv6 socket:
> Address
> > family not supported by protocol
> > < 2014-11-25 10:04:14.694 JST >LOG:  redirecting log output to logging
> > collector process
> > < 2014-11-25 10:04:14.694 JST >HINT:  Future log output will appear in
> > directory "pg_log".
> > 2014-11-25 10:04:15 DEBUG: pid 1890: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: backend key data received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: transaction state: I
> > 2014-11-25 10:04:15 DEBUG: pid 1890: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: backend key data received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: transaction state: I
> > 2014-11-25 10:04:15 DEBUG: pid 1890: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: backend key data received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: s_do_auth: transaction state: I
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: extended:0 query:SELECT
> > pg_last_xlog_replay_location()
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: T
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: row description received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: num_fileds: 1
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: D
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: data row received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: C
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: Command complete received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: Z
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: Ready for query
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: extended:0 query:SELECT
> > pg_current_xlog_location()
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: T
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: row description received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: num_fileds: 1
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: D
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: data row received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: C
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: Command complete received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: Z
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: Ready for query
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: extended:0 query:SELECT
> > pg_last_xlog_replay_location()
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: T
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: row description received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: num_fileds: 1
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: D
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: data row received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: C
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: Command complete received
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: kind: Z
> > 2014-11-25 10:04:15 DEBUG: pid 1890: do_query: Ready for query
> > 2014-11-25 10:04:16 LOG:   pid 1889: send_failback_request: fail back 2
> th
> > node request from pid 1889
> > 2014-11-25 10:04:16 LOG:   pid 1651: find_primary_node_repeatedly:
> waiting
> > for finding a primary node
> > 2014-11-25 10:04:16 DEBUG: pid 1651: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:16 DEBUG: pid 1651: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:16 DEBUG: pid 1651: s_do_auth: backend key data received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: s_do_auth: transaction state: I
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: extended:0 query:SELECT
> > pg_is_in_recovery()
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: T
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: row description received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: num_fileds: 1
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: D
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: data row received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: C
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: Command complete received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: Z
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: Ready for query
> > 2014-11-25 10:04:16 DEBUG: pid 1651: find_primary_node: 0 node is standby
> > 2014-11-25 10:04:16 DEBUG: pid 1651: pool_ssl: SSL requested but SSL
> > support is not available
> > 2014-11-25 10:04:16 DEBUG: pid 1651: s_do_auth: auth kind: 0
> > 2014-11-25 10:04:16 DEBUG: pid 1651: s_do_auth: backend key data received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: s_do_auth: transaction state: I
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: extended:0 query:SELECT
> > pg_is_in_recovery()
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: T
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: row description received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: num_fileds: 1
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: D
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: data row received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: C
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: Command complete received
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: kind: Z
> > 2014-11-25 10:04:16 DEBUG: pid 1651: do_query: Ready for query
> > 2014-11-25 10:04:16 LOG:   pid 1651: find_primary_node: primary node id
> is 1
> > 2014-11-25 10:04:16 LOG:   pid 1651: failover: set new primary node: 1
> > 2014-11-25 10:04:16 LOG:   pid 1651: failover: set new master node: 0
> > 2014-11-25 10:04:16 DEBUG: pid 1907: I am 1907
> > 2014-11-25 10:04:16 DEBUG: pid 1907:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1907: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 DEBUG: pid 1906: I am 1906
> > 2014-11-25 10:04:16 DEBUG: pid 1906:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1906: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 DEBUG: pid 1908: I am 1908
> > 2014-11-25 10:04:16 DEBUG: pid 1908:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1908: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 DEBUG: pid 1905: I am 1905
> > 2014-11-25 10:04:16 DEBUG: pid 1905:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1905: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 DEBUG: pid 1910: I am 1910
> > 2014-11-25 10:04:16 DEBUG: pid 1910:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1910: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 DEBUG: pid 1911: I am 1911
> > 2014-11-25 10:04:16 DEBUG: pid 1911:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1911: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 DEBUG: pid 1912: I am 1912
> > 2014-11-25 10:04:16 DEBUG: pid 1912:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1912: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:16 LOG:   pid 1890: worker process received restart
> request
> > 2014-11-25 10:04:16 LOG:   pid 1651: failback done. reconnect host
> > 10.10.32.14(5434)
> > 2014-11-25 10:04:16 DEBUG: pid 1909: I am 1909
> > 2014-11-25 10:04:16 DEBUG: pid 1909:
> > pool_initialize_private_backend_status: initialize backend status
> > 2014-11-25 10:04:16 LOG:   pid 1909: do_child: failback event found.
> > restart myself.
> > 2014-11-25 10:04:17 LOG:   pid 1889: pcp child process received restart
> > request
> > 2014-11-25 10:04:17 LOG:   pid 1651: PCP child 1889 exits with status 256
> > in failover()
> > 2014-11-25 10:04:17 LOG:   pid 1651: fork a new PCP child pid 1913 in
> > failover()
> > 2014-11-25 10:04:17 DEBUG: pid 1651: health check: clearing alarm
> > 2014-11-25 10:04:17 DEBUG: pid 1651: failover_handler called
> > 2014-11-25 10:04:17 DEBUG: pid 1651: failover_handler: starting to select
> > new master node
> > 2014-11-25 10:04:17 ERROR: pid 1651: failover_handler: invalid node_id -1
> > MAX_NUM_BACKENDS: 128
> > 2014-11-25 10:04:17 DEBUG: pid 1651: reap_handler called
> > 2014-11-25 10:04:17 DEBUG: pid 1651: reap_handler: call wait3
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1827 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1831 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1832 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1833 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1834 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1835 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1836 exits with status 0
> > 2014-11-25 10:04:17 DEBUG: pid 1651: child 1837 exits with status 0
> > 2014-11-25 10:04:17 LOG:   pid 1651: worker child 1890 exits with status
> 256
> > 2014-11-25 10:04:17 LOG:   pid 1651: fork a new worker child pid 1914
> > ---------------------------------
> >
> >
> > お忙しいところすみませんが、よろしくお願い致します。
> >
> >
> > 2014年11月19日 9:11 Tatsuo Ishii <ishii @ sraoss.co.jp>:
> >
> >> 加藤様
> >>
> >> スタンバイノードが落ちた時のログ(/var/log/pgpool/failover2.log)を見せて
> >> いただけますか?
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese:http://www.sraoss.co.jp
> >>
> >> > 石井様
> >> >
> >> > お世話になっております。加藤です。
> >> >
> >> > ご回答ありがとうございます。
> >> > failover.shは以下の通りです。
> >> > ------------------------------------------------
> >> > #! /bin/sh
> >> > # Execute command by failover.
> >> > # special values:  %d = node id
> >> > #                  %h = host name
> >> > #                  %p = port number
> >> > #                  %D = database cluster path
> >> > #                  %m = new master node id
> >> > #                  %M = old master node id
> >> > #                  %H = new master node host name
> >> > #                  %P = old primary node id
> >> > #                  %R = new master database cluster path
> >> > #                  %r = new master port number
> >> > #                  %% = '%' character
> >> > failed_node_id=$1
> >> > failed_host_name=$2
> >> > failed_port=$3
> >> > failed_db_cluster=$4
> >> >
> >> > new_master_id=$5
> >> > old_master_id=$6
> >> > new_master_host_name=$7
> >> > old_primary_node_id=$8
> >> > new_master_port_number=$9
> >> > new_master_db_cluster=${10}
> >> >
> >> > log=/var/log/pgpool/failover2.log
> >> >
> >> > PGPOOL_HOME=/usr/pgsql-9.3
> >> > PG_CTL=$PGPOOL_HOME/bin/pg_ctl
> >> > POSTMASTER=$PGPOOL_HOME/bin/postmaster
> >> >
> >> > PCP_PORT=9898
> >> > PCP_SUPER_USER=postgres
> >> >
> >> >
> >> > echo "" >> $log
> >> > date >> $log
> >> > echo "user[`whoami`] failed_node_id[$failed_node_id]
> >> > failed_host_name[$failed_host_name] failed_port[$failed_port]
> >> > failed_db_cluster[$failed_db_cluster] ###
> old_master_id[$old_master_id]
> >> > old_primary_node_id[$old_primary_node_id] ###
> >> new_master_id[$new_master_id]
> >> > new_master_host_name[$new_master_host_name]
> >> > new_master_port_number[$new_master_port_number]
> >> > new_master_db_cluster[$new_master_db_cluster]" >> $log
> >> >
> >> > # masterが落ちた場合
> >> > if [ a"$failed_node_id" = a"$old_primary_node_id" ];then # master
> failed
> >> > echo "ssh -l postgres -T $new_master_host_name $PG_CTL -D
> >> > $new_master_db_cluster promote" >> $log
> >> > ssh -l postgres -T $new_master_host_name $PG_CTL -D
> >> $new_master_db_cluster
> >> > promote
> >> > else
> >> >
> >> > ############################
> >> > # スタンバイノードの起動
> >> > ############################
> >> > echo "ssh -l postgres -T $failed_host_name $POSTMASTER -i -p
> $failed_port
> >> > -D $failed_db_cluster &" >> $log
> >> > ssh -l postgres -T $failed_host_name $POSTMASTER -i -p $failed_port -D
> >> > $failed_db_cluster & >> $log 2>&1 < /dev/null
> >> > sleep 2
> >> >
> >> >
> >> > ############################
> >> > # スタンバイノードの復帰
> >> > ############################
> >> > echo "/usr/local/pgpool-3.3/bin/pcp_attach_node 5 localhost $PCP_PORT
> >> > $PCP_SUPER_USER $PCP_SUPER_USER $failed_node_id" >> $log
> >> > /usr/local/pgpool-3.3/bin/pcp_attach_node 5 localhost $PCP_PORT
> >> > $PCP_SUPER_USER $PCP_SUPER_USER ${failed_node_id} >> $log 2>&1
> >> >
> >> > fi
> >> > ------------------------------------------------
> >> >
> >> > お忙しいところすみませんが、ご確認いただけないでしょうか。
> >> > よろしくお願い致します。
> >> >
> >> >
> >> > 2014年11月13日 8:14 Tatsuo Ishii <ishii @ sraoss.co.jp>:
> >> >
> >> >> 石井です。
> >> >>
> >> >> 添付してあるfailover.shは抜粋だと思います。
> >> >> 全部見られたらなにか分かるかもしれません。
> >> >> --
> >> >> Tatsuo Ishii
> >> >> SRA OSS, Inc. Japan
> >> >> English: http://www.sraoss.co.jp/index_en.php
> >> >> Japanese:http://www.sraoss.co.jp
> >> >>
> >> >> From: sai kato <tmi.katosai @ gmail.com>
> >> >> Subject: [pgpool-general-jp: 1324]
> >> >> failover_commandの中でpcp_attach_nodeするとBackendError
> >> >> Date: Tue, 11 Nov 2014 10:58:34 +0900
> >> >> Message-ID: <
> >> >> CAEn1G_G712mFK0LFvxiZ+8c7t+NCk_wUqzWpmLG2FJCRTm+rqA @ mail.gmail.com>
> >> >>
> >> >> > はじめまして、加藤と申します。
> >> >> >
> >> >> >
> >> >> >
> >> >> > (文字化けしておりましたので、再送させていただきます)
> >> >> >
> >> >> > PgpoolⅡ3.3.4でエラーが発生しましたが、原因がわからないため、
> >> >> >
> >> >> > 困っております。
> >> >> >
> >> >> > ■環境
> >> >> >
> >> >> > OS:全てCentOS6.4
> >> >> >
> >> >> > DB:PostgreSQL9.3.4
> >> >> >
> >> >> > Pgpool:PgpoolⅡ3.3.4
> >> >> >
> >> >> >
> >> >> >
> >> >> > PC1:pgpool
> >> >> >
> >> >> >    |-------PC2:DB1(マスタ)
> >> >> >
> >> >> >    |-------PC2:DB2(スレーブ)
> >> >> >
> >> >> >    |-------PC2:DB3(スレーブ)
> >> >> >
> >> >> >
> >> >> >
> >> >> > DB1、DB2、DB3は同期レプリケーションをしており、
> >> >> >
> >> >> > DB1がマスタ、DB2,DB3はスレーブとなっております。
> >> >> >
> >> >> >
> >> >> >
> >> >> > ■現象
> >> >> >
> >> >> > マスタのDB1がフェールオーバした場合は、以下のコマンドが実行されました。
> >> >> >
> >> >> > ①failover_command
> >> >> >
> >> >> > ②follow_master_command
> >> >> >
> >> >> >
> >> >> >
> >> >> > コマンドの内容は以下の通りです。
> >> >> >
> >> >> > ①:スレーブを新マスタへ昇格
> >> >> >
> >> >> > ②:・旧マスタのrecovery.conf設定とDB起動、オンライン復帰
> >> >> >
> >> >> > ・他のスレーブのrecovery.conf設定とDB再起動、オンライン復帰
> >> >> >
> >> >> >
> >> >> >
> >> >> > 上記コマンドにより、DB1は自動でオンライン復帰することが出来ました。
> >> >> >
> >> >> >
> >> >> >
> >> >> > 次に、スレーブのDB2がフェールオーバした場合は、以下のコマンドが実行されました。
> >> >> >
> >> >> > ①failover_command
> >> >> >
> >> >> >
> >> >> >
> >> >> > failover_commandしか実行されないため、このコマンドにスレーブDBのリカバリの
> >> >> >
> >> >> > 処理をさせようと考えました。
> >> >> >
> >> >> > コマンドの内容は以下の通りです。
> >> >> >
> >> >> > ①:スレーブDBの起動、オンライン復帰
> >> >> >
> >> >> >
> >> >> >
> >> >> > ①のコマンドの中でオンライン復帰(pcp_attach_node)が実行されると、
> >> >> >
> >> >> > BackendErrorとなりました。
> >> >> >
> >> >> >
> >> >> >
> >> >> > エラーの後に、①で実施したpcp_attach_nodeをプロンプト上から実行すると、
> >> >> >
> >> >> > 成功しました。
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > ■pgpool.log
> >> >> >
> >> >> > ログは以下の通りです。
> >> >> >
> >> >> > 2014-10-31 16:04:56 LOG:   pid 1732: pcp child process received
> >> restart
> >> >> > request
> >> >> >
> >> >> > 2014-10-31 16:04:56 LOG:   pid 1573: PCP child 1732 exits with
> status
> >> 256
> >> >> > in failover()
> >> >> >
> >> >> > 2014-10-31 16:04:56 LOG:   pid 1573: fork a new PCP child pid 1788
> in
> >> >> > failover()
> >> >> >
> >> >> > 2014-10-31 16:04:56 DEBUG: pid 1573: health check: clearing alarm
> >> >> >
> >> >> > 2014-10-31 16:04:56 DEBUG: pid 1573: failover_handler called
> >> >> >
> >> >> > 2014-10-31 16:04:56 DEBUG: pid 1573: failover_handler: starting to
> >> select
> >> >> > new master node
> >> >> >
> >> >> > 2014-10-31 16:04:56 ERROR: pid 1573: failover_handler: invalid
> >> node_id -1
> >> >> > MAX_NUM_BACKENDS: 128
> >> >> >
> >> >> > 2014-10-31 16:04:56 DEBUG: pid 1573: reap_handler called
> >> >> >
> >> >> > 2014-10-31 16:04:56 DEBUG: pid 1573: reap_handler: call wait3
> >> >> >
> >> >> > 2014-10-31 16:04:56 LOG:   pid 1573: worker child 1733 exits with
> >> status
> >> >> 256
> >> >> >
> >> >> > 2014-10-31 16:04:56 LOG:   pid 1573: fork a new worker child pid
> 1789
> >> >> >
> >> >> >
> >> >> >
> >> >> > ログのエラー内容からみると、pcp_attach_nodeに指定したノードIDが不正のようです。
> >> >> >
> >> >> >
> >> >> >
> >> >> > ■pgpool.conf
> >> >> >
> >> >> > 設定は以下の通りです。
> >> >> >
> >> >> > replication_mode = off
> >> >> >
> >> >> > load_balance_mode = on
> >> >> >
> >> >> > connection_cache = off
> >> >> >
> >> >> > master_slave_mode = on
> >> >> >
> >> >> > master_slave_sub_mode = 'stream'
> >> >> >
> >> >> > sr_check_period = 10
> >> >> >
> >> >> > sr_check_user = 'postgres'
> >> >> >
> >> >> > sr_check_password = 'postgres'
> >> >> >
> >> >> > delay_threshold = 10000000
> >> >> >
> >> >> > backend_hostname0 = 'xxxx' ←PC2のIP
> >> >> >
> >> >> > backend_port0 = 5432
> >> >> >
> >> >> > backend_weight0 = 1
> >> >> >
> >> >> > backend_hostname1 = 'xxxx' ←PC2のIP
> >> >> >
> >> >> > backend_port1 = 5433
> >> >> >
> >> >> > backend_weight1 = 1
> >> >> >
> >> >> > backend_hostname2 = 'xxxx' ←PC2のIP
> >> >> >
> >> >> > backend_port2 = 5434
> >> >> >
> >> >> > backend_weight2 = 1
> >> >> >
> >> >> > follow_master_command = '/usr/local/pgpool-3.3/bin/follow_master.sh
> >> %d %h
> >> >> > %p %D %m %M %H %P %r %R'
> >> >> >
> >> >> > failover_command = '/usr/local/pgpool-3.3/bin/failover.sh %d %h %p
> %D
> >> %m
> >> >> %M
> >> >> > %H %P %r %R'
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > ■failover.sh
> >> >> >
> >> >> > ############################
> >> >> >
> >> >> > # スタンバイノードの起動
> >> >> >
> >> >> > ############################
> >> >> >
> >> >> > ssh -l postgres -T $failed_host_name $POSTMASTER -i -p
> $failed_port -D
> >> >> > $failed_db_cluster & >> $log 2>&1 < /dev/null
> >> >> >
> >> >> > sleep 2
> >> >> >
> >> >> >
> >> >> >
> >> >> > ############################
> >> >> >
> >> >> > # スタンバイノードの復帰
> >> >> >
> >> >> > ############################
> >> >> >
> >> >> > /usr/local/pgpool-3.3/bin/pcp_attach_node 5 localhost $PCP_PORT
> >> >> > $PCP_SUPER_USER $PCP_SUPER_USER $failed_node_id >> $log 2>&1
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > ■質問
> >> >> >
> >> >> > ①pcp_attach_nodeが失敗する原因は何でしょうか?対応方法はあるでしょうか?
> >> >> >
> >> >> > ②スレーブDBのフェールオーバ時の対応は上記で合っているでしょうか?
> >> >> >
> >> >> > →failover_commandの中でpcp_attach_nodeを実行してよいのか?
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> > ご存知の方がおられましたらご教授お願い致します。
> >> >> >
> >> >> > お手数ですが、よろしくお願いします。
> >> >>
> >>
>
-------------- next part --------------
HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...
URL: <http://www.sraoss.jp/pipermail/pgpool-general-jp/attachments/20141128/12626cad/attachment-0001.html>


pgpool-general-jp メーリングリストの案内