[pgpool-general-jp: 1281] pgpool-ii watchdog アクティブサーバの終了ができない
taro tanaka
super.fool.1984 @ gmail.com
2014年 6月 11日 (水) 17:04:59 JST
はじめまして、鎌田と申します。
以下のサーバ二台の環境で利用しています。
-------------------------------------------
■サーバ1
ip:192.168.100.76
pgpool-II 3.3.3
postgresql 9.3.4
CentOS 6.5 (64bit)
■サーバ2
ip:192.168.100.79
pgpool-II 3.3.3
postgresql 9.3.4
CentOS 6.5 (64bit)
仮想ip:192.168.100.72
-------------------------------------------
各サーバのrootで起動したpgpoolで「use_watchdog」をonにし、heartbeatで死活監視をしています。
アクティブサーバのpgpoolを「pgpool stop」コマンドで終了させようとすると、
「stop request sent to pgpool. waiting for
termination......」のメッセージが出て、いつまでたっても終了しません。
各サーバのログは以下のようになっています(今回の場合はサーバ1がアクティブサーバです)。
■サーバ1
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2014-06-11 15:54:48 LOG: pid 6310: wd_chk_setuid all commands have setuid
bit
2014-06-11 15:54:48 LOG: pid 6310: watchdog might call network commands
which using setuid bit.
2014-06-11 15:54:48 LOG: pid 6310: wd_create_send_socket: connect()
reports failure (Connection refused). You can safely ignore this while
starting up.
2014-06-11 15:54:48 LOG: pid 6310: send_packet_4_nodes: packet for
192.168.100.79:9000 is canceled
2014-06-11 15:55:00 LOG: pid 6310: wd_escalation: escalating to master
pgpool
2014-06-11 15:55:04 LOG: pid 6310: wd_IP_up: ifconfig up succeeded
2014-06-11 15:55:04 LOG: pid 6310: wd_escalation: escalated to master
pgpool successfully
2014-06-11 15:55:04 LOG: pid 6310: wd_init: start watchdog
2014-06-11 15:55:04 LOG: pid 6310: pgpool-II successfully started.
version 3.3.3 (tokakiboshi)
2014-06-11 15:55:05 LOG: pid 6319: wd_create_hb_send_socket: set
SO_REUSEPORT
2014-06-11 15:55:05 LOG: pid 6319: wd_create_hb_recv_socket: set
SO_REUSEPORT
2014-06-11 15:55:05 LOG: pid 6320: wd_create_hb_send_socket: set
SO_REUSEPORT
2014-06-11 15:55:05 LOG: pid 6320: wd_create_hb_send_socket: set
SO_REUSEPORT
2014-06-11 15:55:28 LOG: pid 6318: wd_send_response: receive add request
from 192.168.100.79:9999 and accept it
2014-06-11 15:56:45 LOG: pid 6321: watchdog: lifecheck started
(ここでサーバ1のpgpoolをstop)
2014-06-11 16:02:11 LOG: pid 6310: received smart shutdown request
2014-06-11 16:02:11 LOG: pid 6310: pgpool main: close listen socket
2014-06-11 16:02:11 LOG: pid 6326: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6328: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6341: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6324: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6352: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6325: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6329: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6342: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6340: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6333: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6335: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6330: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6351: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6337: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6343: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6339: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6334: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6336: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6331: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6350: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6338: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6357: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6356: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6347: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6346: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6332: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6349: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6345: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6358: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6348: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6360: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6361: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6366: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6364: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6371: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6368: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6365: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6363: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6370: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6369: die: close listen socket
2014-06-11 16:02:11 LOG: pid 6367: die: close listen socket
2014-06-11 16:02:14 LOG: pid 6321: wd_IP_down: ifconfig down succeeded
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
■サーバ2
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2014-06-11 15:55:27 LOG: pid 12619: wd_chk_setuid all commands have
setuid bit
2014-06-11 15:55:27 LOG: pid 12619: watchdog might call network commands
which using setuid bit.
2014-06-11 15:55:28 LOG: pid 12619: wd_init: start watchdog
2014-06-11 15:55:28 LOG: pid 12619: pgpool-II successfully started.
version 3.3.3 (tokakiboshi)
2014-06-11 15:55:29 LOG: pid 12622: wd_create_hb_send_socket: set
SO_REUSEPORT
2014-06-11 15:55:29 LOG: pid 12622: wd_create_hb_recv_socket: set
SO_REUSEPORT
2014-06-11 15:55:29 LOG: pid 12623: wd_create_hb_send_socket: set
SO_REUSEPORT
2014-06-11 15:55:29 LOG: pid 12623: wd_create_hb_send_socket: set
SO_REUSEPORT
2014-06-11 15:57:09 LOG: pid 12624: watchdog: lifecheck started
(ここでサーバ1のpgpoolをstop)
2014-06-11 16:02:14 LOG: pid 12621: wd_escalation: escalating to master
pgpool
2014-06-11 16:02:18 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:19 LOG: pid 12624: check_pgpool_status_by_hb: pgpool 1 (
192.168.100.76:9999) is in down status
2014-06-11 16:02:20 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:22 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:24 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:26 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:28 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:29 LOG: pid 12624: check_pgpool_status_by_hb: pgpool 1 (
192.168.100.76:9999) is in down status
2014-06-11 16:02:30 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:32 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:34 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:36 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
2014-06-11 16:02:38 ERROR: pid 12621: exec_ping: wait() failed. reason: No
child processes
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
この状態で、サーバ1で、pgpool: lifecheckプロセスが生き残っています(サーバ2のpgpoolを終了するとこのプロセスは消えます)。
この状態でも、仮想ipは正常に切り替わっているようです。(サーバ2の仮想ipが有効になっており、pingも通る)
逆にサーバ2がアクティブサーバの時に同様の動作を行っても全く同じ状態になります。
また、スタンバイサーバのpgpoolを終了させた時は、特に問題は発生せず、正常に復帰させることもできます。
原因が特定できず、困っています。よろしくお願いします。
(以下に各サーバの設定ファイルを抜粋して掲載します)
■サーバ1
------------------------------
-------------------------------------------------------------------------------------------------------
# - pgpool Connection Settings -
listen_addresses = '*'
# Host name or IP address to listen on:
# '*' for all, '' for no TCP/IP
connections
# (change requires restart)
port = 9999
# Port number
# (change requires restart)
socket_dir = '/tmp'
# Unix domain socket path
# The Debian package defaults to
# /var/run/postgresql
# (change requires restart)
# - pgpool Communication Manager Connection Settings -
pcp_port = 9898
# Port number for pcp
# (change requires restart)
pcp_socket_dir = '/tmp'
# Unix domain socket path for pcp
# The Debian package defaults to
# /var/run/postgresql
# (change requires restart)
# - Backend Connection Settings -
backend_hostname0 = '192.168.100.76'
# Host name or IP address to connect to
for backend 0
backend_port0 = 5432
# Port number for backend 0
backend_weight0 = 1
# Weight for backend 0 (only in load
balancing mode)
backend_data_directory0 = '/common/data'
# Data directory for backend 0
#backend_flag0 = 'ALLOW_TO_FAILOVER'
# Controls various backend behavior
# ALLOW_TO_FAILOVER or
DISALLOW_TO_FAILOVER
backend_hostname1 = '192.168.100.79'
backend_port1 = 5432
backend_weight1 = 1
backend_data_directory1 = '/common/data'
#backend_flag1 = 'ALLOW_TO_FAILOVER'
# - Authentication -
enable_pool_hba = on
# Use pool_hba.conf for client
authentication
pool_passwd = 'pool_passwd'
# File name of pool_passwd for md5
authentication.
# "" disables pool_passwd.
# (change requires restart)
authentication_timeout = 60
# Delay in seconds to complete client
authentication
# 0 means no timeout.
# - SSL Connections -
ssl = off
#------------------------------------------------------------------------------
# POOLS
#------------------------------------------------------------------------------
# - Pool size -
num_init_children = 50
# Number of pools
# (change requires restart)
max_pool = 10
# Number of connections per pool
# (change requires restart)
# - Life time -
child_life_time = 300
# Pool exits after being idle for this
many seconds
child_max_connections = 0
# Pool exits after receiving that many
connections
# 0 means no exit
connection_life_time = 0
# Connection to backend closes after
being idle for this many seconds
# 0 means no close
client_idle_limit = 0
# Client is disconnected after being
idle for that many seconds
# (even inside an explicit transactions!)
# 0 means no disconnection
#------------------------------------------------------------------------------
# CONNECTION POOLING
#------------------------------------------------------------------------------
connection_cache = on
# Activate connection pools
# (change requires restart)
# Semicolon separated list of queries
# to be issued at the end of a session
# The default is for 8.3 and later
reset_query_list = 'ABORT; DISCARD ALL'
# The following one is for 8.2 and before
#reset_query_list = 'ABORT; RESET ALL; SET SESSION AUTHORIZATION DEFAULT'
#------------------------------------------------------------------------------
# REPLICATION MODE
#------------------------------------------------------------------------------
replication_mode = on
# Activate replication mode
# (change requires restart)
replicate_select = off
# Replicate SELECT statements
# when in replication or parallel mode
# replicate_select is higher priority
than
# load_balance_mode.
insert_lock = on
# Automatically locks a dummy row or a
table
# with INSERT statements to keep SERIAL
data
# consistency
# Without SERIAL, no lock will be issued
lobj_lock_table = ''
# When rewriting lo_creat command in
# replication mode, specify table name to
# lock
# - Degenerate handling -
replication_stop_on_mismatch = off
# On disagreement with the packet kind
# sent from backend, degenerate the node
# which is most likely "minority"
# If off, just force to exit this session
failover_if_affected_tuples_mismatch = off
# On disagreement with the number of
affected
# tuples in UPDATE/DELETE queries, then
# degenerate the node which is most
likely
# "minority".
# If off, just abort the transaction to
# keep the consistency
#------------------------------------------------------------------------------
# LOAD BALANCING MODE
#------------------------------------------------------------------------------
load_balance_mode = on
# Activate load balancing mode
# (change requires restart)
ignore_leading_white_space = on
# Ignore leading white spaces of each
query
white_function_list = ''
# Comma separated list of function names
# that don't write to database
# Regexp are accepted
black_function_list = 'nextval,setval'
# Comma separated list of function names
# that write to database
# Regexp are accepted
#------------------------------------------------------------------------------
# MASTER/SLAVE MODE
#------------------------------------------------------------------------------
master_slave_mode = off
#------------------------------------------------------------------------------
# PARALLEL MODE
#------------------------------------------------------------------------------
parallel_mode = off
#------------------------------------------------------------------------------
# HEALTH CHECK
#------------------------------------------------------------------------------
health_check_period = 0
# Health check period
# Disabled (0) by default
health_check_timeout = 20
# Health check timeout
# 0 means no timeout
health_check_user = 'postgres'
# Health check user
health_check_password = ''
# Password for health check user
health_check_max_retries = 0
# Maximum number of times to retry a
failed health check before giving up.
health_check_retry_delay = 1
# Amount of time to wait (in seconds)
between retries.
#------------------------------------------------------------------------------
# FAILOVER AND FAILBACK
#------------------------------------------------------------------------------
failover_command = ''
# Executes this command at failover
# Special values:
# %d = node id
# %h = host name
# %p = port number
# %D = database cluster path
# %m = new master node id
# %H = hostname of the new master node
# %M = old master node id
# %P = old primary node id
# %r = new master port number
# %R = new master database cluster path
# %% = '%' character
failback_command = ''
# Executes this command at failback.
# Special values:
# %d = node id
# %h = host name
# %p = port number
# %D = database cluster path
# %m = new master node id
# %H = hostname of the new master node
# %M = old master node id
# %P = old primary node id
# %r = new master port number
# %R = new master database cluster path
# %% = '%' character
fail_over_on_backend_error = on
# Initiates failover when
reading/writing to the
# backend communication socket fails
# If set to off, pgpool will report an
# error and disconnect the session.
search_primary_node_timeout = 10
# Timeout in seconds to search for the
# primary node when a failover occurs.
# 0 means no timeout, keep searching
# for a primary node forever.
#------------------------------------------------------------------------------
# WATCHDOG
#------------------------------------------------------------------------------
# - Enabling -
use_watchdog = on
# Activates watchdog
# (change requires restart)
# -Connection to up stream servers -
trusted_servers = ''
# trusted server list which are used
# to confirm network connection
# (hostA,hostB,hostC,...)
# (change requires restart)
ping_path = '/home/www/bin'
# ping command path
# (change requires restart)
# - Watchdog communication Settings -
wd_hostname = '192.168.100.76'
# Host name or IP address of this
watchdog
# (change requires restart)
wd_port = 9000
# port number for watchdog service
# (change requires restart)
wd_authkey = ''
# Authentication key for watchdog
communication
# (change requires restart)
# - Virtual IP control Setting -
delegate_IP = '192.168.100.72'
# delegate IP address
# If this is empty, virtual IP never
bring up.
# (change requires restart)
ifconfig_path = '/home/www/sbin'
# ifconfig command path
# (change requires restart)
if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0'
# startup delegate IP command
# (change requires restart)
if_down_cmd = 'ifconfig eth0:0 down'
# shutdown delegate IP command
# (change requires restart)
arping_path = '/home/www/sbin' # arping command path
# (change requires restart)
arping_cmd = 'arping -U $_IP_$ -w 1'
# arping command
# (change requires restart)
# - Behaivor on escalation Setting -
clear_memqcache_on_escalation = on
# Clear all the query cache on shared
memory
# when standby pgpool escalate to
active pgpool
# (= virtual IP holder).
# This should be off if client connects
to pgpool
# not using virtual IP.
# (change requires restart)
wd_escalation_command = ''
# Executes this command at escalation
on new active pgpool.
# (change requires restart)
# - Lifecheck Setting -
# -- common --
wd_lifecheck_method = 'heartbeat'
# Method of watchdog lifecheck
('heartbeat' or 'query')
# (change requires restart)
wd_interval = 10
# lifecheck interval (sec) > 0
# (change requires restart)
# -- heartbeat mode --
wd_heartbeat_port = 9694
# Port number for receiving heartbeat
signal
# (change requires restart)
wd_heartbeat_keepalive = 2
# Interval time of sending heartbeat
signal (sec)
# (change requires restart)
wd_heartbeat_deadtime = 30
# Deadtime interval for heartbeat
signal (sec)
# Host name or IP address of
destination 0
# for sending heartbeat signal.
# (change requires restart)
heartbeat_destination0 = '192.168.100.79'
# Host name or IP address of
destination 0
# for sending heartbeat signal.
# (change requires restart)
heartbeat_destination_port0 = 9694
# Port number of destination 0 for
sending
# heartbeat signal. Usually this is the
# same as wd_heartbeat_port.
# (change requires restart)
heartbeat_device0 = ''
# Name of NIC device (such like 'eth0')
# used for sending/receiving heartbeat
# signal to/from destination 0.
# This works only when this is not empty
# and pgpool has root privilege.
# (change requires restart)
# - Other pgpool Connection Settings -
other_pgpool_hostname0 = '192.168.100.79'
# Host name or IP address to connect to
for other pgpool 0
# (change requires restart)
other_pgpool_port0 = 9999
# Port number for othet pgpool 0
# (change requires restart)
other_wd_port0 = 9000
# Port number for othet watchdog 0
# (change requires restart)
-------------------------------------------------------------------------------------------------------------------------------------
■サーバ2(下記抜粋部分以外はサーバ1と同じ)
-------------------------------------------------------------------------------------------------------------------------------------
#------------------------------------------------------------------------------
# WATCHDOG
#------------------------------------------------------------------------------
# - Enabling -
use_watchdog = on
# Activates watchdog
# (change requires restart)
# -Connection to up stream servers -
trusted_servers = ''
# trusted server list which are used
# to confirm network connection
# (hostA,hostB,hostC,...)
# (change requires restart)
ping_path = '/home/www/bin'
# ping command path
# (change requires restart)
# - Watchdog communication Settings -
wd_hostname = '192.168.100.79'
# Host name or IP address of this
watchdog
# (change requires restart)
wd_port = 9000
# port number for watchdog service
# (change requires restart)
wd_authkey = ''
# Authentication key for watchdog
communication
# (change requires restart)
# - Virtual IP control Setting -
delegate_IP = '192.168.100.72'
# delegate IP address
# If this is empty, virtual IP never
bring up.
# (change requires restart)
ifconfig_path = '/home/www/sbin'
# ifconfig command path
# (change requires restart)
if_up_cmd = 'ifconfig eth0:0 inet $_IP_$ netmask 255.255.255.0'
# startup delegate IP command
# (change requires restart)
if_down_cmd = 'ifconfig eth0:0 down'
# shutdown delegate IP command
# (change requires restart)
arping_path = '/home/www/sbin' # arping command path
# (change requires restart)
arping_cmd = 'arping -U $_IP_$ -w 1'
# arping command
# (change requires restart)
# - Behaivor on escalation Setting -
clear_memqcache_on_escalation = on
# Clear all the query cache on shared
memory
# when standby pgpool escalate to
active pgpool
# (= virtual IP holder).
# This should be off if client connects
to pgpool
# not using virtual IP.
# (change requires restart)
wd_escalation_command = ''
# Executes this command at escalation
on new active pgpool.
# (change requires restart)
# - Lifecheck Setting -
# -- common --
wd_lifecheck_method = 'heartbeat'
# Method of watchdog lifecheck
('heartbeat' or 'query')
# (change requires restart)
wd_interval = 10
# lifecheck interval (sec) > 0
# (change requires restart)
# -- heartbeat mode --
wd_heartbeat_port = 9694
# Port number for receiving heartbeat
signal
# (change requires restart)
wd_heartbeat_keepalive = 2
# Interval time of sending heartbeat
signal (sec)
# (change requires restart)
wd_heartbeat_deadtime = 30
# Deadtime interval for heartbeat
signal (sec)
# Host name or IP address of
destination 0
# for sending heartbeat signal.
# (change requires restart)
heartbeat_destination0 = '192.168.100.76'
# Host name or IP address of
destination 0
# for sending heartbeat signal.
# (change requires restart)
heartbeat_destination_port0 = 9694
# Port number of destination 0 for
sending
# heartbeat signal. Usually this is the
# same as wd_heartbeat_port.
# (change requires restart)
heartbeat_device0 = ''
# Name of NIC device (such like 'eth0')
# used for sending/receiving heartbeat
# signal to/from destination 0.
# This works only when this is not empty
# and pgpool has root privilege.
# (change requires restart)
# - Other pgpool Connection Settings -
other_pgpool_hostname0 = '192.168.100.76'
# Host name or IP address to connect to
for other pgpool 0
# (change requires restart)
other_pgpool_port0 = 9999
# Port number for othet pgpool 0
# (change requires restart)
other_wd_port0 = 9000
# Port number for othet watchdog 0
# (change requires restart)
-------------------------------------------------------------------------------------------------------------------------------------
-------------- next part --------------
HTML$B$NE:IU%U%!%$%k$rJ]4I$7$^$7$?(B...
URL: <http://www.sraoss.jp/pipermail/pgpool-general-jp/attachments/20140611/86498297/attachment-0001.html>
pgpool-general-jp メーリングリストの案内