[pgpool-general: 3493] pgpool 3.3.1 - unexplained failover without health-check retries

Boaz Goldstein boaz.goldstein at sizmek.com
Thu Mar 5 17:59:52 JST 2015


Hello Pgpool support,
I am using Pgpool 3.3.1 in our production database, with master-slave feature + health check.
Today and not for the first time, we encountered a situation in which Pgpool "decided" to do a failover after it has lost connectivity to the master node, without the retry logic that should be when using health-check mode.
Here is the Pgpool log from the failure time:
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: C message
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: B message
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 1 backend pid: 26169 statement: B message
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: Execute: ROLLBACK
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 1 backend pid: 26169 statement: Execute: ROLLBACK
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: Parse: SELECT 1
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: B message
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: D message
Mar  4 16:35:06 pgpool pgpool[22795]: DB node id: 0 backend pid: 44867 statement: Execute: SELECT 1
Mar  4 16:35:09 pgpool pgpool[2306]: connect_inet_domain_socket: gethostbyname() failed: Unknown host host: pnmpg3.sj.peer39.com
Mar  4 16:35:09 pgpool pgpool[2306]: connection to pnmpg3.sj.peer39.com(5432) failed
Mar  4 16:35:09 pgpool pgpool[2306]: new_connection: create_cp() failed
Mar  4 16:35:09 pgpool pgpool[2306]: degenerate_backend_set: 0 fail over request from pid 2306
Mar  4 16:35:09 pgpool pgpool[32118]: starting degeneration. shutdown host pnmpg3.sj.peer39.com(5432)
Mar  4 16:35:09 pgpool pgpool[32118]: Restart all children
Mar  4 16:35:09 pgpool pgpool[32118]: execute command: /etc/pgpool-II/failover.sh 0 "pnmpg3.sj.peer39.com" 5432 /var/lib/pgsql/9.2/data 1 0 "pnmpg4.sj.peer39.com" 0
Mar  4 16:35:19 pgpool pgpool[32118]: find_primary_node_repeatedly: waiting for finding a primary node
Mar  4 16:35:35 pgpool pgpool[32118]: failover: set new primary node: -1
Mar  4 16:35:35 pgpool pgpool[32118]: failover: set new master node: 1
Mar  4 16:35:35 pgpool pgpool[32233]: worker process received restart request
Mar  4 16:35:35 pgpool pgpool[32118]: failover done. shutdown host pnmpg3.sj.peer39.com(5432)

Is this a normal behavior or is it a bug?? If it is normal, Is there a way to modify it?

Pgpool enabled features:

[ mode ] Master Slave mode

[ healthcheck ] every 40 seconds / retry upto 3 counts

Partial pgpool.conf:
#------------------------------------------------------------------------------
# POOLS
#------------------------------------------------------------------------------
# - Pool size -
num_init_children = 50
max_pool = 1
# - Life time -
child_life_time = 50
child_max_connections = 0
connection_life_time = 1800
client_idle_limit = 0
#------------------------------------------------------------------------------
# MASTER/SLAVE MODE
#------------------------------------------------------------------------------
master_slave_mode = on
master_slave_sub_mode = 'stream'
# - Streaming -
sr_check_period = 30
sr_check_user = 'username'
sr_check_password = 'password'
delay_threshold = 0
# - Special commands -
follow_master_command = ''
#------------------------------------------------------------------------------
# HEALTH CHECK
#------------------------------------------------------------------------------
health_check_period = 40
health_check_timeout = 10
health_check_user = 'username'
health_check_password = 'password'
health_check_max_retries = 3
health_check_retry_delay = 20
#------------------------------------------------------------------------------
# FAILOVER AND FAILBACK
#------------------------------------------------------------------------------
failover_command = '/etc/pgpool-II/failover.sh %d "%h" %p %D %m %M "%H" %P'
failback_command = ''
fail_over_on_backend_error = on
search_primary_node_timeout = 10
#------------------------------------------------------------------------------
# ONLINE RECOVERY
#------------------------------------------------------------------------------
recovery_user = ''
recovery_password = ''
recovery_1st_stage_command = 'basebackup.sh'
recovery_2nd_stage_command = ''
recovery_timeout = 90
client_idle_limit_in_recovery = 0

Thanks

Boaz Goldstein
DBA, Deployment DBA Team
boaz.goldstein at sizmek.com<mailto:boaz.goldstein at sizmek.com>
M +972.524.695731
T +972.9.778.2910
Israel

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20150305/53266a05/attachment.html>


More information about the pgpool-general mailing list