[pgpool-general: 6407] health_check_max_retries is not honored

Alexander Dorogensky amazinglifetime at gmail.com
Tue Feb 12 06:32:55 JST 2019

I'm running 4 app (pgpool) nodes (3.6.10) and 2 db (postgres) nodes (9.6.9)
primary/standby configuration with streaming replication. All 6 nodes are
separate machines.

A client has had too many failovers caused by the flaky network and in an
effort to remedy the issue I set the following parameters

health_check_max_retries = 7
health_check_retry_delay = 15

Now, I have the client's environment and a lab environment to reproduce the
issue. Pgpool configuration and the version are identical.

To simulate a flaky network, I use iptables to deny postgres connections to
one of the db nodes and see that pgpool on all app nodes is trying to
reconnect according to the configured number of retries and retry delay,

> i.e.
> 2019-02-11 14:22:51: pid 7825: LOG:  failed to connect to PostgreSQL
> server on "", getsockopt() detected error "No route to
> host"
> ...
> 2019-02-11 14:23:23: pid 6458: LOG:  health checking retry count 1
> ...
> 2019-02-11 14:23:38: pid 6458: LOG:  health checking retry count 2
> ...
> 2019-02-11 14:42:45: pid 6458: LOG:  health checking retry count 3
> ...
> 2019-02-11 14:43:00: pid 6458: LOG:  health checking retry count 4
> ...
> 2019-02-11 14:43:15: pid 6458: LOG:  health checking retry count 5
> ...
> 2019-02-11 14:43:30: pid 6458: LOG:  health checking retry count 6
> ...
> 2019-02-11 14:43:30: pid 6460: LOG:  failover request from local pgpool-II
> node received on IPC interface is forwarded to master watchdog node "
> 2019-02-11 14:43:30: pid 4565: LOG:  watchdog received the failover
> command from remote pgpool-II node ""
> ...
> 2019-02-11 14:43:30: pid 4563: LOG:  execute command:
> /etc/pgpool-II/failover.sh 0 5433 /opt/redsky/db/data 1 0
> 1 5433 /opt/redsky/db/data
> However, in the client's environment failover gets initiated before the
configured number of retries, i.e.

2019-02-09 05:17:47: pid 19402: LOG:  watchdog received the failover
> command from local pgpool-II on IPC interface
> 2019-02-09 05:17:47: pid 19402: LOG:  watchdog is processing the failover
> command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC
> interface
> 2019-02-09 05:17:47: pid 19402: LOG:  forwarding the failover request
> [DEGENERATE_BACKEND_REQUEST] to all alive nodes
> 2019-02-09 05:17:47: pid 19402: DETAIL:  watchdog cluster currently has 3
> connected remote nodes
> 2019-02-09 05:17:47: pid 19276: ERROR:  unable to read data from DB node 1
> 2019-02-09 05:17:47: pid 19276: DETAIL:  socket read failed with an error
> "Success"
> 2019-02-09 05:17:47: pid 19400: LOG:  Pgpool-II parent process has
> received failover request
> 2019-02-09 05:17:47: pid 19402: LOG:  new IPC connection received
> 2019-02-09 05:17:47: pid 19402: LOG:  received the failover command lock
> request from local pgpool-II on IPC interface
> 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
>" is requesting to become a lock holder for failover ID:
> 19880
> 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
>" is the lock holder
> 2019-02-09 05:17:47: pid 19400: LOG:  starting degeneration. shutdown host
> 2019-02-09 05:17:47: pid 19400: LOG:  Restart all children
> 2019-02-09 05:17:47: pid 19400: LOG:  execute command:
> /etc/pgpool-II/failover.sh 1 5433 /opt/redsky/db/data 0 0
> 1 5433 /opt/redsky/db/data
I ran the following command on all app nodes

psql -c 'pgpool show health_check_max_retries'
(1 row)

and the number is different from what I have in the configuration file..
It's more than 1 though and I expect it to be honored.

Can you guys help me out? I'm out of ideas..

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20190211/47d2752c/attachment.html>

More information about the pgpool-general mailing list