[pgpool-general: 6424] Re: health_check_max_retries is not honored

Tue Feb 19 11:15:33 JST 2019

Hi,

Sorry. I couldn't understand your problem.

Do you mean that if you set health check parameters as below:
----
health_check_period = 20                              
health_check_timeout = 20
health_check_max_retries = 16
health_check_retry_delay = 8
connect_timeout = 60000
----

failover didn't occure after 16 times retry?

Could you give us more info about the config of health check parameters 
and the full pgpool.log including all health check retries and failover request?

On Mon, 18 Feb 2019 08:00:03 -0600
Alexander Dorogensky <amazinglifetime at gmail.com> wrote:

> Any idea why pgpool doesn't retry?
> On Sun, Feb 17, 2019 at 6:19 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
> 
> > Hi,
> >
> > I confirmed your pgpool.conf, you set
> >
> >   health_check_max_retries = 16
> >
> > So I think the following result is correct.
> >
> > > >> > psql -c 'pgpool show health_check_max_retries'
> > > >> > health_check_max_retries
> > > >> > --------------------------
> > > >> > 16
> > > >> > (1 row)
> >
> > On Fri, 15 Feb 2019 07:12:56 -0500
> > Alexander Dorogensky <amazinglifetime at gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Do you have any ideas what’s going on?
> > >
> > > On Mon, Feb 11, 2019 at 8:30 PM Alexander Dorogensky <
> > > amazinglifetime at gmail.com> wrote:
> > >
> > > > Pgpool.conf from one of the app nodes is attached
> > > >
> > > > Thanks
> > > >
> > > > On Mon, Feb 11, 2019 at 6:59 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >> On Mon, 11 Feb 2019 15:32:55 -0600
> > > >> Alexander Dorogensky <amazinglifetime at gmail.com> wrote:
> > > >>
> > > >> > I'm running 4 app (pgpool) nodes (3.6.10) and 2 db (postgres) nodes
> > > >> (9.6.9)
> > > >> > primary/standby configuration with streaming replication. All 6
> > nodes
> > > >> are
> > > >> > separate machines.
> > > >> >
> > > >> > A client has had too many failovers caused by the flaky network and
> > in
> > > >> an
> > > >> > effort to remedy the issue I set the following parameters
> > > >> >
> > > >> > health_check_max_retries = 7
> > > >> > health_check_retry_delay = 15
> > > >> >
> > > >> > Now, I have the client's environment and a lab environment to
> > reproduce
> > > >> the
> > > >> > issue. Pgpool configuration and the version are identical.
> > > >> >
> > > >> > To simulate a flaky network, I use iptables to deny postgres
> > > >> connections to
> > > >> > one of the db nodes and see that pgpool on all app nodes is trying
> > to
> > > >> > reconnect according to the configured number of retries and retry
> > delay,
> > > >> >
> > > >> > > i.e.
> > > >> > > 2019-02-11 14:22:51: pid 7825: LOG:  failed to connect to
> > PostgreSQL
> > > >> > > server on "10.0.10.133:5433", getsockopt() detected error "No
> > route
> > > >> to
> > > >> > > host"
> > > >> > > ...
> > > >> > > 2019-02-11 14:23:23: pid 6458: LOG:  health checking retry count 1
> > > >> > > ...
> > > >> > > 2019-02-11 14:23:38: pid 6458: LOG:  health checking retry count 2
> > > >> > > ...
> > > >> > > 2019-02-11 14:42:45: pid 6458: LOG:  health checking retry count 3
> > > >> > > ...
> > > >> > > 2019-02-11 14:43:00: pid 6458: LOG:  health checking retry count 4
> > > >> > > ...
> > > >> > > 2019-02-11 14:43:15: pid 6458: LOG:  health checking retry count 5
> > > >> > > ...
> > > >> > > 2019-02-11 14:43:30: pid 6458: LOG:  health checking retry count 6
> > > >> > > ...
> > > >> > > 2019-02-11 14:43:30: pid 6460: LOG:  failover request from local
> > > >> pgpool-II
> > > >> > > node received on IPC interface is forwarded to master watchdog
> > node "
> > > >> > > 172.20.20.173:5432"
> > > >> > > 2019-02-11 14:43:30: pid 4565: LOG:  watchdog received the
> > failover
> > > >> > > command from remote pgpool-II node "172.20.20.172:5432"
> > > >> > > ...
> > > >> > > 2019-02-11 14:43:30: pid 4563: LOG:  execute command:
> > > >> > > /etc/pgpool-II/failover.sh 0 10.0.10.133 5433 /opt/redsky/db/data
> > 1 0
> > > >> > > 10.0.10.134 1 5433 /opt/redsky/db/data
> > > >> > >
> > > >> > > However, in the client's environment failover gets initiated
> > before
> > > >> the
> > > >> > configured number of retries, i.e.
> > > >> >
> > > >> > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog received the failover
> > > >> > > command from local pgpool-II on IPC interface
> > > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog is processing the
> > > >> failover
> > > >> > > command [DEGENERATE_BACKEND_REQUEST] received from local
> > pgpool-II on
> > > >> IPC
> > > >> > > interface
> > > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  forwarding the failover
> > request
> > > >> > > [DEGENERATE_BACKEND_REQUEST] to all alive nodes
> > > >> > > 2019-02-09 05:17:47: pid 19402: DETAIL:  watchdog cluster
> > currently
> > > >> has 3
> > > >> > > connected remote nodes
> > > >> > > 2019-02-09 05:17:47: pid 19276: ERROR:  unable to read data from
> > DB
> > > >> node 1
> > > >> > > 2019-02-09 05:17:47: pid 19276: DETAIL:  socket read failed with
> > an
> > > >> error
> > > >> > > "Success"
> > > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  Pgpool-II parent process has
> > > >> > > received failover request
> > > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  new IPC connection received
> > > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  received the failover
> > command
> > > >> lock
> > > >> > > request from local pgpool-II on IPC interface
> > > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
> > > >> > > 10.15.35.35:5432" is requesting to become a lock holder for
> > failover
> > > >> ID:
> > > >> > > 19880
> > > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
> > > >> > > 10.15.35.35:5432" is the lock holder
> > > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  starting degeneration.
> > shutdown
> > > >> host
> > > >> > > 10.38.135.137(5433)
> > > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  Restart all children
> > > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  execute command:
> > > >> > > /etc/pgpool-II/failover.sh 1 10.38.135.137 5433
> > /opt/redsky/db/data 0
> > > >> 0
> > > >> > > 10.15.35.39 1 5433 /opt/redsky/db/data
> > > >> > >
> > > >> > >
> > > >> > I ran the following command on all app nodes
> > > >> >
> > > >> > psql -c 'pgpool show health_check_max_retries'
> > > >> > health_check_max_retries
> > > >> > --------------------------
> > > >> > 16
> > > >> > (1 row)
> > > >> >
> > > >> > and the number is different from what I have in the configuration
> > file..
> > > >> > It's more than 1 though and I expect it to be honored.
> > > >>
> > > >> I could not reproduce this issue by using pgpool_setup.
> > > >> Could you share the whole pgpool.conf?
> > > >>
> > > >> > Can you guys help me out? I'm out of ideas..
> > > >> >
> > > >> > pgpool-II-pg96-3.6.10-1pgdg.rhel6.x86_64
> > > >>
> > > >>
> > > >> --
> > > >> Bo Peng <pengbo at sraoss.co.jp>
> > > >> SRA OSS, Inc. Japan
> > > >>
> > > >>
> >
> >
> > --
> > Bo Peng <pengbo at sraoss.co.jp>
> > SRA OSS, Inc. Japan
> >
> >

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan