[pgpool-general: 6417] Re: health_check_max_retries is not honored

Fri Feb 15 21:12:56 JST 2019

Hi,

Do you have any ideas what’s going on?

On Mon, Feb 11, 2019 at 8:30 PM Alexander Dorogensky <
amazinglifetime at gmail.com> wrote:

> Pgpool.conf from one of the app nodes is attached
>
> Thanks
>
> On Mon, Feb 11, 2019 at 6:59 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
>
>> Hi,
>>
>> On Mon, 11 Feb 2019 15:32:55 -0600
>> Alexander Dorogensky <amazinglifetime at gmail.com> wrote:
>>
>> > I'm running 4 app (pgpool) nodes (3.6.10) and 2 db (postgres) nodes
>> (9.6.9)
>> > primary/standby configuration with streaming replication. All 6 nodes
>> are
>> > separate machines.
>> >
>> > A client has had too many failovers caused by the flaky network and in
>> an
>> > effort to remedy the issue I set the following parameters
>> >
>> > health_check_max_retries = 7
>> > health_check_retry_delay = 15
>> >
>> > Now, I have the client's environment and a lab environment to reproduce
>> the
>> > issue. Pgpool configuration and the version are identical.
>> >
>> > To simulate a flaky network, I use iptables to deny postgres
>> connections to
>> > one of the db nodes and see that pgpool on all app nodes is trying to
>> > reconnect according to the configured number of retries and retry delay,
>> >
>> > > i.e.
>> > > 2019-02-11 14:22:51: pid 7825: LOG:  failed to connect to PostgreSQL
>> > > server on "10.0.10.133:5433", getsockopt() detected error "No route
>> to
>> > > host"
>> > > ...
>> > > 2019-02-11 14:23:23: pid 6458: LOG:  health checking retry count 1
>> > > ...
>> > > 2019-02-11 14:23:38: pid 6458: LOG:  health checking retry count 2
>> > > ...
>> > > 2019-02-11 14:42:45: pid 6458: LOG:  health checking retry count 3
>> > > ...
>> > > 2019-02-11 14:43:00: pid 6458: LOG:  health checking retry count 4
>> > > ...
>> > > 2019-02-11 14:43:15: pid 6458: LOG:  health checking retry count 5
>> > > ...
>> > > 2019-02-11 14:43:30: pid 6458: LOG:  health checking retry count 6
>> > > ...
>> > > 2019-02-11 14:43:30: pid 6460: LOG:  failover request from local
>> pgpool-II
>> > > node received on IPC interface is forwarded to master watchdog node "
>> > > 172.20.20.173:5432"
>> > > 2019-02-11 14:43:30: pid 4565: LOG:  watchdog received the failover
>> > > command from remote pgpool-II node "172.20.20.172:5432"
>> > > ...
>> > > 2019-02-11 14:43:30: pid 4563: LOG:  execute command:
>> > > /etc/pgpool-II/failover.sh 0 10.0.10.133 5433 /opt/redsky/db/data 1 0
>> > > 10.0.10.134 1 5433 /opt/redsky/db/data
>> > >
>> > > However, in the client's environment failover gets initiated before
>> the
>> > configured number of retries, i.e.
>> >
>> > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog received the failover
>> > > command from local pgpool-II on IPC interface
>> > > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog is processing the
>> failover
>> > > command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on
>> IPC
>> > > interface
>> > > 2019-02-09 05:17:47: pid 19402: LOG:  forwarding the failover request
>> > > [DEGENERATE_BACKEND_REQUEST] to all alive nodes
>> > > 2019-02-09 05:17:47: pid 19402: DETAIL:  watchdog cluster currently
>> has 3
>> > > connected remote nodes
>> > > 2019-02-09 05:17:47: pid 19276: ERROR:  unable to read data from DB
>> node 1
>> > > 2019-02-09 05:17:47: pid 19276: DETAIL:  socket read failed with an
>> error
>> > > "Success"
>> > > 2019-02-09 05:17:47: pid 19400: LOG:  Pgpool-II parent process has
>> > > received failover request
>> > > 2019-02-09 05:17:47: pid 19402: LOG:  new IPC connection received
>> > > 2019-02-09 05:17:47: pid 19402: LOG:  received the failover command
>> lock
>> > > request from local pgpool-II on IPC interface
>> > > 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
>> > > 10.15.35.35:5432" is requesting to become a lock holder for failover
>> ID:
>> > > 19880
>> > > 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
>> > > 10.15.35.35:5432" is the lock holder
>> > > 2019-02-09 05:17:47: pid 19400: LOG:  starting degeneration. shutdown
>> host
>> > > 10.38.135.137(5433)
>> > > 2019-02-09 05:17:47: pid 19400: LOG:  Restart all children
>> > > 2019-02-09 05:17:47: pid 19400: LOG:  execute command:
>> > > /etc/pgpool-II/failover.sh 1 10.38.135.137 5433 /opt/redsky/db/data 0
>> 0
>> > > 10.15.35.39 1 5433 /opt/redsky/db/data
>> > >
>> > >
>> > I ran the following command on all app nodes
>> >
>> > psql -c 'pgpool show health_check_max_retries'
>> > health_check_max_retries
>> > --------------------------
>> > 16
>> > (1 row)
>> >
>> > and the number is different from what I have in the configuration file..
>> > It's more than 1 though and I expect it to be honored.
>>
>> I could not reproduce this issue by using pgpool_setup.
>> Could you share the whole pgpool.conf?
>>
>> > Can you guys help me out? I'm out of ideas..
>> >
>> > pgpool-II-pg96-3.6.10-1pgdg.rhel6.x86_64
>>
>>
>> --
>> Bo Peng <pengbo at sraoss.co.jp>
>> SRA OSS, Inc. Japan
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20190215/48cfa7e0/attachment-0001.html>