[pgpool-general: 6422] Re: health_check_max_retries is not honored

Tue Feb 19 04:35:18 JST 2019

...
2019-02-09 05:17:35: pid 19275: LOG:  PCP process with pid: 9642 exit with
SUCCESS.
2019-02-09 05:17:35: pid 19275: LOG:  PCP process with pid: 9642 exits with
status 0
2019-02-09 05:17:35: pid 19275: LOG:  forked new pcp worker, pid=9644
socket=7
2019-02-09 05:17:35: pid 19275: LOG:  PCP process with pid: 9644 exit with
SUCCESS.
2019-02-09 05:17:35: pid 19275: LOG:  PCP process with pid: 9644 exits with
status 0
2019-02-09 05:17:40: pid 19275: LOG:  forked new pcp worker, pid=9647
socket=7
2019-02-09 05:17:40: pid 19275: LOG:  PCP process with pid: 9647 exit with
SUCCESS.
2019-02-09 05:17:40: pid 19275: LOG:  PCP process with pid: 9647 exits with
status 0
2019-02-09 05:17:40: pid 19275: LOG:  forked new pcp worker, pid=9649
socket=7
2019-02-09 05:17:40: pid 19275: LOG:  PCP process with pid: 9649 exit with
SUCCESS.
2019-02-09 05:17:40: pid 19275: LOG:  PCP process with pid: 9649 exits with
status 0
2019-02-09 05:17:40: pid 19275: LOG:  forked new pcp worker, pid=9651
socket=7
2019-02-09 05:17:40: pid 19275: LOG:  PCP process with pid: 9651 exit with
SUCCESS.
2019-02-09 05:17:40: pid 19275: LOG:  PCP process with pid: 9651 exits with
status 0
2019-02-09 05:17:46: pid 19275: LOG:  forked new pcp worker, pid=9655
socket=7
2019-02-09 05:17:46: pid 19275: LOG:  PCP process with pid: 9655 exit with
SUCCESS.
2019-02-09 05:17:46: pid 19275: LOG:  PCP process with pid: 9655 exits with
status 0
2019-02-09 05:17:46: pid 19275: LOG:  forked new pcp worker, pid=9657
socket=7
2019-02-09 05:17:46: pid 19275: LOG:  PCP process with pid: 9657 exit with
SUCCESS.
2019-02-09 05:17:46: pid 19275: LOG:  PCP process with pid: 9657 exits with
status 0
2019-02-09 05:17:46: pid 19275: LOG:  forked new pcp worker, pid=9659
socket=7
2019-02-09 05:17:46: pid 19275: LOG:  PCP process with pid: 9659 exit with
SUCCESS.
2019-02-09 05:17:46: pid 19275: LOG:  PCP process with pid: 9659 exits with
status 0
2019-02-09 05:17:47: pid 19276: LOG:  received degenerate backend request
for node_id: 1 from pid [19276]
2019-02-09 05:17:47: pid 19402: LOG:  new IPC connection received

On Mon, Feb 18, 2019 at 12:47 PM Pierre Timmermans <ptim007 at yahoo.com>
wrote:

> Alexander, you should give more logs from the production set-up, in
> particular the lines before this one :
>
>
> 2019-02-09 05:17:47: pid 19402: LOG:  watchdog received the failover
> > >> > > command from local pgpool-II on IPC interface
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog is processing the
> > >> failover
>
> Maybe there is an indication of what caused the failover.
>
> Pierre
>
>
> On Monday, February 18, 2019, 3:00:22 PM GMT+1, Alexander Dorogensky <
> amazinglifetime at gmail.com> wrote:
>
>
> Any idea why pgpool doesn't retry?
>
> On Sun, Feb 17, 2019 at 6:19 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
>
> Hi,
>
> I confirmed your pgpool.conf, you set
>
>   health_check_max_retries = 16
>
> So I think the following result is correct.
>
> > >> > psql -c 'pgpool show health_check_max_retries'
> > >> > health_check_max_retries
> > >> > --------------------------
> > >> > 16
> > >> > (1 row)
>
> On Fri, 15 Feb 2019 07:12:56 -0500
> Alexander Dorogensky <amazinglifetime at gmail.com> wrote:
>
> > Hi,
> >
> > Do you have any ideas what’s going on?
> >
> > On Mon, Feb 11, 2019 at 8:30 PM Alexander Dorogensky <
> > amazinglifetime at gmail.com> wrote:
> >
> > > Pgpool.conf from one of the app nodes is attached
> > >
> > > Thanks
> > >
> > > On Mon, Feb 11, 2019 at 6:59 PM Bo Peng <pengbo at sraoss.co.jp> wrote:
> > >
> > >> Hi,
> > >>
> > >> On Mon, 11 Feb 2019 15:32:55 -0600
> > >> Alexander Dorogensky <amazinglifetime at gmail.com> wrote:
> > >>
> > >> > I'm running 4 app (pgpool) nodes (3.6.10) and 2 db (postgres) nodes
> > >> (9.6.9)
> > >> > primary/standby configuration with streaming replication. All 6
> nodes
> > >> are
> > >> > separate machines.
> > >> >
> > >> > A client has had too many failovers caused by the flaky network and
> in
> > >> an
> > >> > effort to remedy the issue I set the following parameters
> > >> >
> > >> > health_check_max_retries = 7
> > >> > health_check_retry_delay = 15
> > >> >
> > >> > Now, I have the client's environment and a lab environment to
> reproduce
> > >> the
> > >> > issue. Pgpool configuration and the version are identical.
> > >> >
> > >> > To simulate a flaky network, I use iptables to deny postgres
> > >> connections to
> > >> > one of the db nodes and see that pgpool on all app nodes is trying
> to
> > >> > reconnect according to the configured number of retries and retry
> delay,
> > >> >
> > >> > > i.e.
> > >> > > 2019-02-11 14:22:51: pid 7825: LOG:  failed to connect to
> PostgreSQL
> > >> > > server on "10.0.10.133:5433", getsockopt() detected error "No
> route
> > >> to
> > >> > > host"
> > >> > > ...
> > >> > > 2019-02-11 14:23:23: pid 6458: LOG:  health checking retry count 1
> > >> > > ...
> > >> > > 2019-02-11 14:23:38: pid 6458: LOG:  health checking retry count 2
> > >> > > ...
> > >> > > 2019-02-11 14:42:45: pid 6458: LOG:  health checking retry count 3
> > >> > > ...
> > >> > > 2019-02-11 14:43:00: pid 6458: LOG:  health checking retry count 4
> > >> > > ...
> > >> > > 2019-02-11 14:43:15: pid 6458: LOG:  health checking retry count 5
> > >> > > ...
> > >> > > 2019-02-11 14:43:30: pid 6458: LOG:  health checking retry count 6
> > >> > > ...
> > >> > > 2019-02-11 14:43:30: pid 6460: LOG:  failover request from local
> > >> pgpool-II
> > >> > > node received on IPC interface is forwarded to master watchdog
> node "
> > >> > > 172.20.20.173:5432"
> > >> > > 2019-02-11 14:43:30: pid 4565: LOG:  watchdog received the
> failover
> > >> > > command from remote pgpool-II node "172.20.20.172:5432"
> > >> > > ...
> > >> > > 2019-02-11 14:43:30: pid 4563: LOG:  execute command:
> > >> > > /etc/pgpool-II/failover.sh 0 10.0.10.133 5433 /opt/redsky/db/data
> 1 0
> > >> > > 10.0.10.134 1 5433 /opt/redsky/db/data
> > >> > >
> > >> > > However, in the client's environment failover gets initiated
> before
> > >> the
> > >> > configured number of retries, i.e.
> > >> >
> > >> > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog received the failover
> > >> > > command from local pgpool-II on IPC interface
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  watchdog is processing the
> > >> failover
> > >> > > command [DEGENERATE_BACKEND_REQUEST] received from local
> pgpool-II on
> > >> IPC
> > >> > > interface
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  forwarding the failover
> request
> > >> > > [DEGENERATE_BACKEND_REQUEST] to all alive nodes
> > >> > > 2019-02-09 05:17:47: pid 19402: DETAIL:  watchdog cluster
> currently
> > >> has 3
> > >> > > connected remote nodes
> > >> > > 2019-02-09 05:17:47: pid 19276: ERROR:  unable to read data from
> DB
> > >> node 1
> > >> > > 2019-02-09 05:17:47: pid 19276: DETAIL:  socket read failed with
> an
> > >> error
> > >> > > "Success"
> > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  Pgpool-II parent process has
> > >> > > received failover request
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  new IPC connection received
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  received the failover
> command
> > >> lock
> > >> > > request from local pgpool-II on IPC interface
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
> > >> > > 10.15.35.35:5432" is requesting to become a lock holder for
> failover
> > >> ID:
> > >> > > 19880
> > >> > > 2019-02-09 05:17:47: pid 19402: LOG:  local pgpool-II node "
> > >> > > 10.15.35.35:5432" is the lock holder
> > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  starting degeneration.
> shutdown
> > >> host
> > >> > > 10.38.135.137(5433)
> > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  Restart all children
> > >> > > 2019-02-09 05:17:47: pid 19400: LOG:  execute command:
> > >> > > /etc/pgpool-II/failover.sh 1 10.38.135.137 5433
> /opt/redsky/db/data 0
> > >> 0
> > >> > > 10.15.35.39 1 5433 /opt/redsky/db/data
> > >> > >
> > >> > >
> > >> > I ran the following command on all app nodes
> > >> >
> > >> > psql -c 'pgpool show health_check_max_retries'
> > >> > health_check_max_retries
> > >> > --------------------------
> > >> > 16
> > >> > (1 row)
> > >> >
> > >> > and the number is different from what I have in the configuration
> file..
> > >> > It's more than 1 though and I expect it to be honored.
> > >>
> > >> I could not reproduce this issue by using pgpool_setup.
> > >> Could you share the whole pgpool.conf?
> > >>
> > >> > Can you guys help me out? I'm out of ideas..
> > >> >
> > >> > pgpool-II-pg96-3.6.10-1pgdg.rhel6.x86_64
> > >>
> > >>
> > >> --
> > >> Bo Peng <pengbo at sraoss.co.jp>
> > >> SRA OSS, Inc. Japan
> > >>
> > >>
>
>
> --
> Bo Peng <pengbo at sraoss.co.jp>
> SRA OSS, Inc. Japan
>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20190218/7702b271/attachment-0001.html>