[pgpool-general: 5873] Re: Troubleshooting assistance on node failure and connections blocking

Fri Jan 19 07:27:29 JST 2018

Thank you very much for the information.

I see pg-pool on node 0 detecting that it cannot connect to node 1,
Interrupted system call message
failed to make persistent db connection
and then pg-pool on node 0 being restarted

A couple of clarifying questions....

1) What happens after restart of pg-pool and the slave node (1) remains
down for extended time?  Will pg-pool on node 0 ignore attempting any
connections to node 1 postgres until healthcheck passes or it will
regardless?   If look like it continues to try establish connection to node
1 postgres.

2) Maybe related to question 1, we notice that prior developers have our
\etc\sysconfig\pgpool configured with OPTS="-n -D".  Since it looks like
pg-pool uses the status file to transmit node status across restarts and it
is discarded, will this cause failover issues because the node 1 down
status is lost on restart?

3) Throughout this whole troubleshooting experience...we have found that
without any pgpool.conf changes.  If node 1 is disconnected via that hard
failure, node 0 will "hang" blocking all DB connections then at about the
16 minute mark free up and just start processing db requests.   Where Is
this "16 minute" hang originating from.  It seems pretty consistent and
repeatable when we repeat the failure scenario without pgpool.conf changes.

Thank you for your help.  This has been a challenging issue to get
corrected for seamless operation in the standby/slave node hard failed
case.

On Wed, Jan 17, 2018 at 11:47 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > We have 2 nodes running.   The stack on each node is:
> >
> >
> >
> > Our app
> >
> > HikariCP
> >
> > Jdbc
> >
> > Pg-pool 3.6.7
> >
> > Postgres 9.6.6
> >
> >
> >
> > Pgpool is set for Master-Slave with streaming replication; no load
> > balancing.
> >
> >
> >
> > We are testing our disaster recovery and failover capabilities.    If we
> > gracefully shutdown node 1 (2nd node), the 1st node proceeds as is
> nothing
> > happened.  The app continues to run without missing a beat. As you would
> > expect.
> >
> >
> >
> > Our problem is when we encounter a “hard” error.  If node 1 becomes
> > disconnected (network is removed), node 0 becomes impacted.   The app
> will
> > freeze up as it can no longer get database connections.   We see the
> > app/spring talk to Hikari, Hikari talks to jdbc, jdbc cannot get
> connection
> > , eventaully Hikari times out (with 30 sec connection wait) and reples to
> > app and we get exceptions.  This repeats as the app continues to try talk
> > to the database.    Pgpool is aware that the node1 is gone as it is in
> > recovery mode and node 0 pgpool retries to establish connectivity to
> pgpool
> > on node 1 per pgpool.conf intervals.
> >
> >
> >
> > So the thing that really has us stumped is if node 0 is only talking
> > through it’s stack to node 0 postgres, why is this failure on node1
> having
> > any impact on node 0 and freezing the db connections?    Obviously when a
> > graceful shutdown occurs pgpool graceful handles this and things work as
> > you expect.  With a hard failure, it does not.    I have attached our
> > pgpool.conf file.   Can someone provide some guidance into the internals
> of
> > pgpool and why this node1 hard failure causes node 0 impacts?
>
> Pgpool-II connects to all PostgreSQL even if load_balance_mode = off.
> There has been ongoing discussions to make Pgpool-II connects to only
> 1 backend, but it's not still implemented.
>
> If you want to shorten the "black period" (that's Pgpool-II is working
> on failover), You can adjust health check parameters and failover
> related parameter.
>
> Change fail_over_on_backend_error = off to on, will cause immediate
> failover if there's problem on connecting or read/write sockets to
> backend.
>
> health_check_period = 40 may take up to 40 seconds before Pgpool-II
> notices the error. So you might want to shorten this.
>
> health_check_timeout = 10 make take up to 10 secinds before Pgpool-II
> notices the error. So you might want to shorten this.
>
> health_check_max_retries = 3 could retry before it gives up, upto
> health_check_timeout*health_check_max_retries = 30 seconds.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180118/dd4c8fc3/attachment-0001.html>