[pgpool-general: 5875] Re: Troubleshooting assistance on node failure and connections blocking

Fri Jan 19 07:59:55 JST 2018

> Thank you very much for the information.

You are welcome.

> I see pg-pool on node 0 detecting that it cannot connect to node 1,
> Interrupted system call message
> failed to make persistent db connection
> and then pg-pool on node 0 being restarted
> 
> A couple of clarifying questions....
> 
> 1) What happens after restart of pg-pool and the slave node (1) remains
> down for extended time?  Will pg-pool on node 0 ignore attempting any
> connections to node 1 postgres until healthcheck passes or it will
> regardless?   If look like it continues to try establish connection to node
> 1 postgres.

As you see in question #2, that depends on whether Pgpool-II is
started with or without -D option: If -D is not given, Pgpool-II uses
the status file to know which node is already dead. For example if
node 1 is marked dead in the status file, Pgpool-II will ignore node 1
in health checking and connection establishing.

> 2) Maybe related to question 1, we notice that prior developers have our
> \etc\sysconfig\pgpool configured with OPTS="-n -D".  Since it looks like
> pg-pool uses the status file to transmit node status across restarts and it
> is discarded, will this cause failover issues because the node 1 down
> status is lost on restart?

Yes.

> 3) Throughout this whole troubleshooting experience...we have found that
> without any pgpool.conf changes.  If node 1 is disconnected via that hard
> failure, node 0 will "hang" blocking all DB connections then at about the
> 16 minute mark free up and just start processing db requests.   Where Is
> this "16 minute" hang originating from.  It seems pretty consistent and
> repeatable when we repeat the failure scenario without pgpool.conf changes.

16 minutes is too long. In my caliculation, it should be:

(health_check_timeout + health_check_retry_delay) * health_check_max_retries = (10+1)*3 = 44 seconds.

before Pgpool-II starts failover.

Can you share Pgpool-II log with debugging on (restarting with -d
option) while the 16 minutes hang?

> Thank you for your help.  This has been a challenging issue to get
> corrected for seamless operation in the standby/slave node hard failed
> case.
>
> On Wed, Jan 17, 2018 at 11:47 PM, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
> 
>> > We have 2 nodes running.   The stack on each node is:
>> >
>> >
>> >
>> > Our app
>> >
>> > HikariCP
>> >
>> > Jdbc
>> >
>> > Pg-pool 3.6.7
>> >
>> > Postgres 9.6.6
>> >
>> >
>> >
>> > Pgpool is set for Master-Slave with streaming replication; no load
>> > balancing.
>> >
>> >
>> >
>> > We are testing our disaster recovery and failover capabilities.    If we
>> > gracefully shutdown node 1 (2nd node), the 1st node proceeds as is
>> nothing
>> > happened.  The app continues to run without missing a beat. As you would
>> > expect.
>> >
>> >
>> >
>> > Our problem is when we encounter a “hard” error.  If node 1 becomes
>> > disconnected (network is removed), node 0 becomes impacted.   The app
>> will
>> > freeze up as it can no longer get database connections.   We see the
>> > app/spring talk to Hikari, Hikari talks to jdbc, jdbc cannot get
>> connection
>> > , eventaully Hikari times out (with 30 sec connection wait) and reples to
>> > app and we get exceptions.  This repeats as the app continues to try talk
>> > to the database.    Pgpool is aware that the node1 is gone as it is in
>> > recovery mode and node 0 pgpool retries to establish connectivity to
>> pgpool
>> > on node 1 per pgpool.conf intervals.
>> >
>> >
>> >
>> > So the thing that really has us stumped is if node 0 is only talking
>> > through it’s stack to node 0 postgres, why is this failure on node1
>> having
>> > any impact on node 0 and freezing the db connections?    Obviously when a
>> > graceful shutdown occurs pgpool graceful handles this and things work as
>> > you expect.  With a hard failure, it does not.    I have attached our
>> > pgpool.conf file.   Can someone provide some guidance into the internals
>> of
>> > pgpool and why this node1 hard failure causes node 0 impacts?
>>
>> Pgpool-II connects to all PostgreSQL even if load_balance_mode = off.
>> There has been ongoing discussions to make Pgpool-II connects to only
>> 1 backend, but it's not still implemented.
>>
>> If you want to shorten the "black period" (that's Pgpool-II is working
>> on failover), You can adjust health check parameters and failover
>> related parameter.
>>
>> Change fail_over_on_backend_error = off to on, will cause immediate
>> failover if there's problem on connecting or read/write sockets to
>> backend.
>>
>> health_check_period = 40 may take up to 40 seconds before Pgpool-II
>> notices the error. So you might want to shorten this.
>>
>> health_check_timeout = 10 make take up to 10 secinds before Pgpool-II
>> notices the error. So you might want to shorten this.
>>
>> health_check_max_retries = 3 could retry before it gives up, upto
>> health_check_timeout*health_check_max_retries = 30 seconds.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>>