[pgpool-general: 5871] Re: Troubleshooting assistance on node failure and connections blocking

Thu Jan 18 14:47:09 JST 2018

> We have 2 nodes running.   The stack on each node is:
> 
> 
> 
> Our app
> 
> HikariCP
> 
> Jdbc
> 
> Pg-pool 3.6.7
> 
> Postgres 9.6.6
> 
> 
> 
> Pgpool is set for Master-Slave with streaming replication; no load
> balancing.
> 
> 
> 
> We are testing our disaster recovery and failover capabilities.    If we
> gracefully shutdown node 1 (2nd node), the 1st node proceeds as is nothing
> happened.  The app continues to run without missing a beat. As you would
> expect.
> 
> 
> 
> Our problem is when we encounter a “hard” error.  If node 1 becomes
> disconnected (network is removed), node 0 becomes impacted.   The app will
> freeze up as it can no longer get database connections.   We see the
> app/spring talk to Hikari, Hikari talks to jdbc, jdbc cannot get connection
> , eventaully Hikari times out (with 30 sec connection wait) and reples to
> app and we get exceptions.  This repeats as the app continues to try talk
> to the database.    Pgpool is aware that the node1 is gone as it is in
> recovery mode and node 0 pgpool retries to establish connectivity to pgpool
> on node 1 per pgpool.conf intervals.
> 
> 
> 
> So the thing that really has us stumped is if node 0 is only talking
> through it’s stack to node 0 postgres, why is this failure on node1 having
> any impact on node 0 and freezing the db connections?    Obviously when a
> graceful shutdown occurs pgpool graceful handles this and things work as
> you expect.  With a hard failure, it does not.    I have attached our
> pgpool.conf file.   Can someone provide some guidance into the internals of
> pgpool and why this node1 hard failure causes node 0 impacts?

Pgpool-II connects to all PostgreSQL even if load_balance_mode = off.
There has been ongoing discussions to make Pgpool-II connects to only
1 backend, but it's not still implemented.

If you want to shorten the "black period" (that's Pgpool-II is working
on failover), You can adjust health check parameters and failover
related parameter.

Change fail_over_on_backend_error = off to on, will cause immediate
failover if there's problem on connecting or read/write sockets to
backend.

health_check_period = 40 may take up to 40 seconds before Pgpool-II
notices the error. So you might want to shorten this.

health_check_timeout = 10 make take up to 10 secinds before Pgpool-II
notices the error. So you might want to shorten this.

health_check_max_retries = 3 could retry before it gives up, upto
health_check_timeout*health_check_max_retries = 30 seconds.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp