[pgpool-hackers: 186] pgpool health check failsafe mechanism

Wed Apr 3 15:43:58 JST 2013

Hi,

We are facing issue with pgpool health check failsafe mechanism in
production environment. I have previously posted this issue on
http://www.pgpool.net/mantisbt/view.php?id=50. I have observed 2 issue
with gpool-II version 3.2.3 (built with latest source code) i.e.

Used versions i.e.

> pgpool-II version 3.2.3
> postgresql 9.2.3 (Master + Slave)

1. In master slave configuration, if health check and failover is enabled
i.e.

pgpool.conf

> backend_flag0 = 'ALLOW_TO_FAILOVER'
> backend_flag1 = 'ALLOW_TO_FAILOVER'
>
health_check_period = 5
> health_check_timeout = 1
> health_check_max_retries = 2
> health_check_retry_delay = 10

load_balance_mode = off

On Linux64, When master server is running fine and without load balancing
and when suddenly if network interruption happen or any other reason (I
mimic the situation via forcefully shutdown dbserver via immediate mode
etc) and pgpool is not able to make connection to slave server. After that
first connection attempt to pgpool return without error/warning message and
pgpool do fail over and kill all child processes. Does that make sense that
when there is no load balancing and master dbserver is serving the queries
well and disconnection of slave server trigger failover ?.

pgpool.log

> ....
> 2013-04-02 17:24:36 DEBUG: pid 65431: I am 65431 accept fd 6
> 2013-04-02 17:24:36 DEBUG: pid 65431: read_startup_packet:
> application_name: psql
> 2013-04-02 17:24:36 DEBUG: pid 65431: Protocol Major: 3 Minor: 0 database:
> postgres user: asif
> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 0 backend
> 2013-04-02 17:24:36 DEBUG: pid 65431: new_connection: connecting 1 backend
> 2013-04-02 17:24:36 ERROR: pid 65431: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2013-04-02 17:24:36 ERROR: pid 65431: connection to localhost(7445) failed
> 2013-04-02 17:24:36 ERROR: pid 65431: new_connection: create_cp() failed
> 2013-04-02 17:24:36 LOG:   pid 65431: degenerate_backend_set: 1 fail over
> request from pid 65431
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler called
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: starting to select
> new master node
> 2013-04-02 17:24:36 LOG:   pid 65417: starting degeneration. shutdown host
> localhost(7445)
> 2013-04-02 17:24:36 LOG:   pid 65417: Restart all children
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65418
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65419
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65420
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65421
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65422
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65423
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65424
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65425
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65426
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65427
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65428
> 2013-04-02 17:24:36 DEBUG: pid 65417: failover_handler: kill 65429
> ...
> ...

2. In the same previous configuration, If I disable failover i.e.

pgpool.conf

> backend_flag0 = 'DISALLOW_TO_FAILOVER'
> backend_flag1 = 'DISALLOW_TO_FAILOVER'
>
health_check_period = 5
> health_check_timeout = 1
> health_check_max_retries = 2
> health_check_retry_delay = 10

load_balance_mode = off

On Linux64, When master server is running fine and there is no load
balancing and no failover and suddenly slave server appear to be
disconnected because of network interruption happen or any other reason (I
mimic it by forcefully shutdown dbserver via immediate mode etc). After
that no connection attempt got successful to pgpool until health check
complete and master database server log shows the following messages i.e.

dbserver.log
  ...
  ...
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  ...

3. While testing this scenario on my MacOSX machine (gcc), it seems that
health check is not getting complete and endless with pgpool configuration
settings as issue #2 above and it completely refrain me from to to connect
pgpool any more i.e.

pgpool.log

> ...
> ...
> 2013-04-03 11:29:29 DEBUG: pid 44263: retrying *679* th health checking
> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 0 th DB node status: 2
> 2013-04-03 11:29:29 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
> support is not available
> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: auth kind: 0
> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: backend key data received
> 2013-04-03 11:29:29 DEBUG: pid 44263: s_do_auth: transaction state: I
> 2013-04-03 11:29:29 DEBUG: pid 44263: health_check: 1 th DB node status: 2
> 2013-04-03 11:29:29 ERROR: pid 44263: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2013-04-03 11:29:29 ERROR: pid 44263: make_persistent_db_connection:
> connection to localhost(7445) failed
> 2013-04-03 11:29:29 ERROR: pid 44263: health check failed. 1 th host
> localhost at port 7445 is down
> 2013-04-03 11:29:29 LOG:   pid 44263: health_check: 1 failover is canceld
> because failover is disallowed
> 2013-04-03 11:29:34 DEBUG: pid 44263: retrying *680* th health checking
> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 0 th DB node status: 2
> 2013-04-03 11:29:34 DEBUG: pid 44263: pool_ssl: SSL requested but SSL
> support is not available
> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: auth kind: 0
> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: backend key data received
> 2013-04-03 11:29:34 DEBUG: pid 44263: s_do_auth: transaction state: I
> 2013-04-03 11:29:34 DEBUG: pid 44263: health_check: 1 th DB node status: 2
> 2013-04-03 11:29:34 ERROR: pid 44263: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2013-04-03 11:29:34 ERROR: pid 44263: make_persistent_db_connection:
> connection to localhost(7445) failed
> 2013-04-03 11:29:34 ERROR: pid 44263: health check failed. 1 th host
> localhost at port 7445 is down
> 2013-04-03 11:29:34 LOG:   pid 44263: health_check: 1 failover is canceld
> because failover is disallowed
> ...
> ...

I will try it on Linux64 machine too. Thanks.

Best Regards,
Asif Naeem
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-hackers/attachments/20130403/467a6897/attachment.html>