View Issue Details

IDProjectCategoryView StatusLast Update
0000050Pgpool-IIBugpublic2013-04-04 11:20
ReporterAsif NaeemAssigned Tot-ishii 
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionopen 
PlatformLinux64OSOS Version
Product Version 
Target VersionFixed in Version 
Summary0000050: pgpool connectivity issue when health check enabled
DescriptionIn master slave configuration, If slave dbserver shutdown unexpectedly or via immediate/fast mode, pgpool don't allow access to the master database until health check is complete. Any attempt to connect via pgpool fail with out any error message, dbserver log complain about "invalid length of startup packet". It can be reproducible easily with higher values of health check parameter values i.e. health_check_retry_delay = 10 etc. PFA pgpool.conf that I used to write test case for this issue.

Postgresql dbserver log contains warnings as following i.e.
LOG: database system is ready to accept connections
LOG: autovacuum launcher started
LOG: invalid length of startup packet
LOG: invalid length of startup packet
LOG: invalid length of startup packet
...


Steps To Reproduce1. Run 2 postgresql dbserver (Master + Slave)
2. Run pg_pool with master slave configuration with health check enabled and high health_check_retry_delay value e.g. 100 for testing purpose.
3. Use psql to connect to Master server e.g psql -p 9999 -c "select 1;" . It should work.
4. Stop slave dbserver with "-m immediate" e.g. "pg_ctl -D data_slave -m immediate stop"
5. Use psql to connect to Master server e.g psql -p 9999 -c "select 1;" . It is not going to work and psql will return without any message or warning.
Additional InformationUsed versions i.e.
pgpool-II version 3.2.2 (namameboshi)
postgresql 9.2.3 (Master + Slave)
TagsNo tags attached.

Activities

Asif Naeem

2013-02-15 20:03

reporter  

pgpool.conf (26,508 bytes)

t-ishii

2013-02-19 15:01

developer   ~0000226

As you might already notice, we found a problem with pgpool-II 3.2.2 and released 3.2.3. The fix was relating to a process of connection to backend. I suspect your problem might be related to this. Can you please try 3.2.3 out?

Asif Naeem

2013-02-19 20:06

reporter   ~0000227

Thank you t-ishii.

I have used latest source code (yesterday's release, pgpool-II version 3.2.3 (namameboshi)) to test the scenario. It is doing fail over if master is available and slave system forcefully shutdown (via immediate mode etc) if health check is not complete. First connection attempt fail without error message (it triggers fail over) , next connections because complete fail over happend i.e.

                new_connection()->notice_backend_error()->degenerate_backend_set()->kill(parent, SIGUSR1); that triggers failover_handler handler.

Following is log related to first connection attempt after slave forcefully shutdown i.e.

-------------------
....
....
2013-02-19 14:40:46 DEBUG: pid 13780: I am 13780 accept fd 6
2013-02-19 14:40:46 DEBUG: pid 13780: read_startup_packet: application_name: psql
2013-02-19 14:40:46 DEBUG: pid 13780: Protocol Major: 3 Minor: 0 database: DUDB user: asif
connect_backend : called
2013-02-19 14:40:46 DEBUG: pid 13780: new_connection: connecting 0 backend
2013-02-19 14:40:46 DEBUG: pid 13780: new_connection: connecting 1 backend
2013-02-19 14:40:46 ERROR: pid 13780: connect_inet_domain_socket: getsockopt() detected error: Connection refused
2013-02-19 14:40:46 ERROR: pid 13780: connection to localhost(7445) failed
2013-02-19 14:40:46 ERROR: pid 13780: new_connection: create_cp() failed
2013-02-19 14:40:46 LOG: pid 13780: degenerate_backend_set: 1 fail over request from pid 13780
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler called
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: starting to select new master node
2013-02-19 14:40:46 LOG: pid 13539: starting degeneration. shutdown host localhost(7445)
2013-02-19 14:40:46 LOG: pid 13539: Restart all children
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13540
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13541
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13542
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13543
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13544
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13545
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13546
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13547
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13548
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13549
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13550
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13551
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13552
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13553
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13554
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13555
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13556
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13557
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13558
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13559
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13560
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13561
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13562
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13563
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13564
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13565
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13566
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13567
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13568
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13780
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13570
2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13571
2013-02-19 14:40:46 LOG: pid 13539: find_primary_node_repeatedly: waiting for finding a primary node
2013-02-19 14:40:46 DEBUG: pid 13539: pool_ssl: SSL requested but SSL support is not available
2013-02-19 14:40:46 DEBUG: pid 13541: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13559: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13550: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13560: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13545: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13561: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13544: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13562: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13549: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13548: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13563: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13552: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13565: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13542: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13566: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13554: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13567: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13558: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13568: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13547: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13571: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13556: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13540: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13555: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13553: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13546: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13780: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13551: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13557: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13564: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13570: child received shutdown request signal 3
2013-02-19 14:40:46 DEBUG: pid 13539: s_do_auth: auth kind: 0
2013-02-19 14:40:46 DEBUG: pid 13539: s_do_auth: backend key data received
2013-02-19 14:40:46 DEBUG: pid 13539: s_do_auth: transaction state: I
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: extended:0 query:SELECT pg_is_in_recovery()
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: T
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: row description received
2013-02-19 14:40:46 DEBUG: pid 13539: num_fileds: 1
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: D
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: data row received
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: C
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: Command complete received
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: Z
2013-02-19 14:40:46 DEBUG: pid 13539: do_query: Ready for query
2013-02-19 14:40:46 LOG: pid 13539: find_primary_node: primary node id is 0
2013-02-19 14:40:46 LOG: pid 13539: failover: set new primary node: 0
2013-02-19 14:40:46 LOG: pid 13539: failover: set new master node: 0
...
...

----------------------

I did put some effort to debug this issue, I come up with pgpool2_3.2.3_healthcheck_issue.patch, that seems fix this issue. Probably you can come up with better approach. Thanks.

Regards,
Asif Naeem

Asif Naeem

2013-02-19 20:06

reporter  

pgpool2_3.2.3_healthcheck_issue.patch (804 bytes)
diff --git a/pool_connection_pool.c b/pool_connection_pool.c
index cc0ba21..c9855d7 100644
--- a/pool_connection_pool.c
+++ b/pool_connection_pool.c
@@ -725,6 +725,16 @@ static POOL_CONNECTION_POOL *new_connection(POOL_CONNECTION_POOL *p)
 			/* connection failed. mark this backend down */
 			pool_error("new_connection: create_cp() failed");
 
+			/*If it is not master node and there is connection failure, don't quit child*/
+			if( i != REAL_MASTER_NODE_ID )
+			{
+				pool_debug("new_connection: skipping slot %d because connection failed", i);
+				/* change node status */
+				BACKEND_INFO(i).backend_status = CON_DOWN;
+				*(my_backend_status[i]) = CON_DOWN;
+				continue;
+			}
+
 			/* If fail_over_on_backend_error is true, do failover.
 			 * Otherwise, just exit this session.
 			 */

t-ishii

2013-02-20 08:08

developer   ~0000228

I think your patch breaks the case when health_check is disabled because pgpool does not trigger failover when connection error is detected.
I think what you want is turning off fail_over_on_backend_error.

Asif Naeem

2013-02-21 20:22

reporter   ~0000229

I do feeling that it may break functionality for in some scenarios that I will work on it and share my findings. Thanks.

Asif Naeem

2013-02-26 20:56

reporter   ~0000230

I would like to present two scenarios that pgpool (3.2.3) behavior is not desirable or unexpected i.e.

1. In master slave configuration, if health check and failover is enabled i.e.
     backend_flag0 = 'ALLOW_TO_FAILOVER'
     backend_flag1 = 'ALLOW_TO_FAILOVER'
     health_check_period = 5
     health_check_retry_delay = 20
     load_balance_mode = off

When master server is running fine and there is no load balancing and suddenly slave server got disconnected (forcefully shutdown via immediate mode etc). After that first connection attempt to pgpool return without error/warning message and pgpool do fail over and kill all child processes. Does that make sense that when there is no load balancing and master dbserver is serving the queries well and disconnection of slave server trigger failover ?.

2. In the same previous configuration, If I disable failover i.e.
     backend_flag0 = 'DISALLOW_TO_FAILOVER'
     backend_flag1 = 'DISALLOW_TO_FAILOVER'
     health_check_period = 5
     health_check_retry_delay = 20
     load_balance_mode = off

When master server is running fine and there is no load balancing and no failover and suddenly slave server got disconnected (forcefully shutdown via immediate mode etc). After that no connection attempt got successful to pgpool until health check complete and master database server log shows the following messages i.e.

  ...
  ...
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  LOG: incomplete startup packet
  ...

I believe both of the issues are valid and need to be fixed. Thanks.

t-ishii

2013-02-27 08:02

developer   ~0000231

Load balacing off does not means that pgpool does not take care of standbys. In your case you should teach your DBA to execute pcp_detach_node to detach the node before shutting down it.

Asif Naeem

2013-02-27 17:47

reporter   ~0000232

Last edited: 2013-02-27 17:52

View 2 revisions

Thank you t-ishii, pcp_detach_node handle it gracefully when DBA shutdown node and de attach it properly. I presented my case about pgpool fail safe mechanism. We have observed this issue in production environment, I am not sure how we can workaround it. If pgpool was not able to make connection to slave server because of network interruption happen or any other reason, even when master server is up and working well, it do not handle it properly and shows unexpected behavior. Is it not recommended for production environment ?. Thanks.

Asif Naeem

2013-03-06 21:28

reporter   ~0000234

Can someone from community guide me how to get over this issue, Dont you think its serious issue ?. Thanks.

Asif Naeem

2013-03-13 18:19

reporter  

pgpool2_3.2.3_healthcheck_issue.patch.v2 (869 bytes)

Asif Naeem

2013-03-13 18:26

reporter   ~0000239

Hi t-ishii,

I have added check for health_check in v2 patch as you suggested earlier, do you have any other concern or suggestion. Thanks.

t-ishii

2013-03-13 23:35

developer   ~0000241

This patch will break pgpool child process in many places. They assume that the number of nodes which are not down is not descreased while they are alive.
In your patch:
  BACKEND_INFO(i).backend_status = CON_DOWN;
will suddenly descrease the number of non down nodes.
The only safe way to descrease it is, restarting the pgpool child process, which is done in fail over.

BTW, bug track is not a great place to discuss on hacking pgpool code.
Can you please move to pgpool-hackers list?

Asif Naeem

2013-03-14 17:26

reporter   ~0000242

Sure. I will post the issue on pgpool-hackers. Thank you.

t-ishii

2013-04-04 11:20

developer   ~0000250

Moved to pgpool-hackers list.

Issue History

Date Modified Username Field Change
2013-02-15 20:03 Asif Naeem New Issue
2013-02-15 20:03 Asif Naeem File Added: pgpool.conf
2013-02-19 15:01 t-ishii Note Added: 0000226
2013-02-19 15:02 t-ishii Assigned To => t-ishii
2013-02-19 15:02 t-ishii Status new => assigned
2013-02-19 20:06 Asif Naeem Note Added: 0000227
2013-02-19 20:06 Asif Naeem File Added: pgpool2_3.2.3_healthcheck_issue.patch
2013-02-20 08:08 t-ishii Note Added: 0000228
2013-02-21 20:22 Asif Naeem Note Added: 0000229
2013-02-26 20:56 Asif Naeem Note Added: 0000230
2013-02-27 08:02 t-ishii Note Added: 0000231
2013-02-27 17:47 Asif Naeem Note Added: 0000232
2013-02-27 17:52 Asif Naeem Note Edited: 0000232 View Revisions
2013-03-06 21:28 Asif Naeem Note Added: 0000234
2013-03-13 18:19 Asif Naeem File Added: pgpool2_3.2.3_healthcheck_issue.patch.v2
2013-03-13 18:26 Asif Naeem Note Added: 0000239
2013-03-13 23:35 t-ishii Note Added: 0000241
2013-03-14 17:26 Asif Naeem Note Added: 0000242
2013-04-04 11:20 t-ishii Note Added: 0000250
2013-04-04 11:20 t-ishii Status assigned => resolved