View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000050 | Pgpool-II | Bug | public | 2013-02-15 20:03 | 2013-04-04 11:20 |
| Reporter | Asif Naeem | Assigned To | t-ishii | ||
| Priority | normal | Severity | major | Reproducibility | always |
| Status | resolved | Resolution | open | ||
| Platform | Linux64 | ||||
| Summary | 0000050: pgpool connectivity issue when health check enabled | ||||
| Description | In master slave configuration, If slave dbserver shutdown unexpectedly or via immediate/fast mode, pgpool don't allow access to the master database until health check is complete. Any attempt to connect via pgpool fail with out any error message, dbserver log complain about "invalid length of startup packet". It can be reproducible easily with higher values of health check parameter values i.e. health_check_retry_delay = 10 etc. PFA pgpool.conf that I used to write test case for this issue. Postgresql dbserver log contains warnings as following i.e. LOG: database system is ready to accept connections LOG: autovacuum launcher started LOG: invalid length of startup packet LOG: invalid length of startup packet LOG: invalid length of startup packet ... | ||||
| Steps To Reproduce | 1. Run 2 postgresql dbserver (Master + Slave) 2. Run pg_pool with master slave configuration with health check enabled and high health_check_retry_delay value e.g. 100 for testing purpose. 3. Use psql to connect to Master server e.g psql -p 9999 -c "select 1;" . It should work. 4. Stop slave dbserver with "-m immediate" e.g. "pg_ctl -D data_slave -m immediate stop" 5. Use psql to connect to Master server e.g psql -p 9999 -c "select 1;" . It is not going to work and psql will return without any message or warning. | ||||
| Additional Information | Used versions i.e. pgpool-II version 3.2.2 (namameboshi) postgresql 9.2.3 (Master + Slave) | ||||
| Tags | No tags attached. | ||||
|
|
|
|
|
As you might already notice, we found a problem with pgpool-II 3.2.2 and released 3.2.3. The fix was relating to a process of connection to backend. I suspect your problem might be related to this. Can you please try 3.2.3 out? |
|
|
Thank you t-ishii. I have used latest source code (yesterday's release, pgpool-II version 3.2.3 (namameboshi)) to test the scenario. It is doing fail over if master is available and slave system forcefully shutdown (via immediate mode etc) if health check is not complete. First connection attempt fail without error message (it triggers fail over) , next connections because complete fail over happend i.e. new_connection()->notice_backend_error()->degenerate_backend_set()->kill(parent, SIGUSR1); that triggers failover_handler handler. Following is log related to first connection attempt after slave forcefully shutdown i.e. ------------------- .... .... 2013-02-19 14:40:46 DEBUG: pid 13780: I am 13780 accept fd 6 2013-02-19 14:40:46 DEBUG: pid 13780: read_startup_packet: application_name: psql 2013-02-19 14:40:46 DEBUG: pid 13780: Protocol Major: 3 Minor: 0 database: DUDB user: asif connect_backend : called 2013-02-19 14:40:46 DEBUG: pid 13780: new_connection: connecting 0 backend 2013-02-19 14:40:46 DEBUG: pid 13780: new_connection: connecting 1 backend 2013-02-19 14:40:46 ERROR: pid 13780: connect_inet_domain_socket: getsockopt() detected error: Connection refused 2013-02-19 14:40:46 ERROR: pid 13780: connection to localhost(7445) failed 2013-02-19 14:40:46 ERROR: pid 13780: new_connection: create_cp() failed 2013-02-19 14:40:46 LOG: pid 13780: degenerate_backend_set: 1 fail over request from pid 13780 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler called 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: starting to select new master node 2013-02-19 14:40:46 LOG: pid 13539: starting degeneration. shutdown host localhost(7445) 2013-02-19 14:40:46 LOG: pid 13539: Restart all children 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13540 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13541 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13542 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13543 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13544 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13545 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13546 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13547 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13548 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13549 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13550 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13551 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13552 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13553 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13554 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13555 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13556 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13557 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13558 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13559 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13560 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13561 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13562 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13563 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13564 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13565 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13566 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13567 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13568 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13780 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13570 2013-02-19 14:40:46 DEBUG: pid 13539: failover_handler: kill 13571 2013-02-19 14:40:46 LOG: pid 13539: find_primary_node_repeatedly: waiting for finding a primary node 2013-02-19 14:40:46 DEBUG: pid 13539: pool_ssl: SSL requested but SSL support is not available 2013-02-19 14:40:46 DEBUG: pid 13541: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13559: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13550: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13560: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13545: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13561: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13544: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13562: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13549: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13548: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13563: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13552: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13565: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13542: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13566: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13554: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13567: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13558: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13568: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13547: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13571: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13556: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13540: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13555: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13553: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13546: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13780: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13551: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13557: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13564: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13570: child received shutdown request signal 3 2013-02-19 14:40:46 DEBUG: pid 13539: s_do_auth: auth kind: 0 2013-02-19 14:40:46 DEBUG: pid 13539: s_do_auth: backend key data received 2013-02-19 14:40:46 DEBUG: pid 13539: s_do_auth: transaction state: I 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: extended:0 query:SELECT pg_is_in_recovery() 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: T 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: row description received 2013-02-19 14:40:46 DEBUG: pid 13539: num_fileds: 1 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: D 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: data row received 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: C 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: Command complete received 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: kind: Z 2013-02-19 14:40:46 DEBUG: pid 13539: do_query: Ready for query 2013-02-19 14:40:46 LOG: pid 13539: find_primary_node: primary node id is 0 2013-02-19 14:40:46 LOG: pid 13539: failover: set new primary node: 0 2013-02-19 14:40:46 LOG: pid 13539: failover: set new master node: 0 ... ... ---------------------- I did put some effort to debug this issue, I come up with pgpool2_3.2.3_healthcheck_issue.patch, that seems fix this issue. Probably you can come up with better approach. Thanks. Regards, Asif Naeem |
|
|
|
|
|
I think your patch breaks the case when health_check is disabled because pgpool does not trigger failover when connection error is detected. I think what you want is turning off fail_over_on_backend_error. |
|
|
I do feeling that it may break functionality for in some scenarios that I will work on it and share my findings. Thanks. |
|
|
I would like to present two scenarios that pgpool (3.2.3) behavior is not desirable or unexpected i.e. 1. In master slave configuration, if health check and failover is enabled i.e. backend_flag0 = 'ALLOW_TO_FAILOVER' backend_flag1 = 'ALLOW_TO_FAILOVER' health_check_period = 5 health_check_retry_delay = 20 load_balance_mode = off When master server is running fine and there is no load balancing and suddenly slave server got disconnected (forcefully shutdown via immediate mode etc). After that first connection attempt to pgpool return without error/warning message and pgpool do fail over and kill all child processes. Does that make sense that when there is no load balancing and master dbserver is serving the queries well and disconnection of slave server trigger failover ?. 2. In the same previous configuration, If I disable failover i.e. backend_flag0 = 'DISALLOW_TO_FAILOVER' backend_flag1 = 'DISALLOW_TO_FAILOVER' health_check_period = 5 health_check_retry_delay = 20 load_balance_mode = off When master server is running fine and there is no load balancing and no failover and suddenly slave server got disconnected (forcefully shutdown via immediate mode etc). After that no connection attempt got successful to pgpool until health check complete and master database server log shows the following messages i.e. ... ... LOG: incomplete startup packet LOG: incomplete startup packet LOG: incomplete startup packet LOG: incomplete startup packet LOG: incomplete startup packet ... I believe both of the issues are valid and need to be fixed. Thanks. |
|
|
Load balacing off does not means that pgpool does not take care of standbys. In your case you should teach your DBA to execute pcp_detach_node to detach the node before shutting down it. |
|
|
Thank you t-ishii, pcp_detach_node handle it gracefully when DBA shutdown node and de attach it properly. I presented my case about pgpool fail safe mechanism. We have observed this issue in production environment, I am not sure how we can workaround it. If pgpool was not able to make connection to slave server because of network interruption happen or any other reason, even when master server is up and working well, it do not handle it properly and shows unexpected behavior. Is it not recommended for production environment ?. Thanks. |
|
|
Can someone from community guide me how to get over this issue, Dont you think its serious issue ?. Thanks. |
|
|
|
|
|
Hi t-ishii, I have added check for health_check in v2 patch as you suggested earlier, do you have any other concern or suggestion. Thanks. |
|
|
This patch will break pgpool child process in many places. They assume that the number of nodes which are not down is not descreased while they are alive. In your patch: BACKEND_INFO(i).backend_status = CON_DOWN; will suddenly descrease the number of non down nodes. The only safe way to descrease it is, restarting the pgpool child process, which is done in fail over. BTW, bug track is not a great place to discuss on hacking pgpool code. Can you please move to pgpool-hackers list? |
|
|
Sure. I will post the issue on pgpool-hackers. Thank you. |
|
|
Moved to pgpool-hackers list. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2013-02-15 20:03 | Asif Naeem | New Issue | |
| 2013-02-15 20:03 | Asif Naeem | File Added: pgpool.conf | |
| 2013-02-19 15:01 | t-ishii | Note Added: 0000226 | |
| 2013-02-19 15:02 | t-ishii | Assigned To | => t-ishii |
| 2013-02-19 15:02 | t-ishii | Status | new => assigned |
| 2013-02-19 20:06 | Asif Naeem | Note Added: 0000227 | |
| 2013-02-19 20:06 | Asif Naeem | File Added: pgpool2_3.2.3_healthcheck_issue.patch | |
| 2013-02-20 08:08 | t-ishii | Note Added: 0000228 | |
| 2013-02-21 20:22 | Asif Naeem | Note Added: 0000229 | |
| 2013-02-26 20:56 | Asif Naeem | Note Added: 0000230 | |
| 2013-02-27 08:02 | t-ishii | Note Added: 0000231 | |
| 2013-02-27 17:47 | Asif Naeem | Note Added: 0000232 | |
| 2013-02-27 17:52 | Asif Naeem | Note Edited: 0000232 | |
| 2013-03-06 21:28 | Asif Naeem | Note Added: 0000234 | |
| 2013-03-13 18:19 | Asif Naeem | File Added: pgpool2_3.2.3_healthcheck_issue.patch.v2 | |
| 2013-03-13 18:26 | Asif Naeem | Note Added: 0000239 | |
| 2013-03-13 23:35 | t-ishii | Note Added: 0000241 | |
| 2013-03-14 17:26 | Asif Naeem | Note Added: 0000242 | |
| 2013-04-04 11:20 | t-ishii | Note Added: 0000250 | |
| 2013-04-04 11:20 | t-ishii | Status | assigned => resolved |