View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000227 | Pgpool-II | Bug | public | 2016-07-30 09:19 | 2017-08-29 09:34 |
| Reporter | supp_k | Assigned To | Muhammad Usama | ||
| Priority | immediate | Severity | major | Reproducibility | random |
| Status | closed | Resolution | open | ||
| OS | CentOS | OS Version | 6 & 7 | ||
| Product Version | 3.5.3 | ||||
| Summary | 0000227: failover not performed by stanby node | ||||
| Description | In my production environment the following case not always leads to failover procedure. I have 2 pgpool nodes and watchdog cluster. Every pgpool node server also has PostgreSQL instance running. Sometimes we are facing network troubles and the server where the master pgpool and master postgresql are hosted becomes unavailable. In this case standby pgpool node switching into Master and it is ok. But the problem here is that it doesnt trigger failover command. I see the pgpool standby is promoted into master but it doesnt trigger his failover command. Why? The case is reproduced randomly. | ||||
| Steps To Reproduce | Setup environment (2 server): 1) pgpool + postgres 2) pgpool + postgres Make the pgpool in the server 1 master and the postgres master as well. Postgres should perform sync replication onto 2nd server; Unplug network cable at 1st server. Result: 2nd pgpool switched into Master but it doesn't trigger failover procedure which in my scenario should promote 2nd postgres into master. Please consider the issue, to me it very critical. | ||||
| Additional Information | Configuration of the 1st pgpool server is attached. 2nd's server configuration is identical. | ||||
| Tags | master slave, streaming replication, watchdog | ||||
|
|
|
|
|
I am facing this issue very often. Doing a hard power-off(master pgpool+postgres) node to simulate a crash situation instead of making the interface down. The STANDBY pgpool is escalated to master but fail over does not happen and postgres remains in replication slave mode. My environment is CentOS 7.0 Pgpool 3.5.3 Postgresql 9.4.7 |
|
|
I see not only me facing the issue (((. Moreover it can be considered like a but provement. To my production environment it is very critical because pgpool here becomes a significant bottleneck. |
|
|
I am looking into the issue. Meanwhile I had recently committed a fix related to the watchdog heartbeat problem, So could you also try building from the latest source code to verify if you still face the issue. And if the issue still persists, please do share the complete pgpool-II log file with debug enabled. That would help to solve the problem more quickly. Kind regards |
|
|
Hi, I have verified the patch. Now it works and I think I can apply elsewhere. One more question: when will it be possible to provide RPM package (where the issue is solved) available in pgpool's download link? Thank you! |
|
|
|
|
|
Sorry but this is not solved and I can reliably reproduce the issue (built from 3.5.3 tree at 2016-08-04 16:00:00 UTC). Our setup is a 3-way watchdog handling 3 PG backends. wd_escalation works flawlessly, however after any watchdog failover/escalation has occurred, "degenerate backend request"s have no effect whatsoever (the backend slot stays in state "2" and the failover_command is not called). This detection issue ONLY occurs for backend TCP timeouts that happen AFTER a watchdog standby was promoted: backend failover happens regularly if the original watchdog master is still in charge or the "failed" backend can respond instantly with TCP resets/ICMP "unreachable" messages. My gut feeling tells me this could be some race condition. Same results for both fail_over_on_backend_error and health_check_period based fail-over settings. I attached a tarball with my configuration files and logs (pgpool_47.conf is the original master watchdog that got taken down at the beginning of the test, so there is no corresponding pgpool_47.log). The configuration files are generated by a script and should be absolutely identical, obviously except for the list of watchdog siblings and the wd_priority. Sequence of events in the log: - the current watchdog master (47) is unplugged - one of the remaining pgpools wins the election and wd_escalation happens regularly - I kill one of the backends at the network level ("unplug" the cord) - the message "setting backend node 1 status to NODE DOWN" appears in the logs of both remaining, however all of the configured nodes stay "up" forever and no scripts are called - I get fired due to repeated downtime-inducing failures TIA and best regards F |
|
|
I forgot to mention: not sure if it's clear from the logs, but I'm terminating SSL at pgpool and re-initiating it towards the nodes (sorry, silly security requirements). |
|
|
I built from the latest code. I am not facing the issue when I disable the health check. Fail over of the DB(promoting the slave postgres to master) is taking around 3 minutes. Is it possible to reduce this time further. What parameters are considered to decide the fail over trigger time when health check is disabled? I will check the behavior with health check enabled and share the logs. |
|
|
Hi, I facing this issue also (pgpool 3.5.3 with postgres 9.3.5). cluster with 2 nodes (pgpool and postgres on each node). pgpool 3.5.3 - trunk (pgpool2-f2b5d17.tar.gz) postgresql 9.3.5 when restarting: 1. primary postgres (with the secondary pgpool) - no postgres failover! 2. primary postgres (with primary pgpool) - pgpool failover + postgresql failover to the 2nd node. but when the faulty node (1st) boots up its pgpool became primary again (demoted the 2nd pgpool) and show the 1st postgres as primary and the 2nd postgres as secondary (even when its status was primary) - should be detached? split brain! checked its scenario again - pgpool failover + NO postgresql failover. when the faulty node boots up its pgpool became primary again (demoted the 2nd pgpool) and show the postgres nodes as both “up” and current (old) state! 3. secondary postgres (with primary pgpool) - pgool failover! but when the faulty node boots up it pgpool become primary and demote the 2nd pgpool (after 2 min - duplicate ip for 2 min)! but ! show the secondary postgres as "up" -why? node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay ---------+----------+------+--------+-----------+---------+------------+-------------------+------------------- 0 | 1.1.1.84 | 5432 | up | 0.500000 | primary | 801 | true | 0 1 | 1.1.1.85 | 5432 | up | 0.500000 | standby | 911 | false | 0 4. secondary pgpool & secondary postgres - no failover - as expected. but faulty postgres is shown as “up” - should be detached? Please check all the scenarios - its is a blocker regarding the pgpool HA! |
|
|
@Muhammad Not sure if this info can help you at all in reproducing the issue but here it goes: - I tried disabling TLS between pools and backends, to no effect (still no failover_command is run after NODE DOWN event) - Keep in mind that our monitoring scripts/API/SNMP presentation layer issues quite a few pcp_* commands on a regular basis against the surviving pgpools, especially during a failover. Thank you |
|
|
I can confirm this behavior on Debian jessie (8.5) Pgpool 3.5.3 Postgresql 9.5.4 |
|
|
Hi Thanks for the information. Apparently the problem is when the backend node and the pgpool-II becomes unreachable at the same instance then the failover function becomes confused. I am looking into at priority and will update on this as soon as some I have an update. Kind regards Muhammad Usama |
|
|
I also can confirm this behavior on Centos 7 Pgpool 3.5.3 Postgresql 9.5.4 This is a quite important issue, any news? |
|
|
I have also create similar issue: http://www.pgpool.net/mantisbt/view.php?id=251 In my case failover not performed by master pgpool. |
|
|
Guys, please update the status of the issue! When will you be able to solve it? It is very critical for our environment!! |
|
|
|
|
|
Hi Sorry for the delayed response. The issue was a little more complex than origin anticipated. Can you please try the attached a patch (failover_standby_fix.diff) if it solves the issue. Not that you need to apply the patch on the current head of master branch and if it fixes the problem I will also back port it to pgpool-II 3.5 branch. Thanks |
|
|
Yes, the issue is fixed. Quick tests detects the problem disappeared. Can you please apply it to branch 3.5? Another problem appeares on the new build "pgpool-II version 3.5.4 (ekieboshi)". It seems something has changed in authentication. The new build returns error: "Caused by: org.postgresql.util.PSQLException: ERROR: MD5 authentication is unsupported in replication and master-slave modes." Can be happen because the versions is alpha? |
|
|
Sorry for the mistake - auth error is detected in "pgpool-II version 3.6-alpha1 (subaruboshi)" |
|
|
Regarding the auth problem, yes I recently changed auth module. Please report it using new bug report. |
|
|
Thanks for fixing this issue. Our customer also have facing with this issue too. When will this patch is applied to STABLE versions sir? I also think, this is a critical problem. Is it should be a minor version (for example 3.5.5?) release here sir? |
|
|
There is one remark regarding fixing of the issue. Now I see that in the environment with 3 installations of Pgpool the failover_command is executed by 2 survived Pgpool instances. Should it happen or the only one Pgpool instance should trigger the failover_command? |
|
|
Hi Thanks for taking the time out for testing. No failover_command should only be executed by one node only. Can you share the steps to reproduce and the log files for the scenario. |
|
|
Please see the attached file. The environment consists of: Server 1) Head Pgpool server Server 2) Pgpool A + PostgreSQL Server 3) Pgpool A + PostgreSQL Server 3 is killed and from Head & A server's logs one can see the failover_command is executed twice. |
|
|
Many thanks. Can you please try out the revised attached patch (failover_standby_fix_v2). |
|
|
Hi, the patch doesn't solve the problem. The failover_command is still triggered by both survived nodes nodes. In my scenario the Pgpool master node was killed. See the logs: Server 1: LOG: server socket of Linux_srv-2181113.aqa.int.zone_5432 is closed LOG: remote node "Linux_srv-2181113.aqa.int.zone_5432" is not reachable DETAIL: marking the node as lost LOG: remote node "Linux_srv-2181113.aqa.int.zone_5432" is lost LOG: watchdog cluster has lost the coordinator node LOG: watchdog node state changed from [STANDBY] to [JOINING] LOG: watchdog node state changed from [JOINING] to [INITIALIZING] LOG: watchdog node state changed from [INITIALIZING] to [STANDING FOR MASTER] LOG: watchdog node state changed from [STANDING FOR MASTER] to [PARTICIPATING IN ELECTION] LOG: watchdog node state changed from [PARTICIPATING IN ELECTION] to [INITIALIZING] LOG: watchdog node state changed from [INITIALIZING] to [STANDBY] LOG: successfully joined the watchdog cluster as standby node DETAIL: our join coordinator request is accepted by cluster leader node "Linux_srv-2181107.aqa.int.zon e_5432" LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket DETAIL: select() system call failed with an error "Interrupted system call" ERROR: failed to make persistent db connection DETAIL: connection to host:"b.db.node:15432" failed LOG: received degenerate backend request for node_id: 1 from pid [13113] LOG: connect_inet_domain_socket: select() interrupted by certain signal. retrying... LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket DETAIL: select() system call failed with an error "Interrupted system call" ERROR: failed to make persistent db connection DETAIL: connection to host:"b.db.node:15432" failed LOG: setting backend node 1 status to NODE DOWN LOG: received degenerate backend request for node_id: 1 from pid [13111] LOG: new IPC connection received LOG: new IPC connection received LOG: processing sync request from IPC socket LOG: sync request from IPC socket is forwarded to master watchdog node "Linux_srv-2181107.aqa.int.zone_5432" DETAIL: waiting for the reply from master node... LOG: starting degeneration. shutdown host b.db.node(15432) LOG: Restart all children LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: execute command: /etc/pgpool-II/failover.sh 1 b.db.node 15432 /var/lib/pgsql/9.5/data 0 a.db.node 0 0 15432 /var/lib/pgsql/9.5/data LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 Server 2: LOG: read from socket failed, remote end closed the connection LOG: server socket of Linux_srv-2181113.aqa.int.zone_5432 is closed LOG: remote node "Linux_srv-2181113.aqa.int.zone_5432" is not reachable DETAIL: marking the node as lost LOG: remote node "Linux_srv-2181113.aqa.int.zone_5432" is lost LOG: watchdog cluster has lost the coordinator node LOG: watchdog node state changed from [STANDBY] to [JOINING] LOG: watchdog node state changed from [JOINING] to [INITIALIZING] LOG: watchdog node state changed from [INITIALIZING] to [STANDING FOR MASTER] LOG: watchdog node state changed from [STANDING FOR MASTER] to [MASTER] LOG: I am announcing my self as master/coordinator watchdog node LOG: I am the cluster leader node DETAIL: our declare coordinator message is accepted by all nodes LOG: I am the cluster leader node. Starting escalation process LOG: escalation process started with PID:9137 LOG: watchdog: escalation started WARNING: interface is ignored: Operation not permitted LOG: failback event detected DETAIL: restarting myself LOG: selecting backend connection DETAIL: failback event detected, discarding existing connections LOG: child process with pid: 8806 exits with status 256 LOG: fork a new child process with pid: 9141 LOG: watchdog bringing up delegate IP, 'if_up_cmd' succeeded LOG: watchdog escalation process with pid: 9137 exit with SUCCESS. LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket DETAIL: select() system call failed with an error "Interrupted system call" ERROR: failed to make persistent db connection DETAIL: connection to host:"b.db.node:15432" failed LOG: trying connecting to PostgreSQL server on "b.db.node:15432" by INET socket DETAIL: timed out. retrying... LOG: failed to connect to PostgreSQL server on "b.db.node:15432" using INET socket DETAIL: select() system call failed with an error "Interrupted system call" ERROR: failed to make persistent db connection DETAIL: connection to host:"b.db.node:15432" failed LOG: setting backend node 1 status to NODE DOWN LOG: received degenerate backend request for node_id: 1 from pid [8799] LOG: new IPC connection received LOG: new IPC connection received LOG: processing sync request from IPC socket LOG: local pgpool-II node "Linux_srv-2181107.aqa.int.zone_5432" is requesting to become a lock holder LOG: local pgpool-II node "Linux_srv-2181107.aqa.int.zone_5432" is the lock holder LOG: starting degeneration. shutdown host b.db.node(15432) LOG: Restart all children LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: execute command: /etc/pgpool-II/failover.sh 1 b.db.node 15432 /var/lib/pgsql/9.5/data 0 a.db.node 0 0 15432 /var/lib/pgsql/9.5/data LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 LOG: child process received shutdown request signal 3 |
|
|
@supp_k: are failover_command's %d and %P set to the same values in both pgpools? You may be experiencing a double-detection problem. Speaking of which, is there any particular reason all the pgpools perform the health checks to begin with, and not just the current watchdog master? |
|
|
z0rb1n0, yes the commands are identical. It can be seen from the log information in my previous message. Identical commands are triggered. |
|
|
Guys, do you need any additional information? When will you be able to fix the issue? |
|
|
I think this is a quite important issue. When will you be able to fix the issue? |
|
|
Guys, kindly ask you to review this problem. Without this issue fixed we can't go live and even automatic tests failing. Full verification of our solution which works on the basis of Pgpool can't be performed. If you need any assistance or help with testing then we are ready and want to help you with it. Also we would kindly ask you to refresh the yum repository version, because our customers need the fresh official build. Best Regards, Sergey. |
|
|
|
|
|
Hi First of all sorry for the late reply. Basically the issue was a little more deep rooted and need a design change so it took a me a long time to fix it. Can you please try out the latest attached patch(wd_rewamp_failover.diff) to see if it behaves as expected. I am not finished with testing it as yet and patch might also have some extra debug infos but want to share the early version with you to make sure we can get it out as early as possible. Best regards |
|
|
Hi Muhammad, we have environment that consists of 3 pgpool nodes. Skim verification on CentOS 6.8 x86_64 reveals that the patch doesn't solve the problem. The failover action is performed by one node occasionally. We tested several cases including full down of a pgpool node (poweroff). In case one node disappeares it works but when there are all three pgpool nodes active and one postgres backend downs then it doesn't work. |
|
|
Hi, Thanks for getting back on this. I am also testing with three nodes cluster using the centos on openstack but not able to reproduce the mentioned problem. Do you mean when one of the bcakend PostgreSQL servers goes down, then pgpool-II failover does not happen? Or does the failover actually happens, but not on all pgpool-II nodes? Also can you please share the pgpool.confs and logs of all nodes for the failing scenario? |
|
|
Hi Muhammad, we have 3 servers: 1) Pgpool 2) Pgpool Master + Postgres Master A 3) Pgpool + Postgres B Emulate cases: 1) Poweroff server 0000002 or 0000003. Result: New Pgpool master elected; Failover performed OK 3) Kill Postgres A or B. Result: Failover not performed. |
|
|
I have also tested the same scenario, but somehow it is working on my side. Can you please share the pgpool-II log and configuration files. Thanks and regards! |
|
|
If I kill master postgres process then I see the failover works. But the pgpool cluster doesnt provide any answers to SQL queries despite the fact the failover is complete. Please see the attached log files. |
|
|
|
|
|
|
|
|
|
|
|
Hi I have looked at the attached log and everything looks just fine, Also tested the said scenario and it is working fine at my end. What behavior you are experiencing after the failover? Do the new connections get stuck or if the connections are successful, and query results are not coming back? This is also strange if you are experiencing some issue with the client connections after this patch since this patch does not touch the area of client connection and query handling. Thanks Regards |
|
|
Hi, maybe these problems are issues of my building environment. Ok, you state that the fix of the problem 0000227 is available on your side I believe. Now can you share the patched 3.5.X version or commit it into the corresponding branch? |
|
|
Thanks for the verification. I will commit it to the master branch after some more testing and will discuss about the back porting of fix with the pgpool-II development group and update you accordingly. Best regards! |
|
|
Do you have any estimations when it will be possible to obtain the patched build? We need it since it impacts out environment. |
|
|
I have checked-in the fix to the master branch, and we have decided to port it to the 3.5 branch as well. I will try to push it to the 3.5 branch by the end of the week. |
|
|
Our tests revealed no problems with new failovering mechanizm. I think it is OK now. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2016-07-30 09:19 | supp_k | New Issue | |
| 2016-07-30 09:19 | supp_k | File Added: pgpool.conf | |
| 2016-08-01 21:25 | guptesh.cg4@gmail.com | Tag Attached: master slave | |
| 2016-08-01 21:25 | guptesh.cg4@gmail.com | Tag Attached: streaming replication | |
| 2016-08-01 23:42 | guptesh.cg4@gmail.com | Note Added: 0000952 | |
| 2016-08-02 00:09 | supp_k | Note Added: 0000954 | |
| 2016-08-02 10:28 | t-ishii | Assigned To | => Muhammad Usama |
| 2016-08-02 10:28 | t-ishii | Status | new => assigned |
| 2016-08-02 13:43 | t-ishii | Tag Attached: watchdog | |
| 2016-08-03 05:23 | Muhammad Usama | Note Added: 0000959 | |
| 2016-08-03 20:57 | supp_k | Note Added: 0000962 | |
| 2016-08-05 19:32 | z0rb1n0 | File Added: pgpool_no_failover_fabio.tar.bz2 | |
| 2016-08-05 19:32 | z0rb1n0 | Note Added: 0000968 | |
| 2016-08-05 21:44 | z0rb1n0 | Note Added: 0000969 | |
| 2016-08-10 16:09 | guptesh.cg4@gmail.com | Note Added: 0000977 | |
| 2016-08-19 04:03 | cohavisi | Note Added: 0001015 | |
| 2016-08-23 20:21 | z0rb1n0 | Note Added: 0001020 | |
| 2016-08-24 23:20 | tscheuren | Note Added: 0001021 | |
| 2016-08-25 23:40 | Muhammad Usama | Note Added: 0001024 | |
| 2016-09-21 00:53 | gabrimonfa | Note Added: 0001073 | |
| 2016-09-27 21:39 | supp_k | Note Added: 0001088 | |
| 2016-09-27 22:09 | supp_k | Note Added: 0001089 | |
| 2016-09-27 23:23 | Muhammad Usama | File Added: failover_standby_fix.diff | |
| 2016-09-27 23:30 | Muhammad Usama | Status | assigned => feedback |
| 2016-09-27 23:30 | Muhammad Usama | Note Added: 0001090 | |
| 2016-09-28 00:46 | supp_k | Note Added: 0001092 | |
| 2016-09-28 00:46 | supp_k | Status | feedback => assigned |
| 2016-09-28 01:15 | supp_k | Note Added: 0001093 | |
| 2016-09-28 09:44 | t-ishii | Note Added: 0001096 | |
| 2016-09-28 15:49 | Dang Minh Huong | Note Added: 0001104 | |
| 2016-09-28 16:09 | supp_k | Note Added: 0001105 | |
| 2016-09-28 17:15 | Muhammad Usama | Note Added: 0001106 | |
| 2016-09-28 22:01 | supp_k | File Added: data.tar.gz | |
| 2016-09-28 22:01 | supp_k | Note Added: 0001107 | |
| 2016-09-28 22:37 | Muhammad Usama | File Added: failover_standby_fix_v2.diff | |
| 2016-09-28 22:37 | Muhammad Usama | Note Added: 0001108 | |
| 2016-09-30 00:11 | supp_k | Note Added: 0001111 | |
| 2016-09-30 01:24 | z0rb1n0 | Note Added: 0001112 | |
| 2016-09-30 16:23 | supp_k | Note Added: 0001114 | |
| 2016-10-05 20:26 | supp_k | Note Added: 0001116 | |
| 2016-10-10 20:00 | gabrimonfa | Note Added: 0001117 | |
| 2016-10-13 18:07 | supp_k | Note Added: 0001118 | |
| 2016-11-01 00:33 | Muhammad Usama | File Added: wd_rewamp_failover.diff | |
| 2016-11-01 00:43 | Muhammad Usama | Status | assigned => feedback |
| 2016-11-01 00:43 | Muhammad Usama | Note Added: 0001143 | |
| 2016-11-01 20:01 | supp_k | Note Added: 0001145 | |
| 2016-11-01 20:01 | supp_k | Status | feedback => assigned |
| 2016-11-01 22:29 | Muhammad Usama | Note Added: 0001146 | |
| 2016-11-02 17:39 | supp_k | Note Added: 0001150 | |
| 2016-11-02 19:36 | Muhammad Usama | Note Added: 0001151 | |
| 2016-11-10 21:16 | supp_k | File Added: server_1 | |
| 2016-11-10 21:16 | supp_k | Note Added: 0001162 | |
| 2016-11-10 21:16 | supp_k | File Added: server_1-2 | |
| 2016-11-10 21:16 | supp_k | File Added: server_1-3 | |
| 2016-11-10 21:17 | supp_k | File Added: server_2 | |
| 2016-11-10 22:47 | Muhammad Usama | Note Added: 0001163 | |
| 2016-11-10 22:57 | supp_k | Note Added: 0001164 | |
| 2016-11-11 00:12 | Muhammad Usama | Note Added: 0001165 | |
| 2016-11-11 17:11 | supp_k | Note Added: 0001167 | |
| 2016-11-15 06:42 | Muhammad Usama | Note Added: 0001170 | |
| 2016-11-23 17:55 | supp_k | Note Added: 0001187 | |
| 2017-08-29 09:34 | pengbo | Status | assigned => closed |