View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000406 | Pgpool-II | Bug | public | 2018-06-20 02:36 | 2018-08-06 09:59 |
| Reporter | ancoron | Assigned To | t-ishii | ||
| Priority | normal | Severity | minor | Reproducibility | always |
| Status | closed | Resolution | fixed | ||
| Platform | Linux | OS | Debian | OS Version | 9.4 |
| Product Version | 3.7.4 | ||||
| Summary | 0000406: failover called with old-primary = -1 | ||||
| Description | failover_command parameter "%P" gets set to -1 in case of the primary backend. I have a setup of three machines, each having PostgreSQL (10) + Pgpool running with streaming replication configured and working in PostgreSQL, Pgpool configured appropriately as well. Virtual IP is used for pgpool master. Watchdog and health checks are enabled as well. Now, testing failover, the situation before is as follows: pcp_node_info: pg-cluster-01.dev.mycorp.net 5432 2 0.333333 up primary pg-cluster-02.dev.mycorp.net 5432 2 0.333333 up standby pg-cluster-03.dev.mycorp.net 5432 2 0.333333 up standby State from pgpool master log: 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [18-1] LOG: find_primary_node_repeatedly: waiting for finding a primary node 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [19-1] LOG: find_primary_node: checking backend no 0 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [20-1] LOG: find_primary_node: primary node id is 0 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [21-1] LOG: failover: set new primary node: 0 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [22-1] LOG: failover: set new master node: 0 While node-by-node setup of the cluster, I see failback of standby's called correctly, e.g.: 2018-06-19T18:34:32+0200 pg-cluster-01 pgpool[42714]: [30-1] LOG: starting fail back. reconnect host pg-cluster-03.dev.mycorp.net(5432) 2018-06-19T18:34:32+0200 pg-cluster-01 pgpool[42714]: [31-1] LOG: Node 0 is not down (status: 2) 2018-06-19T18:34:32+0200 pg-cluster-01 pgpool[42714]: [32-1] LOG: execute command: /var/lib/postgresql/10/main/failback.sh --failed-node 2 --failed-host pg-cluster-03.dev.mycorp.net --new-master-node 0 --new-master-host pg-cluster-01.dev.mycorp.net --old-master-node 0 --old-primary-node 0 Executed at node pg-cluster-01.dev.mycorp.net: sudo systemctl stop postgresql Expected failover to be called with old-primary = 0, but: 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [565-1] LOG: Pgpool-II parent process has received failover request 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [566-1] LOG: starting quarantine. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [567-1] LOG: Restart all children 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [568-1] LOG: failover: set new primary node: -1 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [569-1] LOG: failover: set new master node: 1 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [570-1] LOG: quarantine done. shutdown host pg-cluster-01.dev.mycorp.net(5432) ... 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [875-1] LOG: Pgpool-II parent process has received failover request 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [876-1] LOG: starting degeneration. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [877-1] LOG: Do not restart children because we are switching over node id 0 host: pg-cluster-01.dev.mycorp.net port: 5432 and we are in streaming replication mode 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [878-1] LOG: execute command: /var/lib/postgresql/10/main/failover.sh --failed-node 0 --failed-host pg-cluster-01.dev.mycorp.net --new-master-node 1 --new-master-host pg-cluster-02.dev.mycorp.net --old-master-node 1 --old-primary-node -1 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: quarantine done. shutdown host pg-cluster-01.dev.mycorp.net(5432) ... 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [882-1] LOG: failover done. shutdown host pg-cluster-01.dev.mycorp.net(5432) ...leading to: pg-cluster-01.dev.mycorp.net 5432 3 0.333333 quarantine standby pg-cluster-02.dev.mycorp.net 5432 2 0.333333 up standby pg-cluster-03.dev.mycorp.net 5432 2 0.333333 up standby This means that failover scripts written in accordance to the example scripts provided via documentation do not work as expected. | ||||
| Steps To Reproduce | Setup cluster with streaming replication. Stop primary backend. | ||||
| Additional Information | pgpool.conf entries: follow_master_command = '/var/lib/postgresql/10/main/follow-master.sh --failed-node %d --failed-host %h --new-master-node %m --new-master-host %H --old-master-node %M --old-primary-node %P' failover_command = '/var/lib/postgresql/10/main/failover.sh --failed-node %d --failed-host %h --new-master-node %m --new-master-host %H --old-master-node %M --old-primary-node %P' failback_command = '/var/lib/postgresql/10/main/failback.sh --failed-node %d --failed-host %h --new-master-node %m --new-master-host %H --old-master-node %M --old-primary-node %P' | ||||
| Tags | consensus, failover, quarantine, streaming replication | ||||
|
|
Reading what I just posted it really looks like the new quarantine functionality gets in the way here and final failover after reaching consensus will never see original values for old-master and old-primary. |
|
|
> Reading what I just posted it really looks like the new quarantine functionality gets in the way here and final failover after reaching consensus will never see original values for old-master and old-primary. Assuming that you enable (they are enabled by default): failover_when_quorum_exists failover_require_consensus I suspect there are something wrong with primary node detection mechanism in the new new quarantine functionality. I will look into this... |
|
|
Yes, both options are active: failover_when_quorum_exists = on failover_require_consensus = on allow_multiple_failover_requests_from_node = off Also, the delay between quarantine and failover is a full healthcheck failure cycle, as it was configured: health_check_period = 30 health_check_timeout = 20 health_check_max_retries = 10 health_check_retry_delay = 1 connect_timeout = 10000 To a certain degree I can understand the logic of quarantine, resetting the primary as the failed backend apparently was the primary. On the other hand, it would be questionable why quarantine actually sets old primary and new master because as long as there is no consensus (if active), it wouldn't come into effect as only the failover is responsible to promote a standby, or? |
|
|
> Yes, both options are active: Is it possible for you to try to reproduce the problem with ailover_when_quorum_exists = off and failover_require_consensus = off? > On the other hand, it would be questionable why quarantine actually sets old primary and new master because as long as there is no > consensus (if active), it wouldn't come into effect as only the failover is responsible to promote a standby, or? I think it's a reasonable argument. I have already ping the author of the functionality and am expecting comment from him. |
|
|
Without consensus, the failover gets called with correct old primary = 0 2018-06-20T22:20:42+0200 pgpool[26552]: [497-1] LOG: execute command: /var/lib/postgresql/10/main/failover.sh --failed-node 0 --failed-host pg-cluster-01.dev.mycorp.net --new-master-node 1 --new-master-host pg-cluster-02.dev.mycorp.net --old-master-node 0 --old-primary-node 0 2018-06-20T22:20:43+0200 pgpool[26552]: [498-1] LOG: find_primary_node_repeatedly: waiting for finding a primary node 2018-06-20T22:20:43+0200 pgpool[26552]: [499-1] LOG: find_primary_node: checking backend no 0 2018-06-20T22:20:43+0200 pgpool[26552]: [500-1] LOG: find_primary_node: checking backend no 1 2018-06-20T22:20:43+0200 pgpool[26552]: [501-1] LOG: find_primary_node: primary node id is 1 2018-06-20T22:20:43+0200 pgpool[26552]: [502-1] LOG: starting follow degeneration. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-20T22:20:43+0200 pgpool[26552]: [503-1] LOG: starting follow degeneration. shutdown host pg-cluster-03.dev.mycorp.net(5432) 2018-06-20T22:20:43+0200 pgpool[26552]: [504-1] LOG: failover: 2 follow backends have been degenerated 2018-06-20T22:20:43+0200 pgpool[26552]: [505-1] LOG: failover: set new primary node: 1 2018-06-20T22:20:43+0200 pgpool[26552]: [506-1] LOG: failover: set new master node: 1 2018-06-20T22:20:43+0200 pgpool[31890]: [505-1] LOG: start triggering follow command. 2018-06-20T22:20:43+0200 pgpool[31890]: [506-1] LOG: execute command: /var/lib/postgresql/10/main/follow-master.sh --failed-node 0 --failed-host pg-cluster-01.dev.mycorp.net --new-master-node 1 --new-master-host pg-cluster-02.dev.mycorp.net --old-master-node 0 --old-primary-node 0 |
|
|
That's what I expected. I am sure there's something wrong in the code path when ailover_when_quorum_exists = on and/or failover_require_consensus = on. Will look into this. |
|
|
ancoron, Here is a proposed patch to fix the problem. It just prevents the primary node from being set to -1 in quarantine state. As far as I know, we don't need to set it to -1 in this case as you said. Can you please try it out? All regression tests including the watchdog tests have passed here. |
|
|
Sorry for the late reply. I just tested with the patch, but got the following behavior for primary failover: 2018-06-27T13:54:21+0200 pgpool[15208]: [121-1] LOG: failed to connect to PostgreSQL server on "pg-cluster-01.dev.mycorp.net:5432", getsockopt() detected error "Connection refused" 2018-06-27T13:54:21+0200 pgpool[15208]: [122-1] LOG: received degenerate backend request for node_id: 0 from pid [15208] 2018-06-27T13:54:21+0200 pgpool[14788]: [71-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [72-1] LOG: watchdog received the failover command from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [73-1] LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [74-1] LOG: failover requires the majority vote, waiting for consensus 2018-06-27T13:54:21+0200 pgpool[14788]: [74-2] DETAIL: failover request noted 2018-06-27T13:54:21+0200 pgpool[14788]: [75-1] LOG: failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "pg-cluster-01.dev.mycorp.net:5433 Linux pg-cluster-01" is queued, waiting for the confirmation from other nodes 2018-06-27T13:54:21+0200 pgpool[14785]: [145-1] LOG: child process with pid: 15208 exits with status 256 2018-06-27T13:54:21+0200 pgpool[14785]: [146-1] LOG: fork a new child process with pid: 15265 2018-06-27T13:54:21+0200 pgpool[15193]: [98-1] LOG: new connection received 2018-06-27T13:54:21+0200 pgpool[15193]: [98-2] DETAIL: connecting host=192.168.123.45 port=49736 2018-06-27T13:54:21+0200 pgpool[15193]: [99-1] LOG: failed to connect to PostgreSQL server on "pg-cluster-01.dev.mycorp.net:5432", getsockopt() detected error "Connection refused" 2018-06-27T13:54:21+0200 pgpool[15193]: [100-1] LOG: received degenerate backend request for node_id: 0 from pid [15193] 2018-06-27T13:54:21+0200 pgpool[14788]: [76-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [77-1] LOG: watchdog received the failover command from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [78-1] LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [79-1] LOG: Duplicate failover request from "pg-cluster-01.dev.mycorp.net:5433 Linux pg-cluster-01" node 2018-06-27T13:54:21+0200 pgpool[14788]: [79-2] DETAIL: request ignored 2018-06-27T13:54:21+0200 pgpool[14788]: [80-1] LOG: failover requires the majority vote, waiting for consensus 2018-06-27T13:54:21+0200 pgpool[14788]: [80-2] DETAIL: failover request noted 2018-06-27T13:54:21+0200 pgpool[14785]: [147-1] LOG: Pgpool-II parent process has received failover request 2018-06-27T13:54:21+0200 pgpool[14788]: [81-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [82-1] LOG: received the failover indication from Pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [83-1] LOG: watchdog is informed of failover end by the main process 2018-06-27T13:54:21+0200 pgpool[14785]: [148-1] LOG: starting quarantine. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-27T13:54:21+0200 pgpool[14785]: [149-1] LOG: Restart all children 2018-06-27T13:54:21+0200 pgpool[14785]: [150-1] LOG: failover: set new primary node: 1868963872 2018-06-27T13:54:21+0200 pgpool[14785]: [151-1] LOG: failover: set new master node: 1 2018-06-27T13:54:21+0200 pgpool[14788]: [84-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [85-1] LOG: received the failover indication from Pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [86-1] LOG: watchdog is informed of failover start by the main process 2018-06-27T13:54:21+0200 pgpool[14785]: [152-1] LOG: quarantine done. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-27T13:54:22+0200 pgpool[15115]: [51-1] LOG: restart request received in pcp child process 2018-06-27T13:54:22+0200 pgpool[14785]: [153-1] LOG: PCP child 15115 exits with status 0 in failover() 2018-06-27T13:54:22+0200 pgpool[14785]: [154-1] LOG: fork a new PCP child pid 15366 in failover() 2018-06-27T13:54:22+0200 pgpool[14785]: [155-1] LOG: child process with pid: 14813 exits with status 0 2018-06-27T13:54:22+0200 pgpool[14785]: [156-1] LOG: child process with pid: 14813 exited with success and will not be restarted ... 2018-06-27T13:54:22+0200 pgpool[14785]: [455-1] WARNING: worker child process with pid: 15367 was terminated by segmentation fault 2018-06-27T13:54:22+0200 pgpool[14785]: [456-1] LOG: fork a new worker child process with pid: 15417 2018-06-27T13:54:22+0200 pgpool[14785]: [457-1] WARNING: worker child process with pid: 15417 was terminated by segmentation fault 2018-06-27T13:54:22+0200 pgpool[14785]: [458-1] LOG: fork a new worker child process with pid: 15418 2018-06-27T13:54:22+0200 pgpool[14785]: [459-1] WARNING: worker child process with pid: 15418 was terminated by segmentation fault 2018-06-27T13:54:22+0200 pgpool[14785]: [460-1] LOG: fork a new worker child process with pid: 15419 ... 2018-06-27T13:54:26+0200 pgpool[14785]: [1033-1] WARNING: worker child process with pid: 15707 was terminated by segmentation fault 2018-06-27T13:54:26+0200 pgpool[14785]: [1034-1] LOG: fork a new worker child process with pid: 15708 2018-06-27T13:54:26+0200 pgpool[14788]: [87-1] LOG: watchdog received the failover command from remote pgpool-II node "pg-cluster-02.dev.mycorp.net:5433 Linux pg-cluster-02" 2018-06-27T13:54:26+0200 pgpool[14788]: [88-1] LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from pg-cluster-02.dev.mycorp.net:5433 Linux pg-cluster-02 2018-06-27T13:54:26+0200 pgpool[14788]: [89-1] LOG: we have got the consensus to perform the failover 2018-06-27T13:54:26+0200 pgpool[14788]: [89-2] DETAIL: 2 node(s) voted in the favor 2018-06-27T13:54:26+0200 pgpool[14788]: [90-1] LOG: received degenerate backend request for node_id: 0 from pid [14788] 2018-06-27T13:54:26+0200 pgpool[14788]: [91-1] LOG: new IPC connection received 2018-06-27T13:54:26+0200 pgpool[14788]: [92-1] LOG: received the failover indication from Pgpool-II on IPC interface 2018-06-27T13:54:26+0200 pgpool[14788]: [93-1] LOG: watchdog is informed of failover end by the main process 2018-06-27T13:54:26+0200 systemd[1]: pgpool2.service: Main process exited, code=killed, status=11/SEGV 2018-06-27T13:54:26+0200 pgpool[14788]: [94-1] LOG: Watchdog is shutting down 2018-06-27T13:54:26+0200 systemd[1]: pgpool2.service: Unit entered failed state. 2018-06-27T13:54:26+0200 systemd[1]: pgpool2.service: Failed with result 'signal'. So, sorry, but that patch makes it worse. After VIP takeover, I could see that indeed, no actual failover occurred and is left in quarantine state: pg-cluster-01.dev.mycorp.net 5432 3 0.333333 quarantine standby pg-cluster-02.dev.mycorp.net 5432 2 0.333333 up standby pg-cluster-03.dev.mycorp.net 5432 2 0.333333 up standby |
|
|
Thank you for the testing. I think we need help from Usama who is the authority of watchdog. Will ping him. |
|
|
Hi I have pushed the fix for it, You can test it by building from the head of 3.7 branch https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=1648cb8c552a19b2f90de1fab120929dd5be06e4 Thanks |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2018-06-20 02:36 | ancoron | New Issue | |
| 2018-06-20 02:36 | ancoron | Tag Attached: failover | |
| 2018-06-20 02:36 | ancoron | Tag Attached: streaming replication | |
| 2018-06-20 02:47 | ancoron | Note Added: 0002050 | |
| 2018-06-20 10:21 | t-ishii | Note Added: 0002051 | |
| 2018-06-21 03:45 | ancoron | Note Added: 0002052 | |
| 2018-06-21 12:44 | t-ishii | Assigned To | => t-ishii |
| 2018-06-21 12:44 | t-ishii | Status | new => assigned |
| 2018-06-21 12:48 | t-ishii | Note Added: 0002055 | |
| 2018-06-21 15:36 | ancoron | Note Added: 0002067 | |
| 2018-06-25 10:33 | t-ishii | Note Added: 0002071 | |
| 2018-06-25 11:29 | t-ishii | File Added: watchdog.diff | |
| 2018-06-25 11:29 | t-ishii | Note Added: 0002073 | |
| 2018-06-25 11:29 | t-ishii | Status | assigned => feedback |
| 2018-06-27 21:43 | ancoron | Note Added: 0002074 | |
| 2018-06-27 21:43 | ancoron | Status | feedback => assigned |
| 2018-06-28 09:18 | t-ishii | Note Added: 0002075 | |
| 2018-07-02 16:20 | Muhammad Usama | Status | assigned => resolved |
| 2018-07-02 16:20 | Muhammad Usama | Resolution | open => fixed |
| 2018-07-02 16:20 | Muhammad Usama | Note Added: 0002076 | |
| 2018-07-05 03:14 | ancoron | Tag Attached: consensus | |
| 2018-07-05 03:14 | ancoron | Tag Attached: quarantine | |
| 2018-08-06 09:59 | administrator | Status | resolved => closed |