0000406: failover called with old-primary = -1

ID	Project	Category	View Status	Date Submitted	Last Update

0000406	Pgpool-II	Bug	public	2018-06-20 02:36	2018-08-06 09:59

Reporter	ancoron	Assigned To	t-ishii
Priority	normal	Severity	minor	Reproducibility	always
Status	closed	Resolution	fixed
Platform	Linux	OS	Debian	OS Version	9.4
Product Version	3.7.4

Summary	0000406: failover called with old-primary = -1
Description	failover_command parameter "%P" gets set to -1 in case of the primary backend. I have a setup of three machines, each having PostgreSQL (10) + Pgpool running with streaming replication configured and working in PostgreSQL, Pgpool configured appropriately as well. Virtual IP is used for pgpool master. Watchdog and health checks are enabled as well. Now, testing failover, the situation before is as follows: pcp_node_info: pg-cluster-01.dev.mycorp.net 5432 2 0.333333 up primary pg-cluster-02.dev.mycorp.net 5432 2 0.333333 up standby pg-cluster-03.dev.mycorp.net 5432 2 0.333333 up standby State from pgpool master log: 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [18-1] LOG: find_primary_node_repeatedly: waiting for finding a primary node 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [19-1] LOG: find_primary_node: checking backend no 0 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [20-1] LOG: find_primary_node: primary node id is 0 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [21-1] LOG: failover: set new primary node: 0 2018-06-19T18:33:42+0200 pg-cluster-01 pgpool[42714]: [22-1] LOG: failover: set new master node: 0 While node-by-node setup of the cluster, I see failback of standby's called correctly, e.g.: 2018-06-19T18:34:32+0200 pg-cluster-01 pgpool[42714]: [30-1] LOG: starting fail back. reconnect host pg-cluster-03.dev.mycorp.net(5432) 2018-06-19T18:34:32+0200 pg-cluster-01 pgpool[42714]: [31-1] LOG: Node 0 is not down (status: 2) 2018-06-19T18:34:32+0200 pg-cluster-01 pgpool[42714]: [32-1] LOG: execute command: /var/lib/postgresql/10/main/failback.sh --failed-node 2 --failed-host pg-cluster-03.dev.mycorp.net --new-master-node 0 --new-master-host pg-cluster-01.dev.mycorp.net --old-master-node 0 --old-primary-node 0 Executed at node pg-cluster-01.dev.mycorp.net: sudo systemctl stop postgresql Expected failover to be called with old-primary = 0, but: 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [565-1] LOG: Pgpool-II parent process has received failover request 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [566-1] LOG: starting quarantine. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [567-1] LOG: Restart all children 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [568-1] LOG: failover: set new primary node: -1 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [569-1] LOG: failover: set new master node: 1 2018-06-19T18:38:25+0200 pg-cluster-01 pgpool[42714]: [570-1] LOG: quarantine done. shutdown host pg-cluster-01.dev.mycorp.net(5432) ... 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [875-1] LOG: Pgpool-II parent process has received failover request 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [876-1] LOG: starting degeneration. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [877-1] LOG: Do not restart children because we are switching over node id 0 host: pg-cluster-01.dev.mycorp.net port: 5432 and we are in streaming replication mode 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [878-1] LOG: execute command: /var/lib/postgresql/10/main/failover.sh --failed-node 0 --failed-host pg-cluster-01.dev.mycorp.net --new-master-node 1 --new-master-host pg-cluster-02.dev.mycorp.net --old-master-node 1 --old-primary-node -1 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: quarantine done. shutdown host pg-cluster-01.dev.mycorp.net(5432) ... 2018-06-19T18:38:37+0200 pg-cluster-01 pgpool[42714]: [882-1] LOG: failover done. shutdown host pg-cluster-01.dev.mycorp.net(5432) ...leading to: pg-cluster-01.dev.mycorp.net 5432 3 0.333333 quarantine standby pg-cluster-02.dev.mycorp.net 5432 2 0.333333 up standby pg-cluster-03.dev.mycorp.net 5432 2 0.333333 up standby This means that failover scripts written in accordance to the example scripts provided via documentation do not work as expected.
Steps To Reproduce	Setup cluster with streaming replication. Stop primary backend.
Additional Information	pgpool.conf entries: follow_master_command = '/var/lib/postgresql/10/main/follow-master.sh --failed-node %d --failed-host %h --new-master-node %m --new-master-host %H --old-master-node %M --old-primary-node %P' failover_command = '/var/lib/postgresql/10/main/failover.sh --failed-node %d --failed-host %h --new-master-node %m --new-master-host %H --old-master-node %M --old-primary-node %P' failback_command = '/var/lib/postgresql/10/main/failback.sh --failed-node %d --failed-host %h --new-master-node %m --new-master-host %H --old-master-node %M --old-primary-node %P'
Tags	consensus, failover, quarantine, streaming replication

ancoron 2018-06-20 02:47 reporter ~0002050	Reading what I just posted it really looks like the new quarantine functionality gets in the way here and final failover after reaching consensus will never see original values for old-master and old-primary.

t-ishii 2018-06-20 10:21 developer ~0002051	> Reading what I just posted it really looks like the new quarantine functionality gets in the way here and final failover after reaching consensus will never see original values for old-master and old-primary. Assuming that you enable (they are enabled by default): failover_when_quorum_exists failover_require_consensus I suspect there are something wrong with primary node detection mechanism in the new new quarantine functionality. I will look into this...

ancoron 2018-06-21 03:45 reporter ~0002052	Yes, both options are active: failover_when_quorum_exists = on failover_require_consensus = on allow_multiple_failover_requests_from_node = off Also, the delay between quarantine and failover is a full healthcheck failure cycle, as it was configured: health_check_period = 30 health_check_timeout = 20 health_check_max_retries = 10 health_check_retry_delay = 1 connect_timeout = 10000 To a certain degree I can understand the logic of quarantine, resetting the primary as the failed backend apparently was the primary. On the other hand, it would be questionable why quarantine actually sets old primary and new master because as long as there is no consensus (if active), it wouldn't come into effect as only the failover is responsible to promote a standby, or?

t-ishii 2018-06-21 12:48 developer ~0002055	> Yes, both options are active: Is it possible for you to try to reproduce the problem with ailover_when_quorum_exists = off and failover_require_consensus = off? > On the other hand, it would be questionable why quarantine actually sets old primary and new master because as long as there is no > consensus (if active), it wouldn't come into effect as only the failover is responsible to promote a standby, or? I think it's a reasonable argument. I have already ping the author of the functionality and am expecting comment from him.

ancoron 2018-06-21 15:36 reporter ~0002067	Without consensus, the failover gets called with correct old primary = 0 2018-06-20T22:20:42+0200 pgpool[26552]: [497-1] LOG: execute command: /var/lib/postgresql/10/main/failover.sh --failed-node 0 --failed-host pg-cluster-01.dev.mycorp.net --new-master-node 1 --new-master-host pg-cluster-02.dev.mycorp.net --old-master-node 0 --old-primary-node 0 2018-06-20T22:20:43+0200 pgpool[26552]: [498-1] LOG: find_primary_node_repeatedly: waiting for finding a primary node 2018-06-20T22:20:43+0200 pgpool[26552]: [499-1] LOG: find_primary_node: checking backend no 0 2018-06-20T22:20:43+0200 pgpool[26552]: [500-1] LOG: find_primary_node: checking backend no 1 2018-06-20T22:20:43+0200 pgpool[26552]: [501-1] LOG: find_primary_node: primary node id is 1 2018-06-20T22:20:43+0200 pgpool[26552]: [502-1] LOG: starting follow degeneration. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-20T22:20:43+0200 pgpool[26552]: [503-1] LOG: starting follow degeneration. shutdown host pg-cluster-03.dev.mycorp.net(5432) 2018-06-20T22:20:43+0200 pgpool[26552]: [504-1] LOG: failover: 2 follow backends have been degenerated 2018-06-20T22:20:43+0200 pgpool[26552]: [505-1] LOG: failover: set new primary node: 1 2018-06-20T22:20:43+0200 pgpool[26552]: [506-1] LOG: failover: set new master node: 1 2018-06-20T22:20:43+0200 pgpool[31890]: [505-1] LOG: start triggering follow command. 2018-06-20T22:20:43+0200 pgpool[31890]: [506-1] LOG: execute command: /var/lib/postgresql/10/main/follow-master.sh --failed-node 0 --failed-host pg-cluster-01.dev.mycorp.net --new-master-node 1 --new-master-host pg-cluster-02.dev.mycorp.net --old-master-node 0 --old-primary-node 0

t-ishii 2018-06-25 10:33 developer ~0002071	That's what I expected. I am sure there's something wrong in the code path when ailover_when_quorum_exists = on and/or failover_require_consensus = on. Will look into this.

t-ishii 2018-06-25 11:29 developer ~0002073	ancoron, Here is a proposed patch to fix the problem. It just prevents the primary node from being set to -1 in quarantine state. As far as I know, we don't need to set it to -1 in this case as you said. Can you please try it out? All regression tests including the watchdog tests have passed here. watchdog.diff (675 bytes) watchdog.diff (675 bytes)

ancoron 2018-06-27 21:43 reporter ~0002074	Sorry for the late reply. I just tested with the patch, but got the following behavior for primary failover: 2018-06-27T13:54:21+0200 pgpool[15208]: [121-1] LOG: failed to connect to PostgreSQL server on "pg-cluster-01.dev.mycorp.net:5432", getsockopt() detected error "Connection refused" 2018-06-27T13:54:21+0200 pgpool[15208]: [122-1] LOG: received degenerate backend request for node_id: 0 from pid [15208] 2018-06-27T13:54:21+0200 pgpool[14788]: [71-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [72-1] LOG: watchdog received the failover command from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [73-1] LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [74-1] LOG: failover requires the majority vote, waiting for consensus 2018-06-27T13:54:21+0200 pgpool[14788]: [74-2] DETAIL: failover request noted 2018-06-27T13:54:21+0200 pgpool[14788]: [75-1] LOG: failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "pg-cluster-01.dev.mycorp.net:5433 Linux pg-cluster-01" is queued, waiting for the confirmation from other nodes 2018-06-27T13:54:21+0200 pgpool[14785]: [145-1] LOG: child process with pid: 15208 exits with status 256 2018-06-27T13:54:21+0200 pgpool[14785]: [146-1] LOG: fork a new child process with pid: 15265 2018-06-27T13:54:21+0200 pgpool[15193]: [98-1] LOG: new connection received 2018-06-27T13:54:21+0200 pgpool[15193]: [98-2] DETAIL: connecting host=192.168.123.45 port=49736 2018-06-27T13:54:21+0200 pgpool[15193]: [99-1] LOG: failed to connect to PostgreSQL server on "pg-cluster-01.dev.mycorp.net:5432", getsockopt() detected error "Connection refused" 2018-06-27T13:54:21+0200 pgpool[15193]: [100-1] LOG: received degenerate backend request for node_id: 0 from pid [15193] 2018-06-27T13:54:21+0200 pgpool[14788]: [76-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [77-1] LOG: watchdog received the failover command from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [78-1] LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [79-1] LOG: Duplicate failover request from "pg-cluster-01.dev.mycorp.net:5433 Linux pg-cluster-01" node 2018-06-27T13:54:21+0200 pgpool[14788]: [79-2] DETAIL: request ignored 2018-06-27T13:54:21+0200 pgpool[14788]: [80-1] LOG: failover requires the majority vote, waiting for consensus 2018-06-27T13:54:21+0200 pgpool[14788]: [80-2] DETAIL: failover request noted 2018-06-27T13:54:21+0200 pgpool[14785]: [147-1] LOG: Pgpool-II parent process has received failover request 2018-06-27T13:54:21+0200 pgpool[14788]: [81-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [82-1] LOG: received the failover indication from Pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [83-1] LOG: watchdog is informed of failover end by the main process 2018-06-27T13:54:21+0200 pgpool[14785]: [148-1] LOG: starting quarantine. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-27T13:54:21+0200 pgpool[14785]: [149-1] LOG: Restart all children 2018-06-27T13:54:21+0200 pgpool[14785]: [150-1] LOG: failover: set new primary node: 1868963872 2018-06-27T13:54:21+0200 pgpool[14785]: [151-1] LOG: failover: set new master node: 1 2018-06-27T13:54:21+0200 pgpool[14788]: [84-1] LOG: new IPC connection received 2018-06-27T13:54:21+0200 pgpool[14788]: [85-1] LOG: received the failover indication from Pgpool-II on IPC interface 2018-06-27T13:54:21+0200 pgpool[14788]: [86-1] LOG: watchdog is informed of failover start by the main process 2018-06-27T13:54:21+0200 pgpool[14785]: [152-1] LOG: quarantine done. shutdown host pg-cluster-01.dev.mycorp.net(5432) 2018-06-27T13:54:22+0200 pgpool[15115]: [51-1] LOG: restart request received in pcp child process 2018-06-27T13:54:22+0200 pgpool[14785]: [153-1] LOG: PCP child 15115 exits with status 0 in failover() 2018-06-27T13:54:22+0200 pgpool[14785]: [154-1] LOG: fork a new PCP child pid 15366 in failover() 2018-06-27T13:54:22+0200 pgpool[14785]: [155-1] LOG: child process with pid: 14813 exits with status 0 2018-06-27T13:54:22+0200 pgpool[14785]: [156-1] LOG: child process with pid: 14813 exited with success and will not be restarted ... 2018-06-27T13:54:22+0200 pgpool[14785]: [455-1] WARNING: worker child process with pid: 15367 was terminated by segmentation fault 2018-06-27T13:54:22+0200 pgpool[14785]: [456-1] LOG: fork a new worker child process with pid: 15417 2018-06-27T13:54:22+0200 pgpool[14785]: [457-1] WARNING: worker child process with pid: 15417 was terminated by segmentation fault 2018-06-27T13:54:22+0200 pgpool[14785]: [458-1] LOG: fork a new worker child process with pid: 15418 2018-06-27T13:54:22+0200 pgpool[14785]: [459-1] WARNING: worker child process with pid: 15418 was terminated by segmentation fault 2018-06-27T13:54:22+0200 pgpool[14785]: [460-1] LOG: fork a new worker child process with pid: 15419 ... 2018-06-27T13:54:26+0200 pgpool[14785]: [1033-1] WARNING: worker child process with pid: 15707 was terminated by segmentation fault 2018-06-27T13:54:26+0200 pgpool[14785]: [1034-1] LOG: fork a new worker child process with pid: 15708 2018-06-27T13:54:26+0200 pgpool[14788]: [87-1] LOG: watchdog received the failover command from remote pgpool-II node "pg-cluster-02.dev.mycorp.net:5433 Linux pg-cluster-02" 2018-06-27T13:54:26+0200 pgpool[14788]: [88-1] LOG: watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from pg-cluster-02.dev.mycorp.net:5433 Linux pg-cluster-02 2018-06-27T13:54:26+0200 pgpool[14788]: [89-1] LOG: we have got the consensus to perform the failover 2018-06-27T13:54:26+0200 pgpool[14788]: [89-2] DETAIL: 2 node(s) voted in the favor 2018-06-27T13:54:26+0200 pgpool[14788]: [90-1] LOG: received degenerate backend request for node_id: 0 from pid [14788] 2018-06-27T13:54:26+0200 pgpool[14788]: [91-1] LOG: new IPC connection received 2018-06-27T13:54:26+0200 pgpool[14788]: [92-1] LOG: received the failover indication from Pgpool-II on IPC interface 2018-06-27T13:54:26+0200 pgpool[14788]: [93-1] LOG: watchdog is informed of failover end by the main process 2018-06-27T13:54:26+0200 systemd[1]: pgpool2.service: Main process exited, code=killed, status=11/SEGV 2018-06-27T13:54:26+0200 pgpool[14788]: [94-1] LOG: Watchdog is shutting down 2018-06-27T13:54:26+0200 systemd[1]: pgpool2.service: Unit entered failed state. 2018-06-27T13:54:26+0200 systemd[1]: pgpool2.service: Failed with result 'signal'. So, sorry, but that patch makes it worse. After VIP takeover, I could see that indeed, no actual failover occurred and is left in quarantine state: pg-cluster-01.dev.mycorp.net 5432 3 0.333333 quarantine standby pg-cluster-02.dev.mycorp.net 5432 2 0.333333 up standby pg-cluster-03.dev.mycorp.net 5432 2 0.333333 up standby

t-ishii 2018-06-28 09:18 developer ~0002075	Thank you for the testing. I think we need help from Usama who is the authority of watchdog. Will ping him.

Muhammad Usama 2018-07-02 16:20 developer ~0002076	Hi I have pushed the fix for it, You can test it by building from the head of 3.7 branch https://git.postgresql.org/gitweb/?p=pgpool2.git;a=commitdiff;h=1648cb8c552a19b2f90de1fab120929dd5be06e4 Thanks

Date Modified	Username	Field	Change
2018-06-20 02:36	ancoron	New Issue
2018-06-20 02:36	ancoron	Tag Attached: failover
2018-06-20 02:36	ancoron	Tag Attached: streaming replication
2018-06-20 02:47	ancoron	Note Added: 0002050
2018-06-20 10:21	t-ishii	Note Added: 0002051
2018-06-21 03:45	ancoron	Note Added: 0002052
2018-06-21 12:44	t-ishii	Assigned To	=> t-ishii
2018-06-21 12:44	t-ishii	Status	new => assigned
2018-06-21 12:48	t-ishii	Note Added: 0002055
2018-06-21 15:36	ancoron	Note Added: 0002067
2018-06-25 10:33	t-ishii	Note Added: 0002071
2018-06-25 11:29	t-ishii	File Added: watchdog.diff
2018-06-25 11:29	t-ishii	Note Added: 0002073
2018-06-25 11:29	t-ishii	Status	assigned => feedback
2018-06-27 21:43	ancoron	Note Added: 0002074
2018-06-27 21:43	ancoron	Status	feedback => assigned
2018-06-28 09:18	t-ishii	Note Added: 0002075
2018-07-02 16:20	Muhammad Usama	Status	assigned => resolved
2018-07-02 16:20	Muhammad Usama	Resolution	open => fixed
2018-07-02 16:20	Muhammad Usama	Note Added: 0002076
2018-07-05 03:14	ancoron	Tag Attached: consensus
2018-07-05 03:14	ancoron	Tag Attached: quarantine
2018-08-06 09:59	administrator	Status	resolved => closed

View Issue Details

Activities

Issue History