[pgpool-hackers: 3441] 013.watchdog_failover_require_consensus test is keep on failing

Tatsuo Ishii ishii at sraoss.co.jp
Sun Sep 22 19:34:36 JST 2019


Hi Usama,

Recently 013.watchdog_failover_require_consensus test is keep on
failing.

* V4_0_STABLE  PostgreSQL 10  CentOS6
testing 013.watchdog_failover_require_consensus...failed.

* V4_0_STABLE  PostgreSQL 11  CentOS6
testing 013.watchdog_failover_require_consensus...failed.

* V3_7_STABLE  PostgreSQL 9.5  CentOS6
testing 013.watchdog_failover_require_consensus...failed.

* V3_7_STABLE  PostgreSQL 10  CentOS6
testing 013.watchdog_failover_require_consensus...failed.

* V3_7_STABLE  PostgreSQL 10  CentOS7
testing 013.watchdog_failover_require_consensus...failed.

* V3_7_STABLE  PostgreSQL 11  CentOS7
testing 013.watchdog_failover_require_consensus...failed.

It seems only 4.0 and 3.7 branches are failing. 4.1 and master are not
failing. Also on my laptop the test does not fail. I have looked into
the buildfarm log and found some differences in the master watchdog log.

My laptop:
2019-09-22 18:48:01: pid 27041: LOG:  watchdog received the failover command from remote pgpool-II node "localhost:11200 Linux tishii-CFSV7-1"
2019-09-22 18:48:01: pid 27041: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from localhost:11200 Linux tishii-CFSV7-1
2019-09-22 18:48:01: pid 27041: LOG:  failover requires the majority vote, waiting for consensus
2019-09-22 18:48:01: pid 27041: DETAIL:  failover request noted
2019-09-22 18:48:01: pid 27041: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "localhost:11200 Linux tishii-CFSV7-1" is queued, waiting for the confirmation from other nodes
2019-09-22 18:48:01: pid 27036: LOG:  Pgpool-II parent process received inform quarantine nodes signal from watchdog
2019-09-22 18:48:01: pid 27041: LOG:  watchdog received the failover command from remote pgpool-II node "localhost:11100 Linux tishii-CFSV7-1"
2019-09-22 18:48:01: pid 27041: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received fr

Buildfarm:
2019-09-13 08:36:07: pid 15599: LOG:  watchdog received the failover command from remote pgpool-II node "localhost:11200 Linux b37695d81f29"
2019-09-13 08:36:07: pid 15599: LOG:  watchdog is processing the failover command [DEGENERATE_BACKEND_REQUEST] received from localhost:11200 Linux b37695d81f29
2019-09-13 08:36:07: pid 15599: LOG:  failover requires the majority vote, waiting for consensus
2019-09-13 08:36:07: pid 15599: DETAIL:  failover request noted
2019-09-13 08:36:07: pid 15599: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "localhost:11200 Linux b37695d81f29" is queued, waiting for the confirmation from other nodes
2019-09-13 08:36:07: pid 15594: LOG:  Pgpool-II parent process received inform quarantine nodes signal from watchdog
2019-09-13 08:36:08: pid 15641: LOG:  DB node id: 0 backend pid: 15659 statement:  DISCARD ALL
2019-09-13 08:36:03: pid 15643: LOG:  DB node id: 1 backend pid: 15763 statement:  DISCARD ALL
2019-09-13 08:36:06: pid 15643: LOG:  DB node id: 0 backend pid: 15764 statement:  DISCARD ALL
2019-09-13 08:36:06: pid 15643: LOG:  DB node id: 1 backend pid: 15763 statement:  DISCARD ALL

As you can see, the local laptop master receives "watchdog received
the failover command from remote pgpool-II node" message from the
standby2 node, which buildfarm case didn't. I guess there's something
timing dependent erros here. Note that the differences probably caused
the failure at [1].

------------------------------ regression log ------------------------------------
creating master pgpool and PostgreSQL clusters...PostgreSQL major version: 115
Starting set up in streaming replication mode
creating startall and shutdownall
creating failover script
creating database cluster /var/volum/CentOS7/11/V4_0_STABLE/src/test/regression/tests/013.watchdog_failover_require_consensus/master/data0...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
creating database cluster /var/volum/CentOS7/11/V4_0_STABLE/src/test/regression/tests/013.watchdog_failover_require_consensus/master/data1...done.
update postgresql.conf
creating pgpool_remote_start
creating basebackup.sh
creating recovery.conf
temporarily start data0 cluster to create extensions
temporarily start pgpool-II to create standby nodes
 node_id |      hostname       | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change  
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /var/run/postgresql | 11002 | up     | 0.500000  | primary | 0          | true              | 0                 | 2019-09-13 08:35:14
 1       | /var/run/postgresql | 11003 | down   | 0.500000  | standby | 0          | false             | 0                 | 1970-01-01 00:00:00
(2 rows)

recovery node 1...pcp_recovery_node -- Command Successful
done.
creating follow master script
 node_id |      hostname       | port  | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change  
---------+---------------------+-------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | /var/run/postgresql | 11002 | up     | 0.500000  | primary | 0          | true              | 0                 | 2019-09-13 08:35:14
 1       | /var/run/postgresql | 11003 | up     | 0.500000  | standby | 0          | false             | 0                 | 2019-09-13 08:35:20
(2 rows)

shutdown all

pgpool-II setting for streaming replication mode is done.
To start the whole system, use /var/volum/CentOS7/11/V4_0_STABLE/src/test/regression/tests/013.watchdog_failover_require_consensus/master/startall.
To shutdown the whole system, use /var/volum/CentOS7/11/V4_0_STABLE/src/test/regression/tests/013.watchdog_failover_require_consensus/master/shutdownall.
pcp command user name is "postgres", password is "postgres".
Each PostgreSQL, pgpool-II and pcp port is as follows:
#1 port is 11002
#2 port is 11003
pgpool port is 11000
pcp port is 11001
The info above is in README.port.
master setup done.
waiting for server to start....15575 2019-09-13 08:35:21 UTC LOG:  listening on IPv4 address "0.0.0.0", port 11002
15575 2019-09-13 08:35:21 UTC LOG:  listening on IPv6 address "::", port 11002
15575 2019-09-13 08:35:21 UTC LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.11002"
15575 2019-09-13 08:35:21 UTC LOG:  redirecting log output to logging collector process
15575 2019-09-13 08:35:21 UTC HINT:  Future log output will appear in directory "log".
 done
server started
waiting for server to start....15586 2019-09-13 08:35:22 UTC LOG:  listening on IPv4 address "0.0.0.0", port 11003
15586 2019-09-13 08:35:22 UTC LOG:  listening on IPv6 address "::", port 11003
15586 2019-09-13 08:35:22 UTC LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.11003"
15586 2019-09-13 08:35:22 UTC LOG:  redirecting log output to logging collector process
15586 2019-09-13 08:35:22 UTC HINT:  Future log output will appear in directory "log".
 done
server started
creating standby pgpool...creating standby2 pgpool...Waiting for the pgpool master...
Master brought up successfully.
Waiting for the standby1 to join cluster...
[check] 1 times
Standby successfully connected.
Waiting for the standby2 to join cluster...
Standby2 successfully connected.
Checking if the standby successfuly process the failover request...
[check] 1 times
[check] 2 times
[check] 3 times
[check] 4 times
[check] 5 times
2019-09-13 08:35:47: pid 15695: LOG:  degenerate backend request for node_id: 1 from pid [15695], will be handled by watchdog, which is building consensus for request
Fake DB error generated and master is waiting for consensus
Checking if the Master receives the failover request and waiting for consensus...
2019-09-13 08:35:47: pid 15599: DETAIL:  failover request noted
Fake DB error delivered to master. and master is waiting for consensus
Checking if the node is quarantined on standby...
[check] 1 times
[check] 2 times
[check] 3 times
[check] 4 times
[check] 5 times
2019-09-13 08:35:57: pid 15695: LOG:  degenerate backend request for 1 node(s) from pid [15695], is changed to quarantine node request by watchdog
Node is quarantined because of second error but no consensus
Checking if all Pgpool-II agrees that the failover request is executed
[check] 1 times
[check] 2 times
[check] 3 times
[check] 4 times
[check] 5 times
[check] 6 times
[check] 7 times
[check] 8 times
[check] 9 times
[check] 10 times [1: regression test failed here]
2019-09-13 08:36:24: pid 15915: LOG:  stop request sent to pgpool. waiting for termination...
.done.
2019-09-13 08:36:25: pid 15916: LOG:  stop request sent to pgpool. waiting for termination...
.done.
2019-09-13 08:36:26: pid 15920: LOG:  stop request sent to pgpool. waiting for termination...
.done.
waiting for server to shut down.... done
server stopped
waiting for server to shut down.... done
server stopped
6 out of 7 successfull

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp


More information about the pgpool-hackers mailing list