[pgpool-general: 7908] Re: PgPool Failover ran twice resulting in outage

Tatsuo Ishii ishii at sraoss.co.jp
Tue Nov 30 09:31:05 JST 2021


Looks like you are using repmgr and I suspect that's is the source of
problem. Pgpool-II and repmgr are different products to manage
failover but they are not designed to work together (at least from
Pgpool-II side).

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Hi All,
> 
> I have an issue where PgPool correctly detected the failover of a database server, and failed over the primary to the new standby. Unfortunately the failover was then triggered a 2nd resulting in the pgpool considering the new primary database as degenerated.
> 
> I have attached the full log output but the parts of interest are:
> 
> Nov 17 00:00:20 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:20: pid 719026: LOG:  trying connecting to PostgreSQL server on "10.51.4.25:5432" by INET socket
> Nov 17 00:00:20 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:20: pid 719026: DETAIL:  timed out. retrying...
> 
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:22: pid 1896062: DETAIL:  failover request noted
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:22: pid 1896062: DETAIL:  2 node(s) voted in the favor
> 
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:22: pid 1896060: LOG:  execute command: /etc/pgpool-II/failover.sh 0 10.51.4.26 10.51.4.25 5432 /var/lib/pgsql/12/data/ 1 0 0 5432 /var/lib/pgsql/12/data/ 10.51.4.25 5432
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Failed node id: 0
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Failed node hostname: 10.51.4.25
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Failed node port: 5432
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Failed node path: /var/lib/pgsql/12/data/
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: New Main node id: 1
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: New Main node hostname: 10.51.4.26
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: New main port number: 5432
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: New main Database Path: /var/lib/pgsql/12/data/
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old Main node id: 0
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old Primary Node id: 0
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old Primary Node Hostname: 10.51.4.25
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old Primary Node Port Number: 5432
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Running failover script at Wed 17 Nov 00:00:22 UTC 2021
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Failed node: 0
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: + /usr/bin/ssh -T -l postgres 10.51.4.26 '/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote'
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Warning: This system is restricted to authorized users only. Unauthorized access or use is a violation of company policy and the law. This system may be monitored for administrative and security reasons. By proceeding, you acknowledge that (1) you have read and understand this notice and (2) you consent to the system monitoring.
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: NOTICE: promoting standby to primary
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: DETAIL: promoting server "10.51.4.26" (ID: 2) using pg_promote()
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: NOTICE: waiting up to 60 seconds (parameter "promote_check_timeout") for promotion to complete
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896273: ERROR:  failed to authenticate
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896273: DETAIL:  the database system is starting up
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896273: LOG:  health check retrying on DB node: 0 (round:3)
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896272: LOG:  trying connecting to PostgreSQL server on "10.51.4.25:5432" by INET socket
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896272: DETAIL:  timed out. retrying...
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: NOTICE: STANDBY PROMOTE successful
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: DETAIL: server "10.51.4.26" (ID: 2) was successfully promoted to primary
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: + exit 0
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896060: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
> 
> Nov 17 00:00:33 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:33: pid 1896272: LOG:  trying connecting to PostgreSQL server on "10.51.4.25:5432" by INET socket
> Nov 17 00:00:33 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:33: pid 1896272: DETAIL:  timed out. retrying...
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  get_query_result falied: status: -2
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: CONTEXT:  while checking replication time lag
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  pgpool_worker_child: invalid node found 1
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  received degenerate backend request for node_id: 1 from pid [1896272]
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896062: LOG:  new IPC connection received
> 
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896062: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from pgpool-II node "10.51.4.29:9999 Linux service-lb-01.prod.region.service.local.int" is queued, waiting for the confirmation from other nodes
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  degenerate backend request for node_id: 1 from pid [1896272], will be handled by watchdog, which is building consensus for request
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  find_primary_node: primary node is 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  failover: set new primary node: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  failover: set new main node: 1
> 
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: failover done. shutdown host 10.51.4.25(5432)2021-11-17 00:00:35: pid 1896060: LOG:  failover done. shutdown host 10.51.4.25(5432)
> 
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  execute command: /etc/pgpool-II/failover.sh 0 "" 10.51.4.25 5432 /var/lib/pgsql/12/data/ -1 1 1 "" "" 10.51.4.26 5432
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node id: 0
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node hostname: 10.51.4.25
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node port: 5432
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node path: /var/lib/pgsql/12/data/
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node id: -1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node hostname:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main port number:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main Database Path:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Main node id: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node id: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node Hostname: 10.51.4.26
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node Port Number: 5432
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Running failover script at Wed 17 Nov 00:00:35 UTC 2021
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node: 0
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: + /usr/bin/ssh -T -l postgres '/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote'
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: ssh: Could not resolve hostname /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote: Name or service not known
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: + exit 0
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  execute command: /etc/pgpool-II/failover.sh 1 "" 10.51.4.26 5432 /var/lib/pgsql/12/data/ -1 1 1 "" "" 10.51.4.26 5432
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node id: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node hostname: 10.51.4.26
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node port: 5432
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node path: /var/lib/pgsql/12/data/
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node id: -1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node hostname:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main port number:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main Database Path:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Main node id: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node id: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node Hostname: 10.51.4.26
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node Port Number: 5432
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Running failover script at Wed 17 Nov 00:00:35 UTC 2021
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node: 1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: + /usr/bin/ssh -T -l postgres '/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote'
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: ssh: Could not resolve hostname /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote: Name or service not known
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: + exit 0
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  find_primary_node_repeatedly: all of the backends are down. Giving up finding primary node
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  failover: set new primary node: -1
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896062: LOG:  new IPC connection received
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896062: LOG:  received the failover indication from Pgpool-II on IPC interface
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896062: LOG:  watchdog is informed of failover end by the main process
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: failover done. shutdown host 10.51.4.26(5432)2021-11-17 00:00:35: pid 1896060: LOG:  failover done. shutdown host 10.51.4.26(5432)
> Nov 17 00:00:36 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:36: pid 732115: FATAL:  pgpool is not accepting any new connections
> Nov 17 00:00:36 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:36: pid 732115: DETAIL:  all backend nodes are down, pgpool requires at least one valid node
> Nov 17 00:00:36 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:36: pid 732115: HINT:  repair the backend nodes and restart pgpool
> Nov 17 00:00:36 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:36: pid 732297: FATAL:  pgpool is not accepting any new connections
> 
> 
> It looks like, it failed over node 1 before the failover was complete? Has anyone seen anything like this before?
> 
> Thanks
> 
> Joe.
> 


More information about the pgpool-general mailing list