[pgpool-general: 7910] Re: PgPool Failover ran twice resulting in outage

Tue Nov 30 17:32:33 JST 2021

Hi Tatsuo,

We don't have repmgrd installed or configured, repmgr is only used to start the replication from initial configuration. If this is not recommended can you confirm the correct approach?

Thanks You,

Joe.

Joe Madden
Senior Systems Engineer
D 01412224666      
Joe.Madden at mottmac.com

-----Original Message-----
From: Tatsuo Ishii <ishii at sraoss.co.jp> 
Sent: 30 November 2021 00:31
To: Joe Madden <Joe.Madden at mottmac.com>
Cc: pgpool-general at pgpool.net; pgpool-hackers at pgpool.net
Subject: Re: [pgpool-general: 7879] PgPool Failover ran twice resulting in outage

Looks like you are using repmgr and I suspect that's is the source of problem. Pgpool-II and repmgr are different products to manage failover but they are not designed to work together (at least from Pgpool-II side).

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.sraoss.co.jp%2Findex_en.php&data=04%7C01%7CJoe.Madden%40mottmac.com%7Cbd104ea635654cbf6bae08d9b398b6a6%7Ca2bed0c459574f73b0c2a811407590fb%7C0%7C0%7C637738290748814926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=aXQyJs%2BfzVNNY4imBGxblxkXRSRDvo23uxkr9ywOOtM%3D&reserved=0
Japanese:https://eur01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.sraoss.co.jp%2F&data=04%7C01%7CJoe.Madden%40mottmac.com%7Cbd104ea635654cbf6bae08d9b398b6a6%7Ca2bed0c459574f73b0c2a811407590fb%7C0%7C0%7C637738290748814926%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&sdata=uLOL04hHBrElbej1ZaObWTwFc0p0kifa7S5C8p3ckT4%3D&reserved=0

> Hi All,
> 
> I have an issue where PgPool correctly detected the failover of a database server, and failed over the primary to the new standby. Unfortunately the failover was then triggered a 2nd resulting in the pgpool considering the new primary database as degenerated.
> 
> I have attached the full log output but the parts of interest are:
> 
> Nov 17 00:00:20 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:20: 
> pid 719026: LOG:  trying connecting to PostgreSQL server on "10.51.4.25:5432" by INET socket Nov 17 00:00:20 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:20: pid 719026: DETAIL:  timed out. retrying...
> 
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:22: 
> pid 1896062: DETAIL:  failover request noted Nov 17 00:00:22 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:22: pid 1896062: 
> DETAIL:  2 node(s) voted in the favor
> 
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:22: 
> pid 1896060: LOG:  execute command: /etc/pgpool-II/failover.sh 0 
> 10.51.4.26 10.51.4.25 5432 /var/lib/pgsql/12/data/ 1 0 0 5432 
> /var/lib/pgsql/12/data/ 10.51.4.25 5432 Nov 17 00:00:22 service-lb-01 
> pgpool[1896060]: Failed node id: 0 Nov 17 00:00:22 service-lb-01 
> pgpool[1896060]: Failed node hostname: 10.51.4.25 Nov 17 00:00:22 
> service-lb-01 pgpool[1896060]: Failed node port: 5432 Nov 17 00:00:22 
> service-lb-01 pgpool[1896060]: Failed node path: 
> /var/lib/pgsql/12/data/ Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 
> New Main node id: 1 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: New 
> Main node hostname: 10.51.4.26 Nov 17 00:00:22 service-lb-01 
> pgpool[1896060]: New main port number: 5432 Nov 17 00:00:22 
> service-lb-01 pgpool[1896060]: New main Database Path: 
> /var/lib/pgsql/12/data/ Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 
> Old Main node id: 0 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old 
> Primary Node id: 0 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old Primary Node Hostname: 10.51.4.25 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Old Primary Node Port Number: 5432 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Running failover script at Wed 17 Nov 00:00:22 UTC 2021 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Failed node: 0 Nov 17 00:00:22 service-lb-01 pgpool[1896060]: + /usr/bin/ssh -T -l postgres 10.51.4.26 '/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote'
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: Warning: This system is restricted to authorized users only. Unauthorized access or use is a violation of company policy and the law. This system may be monitored for administrative and security reasons. By proceeding, you acknowledge that (1) you have read and understand this notice and (2) you consent to the system monitoring.
> Nov 17 00:00:22 service-lb-01 pgpool[1896060]: NOTICE: promoting 
> standby to primary Nov 17 00:00:22 service-lb-01 pgpool[1896060]: 
> DETAIL: promoting server "10.51.4.26" (ID: 2) using pg_promote() Nov 
> 17 00:00:22 service-lb-01 pgpool[1896060]: NOTICE: waiting up to 60 
> seconds (parameter "promote_check_timeout") for promotion to complete 
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: 
> pid 1896273: ERROR:  failed to authenticate Nov 17 00:00:23 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896273: DETAIL:  the database system is starting up Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896273: LOG:  health check retrying on DB node: 0 (round:3) Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896272: LOG:  trying connecting to PostgreSQL server on "10.51.4.25:5432" by INET socket Nov 17 00:00:23 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896272: DETAIL:  timed out. retrying...
> Nov 17 00:00:23 service-lb-01 pgpool[1896060]: NOTICE: STANDBY PROMOTE 
> successful Nov 17 00:00:23 service-lb-01 pgpool[1896060]: DETAIL: 
> server "10.51.4.26" (ID: 2) was successfully promoted to primary Nov 
> 17 00:00:23 service-lb-01 pgpool[1896060]: + exit 0 Nov 17 00:00:23 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:23: pid 1896060: LOG:  
> find_primary_node_repeatedly: waiting for finding a primary node
> 
> Nov 17 00:00:33 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:33: 
> pid 1896272: LOG:  trying connecting to PostgreSQL server on "10.51.4.25:5432" by INET socket Nov 17 00:00:33 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:33: pid 1896272: DETAIL:  timed out. retrying...
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: 
> pid 1896272: LOG:  get_query_result falied: status: -2 Nov 17 00:00:34 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: 
> CONTEXT:  while checking replication time lag Nov 17 00:00:34 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  
> pgpool_worker_child: invalid node found 1 Nov 17 00:00:34 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  
> received degenerate backend request for node_id: 1 from pid [1896272] 
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: 
> pid 1896062: LOG:  new IPC connection received
> 
> Nov 17 00:00:34 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:34: 
> pid 1896062: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] 
> request from pgpool-II node "10.51.4.29:9999 Linux 
> service-lb-01.prod.region.service.local.int" is queued, waiting for 
> the confirmation from other nodes Nov 17 00:00:34 service-lb-01 
> pgpool[1896060]: 2021-11-17 00:00:34: pid 1896272: LOG:  degenerate 
> backend request for node_id: 1 from pid [1896272], will be handled by 
> watchdog, which is building consensus for request Nov 17 00:00:35 
> service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  
> find_primary_node: primary node is 1 Nov 17 00:00:35 service-lb-01 
> pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  failover: set 
> new primary node: 1 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 
> 2021-11-17 00:00:35: pid 1896060: LOG:  failover: set new main node: 1
> 
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: failover done. shutdown 
> host 10.51.4.25(5432)2021-11-17 00:00:35: pid 1896060: LOG:  failover 
> done. shutdown host 10.51.4.25(5432)
> 
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: 
> pid 1896060: LOG:  execute command: /etc/pgpool-II/failover.sh 0 "" 
> 10.51.4.25 5432 /var/lib/pgsql/12/data/ -1 1 1 "" "" 10.51.4.26 5432 
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node id: 0 Nov 
> 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node hostname: 
> 10.51.4.25 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node 
> port: 5432 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node path: /var/lib/pgsql/12/data/ Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node id: -1 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node hostname:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main port number:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main Database Path:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Main node id: 1 Nov 
> 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node id: 1 Nov 
> 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node Hostname: 
> 10.51.4.26 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary 
> Node Port Number: 5432 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 
> Running failover script at Wed 17 Nov 00:00:35 UTC 2021 Nov 17 
> 00:00:35 service-lb-01 pgpool[1896060]: Failed node: 0 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: + /usr/bin/ssh -T -l postgres '/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote'
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: ssh: Could not resolve 
> hostname /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf 
> standby promote: Name or service not known Nov 17 00:00:35 
> service-lb-01 pgpool[1896060]: + exit 0 Nov 17 00:00:35 service-lb-01 
> pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  execute 
> command: /etc/pgpool-II/failover.sh 1 "" 10.51.4.26 5432 
> /var/lib/pgsql/12/data/ -1 1 1 "" "" 10.51.4.26 5432 Nov 17 00:00:35 
> service-lb-01 pgpool[1896060]: Failed node id: 1 Nov 17 00:00:35 
> service-lb-01 pgpool[1896060]: Failed node hostname: 10.51.4.26 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node port: 5432 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Failed node path: /var/lib/pgsql/12/data/ Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node id: -1 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New Main node hostname:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main port number:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: New main Database Path:
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Main node id: 1 Nov 
> 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node id: 1 Nov 
> 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary Node Hostname: 
> 10.51.4.26 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: Old Primary 
> Node Port Number: 5432 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 
> Running failover script at Wed 17 Nov 00:00:35 UTC 2021 Nov 17 
> 00:00:35 service-lb-01 pgpool[1896060]: Failed node: 1 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: + /usr/bin/ssh -T -l postgres '/usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf standby promote'
> Nov 17 00:00:35 service-lb-01 pgpool[1896060]: ssh: Could not resolve 
> hostname /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf 
> standby promote: Name or service not known Nov 17 00:00:35 
> service-lb-01 pgpool[1896060]: + exit 0 Nov 17 00:00:35 service-lb-01 
> pgpool[1896060]: 2021-11-17 00:00:35: pid 1896060: LOG:  
> find_primary_node_repeatedly: all of the backends are down. Giving up 
> finding primary node Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 
> 2021-11-17 00:00:35: pid 1896060: LOG:  failover: set new primary 
> node: -1 Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 
> 00:00:35: pid 1896062: LOG:  new IPC connection received Nov 17 
> 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:35: pid 
> 1896062: LOG:  received the failover indication from Pgpool-II on IPC 
> interface Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 2021-11-17 
> 00:00:35: pid 1896062: LOG:  watchdog is informed of failover end by 
> the main process Nov 17 00:00:35 service-lb-01 pgpool[1896060]: 
> failover done. shutdown host 10.51.4.26(5432)2021-11-17 00:00:35: pid 
> 1896060: LOG:  failover done. shutdown host 10.51.4.26(5432) Nov 17 
> 00:00:36 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:36: pid 
> 732115: FATAL:  pgpool is not accepting any new connections Nov 17 
> 00:00:36 service-lb-01 pgpool[1896060]: 2021-11-17 00:00:36: pid 
> 732115: DETAIL:  all backend nodes are down, pgpool requires at least 
> one valid node Nov 17 00:00:36 service-lb-01 pgpool[1896060]: 
> 2021-11-17 00:00:36: pid 732115: HINT:  repair the backend nodes and 
> restart pgpool Nov 17 00:00:36 service-lb-01 pgpool[1896060]: 
> 2021-11-17 00:00:36: pid 732297: FATAL:  pgpool is not accepting any 
> new connections
> 
> 
> It looks like, it failed over node 1 before the failover was complete? Has anyone seen anything like this before?
> 
> Thanks
> 
> Joe.
>