[pgpool-general: 6756] Watchdog does not get consensus for failover

shacky shacky83 at gmail.com
Tue Oct 29 02:57:37 JST 2019


Hi,
I have a three nodes PgPool 4.0.2 cluster on proxy1, proxy2, proxy3
(master) with a three nodes PostgresSQL 11 on postgres1 (master),
postgres2, postgres3.

Today I was making some failover-recovery tests and everything was running
fine, but since about 3 hours the PgPool master node does not get the
consensus from other PgPool node to make the failover when the primary
PostgreSQL node goes down even if the PgPool cluster is up and running and
healthy.

When the primary PostgreSQL node is up and running, the cluster is healthy:
=================================== 8<
?============================================
postgres at proxy3:~$ psql -h 127.0.0.1 -p 5432 -U repmgr -c "show pool_nodes;"
DEBUG:  SimpleQuery
DETAIL:  nodes reporting
 node_id | hostname  | port | status | lb_weight |  role   | select_cnt |
load_balance_node | replication_delay | last_status_change
---------+-----------+------+--------+-----------+---------+------------+-------------------+-------------------+---------------------
 0       | postgres1 | 5432 | up     | 0.333333  | primary | 0          |
true              | 0                 | 2019-10-28 17:34:05
 1       | postgres2 | 5432 | up     | 0.333333  | standby | 11         |
false             | 0                 | 2019-10-28 18:23:39
 2       | postgres3 | 5432 | up   | 0.333333  | standby | 0          |
false             | 0                 | 2019-10-28 17:35:27
(3 rows)
=================================== 8<
?============================================

The PgPool cluster is healthy and the VIP IP address is on the master node:
=================================== 8<
?============================================
postgres at proxy3:~$ pcp_watchdog_info -h 192.168.175.30 -p 9898 -w -v
Watchdog Cluster Information
Total Nodes          : 3
Remote Nodes         : 2
Quorum state         : QUORUM EXIST
Alive Remote Nodes   : 2
VIP up on local node : YES
Master Node Name     : pgpool3:5432 Linux proxy3
Master Host Name     : pgpool3

Watchdog Node Information
Node Name      : pgpool3:5432 Linux proxy3
Host Name      : pgpool3
Delegate IP    : 192.168.175.30
Pgpool port    : 5432
Watchdog port  : 9000
Node priority  : 1
Status         : 4
Status Name    : MASTER

Node Name      : pgpool1:5432 Linux proxy1
Host Name      : pgpool1
Delegate IP    : 192.168.175.30
Pgpool port    : 5432
Watchdog port  : 9000
Node priority  : 1
Status         : 7
Status Name    : STANDBY

Node Name      : pgpool2:5432 Linux proxy2
Host Name      : pgpool2
Delegate IP    : 192.168.175.30
Pgpool port    : 5432
Watchdog port  : 9000
Node priority  : 1
Status         : 7
Status Name    : STANDBY
=================================== 8<
?============================================

If I stop the primary PostgreSQL server, the maser PgPool node realises
this, but it cannot execute the failover because it does not get the
consensus from other nodes:
=================================== 8<
?============================================
Oct 28 18:47:34 proxy3 pgpool[25364]: [348-1] 2019-10-28 18:47:34: pid
25364: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:47:34 proxy3 pgpool[25364]: [348-2] 2019-10-28 18:47:34: pid
25364: LOCATION:  pool_connection_pool.c:680
Oct 28 18:47:34 proxy3 pgpool[25364]: [349-1] 2019-10-28 18:47:34: pid
25364: ERROR:  failed to make persistent db connection
Oct 28 18:47:34 proxy3 pgpool[25364]: [349-2] 2019-10-28 18:47:34: pid
25364: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:47:34 proxy3 pgpool[25364]: [349-3] 2019-10-28 18:47:34: pid
25364: LOCATION:  child.c:1328
Oct 28 18:47:35 proxy3 pgpool[25036]: [634-1] 2019-10-28 18:47:35: pid
25036: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:47:35 proxy3 pgpool[25036]: [634-2] 2019-10-28 18:47:35: pid
25036: LOCATION:  pool_connection_pool.c:680
Oct 28 18:47:35 proxy3 pgpool[25036]: [635-1] 2019-10-28 18:47:35: pid
25036: ERROR:  failed to make persistent db connection
Oct 28 18:47:35 proxy3 pgpool[25036]: [635-2] 2019-10-28 18:47:35: pid
25036: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:47:35 proxy3 pgpool[25036]: [635-3] 2019-10-28 18:47:35: pid
25036: LOCATION:  child.c:1328
Oct 28 18:47:35 proxy3 pgpool[25036]: [636-1] 2019-10-28 18:47:35: pid
25036: LOG:  health check retrying on DB node: 0 (round:1)
Oct 28 18:47:35 proxy3 pgpool[25036]: [636-2] 2019-10-28 18:47:35: pid
25036: LOCATION:  health_check.c:298
Oct 28 18:47:36 proxy3 pgpool[25036]: [637-1] 2019-10-28 18:47:36: pid
25036: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:47:36 proxy3 pgpool[25036]: [637-2] 2019-10-28 18:47:36: pid
25036: LOCATION:  pool_connection_pool.c:680
Oct 28 18:47:36 proxy3 pgpool[25036]: [638-1] 2019-10-28 18:47:36: pid
25036: ERROR:  failed to make persistent db connection
Oct 28 18:47:36 proxy3 pgpool[25036]: [638-2] 2019-10-28 18:47:36: pid
25036: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:47:36 proxy3 pgpool[25036]: [638-3] 2019-10-28 18:47:36: pid
25036: LOCATION:  child.c:1328
Oct 28 18:47:36 proxy3 pgpool[25036]: [639-1] 2019-10-28 18:47:36: pid
25036: LOG:  health check retrying on DB node: 0 (round:2)
Oct 28 18:47:36 proxy3 pgpool[25036]: [639-2] 2019-10-28 18:47:36: pid
25036: LOCATION:  health_check.c:298
Oct 28 18:47:37 proxy3 pgpool[25036]: [640-1] 2019-10-28 18:47:37: pid
25036: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:47:37 proxy3 pgpool[25036]: [640-2] 2019-10-28 18:47:37: pid
25036: LOCATION:  pool_connection_pool.c:680
Oct 28 18:47:37 proxy3 pgpool[25036]: [641-1] 2019-10-28 18:47:37: pid
25036: ERROR:  failed to make persistent db connection
Oct 28 18:47:37 proxy3 pgpool[25036]: [641-2] 2019-10-28 18:47:37: pid
25036: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:47:37 proxy3 pgpool[25036]: [641-3] 2019-10-28 18:47:37: pid
25036: LOCATION:  child.c:1328
Oct 28 18:47:37 proxy3 pgpool[25036]: [642-1] 2019-10-28 18:47:37: pid
25036: LOG:  health check retrying on DB node: 0 (round:3)
Oct 28 18:47:37 proxy3 pgpool[25036]: [642-2] 2019-10-28 18:47:37: pid
25036: LOCATION:  health_check.c:298
Oct 28 18:47:38 proxy3 pgpool[25036]: [643-1] 2019-10-28 18:47:38: pid
25036: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:47:38 proxy3 pgpool[25036]: [643-2] 2019-10-28 18:47:38: pid
25036: LOCATION:  pool_connection_pool.c:680
Oct 28 18:47:38 proxy3 pgpool[25036]: [644-1] 2019-10-28 18:47:38: pid
25036: ERROR:  failed to make persistent db connection
Oct 28 18:47:38 proxy3 pgpool[25036]: [644-2] 2019-10-28 18:47:38: pid
25036: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:47:38 proxy3 pgpool[25036]: [644-3] 2019-10-28 18:47:38: pid
25036: LOCATION:  child.c:1328
Oct 28 18:47:38 proxy3 pgpool[25036]: [645-1] 2019-10-28 18:47:38: pid
25036: LOG:  health check failed on node 0 (timeout:0)
Oct 28 18:47:38 proxy3 pgpool[25036]: [645-2] 2019-10-28 18:47:38: pid
25036: LOCATION:  health_check.c:201
Oct 28 18:47:38 proxy3 pgpool[25036]: [646-1] 2019-10-28 18:47:38: pid
25036: LOG:  received degenerate backend request for node_id: 0 from pid
[25036]
Oct 28 18:47:38 proxy3 pgpool[25036]: [646-2] 2019-10-28 18:47:38: pid
25036: LOCATION:  pgpool_main.c:1125
Oct 28 18:47:38 proxy3 pgpool[22900]: [341-1] 2019-10-28 18:47:38: pid
22900: LOG:  new IPC connection received
Oct 28 18:47:38 proxy3 pgpool[22900]: [341-2] 2019-10-28 18:47:38: pid
22900: LOCATION:  watchdog.c:3147
Oct 28 18:47:38 proxy3 pgpool[22900]: [342-1] 2019-10-28 18:47:38: pid
22900: LOG:  watchdog received the failover command from local pgpool-II on
IPC interface
Oct 28 18:47:38 proxy3 pgpool[22900]: [342-2] 2019-10-28 18:47:38: pid
22900: LOCATION:  watchdog.c:2570
Oct 28 18:47:38 proxy3 pgpool[22900]: [343-1] 2019-10-28 18:47:38: pid
22900: LOG:  watchdog is processing the failover command
[DEGENERATE_BACKEND_REQUEST] received from local pgpool-II on IPC interface
Oct 28 18:47:38 proxy3 pgpool[22900]: [343-2] 2019-10-28 18:47:38: pid
22900: LOCATION:  watchdog.c:2491
Oct 28 18:47:38 proxy3 pgpool[22900]: [344-1] 2019-10-28 18:47:38: pid
22900: LOG:  failover requires the majority vote, waiting for consensus
Oct 28 18:47:38 proxy3 pgpool[22900]: [344-2] 2019-10-28 18:47:38: pid
22900: DETAIL:  failover request noted
Oct 28 18:47:38 proxy3 pgpool[22900]: [344-3] 2019-10-28 18:47:38: pid
22900: LOCATION:  watchdog.c:2352
Oct 28 18:47:38 proxy3 pgpool[22900]: [345-1] 2019-10-28 18:47:38: pid
22900: LOG:  failover command [DEGENERATE_BACKEND_REQUEST] request from
pgpool-II node "pgpool3:5432 Linux proxy3" is queued, waiting for the
confirmation from other nodes
Oct 28 18:47:38 proxy3 pgpool[22900]: [345-2] 2019-10-28 18:47:38: pid
22900: LOCATION:  watchdog.c:2545
Oct 28 18:47:38 proxy3 pgpool[25036]: [647-1] 2019-10-28 18:47:38: pid
25036: LOG:  degenerate backend request for node_id: 0 from pid [25036],
will be handled by watchdog, which is building consensus for request
Oct 28 18:47:38 proxy3 pgpool[25036]: [647-2] 2019-10-28 18:47:38: pid
25036: LOCATION:  pgpool_main.c:1186
Oct 28 18:47:44 proxy3 pgpool[25364]: [350-1] 2019-10-28 18:47:44: pid
25364: ERROR:  Failed to check replication time lag
Oct 28 18:47:44 proxy3 pgpool[25364]: [350-2] 2019-10-28 18:47:44: pid
25364: DETAIL:  No persistent db connection for the node 0
Oct 28 18:47:44 proxy3 pgpool[25364]: [350-3] 2019-10-28 18:47:44: pid
25364: HINT:  check sr_check_user and sr_check_password
Oct 28 18:47:44 proxy3 pgpool[25364]: [350-4] 2019-10-28 18:47:44: pid
25364: CONTEXT:  while checking replication time lag
Oct 28 18:47:44 proxy3 pgpool[25364]: [350-5] 2019-10-28 18:47:44: pid
25364: LOCATION:  pool_worker_child.c:326
Oct 28 18:47:44 proxy3 pgpool[25364]: [351-1] 2019-10-28 18:47:44: pid
25364: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:47:44 proxy3 pgpool[25364]: [351-2] 2019-10-28 18:47:44: pid
25364: LOCATION:  pool_connection_pool.c:680
Oct 28 18:47:44 proxy3 pgpool[25364]: [352-1] 2019-10-28 18:47:44: pid
25364: ERROR:  failed to make persistent db connection
Oct 28 18:47:44 proxy3 pgpool[25364]: [352-2] 2019-10-28 18:47:44: pid
25364: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:47:44 proxy3 pgpool[25364]: [352-3] 2019-10-28 18:47:44: pid
25364: LOCATION:  child.c:1328
[...]
=================================== 8<
?============================================

Logs are almost the same on other PgPool nodes (proxy1 and proxy2).
=================================== 8<
?============================================
Oct 28 18:50:23 proxy2 pgpool[11661]: [762-5] 2019-10-28 18:50:23: pid
11661: LOCATION:  pool_worker_child.c:326
Oct 28 18:50:23 proxy2 pgpool[11661]: [763-1] 2019-10-28 18:50:23: pid
11661: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:50:23 proxy2 pgpool[11661]: [763-2] 2019-10-28 18:50:23: pid
11661: LOCATION:  pool_connection_pool.c:680
Oct 28 18:50:23 proxy2 pgpool[11661]: [764-1] 2019-10-28 18:50:23: pid
11661: ERROR:  failed to make persistent db connection
Oct 28 18:50:23 proxy2 pgpool[11661]: [764-2] 2019-10-28 18:50:23: pid
11661: DETAIL:  connection to host:"postgres1:5432" failed
Oct 28 18:50:23 proxy2 pgpool[11661]: [764-3] 2019-10-28 18:50:23: pid
11661: LOCATION:  child.c:1328
Oct 28 18:50:30 proxy2 pgpool[10358]: [123-1] 2019-10-28 18:50:30: pid
10358: LOG:  remote node "pgpool3:5432 Linux proxy3" is asking to inform
about quarantined backend nodes
Oct 28 18:50:30 proxy2 pgpool[10358]: [123-2] 2019-10-28 18:50:30: pid
10358: LOCATION:  watchdog.c:3745
Oct 28 18:50:30 proxy2 pgpool[10355]: [717-1] 2019-10-28 18:50:30: pid
10355: LOG:  Pgpool-II parent process received inform quarantine nodes
signal from watchdog
Oct 28 18:50:30 proxy2 pgpool[10355]: [717-2] 2019-10-28 18:50:30: pid
10355: LOCATION:  pgpool_main.c:1550
Oct 28 18:50:33 proxy2 pgpool[11661]: [765-1] 2019-10-28 18:50:33: pid
11661: ERROR:  Failed to check replication time lag
Oct 28 18:50:33 proxy2 pgpool[11661]: [765-2] 2019-10-28 18:50:33: pid
11661: DETAIL:  No persistent db connection for the node 0
Oct 28 18:50:33 proxy2 pgpool[11661]: [765-3] 2019-10-28 18:50:33: pid
11661: HINT:  check sr_check_user and sr_check_password
Oct 28 18:50:33 proxy2 pgpool[11661]: [765-4] 2019-10-28 18:50:33: pid
11661: CONTEXT:  while checking replication time lag
Oct 28 18:50:33 proxy2 pgpool[11661]: [765-5] 2019-10-28 18:50:33: pid
11661: LOCATION:  pool_worker_child.c:326
Oct 28 18:50:33 proxy2 pgpool[11661]: [766-1] 2019-10-28 18:50:33: pid
11661: LOG:  failed to connect to PostgreSQL server on "postgres1:5432",
getsockopt() detected error "Connection refused"
Oct 28 18:50:33 proxy2 pgpool[11661]: [766-2] 2019-10-28 18:50:33: pid
11661: LOCATION:  pool_connection_pool.c:680
=================================== 8<
?============================================

Pool nodes status during the failover try:
=================================== 8<
?============================================
postgres at proxy3:~$ psql -h 127.0.0.1 -p 5432 -U repmgr -c "show pool_nodes;"
psql: FATAL:  failed to create a backend 0 connection
DETAIL:  not executing failover because failover_on_backend_error is off
=================================== 8<
?============================================

This is the failover configuration on PgPool nodes:
=================================== 8<
?============================================
#------------------------------------------------------------------------------
# FAILOVER AND FAILBACK
#------------------------------------------------------------------------------

failover_command = '/usr/local/bin/pgpool-failover.sh %d %h %P %m %H %R'
failback_command = 'echo failback %d %h %p %D %m %H %M %P'
failover_on_backend_error = off
detach_false_primary = off
search_primary_node_timeout = 300
=================================== 8<
?============================================

And this is the health check configuration on PgPool nodes:
=================================== 8<
?============================================
#------------------------------------------------------------------------------
# HEALTH CHECK GLOBAL PARAMETERS
#------------------------------------------------------------------------------

health_check_period = 40
health_check_timeout = 10
health_check_user = 'hc'
health_check_password = 'password'
health_check_database = 'postgres'
health_check_max_retries = 3
health_check_retry_delay = 1
connect_timeout = 10000
=================================== 8<
?============================================

Could you help me please?
Thank you very much!
Bye
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20191028/49f1a352/attachment-0001.html>


More information about the pgpool-general mailing list