[pgpool-general: 6432] Re: follow_master_command executed on node shown as down (one of unrecovered masters from previous failover)

Andre Piwoni apiwoni at webmd.net
Sat Feb 23 03:58:28 JST 2019


I have added pg_ctl status check to ensure no action is taken when node is
down but I'll check 3.7.8 version.

Here's the Pgpool log from the time node2 is shutdown to time node1(already
dead old primary) received follow master command.
Sorry for double date logging. I'm also including self-explanatory
failover.log that I my failover and follow_master scripts generated.

Arguments passed to scripts for your reference.
failover.sh %d %h %p %D %M %P %m %H %r %R
follow_master.sh %d %h %p %D %M %P %m %H %r %R

Pool status before shutdown of node 2:
postgres=> show pool_nodes;
 node_id |          hostname          | port | status | lb_weight |  role
 | select_cnt | load_balance_node | replication_delay
---------+----------------------------+------+--------+-----------+---------+------------+-------------------+-------------------
 0       | pg-hdp-node1.kitchen.local | 5432 | down   | 0.333333  | standby
| 0          | false             | 0
 1       | pg-hdp-node2.kitchen.local | 5432 | up     | 0.333333  | primary
| 0          | false             | 0
 2       | pg-hdp-node3.kitchen.local | 5432 | up     | 0.333333  | standby
| 0          | true              | 0
(3 rows)

Pgpool log
Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [126-1] 2019-02-22 10:43:27:
pid 12437: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [127-1] 2019-02-22 10:43:27:
pid 12437: ERROR:  failed to make persistent db connection
Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [127-2] 2019-02-22 10:43:27:
pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
failed
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-1] 2019-02-22 10:43:37:
pid 12437: ERROR:  Failed to check replication time lag
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-2] 2019-02-22 10:43:37:
pid 12437: DETAIL:  No persistent db connection for the node 1
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-3] 2019-02-22 10:43:37:
pid 12437: HINT:  check sr_check_user and sr_check_password
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-4] 2019-02-22 10:43:37:
pid 12437: CONTEXT:  while checking replication time lag
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [129-1] 2019-02-22 10:43:37:
pid 12437: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [130-1] 2019-02-22 10:43:37:
pid 12437: ERROR:  failed to make persistent db connection
Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [130-2] 2019-02-22 10:43:37:
pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
failed
Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [6-1] 2019-02-22 10:43:45: pid
7786: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [7-1] 2019-02-22 10:43:45: pid
7786: ERROR:  failed to make persistent db connection
Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [7-2] 2019-02-22 10:43:45: pid
7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [8-1] 2019-02-22 10:43:45: pid
7786: LOG:  health check retrying on DB node: 1 (round:1)
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-1] 2019-02-22 10:43:47:
pid 12437: ERROR:  Failed to check replication time lag
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-2] 2019-02-22 10:43:47:
pid 12437: DETAIL:  No persistent db connection for the node 1
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-3] 2019-02-22 10:43:47:
pid 12437: HINT:  check sr_check_user and sr_check_password
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-4] 2019-02-22 10:43:47:
pid 12437: CONTEXT:  while checking replication time lag
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [132-1] 2019-02-22 10:43:47:
pid 12437: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [133-1] 2019-02-22 10:43:47:
pid 12437: ERROR:  failed to make persistent db connection
Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [133-2] 2019-02-22 10:43:47:
pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
failed
Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [9-1] 2019-02-22 10:43:48: pid
7786: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [10-1] 2019-02-22 10:43:48: pid
7786: ERROR:  failed to make persistent db connection
Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [10-2] 2019-02-22 10:43:48: pid
7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [11-1] 2019-02-22 10:43:48: pid
7786: LOG:  health check retrying on DB node: 1 (round:2)
Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [12-1] 2019-02-22 10:43:51: pid
7786: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [13-1] 2019-02-22 10:43:51: pid
7786: ERROR:  failed to make persistent db connection
Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [13-2] 2019-02-22 10:43:51: pid
7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [14-1] 2019-02-22 10:43:51: pid
7786: LOG:  health check retrying on DB node: 1 (round:3)
Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [15-1] 2019-02-22 10:43:54: pid
7786: LOG:  failed to connect to PostgreSQL server on
"pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
refused"
Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [16-1] 2019-02-22 10:43:54: pid
7786: ERROR:  failed to make persistent db connection
Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [16-2] 2019-02-22 10:43:54: pid
7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [17-1] 2019-02-22 10:43:54: pid
7786: LOG:  health check failed on node 1 (timeout:0)
Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [18-1] 2019-02-22 10:43:54: pid
7786: LOG:  received degenerate backend request for node_id: 1 from pid
[7786]
Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [253-1] 2019-02-22 10:43:54: pid
7746: LOG:  Pgpool-II parent process has received failover request
Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [254-1] 2019-02-22 10:43:54: pid
7746: LOG:  starting degeneration. shutdown host
pg-hdp-node2.kitchen.local(5432)
Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [255-1] 2019-02-22 10:43:54: pid
7746: LOG:  Restart all children
Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [256-1] 2019-02-22 10:43:54: pid
7746: LOG:  execute command: /etc/pgpool-II/failover.sh 1
pg-hdp-node2.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [257-1] 2019-02-22 10:43:55: pid
7746: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [258-1] 2019-02-22 10:43:55: pid
7746: LOG:  find_primary_node: checking backend no 0
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [259-1] 2019-02-22 10:43:55: pid
7746: LOG:  find_primary_node: checking backend no 1
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [260-1] 2019-02-22 10:43:55: pid
7746: LOG:  find_primary_node: checking backend no 2
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [261-1] 2019-02-22 10:43:55: pid
7746: LOG:  find_primary_node: primary node id is 2
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [262-1] 2019-02-22 10:43:55: pid
7746: LOG:  starting follow degeneration. shutdown host
pg-hdp-node1.kitchen.local(5432)
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [263-1] 2019-02-22 10:43:55: pid
7746: LOG:  starting follow degeneration. shutdown host
pg-hdp-node2.kitchen.local(5432)
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [264-1] 2019-02-22 10:43:55: pid
7746: LOG:  failover: 2 follow backends have been degenerated
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [265-1] 2019-02-22 10:43:55: pid
7746: LOG:  failover: set new primary node: 2
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [266-1] 2019-02-22 10:43:55: pid
7746: LOG:  failover: set new master node: 2
Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [267-1] 2019-02-22 10:43:55: pid
7746: LOG:  failover done. shutdown host pg-hdp-node2.kitchen.local(5432)
Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-1] 2019-02-22 10:43:55:
pid 12437: ERROR:  Failed to check replication time lag
Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-2] 2019-02-22 10:43:55:
pid 12437: DETAIL:  No persistent db connection for the node 1
Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-3] 2019-02-22 10:43:55:
pid 12437: HINT:  check sr_check_user and sr_check_password
Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-4] 2019-02-22 10:43:55:
pid 12437: CONTEXT:  while checking replication time lag
Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [135-1] 2019-02-22 10:43:55:
pid 12437: LOG:  worker process received restart request
Feb 22 10:43:55 pg-hdp-node3 pgpool[12774]: [267-1] 2019-02-22 10:43:55:
pid 12774: LOG:  failback event detected
Feb 22 10:43:55 pg-hdp-node3 pgpool[12774]: [267-2] 2019-02-22 10:43:55:
pid 12774: DETAIL:  restarting myself
Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [265-1] 2019-02-22 10:43:55:
pid 12742: LOG:  start triggering follow command.
Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [266-1] 2019-02-22 10:43:55:
pid 12742: LOG:  execute command: /etc/pgpool-II/follow_master.sh 0
pg-hdp-node1.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [267-1] 2019-02-22 10:43:55:
pid 12742: LOG:  execute command: /etc/pgpool-II/follow_master.sh 1
pg-hdp-node2.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
Feb 22 10:43:56 pg-hdp-node3 pgpool[12436]: [60-1] 2019-02-22 10:43:56: pid
12436: LOG:  restart request received in pcp child process
Feb 22 10:43:56 pg-hdp-node3 pgpool[7746]: [268-1] 2019-02-22 10:43:56: pid
7746: LOG:  PCP child 12436 exits with status 0 in failover()

Pgpool self-explanatory failover.log

2019-02-22 10:43:54.893 PST Executing failover script ...
2019-02-22 10:43:54.895 PST Script arguments:
failed_node_id           1
failed_node_host         pg-hdp-node2.kitchen.local
failed_node_port         5432
failed_node_pgdata       /var/lib/pgsql/10/data
old_primary_node_id      1
old_master_node_id       1
new_master_node_id       2
new_master_node_host     pg-hdp-node3.kitchen.local
new_master_node_port     5432
new_master_node_pgdata   /var/lib/pgsql/10/data
2019-02-22 10:43:54.897 PST Primary node running on
pg-hdp-node2.kitchen.local host is unresponsive or have died
2019-02-22 10:43:54.898 PST Attempting to stop primary node running on
pg-hdp-node2.kitchen.local host before promoting slave as the new primary
2019-02-22 10:43:54.899 PST ssh -o StrictHostKeyChecking=no -i
/var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node2.kitchen.local -T
/usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data stop -m fast
2019-02-22 10:43:55.151 PST Promoting pg-hdp-node3.kitchen.local host as
the new primary
2019-02-22 10:43:55.153 PST ssh -o StrictHostKeyChecking=no -i
/var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node3.kitchen.local -T
/usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data promote
waiting for server to promote.... done
server promoted
2019-02-22 10:43:55.532 PST Completed executing failover

2019-02-22 10:43:55.564 PST Executing follow master script ...
2019-02-22 10:43:55.566 PST Script arguments
detached_node_id         0
detached_node_host       pg-hdp-node1.kitchen.local
detached_node_port       5432
detached_node_pgdata     /var/lib/pgsql/10/data
old_primary_node_id      1
old_master_node_id       1
new_master_node_id       2
new_master_node_host     pg-hdp-node3.kitchen.local
new_master_node_port     5432
new_master_node_pgdata   /var/lib/pgsql/10/data
2019-02-22 10:43:55.567 PST Checking if server is running on
pg-hdp-node1.kitchen.local host
2019-02-22 10:43:55.569 PST ssh -o StrictHostKeyChecking=no -i
/var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node1.kitchen.local -T
/usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data status


pg_ctl: no server running
2019-02-22 10:43:55.823 PST Node on pg-hdp-node1.kitchen.local host is not
running. It could be old slave or primary that needs to be recovered.
2019-02-22 10:43:55.824 PST Completed executing follow master script

2019-02-22 10:43:55.829 PST Executing follow master script ...
2019-02-22 10:43:55.830 PST Script arguments
detached_node_id         1
detached_node_host       pg-hdp-node2.kitchen.local
detached_node_port       5432
detached_node_pgdata     /var/lib/pgsql/10/data
old_primary_node_id      1
old_master_node_id       1
new_master_node_id       2
new_master_node_host     pg-hdp-node3.kitchen.local
new_master_node_port     5432
new_master_node_pgdata   /var/lib/pgsql/10/data
2019-02-22 10:43:55.831 PST Detached node on pg-hdp-node2.kitchen.local
host is the the old primary node
2019-02-22 10:43:55.833 PST Slave can be created from old primary node by
deleting PG_DATA directory under /var/lib/pgsql/10/data on
pg-hdp-node2.kitchen.local host and re-running Chef client
2019-02-22 10:43:55.834 PST Slave can be recovered from old primary node by
running /usr/pgsql-10/bin/pg_rewind -D /var/lib/pgsql/10/data
--source-server="port=5432 host=pg-hdp-node3.kitchen.local" command on
pg-hdp-node2.kitchen.local host as postgres user
2019-02-22 10:43:55.835 PST After successful pg_rewind run cp
/var/lib/pgsql/10/data/recovery.done /var/lib/pgsql/10/data/recovery.conf,
ensure host connection string points to pg-hdp-node3.kitchen.local, start
PostgreSQL and attach it to pgpool
2019-02-22 10:43:55.836 PST Completed executing follow master script

On Thu, Feb 21, 2019 at 4:47 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:

> > Is this correct behavior?
> >
> > In 3-node setup, node1(primary) is shutdown, failover is executed and
> node2
> > becomes new primary and node3 follows new primary on node2.
> > Now, node2(new primary) is shutdown, failover is executed and node3
> becomes
> > new primary but fallow_master_command is executed on node1 even though it
> > is reported as down.
>
> No. follow master command should not be executed on an already-down
> node (in this case node1).
>
> > It happens that my script repoints node1 and restarts it which breaks
> hell
> > because node1 was never recovered after being shutdown.
> >
> > I'm on PgPool 3.7.4.
>
> Can you share the log from when node2 was shutdown to when node1 was
> recovered by your follow master command?
>
> In the mean time 3.7.4 is not the latest one. Can you try with the
> latest one? (3.7.8).
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>


-- 

*Andre Piwoni*

Sr. Software Developer, BI/Database

*Web*MD Health Services

Mobile: 801.541.4722

www.webmdhealthservices.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20190222/c06504f8/attachment-0001.html>


More information about the pgpool-general mailing list