[pgpool-general: 6451] Re: follow_master_command executed on node shown as down (one of unrecovered masters from previous failover)

Tatsuo Ishii ishii at sraoss.co.jp
Wed Mar 6 21:00:45 JST 2019


> Tatsuo, the link to the github page is not correct, did you mean this one ?
> https://github.com/pgpool/pgpool2/blob/cad8a512d63dab7659c7288356f9cb5ffc331b92/doc/src/sgml/failover.sgml

Oops. I meant this one:
http://tatsuo-ishii.github.io/pgpool-II/current/runtime-config-failover.html#RUNTIME-CONFIG-FAILOVER-SETTINGS

> I will have a look. 
> In the follow_master description it is mentioned that it might be useful to check that the node is running (with pg_ctl) but I believe it might also be useful to check that the node is in recovery, by looking at recovery.conf or by executing function pg_is_in_recovery(), reason it to avoid issue with a degenerated master. 

I am not sure I understand what you mean. What is "issue with a
degenerated master"?

> Maybe there is another issue ? Suppose I have 3 nodes: A, B and C
> and suppose A is in status down (detached from pgpool) and the other
> two are up. B is primary and C is standby. When B is stopped, then
> follow_master is executed on A and on C.

Yes.

> But I think it should not be executed on node A because node A was
> detached from pgpool, maybe this node was a denerated master or
> maybe it was detached on purpose.

It's a limitation of design of follow master command: follow master
command is kind of "automation" tool and is not very smart. It does
not understand whether a node was detatched on purpose or not. It just
tries all nodes to "follow" the new primary.

For example, if a user want to detach a standby node to take a back up
from it, then follow master command should not be used.

> Also in the doc of follow_master it says:
> "Typically follow_master_command command is used to recover the
> slave from the new primary by calling the pcp_recovery_node
> command", but this is not true. I believe the script will simply
> re-points the standby to the new primary, this does not require a
> full backup. In my case I use a script from repmgr, so the command
> is something like /usr/pgsql-10/bin/repmgr --log-to-file -f
> /etc/repmgr/10/repmgr.conf -h ${NEW_MASTER_HOST} -D ${PGDATA} -U
> repmgr -d repmgr standby follow -v

I don't know anything about repmgr so I cannot comment on this.

The reason why pcp_recovery_node is recommended is that it's the
safest method. Depending on the replication setting, it is possible
that a standby cannot be replicated from the new primary because it
already removed the wal segments which is necessary for the standby
to be recovered.

> I would like to propose improvement to the documentation of the FAILOVER_ON_BACKEND_ERROR documentation, is that something I can do via a pull request ?

Can you please provide git diff as an email attachment to the mailing
list? Because our official git repository is not in GitHub.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Regards, 
> Pierre 
> 
>     On Tuesday, March 5, 2019, 10:47:49 PM GMT+1, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:  
>  
>  I have updated follow master command description in the Pgpool-II
> document to clarify what it actually does (will appear in next
> Pgpool-II 4.0.4 release).
> 
> In the mean time I have upload the HTML compiled version to my Github
> page. Please take a look at and give comments if you like.
> 
> http://localhost/~t-ishii/pgpool-II/html/runtime-config-failover.html
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> From: Pierre Timmermans <ptim007 at yahoo.com>
> Subject: Re: [pgpool-general: 6435] Re: follow_master_command executed on node shown as down (one of unrecovered masters from previous failover)
> Date: Fri, 1 Mar 2019 21:54:17 +0000 (UTC)
> Message-ID: <654470371.7835532.1551477257916 at mail.yahoo.com>
> 
>> It is probably a good idea to force old primary to shut down but it is not always possible, if for example the primary node gets shutdown then the failover script will not be able to ssh into it and kill the old primary. If the old server comes back online then there is a degenerated master. I have a cron job that checks for degenerated master (and for detached standby) and re-instate it if possible, but I am sure there is always a risk of edge cases...
>> Pierre 
>> 
>>    On Friday, March 1, 2019, 7:35:12 PM GMT+1, Andre Piwoni <apiwoni at webmd.net> wrote:  
>>  
>>  I just realized that I already handled the case of re-start that triggered failover in another way. Mainly, before promoting new node to master in failover script I am forcing old primary to be shut down. So even if I do restart of the primary and failover occurs it will shut down restarted old primary.Anyway, it doesn't hurt to have that check in follow_master script in case rebooting machine restarts old primary etc.
>> On Fri, Mar 1, 2019 at 9:58 AM Andre Piwoni <apiwoni at webmd.net> wrote:
>> 
>> I agree. This shouldn't be so complicated.
>> Since I'm using sed to repoint slave in follow_master script by updating recovery.conf if the command fails I'm not re-starting and re-attaching the node. Kill two birds with one stone :-)
>> Here'w what I'm testing now:ssh -o StrictHostKeyChecking=no -i /var/lib/pgsql/.ssh/id_rsa postgres@{detached_node_host} -T "sed -i 's/host=.*sslmode=/host=${new_master_node_host} port=5432 sslmode=/g' /var/lib/pgsql/10/data/recovery.conf" >> $LOGFILE
>> repoint_status=$?if [ ${repoint_status} -eq 0 ]; then      //restart      //reattachelse    // WARNING: this could be restarted master so there's no recovery.conf    // CONSIDERATION: Should I shut it down since I don't want to have two masters running even though Pgpool load balances one???fi
>> On Fri, Mar 1, 2019 at 9:44 AM Pierre Timmermans <ptim007 at yahoo.com> wrote:
>> 
>> Thank you, it makes sense indeed and I also like to have a relatively long "grace" delay via the health check interval so that If the primary restarts quickly enough there is no failover
>> For the case where there is a degenerated master, I have added this code in the follow_master script, it seems to work fine in my tests:
>> 
>> ssh_options="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
>> 
>> in_reco=$( $ssh_options postgres@${HOSTNAME} 'psql -t -c "select pg_is_in_recovery();"' | head -1 | awk '{print $1}' )if [ "a${in_reco}" != "a" ] ; then
>>   echo "Node $HOSTNAME is not in recovery, probably a degenerated master, skip it" | tee -a $LOGFILE
>>   exit 0
>>  fi
>> 
>> At the end I believe that pgpool algorithm to choose a primary node (always the node with the lowest id) is the root cause of the problem: pgpool should select the most adequate node (the node that is in recovery and with the lowest gap). Unfortunately I cannot code in "C", otherwise I would contribute.
>> Pierre 
>> 
>>    On Friday, March 1, 2019, 5:07:06 PM GMT+1, Andre Piwoni <apiwoni at webmd.net> wrote:  
>>  
>>  FYI,
>> One of the things that I have done to minimize impact of restarting the primary is using health check where max_retries x retry_delay_interval allows enough time for the primary to be restarted without triggering failover which may take more time time than restart itself. This is with disabled fail_over_on_backend_error
>> Andre
>> On Fri, Mar 1, 2019 at 7:58 AM Andre Piwoni <apiwoni at webmd.net> wrote:
>> 
>> Hi Pierre,
>> Hmmm? I have not covered the case you described which is restart of the primary on node 0, resulting failover ans subsequent restart of new primary on node 1 which results in calling follow_master on node 0. In my case I was shutting down node 0 which resulted in follow_master being called on it after second failover since I was not checking if node 0 was running. In your case, node 0 is running since it has been restarted.
>> Here's part of my script that I have to improve given your case:
>> ssh -o StrictHostKeyChecking=no -i /var/lib/pgsql/.ssh/id_rsa postgres@${detached_node_host} -T "/usr/pgsql-10/bin/pgctl -D /var/lib/pgsql/10/data status" | grep "is running"running_status=$?
>> if [ ${running_status} -eq 0 ]; then        // TODO: Check if recovery.conf exists or pg_is_in_recovery() on ${detached_node_host} and exit if this is not a slave node // repoint to new master ${new_master_node_host} // restart ${detached_node_host}  // reattach restarted node with pcp_attach_nodeelse // do nothing since this could be old slave or primary that needs to be recovered or node in maintenance mode etc.fi
>> 
>> 
>> On Fri, Mar 1, 2019 at 3:28 AM Pierre Timmermans <ptim007 at yahoo.com> wrote:
>> 
>> Hi
>> Same issue for me but I am not sure how to fix it. Andre can you tell exactly how you check ?
>> I cannot add a test using pcp_node_info to check that the status is up, because then follow_master is never doing something. Indeed, in my case, when the follow_master is executed the status of the target node is always down, so my script does the standby follow command and then a pcp_attach_node.
>> To solve the issue now I added a check that the command select pg_is_in_recovery(); returns "t" on the node, if it returns "f" then I can assume it is a degenerated master and I don't execute the follow_master command.
>> 
>> 
>> So my use case is this
>> 
>> 1. node 0 is primary, node 1 and node 2 are standby2. node 0 is restarted, node 1 becomes primary and node 2 follows the new primary (thanks to folllow_master). In follow_master of node 2 I have to do pcp_attach_node after because the status of the node is down 3. in the meantime node 0 has rebooted, the db is started on node 0 but it is down in pgpool and its role is standby (it is a degenerated master)4. node 1 is restarted, pgpool executes failover on node 2 and follow_master on node 0 => the follow_master on node 0 breaks everything because after that node 0 becomes a primary again 
>> Thanks and regards
>> Pierre 
>> 
>>    On Monday, February 25, 2019, 5:35:11 PM GMT+1, Andre Piwoni <apiwoni at webmd.net> wrote:  
>>  
>>  I have already put that check in place.
>> Thank you for confirming.
>> On Sat, Feb 23, 2019 at 11:56 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> 
>> Sorry, I was wrong. A follow_master_command will be executed against
>> the down node as well. So you need to check whether target PostgreSQL
>> node is running in the follow_master_commdn. If it's not, you can skip
>> the node.
>> 
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>> 
>>> I have added pg_ctl status check to ensure no action is taken when node is
>>> down but I'll check 3.7.8 version.
>>> 
>>> Here's the Pgpool log from the time node2 is shutdown to time node1(already
>>> dead old primary) received follow master command.
>>> Sorry for double date logging. I'm also including self-explanatory
>>> failover.log that I my failover and follow_master scripts generated.
>>> 
>>> Arguments passed to scripts for your reference.
>>> failover.sh %d %h %p %D %M %P %m %H %r %R
>>> follow_master.sh %d %h %p %D %M %P %m %H %r %R
>>> 
>>> Pool status before shutdown of node 2:
>>> postgres=> show pool_nodes;
>>>  node_id |          hostname          | port | status | lb_weight |  role
>>>  | select_cnt | load_balance_node | replication_delay
>>> ---------+----------------------------+------+--------+-----------+---------+------------+-------------------+-------------------
>>>  0       | pg-hdp-node1.kitchen.local | 5432 | down   | 0.333333  | standby
>>> | 0          | false             | 0
>>>  1       | pg-hdp-node2.kitchen.local | 5432 | up     | 0.333333  | primary
>>> | 0          | false             | 0
>>>  2       | pg-hdp-node3.kitchen.local | 5432 | up     | 0.333333  | standby
>>> | 0          | true              | 0
>>> (3 rows)
>>> 
>>> Pgpool log
>>> Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [126-1] 2019-02-22 10:43:27:
>>> pid 12437: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [127-1] 2019-02-22 10:43:27:
>>> pid 12437: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [127-2] 2019-02-22 10:43:27:
>>> pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
>>> failed
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-1] 2019-02-22 10:43:37:
>>> pid 12437: ERROR:  Failed to check replication time lag
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-2] 2019-02-22 10:43:37:
>>> pid 12437: DETAIL:  No persistent db connection for the node 1
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-3] 2019-02-22 10:43:37:
>>> pid 12437: HINT:  check sr_check_user and sr_check_password
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-4] 2019-02-22 10:43:37:
>>> pid 12437: CONTEXT:  while checking replication time lag
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [129-1] 2019-02-22 10:43:37:
>>> pid 12437: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [130-1] 2019-02-22 10:43:37:
>>> pid 12437: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [130-2] 2019-02-22 10:43:37:
>>> pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
>>> failed
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [6-1] 2019-02-22 10:43:45: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [7-1] 2019-02-22 10:43:45: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [7-2] 2019-02-22 10:43:45: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [8-1] 2019-02-22 10:43:45: pid
>>> 7786: LOG:  health check retrying on DB node: 1 (round:1)
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-1] 2019-02-22 10:43:47:
>>> pid 12437: ERROR:  Failed to check replication time lag
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-2] 2019-02-22 10:43:47:
>>> pid 12437: DETAIL:  No persistent db connection for the node 1
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-3] 2019-02-22 10:43:47:
>>> pid 12437: HINT:  check sr_check_user and sr_check_password
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-4] 2019-02-22 10:43:47:
>>> pid 12437: CONTEXT:  while checking replication time lag
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [132-1] 2019-02-22 10:43:47:
>>> pid 12437: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [133-1] 2019-02-22 10:43:47:
>>> pid 12437: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [133-2] 2019-02-22 10:43:47:
>>> pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
>>> failed
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [9-1] 2019-02-22 10:43:48: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [10-1] 2019-02-22 10:43:48: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [10-2] 2019-02-22 10:43:48: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [11-1] 2019-02-22 10:43:48: pid
>>> 7786: LOG:  health check retrying on DB node: 1 (round:2)
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [12-1] 2019-02-22 10:43:51: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [13-1] 2019-02-22 10:43:51: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [13-2] 2019-02-22 10:43:51: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [14-1] 2019-02-22 10:43:51: pid
>>> 7786: LOG:  health check retrying on DB node: 1 (round:3)
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [15-1] 2019-02-22 10:43:54: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [16-1] 2019-02-22 10:43:54: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [16-2] 2019-02-22 10:43:54: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [17-1] 2019-02-22 10:43:54: pid
>>> 7786: LOG:  health check failed on node 1 (timeout:0)
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [18-1] 2019-02-22 10:43:54: pid
>>> 7786: LOG:  received degenerate backend request for node_id: 1 from pid
>>> [7786]
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [253-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  Pgpool-II parent process has received failover request
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [254-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  starting degeneration. shutdown host
>>> pg-hdp-node2.kitchen.local(5432)
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [255-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  Restart all children
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [256-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  execute command: /etc/pgpool-II/failover.sh 1
>>> pg-hdp-node2.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
>>> pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [257-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [258-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: checking backend no 0
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [259-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: checking backend no 1
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [260-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: checking backend no 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [261-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: primary node id is 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [262-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  starting follow degeneration. shutdown host
>>> pg-hdp-node1.kitchen.local(5432)
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [263-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  starting follow degeneration. shutdown host
>>> pg-hdp-node2.kitchen.local(5432)
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [264-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover: 2 follow backends have been degenerated
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [265-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover: set new primary node: 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [266-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover: set new master node: 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [267-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover done. shutdown host pg-hdp-node2.kitchen.local(5432)
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-1] 2019-02-22 10:43:55:
>>> pid 12437: ERROR:  Failed to check replication time lag
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-2] 2019-02-22 10:43:55:
>>> pid 12437: DETAIL:  No persistent db connection for the node 1
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-3] 2019-02-22 10:43:55:
>>> pid 12437: HINT:  check sr_check_user and sr_check_password
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-4] 2019-02-22 10:43:55:
>>> pid 12437: CONTEXT:  while checking replication time lag
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [135-1] 2019-02-22 10:43:55:
>>> pid 12437: LOG:  worker process received restart request
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12774]: [267-1] 2019-02-22 10:43:55:
>>> pid 12774: LOG:  failback event detected
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12774]: [267-2] 2019-02-22 10:43:55:
>>> pid 12774: DETAIL:  restarting myself
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [265-1] 2019-02-22 10:43:55:
>>> pid 12742: LOG:  start triggering follow command.
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [266-1] 2019-02-22 10:43:55:
>>> pid 12742: LOG:  execute command: /etc/pgpool-II/follow_master.sh 0
>>> pg-hdp-node1.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
>>> pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [267-1] 2019-02-22 10:43:55:
>>> pid 12742: LOG:  execute command: /etc/pgpool-II/follow_master.sh 1
>>> pg-hdp-node2.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
>>> pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
>>> Feb 22 10:43:56 pg-hdp-node3 pgpool[12436]: [60-1] 2019-02-22 10:43:56: pid
>>> 12436: LOG:  restart request received in pcp child process
>>> Feb 22 10:43:56 pg-hdp-node3 pgpool[7746]: [268-1] 2019-02-22 10:43:56: pid
>>> 7746: LOG:  PCP child 12436 exits with status 0 in failover()
>>> 
>>> Pgpool self-explanatory failover.log
>>> 
>>> 2019-02-22 10:43:54.893 PST Executing failover script ...
>>> 2019-02-22 10:43:54.895 PST Script arguments:
>>> failed_node_id           1
>>> failed_node_host         pg-hdp-node2.kitchen.local
>>> failed_node_port         5432
>>> failed_node_pgdata       /var/lib/pgsql/10/data
>>> old_primary_node_id      1
>>> old_master_node_id       1
>>> new_master_node_id       2
>>> new_master_node_host     pg-hdp-node3.kitchen.local
>>> new_master_node_port     5432
>>> new_master_node_pgdata   /var/lib/pgsql/10/data
>>> 2019-02-22 10:43:54.897 PST Primary node running on
>>> pg-hdp-node2.kitchen.local host is unresponsive or have died
>>> 2019-02-22 10:43:54.898 PST Attempting to stop primary node running on
>>> pg-hdp-node2.kitchen.local host before promoting slave as the new primary
>>> 2019-02-22 10:43:54.899 PST ssh -o StrictHostKeyChecking=no -i
>>> /var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node2.kitchen.local -T
>>> /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data stop -m fast
>>> 2019-02-22 10:43:55.151 PST Promoting pg-hdp-node3.kitchen.local host as
>>> the new primary
>>> 2019-02-22 10:43:55.153 PST ssh -o StrictHostKeyChecking=no -i
>>> /var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node3.kitchen.local -T
>>> /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data promote
>>> waiting for server to promote.... done
>>> server promoted
>>> 2019-02-22 10:43:55.532 PST Completed executing failover
>>> 
>>> 2019-02-22 10:43:55.564 PST Executing follow master script ...
>>> 2019-02-22 10:43:55.566 PST Script arguments
>>> detached_node_id         0
>>> detached_node_host       pg-hdp-node1.kitchen.local
>>> detached_node_port       5432
>>> detached_node_pgdata     /var/lib/pgsql/10/data
>>> old_primary_node_id      1
>>> old_master_node_id       1
>>> new_master_node_id       2
>>> new_master_node_host     pg-hdp-node3.kitchen.local
>>> new_master_node_port     5432
>>> new_master_node_pgdata   /var/lib/pgsql/10/data
>>> 2019-02-22 10:43:55.567 PST Checking if server is running on
>>> pg-hdp-node1.kitchen.local host
>>> 2019-02-22 10:43:55.569 PST ssh -o StrictHostKeyChecking=no -i
>>> /var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node1.kitchen.local -T
>>> /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data status
>>> 
>>> 
>>> pg_ctl: no server running
>>> 2019-02-22 10:43:55.823 PST Node on pg-hdp-node1.kitchen.local host is not
>>> running. It could be old slave or primary that needs to be recovered.
>>> 2019-02-22 10:43:55.824 PST Completed executing follow master script
>>> 
>>> 2019-02-22 10:43:55.829 PST Executing follow master script ...
>>> 2019-02-22 10:43:55.830 PST Script arguments
>>> detached_node_id         1
>>> detached_node_host       pg-hdp-node2.kitchen.local
>>> detached_node_port       5432
>>> detached_node_pgdata     /var/lib/pgsql/10/data
>>> old_primary_node_id      1
>>> old_master_node_id       1
>>> new_master_node_id       2
>>> new_master_node_host     pg-hdp-node3.kitchen.local
>>> new_master_node_port     5432
>>> new_master_node_pgdata   /var/lib/pgsql/10/data
>>> 2019-02-22 10:43:55.831 PST Detached node on pg-hdp-node2.kitchen.local
>>> host is the the old primary node
>>> 2019-02-22 10:43:55.833 PST Slave can be created from old primary node by
>>> deleting PG_DATA directory under /var/lib/pgsql/10/data on
>>> pg-hdp-node2.kitchen.local host and re-running Chef client
>>> 2019-02-22 10:43:55.834 PST Slave can be recovered from old primary node by
>>> running /usr/pgsql-10/bin/pg_rewind -D /var/lib/pgsql/10/data
>>> --source-server="port=5432 host=pg-hdp-node3.kitchen.local" command on
>>> pg-hdp-node2.kitchen.local host as postgres user
>>> 2019-02-22 10:43:55.835 PST After successful pg_rewind run cp
>>> /var/lib/pgsql/10/data/recovery.done /var/lib/pgsql/10/data/recovery.conf,
>>> ensure host connection string points to pg-hdp-node3.kitchen.local, start
>>> PostgreSQL and attach it to pgpool
>>> 2019-02-22 10:43:55.836 PST Completed executing follow master script
>>> 
>>> On Thu, Feb 21, 2019 at 4:47 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>> 
>>>> > Is this correct behavior?
>>>> >
>>>> > In 3-node setup, node1(primary) is shutdown, failover is executed and
>>>> node2
>>>> > becomes new primary and node3 follows new primary on node2.
>>>> > Now, node2(new primary) is shutdown, failover is executed and node3
>>>> becomes
>>>> > new primary but fallow_master_command is executed on node1 even though it
>>>> > is reported as down.
>>>>
>>>> No. follow master command should not be executed on an already-down
>>>> node (in this case node1).
>>>>
>>>> > It happens that my script repoints node1 and restarts it which breaks
>>>> hell
>>>> > because node1 was never recovered after being shutdown.
>>>> >
>>>> > I'm on PgPool 3.7.4.
>>>>
>>>> Can you share the log from when node2 was shutdown to when node1 was
>>>> recovered by your follow master command?
>>>>
>>>> In the mean time 3.7.4 is not the latest one. Can you try with the
>>>> latest one? (3.7.8).
>>>>
>>>> Best regards,
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese:http://www.sraoss.co.jp
>>>>
>>> 
>>> 
>>> -- 
>>> 
>>> *Andre Piwoni*
>> 
>> 
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>  
>> 
>> 
>> -- 
>>  
>> 
>> 
>> -- 
>> 
>> Andre Piwoni
>> 
>> Sr. Software Developer,BI/Database
>> 
>> WebMD Health Services
>> 
>> Mobile: 801.541.4722
>> 
>> www.webmdhealthservices.com
>> 
>> 
>> 
>> -- 
>> 
>> Andre Piwoni
>> 
>> Sr. Software Developer,BI/Database
>> 
>> WebMD Health Services
>> 
>> Mobile: 801.541.4722
>> 
>> www.webmdhealthservices.com
>>    


More information about the pgpool-general mailing list