[pgpool-general: 6452] Re: follow_master_command executed on node shown as down (one of unrecovered masters from previous failover)

Fri Mar 8 20:38:57 JST 2019

Hi
I tried to reply in-line, it is difficult with Yahoo! Mail...

Pierre 

    On Wednesday, March 6, 2019, 1:00:49 PM GMT+1, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:  

 > I am not sure I understand what you mean. What is "issue with a
> degenerated master"?

This is a use case that is quite common and that can lead to the "split-brain" scenario, it happens when a failed master comes back and so there are two databases in read-write mode. Since it is ejected from pgpool, pgpool offers a protection against this scenario. But my issue was that since follow_master ignore the fact that this node is down, I encoutered an issue that this failed master came back in the pool and what's more it became primary again !!
So scenario is this
Node 1 (primary) is powered offNode 2 becomes new primary and node 3 follows the new primaryNode 1 is started again, so postgres is started and it is in read-write mode (I call this a degenerated master, maybe it is a correct wording). We are protected from the split brain scenario because pgpool will not us it.Node 2 is stopped, pgpool does the failover and node 3 becomes primary, then pgpool executes follow_master on node 1. Because my script did not check that node 1 was not in recovery mode, it executed on Node 1 and this caused issues (my script does not do a pcp_node_recovery) and then - for some strange reason - pgpool decided that Node 1 would be the primary (I believe because it is not in recovery and because the id is lower)
As you said it is a limitation of follow master, it does not know that node 1 was detached before the failure of node 2. My solution is to check in follow_master that the target node is in recovery mode

>The reason why pcp_recovery_node is recommended is that it's the
>safest method. Depending on the replication setting, it is possible
> that a standby cannot be replicated from the new primary because it
> already removed the wal segments which is necessary for the standby
> to be recovered.

OK. I suppose it all depends on the size of the database and the network, I was not aware that following a new master was not 100% reliable, I believe indeed that in my tests I was sometimes stuck where the standby was in replication status waiting (I never understood why).

>Can you please provide git diff as an email attachment to the mailing
>list? Because our official git repository is not in GitHub.

About the parameter failover_on_backend_error, I believe the documentation should give a bit more context on what it does and its relation with healthchecks, what do you think of something in the line of the following ?
"One of the main reason to use pgpool is to automate the failover when there is failure of the primary database in streaming replication mode. There are two mechanisms that can be used to trigger the automatic failover: one is the parameter failover_on_backend_error and the other is the health checks mechanism. Normally you will use one mechanism and not both at the same time.
If the parameter failover_on_backend_error is set to "on", then the failover will be triggered when a backend process detects an error on the postgres connection. So if a client application tries to connect to postgres via pgpool, or a client application already connected via pgpool executes a query, pgpool will detect that the connection to postgres is broken and it will trigger the failover. If failover_on_backend_error is off then pgpool will not trigger the failover and will log something like "not execution failover because failover_on_backend_error is off". Note that as long as there is no activity on postgres, then the failover will not be triggered since pgpool will not notice that the primary is down. Note also that with failover_on_backend_error there is no retry period, so if you prefer to have a "grace" period before triggering a failover the health check mechanism is better suited."
Kind regards, 
Pierre

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

> Regards, 
> Pierre 
> 
>    On Tuesday, March 5, 2019, 10:47:49 PM GMT+1, Tatsuo Ishii <ishii at sraoss.co.jp> wrote:  
>  
>  I have updated follow master command description in the Pgpool-II
> document to clarify what it actually does (will appear in next
> Pgpool-II 4.0.4 release).
> 
> In the mean time I have upload the HTML compiled version to my Github
> page. Please take a look at and give comments if you like.
> 
> http://localhost/~t-ishii/pgpool-II/html/runtime-config-failover.html
> 
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
> 
> From: Pierre Timmermans <ptim007 at yahoo.com>
> Subject: Re: [pgpool-general: 6435] Re: follow_master_command executed on node shown as down (one of unrecovered masters from previous failover)
> Date: Fri, 1 Mar 2019 21:54:17 +0000 (UTC)
> Message-ID: <654470371.7835532.1551477257916 at mail.yahoo.com>
> 
>> It is probably a good idea to force old primary to shut down but it is not always possible, if for example the primary node gets shutdown then the failover script will not be able to ssh into it and kill the old primary. If the old server comes back online then there is a degenerated master. I have a cron job that checks for degenerated master (and for detached standby) and re-instate it if possible, but I am sure there is always a risk of edge cases...
>> Pierre 
>> 
>>    On Friday, March 1, 2019, 7:35:12 PM GMT+1, Andre Piwoni <apiwoni at webmd.net> wrote:  
>>  
>>  I just realized that I already handled the case of re-start that triggered failover in another way. Mainly, before promoting new node to master in failover script I am forcing old primary to be shut down. So even if I do restart of the primary and failover occurs it will shut down restarted old primary.Anyway, it doesn't hurt to have that check in follow_master script in case rebooting machine restarts old primary etc.
>> On Fri, Mar 1, 2019 at 9:58 AM Andre Piwoni <apiwoni at webmd.net> wrote:
>> 
>> I agree. This shouldn't be so complicated.
>> Since I'm using sed to repoint slave in follow_master script by updating recovery.conf if the command fails I'm not re-starting and re-attaching the node. Kill two birds with one stone :-)
>> Here'w what I'm testing now:ssh -o StrictHostKeyChecking=no -i /var/lib/pgsql/.ssh/id_rsa postgres@{detached_node_host} -T "sed -i 's/host=.*sslmode=/host=${new_master_node_host} port=5432 sslmode=/g' /var/lib/pgsql/10/data/recovery.conf" >> $LOGFILE
>> repoint_status=$?if [ ${repoint_status} -eq 0 ]; then      //restart      //reattachelse    // WARNING: this could be restarted master so there's no recovery.conf    // CONSIDERATION: Should I shut it down since I don't want to have two masters running even though Pgpool load balances one???fi
>> On Fri, Mar 1, 2019 at 9:44 AM Pierre Timmermans <ptim007 at yahoo.com> wrote:
>> 
>> Thank you, it makes sense indeed and I also like to have a relatively long "grace" delay via the health check interval so that If the primary restarts quickly enough there is no failover
>> For the case where there is a degenerated master, I have added this code in the follow_master script, it seems to work fine in my tests:
>> 
>> ssh_options="ssh -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no"
>> 
>> in_reco=$( $ssh_options postgres@${HOSTNAME} 'psql -t -c "select pg_is_in_recovery();"' | head -1 | awk '{print $1}' )if [ "a${in_reco}" != "a" ] ; then
>>   echo "Node $HOSTNAME is not in recovery, probably a degenerated master, skip it" | tee -a $LOGFILE
>>   exit 0
>>  fi
>> 
>> At the end I believe that pgpool algorithm to choose a primary node (always the node with the lowest id) is the root cause of the problem: pgpool should select the most adequate node (the node that is in recovery and with the lowest gap). Unfortunately I cannot code in "C", otherwise I would contribute.
>> Pierre 
>> 
>>    On Friday, March 1, 2019, 5:07:06 PM GMT+1, Andre Piwoni <apiwoni at webmd.net> wrote:  
>>  
>>  FYI,
>> One of the things that I have done to minimize impact of restarting the primary is using health check where max_retries x retry_delay_interval allows enough time for the primary to be restarted without triggering failover which may take more time time than restart itself. This is with disabled fail_over_on_backend_error
>> Andre
>> On Fri, Mar 1, 2019 at 7:58 AM Andre Piwoni <apiwoni at webmd.net> wrote:
>> 
>> Hi Pierre,
>> Hmmm? I have not covered the case you described which is restart of the primary on node 0, resulting failover ans subsequent restart of new primary on node 1 which results in calling follow_master on node 0. In my case I was shutting down node 0 which resulted in follow_master being called on it after second failover since I was not checking if node 0 was running. In your case, node 0 is running since it has been restarted.
>> Here's part of my script that I have to improve given your case:
>> ssh -o StrictHostKeyChecking=no -i /var/lib/pgsql/.ssh/id_rsa postgres@${detached_node_host} -T "/usr/pgsql-10/bin/pgctl -D /var/lib/pgsql/10/data status" | grep "is running"running_status=$?
>> if [ ${running_status} -eq 0 ]; then        // TODO: Check if recovery.conf exists or pg_is_in_recovery() on ${detached_node_host} and exit if this is not a slave node // repoint to new master ${new_master_node_host} // restart ${detached_node_host}  // reattach restarted node with pcp_attach_nodeelse // do nothing since this could be old slave or primary that needs to be recovered or node in maintenance mode etc.fi
>> 
>> 
>> On Fri, Mar 1, 2019 at 3:28 AM Pierre Timmermans <ptim007 at yahoo.com> wrote:
>> 
>> Hi
>> Same issue for me but I am not sure how to fix it. Andre can you tell exactly how you check ?
>> I cannot add a test using pcp_node_info to check that the status is up, because then follow_master is never doing something. Indeed, in my case, when the follow_master is executed the status of the target node is always down, so my script does the standby follow command and then a pcp_attach_node.
>> To solve the issue now I added a check that the command select pg_is_in_recovery(); returns "t" on the node, if it returns "f" then I can assume it is a degenerated master and I don't execute the follow_master command.
>> 
>> 
>> So my use case is this
>> 
>> 1. node 0 is primary, node 1 and node 2 are standby2. node 0 is restarted, node 1 becomes primary and node 2 follows the new primary (thanks to folllow_master). In follow_master of node 2 I have to do pcp_attach_node after because the status of the node is down 3. in the meantime node 0 has rebooted, the db is started on node 0 but it is down in pgpool and its role is standby (it is a degenerated master)4. node 1 is restarted, pgpool executes failover on node 2 and follow_master on node 0 => the follow_master on node 0 breaks everything because after that node 0 becomes a primary again 
>> Thanks and regards
>> Pierre 
>> 
>>    On Monday, February 25, 2019, 5:35:11 PM GMT+1, Andre Piwoni <apiwoni at webmd.net> wrote:  
>>  
>>  I have already put that check in place.
>> Thank you for confirming.
>> On Sat, Feb 23, 2019 at 11:56 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>> 
>> Sorry, I was wrong. A follow_master_command will be executed against
>> the down node as well. So you need to check whether target PostgreSQL
>> node is running in the follow_master_commdn. If it's not, you can skip
>> the node.
>> 
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>> 
>>> I have added pg_ctl status check to ensure no action is taken when node is
>>> down but I'll check 3.7.8 version.
>>> 
>>> Here's the Pgpool log from the time node2 is shutdown to time node1(already
>>> dead old primary) received follow master command.
>>> Sorry for double date logging. I'm also including self-explanatory
>>> failover.log that I my failover and follow_master scripts generated.
>>> 
>>> Arguments passed to scripts for your reference.
>>> failover.sh %d %h %p %D %M %P %m %H %r %R
>>> follow_master.sh %d %h %p %D %M %P %m %H %r %R
>>> 
>>> Pool status before shutdown of node 2:
>>> postgres=> show pool_nodes;
>>>  node_id |          hostname          | port | status | lb_weight |  role
>>>  | select_cnt | load_balance_node | replication_delay
>>> ---------+----------------------------+------+--------+-----------+---------+------------+-------------------+-------------------
>>>  0       | pg-hdp-node1.kitchen.local | 5432 | down   | 0.333333  | standby
>>> | 0          | false             | 0
>>>  1       | pg-hdp-node2.kitchen.local | 5432 | up     | 0.333333  | primary
>>> | 0          | false             | 0
>>>  2       | pg-hdp-node3.kitchen.local | 5432 | up     | 0.333333  | standby
>>> | 0          | true              | 0
>>> (3 rows)
>>> 
>>> Pgpool log
>>> Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [126-1] 2019-02-22 10:43:27:
>>> pid 12437: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [127-1] 2019-02-22 10:43:27:
>>> pid 12437: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:27 pg-hdp-node3 pgpool[12437]: [127-2] 2019-02-22 10:43:27:
>>> pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
>>> failed
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-1] 2019-02-22 10:43:37:
>>> pid 12437: ERROR:  Failed to check replication time lag
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-2] 2019-02-22 10:43:37:
>>> pid 12437: DETAIL:  No persistent db connection for the node 1
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-3] 2019-02-22 10:43:37:
>>> pid 12437: HINT:  check sr_check_user and sr_check_password
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [128-4] 2019-02-22 10:43:37:
>>> pid 12437: CONTEXT:  while checking replication time lag
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [129-1] 2019-02-22 10:43:37:
>>> pid 12437: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [130-1] 2019-02-22 10:43:37:
>>> pid 12437: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:37 pg-hdp-node3 pgpool[12437]: [130-2] 2019-02-22 10:43:37:
>>> pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
>>> failed
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [6-1] 2019-02-22 10:43:45: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [7-1] 2019-02-22 10:43:45: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [7-2] 2019-02-22 10:43:45: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:45 pg-hdp-node3 pgpool[7786]: [8-1] 2019-02-22 10:43:45: pid
>>> 7786: LOG:  health check retrying on DB node: 1 (round:1)
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-1] 2019-02-22 10:43:47:
>>> pid 12437: ERROR:  Failed to check replication time lag
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-2] 2019-02-22 10:43:47:
>>> pid 12437: DETAIL:  No persistent db connection for the node 1
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-3] 2019-02-22 10:43:47:
>>> pid 12437: HINT:  check sr_check_user and sr_check_password
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [131-4] 2019-02-22 10:43:47:
>>> pid 12437: CONTEXT:  while checking replication time lag
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [132-1] 2019-02-22 10:43:47:
>>> pid 12437: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [133-1] 2019-02-22 10:43:47:
>>> pid 12437: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:47 pg-hdp-node3 pgpool[12437]: [133-2] 2019-02-22 10:43:47:
>>> pid 12437: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432"
>>> failed
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [9-1] 2019-02-22 10:43:48: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [10-1] 2019-02-22 10:43:48: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [10-2] 2019-02-22 10:43:48: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:48 pg-hdp-node3 pgpool[7786]: [11-1] 2019-02-22 10:43:48: pid
>>> 7786: LOG:  health check retrying on DB node: 1 (round:2)
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [12-1] 2019-02-22 10:43:51: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [13-1] 2019-02-22 10:43:51: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [13-2] 2019-02-22 10:43:51: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:51 pg-hdp-node3 pgpool[7786]: [14-1] 2019-02-22 10:43:51: pid
>>> 7786: LOG:  health check retrying on DB node: 1 (round:3)
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [15-1] 2019-02-22 10:43:54: pid
>>> 7786: LOG:  failed to connect to PostgreSQL server on
>>> "pg-hdp-node2.kitchen.local:5432", getsockopt() detected error "Connection
>>> refused"
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [16-1] 2019-02-22 10:43:54: pid
>>> 7786: ERROR:  failed to make persistent db connection
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [16-2] 2019-02-22 10:43:54: pid
>>> 7786: DETAIL:  connection to host:"pg-hdp-node2.kitchen.local:5432" failed
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [17-1] 2019-02-22 10:43:54: pid
>>> 7786: LOG:  health check failed on node 1 (timeout:0)
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7786]: [18-1] 2019-02-22 10:43:54: pid
>>> 7786: LOG:  received degenerate backend request for node_id: 1 from pid
>>> [7786]
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [253-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  Pgpool-II parent process has received failover request
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [254-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  starting degeneration. shutdown host
>>> pg-hdp-node2.kitchen.local(5432)
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [255-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  Restart all children
>>> Feb 22 10:43:54 pg-hdp-node3 pgpool[7746]: [256-1] 2019-02-22 10:43:54: pid
>>> 7746: LOG:  execute command: /etc/pgpool-II/failover.sh 1
>>> pg-hdp-node2.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
>>> pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [257-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node_repeatedly: waiting for finding a primary node
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [258-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: checking backend no 0
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [259-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: checking backend no 1
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [260-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: checking backend no 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [261-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  find_primary_node: primary node id is 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [262-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  starting follow degeneration. shutdown host
>>> pg-hdp-node1.kitchen.local(5432)
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [263-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  starting follow degeneration. shutdown host
>>> pg-hdp-node2.kitchen.local(5432)
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [264-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover: 2 follow backends have been degenerated
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [265-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover: set new primary node: 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [266-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover: set new master node: 2
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[7746]: [267-1] 2019-02-22 10:43:55: pid
>>> 7746: LOG:  failover done. shutdown host pg-hdp-node2.kitchen.local(5432)
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-1] 2019-02-22 10:43:55:
>>> pid 12437: ERROR:  Failed to check replication time lag
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-2] 2019-02-22 10:43:55:
>>> pid 12437: DETAIL:  No persistent db connection for the node 1
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-3] 2019-02-22 10:43:55:
>>> pid 12437: HINT:  check sr_check_user and sr_check_password
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [134-4] 2019-02-22 10:43:55:
>>> pid 12437: CONTEXT:  while checking replication time lag
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12437]: [135-1] 2019-02-22 10:43:55:
>>> pid 12437: LOG:  worker process received restart request
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12774]: [267-1] 2019-02-22 10:43:55:
>>> pid 12774: LOG:  failback event detected
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12774]: [267-2] 2019-02-22 10:43:55:
>>> pid 12774: DETAIL:  restarting myself
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [265-1] 2019-02-22 10:43:55:
>>> pid 12742: LOG:  start triggering follow command.
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [266-1] 2019-02-22 10:43:55:
>>> pid 12742: LOG:  execute command: /etc/pgpool-II/follow_master.sh 0
>>> pg-hdp-node1.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
>>> pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
>>> Feb 22 10:43:55 pg-hdp-node3 pgpool[12742]: [267-1] 2019-02-22 10:43:55:
>>> pid 12742: LOG:  execute command: /etc/pgpool-II/follow_master.sh 1
>>> pg-hdp-node2.kitchen.local 5432 /var/lib/pgsql/10/data 1 1 2
>>> pg-hdp-node3.kitchen.local 5432 /var/lib/pgsql/10/data
>>> Feb 22 10:43:56 pg-hdp-node3 pgpool[12436]: [60-1] 2019-02-22 10:43:56: pid
>>> 12436: LOG:  restart request received in pcp child process
>>> Feb 22 10:43:56 pg-hdp-node3 pgpool[7746]: [268-1] 2019-02-22 10:43:56: pid
>>> 7746: LOG:  PCP child 12436 exits with status 0 in failover()
>>> 
>>> Pgpool self-explanatory failover.log
>>> 
>>> 2019-02-22 10:43:54.893 PST Executing failover script ...
>>> 2019-02-22 10:43:54.895 PST Script arguments:
>>> failed_node_id           1
>>> failed_node_host         pg-hdp-node2.kitchen.local
>>> failed_node_port         5432
>>> failed_node_pgdata       /var/lib/pgsql/10/data
>>> old_primary_node_id      1
>>> old_master_node_id       1
>>> new_master_node_id       2
>>> new_master_node_host     pg-hdp-node3.kitchen.local
>>> new_master_node_port     5432
>>> new_master_node_pgdata   /var/lib/pgsql/10/data
>>> 2019-02-22 10:43:54.897 PST Primary node running on
>>> pg-hdp-node2.kitchen.local host is unresponsive or have died
>>> 2019-02-22 10:43:54.898 PST Attempting to stop primary node running on
>>> pg-hdp-node2.kitchen.local host before promoting slave as the new primary
>>> 2019-02-22 10:43:54.899 PST ssh -o StrictHostKeyChecking=no -i
>>> /var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node2.kitchen.local -T
>>> /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data stop -m fast
>>> 2019-02-22 10:43:55.151 PST Promoting pg-hdp-node3.kitchen.local host as
>>> the new primary
>>> 2019-02-22 10:43:55.153 PST ssh -o StrictHostKeyChecking=no -i
>>> /var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node3.kitchen.local -T
>>> /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data promote
>>> waiting for server to promote.... done
>>> server promoted
>>> 2019-02-22 10:43:55.532 PST Completed executing failover
>>> 
>>> 2019-02-22 10:43:55.564 PST Executing follow master script ...
>>> 2019-02-22 10:43:55.566 PST Script arguments
>>> detached_node_id         0
>>> detached_node_host       pg-hdp-node1.kitchen.local
>>> detached_node_port       5432
>>> detached_node_pgdata     /var/lib/pgsql/10/data
>>> old_primary_node_id      1
>>> old_master_node_id       1
>>> new_master_node_id       2
>>> new_master_node_host     pg-hdp-node3.kitchen.local
>>> new_master_node_port     5432
>>> new_master_node_pgdata   /var/lib/pgsql/10/data
>>> 2019-02-22 10:43:55.567 PST Checking if server is running on
>>> pg-hdp-node1.kitchen.local host
>>> 2019-02-22 10:43:55.569 PST ssh -o StrictHostKeyChecking=no -i
>>> /var/lib/pgsql/.ssh/id_rsa postgres at pg-hdp-node1.kitchen.local -T
>>> /usr/pgsql-10/bin/pg_ctl -D /var/lib/pgsql/10/data status
>>> 
>>> 
>>> pg_ctl: no server running
>>> 2019-02-22 10:43:55.823 PST Node on pg-hdp-node1.kitchen.local host is not
>>> running. It could be old slave or primary that needs to be recovered.
>>> 2019-02-22 10:43:55.824 PST Completed executing follow master script
>>> 
>>> 2019-02-22 10:43:55.829 PST Executing follow master script ...
>>> 2019-02-22 10:43:55.830 PST Script arguments
>>> detached_node_id         1
>>> detached_node_host       pg-hdp-node2.kitchen.local
>>> detached_node_port       5432
>>> detached_node_pgdata     /var/lib/pgsql/10/data
>>> old_primary_node_id      1
>>> old_master_node_id       1
>>> new_master_node_id       2
>>> new_master_node_host     pg-hdp-node3.kitchen.local
>>> new_master_node_port     5432
>>> new_master_node_pgdata   /var/lib/pgsql/10/data
>>> 2019-02-22 10:43:55.831 PST Detached node on pg-hdp-node2.kitchen.local
>>> host is the the old primary node
>>> 2019-02-22 10:43:55.833 PST Slave can be created from old primary node by
>>> deleting PG_DATA directory under /var/lib/pgsql/10/data on
>>> pg-hdp-node2.kitchen.local host and re-running Chef client
>>> 2019-02-22 10:43:55.834 PST Slave can be recovered from old primary node by
>>> running /usr/pgsql-10/bin/pg_rewind -D /var/lib/pgsql/10/data
>>> --source-server="port=5432 host=pg-hdp-node3.kitchen.local" command on
>>> pg-hdp-node2.kitchen.local host as postgres user
>>> 2019-02-22 10:43:55.835 PST After successful pg_rewind run cp
>>> /var/lib/pgsql/10/data/recovery.done /var/lib/pgsql/10/data/recovery.conf,
>>> ensure host connection string points to pg-hdp-node3.kitchen.local, start
>>> PostgreSQL and attach it to pgpool
>>> 2019-02-22 10:43:55.836 PST Completed executing follow master script
>>> 
>>> On Thu, Feb 21, 2019 at 4:47 PM Tatsuo Ishii <ishii at sraoss.co.jp> wrote:
>>> 
>>>> > Is this correct behavior?
>>>> >
>>>> > In 3-node setup, node1(primary) is shutdown, failover is executed and
>>>> node2
>>>> > becomes new primary and node3 follows new primary on node2.
>>>> > Now, node2(new primary) is shutdown, failover is executed and node3
>>>> becomes
>>>> > new primary but fallow_master_command is executed on node1 even though it
>>>> > is reported as down.
>>>>
>>>> No. follow master command should not be executed on an already-down
>>>> node (in this case node1).
>>>>
>>>> > It happens that my script repoints node1 and restarts it which breaks
>>>> hell
>>>> > because node1 was never recovered after being shutdown.
>>>> >
>>>> > I'm on PgPool 3.7.4.
>>>>
>>>> Can you share the log from when node2 was shutdown to when node1 was
>>>> recovered by your follow master command?
>>>>
>>>> In the mean time 3.7.4 is not the latest one. Can you try with the
>>>> latest one? (3.7.8).
>>>>
>>>> Best regards,
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese:http://www.sraoss.co.jp
>>>>
>>> 
>>> 
>>> -- 
>>> 
>>> *Andre Piwoni*
>> 
>> 
>> _______________________________________________
>> pgpool-general mailing list
>> pgpool-general at pgpool.net
>> http://www.pgpool.net/mailman/listinfo/pgpool-general
>>  
>> 
>> 
>> -- 
>>  
>> 
>> 
>> -- 
>> 
>> Andre Piwoni
>> 
>> Sr. Software Developer,BI/Database
>> 
>> WebMD Health Services
>> 
>> Mobile: 801.541.4722
>> 
>> www.webmdhealthservices.com
>> 
>> 
>> 
>> -- 
>> 
>> Andre Piwoni
>> 
>> Sr. Software Developer,BI/Database
>> 
>> WebMD Health Services
>> 
>> Mobile: 801.541.4722
>> 
>> www.webmdhealthservices.com
>>     
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20190308/7d368b64/attachment-0001.html>