[pgpool-general: 6256] Re: failiover fails sometimes with failover :falling node is alive and set new primary node:-1

Bo Peng pengbo at sraoss.co.jp
Sat Oct 27 23:27:42 JST 2018


Hi

It seems that after the first failover, pgpool could not 
make connection to the new primary node.

Could you share the pgpool.conf?

On Fri, 26 Oct 2018 18:19:17 +0800
"mandy" <dw_qiuchunxiao at sina.com> wrote:

> Hi,     Sometime I  fail to activate the standby node  as a new primary node when the old primary node is down  by using the failover of pgpool-II.     I have 2 severs with centOS7,one for pgpool-II-4.0.0 and  postgresql-11.0, the other for postgresql-11.0. I installed pgpool-II and postgresql by using sources.
>     As the following show,I have 2 nodes, pgsrv14 is the primary node and pgsrv13 is the standby node, for using streaming replication. And the pgpool-II is installed in pgsrv13 server.     [postgres at pgsrv13 replscript]$ psql -p 9999psql (11.0)Type "help" for help.
> postgres=# show pool_nodes; node_id | hostname | port | status | lb_weight |  role   | select_cnt | load_balance_node | replication_delay | last_status_change  ---------+----------+------+--------+-----------+---------+------------+-------------------+-------------------+--------------------- 0       | pgsrv13  | 5432 | up     | 0.500000  | standby | 0          | true              | 0                 | 2018-10-26 07:42:46 1       | pgsrv14  | 5432 | up     | 0.500000  | primary | 0          | false             | 0                 | 2018-10-26 07:41:15(2 rows)           When I use the command "pg_ctl stop" to  stop the database in pgsrv14,it should have activated the pgsrv13(the standby node) as a new primary node by executing failover.sh script.  However, it failed.      I have specify 4 special characters in failover_command  in pgpool.conf.      failover_command = '/usr/local/pgpool-4.0.0/replscript/failover.sh %d %h %P %H'                                   # Execu
 tes this command at failover                                   # Special values:                                   #   %d = node id                                   #   %h = host name                                   #   %H = hostname of the new master node                                   #   %P = old primary node id
> and the log of failover.sh said,.......failover.sh FALLING_NODE: 1; FALLING_HOST: pgsrv14; OLDPRIMARY_NODE: 1; NEW_PRIMARY: pgsrv13; at Fri Oct 26 07:43:06 EDT 2018 ssh -f -n -T postgres at pgsrv13 /usr/local/pgpool-4.0.0/replscript/promote.sh -d pgsrv14failover done! 
> failover.sh FALLING_NODE: 0; FALLING_HOST: pgsrv13; OLDPRIMARY_NODE: 0; NEW_PRIMARY: ; at Fri Oct 26 07:43:11 EDT 2018 ssh -f -n -T postgres@ /usr/local/pgpool-4.0.0/replscript/promote.sh -d pgsrv13failover done! .......
> the first paragraph in the log of failover.sh is proper, it correctly realize the falling node  and old primary node is 1(pgsrv14) ,and new primary node is pgsrv13. And the result of command "select pg_is_in_recovery()" in pgsrv13 is 'f', which show that pgsrv13 is the primary node now and is alive.
> However, because I can not execute the recovery command to recovery pgsrv14  to a normal node in such a short period of time which is  5 seconds, the pgpool execute a second failover command,as the second paragreph show.But pgpool realize the falling node is pgsrv13 (pgsrv 13 is up and primary node actually)
> when the second paragraph happened, pgpool.log said pgsrv13 was shutdown by adminstrative command(actually it was alive), and all db nodes are in down status,and  pgpool set new primary node :-1, maybe this is the reason why the new_primary is null in the second paragraph.
> that pgpool.log said,Oct 26 07:43:11 pgsrv13 pgpool[3827]: [438-1] 2018-10-26 07:43:11: pid 3827: LOG:  reading and processing packetsOct 26 07:43:11 pgsrv13 pgpool[3827]: [438-2] 2018-10-26 07:43:11: pid 3827: DETAIL:  postmaster on DB node 0 was shutdown by administrative commandOct 26 07:43:11 pgsrv13 pgpool[3827]: [439-1] 2018-10-26 07:43:11: pid 3827: LOG:  received degenerate backend request for node_id: 0 from pid [3827]Oct 26 07:43:11 pgsrv13 pgpool[3036]: [467-1] 2018-10-26 07:43:11: pid 3036: LOG:  Pgpool-II parent process has received failover requestOct 26 07:43:11 pgsrv13 pgpool[3036]: [468-1] 2018-10-26 07:43:11: pid 3036: LOG:  starting degeneration. shutdown host pgsrv13(5432)Oct 26 07:43:11 pgsrv13 pgpool[3036]: [469-1] 2018-10-26 07:43:11: pid 3036: WARNING:  All the DB nodes are in down status and skip writing status file.Oct 26 07:43:11 pgsrv13 pgpool[3036]: [470-1] 2018-10-26 07:43:11: pid 3036: LOG:  failover: no valid backend node foundOct 26 07:43:11
  pgsrv13 pgpool[3036]: [471-1] 2018-10-26 07:43:11: pid 3036: LOG:  Restart all childrenOct 26 07:43:11 pgsrv13 pgpool[3036]: [472-1] 2018-10-26 07:43:11: pid 3036: LOG:  execute command: /usr/local/pgpool-4.0.0/replscript/failover.sh 0 pgsrv13 0 ""Oct 26 07:43:11 pgsrv13 pgpool[3036]: [473-1] 2018-10-26 07:43:11: pid 3036: LOG:  find_primary_node_repeatedly: waiting for finding a primary nodeOct 26 07:48:11 pgsrv13 pgpool[3036]: [474-1] 2018-10-26 07:48:11: pid 3036: LOG:  failover: set new primary node: -1Oct 26 07:48:11 pgsrv13 pgpool[4029]: [475-1] 2018-10-26 07:48:11: pid 4029: LOG:  failback event detectedOct 26 07:48:11 pgsrv13 pgpool[4029]: [475-2] 2018-10-26 07:48:11: pid 4029: DETAIL:  restarting myself
> I don't know what 's wrong! Any help is welcome, and I am glad to offer more information if it's helpful to solve the problem.Thank you!

-- 
Bo Peng <pengbo at sraoss.co.jp>
SRA OSS, Inc. Japan



More information about the pgpool-general mailing list