[pgpool-general: 5737] pgpool failover scenario ERROR - pgpool 3.6.6 (postgres 9.3.5)

Thu Sep 21 17:10:12 JST 2017

Hi,
I have 2 postgres servers (postgres 9.3.5) installed with pgpool(3.6.6) on
the same node.

server1 192.168.15.55
server 2 192.168.15.56
pgpool vip 192.168.15.59

# psql -U postgres -h 192.168.15.59 -p 9999 -c 'show pool_nodes'
 node_id |   hostname    | port | status | lb_weight |  role   | select_cnt
| load_balance_node | replication_delay
---------+---------------+------+--------+-----------+---------+------------+-------------------+-------------------
 0       | 192.168.15.55 | 5432 | up     | 0.500000  | standby | 0
 | false             | 0
 1       | 192.168.15.56 | 5432 | up     | 0.500000  | primary | 0
 | true              | 0
(2 rows)

my question is about disaster recovery scenario:

1. *"primary postgres" and the "master pgpool" is on server1* . pgpool vip
established on this node (192.168.15.59)

2. server1 went down due to  power break. (server1 is dead!!!)

3. pgpool on server server2 get promoted (vip established 192.168.15.59) .

4. postgres failover action occurs by the pgpool. postgres is being
promoted to primary.

5. staring up the faulty node , pgpool started as STANDBY. and postgres is
being recovered vis *pcp_recovery_node* tool!

*when performing this action multiple times...(crashing the MASTER pgpool &
primary postgres)*

*example:*
*server1 (STANDBY pgpool and secondary postgres)*
*server2 (MASTER pgpool and primary postgres)*

*the STANDBY pgpool node (server1) is not being elected as MASTER and no
failback to the postgres.*

*only when starting up again the faulty node... (server2) the STANBY pgpool
(server1) is being elected as MASTER and server2 is being elected as
STANDBY but no primary postgres is available due to lack of failback
command !!!*

*logs attached as well as pgpool.conf file of both servers!*

log of the STANDBY pgpool which has not been elected as MASTER:

2017-09-21 09:53:08: pid 2014: WARNING:  checking setuid bit of if_up_cmd
2017-09-21 09:53:08: pid 2014: DETAIL:  ifup[/sbin/ifconfig] doesn't have
setuid bit
2017-09-21 09:53:08: pid 2014: WARNING:  checking setuid bit of if_down_cmd
2017-09-21 09:53:08: pid 2014: DETAIL:  ifdown[/sbin/ifconfig] doesn't have
setuid bit
2017-09-21 09:53:08: pid 2014: WARNING:  checking setuid bit of arping
command
2017-09-21 09:53:08: pid 2014: DETAIL:  arping[/usr/sbin/arping] doesn't
have setuid bit
2017-09-21 09:53:08: pid 2014: LOG:  waiting for watchdog to initialize
2017-09-21 09:53:08: pid 2016: LOG:  setting the local watchdog node name
to "1.1.1.55:9999 Linux mgrdb-55"
2017-09-21 09:53:08: pid 2016: LOG:  watchdog cluster is configured with 1
remote nodes
2017-09-21 09:53:08: pid 2016: LOG:  watchdog remote node:0 on 1.1.1.56:9000
2017-09-21 09:53:08: pid 2016: LOG:  interface monitoring is disabled in
watchdog
2017-09-21 09:53:08: pid 2016: FATAL:  failed to create watchdog command
server socket
2017-09-21 09:53:08: pid 2016: DETAIL:  bind on "/tmp/.s.PGPOOLWD_CMD.9000"
failed with reason: "Address already in use"
2017-09-21 09:53:08: pid 2014: LOG:  watchdog child process with pid: 2016
exits with status 768
2017-09-21 09:53:08: pid 2014: FATAL:  watchdog child process exit with
fatal error. exiting pgpool-II
2017-09-21 09:53:08: pid 2018: LOG:  setting the local watchdog node name
to "1.1.1.55:9999 Linux mgrdb-55"
2017-09-21 09:53:08: pid 2018: LOG:  watchdog cluster is configured with 1
remote nodes
2017-09-21 09:53:08: pid 2018: LOG:  watchdog remote node:0 on 1.1.1.56:9000
2017-09-21 09:53:08: pid 2018: LOG:  interface monitoring is disabled in
watchdog
2017-09-21 09:53:08: pid 2018: LOG:  watchdog node state changed from
[DEAD] to [LOADING]
2017-09-21 09:53:08: pid 2018: LOG:  new outbond connection to 1.1.1.56:9000
2017-09-21 09:53:08: pid 2018: LOG:  setting the remote node "1.1.1.56:9999
Linux mgrdb-56" as watchdog cluster master
2017-09-21 09:53:08: pid 2018: LOG:  watchdog node state changed from
[LOADING] to [INITIALIZING]
2017-09-21 09:53:09: pid 2018: LOG:  watchdog node state changed from
[INITIALIZING] to [STANDBY]
2017-09-21 09:53:09: pid 2018: LOG:  successfully joined the watchdog
cluster as standby node
2017-09-21 09:53:09: pid 2018: DETAIL:  our join coordinator request is
accepted by cluster leader node "1.1.1.56:9999 Linux mgrdb-56"
2017-09-21 09:53:10: pid 2018: LOG:  new watchdog node connection is
received from "1.1.1.56:40168"
2017-09-21 09:53:10: pid 2018: LOG:  new node joined the cluster
hostname:"1.1.1.56" port:9000 pgpool_port:9999
2017-09-21 09:55:01: pid 2018: LOG:  watchdog received the failover command
from "1.1.1.56:9999 Linux mgrdb-56"
2017-09-21 09:55:01: pid 2018: LOG:  received failback request for node_id:
0 from pid [2018] wd_failover_id [42]

*when starting the faulty node again(server2).. this node elected as
MASTER:*

2017-09-21 10:49:11: pid 2018: LOG:  new watchdog node connection is
received from "1.1.1.56:30363"
2017-09-21 10:49:11: pid 2018: LOG:  new node joined the cluster
hostname:"1.1.1.56" port:9000 pgpool_port:9999
2017-09-21 10:49:11: pid 2018: LOG:  failed to write watchdog packet to
socket
2017-09-21 10:49:11: pid 2018: DETAIL:  Connection reset by peer
2017-09-21 10:49:11: pid 2018: WARNING:  the coordinator as per our record
is not coordinator anymore
2017-09-21 10:49:11: pid 2018: DETAIL:  re-initializing the cluster
2017-09-21 10:49:11: pid 2018: LOG:  watchdog node state changed from
[STANDBY] to [JOINING]
2017-09-21 10:49:11: pid 2018: LOG:  unassigning the remote node "
1.1.1.56:9999 Linux mgrdb-56" from watchdog cluster master
2017-09-21 10:49:11: pid 2018: LOG:  new outbond connection to 1.1.1.56:9000
2017-09-21 10:49:11: pid 2018: LOG:  watchdog node state changed from
[JOINING] to [INITIALIZING]
2017-09-21 10:49:11: pid 2018: LOG:  remote node "1.1.1.56:9999 Linux
mgrdb-56" is shutting down
2017-09-21 10:49:12: pid 2018: LOG:  I am the only alive node in the
watchdog cluster
2017-09-21 10:49:12: pid 2018: HINT:  skiping stand for coordinator state
2017-09-21 10:49:12: pid 2018: LOG:  watchdog node state changed from
[INITIALIZING] to [MASTER]
2017-09-21 10:49:12: pid 2018: LOG:  I am announcing my self as
master/coordinator watchdog node
2017-09-21 10:49:16: pid 2018: LOG:  I am the cluster leader node
2017-09-21 10:49:16: pid 2018: DETAIL:  our declare coordinator message is
accepted by all nodes
2017-09-21 10:49:16: pid 2018: LOG:  setting the local node "1.1.1.55:9999
Linux mgrdb-55" as watchdog cluster master
2017-09-21 10:49:16: pid 2018: LOG:  I am the cluster leader node. Starting
escalation process
2017-09-21 10:49:16: pid 2018: LOG:  escalation process started with
PID:10980
2017-09-21 10:49:16: pid 10980: LOG:  watchdog: escalation started
2017-09-21 10:49:20: pid 10980: LOG:  successfully acquired the delegate
IP:"192.168.15.59"
2017-09-21 10:49:20: pid 10980: DETAIL:  'if_up_cmd' returned with success
2017-09-21 10:49:20: pid 2018: LOG:  watchdog escalation process with pid:
10980 exit with SUCCESS.
2017-09-21 10:49:25: pid 2018: LOG:  new watchdog node connection is
received from "1.1.1.56:32923"
2017-09-21 10:49:25: pid 2018: LOG:  new node joined the cluster
hostname:"1.1.1.56" port:9000 pgpool_port:9999
2017-09-21 10:49:25: pid 2018: LOG:  new outbond connection to 1.1.1.56:9000
2017-09-21 10:49:26: pid 2018: LOG:  adding watchdog node "1.1.1.56:9999
Linux mgrdb-56" to the standby list
2017-09-21 10:50:15: pid 2018: LOG:  remote node "1.1.1.56:9999 Linux
mgrdb-56" is shutting down
2017-09-21 10:50:15: pid 2018: LOG:  removing watchdog node "1.1.1.56:9999
Linux mgrdb-56" from the standby list

Thanks,
cohavisi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20170921/be4fd17e/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Untitled drawing.jpg
Type: image/jpeg
Size: 41138 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20170921/be4fd17e/attachment-0001.jpg>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool_error.tar.gz
Type: application/x-gzip
Size: 25422 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20170921/be4fd17e/attachment-0001.gz>