View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000348 | Pgpool-II | Bug | public | 2017-10-05 15:43 | 2017-11-09 16:51 |
| Reporter | cohavisi | Assigned To | Muhammad Usama | ||
| Priority | urgent | Severity | major | Reproducibility | always |
| Status | feedback | Resolution | open | ||
| Platform | Oracle VirtualBox | OS | CentOS | OS Version | 6.8 |
| Product Version | 3.6.6 | ||||
| Summary | 0000348: pgpool failover scenario ERROR - pgpool 3.6.6 (postgres 9.3.5) | ||||
| Description | Hi, I have 2 postgres servers (postgres 9.3.5) installed with pgpool(3.6.6) on the same node. server1 192.168.15.55 server 2 192.168.15.56 pgpool vip 192.168.15.59 # psql -U postgres -h 192.168.15.59 -p 9999 -c 'show pool_nodes' node_id | hostname | port | status | lb_weight | role | select_cnt | load_balance_node | replication_delay ---------+---------------+------+--------+-----------+---------+------------+-------------------+------------------- 0 | 192.168.15.55 | 5432 | up | 0.500000 | standby | 0 | false | 0 1 | 192.168.15.56 | 5432 | up | 0.500000 | primary | 0 | true | 0 (2 rows) error appear in a disaster recovery scenario: 1. "primary postgres" and the "master pgpool" is on server1 . pgpool vip established on this node (192.168.15.59) 2. server1 went down due to power break. (server1 is dead!!!) 3. pgpool on server server2 get promoted (vip established 192.168.15.59) . 4. postgres failover action occurs by the pgpool. postgres is being promoted to primary. 5. staring up the faulty node , pgpool started as STANDBY. and postgres is being recovered vis pcp_recovery_node tool! when performing this action multiple times...(crashing the MASTER pgpool & primary postgres) example: server1 (STANDBY pgpool and secondary postgres) server2 (MASTER pgpool and primary postgres) the STANDBY pgpool node (server1) is not being elected as MASTER and no failback to the postgres. only when starting up again the faulty node... (server2) the STANBY pgpool (server1) is being elected as MASTER and server2 is being elected as STANDBY but no primary postgres is available due to lack of failback command !!! logs attached as well as pgpool.conf file of both servers! | ||||
| Steps To Reproduce | 1. "primary postgres" and the "master pgpool" is on server1 . pgpool vip established on this node (192.168.15.59) 2. server1 went down due to power break. (server1 is dead!!!) 3. pgpool on server server2 get promoted (vip established 192.168.15.59) . 4. postgres failover action occurs by the pgpool. postgres is being promoted to primary. 5. staring up the faulty node , pgpool started as STANDBY. and postgres is being recovered vis pcp_recovery_node tool! when performing this action multiple times...(crashing the MASTER pgpool & primary postgres) example: server1 (STANDBY pgpool and secondary postgres) server2 (MASTER pgpool and primary postgres) the STANDBY pgpool node (server1) is not being elected as MASTER and no failback to the postgres. | ||||
| Additional Information | log of the STANDBY pgpool which has not been elected as MASTER: 2017-09-21 09:53:08: pid 2014: WARNING: checking setuid bit of if_up_cmd 2017-09-21 09:53:08: pid 2014: DETAIL: ifup[/sbin/ifconfig] doesn't have setuid bit 2017-09-21 09:53:08: pid 2014: WARNING: checking setuid bit of if_down_cmd 2017-09-21 09:53:08: pid 2014: DETAIL: ifdown[/sbin/ifconfig] doesn't have setuid bit 2017-09-21 09:53:08: pid 2014: WARNING: checking setuid bit of arping command 2017-09-21 09:53:08: pid 2014: DETAIL: arping[/usr/sbin/arping] doesn't have setuid bit 2017-09-21 09:53:08: pid 2014: LOG: waiting for watchdog to initialize 2017-09-21 09:53:08: pid 2016: LOG: setting the local watchdog node name to "1.1.1.55:9999 Linux mgrdb-55" 2017-09-21 09:53:08: pid 2016: LOG: watchdog cluster is configured with 1 remote nodes 2017-09-21 09:53:08: pid 2016: LOG: watchdog remote node:0 on 1.1.1.56:9000 2017-09-21 09:53:08: pid 2016: LOG: interface monitoring is disabled in watchdog 2017-09-21 09:53:08: pid 2016: FATAL: failed to create watchdog command server socket 2017-09-21 09:53:08: pid 2016: DETAIL: bind on "/tmp/.s.PGPOOLWD_CMD.9000" failed with reason: "Address already in use" 2017-09-21 09:53:08: pid 2014: LOG: watchdog child process with pid: 2016 exits with status 768 2017-09-21 09:53:08: pid 2014: FATAL: watchdog child process exit with fatal error. exiting pgpool-II 2017-09-21 09:53:08: pid 2018: LOG: setting the local watchdog node name to "1.1.1.55:9999 Linux mgrdb-55" 2017-09-21 09:53:08: pid 2018: LOG: watchdog cluster is configured with 1 remote nodes 2017-09-21 09:53:08: pid 2018: LOG: watchdog remote node:0 on 1.1.1.56:9000 2017-09-21 09:53:08: pid 2018: LOG: interface monitoring is disabled in watchdog 2017-09-21 09:53:08: pid 2018: LOG: watchdog node state changed from [DEAD] to [LOADING] 2017-09-21 09:53:08: pid 2018: LOG: new outbond connection to 1.1.1.56:9000 2017-09-21 09:53:08: pid 2018: LOG: setting the remote node "1.1.1.56:9999 Linux mgrdb-56" as watchdog cluster master 2017-09-21 09:53:08: pid 2018: LOG: watchdog node state changed from [LOADING] to [INITIALIZING] 2017-09-21 09:53:09: pid 2018: LOG: watchdog node state changed from [INITIALIZING] to [STANDBY] 2017-09-21 09:53:09: pid 2018: LOG: successfully joined the watchdog cluster as standby node 2017-09-21 09:53:09: pid 2018: DETAIL: our join coordinator request is accepted by cluster leader node "1.1.1.56:9999 Linux mgrdb-56" 2017-09-21 09:53:10: pid 2018: LOG: new watchdog node connection is received from "1.1.1.56:40168" 2017-09-21 09:53:10: pid 2018: LOG: new node joined the cluster hostname:"1.1.1.56" port:9000 pgpool_port:9999 2017-09-21 09:55:01: pid 2018: LOG: watchdog received the failover command from "1.1.1.56:9999 Linux mgrdb-56" 2017-09-21 09:55:01: pid 2018: LOG: received failback request for node_id: 0 from pid [2018] wd_failover_id [42] when starting the faulty node again(server2).. this node elected as MASTER: 2017-09-21 10:49:11: pid 2018: LOG: new watchdog node connection is received from "1.1.1.56:30363" 2017-09-21 10:49:11: pid 2018: LOG: new node joined the cluster hostname:"1.1.1.56" port:9000 pgpool_port:9999 2017-09-21 10:49:11: pid 2018: LOG: failed to write watchdog packet to socket 2017-09-21 10:49:11: pid 2018: DETAIL: Connection reset by peer 2017-09-21 10:49:11: pid 2018: WARNING: the coordinator as per our record is not coordinator anymore 2017-09-21 10:49:11: pid 2018: DETAIL: re-initializing the cluster 2017-09-21 10:49:11: pid 2018: LOG: watchdog node state changed from [STANDBY] to [JOINING] 2017-09-21 10:49:11: pid 2018: LOG: unassigning the remote node "1.1.1.56:9999 Linux mgrdb-56" from watchdog cluster master 2017-09-21 10:49:11: pid 2018: LOG: new outbond connection to 1.1.1.56:9000 2017-09-21 10:49:11: pid 2018: LOG: watchdog node state changed from [JOINING] to [INITIALIZING] 2017-09-21 10:49:11: pid 2018: LOG: remote node "1.1.1.56:9999 Linux mgrdb-56" is shutting down 2017-09-21 10:49:12: pid 2018: LOG: I am the only alive node in the watchdog cluster 2017-09-21 10:49:12: pid 2018: HINT: skiping stand for coordinator state 2017-09-21 10:49:12: pid 2018: LOG: watchdog node state changed from [INITIALIZING] to [MASTER] 2017-09-21 10:49:12: pid 2018: LOG: I am announcing my self as master/coordinator watchdog node 2017-09-21 10:49:16: pid 2018: LOG: I am the cluster leader node 2017-09-21 10:49:16: pid 2018: DETAIL: our declare coordinator message is accepted by all nodes 2017-09-21 10:49:16: pid 2018: LOG: setting the local node "1.1.1.55:9999 Linux mgrdb-55" as watchdog cluster master 2017-09-21 10:49:16: pid 2018: LOG: I am the cluster leader node. Starting escalation process 2017-09-21 10:49:16: pid 2018: LOG: escalation process started with PID:10980 2017-09-21 10:49:16: pid 10980: LOG: watchdog: escalation started 2017-09-21 10:49:20: pid 10980: LOG: successfully acquired the delegate IP:"192.168.15.59" 2017-09-21 10:49:20: pid 10980: DETAIL: 'if_up_cmd' returned with success 2017-09-21 10:49:20: pid 2018: LOG: watchdog escalation process with pid: 10980 exit with SUCCESS. 2017-09-21 10:49:25: pid 2018: LOG: new watchdog node connection is received from "1.1.1.56:32923" 2017-09-21 10:49:25: pid 2018: LOG: new node joined the cluster hostname:"1.1.1.56" port:9000 pgpool_port:9999 2017-09-21 10:49:25: pid 2018: LOG: new outbond connection to 1.1.1.56:9000 2017-09-21 10:49:26: pid 2018: LOG: adding watchdog node "1.1.1.56:9999 Linux mgrdb-56" to the standby list 2017-09-21 10:50:15: pid 2018: LOG: remote node "1.1.1.56:9999 Linux mgrdb-56" is shutting down 2017-09-21 10:50:15: pid 2018: LOG: removing watchdog node "1.1.1.56:9999 Linux mgrdb-56" from the standby list | ||||
| Tags | No tags attached. | ||||