View Issue Details

IDProjectCategoryView StatusLast Update
0000121Pgpool-IIBugpublic2014-12-02 23:37
ReportereldadAssigned Tot-ishii 
PriorityhighSeveritymajorReproducibilityalways
Status assignedResolutionopen 
PlatformLinuxOSRHELOS Version6.3
Product Version 
Target VersionFixed in Version 
Summary0000121: Failback process igonre node status and lag limit
DescriptionI have postgres 9.3.3 cluster with 4 nodes managed by pgpool II 3.3.3 (active-standby)
Streaming replication (master_slave_mode = on, master_slave_sub_mode = 'stream')
node0 is the primary.
node0 was unavailable for few min and pgpool did a failover using the script we wrote, failover was OK and node1 bacame the master, node2+3 are the slaves following node1. node0 is up but in status 3.
12 hours later we did attach to node0 without restoring it as a slave of node1.
pgpool immediately chose to do failback to node0 and set it as primary without checking the lag (delay_threshold = 1048576) against the current master(node1), without checking this node is a slave of the current master.
This resulted in lost of data until we realized what happened.
we don't have any failback script to follow master script, only failover.

I would like to know is it suppose to happen and if so how to prevent it?

here is the output from the log:
014-11-12 03:58:09 LOG: pid 25764: send_failback_request: fail back 0 th node request from pid 25764
2014-11-12 03:58:09 LOG: pid 24452: wd_start_interlock: start interlocking
2014-11-12 03:58:09 LOG: pid 24452: wd_assume_lock_holder: become a new lock holder
2014-11-12 03:58:10 LOG: pid 24452: starting fail back. reconnect host ptast001ppdb10(5432)
2014-11-12 03:58:10 LOG: pid 24452: Do not restart children because we are failbacking node id 0 host node0 port:5432 and we are in streaming replication mode
2014-11-12 03:58:10 LOG: pid 24452: find_primary_node_repeatedly: waiting for finding a primary node
2014-11-12 03:58:10 LOG: pid 24452: find_primary_node: primary node id is 0
2014-11-12 03:58:10 LOG: pid 24452: wd_end_interlock: end interlocking
2014-11-12 03:58:10 LOG: pid 24452: failover: set new primary node: 0
2014-11-12 03:58:10 LOG: pid 24452: failover: set new master node: 0
2014-11-12 03:58:10 LOG: pid 24452: failback done. reconnect host node0 (5432)
2014-11-12 03:58:10 LOG: pid 25765: worker process received restart request
2014-11-12 03:58:11 LOG: pid 25764: pcp child process received restart request
2014-11-12 03:58:11 LOG: pid 24452: PCP child 25764 exits with status 256 in failover()
2014-11-12 03:58:11 LOG: pid 24452: fork a new PCP child pid 13729 in failover()
2014-11-12 03:58:11 LOG: pid 24452: worker child 25765 exits with status 256
2014-11-12 03:58:11 LOG: pid 24452: fork a new worker child pid 13730

Steps To Reproducein a streaming replication mode do:
make the master offline to trigger failover.
after failover complete and we have new master with synced slaves attach the old master back without restoring it as a slave.
failback will be done without checking the lag on this node or the replication status on it.
Additional Informationpostgres 9.3.3 cluster with 4 nodes managed by pgpool II 3.3.3.(active-standby)
master_slave_mode = on
master_slave_sub_mode = 'stream'
delay_threshold = 1048576
follow_master_command = ''
failover_command = '/etc/pgpool-II/failover_stream.sh %d %h %m %H %P'
failback_command = ''
fail_over_on_backend_error = on
TagsNo tags attached.

Activities

t-ishii

2014-11-14 08:06

developer   ~0000492

This is an expected behavior. When attaching a node, users must be very careful about the state of the node (that's why pgpool won't attach a node automatically). In your situation, you already knew that there's a master node but you added another master.

If you are not sure about the state of the node you are going to attach, you should shutdown the node and use online recovery.

eldad

2014-12-02 23:37

reporter   ~0000499

The attach for the failed master was a human error (it should be rebuild as slave before) but the question is why the failback was triggered at all and why pgpool don't have any safety machanisem to avoid failback to unsynchronized node?
Is there an option to disable failback completely?

Issue History

Date Modified Username Field Change
2014-11-13 22:57 eldad New Issue
2014-11-14 08:06 t-ishii Note Added: 0000492
2014-11-14 08:06 t-ishii Assigned To => t-ishii
2014-11-14 08:06 t-ishii Status new => feedback
2014-12-02 23:37 eldad Note Added: 0000499
2014-12-02 23:37 eldad Status feedback => assigned