[pgpool-general: 3039] Re: Last node in pgpool chain failover

Yugo Nagata nagata at sraoss.co.jp
Thu Jul 17 20:33:05 JST 2014


Hi,

I looked at your pgpool.conf.If you use streaming-reprication, 
master_slave_mode should be on and masetr_slave_sub_mode = 'steam'.

 master_slave_mode = on
 master_slave_sub_mode = 'stream'
 
And, when you recover node01 as standby, you have to backup data
from the current primary (node02) and setup recovery.conf at node01
before attaching this to pgpool. Otherwize, node01 would work still
as standby but as primary, since recovery.conf doesn't exist.

You can use "online recovery" functionality of pgpool to automate
backup and recovery.conf configuration. In this case, pcp_recovery_node
command is used rather than pcp_attach_node.

The tutorial would be helpful for you to setup streaming-replication with watchdog.
http://www.pgpool.net/pgpool-web/contrib_docs/watchdog_master_slave_3.3/en.html

There are sample conf files and scripts in the tutorial. Alhough pgpoolAdmin
is used instead of pcp command, "Recovery" button of pgpool is equivalent to
pcp_recovery_node.

On Wed, 16 Jul 2014 23:41:44 -0400
Long On <on.long.on at gmail.com> wrote:

> Thanks for looking into this Yugo. Sorry for the long message.
> 
> Start with node01 primary, node02 standby
> 
> -- pgpool.log --
> 2014-07-17 02:05:53 LOG:   pid 18498: wd_chk_setuid all commands have
> setuid bit
> 2014-07-17 02:05:53 LOG:   pid 18498: watchdog might call network commands
> which using setuid bit.
> 2014-07-17 02:05:53 LOG:   pid 18498: wd_create_send_socket: connect()
> reports failure (Connection refused). You can safely ignore this while
> starting up.
> 2014-07-17 02:05:53 LOG:   pid 18498: send_packet_4_nodes: packet for
> node02:9000 is canceled
> 2014-07-17 02:05:56 LOG:   pid 18498: wd_escalation: escalating to master
> pgpool
> 2014-07-17 02:05:58 LOG:   pid 18498: wd_escalation: escalated to master
> pgpool successfully
> 2014-07-17 02:05:58 LOG:   pid 18498: wd_init: start watchdog
> 2014-07-17 02:05:58 LOG:   pid 18498: pgpool-II successfully started.
> version 3.3.2 (tokakiboshi)
> 2014-07-17 02:05:59 LOG:   pid 18507: wd_create_hb_recv_socket: set
> SO_REUSEPORT
> 2014-07-17 02:05:59 LOG:   pid 18508: wd_create_hb_send_socket: set
> SO_REUSEPORT
> 2014-07-17 02:05:59 LOG:   pid 18509: wd_create_hb_recv_socket: set
> SO_REUSEPORT
> 2014-07-17 02:05:59 LOG:   pid 18510: wd_create_hb_send_socket: set
> SO_REUSEPORT
> 
> ...
> 
> Stop postgres on node01 to trigger failover to node02
> 
> 2014-07-17 02:27:10 LOG:   pid 18599: connection closed. retry to create
> new connection pool.
> 2014-07-17 02:27:10 ERROR: pid 18599: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2014-07-17 02:27:10 ERROR: pid 18599: connection to node01(5432) failed
> 2014-07-17 02:27:10 ERROR: pid 18599: new_connection: create_cp() failed
> 2014-07-17 02:27:10 LOG:   pid 18599: degenerate_backend_set: 0 fail over
> request from pid 18599
> 2014-07-17 02:27:10 LOG:   pid 18498: wd_start_interlock: start interlocking
> 2014-07-17 02:27:10 LOG:   pid 18498: wd_assume_lock_holder: become a new
> lock holder
> 2014-07-17 02:27:11 LOG:   pid 18498: starting degeneration. shutdown host
> node01(5432)
> 2014-07-17 02:27:11 LOG:   pid 18498: Restart all children
> 2014-07-17 02:27:11 LOG:   pid 18498: execute command: /usr/bin/sudo -u
> postgres /var/lib/postgresql/failover_cmd.sh node01 node02
> 2014-07-17 02:27:11 LOG:   pid 18498: wd_end_interlock: end interlocking
> 2014-07-17 02:27:12 LOG:   pid 18498: failover: set new primary node: -1
> 2014-07-17 02:27:12 LOG:   pid 18498: failover: set new master node: 1
> 2014-07-17 02:27:12 LOG:   pid 18498: failover done. shutdown host
> node01(5432)
> 2014-07-17 02:27:12 LOG:   pid 18546: worker process received restart
> request
> 2014-07-17 02:27:13 LOG:   pid 18545: pcp child process received restart
> request
> 2014-07-17 02:27:13 LOG:   pid 18498: PCP child 18545 exits with status 256
> in failover()
> 2014-07-17 02:27:13 LOG:   pid 18498: fork a new PCP child pid 18745 in
> failover()
> 2014-07-17 02:27:13 LOG:   pid 18498: worker child 18546 exits with status
> 256
> 2014-07-17 02:27:13 LOG:   pid 18498: fork a new worker child pid 18746
> 
> ...
> 
> Failover to node02 is successful.
> node01 gets a backup of node02 postgresql and start replicating.
> pgpool still think node01 is shutdown. However, if pcp_attach_node is run
> here then
> pgpool will make node01 postgresql primary, but it is standby to node02
> right now.
> pcp_attach_node is not run.
> 
> Stop postgres on node02 to trigger failover to node01
> 
> 2014-07-17 02:37:12 LOG:   pid 18742: connection closed. retry to create
> new connection pool.
> 2014-07-17 02:37:12 ERROR: pid 18742: connect_inet_domain_socket:
> getsockopt() detected error: Connection refused
> 2014-07-17 02:37:12 ERROR: pid 18742: connection to node02(5432) failed
> 2014-07-17 02:37:12 ERROR: pid 18742: new_connection: create_cp() failed
> 2014-07-17 02:37:12 LOG:   pid 18742: degenerate_backend_set: 1 fail over
> request from pid 18742
> 2014-07-17 02:37:12 LOG:   pid 18498: wd_start_interlock: start interlocking
> 2014-07-17 02:37:12 LOG:   pid 18498: wd_assume_lock_holder: become a new
> lock holder
> 2014-07-17 02:37:13 LOG:   pid 18498: starting degeneration. shutdown host
> node02(5432)
> 2014-07-17 02:37:13 ERROR: pid 18498: failover_handler: no valid DB node
> found
> 2014-07-17 02:37:13 LOG:   pid 18498: Restart all children
> 2014-07-17 02:37:13 LOG:   pid 18498: execute command: /usr/bin/sudo -u
> postgres /var/lib/postgresql/failover_cmd.sh node02
> 
> tail log show logging stopped here so failover is not completed. Note the
> failover command is incomplete. It should be ... failover_cmd.sh node02
> node01
> 
> Listing pgpool processes show all with <defunct> tag. This makes sense
> since pgpool doesn't know node01 is now standby and died. It thinks there
> are no "good" nodes left in the cluster.
> 
> The remaining log entries are logged when I exited my shell session and
> closing open jobs.
> 
> 2014-07-17 02:45:42 LOG:   pid 18498: wd_end_interlock: end interlocking
> 2014-07-17 02:45:42 LOG:   pid 18498: failover: set new primary node: -1
> 2014-07-17 02:45:43 LOG:   pid 18498: failover done. shutdown host
> node02(5432)
> 2014-07-17 02:45:44 LOG:   pid 18498: PCP child 18745 exits with status 0
> in failover()
> 2014-07-17 02:45:44 LOG:   pid 18498: fork a new PCP child pid 18977 in
> failover()
> 2014-07-17 02:45:44 LOG:   pid 18498: received smart shutdown request
> 2014-07-17 02:45:45 LOG:   pid 18511: wd_IP_down: ifconfig down succeeded
> -- /pgpool.log --
> 
> 
> I think the real question is how to re-attach node01 as standby so pgpool
> will know about it.
> 
> On a related note, Muhammad Usama had pointed out in another thread that
> pgpool looks for specific conditions to determine if a node is primary. I
> think satisfying these conditions may help.
> 
> Long


-- 
Yugo Nagata <nagata at sraoss.co.jp>


More information about the pgpool-general mailing list