[Pgpool-general] Able to do online recovery for failed nodes but not to attach new node

Tue Apr 21 07:14:50 UTC 2009

Hi,

I have Pgpool configured in a load-balancing replication setup. I have been able to set it up with two backends, disconnect one backend, then use pcp_recovery_node to successfully bring back the degenerated node back into the cluster. That works fine.

I'm not able to use online recovery to dynamically add a node to the cluster. For example, I set up pgpool to only have one back-end defined in pgpool.conf, and I start pgpool. Then, I change the pgpool.conf to add a new backend node. Then I run pgpool reload. Then I run pcp_recovery_node in the exact same way with the same scripts that worked for recovering a degenerated node, except this time on the new node I defined. In this case it fails. Here is what I notice is different

Recovering degenerated node:

2009-04-21 06:20:29 LOG:   pid 2101: starting recovery command: "SELECT pgpool_recovery('copy_base_backup', 'domU-12-31-39-02-B5-C2.compute-1.internal', '/vol/postgresdata')"
2009-04-21 06:20:29 DEBUG: pid 2101: exec_recovery: start recovery
2009-04-21 06:20:34 DEBUG: pid 2101: exec_recovery: finish recovery
2009-04-21 06:20:34 LOG:   pid 2101: 1st stage is done
2009-04-21 06:20:34 LOG:   pid 2101: starting 2nd stage
2009-04-21 06:20:34 LOG:   pid 2101: all connections from clients have been closed
2009-04-21 06:20:34 DEBUG: pid 2101: exec_checkpoint: start checkpoint
2009-04-21 06:20:34 DEBUG: pid 2101: exec_checkpoint: finish checkpoint
2009-04-21 06:20:34 LOG:   pid 2101: CHECKPOINT in the 2nd stage done
2009-04-21 06:20:34 LOG:   pid 2101: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr', 'domU-12-31-39-02-B5-C2.compute-1.internal', '/vol/postgresdata')"
2009-04-21 06:20:34 DEBUG: pid 2101: exec_recovery: start recovery
2009-04-21 06:20:34 DEBUG: pid 2101: exec_recovery: finish recovery
2009-04-21 06:20:34 DEBUG: pid 2101: exec_remote_start: start 

Trying recovery on a new node:

2009-04-21 06:25:04 LOG:   pid 2258: starting recovering node 1
2009-04-21 06:25:04 DEBUG: pid 2258: exec_checkpoint: start checkpoint
2009-04-21 06:25:04 DEBUG: pid 2258: exec_checkpoint: finish checkpoint
2009-04-21 06:25:04 LOG:   pid 2258: CHECKPOINT in the 1st stage done
2009-04-21 06:25:04 LOG:   pid 2258: starting recovery command: "SELECT pgpool_recovery('copy_base_backup', 'localhost', '/vol/postgresdata')"
2009-04-21 06:25:04 DEBUG: pid 2258: exec_recovery: start recovery
2009-04-21 06:25:05 DEBUG: pid 2258: exec_recovery: finish recovery
2009-04-21 06:25:05 LOG:   pid 2258: 1st stage is done
2009-04-21 06:25:05 LOG:   pid 2258: starting 2nd stage
2009-04-21 06:25:05 LOG:   pid 2258: all connections from clients have been closed
2009-04-21 06:25:05 DEBUG: pid 2258: exec_checkpoint: start checkpoint
2009-04-21 06:25:05 DEBUG: pid 2258: exec_checkpoint: finish checkpoint
2009-04-21 06:25:05 LOG:   pid 2258: CHECKPOINT in the 2nd stage done
2009-04-21 06:25:05 LOG:   pid 2258: starting recovery command: "SELECT pgpool_recovery('pgpool_recovery_pitr', 'localhost', '/vol/postgresdata')"
2009-04-21 06:25:05 DEBUG: pid 2258: exec_recovery: start recovery
2009-04-21 06:25:05 DEBUG: pid 2258: exec_recovery: finish recovery
2009-04-21 06:25:05 DEBUG: pid 2258: exec_remote_start: start pgpool_remote_start

Notice in the second case it is using "localhost" as the parameter (and localhost is not the node I'm trying to recover). I notice that if I start pgpool with only one backend, then I add a backend to the conf file, then do pgpool reload, it does read it correctly, eg, I see 

2009-04-21 07:09:42 DEBUG: pid 2904: key: backend_hostname1
2009-04-21 07:09:42 DEBUG: pid 2904: value: domU-12-31-39-02-B5-C2.compute-1.internal kind: 5

In the output, and the health check registers a disconnected node:

009-04-21 07:10:38 DEBUG: pid 2904: health_check: 0 th DB node status: 1
2009-04-21 07:10:38 DEBUG: pid 2904: health_check: 1 th DB node status: 3

But, when I use pcp_node_info on node 1, here is the output:

 5432 3 1073741823.500000

There is no hostname printed!

Does anyone have any ideas? Because of all this, I'm thinking pgpool is having a problem with fully recognizing the new host and recovering it.

_________________________________________________________________
Rediscover Hotmail®: Get e-mail storage that grows with you. 
http://windowslive.com/RediscoverHotmail?ocid=TXT_TAGLM_WL_HM_Rediscover_Storage2_042009
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://pgfoundry.org/pipermail/pgpool-general/attachments/20090421/03565ce4/attachment.html