[pgpool-general: 1451] Re: pgPool Online Recovery Streaming Replication

ning chan ninchan8328 at gmail.com
Fri Mar 1 14:42:23 JST 2013


I know that failover.py did not execute because it did not print anything
to the failover.log. If it is executed, i see that the activities is logged
to the file, below is the sample:

Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  FAILED NODEID: 0
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  FAILED HOSTNAME: server0
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  FAILED PORTNUMBER: 5432
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  FAILED DBPath:
/opt/postgres/9.2/data
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  NEW_MASTER_NODEID: 1
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  NEW_MASTER_HOSTNAME: server1
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  OLD_MASTER_NODEID: 0
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  OLD_PRIMARY_NODEID: 0
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  NEW_MASTER_PORT: 5432
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover  DEBUG:  -->
Thu Feb 28 16:04:04 2013 failover  DEBUG:  NEW_MASTER_DBPATH:
/opt/postgres/9.2/data
Thu Feb 28 16:04:04 2013 failover  DEBUG:  <--
Thu Feb 28 16:04:04 2013 failover.run  ERROR:  Failed server server0 is an
old Primary, and localhost is the new primary, let's failover.
Thu Feb 28 16:04:04 2013 failover.run  DEBUG:  Executing /bin/touch
/opt/postgres/9.2/data/trigger_file0
Thu Feb 28 16:04:04 2013 failover.run  DEBUG:  Failover has been triggered
successfully
Thu Feb 28 16:04:04 2013 failover.run  DEBUG:  <--


On Thu, Feb 28, 2013 at 11:38 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> > Hi Tatsuo,
> > I reproduce the problem on pgpool 3.2.3.
> > Basically server1 is the originally primary, and server0 is standby. When
> > server1 kernel panic, server0 failover which become the primary, which is
> > expected.
> > When failed server1 comes back online, execute pcp_recovery on server0 to
> > recover and connect it to primary as a standby, which is expected also.
> > Now, if server0 kernel panic, pgpool on server1 trigger the failover
> > script, however, the script did not really get executed although it is
> > logged in the pgpool log.
>
> How do you determine failover scrit did not get executed?
> I see this line:
>
> Feb 28 21:47:41 server1 pgpool[2054]: execute command:
> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > See below for the trace:
> >
> > Feb 28 21:47:40 server1 pgpool[3593]: connect_inet_domain_socket:
> select()
> > timedout. retrying...
> > Feb 28 21:47:40 server1 pgpool[3679]: connect_inet_domain_socket:
> select()
> > timedout. retrying...
> > Feb 28 21:47:41 server1 pgpool[3594]: wd_create_send_socket: connect()
> > reports failure (No route to host). You can safely ignore this while
> > starting up.
> > Feb 28 21:47:41 server1 pgpool[2054]: pool_flush_it: write failed to
> > backend (0). reason: No route to host offset: 0 wlen: 42
> > Feb 28 21:47:41 server1 pgpool[2054]: notice_backend_error: called from
> > pgpool main. ignored.
> > Feb 28 21:47:41 server1 pgpool[2054]: child_exit: called from pgpool
> main.
> > ignored.
> > Feb 28 21:47:41 server1 pgpool[2054]: pool_read: EOF encountered with
> > backend
> > Feb 28 21:47:41 server1 pgpool[2054]: s_do_auth: error while reading
> > message kind
> > Feb 28 21:47:41 server1 pgpool[2054]: make_persistent_db_connection:
> > s_do_auth failed
> > Feb 28 21:47:41 server1 pgpool[2054]: health check failed. 0 th host
> > server0 at port 5432 is down
> > Feb 28 21:47:41 server1 pgpool[2054]: health check retry sleep time: 5
> > second(s)
> > Feb 28 21:47:41 server1 pgpool[2054]: starting degeneration. shutdown
> host
> > server0(5432)
> > Feb 28 21:47:41 server1 pgpool[2054]: Restart all children
> > Feb 28 21:47:41 server1 pgpool[2054]: execute command:
> > /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D
> /opt/postgres/9.2/data
> > -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
> > Feb 28 21:47:41 server1 pgpool[2054]: find_primary_node_repeatedly:
> waiting
> > for finding a primary node
> > Feb 28 21:47:41 server1 pgpool[2054]: connect_inet_domain_socket_by_port:
> > health check timer expired
> > Feb 28 21:47:41 server1 pgpool[2054]: make_persistent_db_connection:
> > connection to server1(5432) failed
> > Feb 28 21:47:41 server1 pgpool[2054]: find_primary_node:
> > make_persistent_connection failed
> > Feb 28 21:47:41 server1 pgpool[3679]: connect_inet_domain_socket:
> select()
> > timedout. retrying...
> > Feb 28 21:47:42 server1 pgpool[2054]: connect_inet_domain_socket_by_port:
> > health check timer expired
> >
> > above logs repeat few times, and after few seconds, i see this:
> > Feb 28 21:47:47 server1 pgpool[3679]: check_replication_time_lag: could
> not
> > connect to DB node 0, check sr_check_user and sr_check_password
> >
> > Feb 28 21:48:24 server1 pgpool[2054]: find_primary_node:
> > make_persistent_connection failed
> > Feb 28 21:48:25 server1 pgpool[3679]: connect_inet_domain_socket:
> select()
> > timedout. retrying...
> > Feb 28 21:48:25 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 0 (server1:9999) seems not to be working
> > Feb 28 21:48:25 server1 pgpool[3679]: connect_inet_domain_socket:
> > getsockopt() detected error: No route to host
> > Feb 28 21:48:25 server1 pgpool[3679]: make_persistent_db_connection:
> > connection to server0(5432) failed
> > Feb 28 21:48:25 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 1 (server0:9999) seems not to be working
> > Feb 28 21:48:25 server1 pgpool[2057]: wd_lifecheck: watchdog status is
> > DOWN. You need to restart this for recovery.
> > Feb 28 21:48:25 server1 pgpool[3679]: check_replication_time_lag: could
> not
> > connect to DB node 0, check sr_check_user and sr_check_password
> >
> > Feb 28 21:48:44 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 0 (server1:9999) seems not to be working
> > Feb 28 21:48:44 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 1 (server0:9999) seems not to be working
> >
> > Feb 28 21:49:10 server1 pgpool[2054]: connect_inet_domain_socket_by_port:
> > health check timer expired
> > Feb 28 21:49:10 server1 pgpool[2054]: make_persistent_db_connection:
> > connection to server1(5432) failed
> > Feb 28 21:49:10 server1 pgpool[2054]: find_primary_node:
> > make_persistent_connection failed
> > Feb 28 21:49:11 server1 pgpool[2054]: failover: no follow backends are
> > degenerated
> > Feb 28 21:49:11 server1 pgpool[2054]: failover: set new primary node: -1
> > Feb 28 21:49:11 server1 pgpool[2054]: failover: set new master node: 1
> > Feb 28 21:49:11 server1 pgpool[4012]: do_child: failback event found.
> > restart myself.
> > Feb 28 21:49:11 server1 pgpool[4013]: do_child: failback event found.
> > restart myself.
> > Feb 28 21:49:11 server1 pgpool[4014]: do_child: failback event found.
> > restart myself.
> > Feb 28 21:49:11 server1 pgpool[4015]: do_child: failback event found.
> > restart myself.
> >
> > Feb 28 21:49:11 server1 pgpool[4030]: do_child: failback event found.
> > restart myself.
> > Feb 28 21:49:11 server1 pgpool[4039]: do_child: failback event found.
> > restart myself.
> > Feb 28 21:49:11 server1 pgpool[4040]: connection received:
> > host=server1.local port=41474
> > Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect
> packet
> > length (0)
> > Feb 28 21:49:11 server1 pgpool[4040]: connection received:
> > host=server1.local port=41481
> > Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect
> packet
> > length (0)
> > Feb 28 21:49:11 server1 pgpool[4040]: connection received:
> > host=server1.local port=41485
> > Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect
> packet
> > length (0)
> > Feb 28 21:49:11 server1 pgpool[4040]: connection received:
> > host=server1.local port=41491
> > Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect
> packet
> > length (0)
> > Feb 28 21:49:11 server1 pgpool[2054]: failover done. shutdown host
> > server0(5432)
> > Feb 28 21:49:11 server1 pgpool[4029]: do_child: failback event found.
> > restart myself.
> > Feb 28 21:49:12 server1 pgpool[3678]: pcp child process received restart
> > request
> > Feb 28 21:49:12 server1 pgpool[2054]: PCP child 3678 exits with status
> 256
> > in failover()
> > Feb 28 21:49:12 server1 pgpool[2054]: fork a new PCP child pid 4042 in
> > failover()
> > Feb 28 21:49:12 server1 pgpool[2054]: worker child 3679 exits with status
> > 256
> > Feb 28 21:49:12 server1 pgpool[2054]: fork a new worker child pid 4043
> > Feb 28 21:49:12 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
> > health check timer expired
> > Feb 28 21:49:12 server1 pgpool[4043]: make_persistent_db_connection:
> > connection to server1(5432) failed
> > Feb 28 21:49:12 server1 pgpool[2054]: after some retrying backend
> returned
> > to healthy state
> > Feb 28 21:49:13 server1 pgpool[4073]: connection received:
> > host=server1.local port=41496
> > Feb 28 21:49:13 server1 pgpool[4073]: connect_inet_domain_socket_by_port:
> > health check timer expired
> > Feb 28 21:49:13 server1 pgpool[4073]: connection to server1(5432) failed
> > Feb 28 21:49:13 server1 pgpool[4073]: new_connection: create_cp() failed
> > Feb 28 21:49:13 server1 pgpool[4073]: degenerate_backend_set: 1 fail over
> > request from pid 4073
> > Feb 28 21:49:13 server1 pgpool[2054]: starting degeneration. shutdown
> host
> > server1(5432)
> > Feb 28 21:49:13 server1 pgpool[2054]: failover_handler: no valid DB node
> > found
> > Feb 28 21:49:13 server1 pgpool[2054]: Restart all children
> > Feb 28 21:49:13 server1 pgpool[2054]: execute command:
> > /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D
> /opt/postgres/9.2/data
> > -m -1 -H  -M 1 -P 1 -r  -R
> > Feb 28 21:49:13 server1 pgpool[2054]: find_primary_node_repeatedly:
> waiting
> > for finding a primary node
> > Feb 28 21:49:16 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 0 (server1:9999) seems not to be working
> > Feb 28 21:49:16 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 1 (server0:9999) seems not to be working
> > Feb 28 21:49:16 server1 pgpool[2057]: wd_lifecheck: watchdog status is
> > DOWN. You need to restart this for recovery.
> > Feb 28 21:49:22 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
> > health check timer expired
> > Feb 28 21:49:22 server1 pgpool[4043]: make_persistent_db_connection:
> > connection to server1(5432) failed
> > Feb 28 21:49:32 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
> > health check timer expired
> > Feb 28 21:49:32 server1 pgpool[4043]: make_persistent_db_connection:
> > connection to server1(5432) failed
> > Feb 28 21:49:35 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 0 (server1:9999) seems not to be working
> > Feb 28 21:49:35 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
> > times. pgpool 1 (server0:9999) seems not to be working
> > Feb 28 21:49:35 server1 pgpool[2057]: wd_lifecheck: watchdog status is
> > DOWN. You need to restart this for recovery.
> > Feb 28 21:49:42 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
> > health check timer expired
> >
> >
> > Question 1: why the failover script did not executed? is it becuase
> primary
> > is -1?
> > Question 2: why failover script trigger again?
> > Question 3: why pgpool on server1 failed? why watchdog on server1 is
> down?
> >
> > Thanks~
> > Ning
> >
> >
> >
> >
> >
> > On Wed, Feb 27, 2013 at 10:44 PM, ning chan <ninchan8328 at gmail.com>
> wrote:
> >
> >> Hi Tatsuo,
> >> I am on 3.2.1. I will try 3.2.3 and report back to you.
> >>
> >> Meanwhile, I have question about the backend server setting in the
> >> pgpool.conf
> >> I have two pgpools and each of them pointing to the same backend
> servers,
> >> should  both pgpool.conf have the same backend_hostname configuration?
> >>
> >> *****pgpool-A****
> >> backend_hostname0 = 'se032c-94-30'
> >> backend_port0 = 5432
> >> backend_weight0 = 1
> >> backend_data_directory0 = '/opt/postgres/9.2/data'
> >> backend_flag0 = 'ALLOW_TO_FAILOVER'
> >>
> >> backend_hostname1 = 'se032c-94-31'
> >> backend_port1 = 5432
> >> backend_weight1 = 1
> >> backend_data_directory1 = '/opt/postgres/9.2/data'
> >> backend_flag1 = 'ALLOW_TO_FAILOVER'
> >>
> >>
> >> ****pgpool-B****
> >> backend_hostname0 = 'se032c-94-30'
> >> backend_port0 = 5432
> >> backend_weight0 = 1
> >> backend_data_directory0 = '/opt/postgres/9.2/data'
> >> backend_flag0 = 'ALLOW_TO_FAILOVER'
> >>
> >> backend_hostname1 = 'se032c-94-31'
> >> backend_port1 = 5432
> >> backend_weight1 = 1
> >> backend_data_directory1 = '/opt/postgres/9.2/data'
> >> backend_flag1 = 'ALLOW_TO_FAILOVER'
> >>
> >> Thanks~
> >> Ning
> >>
> >>
> >> On Wed, Feb 27, 2013 at 6:12 PM, Tatsuo Ishii <ishii at postgresql.org
> >wrote:
> >>
> >>> What version of pgpool-II are you using?  We have found a failover
> >>> handling problem with 3.2.2 and released 3.2.3 afterwards.
> >>> --
> >>> Tatsuo Ishii
> >>> SRA OSS, Inc. Japan
> >>> English: http://www.sraoss.co.jp/index_en.php
> >>> Japanese: http://www.sraoss.co.jp
> >>>
> >>> > Hi Tatsuo,
> >>> > Do you have any chance to look at this problem yet?
> >>> >
> >>> > Thanks~
> >>> > Ning
> >>> >
> >>> >
> >>> > On Tue, Feb 19, 2013 at 3:02 PM, ning chan <ninchan8328 at gmail.com>
> >>> wrote:
> >>> >
> >>> >> Hi Tatsuo,
> >>> >> I figured out the problem to the some scripting error in the
> >>> >> pgpool_remote_start to have the incorrect path to pg_ctl.
> >>> >> As soon as I corrected it i can recover the failed server and bring
> it
> >>> >> online.
> >>> >> I now however facing another problem.
> >>> >> After I bring the failed master back in to the pgpool as a standby
> >>> server,
> >>> >> i then shutdown the currect Primary server and expecting Standby
> server
> >>> >> will be promoted to become the new Primary; however it did not
> happen.
> >>> >> I check the pgpool log, i see failover command is called but it did
> not
> >>> >> execute. I check and confirm my failover script works just fine.
> >>> >>
> >>> >> Here is the log:
> >>> >> Feb 19 14:51:40 server0 pgpool[3519]: set 1 th backend down status
> >>> >> Feb 19 14:51:43 server0 pgpool[3554]: wd_create_send_socket:
> connect()
> >>> >> reports failure (No route to host). You can safely ignore this while
> >>> >> starting up.
> >>> >> Feb 19 14:51:43 server0 pgpool[3522]: wd_lifecheck: lifecheck
> failed 3
> >>> >> times. pgpool seems not to be working
> >>> >> Feb 19 14:51:43 server0 pgpool[3522]: wd_IP_down: ifconfig down
> >>> succeeded
> >>> >> Feb 19 14:51:43 server0 pgpool[3519]: starting degeneration.
> shutdown
> >>> host
> >>> >> server1(5432)
> >>> >> Feb 19 14:51:43 server0 pgpool[3519]: Restart all children
> >>> >> Feb 19 14:51:43 server0 pgpool[3519]: execute command:
> >>> >> /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D
> >>> /opt/postgres/9.2/data
> >>> >> -m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
> >>> >> Feb 19 14:51:43 server0 postgres[3939]: [2-1] LOG:  incomplete
> startup
> >>> >> packet
> >>> >> Feb 19 14:51:43 server0 postgres[3931]: [2-1] LOG:  incomplete
> startup
> >>> >> packet
> >>> >> Feb 19 14:51:43 server0 postgres[3935]: [2-1] LOG:  incomplete
> startup
> >>> >> packet
> >>> >> Feb 19 14:51:43 server0 pgpool[3519]: find_primary_node_repeatedly:
> >>> >> waiting for finding a primary node
> >>> >> Feb 19 14:51:46 server0 pgpool[3522]: wd_create_send_socket:
> connect()
> >>> >> reports failure (No route to host). You can safely ignore this while
> >>> >> starting up.
> >>> >> Feb 19 14:51:46 server0 pgpool[3522]: wd_lifecheck: lifecheck
> failed 3
> >>> >> times. pgpool seems not to be working
> >>> >> Feb 19 14:51:54 server0 pgpool[3556]: connect_inet_domain_socket:
> >>> >> connect() failed: Connection timed out
> >>> >> Feb 19 14:51:54 server0 pgpool[3556]: make_persistent_db_connection:
> >>> >> connection to server1(5432) failed
> >>> >> Feb 19 14:51:54 server0 pgpool[3556]: check_replication_time_lag:
> could
> >>> >> not connect to DB node 1, check sr_check_user and sr_check_password
> >>> >> Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck
> failed 3
> >>> >> times. pgpool seems not to be working
> >>> >> Feb 19 14:52:05 server0 pgpool[3556]: connect_inet_domain_socket:
> >>> >> connect() failed: No route to host
> >>> >> Feb 19 14:52:05 server0 pgpool[3556]: make_persistent_db_connection:
> >>> >> connection to server1(5432) failed
> >>> >> Feb 19 14:52:05 server0 pgpool[3556]: check_replication_time_lag:
> could
> >>> >> not connect to DB node 1, check sr_check_user and sr_check_password
> >>> >> Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck
> failed 3
> >>> >> times. pgpool seems not to be working
> >>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover: no follow backends
> are
> >>> >> degenerated
> >>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover: set new primary
> node:
> >>> -1
> >>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover: set new master
> node: 0
> >>> >> Feb 19 14:52:14 server0 pgpool[3980]: connection received:
> >>> >> host=server0.local port=45361
> >>> >> Feb 19 14:52:14 server0 pgpool[3556]: worker process received
> restart
> >>> >> request
> >>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover done. shutdown host
> >>> >> server1(5432)
> >>> >>
> >>> >> As you can see, postgreSQL did not restarted.
> >>> >>
> >>> >> From a successful failover, the log should looks like this:
> >>> >> Feb 19 13:47:01 server0 pgpool[4391]: execute command:
> >>> >> /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D
> >>> /opt/postgres/9.2/data
> >>> >> -m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
> >>> >> Feb 19 13:47:01 server0 postgres[4786]: [2-1] LOG:  incomplete
> startup
> >>> >> packet
> >>> >> Feb 19 13:47:02 server0 pgpool[4391]: find_primary_node_repeatedly:
> >>> >> waiting for finding a primary node
> >>> >> Feb 19 13:47:06 server0 postgres[3083]: [6-1] LOG:  trigger file
> found:
> >>> >> /opt/postgres/9.2/data/trigger_file0
> >>> >> Feb 19 13:47:06 server0 postgres[3083]: [7-1] LOG:  redo done at
> >>> 0/91000020
> >>> >> Feb 19 13:47:06 server0 postgres[3083]: [8-1] LOG:  selected new
> >>> timeline
> >>> >> ID: 18
> >>> >> Feb 19 13:47:06 server0 postgres[3083]: [9-1] LOG:  archive recovery
> >>> >> complete
> >>> >> Feb 19 13:47:06 server0 postgres[3081]: [2-1] LOG:  database system
> is
> >>> >> ready to accept connections
> >>> >> Feb 19 13:47:06 server0 postgres[4804]: [2-1] LOG:  autovacuum
> launcher
> >>> >> started
> >>> >> Feb 19 13:47:07 server0 pgpool[4391]: find_primary_node: primary
> node
> >>> id
> >>> >> is 0
> >>> >> Feb 19 13:47:07 server0 pgpool[4391]: starting follow degeneration.
> >>> >> shutdown host server1(5432)
> >>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover: 1 follow backends
> have
> >>> >> been degenerated
> >>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover: set new primary
> node: 0
> >>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover: set new master
> node: 0
> >>> >> Feb 19 13:47:07 server0 pgpool[4817]: connection received:
> >>> >> host=server1.local port=54619
> >>> >> Feb 19 13:47:07 server0 pgpool[4816]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4818]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4819]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4815]: start triggering follow
> command.
> >>> >> Feb 19 13:47:07 server0 pgpool[4815]: execute command: echo 1
> server1
> >>> 5432
> >>> >> /opt/postgres/9.2/data 0 server0 0 1 5432 /opt/postgres/9.2/data %
> >>> >> Feb 19 13:47:07 server0 pgpool[4821]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4822]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4823]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4820]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4827]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4828]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4826]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4829]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4830]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4831]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4832]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4835]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4836]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4837]: do_child: failback event
> found.
> >>> >> restart myself.
> >>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover done. shutdown host
> >>> >> server1(5432)
> >>> >>
> >>> >> Any idea why failover did not work after the recovery? Do i need to
> >>> >> restart remote pgpool after recovery?
> >>> >>
> >>> >>
> >>> >> On Mon, Feb 18, 2013 at 4:24 PM, ning chan <ninchan8328 at gmail.com>
> >>> wrote:
> >>> >>
> >>> >>>
> >>> >>> Hi Tatsuo,
> >>> >>> Sorry for the late reply as I was traveling.
> >>> >>> I am able to reproduce the problem and here is what I did:
> >>> >>>
> >>> >>> *1) Make sure primary (server0) and standby (server1) connections
> are
> >>> >>> good.*
> >>> >>> [root at server0 tmp]# psql -c "show pool_nodes" -p 9999
> >>> >>>  node_id | hostname | port | status | lb_weight |  role
> >>> >>> ---------+----------+------+--------+-----------+---------
> >>> >>>  0       | server0  | 5432 | 2      | 0.500000  | primary
> >>> >>>  1       | server1  | 5432 | 2      | 0.500000  | standby
> >>> >>> (2 rows)
> >>> >>>
> >>> >>> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
> >>> >>>  node_id | hostname | port | status | lb_weight |  role
> >>> >>> ---------+----------+------+--------+-----------+---------
> >>> >>>  0       | server0  | 5432 | 2      | 0.500000  | primary
> >>> >>>  1       | server1  | 5432 | 2      | 0.500000  | standby
> >>> >>> (2 rows)
> >>> >>>
> >>> >>> *2) force shutdown primary (server0), and standby (server1) server
> was
> >>> >>> promoted, below is the pgpool log from server1 during the
> failover:*
> >>> >>> Feb 18 15:31:45 server1 pgpool[2691]: connect_inet_domain_socket:
> >>> >>> connect() failed: No route to host
> >>> >>> Feb 18 15:31:45 server1 pgpool[2691]:
> make_persistent_db_connection:
> >>> >>> connection to server0(5432) failed
> >>> >>> Feb 18 15:31:45 server1 pgpool[2691]: check_replication_time_lag:
> >>> could
> >>> >>> not connect to DB node 0, check sr_check_user and sr_check_password
> >>> >>> Feb 18 15:31:45 server1 pgpool[2689]: connect_inet_domain_socket:
> >>> >>> connect() failed: No route to host
> >>> >>> Feb 18 15:31:45 server1 pgpool[2689]: connection to server0(5432)
> >>> failed
> >>> >>> Feb 18 15:31:45 server1 pgpool[2689]: new_connection: create_cp()
> >>> failed
> >>> >>> Feb 18 15:31:45 server1 pgpool[2689]: degenerate_backend_set: 0
> fail
> >>> over
> >>> >>> request from pid 2689
> >>> >>> Feb 18 15:31:48 server1 pgpool[2689]: wd_create_send_socket:
> connect()
> >>> >>> reports failure (No route to host). You can safely ignore this
> while
> >>> >>> starting up.
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]: connect_inet_domain_socket:
> >>> >>> connect() failed: No route to host
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]:
> make_persistent_db_connection:
> >>> >>> connection to server0(5432) failed
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]: health check failed. 0 th
> host
> >>> >>> server0 at port 5432 is down
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]: health check retry sleep
> time: 5
> >>> >>> second(s)
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]: starting degeneration.
> shutdown
> >>> >>> host server0(5432)
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]: Restart all children
> >>> >>> Feb 18 15:31:48 server1 pgpool[2653]: execute command:
> >>> >>> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D
> >>> /opt/postgres/9.2/data
> >>> >>> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
> >>> >>> Feb 18 15:31:49 server1 pgpool[2653]: find_primary_node_repeatedly:
> >>> >>> waiting for finding a primary node
> >>> >>> Feb 18 15:31:53 server1 postgres[1970]: [7-1] LOG:  trigger file
> >>> found:
> >>> >>> /tmp/trigger_file0
> >>> >>> Feb 18 15:31:53 server1 postgres[2597]: [3-1] FATAL:  terminating
> >>> >>> walreceiver process due to administrator command
> >>> >>> Feb 18 15:31:53 server1 postgres[1970]: [8-1] LOG:  record with
> zero
> >>> >>> length at 0/64000418
> >>> >>> Feb 18 15:31:53 server1 postgres[1970]: [9-1] LOG:  redo done at
> >>> >>> 0/640003B8
> >>> >>> Feb 18 15:31:53 server1 postgres[1970]: [10-1] LOG:  selected new
> >>> >>> timeline ID: 15
> >>> >>> Feb 18 15:31:53 server1 postgres[1970]: [11-1] LOG:  archive
> recovery
> >>> >>> complete
> >>> >>> Feb 18 15:31:53 server1 postgres[1968]: [2-1] LOG:  database
> system is
> >>> >>> ready to accept connections
> >>> >>> Feb 18 15:31:53 server1 postgres[3168]: [2-1] LOG:  autovacuum
> >>> launcher
> >>> >>> started
> >>> >>> Feb 18 15:31:54 server1 pgpool[2653]: find_primary_node: primary
> node
> >>> id
> >>> >>> is 1
> >>> >>> Feb 18 15:31:54 server1 pgpool[2653]: starting follow degeneration.
> >>> >>> shutdown host server0(5432)
> >>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover: 1 follow backends
> have
> >>> >>> been degenerated
> >>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new primary
> node:
> >>> 1
> >>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new master
> node: 1
> >>> >>> Feb 18 15:31:54 server1 pgpool[3171]: connection received:
> >>> >>> host=server1.local port=58346
> >>> >>> Feb 18 15:31:54 server1 pgpool[3170]: start triggering follow
> command.
> >>> >>> Feb 18 15:31:54 server1 pgpool[3170]: execute command: echo 0
> server0
> >>> >>> 5432 /opt/postgres/9.2/data 1 server1 0 0 5432
> /opt/postgres/9.2/data
> >>> %
> >>> >>> Feb 18 15:31:54 server1 pgpool[2691]: worker process received
> restart
> >>> >>> request
> >>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover done. shutdown host
> >>> >>> server0(5432)
> >>> >>>
> >>> >>> *3) boot up the failed server (server0) as usual, pgpool and
> database
> >>> >>> started up as usual.
> >>> >>> *
> >>> >>> *after server0 boot up, i check the pool_nodes info, server0 pgpool
> >>> >>> still think server0 is the primary:*
> >>> >>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> >>> >>>  node_id | hostname | port | status | lb_weight |  role
> >>> >>> ---------+----------+------+--------+-----------+---------
> >>> >>>  0       | server0  | 5432 | 2      | 0.500000  | primary
> >>> >>>  1       | server1  | 5432 | 2      | 0.500000  | standby
> >>> >>> (2 rows)
> >>> >>>
> >>> >>> server1 pgpool still think server1 is the primary:
> >>> >>> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
> >>> >>>  node_id | hostname | port | status | lb_weight |  role
> >>> >>> ---------+----------+------+--------+-----------+---------
> >>> >>>  0       | server0  | 5432 | 3      | 0.500000  | standby
> >>> >>>  1       | server1  | 5432 | 2      | 0.500000  | primary
> >>> >>> (2 rows)
> >>> >>>
> >>> >>> *4) kill the database on server0 to prepare of the recovery*
> >>> >>> *as soon as i kill the server0 database, pgpool log shows
> following:*
> >>> >>> Feb 18 15:33:56 server0 pgpool[2153]: execute command:
> >>> >>> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D
> >>> /opt/postgres/9.2/data
> >>> >>> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node_repeatedly:
> >>> >>> waiting for finding a primary node
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node: primary
> node
> >>> id
> >>> >>> is 1
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: starting follow degeneration.
> >>> >>> shutdown host server0(5432)
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover: 1 follow backends
> have
> >>> >>> been degenerated
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new primary
> node:
> >>> 1
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new master
> node: 1
> >>> >>> Feb 18 15:33:57 server0 pgpool[2534]: start triggering follow
> command.
> >>> >>> Feb 18 15:33:57 server0 pgpool[2534]: execute command: echo 0
> server0
> >>> >>> 5432 /opt/postgres/9.2/data 1 server1 0 0 5432
> /opt/postgres/9.2/data
> >>> %
> >>> >>> Feb 18 15:33:57 server0 pgpool[2190]: worker process received
> restart
> >>> >>> request
> >>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover done. shutdown host
> >>> >>> server0(5432)
> >>> >>>
> >>> >>> pool_nodes shows:
> >>> >>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
> >>> >>>  node_id | hostname | port | status | lb_weight |  role
> >>> >>> ---------+----------+------+--------+-----------+---------
> >>> >>>  0       | server0  | 5432 | 3      | 0.500000  | standby
> >>> >>>  1       | server1  | 5432 | 2      | 0.500000  | primary
> >>> >>> (2 rows)
> >>> >>>
> >>> >>>
> >>> >>> *5) on server1, execute /usr/local/bin/pcp_recovery_node 10
> localhost
> >>> >>> 9898 pgpool password 0*
> >>> >>> server 1 pgpool log shows following:
> >>> >>> Feb 18 15:34:52 server1 pgpool[3209]: starting recovery command:
> >>> "SELECT
> >>> >>> pgpool_recovery('basebackup.sh', 'server0',
> '/opt/postgres/9.2/data')"
> >>> >>> Feb 18 15:34:53 server1 pgpool[3209]: 1st stage is done
> >>> >>> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started:
> try to
> >>> >>> connect to postmaster on hostname:server0 database:postgres
> >>> user:postgres
> >>> >>> (retry 0 times)
> >>> >>> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started:
> >>> failed to
> >>> >>> connect to postmaster on hostname:server0 database:postgres
> >>> user:postgres
> >>> >>>
> >>> >>> *The message "fail to connect to postmaster on hostname server0"
> >>> >>> continue until the command timeout.
> >>> >>> *
> >>> >>> *it looks like the 1st stage command is done but pgpool on server1
> did
> >>> >>> not call the pgpool_remote_start to start the server0 database
> engine
> >>> >>> although the script is there under postgresql data folder
> >>> >>> *
> >>> >>> [root at server1 data]# ls
> >>> >>> backup_label.old     data       pg_hba.conf    pg_notify
> >>> >>> pg_stat_tmp  PG_VERSION       postmaster.pid
> >>> >>> base                 global     pg_ident.conf  pgpool_remote_start
> >>> >>> pg_subtrans  pg_xlog          recovery.done
> >>> >>> basebackup.sh        ning.test  pg_log         pg_serial
> >>> >>> pg_tblspc    postgresql.conf  recovery.standby1
> >>> >>> copy-base-backup.sh  pg_clog    pg_multixact   pg_snapshots
> >>> >>> pg_twophase  postmaster.opts
> >>> >>>
> >>> >>> *during this period, pgpool log on server0 shows the following:*
> >>> >>> Feb 18 15:34:42 server0 pgpool[2567]: connection received:
> >>> >>> host=server0.local port=45366
> >>> >>> Feb 18 15:34:46 server0 pgpool[2567]: connection received:
> >>> >>> host=server1.local port=39535
> >>> >>> Feb 18 15:34:52 server0 pgpool[2567]: connection received:
> >>> >>> host=server0.local port=45381
> >>> >>> Feb 18 15:34:56 server0 pgpool[2565]: connection received:
> >>> >>> host=server1.local port=39557
> >>> >>> Feb 18 15:35:02 server0 pgpool[2565]: connection received:
> >>> >>> host=server0.local port=45396
> >>> >>> Feb 18 15:35:06 server0 pgpool[2565]: connection received:
> >>> >>> host=server1.local port=39576
> >>> >>> Feb 18 15:35:12 server0 pgpool[2565]: connection received:
> >>> >>> host=server0.local port=45410
> >>> >>> Feb 18 15:35:16 server0 pgpool[2565]: connection received:
> >>> >>> host=server1.local port=39593
> >>> >>> Feb 18 15:35:22 server0 pgpool[2565]: connection received:
> >>> >>> host=server0.local port=45425
> >>> >>> Feb 18 15:35:26 server0 pgpool[2565]: connection received:
> >>> >>> host=server1.local port=39612
> >>> >>> Feb 18 15:35:32 server0 pgpool[2565]: connection received:
> >>> >>> host=server0.local port=45440
> >>> >>> Feb 18 15:35:36 server0 pgpool[2565]: connection received:
> >>> >>> host=server1.local port=39630
> >>> >>> Feb 18 15:35:42 server0 pgpool[2565]: connection received:
> >>> >>> host=server0.local port=45455
> >>> >>> Feb 18 15:35:46 server0 pgpool[2565]: connection received:
> >>> >>> host=server1.local port=39648
> >>> >>>
> >>> >>> *Below is the script of recovery_1st_stage_command =
> 'basebackup.sh'*
> >>> >>> [root at server1 ~]# more /opt/postgres/9.2/data/basebackup.sh
> >>> >>> #! /bin/sh
> >>> >>> # Recovery script for streaming replication.
> >>> >>> # This script assumes that DB node 0 is primary, and 1 is standby.
> >>> >>> #
> >>> >>> datadir=$1
> >>> >>> desthost=$2
> >>> >>> destdir=$3
> >>> >>>
> >>> >>> echo "datadir=$datadir"
> >>> >>> echo "desthost=$desthost"
> >>> >>> echo "destdir=$destdir"
> >>> >>>
> >>> >>> echo "Executing basebackup.sh" >> /var/log/replication.log
> >>> >>>
> >>> >>> echo "Executing start backup" >> /var/log/replication.log
> >>> >>> psql -c "SELECT pg_start_backup('Streaming Replication', true)"
> >>> postgres
> >>> >>>
> >>> >>> echo "Executing rsync" >> /var/log/replication.log
> >>> >>> echo "rsync -C -a --progress --delete -e ssh --exclude
> postgresql.conf
> >>> >>> --exclude postmaster.pid --exclude postmaster.
> >>> >>> opts --exclude pg_log --exclude pg_xlog --exclude recovery.conf
> >>> $datadir/
> >>> >>> $desthost:$destdir/"
> >>> >>>
> >>> >>> echo "Renaming recovery conf file" >>/var/log/replication.log
> >>> >>> echo "ssh -T $desthost mv $destdir/recovery.done
> >>> $destdir/recovery.conf"
> >>> >>> ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf
> >>> >>>
> >>> >>> echo "Removing trigger file"
> >>> >>> ssh -T $desthost rm -f /tmp/trigger_file0
> >>> >>>
> >>> >>> echo "Executing stop backup" >> /var/log/replication.log
> >>> >>> psql -c "SELECT pg_stop_backup()" postgres
> >>> >>>
> >>> >>> echo "existing basebackup.sh" >> /var/log/replication.log
> >>> >>>
> >>> >>> *Manually execute the command on server1 return the following
> result:*
> >>> >>>
> >>> >>> [root at server1 data]# ./basebackup.sh /opt/postgres/9.2/data/
> server0
> >>> >>> /opt/postgres/9.2/data/
> >>> >>> datadir=/opt/postgres/9.2/data/
> >>> >>> desthost=server0
> >>> >>> destdir=/opt/postgres/9.2/data/
> >>> >>>  pg_start_backup
> >>> >>> -----------------
> >>> >>>  0/76000020
> >>> >>> (1 row)
> >>> >>>
> >>> >>> rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
> >>> >>> --exclude postmaster.pid --exclude postmaster.opts --exclude pg_log
> >>> >>> --exclude pg_xlog --exclude recovery.conf /opt/postgres/9.2/data//
> >>> >>> server0:/opt/postgres/9.2/data//
> >>> >>> ssh -T server0 mv /opt/postgres/9.2/data//recovery.done
> >>> >>> /opt/postgres/9.2/data//recovery.conf
> >>> >>> Removing trigger file
> >>> >>> NOTICE:  WAL archiving is not enabled; you must ensure that all
> >>> required
> >>> >>> WAL segments are copied through other means to complete the backup
> >>> >>>  pg_stop_backup
> >>> >>> ----------------
> >>> >>>  0/760000E0
> >>> >>> (1 row)
> >>> >>>
> >>> >>> Thanks and please advise.
> >>> >>>
> >>> >>> Ning
> >>> >>>
> >>> >>>
> >>> >>> On Sun, Feb 17, 2013 at 7:05 AM, Tatsuo Ishii <
> ishii at postgresql.org
> >>> >wrote:
> >>> >>>
> >>> >>>> Hi,
> >>> >>>>
> >>> >>>> It seems the standby was unable to start up. Can you show standby
> >>> >>>> PostgreSQL's log? Maybe we could find the cause of the problem.
> >>> >>>> --
> >>> >>>> Tatsuo Ishii
> >>> >>>> SRA OSS, Inc. Japan
> >>> >>>> English: http://www.sraoss.co.jp/index_en.php
> >>> >>>> Japanese: http://www.sraoss.co.jp
> >>> >>>>
> >>> >>>> > Hi Tatsuo,
> >>> >>>> > Thank you so much for your reply.
> >>> >>>> > Actually in my case, i was using the pcp_recovery command and
> >>> execute
> >>> >>>> it on
> >>> >>>> > the current primary server.
> >>> >>>> > However, if the remote node (recover node) database is off, i
> got
> >>> the
> >>> >>>> > following message on the primary server pgpool log:
> >>> >>>> >
> >>> >>>> > Jan 31 16:58:10 server0 pgpool[2723]: starting recovery command:
> >>> >>>> "SELECT
> >>> >>>> > pgpool_recovery('basebackup.sh', 'server1',
> >>> '/opt/postgres/9.2/data')"
> >>> >>>> > Jan 31 16:58:11 server0 pgpool[2723]: 1st stage is done
> >>> >>>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > (retry 0 times)
> >>> >>>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:13 server0 pgpool[2719]: connection received:
> >>> >>>> > host=server0.local port=58446
> >>> >>>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > (retry 1 times)
> >>> >>>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:14 server0 pgpool[2719]: connection received:
> >>> >>>> > host=server1.local port=39928
> >>> >>>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > (retry 2 times)
> >>> >>>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > (retry 3 times)
> >>> >>>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:postgres
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:23 server0 pgpool[2719]: connection received:
> >>> >>>> > host=server0.local port=58464
> >>> >>>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > (retry 0 times)
> >>> >>>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > (retry 1 times)
> >>> >>>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:26 server0 pgpool[2719]: connection received:
> >>> >>>> > host=server1.local port=39946
> >>> >>>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > (retry 2 times)
> >>> >>>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > (retry 3 times)
> >>> >>>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:33 server0 pgpool[2719]: connection received:
> >>> >>>> > host=server0.local port=58483
> >>> >>>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > (retry 4 times)
> >>> >>>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started:
> >>> try to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> > (retry 5 times)
> >>> >>>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started:
> >>> failed
> >>> >>>> to
> >>> >>>> > connect to postmaster on hostname:server1 database:template1
> >>> >>>> user:postgres
> >>> >>>> >
> >>> >>>> > here is the exact command i execute on server0 to recover
> server1,
> >>> >>>> > /usr/local/bin/pcp_recovery_node 10 localhost 9898 pgpool
> password
> >>> 1
> >>> >>>> >
> >>> >>>> > Do you have any idea why?
> >>> >>>> >
> >>> >>>> > Just FYI, we cannot pgpoolAdmin in our environment.
> >>> >>>> >
> >>> >>>> >
> >>> >>>> > On Sun, Feb 17, 2013 at 12:13 AM, Tatsuo Ishii <
> >>> ishii at postgresql.org>
> >>> >>>> wrote:
> >>> >>>> >
> >>> >>>> >> > Hi all,
> >>> >>>> >> > I have the following question regarding the recovery of a
> filed
> >>> >>>> Primary
> >>> >>>> >> > Database Server.
> >>> >>>> >> >
> >>> >>>> >> > Question 1: in the documentation, under Streaming Replication
> >>> Online
> >>> >>>> >> > Recovery section.
> >>> >>>> >> >
> >>> >>>> >> > http://www.pgpool.net/docs/latest/pgpool-en.html#stream
> >>> >>>> >> >
> >>> >>>> >> > in steps 6:
> >>> >>>> >> >
> >>> >>>> >> >    1. After completing online recovery, pgpool-II will start
> >>> >>>> PostgreSQL
> >>> >>>> >> on
> >>> >>>> >> >    the standby node. Install the script for this purpose on
> >>> each DB
> >>> >>>> >> > nodes. Sample
> >>> >>>> >> >    script <
> >>> http://www.pgpool.net/docs/latest/pgpool_remote_start>
> >>> >>>> is
> >>> >>>> >> >    included in "sample" directory of the source code. This
> >>> script
> >>> >>>> uses
> >>> >>>> >> ssh.
> >>> >>>> >> >    You need to allow recovery_user to login from the primary
> >>> node
> >>> >>>> to the
> >>> >>>> >> >    standby node without being asked password.
> >>> >>>> >> >
> >>> >>>> >> > To my understanding, i think the postgreSQL doesn't not need
> to
> >>> be
> >>> >>>> online
> >>> >>>> >> > for the recovery process right? Since later on it mentions
> that
> >>> >>>> >> > pgpool_remote_start will start up the DB on the failed node.
> >>> >>>> >>
> >>> >>>> >> Acually standby PostgreSQL node should not be started.
> >>> >>>> >>
> >>> >>>> >> > Question 2: in my configuration, i have 2 pgpool server with
> two
> >>> >>>> >> backends.
> >>> >>>> >> > Will it work for oneline recovery?
> >>> >>>> >>
> >>> >>>> >> Yes, but online recovery process should be initiated by one of
> >>> pgpool,
> >>> >>>> >> not both. If you enable pgpool-II 3.2's watchdog, it will take
> >>> care of
> >>> >>>> >> neccessary interlocking.
> >>> >>>> >>
> >>> >>>> >> > Question 3: when the failed node comes back online, should i
> use
> >>> >>>> >> > pcp_recovery from DB primary or should i use pcp_attach on
> the
> >>> >>>> failed
> >>> >>>> >> node
> >>> >>>> >> > to recover the failed system? Actually in my case, both
> methods
> >>> do
> >>> >>>> not
> >>> >>>> >> > recover my system every time.
> >>> >>>> >>
> >>> >>>> >> I'm confused. Didn't you start the online recovery process by
> >>> using
> >>> >>>> >> pcp_recovery_node?(of course you could do it via pgpoolAdmin).
> >>> >>>> >>
> >>> >>>> >> Anyway pcp_recovery_node automatically attach recovered node,
> and
> >>> you
> >>> >>>> >> don't need to execute pcp_attach_node.
> >>> >>>> >>
> >>> >>>> >> I suggest you read tutorial:
> >>> >>>> >>
> >>> >>>>
> >>>
> http://www.pgpool.net/pgpool-web/contrib_docs/simple_sr_setting2/index.html
> >>> >>>> >> --
> >>> >>>> >> Tatsuo Ishii
> >>> >>>> >> SRA OSS, Inc. Japan
> >>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
> >>> >>>> >> Japanese: http://www.sraoss.co.jp
> >>> >>>> >>
> >>> >>>>
> >>> >>>
> >>> >>>
> >>> >>
> >>>
> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130228/d5ef7c4e/attachment-0001.html>


More information about the pgpool-general mailing list