[pgpool-general: 1441] Re: pgPool Online Recovery Streaming Replication

Tatsuo Ishii ishii at postgresql.org
Thu Feb 28 09:12:56 JST 2013


What version of pgpool-II are you using?  We have found a failover
handling problem with 3.2.2 and released 3.2.3 afterwards.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hi Tatsuo,
> Do you have any chance to look at this problem yet?
> 
> Thanks~
> Ning
> 
> 
> On Tue, Feb 19, 2013 at 3:02 PM, ning chan <ninchan8328 at gmail.com> wrote:
> 
>> Hi Tatsuo,
>> I figured out the problem to the some scripting error in the
>> pgpool_remote_start to have the incorrect path to pg_ctl.
>> As soon as I corrected it i can recover the failed server and bring it
>> online.
>> I now however facing another problem.
>> After I bring the failed master back in to the pgpool as a standby server,
>> i then shutdown the currect Primary server and expecting Standby server
>> will be promoted to become the new Primary; however it did not happen.
>> I check the pgpool log, i see failover command is called but it did not
>> execute. I check and confirm my failover script works just fine.
>>
>> Here is the log:
>> Feb 19 14:51:40 server0 pgpool[3519]: set 1 th backend down status
>> Feb 19 14:51:43 server0 pgpool[3554]: wd_create_send_socket: connect()
>> reports failure (No route to host). You can safely ignore this while
>> starting up.
>> Feb 19 14:51:43 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> times. pgpool seems not to be working
>> Feb 19 14:51:43 server0 pgpool[3522]: wd_IP_down: ifconfig down succeeded
>> Feb 19 14:51:43 server0 pgpool[3519]: starting degeneration. shutdown host
>> server1(5432)
>> Feb 19 14:51:43 server0 pgpool[3519]: Restart all children
>> Feb 19 14:51:43 server0 pgpool[3519]: execute command:
>> /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D /opt/postgres/9.2/data
>> -m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
>> Feb 19 14:51:43 server0 postgres[3939]: [2-1] LOG:  incomplete startup
>> packet
>> Feb 19 14:51:43 server0 postgres[3931]: [2-1] LOG:  incomplete startup
>> packet
>> Feb 19 14:51:43 server0 postgres[3935]: [2-1] LOG:  incomplete startup
>> packet
>> Feb 19 14:51:43 server0 pgpool[3519]: find_primary_node_repeatedly:
>> waiting for finding a primary node
>> Feb 19 14:51:46 server0 pgpool[3522]: wd_create_send_socket: connect()
>> reports failure (No route to host). You can safely ignore this while
>> starting up.
>> Feb 19 14:51:46 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> times. pgpool seems not to be working
>> Feb 19 14:51:54 server0 pgpool[3556]: connect_inet_domain_socket:
>> connect() failed: Connection timed out
>> Feb 19 14:51:54 server0 pgpool[3556]: make_persistent_db_connection:
>> connection to server1(5432) failed
>> Feb 19 14:51:54 server0 pgpool[3556]: check_replication_time_lag: could
>> not connect to DB node 1, check sr_check_user and sr_check_password
>> Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> times. pgpool seems not to be working
>> Feb 19 14:52:05 server0 pgpool[3556]: connect_inet_domain_socket:
>> connect() failed: No route to host
>> Feb 19 14:52:05 server0 pgpool[3556]: make_persistent_db_connection:
>> connection to server1(5432) failed
>> Feb 19 14:52:05 server0 pgpool[3556]: check_replication_time_lag: could
>> not connect to DB node 1, check sr_check_user and sr_check_password
>> Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> times. pgpool seems not to be working
>> Feb 19 14:52:14 server0 pgpool[3519]: failover: no follow backends are
>> degenerated
>> Feb 19 14:52:14 server0 pgpool[3519]: failover: set new primary node: -1
>> Feb 19 14:52:14 server0 pgpool[3519]: failover: set new master node: 0
>> Feb 19 14:52:14 server0 pgpool[3980]: connection received:
>> host=server0.local port=45361
>> Feb 19 14:52:14 server0 pgpool[3556]: worker process received restart
>> request
>> Feb 19 14:52:14 server0 pgpool[3519]: failover done. shutdown host
>> server1(5432)
>>
>> As you can see, postgreSQL did not restarted.
>>
>> From a successful failover, the log should looks like this:
>> Feb 19 13:47:01 server0 pgpool[4391]: execute command:
>> /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D /opt/postgres/9.2/data
>> -m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
>> Feb 19 13:47:01 server0 postgres[4786]: [2-1] LOG:  incomplete startup
>> packet
>> Feb 19 13:47:02 server0 pgpool[4391]: find_primary_node_repeatedly:
>> waiting for finding a primary node
>> Feb 19 13:47:06 server0 postgres[3083]: [6-1] LOG:  trigger file found:
>> /opt/postgres/9.2/data/trigger_file0
>> Feb 19 13:47:06 server0 postgres[3083]: [7-1] LOG:  redo done at 0/91000020
>> Feb 19 13:47:06 server0 postgres[3083]: [8-1] LOG:  selected new timeline
>> ID: 18
>> Feb 19 13:47:06 server0 postgres[3083]: [9-1] LOG:  archive recovery
>> complete
>> Feb 19 13:47:06 server0 postgres[3081]: [2-1] LOG:  database system is
>> ready to accept connections
>> Feb 19 13:47:06 server0 postgres[4804]: [2-1] LOG:  autovacuum launcher
>> started
>> Feb 19 13:47:07 server0 pgpool[4391]: find_primary_node: primary node id
>> is 0
>> Feb 19 13:47:07 server0 pgpool[4391]: starting follow degeneration.
>> shutdown host server1(5432)
>> Feb 19 13:47:07 server0 pgpool[4391]: failover: 1 follow backends have
>> been degenerated
>> Feb 19 13:47:07 server0 pgpool[4391]: failover: set new primary node: 0
>> Feb 19 13:47:07 server0 pgpool[4391]: failover: set new master node: 0
>> Feb 19 13:47:07 server0 pgpool[4817]: connection received:
>> host=server1.local port=54619
>> Feb 19 13:47:07 server0 pgpool[4816]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4818]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4819]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4815]: start triggering follow command.
>> Feb 19 13:47:07 server0 pgpool[4815]: execute command: echo 1 server1 5432
>> /opt/postgres/9.2/data 0 server0 0 1 5432 /opt/postgres/9.2/data %
>> Feb 19 13:47:07 server0 pgpool[4821]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4822]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4823]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4820]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4827]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4828]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4826]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4829]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4830]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4831]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4832]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4835]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4836]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4837]: do_child: failback event found.
>> restart myself.
>> Feb 19 13:47:07 server0 pgpool[4391]: failover done. shutdown host
>> server1(5432)
>>
>> Any idea why failover did not work after the recovery? Do i need to
>> restart remote pgpool after recovery?
>>
>>
>> On Mon, Feb 18, 2013 at 4:24 PM, ning chan <ninchan8328 at gmail.com> wrote:
>>
>>>
>>> Hi Tatsuo,
>>> Sorry for the late reply as I was traveling.
>>> I am able to reproduce the problem and here is what I did:
>>>
>>> *1) Make sure primary (server0) and standby (server1) connections are
>>> good.*
>>> [root at server0 tmp]# psql -c "show pool_nodes" -p 9999
>>>  node_id | hostname | port | status | lb_weight |  role
>>> ---------+----------+------+--------+-----------+---------
>>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>>> (2 rows)
>>>
>>> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
>>>  node_id | hostname | port | status | lb_weight |  role
>>> ---------+----------+------+--------+-----------+---------
>>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>>> (2 rows)
>>>
>>> *2) force shutdown primary (server0), and standby (server1) server was
>>> promoted, below is the pgpool log from server1 during the failover:*
>>> Feb 18 15:31:45 server1 pgpool[2691]: connect_inet_domain_socket:
>>> connect() failed: No route to host
>>> Feb 18 15:31:45 server1 pgpool[2691]: make_persistent_db_connection:
>>> connection to server0(5432) failed
>>> Feb 18 15:31:45 server1 pgpool[2691]: check_replication_time_lag: could
>>> not connect to DB node 0, check sr_check_user and sr_check_password
>>> Feb 18 15:31:45 server1 pgpool[2689]: connect_inet_domain_socket:
>>> connect() failed: No route to host
>>> Feb 18 15:31:45 server1 pgpool[2689]: connection to server0(5432) failed
>>> Feb 18 15:31:45 server1 pgpool[2689]: new_connection: create_cp() failed
>>> Feb 18 15:31:45 server1 pgpool[2689]: degenerate_backend_set: 0 fail over
>>> request from pid 2689
>>> Feb 18 15:31:48 server1 pgpool[2689]: wd_create_send_socket: connect()
>>> reports failure (No route to host). You can safely ignore this while
>>> starting up.
>>> Feb 18 15:31:48 server1 pgpool[2653]: connect_inet_domain_socket:
>>> connect() failed: No route to host
>>> Feb 18 15:31:48 server1 pgpool[2653]: make_persistent_db_connection:
>>> connection to server0(5432) failed
>>> Feb 18 15:31:48 server1 pgpool[2653]: health check failed. 0 th host
>>> server0 at port 5432 is down
>>> Feb 18 15:31:48 server1 pgpool[2653]: health check retry sleep time: 5
>>> second(s)
>>> Feb 18 15:31:48 server1 pgpool[2653]: starting degeneration. shutdown
>>> host server0(5432)
>>> Feb 18 15:31:48 server1 pgpool[2653]: Restart all children
>>> Feb 18 15:31:48 server1 pgpool[2653]: execute command:
>>> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
>>> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
>>> Feb 18 15:31:49 server1 pgpool[2653]: find_primary_node_repeatedly:
>>> waiting for finding a primary node
>>> Feb 18 15:31:53 server1 postgres[1970]: [7-1] LOG:  trigger file found:
>>> /tmp/trigger_file0
>>> Feb 18 15:31:53 server1 postgres[2597]: [3-1] FATAL:  terminating
>>> walreceiver process due to administrator command
>>> Feb 18 15:31:53 server1 postgres[1970]: [8-1] LOG:  record with zero
>>> length at 0/64000418
>>> Feb 18 15:31:53 server1 postgres[1970]: [9-1] LOG:  redo done at
>>> 0/640003B8
>>> Feb 18 15:31:53 server1 postgres[1970]: [10-1] LOG:  selected new
>>> timeline ID: 15
>>> Feb 18 15:31:53 server1 postgres[1970]: [11-1] LOG:  archive recovery
>>> complete
>>> Feb 18 15:31:53 server1 postgres[1968]: [2-1] LOG:  database system is
>>> ready to accept connections
>>> Feb 18 15:31:53 server1 postgres[3168]: [2-1] LOG:  autovacuum launcher
>>> started
>>> Feb 18 15:31:54 server1 pgpool[2653]: find_primary_node: primary node id
>>> is 1
>>> Feb 18 15:31:54 server1 pgpool[2653]: starting follow degeneration.
>>> shutdown host server0(5432)
>>> Feb 18 15:31:54 server1 pgpool[2653]: failover: 1 follow backends have
>>> been degenerated
>>> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new primary node: 1
>>> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new master node: 1
>>> Feb 18 15:31:54 server1 pgpool[3171]: connection received:
>>> host=server1.local port=58346
>>> Feb 18 15:31:54 server1 pgpool[3170]: start triggering follow command.
>>> Feb 18 15:31:54 server1 pgpool[3170]: execute command: echo 0 server0
>>> 5432 /opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %
>>> Feb 18 15:31:54 server1 pgpool[2691]: worker process received restart
>>> request
>>> Feb 18 15:31:54 server1 pgpool[2653]: failover done. shutdown host
>>> server0(5432)
>>>
>>> *3) boot up the failed server (server0) as usual, pgpool and database
>>> started up as usual.
>>> *
>>> *after server0 boot up, i check the pool_nodes info, server0 pgpool
>>> still think server0 is the primary:*
>>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>>>  node_id | hostname | port | status | lb_weight |  role
>>> ---------+----------+------+--------+-----------+---------
>>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>>> (2 rows)
>>>
>>> server1 pgpool still think server1 is the primary:
>>> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
>>>  node_id | hostname | port | status | lb_weight |  role
>>> ---------+----------+------+--------+-----------+---------
>>>  0       | server0  | 5432 | 3      | 0.500000  | standby
>>>  1       | server1  | 5432 | 2      | 0.500000  | primary
>>> (2 rows)
>>>
>>> *4) kill the database on server0 to prepare of the recovery*
>>> *as soon as i kill the server0 database, pgpool log shows following:*
>>> Feb 18 15:33:56 server0 pgpool[2153]: execute command:
>>> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
>>> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
>>> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node_repeatedly:
>>> waiting for finding a primary node
>>> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node: primary node id
>>> is 1
>>> Feb 18 15:33:57 server0 pgpool[2153]: starting follow degeneration.
>>> shutdown host server0(5432)
>>> Feb 18 15:33:57 server0 pgpool[2153]: failover: 1 follow backends have
>>> been degenerated
>>> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new primary node: 1
>>> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new master node: 1
>>> Feb 18 15:33:57 server0 pgpool[2534]: start triggering follow command.
>>> Feb 18 15:33:57 server0 pgpool[2534]: execute command: echo 0 server0
>>> 5432 /opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %
>>> Feb 18 15:33:57 server0 pgpool[2190]: worker process received restart
>>> request
>>> Feb 18 15:33:57 server0 pgpool[2153]: failover done. shutdown host
>>> server0(5432)
>>>
>>> pool_nodes shows:
>>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>>>  node_id | hostname | port | status | lb_weight |  role
>>> ---------+----------+------+--------+-----------+---------
>>>  0       | server0  | 5432 | 3      | 0.500000  | standby
>>>  1       | server1  | 5432 | 2      | 0.500000  | primary
>>> (2 rows)
>>>
>>>
>>> *5) on server1, execute /usr/local/bin/pcp_recovery_node 10 localhost
>>> 9898 pgpool password 0*
>>> server 1 pgpool log shows following:
>>> Feb 18 15:34:52 server1 pgpool[3209]: starting recovery command: "SELECT
>>> pgpool_recovery('basebackup.sh', 'server0', '/opt/postgres/9.2/data')"
>>> Feb 18 15:34:53 server1 pgpool[3209]: 1st stage is done
>>> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: try to
>>> connect to postmaster on hostname:server0 database:postgres user:postgres
>>> (retry 0 times)
>>> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: failed to
>>> connect to postmaster on hostname:server0 database:postgres user:postgres
>>>
>>> *The message "fail to connect to postmaster on hostname server0"
>>> continue until the command timeout.
>>> *
>>> *it looks like the 1st stage command is done but pgpool on server1 did
>>> not call the pgpool_remote_start to start the server0 database engine
>>> although the script is there under postgresql data folder
>>> *
>>> [root at server1 data]# ls
>>> backup_label.old     data       pg_hba.conf    pg_notify
>>> pg_stat_tmp  PG_VERSION       postmaster.pid
>>> base                 global     pg_ident.conf  pgpool_remote_start
>>> pg_subtrans  pg_xlog          recovery.done
>>> basebackup.sh        ning.test  pg_log         pg_serial
>>> pg_tblspc    postgresql.conf  recovery.standby1
>>> copy-base-backup.sh  pg_clog    pg_multixact   pg_snapshots
>>> pg_twophase  postmaster.opts
>>>
>>> *during this period, pgpool log on server0 shows the following:*
>>> Feb 18 15:34:42 server0 pgpool[2567]: connection received:
>>> host=server0.local port=45366
>>> Feb 18 15:34:46 server0 pgpool[2567]: connection received:
>>> host=server1.local port=39535
>>> Feb 18 15:34:52 server0 pgpool[2567]: connection received:
>>> host=server0.local port=45381
>>> Feb 18 15:34:56 server0 pgpool[2565]: connection received:
>>> host=server1.local port=39557
>>> Feb 18 15:35:02 server0 pgpool[2565]: connection received:
>>> host=server0.local port=45396
>>> Feb 18 15:35:06 server0 pgpool[2565]: connection received:
>>> host=server1.local port=39576
>>> Feb 18 15:35:12 server0 pgpool[2565]: connection received:
>>> host=server0.local port=45410
>>> Feb 18 15:35:16 server0 pgpool[2565]: connection received:
>>> host=server1.local port=39593
>>> Feb 18 15:35:22 server0 pgpool[2565]: connection received:
>>> host=server0.local port=45425
>>> Feb 18 15:35:26 server0 pgpool[2565]: connection received:
>>> host=server1.local port=39612
>>> Feb 18 15:35:32 server0 pgpool[2565]: connection received:
>>> host=server0.local port=45440
>>> Feb 18 15:35:36 server0 pgpool[2565]: connection received:
>>> host=server1.local port=39630
>>> Feb 18 15:35:42 server0 pgpool[2565]: connection received:
>>> host=server0.local port=45455
>>> Feb 18 15:35:46 server0 pgpool[2565]: connection received:
>>> host=server1.local port=39648
>>>
>>> *Below is the script of recovery_1st_stage_command = 'basebackup.sh'*
>>> [root at server1 ~]# more /opt/postgres/9.2/data/basebackup.sh
>>> #! /bin/sh
>>> # Recovery script for streaming replication.
>>> # This script assumes that DB node 0 is primary, and 1 is standby.
>>> #
>>> datadir=$1
>>> desthost=$2
>>> destdir=$3
>>>
>>> echo "datadir=$datadir"
>>> echo "desthost=$desthost"
>>> echo "destdir=$destdir"
>>>
>>> echo "Executing basebackup.sh" >> /var/log/replication.log
>>>
>>> echo "Executing start backup" >> /var/log/replication.log
>>> psql -c "SELECT pg_start_backup('Streaming Replication', true)" postgres
>>>
>>> echo "Executing rsync" >> /var/log/replication.log
>>> echo "rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
>>> --exclude postmaster.pid --exclude postmaster.
>>> opts --exclude pg_log --exclude pg_xlog --exclude recovery.conf $datadir/
>>> $desthost:$destdir/"
>>>
>>> echo "Renaming recovery conf file" >>/var/log/replication.log
>>> echo "ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf"
>>> ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf
>>>
>>> echo "Removing trigger file"
>>> ssh -T $desthost rm -f /tmp/trigger_file0
>>>
>>> echo "Executing stop backup" >> /var/log/replication.log
>>> psql -c "SELECT pg_stop_backup()" postgres
>>>
>>> echo "existing basebackup.sh" >> /var/log/replication.log
>>>
>>> *Manually execute the command on server1 return the following result:*
>>>
>>> [root at server1 data]# ./basebackup.sh /opt/postgres/9.2/data/ server0
>>> /opt/postgres/9.2/data/
>>> datadir=/opt/postgres/9.2/data/
>>> desthost=server0
>>> destdir=/opt/postgres/9.2/data/
>>>  pg_start_backup
>>> -----------------
>>>  0/76000020
>>> (1 row)
>>>
>>> rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
>>> --exclude postmaster.pid --exclude postmaster.opts --exclude pg_log
>>> --exclude pg_xlog --exclude recovery.conf /opt/postgres/9.2/data//
>>> server0:/opt/postgres/9.2/data//
>>> ssh -T server0 mv /opt/postgres/9.2/data//recovery.done
>>> /opt/postgres/9.2/data//recovery.conf
>>> Removing trigger file
>>> NOTICE:  WAL archiving is not enabled; you must ensure that all required
>>> WAL segments are copied through other means to complete the backup
>>>  pg_stop_backup
>>> ----------------
>>>  0/760000E0
>>> (1 row)
>>>
>>> Thanks and please advise.
>>>
>>> Ning
>>>
>>>
>>> On Sun, Feb 17, 2013 at 7:05 AM, Tatsuo Ishii <ishii at postgresql.org>wrote:
>>>
>>>> Hi,
>>>>
>>>> It seems the standby was unable to start up. Can you show standby
>>>> PostgreSQL's log? Maybe we could find the cause of the problem.
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese: http://www.sraoss.co.jp
>>>>
>>>> > Hi Tatsuo,
>>>> > Thank you so much for your reply.
>>>> > Actually in my case, i was using the pcp_recovery command and execute
>>>> it on
>>>> > the current primary server.
>>>> > However, if the remote node (recover node) database is off, i got the
>>>> > following message on the primary server pgpool log:
>>>> >
>>>> > Jan 31 16:58:10 server0 pgpool[2723]: starting recovery command:
>>>> "SELECT
>>>> > pgpool_recovery('basebackup.sh', 'server1', '/opt/postgres/9.2/data')"
>>>> > Jan 31 16:58:11 server0 pgpool[2723]: 1st stage is done
>>>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > (retry 0 times)
>>>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > Jan 31 16:58:13 server0 pgpool[2719]: connection received:
>>>> > host=server0.local port=58446
>>>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > (retry 1 times)
>>>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > Jan 31 16:58:14 server0 pgpool[2719]: connection received:
>>>> > host=server1.local port=39928
>>>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > (retry 2 times)
>>>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > (retry 3 times)
>>>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:postgres
>>>> user:postgres
>>>> > Jan 31 16:58:23 server0 pgpool[2719]: connection received:
>>>> > host=server0.local port=58464
>>>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > (retry 0 times)
>>>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > (retry 1 times)
>>>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > Jan 31 16:58:26 server0 pgpool[2719]: connection received:
>>>> > host=server1.local port=39946
>>>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > (retry 2 times)
>>>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > (retry 3 times)
>>>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > Jan 31 16:58:33 server0 pgpool[2719]: connection received:
>>>> > host=server0.local port=58483
>>>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > (retry 4 times)
>>>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: try to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> > (retry 5 times)
>>>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: failed
>>>> to
>>>> > connect to postmaster on hostname:server1 database:template1
>>>> user:postgres
>>>> >
>>>> > here is the exact command i execute on server0 to recover server1,
>>>> > /usr/local/bin/pcp_recovery_node 10 localhost 9898 pgpool password 1
>>>> >
>>>> > Do you have any idea why?
>>>> >
>>>> > Just FYI, we cannot pgpoolAdmin in our environment.
>>>> >
>>>> >
>>>> > On Sun, Feb 17, 2013 at 12:13 AM, Tatsuo Ishii <ishii at postgresql.org>
>>>> wrote:
>>>> >
>>>> >> > Hi all,
>>>> >> > I have the following question regarding the recovery of a filed
>>>> Primary
>>>> >> > Database Server.
>>>> >> >
>>>> >> > Question 1: in the documentation, under Streaming Replication Online
>>>> >> > Recovery section.
>>>> >> >
>>>> >> > http://www.pgpool.net/docs/latest/pgpool-en.html#stream
>>>> >> >
>>>> >> > in steps 6:
>>>> >> >
>>>> >> >    1. After completing online recovery, pgpool-II will start
>>>> PostgreSQL
>>>> >> on
>>>> >> >    the standby node. Install the script for this purpose on each DB
>>>> >> > nodes. Sample
>>>> >> >    script <http://www.pgpool.net/docs/latest/pgpool_remote_start>
>>>> is
>>>> >> >    included in "sample" directory of the source code. This script
>>>> uses
>>>> >> ssh.
>>>> >> >    You need to allow recovery_user to login from the primary node
>>>> to the
>>>> >> >    standby node without being asked password.
>>>> >> >
>>>> >> > To my understanding, i think the postgreSQL doesn't not need to be
>>>> online
>>>> >> > for the recovery process right? Since later on it mentions that
>>>> >> > pgpool_remote_start will start up the DB on the failed node.
>>>> >>
>>>> >> Acually standby PostgreSQL node should not be started.
>>>> >>
>>>> >> > Question 2: in my configuration, i have 2 pgpool server with two
>>>> >> backends.
>>>> >> > Will it work for oneline recovery?
>>>> >>
>>>> >> Yes, but online recovery process should be initiated by one of pgpool,
>>>> >> not both. If you enable pgpool-II 3.2's watchdog, it will take care of
>>>> >> neccessary interlocking.
>>>> >>
>>>> >> > Question 3: when the failed node comes back online, should i use
>>>> >> > pcp_recovery from DB primary or should i use pcp_attach on the
>>>> failed
>>>> >> node
>>>> >> > to recover the failed system? Actually in my case, both methods do
>>>> not
>>>> >> > recover my system every time.
>>>> >>
>>>> >> I'm confused. Didn't you start the online recovery process by using
>>>> >> pcp_recovery_node?(of course you could do it via pgpoolAdmin).
>>>> >>
>>>> >> Anyway pcp_recovery_node automatically attach recovered node, and you
>>>> >> don't need to execute pcp_attach_node.
>>>> >>
>>>> >> I suggest you read tutorial:
>>>> >>
>>>> http://www.pgpool.net/pgpool-web/contrib_docs/simple_sr_setting2/index.html
>>>> >> --
>>>> >> Tatsuo Ishii
>>>> >> SRA OSS, Inc. Japan
>>>> >> English: http://www.sraoss.co.jp/index_en.php
>>>> >> Japanese: http://www.sraoss.co.jp
>>>> >>
>>>>
>>>
>>>
>>


More information about the pgpool-general mailing list