[pgpool-general: 1405] Re: pgPool Online Recovery Streaming Replication

ning chan ninchan8328 at gmail.com
Tue Feb 19 07:24:31 JST 2013


Hi Tatsuo,
Sorry for the late reply as I was traveling.
I am able to reproduce the problem and here is what I did:

*1) Make sure primary (server0) and standby (server1) connections are good.*
[root at server0 tmp]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 2      | 0.500000  | standby
(2 rows)

[root at server1 tmp]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 2      | 0.500000  | standby
(2 rows)

*2) force shutdown primary (server0), and standby (server1) server was
promoted, below is the pgpool log from server1 during the failover:*
Feb 18 15:31:45 server1 pgpool[2691]: connect_inet_domain_socket: connect()
failed: No route to host
Feb 18 15:31:45 server1 pgpool[2691]: make_persistent_db_connection:
connection to server0(5432) failed
Feb 18 15:31:45 server1 pgpool[2691]: check_replication_time_lag: could not
connect to DB node 0, check sr_check_user and sr_check_password
Feb 18 15:31:45 server1 pgpool[2689]: connect_inet_domain_socket: connect()
failed: No route to host
Feb 18 15:31:45 server1 pgpool[2689]: connection to server0(5432) failed
Feb 18 15:31:45 server1 pgpool[2689]: new_connection: create_cp() failed
Feb 18 15:31:45 server1 pgpool[2689]: degenerate_backend_set: 0 fail over
request from pid 2689
Feb 18 15:31:48 server1 pgpool[2689]: wd_create_send_socket: connect()
reports failure (No route to host). You can safely ignore this while
starting up.
Feb 18 15:31:48 server1 pgpool[2653]: connect_inet_domain_socket: connect()
failed: No route to host
Feb 18 15:31:48 server1 pgpool[2653]: make_persistent_db_connection:
connection to server0(5432) failed
Feb 18 15:31:48 server1 pgpool[2653]: health check failed. 0 th host
server0 at port 5432 is down
Feb 18 15:31:48 server1 pgpool[2653]: health check retry sleep time: 5
second(s)
Feb 18 15:31:48 server1 pgpool[2653]: starting degeneration. shutdown host
server0(5432)
Feb 18 15:31:48 server1 pgpool[2653]: Restart all children
Feb 18 15:31:48 server1 pgpool[2653]: execute command:
/home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
-m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
Feb 18 15:31:49 server1 pgpool[2653]: find_primary_node_repeatedly: waiting
for finding a primary node
Feb 18 15:31:53 server1 postgres[1970]: [7-1] LOG:  trigger file found:
/tmp/trigger_file0
Feb 18 15:31:53 server1 postgres[2597]: [3-1] FATAL:  terminating
walreceiver process due to administrator command
Feb 18 15:31:53 server1 postgres[1970]: [8-1] LOG:  record with zero length
at 0/64000418
Feb 18 15:31:53 server1 postgres[1970]: [9-1] LOG:  redo done at 0/640003B8
Feb 18 15:31:53 server1 postgres[1970]: [10-1] LOG:  selected new timeline
ID: 15
Feb 18 15:31:53 server1 postgres[1970]: [11-1] LOG:  archive recovery
complete
Feb 18 15:31:53 server1 postgres[1968]: [2-1] LOG:  database system is
ready to accept connections
Feb 18 15:31:53 server1 postgres[3168]: [2-1] LOG:  autovacuum launcher
started
Feb 18 15:31:54 server1 pgpool[2653]: find_primary_node: primary node id is
1
Feb 18 15:31:54 server1 pgpool[2653]: starting follow degeneration.
shutdown host server0(5432)
Feb 18 15:31:54 server1 pgpool[2653]: failover: 1 follow backends have been
degenerated
Feb 18 15:31:54 server1 pgpool[2653]: failover: set new primary node: 1
Feb 18 15:31:54 server1 pgpool[2653]: failover: set new master node: 1
Feb 18 15:31:54 server1 pgpool[3171]: connection received:
host=server1.local port=58346
Feb 18 15:31:54 server1 pgpool[3170]: start triggering follow command.
Feb 18 15:31:54 server1 pgpool[3170]: execute command: echo 0 server0 5432
/opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %
Feb 18 15:31:54 server1 pgpool[2691]: worker process received restart
request
Feb 18 15:31:54 server1 pgpool[2653]: failover done. shutdown host
server0(5432)

*3) boot up the failed server (server0) as usual, pgpool and database
started up as usual.
*
*after server0 boot up, i check the pool_nodes info, server0 pgpool still
think server0 is the primary:*
[root at server0 ~]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 2      | 0.500000  | primary
 1       | server1  | 5432 | 2      | 0.500000  | standby
(2 rows)

server1 pgpool still think server1 is the primary:
[root at server1 tmp]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 3      | 0.500000  | standby
 1       | server1  | 5432 | 2      | 0.500000  | primary
(2 rows)

*4) kill the database on server0 to prepare of the recovery*
*as soon as i kill the server0 database, pgpool log shows following:*
Feb 18 15:33:56 server0 pgpool[2153]: execute command:
/home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
-m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node_repeatedly: waiting
for finding a primary node
Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node: primary node id is
1
Feb 18 15:33:57 server0 pgpool[2153]: starting follow degeneration.
shutdown host server0(5432)
Feb 18 15:33:57 server0 pgpool[2153]: failover: 1 follow backends have been
degenerated
Feb 18 15:33:57 server0 pgpool[2153]: failover: set new primary node: 1
Feb 18 15:33:57 server0 pgpool[2153]: failover: set new master node: 1
Feb 18 15:33:57 server0 pgpool[2534]: start triggering follow command.
Feb 18 15:33:57 server0 pgpool[2534]: execute command: echo 0 server0 5432
/opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data %
Feb 18 15:33:57 server0 pgpool[2190]: worker process received restart
request
Feb 18 15:33:57 server0 pgpool[2153]: failover done. shutdown host
server0(5432)

pool_nodes shows:
[root at server0 ~]# psql -c "show pool_nodes" -p 9999
 node_id | hostname | port | status | lb_weight |  role
---------+----------+------+--------+-----------+---------
 0       | server0  | 5432 | 3      | 0.500000  | standby
 1       | server1  | 5432 | 2      | 0.500000  | primary
(2 rows)


*5) on server1, execute /usr/local/bin/pcp_recovery_node 10 localhost 9898
pgpool password 0*
server 1 pgpool log shows following:
Feb 18 15:34:52 server1 pgpool[3209]: starting recovery command: "SELECT
pgpool_recovery('basebackup.sh', 'server0', '/opt/postgres/9.2/data')"
Feb 18 15:34:53 server1 pgpool[3209]: 1st stage is done
Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: try to
connect to postmaster on hostname:server0 database:postgres user:postgres
(retry 0 times)
Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: failed to
connect to postmaster on hostname:server0 database:postgres user:postgres

*The message "fail to connect to postmaster on hostname server0" continue
until the command timeout.
*
*it looks like the 1st stage command is done but pgpool on server1 did not
call the pgpool_remote_start to start the server0 database engine although
the script is there under postgresql data folder
*
[root at server1 data]# ls
backup_label.old     data       pg_hba.conf    pg_notify
pg_stat_tmp  PG_VERSION       postmaster.pid
base                 global     pg_ident.conf  pgpool_remote_start
pg_subtrans  pg_xlog          recovery.done
basebackup.sh        ning.test  pg_log         pg_serial
pg_tblspc    postgresql.conf  recovery.standby1
copy-base-backup.sh  pg_clog    pg_multixact   pg_snapshots
pg_twophase  postmaster.opts

*during this period, pgpool log on server0 shows the following:*
Feb 18 15:34:42 server0 pgpool[2567]: connection received:
host=server0.local port=45366
Feb 18 15:34:46 server0 pgpool[2567]: connection received:
host=server1.local port=39535
Feb 18 15:34:52 server0 pgpool[2567]: connection received:
host=server0.local port=45381
Feb 18 15:34:56 server0 pgpool[2565]: connection received:
host=server1.local port=39557
Feb 18 15:35:02 server0 pgpool[2565]: connection received:
host=server0.local port=45396
Feb 18 15:35:06 server0 pgpool[2565]: connection received:
host=server1.local port=39576
Feb 18 15:35:12 server0 pgpool[2565]: connection received:
host=server0.local port=45410
Feb 18 15:35:16 server0 pgpool[2565]: connection received:
host=server1.local port=39593
Feb 18 15:35:22 server0 pgpool[2565]: connection received:
host=server0.local port=45425
Feb 18 15:35:26 server0 pgpool[2565]: connection received:
host=server1.local port=39612
Feb 18 15:35:32 server0 pgpool[2565]: connection received:
host=server0.local port=45440
Feb 18 15:35:36 server0 pgpool[2565]: connection received:
host=server1.local port=39630
Feb 18 15:35:42 server0 pgpool[2565]: connection received:
host=server0.local port=45455
Feb 18 15:35:46 server0 pgpool[2565]: connection received:
host=server1.local port=39648

*Below is the script of recovery_1st_stage_command = 'basebackup.sh'*
[root at server1 ~]# more /opt/postgres/9.2/data/basebackup.sh
#! /bin/sh
# Recovery script for streaming replication.
# This script assumes that DB node 0 is primary, and 1 is standby.
#
datadir=$1
desthost=$2
destdir=$3

echo "datadir=$datadir"
echo "desthost=$desthost"
echo "destdir=$destdir"

echo "Executing basebackup.sh" >> /var/log/replication.log

echo "Executing start backup" >> /var/log/replication.log
psql -c "SELECT pg_start_backup('Streaming Replication', true)" postgres

echo "Executing rsync" >> /var/log/replication.log
echo "rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
--exclude postmaster.pid --exclude postmaster.
opts --exclude pg_log --exclude pg_xlog --exclude recovery.conf $datadir/
$desthost:$destdir/"

echo "Renaming recovery conf file" >>/var/log/replication.log
echo "ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf"
ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf

echo "Removing trigger file"
ssh -T $desthost rm -f /tmp/trigger_file0

echo "Executing stop backup" >> /var/log/replication.log
psql -c "SELECT pg_stop_backup()" postgres

echo "existing basebackup.sh" >> /var/log/replication.log

*Manually execute the command on server1 return the following result:*

[root at server1 data]# ./basebackup.sh /opt/postgres/9.2/data/ server0
/opt/postgres/9.2/data/
datadir=/opt/postgres/9.2/data/
desthost=server0
destdir=/opt/postgres/9.2/data/
 pg_start_backup
-----------------
 0/76000020
(1 row)

rsync -C -a --progress --delete -e ssh --exclude postgresql.conf --exclude
postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude pg_xlog
--exclude recovery.conf /opt/postgres/9.2/data//
server0:/opt/postgres/9.2/data//
ssh -T server0 mv /opt/postgres/9.2/data//recovery.done
/opt/postgres/9.2/data//recovery.conf
Removing trigger file
NOTICE:  WAL archiving is not enabled; you must ensure that all required
WAL segments are copied through other means to complete the backup
 pg_stop_backup
----------------
 0/760000E0
(1 row)

Thanks and please advise.

Ning


On Sun, Feb 17, 2013 at 7:05 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:

> Hi,
>
> It seems the standby was unable to start up. Can you show standby
> PostgreSQL's log? Maybe we could find the cause of the problem.
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Hi Tatsuo,
> > Thank you so much for your reply.
> > Actually in my case, i was using the pcp_recovery command and execute it
> on
> > the current primary server.
> > However, if the remote node (recover node) database is off, i got the
> > following message on the primary server pgpool log:
> >
> > Jan 31 16:58:10 server0 pgpool[2723]: starting recovery command: "SELECT
> > pgpool_recovery('basebackup.sh', 'server1', '/opt/postgres/9.2/data')"
> > Jan 31 16:58:11 server0 pgpool[2723]: 1st stage is done
> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > (retry 0 times)
> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > Jan 31 16:58:13 server0 pgpool[2719]: connection received:
> > host=server0.local port=58446
> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > (retry 1 times)
> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > Jan 31 16:58:14 server0 pgpool[2719]: connection received:
> > host=server1.local port=39928
> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > (retry 2 times)
> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > (retry 3 times)
> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:postgres user:postgres
> > Jan 31 16:58:23 server0 pgpool[2719]: connection received:
> > host=server0.local port=58464
> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > (retry 0 times)
> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > (retry 1 times)
> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > Jan 31 16:58:26 server0 pgpool[2719]: connection received:
> > host=server1.local port=39946
> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > (retry 2 times)
> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > (retry 3 times)
> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > Jan 31 16:58:33 server0 pgpool[2719]: connection received:
> > host=server0.local port=58483
> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > (retry 4 times)
> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: try to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> > (retry 5 times)
> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started: failed to
> > connect to postmaster on hostname:server1 database:template1
> user:postgres
> >
> > here is the exact command i execute on server0 to recover server1,
> > /usr/local/bin/pcp_recovery_node 10 localhost 9898 pgpool password 1
> >
> > Do you have any idea why?
> >
> > Just FYI, we cannot pgpoolAdmin in our environment.
> >
> >
> > On Sun, Feb 17, 2013 at 12:13 AM, Tatsuo Ishii <ishii at postgresql.org>
> wrote:
> >
> >> > Hi all,
> >> > I have the following question regarding the recovery of a filed
> Primary
> >> > Database Server.
> >> >
> >> > Question 1: in the documentation, under Streaming Replication Online
> >> > Recovery section.
> >> >
> >> > http://www.pgpool.net/docs/latest/pgpool-en.html#stream
> >> >
> >> > in steps 6:
> >> >
> >> >    1. After completing online recovery, pgpool-II will start
> PostgreSQL
> >> on
> >> >    the standby node. Install the script for this purpose on each DB
> >> > nodes. Sample
> >> >    script <http://www.pgpool.net/docs/latest/pgpool_remote_start> is
> >> >    included in "sample" directory of the source code. This script uses
> >> ssh.
> >> >    You need to allow recovery_user to login from the primary node to
> the
> >> >    standby node without being asked password.
> >> >
> >> > To my understanding, i think the postgreSQL doesn't not need to be
> online
> >> > for the recovery process right? Since later on it mentions that
> >> > pgpool_remote_start will start up the DB on the failed node.
> >>
> >> Acually standby PostgreSQL node should not be started.
> >>
> >> > Question 2: in my configuration, i have 2 pgpool server with two
> >> backends.
> >> > Will it work for oneline recovery?
> >>
> >> Yes, but online recovery process should be initiated by one of pgpool,
> >> not both. If you enable pgpool-II 3.2's watchdog, it will take care of
> >> neccessary interlocking.
> >>
> >> > Question 3: when the failed node comes back online, should i use
> >> > pcp_recovery from DB primary or should i use pcp_attach on the failed
> >> node
> >> > to recover the failed system? Actually in my case, both methods do not
> >> > recover my system every time.
> >>
> >> I'm confused. Didn't you start the online recovery process by using
> >> pcp_recovery_node?(of course you could do it via pgpoolAdmin).
> >>
> >> Anyway pcp_recovery_node automatically attach recovered node, and you
> >> don't need to execute pcp_attach_node.
> >>
> >> I suggest you read tutorial:
> >>
> http://www.pgpool.net/pgpool-web/contrib_docs/simple_sr_setting2/index.html
> >> --
> >> Tatsuo Ishii
> >> SRA OSS, Inc. Japan
> >> English: http://www.sraoss.co.jp/index_en.php
> >> Japanese: http://www.sraoss.co.jp
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130218/8ed1729b/attachment-0001.html>


More information about the pgpool-general mailing list