[pgpool-general: 1449] Re: pgPool Online Recovery Streaming Replication

ning chan ninchan8328 at gmail.com
Fri Mar 1 14:30:29 JST 2013


Hi Tatsuo,
I reproduce the problem on pgpool 3.2.3.
Basically server1 is the originally primary, and server0 is standby. When
server1 kernel panic, server0 failover which become the primary, which is
expected.
When failed server1 comes back online, execute pcp_recovery on server0 to
recover and connect it to primary as a standby, which is expected also.
Now, if server0 kernel panic, pgpool on server1 trigger the failover
script, however, the script did not really get executed although it is
logged in the pgpool log.
See below for the trace:

Feb 28 21:47:40 server1 pgpool[3593]: connect_inet_domain_socket: select()
timedout. retrying...
Feb 28 21:47:40 server1 pgpool[3679]: connect_inet_domain_socket: select()
timedout. retrying...
Feb 28 21:47:41 server1 pgpool[3594]: wd_create_send_socket: connect()
reports failure (No route to host). You can safely ignore this while
starting up.
Feb 28 21:47:41 server1 pgpool[2054]: pool_flush_it: write failed to
backend (0). reason: No route to host offset: 0 wlen: 42
Feb 28 21:47:41 server1 pgpool[2054]: notice_backend_error: called from
pgpool main. ignored.
Feb 28 21:47:41 server1 pgpool[2054]: child_exit: called from pgpool main.
ignored.
Feb 28 21:47:41 server1 pgpool[2054]: pool_read: EOF encountered with
backend
Feb 28 21:47:41 server1 pgpool[2054]: s_do_auth: error while reading
message kind
Feb 28 21:47:41 server1 pgpool[2054]: make_persistent_db_connection:
s_do_auth failed
Feb 28 21:47:41 server1 pgpool[2054]: health check failed. 0 th host
server0 at port 5432 is down
Feb 28 21:47:41 server1 pgpool[2054]: health check retry sleep time: 5
second(s)
Feb 28 21:47:41 server1 pgpool[2054]: starting degeneration. shutdown host
server0(5432)
Feb 28 21:47:41 server1 pgpool[2054]: Restart all children
Feb 28 21:47:41 server1 pgpool[2054]: execute command:
/home/pgpool/failover.py -d 0 -h server0 -p 5432 -D /opt/postgres/9.2/data
-m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
Feb 28 21:47:41 server1 pgpool[2054]: find_primary_node_repeatedly: waiting
for finding a primary node
Feb 28 21:47:41 server1 pgpool[2054]: connect_inet_domain_socket_by_port:
health check timer expired
Feb 28 21:47:41 server1 pgpool[2054]: make_persistent_db_connection:
connection to server1(5432) failed
Feb 28 21:47:41 server1 pgpool[2054]: find_primary_node:
make_persistent_connection failed
Feb 28 21:47:41 server1 pgpool[3679]: connect_inet_domain_socket: select()
timedout. retrying...
Feb 28 21:47:42 server1 pgpool[2054]: connect_inet_domain_socket_by_port:
health check timer expired

above logs repeat few times, and after few seconds, i see this:
Feb 28 21:47:47 server1 pgpool[3679]: check_replication_time_lag: could not
connect to DB node 0, check sr_check_user and sr_check_password

Feb 28 21:48:24 server1 pgpool[2054]: find_primary_node:
make_persistent_connection failed
Feb 28 21:48:25 server1 pgpool[3679]: connect_inet_domain_socket: select()
timedout. retrying...
Feb 28 21:48:25 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 0 (server1:9999) seems not to be working
Feb 28 21:48:25 server1 pgpool[3679]: connect_inet_domain_socket:
getsockopt() detected error: No route to host
Feb 28 21:48:25 server1 pgpool[3679]: make_persistent_db_connection:
connection to server0(5432) failed
Feb 28 21:48:25 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 1 (server0:9999) seems not to be working
Feb 28 21:48:25 server1 pgpool[2057]: wd_lifecheck: watchdog status is
DOWN. You need to restart this for recovery.
Feb 28 21:48:25 server1 pgpool[3679]: check_replication_time_lag: could not
connect to DB node 0, check sr_check_user and sr_check_password

Feb 28 21:48:44 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 0 (server1:9999) seems not to be working
Feb 28 21:48:44 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 1 (server0:9999) seems not to be working

Feb 28 21:49:10 server1 pgpool[2054]: connect_inet_domain_socket_by_port:
health check timer expired
Feb 28 21:49:10 server1 pgpool[2054]: make_persistent_db_connection:
connection to server1(5432) failed
Feb 28 21:49:10 server1 pgpool[2054]: find_primary_node:
make_persistent_connection failed
Feb 28 21:49:11 server1 pgpool[2054]: failover: no follow backends are
degenerated
Feb 28 21:49:11 server1 pgpool[2054]: failover: set new primary node: -1
Feb 28 21:49:11 server1 pgpool[2054]: failover: set new master node: 1
Feb 28 21:49:11 server1 pgpool[4012]: do_child: failback event found.
restart myself.
Feb 28 21:49:11 server1 pgpool[4013]: do_child: failback event found.
restart myself.
Feb 28 21:49:11 server1 pgpool[4014]: do_child: failback event found.
restart myself.
Feb 28 21:49:11 server1 pgpool[4015]: do_child: failback event found.
restart myself.

Feb 28 21:49:11 server1 pgpool[4030]: do_child: failback event found.
restart myself.
Feb 28 21:49:11 server1 pgpool[4039]: do_child: failback event found.
restart myself.
Feb 28 21:49:11 server1 pgpool[4040]: connection received:
host=server1.local port=41474
Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect packet
length (0)
Feb 28 21:49:11 server1 pgpool[4040]: connection received:
host=server1.local port=41481
Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect packet
length (0)
Feb 28 21:49:11 server1 pgpool[4040]: connection received:
host=server1.local port=41485
Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect packet
length (0)
Feb 28 21:49:11 server1 pgpool[4040]: connection received:
host=server1.local port=41491
Feb 28 21:49:11 server1 pgpool[4040]: read_startup_packet: incorrect packet
length (0)
Feb 28 21:49:11 server1 pgpool[2054]: failover done. shutdown host
server0(5432)
Feb 28 21:49:11 server1 pgpool[4029]: do_child: failback event found.
restart myself.
Feb 28 21:49:12 server1 pgpool[3678]: pcp child process received restart
request
Feb 28 21:49:12 server1 pgpool[2054]: PCP child 3678 exits with status 256
in failover()
Feb 28 21:49:12 server1 pgpool[2054]: fork a new PCP child pid 4042 in
failover()
Feb 28 21:49:12 server1 pgpool[2054]: worker child 3679 exits with status
256
Feb 28 21:49:12 server1 pgpool[2054]: fork a new worker child pid 4043
Feb 28 21:49:12 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
health check timer expired
Feb 28 21:49:12 server1 pgpool[4043]: make_persistent_db_connection:
connection to server1(5432) failed
Feb 28 21:49:12 server1 pgpool[2054]: after some retrying backend returned
to healthy state
Feb 28 21:49:13 server1 pgpool[4073]: connection received:
host=server1.local port=41496
Feb 28 21:49:13 server1 pgpool[4073]: connect_inet_domain_socket_by_port:
health check timer expired
Feb 28 21:49:13 server1 pgpool[4073]: connection to server1(5432) failed
Feb 28 21:49:13 server1 pgpool[4073]: new_connection: create_cp() failed
Feb 28 21:49:13 server1 pgpool[4073]: degenerate_backend_set: 1 fail over
request from pid 4073
Feb 28 21:49:13 server1 pgpool[2054]: starting degeneration. shutdown host
server1(5432)
Feb 28 21:49:13 server1 pgpool[2054]: failover_handler: no valid DB node
found
Feb 28 21:49:13 server1 pgpool[2054]: Restart all children
Feb 28 21:49:13 server1 pgpool[2054]: execute command:
/home/pgpool/failover.py -d 1 -h server1 -p 5432 -D /opt/postgres/9.2/data
-m -1 -H  -M 1 -P 1 -r  -R
Feb 28 21:49:13 server1 pgpool[2054]: find_primary_node_repeatedly: waiting
for finding a primary node
Feb 28 21:49:16 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 0 (server1:9999) seems not to be working
Feb 28 21:49:16 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 1 (server0:9999) seems not to be working
Feb 28 21:49:16 server1 pgpool[2057]: wd_lifecheck: watchdog status is
DOWN. You need to restart this for recovery.
Feb 28 21:49:22 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
health check timer expired
Feb 28 21:49:22 server1 pgpool[4043]: make_persistent_db_connection:
connection to server1(5432) failed
Feb 28 21:49:32 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
health check timer expired
Feb 28 21:49:32 server1 pgpool[4043]: make_persistent_db_connection:
connection to server1(5432) failed
Feb 28 21:49:35 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 0 (server1:9999) seems not to be working
Feb 28 21:49:35 server1 pgpool[2057]: wd_lifecheck: lifecheck failed 3
times. pgpool 1 (server0:9999) seems not to be working
Feb 28 21:49:35 server1 pgpool[2057]: wd_lifecheck: watchdog status is
DOWN. You need to restart this for recovery.
Feb 28 21:49:42 server1 pgpool[4043]: connect_inet_domain_socket_by_port:
health check timer expired


Question 1: why the failover script did not executed? is it becuase primary
is -1?
Question 2: why failover script trigger again?
Question 3: why pgpool on server1 failed? why watchdog on server1 is down?

Thanks~
Ning





On Wed, Feb 27, 2013 at 10:44 PM, ning chan <ninchan8328 at gmail.com> wrote:

> Hi Tatsuo,
> I am on 3.2.1. I will try 3.2.3 and report back to you.
>
> Meanwhile, I have question about the backend server setting in the
> pgpool.conf
> I have two pgpools and each of them pointing to the same backend servers,
> should  both pgpool.conf have the same backend_hostname configuration?
>
> *****pgpool-A****
> backend_hostname0 = 'se032c-94-30'
> backend_port0 = 5432
> backend_weight0 = 1
> backend_data_directory0 = '/opt/postgres/9.2/data'
> backend_flag0 = 'ALLOW_TO_FAILOVER'
>
> backend_hostname1 = 'se032c-94-31'
> backend_port1 = 5432
> backend_weight1 = 1
> backend_data_directory1 = '/opt/postgres/9.2/data'
> backend_flag1 = 'ALLOW_TO_FAILOVER'
>
>
> ****pgpool-B****
> backend_hostname0 = 'se032c-94-30'
> backend_port0 = 5432
> backend_weight0 = 1
> backend_data_directory0 = '/opt/postgres/9.2/data'
> backend_flag0 = 'ALLOW_TO_FAILOVER'
>
> backend_hostname1 = 'se032c-94-31'
> backend_port1 = 5432
> backend_weight1 = 1
> backend_data_directory1 = '/opt/postgres/9.2/data'
> backend_flag1 = 'ALLOW_TO_FAILOVER'
>
> Thanks~
> Ning
>
>
> On Wed, Feb 27, 2013 at 6:12 PM, Tatsuo Ishii <ishii at postgresql.org>wrote:
>
>> What version of pgpool-II are you using?  We have found a failover
>> handling problem with 3.2.2 and released 3.2.3 afterwards.
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese: http://www.sraoss.co.jp
>>
>> > Hi Tatsuo,
>> > Do you have any chance to look at this problem yet?
>> >
>> > Thanks~
>> > Ning
>> >
>> >
>> > On Tue, Feb 19, 2013 at 3:02 PM, ning chan <ninchan8328 at gmail.com>
>> wrote:
>> >
>> >> Hi Tatsuo,
>> >> I figured out the problem to the some scripting error in the
>> >> pgpool_remote_start to have the incorrect path to pg_ctl.
>> >> As soon as I corrected it i can recover the failed server and bring it
>> >> online.
>> >> I now however facing another problem.
>> >> After I bring the failed master back in to the pgpool as a standby
>> server,
>> >> i then shutdown the currect Primary server and expecting Standby server
>> >> will be promoted to become the new Primary; however it did not happen.
>> >> I check the pgpool log, i see failover command is called but it did not
>> >> execute. I check and confirm my failover script works just fine.
>> >>
>> >> Here is the log:
>> >> Feb 19 14:51:40 server0 pgpool[3519]: set 1 th backend down status
>> >> Feb 19 14:51:43 server0 pgpool[3554]: wd_create_send_socket: connect()
>> >> reports failure (No route to host). You can safely ignore this while
>> >> starting up.
>> >> Feb 19 14:51:43 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> >> times. pgpool seems not to be working
>> >> Feb 19 14:51:43 server0 pgpool[3522]: wd_IP_down: ifconfig down
>> succeeded
>> >> Feb 19 14:51:43 server0 pgpool[3519]: starting degeneration. shutdown
>> host
>> >> server1(5432)
>> >> Feb 19 14:51:43 server0 pgpool[3519]: Restart all children
>> >> Feb 19 14:51:43 server0 pgpool[3519]: execute command:
>> >> /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D
>> /opt/postgres/9.2/data
>> >> -m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
>> >> Feb 19 14:51:43 server0 postgres[3939]: [2-1] LOG:  incomplete startup
>> >> packet
>> >> Feb 19 14:51:43 server0 postgres[3931]: [2-1] LOG:  incomplete startup
>> >> packet
>> >> Feb 19 14:51:43 server0 postgres[3935]: [2-1] LOG:  incomplete startup
>> >> packet
>> >> Feb 19 14:51:43 server0 pgpool[3519]: find_primary_node_repeatedly:
>> >> waiting for finding a primary node
>> >> Feb 19 14:51:46 server0 pgpool[3522]: wd_create_send_socket: connect()
>> >> reports failure (No route to host). You can safely ignore this while
>> >> starting up.
>> >> Feb 19 14:51:46 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> >> times. pgpool seems not to be working
>> >> Feb 19 14:51:54 server0 pgpool[3556]: connect_inet_domain_socket:
>> >> connect() failed: Connection timed out
>> >> Feb 19 14:51:54 server0 pgpool[3556]: make_persistent_db_connection:
>> >> connection to server1(5432) failed
>> >> Feb 19 14:51:54 server0 pgpool[3556]: check_replication_time_lag: could
>> >> not connect to DB node 1, check sr_check_user and sr_check_password
>> >> Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> >> times. pgpool seems not to be working
>> >> Feb 19 14:52:05 server0 pgpool[3556]: connect_inet_domain_socket:
>> >> connect() failed: No route to host
>> >> Feb 19 14:52:05 server0 pgpool[3556]: make_persistent_db_connection:
>> >> connection to server1(5432) failed
>> >> Feb 19 14:52:05 server0 pgpool[3556]: check_replication_time_lag: could
>> >> not connect to DB node 1, check sr_check_user and sr_check_password
>> >> Feb 19 14:52:05 server0 pgpool[3522]: wd_lifecheck: lifecheck failed 3
>> >> times. pgpool seems not to be working
>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover: no follow backends are
>> >> degenerated
>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover: set new primary node:
>> -1
>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover: set new master node: 0
>> >> Feb 19 14:52:14 server0 pgpool[3980]: connection received:
>> >> host=server0.local port=45361
>> >> Feb 19 14:52:14 server0 pgpool[3556]: worker process received restart
>> >> request
>> >> Feb 19 14:52:14 server0 pgpool[3519]: failover done. shutdown host
>> >> server1(5432)
>> >>
>> >> As you can see, postgreSQL did not restarted.
>> >>
>> >> From a successful failover, the log should looks like this:
>> >> Feb 19 13:47:01 server0 pgpool[4391]: execute command:
>> >> /home/pgpool/failover.py -d 1 -h server1 -p 5432 -D
>> /opt/postgres/9.2/data
>> >> -m 0 -H server0 -M 0 -P 1 -r 5432 -R /opt/postgres/9.2/data
>> >> Feb 19 13:47:01 server0 postgres[4786]: [2-1] LOG:  incomplete startup
>> >> packet
>> >> Feb 19 13:47:02 server0 pgpool[4391]: find_primary_node_repeatedly:
>> >> waiting for finding a primary node
>> >> Feb 19 13:47:06 server0 postgres[3083]: [6-1] LOG:  trigger file found:
>> >> /opt/postgres/9.2/data/trigger_file0
>> >> Feb 19 13:47:06 server0 postgres[3083]: [7-1] LOG:  redo done at
>> 0/91000020
>> >> Feb 19 13:47:06 server0 postgres[3083]: [8-1] LOG:  selected new
>> timeline
>> >> ID: 18
>> >> Feb 19 13:47:06 server0 postgres[3083]: [9-1] LOG:  archive recovery
>> >> complete
>> >> Feb 19 13:47:06 server0 postgres[3081]: [2-1] LOG:  database system is
>> >> ready to accept connections
>> >> Feb 19 13:47:06 server0 postgres[4804]: [2-1] LOG:  autovacuum launcher
>> >> started
>> >> Feb 19 13:47:07 server0 pgpool[4391]: find_primary_node: primary node
>> id
>> >> is 0
>> >> Feb 19 13:47:07 server0 pgpool[4391]: starting follow degeneration.
>> >> shutdown host server1(5432)
>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover: 1 follow backends have
>> >> been degenerated
>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover: set new primary node: 0
>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover: set new master node: 0
>> >> Feb 19 13:47:07 server0 pgpool[4817]: connection received:
>> >> host=server1.local port=54619
>> >> Feb 19 13:47:07 server0 pgpool[4816]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4818]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4819]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4815]: start triggering follow command.
>> >> Feb 19 13:47:07 server0 pgpool[4815]: execute command: echo 1 server1
>> 5432
>> >> /opt/postgres/9.2/data 0 server0 0 1 5432 /opt/postgres/9.2/data %
>> >> Feb 19 13:47:07 server0 pgpool[4821]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4822]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4823]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4820]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4827]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4828]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4826]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4829]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4830]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4831]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4832]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4833]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4834]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4835]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4836]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4837]: do_child: failback event found.
>> >> restart myself.
>> >> Feb 19 13:47:07 server0 pgpool[4391]: failover done. shutdown host
>> >> server1(5432)
>> >>
>> >> Any idea why failover did not work after the recovery? Do i need to
>> >> restart remote pgpool after recovery?
>> >>
>> >>
>> >> On Mon, Feb 18, 2013 at 4:24 PM, ning chan <ninchan8328 at gmail.com>
>> wrote:
>> >>
>> >>>
>> >>> Hi Tatsuo,
>> >>> Sorry for the late reply as I was traveling.
>> >>> I am able to reproduce the problem and here is what I did:
>> >>>
>> >>> *1) Make sure primary (server0) and standby (server1) connections are
>> >>> good.*
>> >>> [root at server0 tmp]# psql -c "show pool_nodes" -p 9999
>> >>>  node_id | hostname | port | status | lb_weight |  role
>> >>> ---------+----------+------+--------+-----------+---------
>> >>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>> >>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>> >>> (2 rows)
>> >>>
>> >>> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
>> >>>  node_id | hostname | port | status | lb_weight |  role
>> >>> ---------+----------+------+--------+-----------+---------
>> >>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>> >>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>> >>> (2 rows)
>> >>>
>> >>> *2) force shutdown primary (server0), and standby (server1) server was
>> >>> promoted, below is the pgpool log from server1 during the failover:*
>> >>> Feb 18 15:31:45 server1 pgpool[2691]: connect_inet_domain_socket:
>> >>> connect() failed: No route to host
>> >>> Feb 18 15:31:45 server1 pgpool[2691]: make_persistent_db_connection:
>> >>> connection to server0(5432) failed
>> >>> Feb 18 15:31:45 server1 pgpool[2691]: check_replication_time_lag:
>> could
>> >>> not connect to DB node 0, check sr_check_user and sr_check_password
>> >>> Feb 18 15:31:45 server1 pgpool[2689]: connect_inet_domain_socket:
>> >>> connect() failed: No route to host
>> >>> Feb 18 15:31:45 server1 pgpool[2689]: connection to server0(5432)
>> failed
>> >>> Feb 18 15:31:45 server1 pgpool[2689]: new_connection: create_cp()
>> failed
>> >>> Feb 18 15:31:45 server1 pgpool[2689]: degenerate_backend_set: 0 fail
>> over
>> >>> request from pid 2689
>> >>> Feb 18 15:31:48 server1 pgpool[2689]: wd_create_send_socket: connect()
>> >>> reports failure (No route to host). You can safely ignore this while
>> >>> starting up.
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: connect_inet_domain_socket:
>> >>> connect() failed: No route to host
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: make_persistent_db_connection:
>> >>> connection to server0(5432) failed
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: health check failed. 0 th host
>> >>> server0 at port 5432 is down
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: health check retry sleep time: 5
>> >>> second(s)
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: starting degeneration. shutdown
>> >>> host server0(5432)
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: Restart all children
>> >>> Feb 18 15:31:48 server1 pgpool[2653]: execute command:
>> >>> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D
>> /opt/postgres/9.2/data
>> >>> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
>> >>> Feb 18 15:31:49 server1 pgpool[2653]: find_primary_node_repeatedly:
>> >>> waiting for finding a primary node
>> >>> Feb 18 15:31:53 server1 postgres[1970]: [7-1] LOG:  trigger file
>> found:
>> >>> /tmp/trigger_file0
>> >>> Feb 18 15:31:53 server1 postgres[2597]: [3-1] FATAL:  terminating
>> >>> walreceiver process due to administrator command
>> >>> Feb 18 15:31:53 server1 postgres[1970]: [8-1] LOG:  record with zero
>> >>> length at 0/64000418
>> >>> Feb 18 15:31:53 server1 postgres[1970]: [9-1] LOG:  redo done at
>> >>> 0/640003B8
>> >>> Feb 18 15:31:53 server1 postgres[1970]: [10-1] LOG:  selected new
>> >>> timeline ID: 15
>> >>> Feb 18 15:31:53 server1 postgres[1970]: [11-1] LOG:  archive recovery
>> >>> complete
>> >>> Feb 18 15:31:53 server1 postgres[1968]: [2-1] LOG:  database system is
>> >>> ready to accept connections
>> >>> Feb 18 15:31:53 server1 postgres[3168]: [2-1] LOG:  autovacuum
>> launcher
>> >>> started
>> >>> Feb 18 15:31:54 server1 pgpool[2653]: find_primary_node: primary node
>> id
>> >>> is 1
>> >>> Feb 18 15:31:54 server1 pgpool[2653]: starting follow degeneration.
>> >>> shutdown host server0(5432)
>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover: 1 follow backends have
>> >>> been degenerated
>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new primary node:
>> 1
>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover: set new master node: 1
>> >>> Feb 18 15:31:54 server1 pgpool[3171]: connection received:
>> >>> host=server1.local port=58346
>> >>> Feb 18 15:31:54 server1 pgpool[3170]: start triggering follow command.
>> >>> Feb 18 15:31:54 server1 pgpool[3170]: execute command: echo 0 server0
>> >>> 5432 /opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data
>> %
>> >>> Feb 18 15:31:54 server1 pgpool[2691]: worker process received restart
>> >>> request
>> >>> Feb 18 15:31:54 server1 pgpool[2653]: failover done. shutdown host
>> >>> server0(5432)
>> >>>
>> >>> *3) boot up the failed server (server0) as usual, pgpool and database
>> >>> started up as usual.
>> >>> *
>> >>> *after server0 boot up, i check the pool_nodes info, server0 pgpool
>> >>> still think server0 is the primary:*
>> >>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>> >>>  node_id | hostname | port | status | lb_weight |  role
>> >>> ---------+----------+------+--------+-----------+---------
>> >>>  0       | server0  | 5432 | 2      | 0.500000  | primary
>> >>>  1       | server1  | 5432 | 2      | 0.500000  | standby
>> >>> (2 rows)
>> >>>
>> >>> server1 pgpool still think server1 is the primary:
>> >>> [root at server1 tmp]# psql -c "show pool_nodes" -p 9999
>> >>>  node_id | hostname | port | status | lb_weight |  role
>> >>> ---------+----------+------+--------+-----------+---------
>> >>>  0       | server0  | 5432 | 3      | 0.500000  | standby
>> >>>  1       | server1  | 5432 | 2      | 0.500000  | primary
>> >>> (2 rows)
>> >>>
>> >>> *4) kill the database on server0 to prepare of the recovery*
>> >>> *as soon as i kill the server0 database, pgpool log shows following:*
>> >>> Feb 18 15:33:56 server0 pgpool[2153]: execute command:
>> >>> /home/pgpool/failover.py -d 0 -h server0 -p 5432 -D
>> /opt/postgres/9.2/data
>> >>> -m 1 -H server1 -M 0 -P 0 -r 5432 -R /opt/postgres/9.2/data
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node_repeatedly:
>> >>> waiting for finding a primary node
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: find_primary_node: primary node
>> id
>> >>> is 1
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: starting follow degeneration.
>> >>> shutdown host server0(5432)
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover: 1 follow backends have
>> >>> been degenerated
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new primary node:
>> 1
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover: set new master node: 1
>> >>> Feb 18 15:33:57 server0 pgpool[2534]: start triggering follow command.
>> >>> Feb 18 15:33:57 server0 pgpool[2534]: execute command: echo 0 server0
>> >>> 5432 /opt/postgres/9.2/data 1 server1 0 0 5432 /opt/postgres/9.2/data
>> %
>> >>> Feb 18 15:33:57 server0 pgpool[2190]: worker process received restart
>> >>> request
>> >>> Feb 18 15:33:57 server0 pgpool[2153]: failover done. shutdown host
>> >>> server0(5432)
>> >>>
>> >>> pool_nodes shows:
>> >>> [root at server0 ~]# psql -c "show pool_nodes" -p 9999
>> >>>  node_id | hostname | port | status | lb_weight |  role
>> >>> ---------+----------+------+--------+-----------+---------
>> >>>  0       | server0  | 5432 | 3      | 0.500000  | standby
>> >>>  1       | server1  | 5432 | 2      | 0.500000  | primary
>> >>> (2 rows)
>> >>>
>> >>>
>> >>> *5) on server1, execute /usr/local/bin/pcp_recovery_node 10 localhost
>> >>> 9898 pgpool password 0*
>> >>> server 1 pgpool log shows following:
>> >>> Feb 18 15:34:52 server1 pgpool[3209]: starting recovery command:
>> "SELECT
>> >>> pgpool_recovery('basebackup.sh', 'server0', '/opt/postgres/9.2/data')"
>> >>> Feb 18 15:34:53 server1 pgpool[3209]: 1st stage is done
>> >>> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started: try to
>> >>> connect to postmaster on hostname:server0 database:postgres
>> user:postgres
>> >>> (retry 0 times)
>> >>> Feb 18 15:34:53 server1 pgpool[3209]: check_postmaster_started:
>> failed to
>> >>> connect to postmaster on hostname:server0 database:postgres
>> user:postgres
>> >>>
>> >>> *The message "fail to connect to postmaster on hostname server0"
>> >>> continue until the command timeout.
>> >>> *
>> >>> *it looks like the 1st stage command is done but pgpool on server1 did
>> >>> not call the pgpool_remote_start to start the server0 database engine
>> >>> although the script is there under postgresql data folder
>> >>> *
>> >>> [root at server1 data]# ls
>> >>> backup_label.old     data       pg_hba.conf    pg_notify
>> >>> pg_stat_tmp  PG_VERSION       postmaster.pid
>> >>> base                 global     pg_ident.conf  pgpool_remote_start
>> >>> pg_subtrans  pg_xlog          recovery.done
>> >>> basebackup.sh        ning.test  pg_log         pg_serial
>> >>> pg_tblspc    postgresql.conf  recovery.standby1
>> >>> copy-base-backup.sh  pg_clog    pg_multixact   pg_snapshots
>> >>> pg_twophase  postmaster.opts
>> >>>
>> >>> *during this period, pgpool log on server0 shows the following:*
>> >>> Feb 18 15:34:42 server0 pgpool[2567]: connection received:
>> >>> host=server0.local port=45366
>> >>> Feb 18 15:34:46 server0 pgpool[2567]: connection received:
>> >>> host=server1.local port=39535
>> >>> Feb 18 15:34:52 server0 pgpool[2567]: connection received:
>> >>> host=server0.local port=45381
>> >>> Feb 18 15:34:56 server0 pgpool[2565]: connection received:
>> >>> host=server1.local port=39557
>> >>> Feb 18 15:35:02 server0 pgpool[2565]: connection received:
>> >>> host=server0.local port=45396
>> >>> Feb 18 15:35:06 server0 pgpool[2565]: connection received:
>> >>> host=server1.local port=39576
>> >>> Feb 18 15:35:12 server0 pgpool[2565]: connection received:
>> >>> host=server0.local port=45410
>> >>> Feb 18 15:35:16 server0 pgpool[2565]: connection received:
>> >>> host=server1.local port=39593
>> >>> Feb 18 15:35:22 server0 pgpool[2565]: connection received:
>> >>> host=server0.local port=45425
>> >>> Feb 18 15:35:26 server0 pgpool[2565]: connection received:
>> >>> host=server1.local port=39612
>> >>> Feb 18 15:35:32 server0 pgpool[2565]: connection received:
>> >>> host=server0.local port=45440
>> >>> Feb 18 15:35:36 server0 pgpool[2565]: connection received:
>> >>> host=server1.local port=39630
>> >>> Feb 18 15:35:42 server0 pgpool[2565]: connection received:
>> >>> host=server0.local port=45455
>> >>> Feb 18 15:35:46 server0 pgpool[2565]: connection received:
>> >>> host=server1.local port=39648
>> >>>
>> >>> *Below is the script of recovery_1st_stage_command = 'basebackup.sh'*
>> >>> [root at server1 ~]# more /opt/postgres/9.2/data/basebackup.sh
>> >>> #! /bin/sh
>> >>> # Recovery script for streaming replication.
>> >>> # This script assumes that DB node 0 is primary, and 1 is standby.
>> >>> #
>> >>> datadir=$1
>> >>> desthost=$2
>> >>> destdir=$3
>> >>>
>> >>> echo "datadir=$datadir"
>> >>> echo "desthost=$desthost"
>> >>> echo "destdir=$destdir"
>> >>>
>> >>> echo "Executing basebackup.sh" >> /var/log/replication.log
>> >>>
>> >>> echo "Executing start backup" >> /var/log/replication.log
>> >>> psql -c "SELECT pg_start_backup('Streaming Replication', true)"
>> postgres
>> >>>
>> >>> echo "Executing rsync" >> /var/log/replication.log
>> >>> echo "rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
>> >>> --exclude postmaster.pid --exclude postmaster.
>> >>> opts --exclude pg_log --exclude pg_xlog --exclude recovery.conf
>> $datadir/
>> >>> $desthost:$destdir/"
>> >>>
>> >>> echo "Renaming recovery conf file" >>/var/log/replication.log
>> >>> echo "ssh -T $desthost mv $destdir/recovery.done
>> $destdir/recovery.conf"
>> >>> ssh -T $desthost mv $destdir/recovery.done $destdir/recovery.conf
>> >>>
>> >>> echo "Removing trigger file"
>> >>> ssh -T $desthost rm -f /tmp/trigger_file0
>> >>>
>> >>> echo "Executing stop backup" >> /var/log/replication.log
>> >>> psql -c "SELECT pg_stop_backup()" postgres
>> >>>
>> >>> echo "existing basebackup.sh" >> /var/log/replication.log
>> >>>
>> >>> *Manually execute the command on server1 return the following result:*
>> >>>
>> >>> [root at server1 data]# ./basebackup.sh /opt/postgres/9.2/data/ server0
>> >>> /opt/postgres/9.2/data/
>> >>> datadir=/opt/postgres/9.2/data/
>> >>> desthost=server0
>> >>> destdir=/opt/postgres/9.2/data/
>> >>>  pg_start_backup
>> >>> -----------------
>> >>>  0/76000020
>> >>> (1 row)
>> >>>
>> >>> rsync -C -a --progress --delete -e ssh --exclude postgresql.conf
>> >>> --exclude postmaster.pid --exclude postmaster.opts --exclude pg_log
>> >>> --exclude pg_xlog --exclude recovery.conf /opt/postgres/9.2/data//
>> >>> server0:/opt/postgres/9.2/data//
>> >>> ssh -T server0 mv /opt/postgres/9.2/data//recovery.done
>> >>> /opt/postgres/9.2/data//recovery.conf
>> >>> Removing trigger file
>> >>> NOTICE:  WAL archiving is not enabled; you must ensure that all
>> required
>> >>> WAL segments are copied through other means to complete the backup
>> >>>  pg_stop_backup
>> >>> ----------------
>> >>>  0/760000E0
>> >>> (1 row)
>> >>>
>> >>> Thanks and please advise.
>> >>>
>> >>> Ning
>> >>>
>> >>>
>> >>> On Sun, Feb 17, 2013 at 7:05 AM, Tatsuo Ishii <ishii at postgresql.org
>> >wrote:
>> >>>
>> >>>> Hi,
>> >>>>
>> >>>> It seems the standby was unable to start up. Can you show standby
>> >>>> PostgreSQL's log? Maybe we could find the cause of the problem.
>> >>>> --
>> >>>> Tatsuo Ishii
>> >>>> SRA OSS, Inc. Japan
>> >>>> English: http://www.sraoss.co.jp/index_en.php
>> >>>> Japanese: http://www.sraoss.co.jp
>> >>>>
>> >>>> > Hi Tatsuo,
>> >>>> > Thank you so much for your reply.
>> >>>> > Actually in my case, i was using the pcp_recovery command and
>> execute
>> >>>> it on
>> >>>> > the current primary server.
>> >>>> > However, if the remote node (recover node) database is off, i got
>> the
>> >>>> > following message on the primary server pgpool log:
>> >>>> >
>> >>>> > Jan 31 16:58:10 server0 pgpool[2723]: starting recovery command:
>> >>>> "SELECT
>> >>>> > pgpool_recovery('basebackup.sh', 'server1',
>> '/opt/postgres/9.2/data')"
>> >>>> > Jan 31 16:58:11 server0 pgpool[2723]: 1st stage is done
>> >>>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > (retry 0 times)
>> >>>> > Jan 31 16:58:11 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:13 server0 pgpool[2719]: connection received:
>> >>>> > host=server0.local port=58446
>> >>>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > (retry 1 times)
>> >>>> > Jan 31 16:58:14 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:14 server0 pgpool[2719]: connection received:
>> >>>> > host=server1.local port=39928
>> >>>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > (retry 2 times)
>> >>>> > Jan 31 16:58:17 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > (retry 3 times)
>> >>>> > Jan 31 16:58:20 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:postgres
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:23 server0 pgpool[2719]: connection received:
>> >>>> > host=server0.local port=58464
>> >>>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > (retry 0 times)
>> >>>> > Jan 31 16:58:23 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > (retry 1 times)
>> >>>> > Jan 31 16:58:26 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:26 server0 pgpool[2719]: connection received:
>> >>>> > host=server1.local port=39946
>> >>>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > (retry 2 times)
>> >>>> > Jan 31 16:58:29 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > (retry 3 times)
>> >>>> > Jan 31 16:58:32 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:33 server0 pgpool[2719]: connection received:
>> >>>> > host=server0.local port=58483
>> >>>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > (retry 4 times)
>> >>>> > Jan 31 16:58:35 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started:
>> try to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> > (retry 5 times)
>> >>>> > Jan 31 16:58:38 server0 pgpool[2723]: check_postmaster_started:
>> failed
>> >>>> to
>> >>>> > connect to postmaster on hostname:server1 database:template1
>> >>>> user:postgres
>> >>>> >
>> >>>> > here is the exact command i execute on server0 to recover server1,
>> >>>> > /usr/local/bin/pcp_recovery_node 10 localhost 9898 pgpool password
>> 1
>> >>>> >
>> >>>> > Do you have any idea why?
>> >>>> >
>> >>>> > Just FYI, we cannot pgpoolAdmin in our environment.
>> >>>> >
>> >>>> >
>> >>>> > On Sun, Feb 17, 2013 at 12:13 AM, Tatsuo Ishii <
>> ishii at postgresql.org>
>> >>>> wrote:
>> >>>> >
>> >>>> >> > Hi all,
>> >>>> >> > I have the following question regarding the recovery of a filed
>> >>>> Primary
>> >>>> >> > Database Server.
>> >>>> >> >
>> >>>> >> > Question 1: in the documentation, under Streaming Replication
>> Online
>> >>>> >> > Recovery section.
>> >>>> >> >
>> >>>> >> > http://www.pgpool.net/docs/latest/pgpool-en.html#stream
>> >>>> >> >
>> >>>> >> > in steps 6:
>> >>>> >> >
>> >>>> >> >    1. After completing online recovery, pgpool-II will start
>> >>>> PostgreSQL
>> >>>> >> on
>> >>>> >> >    the standby node. Install the script for this purpose on
>> each DB
>> >>>> >> > nodes. Sample
>> >>>> >> >    script <
>> http://www.pgpool.net/docs/latest/pgpool_remote_start>
>> >>>> is
>> >>>> >> >    included in "sample" directory of the source code. This
>> script
>> >>>> uses
>> >>>> >> ssh.
>> >>>> >> >    You need to allow recovery_user to login from the primary
>> node
>> >>>> to the
>> >>>> >> >    standby node without being asked password.
>> >>>> >> >
>> >>>> >> > To my understanding, i think the postgreSQL doesn't not need to
>> be
>> >>>> online
>> >>>> >> > for the recovery process right? Since later on it mentions that
>> >>>> >> > pgpool_remote_start will start up the DB on the failed node.
>> >>>> >>
>> >>>> >> Acually standby PostgreSQL node should not be started.
>> >>>> >>
>> >>>> >> > Question 2: in my configuration, i have 2 pgpool server with two
>> >>>> >> backends.
>> >>>> >> > Will it work for oneline recovery?
>> >>>> >>
>> >>>> >> Yes, but online recovery process should be initiated by one of
>> pgpool,
>> >>>> >> not both. If you enable pgpool-II 3.2's watchdog, it will take
>> care of
>> >>>> >> neccessary interlocking.
>> >>>> >>
>> >>>> >> > Question 3: when the failed node comes back online, should i use
>> >>>> >> > pcp_recovery from DB primary or should i use pcp_attach on the
>> >>>> failed
>> >>>> >> node
>> >>>> >> > to recover the failed system? Actually in my case, both methods
>> do
>> >>>> not
>> >>>> >> > recover my system every time.
>> >>>> >>
>> >>>> >> I'm confused. Didn't you start the online recovery process by
>> using
>> >>>> >> pcp_recovery_node?(of course you could do it via pgpoolAdmin).
>> >>>> >>
>> >>>> >> Anyway pcp_recovery_node automatically attach recovered node, and
>> you
>> >>>> >> don't need to execute pcp_attach_node.
>> >>>> >>
>> >>>> >> I suggest you read tutorial:
>> >>>> >>
>> >>>>
>> http://www.pgpool.net/pgpool-web/contrib_docs/simple_sr_setting2/index.html
>> >>>> >> --
>> >>>> >> Tatsuo Ishii
>> >>>> >> SRA OSS, Inc. Japan
>> >>>> >> English: http://www.sraoss.co.jp/index_en.php
>> >>>> >> Japanese: http://www.sraoss.co.jp
>> >>>> >>
>> >>>>
>> >>>
>> >>>
>> >>
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130228/b111b13f/attachment-0001.html>


More information about the pgpool-general mailing list