[Pgpool-general] pgpool-II and online recovery process

Thu Sep 18 21:52:22 UTC 2008

Thanks for the suggestion but that did not appear to solve the issue.

I started pgpool with the following timout related options hoping it  
would help out but no go there.

child_life_time = 30
connection_life_time = 10
client_idle_limit = 0
recovery_timeout = 1200

* I have not set "lient_idle_limit" because that will also close idle  
in transaction connections and
on production I cannot allow that.

Basically stage 1 runs just fine and then when it starts stage 2 it  
takes forever for the checkpoint to start ..
The checkpoint itself takes about 3 minutes which I think it's still a  
lot (ok that I'm using a VM with not a lot of resources so that may be  
misleading)

So, again I have started a pgbench process and stopped db3 and then  
used pcp_recovery_node to have that node recovered.

  pgbench -C -h 10.1.100.213 -p 5432 -c 5 -t 200 -U postgres pgbench

LOG: db1 backend (where stage1/stage2 scripts run from)
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
....
global/2845
global/2846
global/2847
global/pg_auth
global/pg_control
global/pg_database
global/pgstat.stat
pg_clog/
pg_clog/0000
pg_multixact/
pg_multixact/members/
pg_multixact/members/0000
pg_multixact/offsets/
pg_multixact/offsets/0000
pg_subtrans/
pg_subtrans/0000
pg_tblspc/
pg_twophase/

sent 11648717 bytes  received 16540 bytes  158710.98 bytes/sec
total size is 172764067  speedup is 14.81
building file list ... done
000000030000000000000048
  pg_stop_backup
----------------
  0/488C002C
(1 row)

sent 16779407 bytes  received 42 bytes  4794128.29 bytes/sec
total size is 16777216  speedup is 1.00
building file list ... done
000000030000000000000048.002F255C.backup

sent 414 bytes  received 42 bytes  912.00 bytes/sec
total size is 251  speedup is 0.55
2008-09-18 16:26:22 CDT LOG:  checkpoint starting: time
2008-09-18 16:28:22 CDT LOG:  checkpoint complete: wrote 774 buffers  
(25.2%); 0 transaction log file(s) added, 0 removed, 0 recycled;  
write=119.383 s, sync=0.209 s, total=119.644 s

= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
= 
========================================================================
* it's now 16:50 PM and stage 2 still seems stuck !

So if I do a Ctrl -C on the pgbench process that I started then the  
CHECKPOINT on stage 2 finishes and then recovery is finished.
It looks like for some reason pgpool is not able to close the  
connections on the backends and therefore stage 2 gets stuck.
Am I missing some other timeout  or parameter ?

PGpool-II LOG:

Sep 18 16:21:55 debian-db1 pgpool: 2008-09-18 16:21:55 LOG:   pid  
6876: starting recovering node 2
Sep 18 16:22:00 debian-db1 pgpool: 2008-09-18 16:22:00 LOG:   pid  
6876: CHECKPOINT in the 1st stage done
Sep 18 16:22:00 debian-db1 pgpool: 2008-09-18 16:22:00 LOG:   pid  
6876: starting recovery command: "SELECT pgpool_recovery('copy-base- 
backup', 'debian-db4', '/var/lib/po
stgresql/8.3/main')"
Sep 18 16:23:34 debian-db1 pgpool: 2008-09-18 16:23:34 LOG:   pid  
6876: 1st stage is done
Sep 18 16:23:34 debian-db1 pgpool: 2008-09-18 16:23:34 LOG:   pid  
6876: starting 2nd stage
Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
6958: ProcessFrontendResponse: failed to read kind from frontend.  
frontend abnormally exited
Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
6955: ProcessFrontendResponse: failed to read kind from frontend.  
frontend abnormally exited
Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
6919: ProcessFrontendResponse: failed to read kind from frontend.  
frontend abnormally exited
Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
6908: ProcessFrontendResponse: failed to read kind from frontend.  
frontend abnormally exited
Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid  
6876: all connections from clients have been closed
Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid  
6876: CHECKPOINT in the 2nd stage done
Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid  
6876: starting recovery command: "SELECT  
pgpool_recovery('pgpool_recovery_pitr', 'debian-db4', '/var/lib/ 
postgresql/8.3/main')"
Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid  
6876: 2 node restarted
Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid  
6876: send_failback_request: fail back 2 th node request from pid 6876
Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid  
6876: recovery done

Any help much appreciated

Marcelo
Linux/Solaris System Administrator
http://www.zeroaccess.org

On Sep 10, 2008, at 8:34 PM, Tatsuo Ishii wrote:

>> Hi,
>>
>> I have recently setup a VM environment to test out an online recovery
>> process which works great. Basically I have the following.
>>
>> pgpool-II 2.1 server:
>> 1 - pgpool
>>
>> 3 postgresql 8.3.3 servers:
>> 1 - db1  (backend_node0)
>> 2 - db2  (backend_node1)
>> 3 - db3  (backend_node2)
>>
>> The 3rd PG server is kept in detached status so that I can use it for
>> online recovery tests.
>> I have created the scripts for 1st/2nd stage and also the
>> pg_remote_start script.
>> I can use pcp_recovery_node to bring a new node online (3rd PG  
>> server)
>> or recover one of the existing ones without any issues.
>>
>> Now, until today all my previous tests on performing online recovery
>> involved calling pcp_recovery_node and during that time no clients
>> were using the database servers through pgpool of course. There was  
>> no
>> activity  going on on the database servers at all. So everything
>> worked great, pgpool went through 1st and 2nd stage and then called
>> the remote start and finished the recovery process. So server got
>> online and sync'd.
>>
>> Then, today I tried to do the same but right before calling
>> pcp_recovery_node I started a pgbench process pointing to the pgpool
>> server to create some activity on the database. I was under the
>> impression that during 2nd stage pgpool would perhaps start to queue
>> some of the transactions along with not allowing new clients to
>> connect to it. Allowing the 2nd stage to occur and then bring the new
>> node online. Once the online recovery was finished, pgpool would go
>> through its queue and send those transactions to all nodes.
>> Is that not the case ? Cause basically it went through 1st stage and
>> then pcp_recovery_node timed out.
>
> Perhaps you need -C option to pgbench?
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan