[Pgpool-general] pgpool-II and online recovery process

Fri Sep 19 02:51:27 UTC 2008

First of all you need to set -c 1. This is due to the architecure of
pgbench (it tries to establish all the connections before proceeding
in your case 5, which gives pgpool virtually no chance to block new
connections from clients). Another point is, you might want to give
some sleep time between pgbench sessions. I use following script for
demonstration.

\set naccounts 100000 * :scale
\setrandom delta -5000 5000
\setrandom aid 1 :naccounts
UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT pg_sleep(0.5);

then run pgbench as follows:

pgbench -d -p 5432 -C -t 10000 -f pgbench.sql test

Hope this helps,
--
Tatsuo Ishii
SRA OSS, Inc. Japan

> Thanks for the suggestion but that did not appear to solve the issue.
> 
> I started pgpool with the following timout related options hoping it  
> would help out but no go there.
> 
> child_life_time = 30
> connection_life_time = 10
> client_idle_limit = 0
> recovery_timeout = 1200
> 
> * I have not set "lient_idle_limit" because that will also close idle  
> in transaction connections and
> on production I cannot allow that.
> 
> 
> Basically stage 1 runs just fine and then when it starts stage 2 it  
> takes forever for the checkpoint to start ..
> The checkpoint itself takes about 3 minutes which I think it's still a  
> lot (ok that I'm using a VM with not a lot of resources so that may be  
> misleading)
> 
> 
> So, again I have started a pgbench process and stopped db3 and then  
> used pcp_recovery_node to have that node recovered.
> 
>   pgbench -C -h 10.1.100.213 -p 5432 -c 5 -t 200 -U postgres pgbench
> 
> 
> LOG: db1 backend (where stage1/stage2 scripts run from)
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ========================================================================
> ....
> global/2845
> global/2846
> global/2847
> global/pg_auth
> global/pg_control
> global/pg_database
> global/pgstat.stat
> pg_clog/
> pg_clog/0000
> pg_multixact/
> pg_multixact/members/
> pg_multixact/members/0000
> pg_multixact/offsets/
> pg_multixact/offsets/0000
> pg_subtrans/
> pg_subtrans/0000
> pg_tblspc/
> pg_twophase/
> 
> sent 11648717 bytes  received 16540 bytes  158710.98 bytes/sec
> total size is 172764067  speedup is 14.81
> building file list ... done
> 000000030000000000000048
>   pg_stop_backup
> ----------------
>   0/488C002C
> (1 row)
> 
> 
> sent 16779407 bytes  received 42 bytes  4794128.29 bytes/sec
> total size is 16777216  speedup is 1.00
> building file list ... done
> 000000030000000000000048.002F255C.backup
> 
> sent 414 bytes  received 42 bytes  912.00 bytes/sec
> total size is 251  speedup is 0.55
> 2008-09-18 16:26:22 CDT LOG:  checkpoint starting: time
> 2008-09-18 16:28:22 CDT LOG:  checkpoint complete: wrote 774 buffers  
> (25.2%); 0 transaction log file(s) added, 0 removed, 0 recycled;  
> write=119.383 s, sync=0.209 s, total=119.644 s
> 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> = 
> ========================================================================
> * it's now 16:50 PM and stage 2 still seems stuck !
> 
> 
> So if I do a Ctrl -C on the pgbench process that I started then the  
> CHECKPOINT on stage 2 finishes and then recovery is finished.
> It looks like for some reason pgpool is not able to close the  
> connections on the backends and therefore stage 2 gets stuck.
> Am I missing some other timeout  or parameter ?
> 
> 
> PGpool-II LOG:
> 
> Sep 18 16:21:55 debian-db1 pgpool: 2008-09-18 16:21:55 LOG:   pid  
> 6876: starting recovering node 2
> Sep 18 16:22:00 debian-db1 pgpool: 2008-09-18 16:22:00 LOG:   pid  
> 6876: CHECKPOINT in the 1st stage done
> Sep 18 16:22:00 debian-db1 pgpool: 2008-09-18 16:22:00 LOG:   pid  
> 6876: starting recovery command: "SELECT pgpool_recovery('copy-base- 
> backup', 'debian-db4', '/var/lib/po
> stgresql/8.3/main')"
> Sep 18 16:23:34 debian-db1 pgpool: 2008-09-18 16:23:34 LOG:   pid  
> 6876: 1st stage is done
> Sep 18 16:23:34 debian-db1 pgpool: 2008-09-18 16:23:34 LOG:   pid  
> 6876: starting 2nd stage
> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
> 6958: ProcessFrontendResponse: failed to read kind from frontend.  
> frontend abnormally exited
> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
> 6955: ProcessFrontendResponse: failed to read kind from frontend.  
> frontend abnormally exited
> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
> 6919: ProcessFrontendResponse: failed to read kind from frontend.  
> frontend abnormally exited
> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid  
> 6908: ProcessFrontendResponse: failed to read kind from frontend.  
> frontend abnormally exited
> Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid  
> 6876: all connections from clients have been closed
> Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid  
> 6876: CHECKPOINT in the 2nd stage done
> Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid  
> 6876: starting recovery command: "SELECT  
> pgpool_recovery('pgpool_recovery_pitr', 'debian-db4', '/var/lib/ 
> postgresql/8.3/main')"
> Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid  
> 6876: 2 node restarted
> Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid  
> 6876: send_failback_request: fail back 2 th node request from pid 6876
> Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid  
> 6876: recovery done
> 
> 
> 
> Any help much appreciated
> 
> Marcelo
> Linux/Solaris System Administrator
> http://www.zeroaccess.org
> 
> 
> On Sep 10, 2008, at 8:34 PM, Tatsuo Ishii wrote:
> 
> >> Hi,
> >>
> >> I have recently setup a VM environment to test out an online recovery
> >> process which works great. Basically I have the following.
> >>
> >> pgpool-II 2.1 server:
> >> 1 - pgpool
> >>
> >> 3 postgresql 8.3.3 servers:
> >> 1 - db1  (backend_node0)
> >> 2 - db2  (backend_node1)
> >> 3 - db3  (backend_node2)
> >>
> >> The 3rd PG server is kept in detached status so that I can use it for
> >> online recovery tests.
> >> I have created the scripts for 1st/2nd stage and also the
> >> pg_remote_start script.
> >> I can use pcp_recovery_node to bring a new node online (3rd PG  
> >> server)
> >> or recover one of the existing ones without any issues.
> >>
> >> Now, until today all my previous tests on performing online recovery
> >> involved calling pcp_recovery_node and during that time no clients
> >> were using the database servers through pgpool of course. There was  
> >> no
> >> activity  going on on the database servers at all. So everything
> >> worked great, pgpool went through 1st and 2nd stage and then called
> >> the remote start and finished the recovery process. So server got
> >> online and sync'd.
> >>
> >> Then, today I tried to do the same but right before calling
> >> pcp_recovery_node I started a pgbench process pointing to the pgpool
> >> server to create some activity on the database. I was under the
> >> impression that during 2nd stage pgpool would perhaps start to queue
> >> some of the transactions along with not allowing new clients to
> >> connect to it. Allowing the 2nd stage to occur and then bring the new
> >> node online. Once the online recovery was finished, pgpool would go
> >> through its queue and send those transactions to all nodes.
> >> Is that not the case ? Cause basically it went through 1st stage and
> >> then pcp_recovery_node timed out.
> >
> > Perhaps you need -C option to pgbench?
> > --
> > Tatsuo Ishii
> > SRA OSS, Inc. Japan
>