[Pgpool-general] pgpool-II and online recovery process

Fri Sep 19 12:31:12 UTC 2008

Cool Thanks I will try that in a bit. Indeed right after I sent the  
previous email I tried pgbench with only 1 client and it did work but  
anything higher did not.
Really appreciate the help here

Marcelo
Linux/Solaris System Administrator
pglists at zeroaccess.org
http://www.zeroaccess.org

On Sep 18, 2008, at 9:51 PM, Tatsuo Ishii wrote:

> First of all you need to set -c 1. This is due to the architecure of
> pgbench (it tries to establish all the connections before proceeding
> in your case 5, which gives pgpool virtually no chance to block new
> connections from clients). Another point is, you might want to give
> some sleep time between pgbench sessions. I use following script for
> demonstration.
>
> \set naccounts 100000 * :scale
> \setrandom delta -5000 5000
> \setrandom aid 1 :naccounts
> UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;
> SELECT pg_sleep(0.5);
>
> then run pgbench as follows:
>
> pgbench -d -p 5432 -C -t 10000 -f pgbench.sql test
>
> Hope this helps,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
>
>> Thanks for the suggestion but that did not appear to solve the issue.
>>
>> I started pgpool with the following timout related options hoping it
>> would help out but no go there.
>>
>> child_life_time = 30
>> connection_life_time = 10
>> client_idle_limit = 0
>> recovery_timeout = 1200
>>
>> * I have not set "lient_idle_limit" because that will also close idle
>> in transaction connections and
>> on production I cannot allow that.
>>
>>
>> Basically stage 1 runs just fine and then when it starts stage 2 it
>> takes forever for the checkpoint to start ..
>> The checkpoint itself takes about 3 minutes which I think it's  
>> still a
>> lot (ok that I'm using a VM with not a lot of resources so that may  
>> be
>> misleading)
>>
>>
>> So, again I have started a pgbench process and stopped db3 and then
>> used pcp_recovery_node to have that node recovered.
>>
>>  pgbench -C -h 10.1.100.213 -p 5432 -c 5 -t 200 -U postgres pgbench
>>
>>
>> LOG: db1 backend (where stage1/stage2 scripts run from)
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> = 
>> = 
>> = 
>> =====================================================================
>> ....
>> global/2845
>> global/2846
>> global/2847
>> global/pg_auth
>> global/pg_control
>> global/pg_database
>> global/pgstat.stat
>> pg_clog/
>> pg_clog/0000
>> pg_multixact/
>> pg_multixact/members/
>> pg_multixact/members/0000
>> pg_multixact/offsets/
>> pg_multixact/offsets/0000
>> pg_subtrans/
>> pg_subtrans/0000
>> pg_tblspc/
>> pg_twophase/
>>
>> sent 11648717 bytes  received 16540 bytes  158710.98 bytes/sec
>> total size is 172764067  speedup is 14.81
>> building file list ... done
>> 000000030000000000000048
>>  pg_stop_backup
>> ----------------
>>  0/488C002C
>> (1 row)
>>
>>
>> sent 16779407 bytes  received 42 bytes  4794128.29 bytes/sec
>> total size is 16777216  speedup is 1.00
>> building file list ... done
>> 000000030000000000000048.002F255C.backup
>>
>> sent 414 bytes  received 42 bytes  912.00 bytes/sec
>> total size is 251  speedup is 0.55
>> 2008-09-18 16:26:22 CDT LOG:  checkpoint starting: time
>> 2008-09-18 16:28:22 CDT LOG:  checkpoint complete: wrote 774 buffers
>> (25.2%); 0 transaction log file(s) added, 0 removed, 0 recycled;
>> write=119.383 s, sync=0.209 s, total=119.644 s
>>
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> =
>> = 
>> = 
>> = 
>> =====================================================================
>> * it's now 16:50 PM and stage 2 still seems stuck !
>>
>>
>> So if I do a Ctrl -C on the pgbench process that I started then the
>> CHECKPOINT on stage 2 finishes and then recovery is finished.
>> It looks like for some reason pgpool is not able to close the
>> connections on the backends and therefore stage 2 gets stuck.
>> Am I missing some other timeout  or parameter ?
>>
>>
>> PGpool-II LOG:
>>
>> Sep 18 16:21:55 debian-db1 pgpool: 2008-09-18 16:21:55 LOG:   pid
>> 6876: starting recovering node 2
>> Sep 18 16:22:00 debian-db1 pgpool: 2008-09-18 16:22:00 LOG:   pid
>> 6876: CHECKPOINT in the 1st stage done
>> Sep 18 16:22:00 debian-db1 pgpool: 2008-09-18 16:22:00 LOG:   pid
>> 6876: starting recovery command: "SELECT pgpool_recovery('copy-base-
>> backup', 'debian-db4', '/var/lib/po
>> stgresql/8.3/main')"
>> Sep 18 16:23:34 debian-db1 pgpool: 2008-09-18 16:23:34 LOG:   pid
>> 6876: 1st stage is done
>> Sep 18 16:23:34 debian-db1 pgpool: 2008-09-18 16:23:34 LOG:   pid
>> 6876: starting 2nd stage
>> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid
>> 6958: ProcessFrontendResponse: failed to read kind from frontend.
>> frontend abnormally exited
>> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid
>> 6955: ProcessFrontendResponse: failed to read kind from frontend.
>> frontend abnormally exited
>> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid
>> 6919: ProcessFrontendResponse: failed to read kind from frontend.
>> frontend abnormally exited
>> Sep 18 16:41:19 debian-db1 pgpool: 2008-09-18 16:41:19 LOG:   pid
>> 6908: ProcessFrontendResponse: failed to read kind from frontend.
>> frontend abnormally exited
>> Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid
>> 6876: all connections from clients have been closed
>> Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid
>> 6876: CHECKPOINT in the 2nd stage done
>> Sep 18 16:41:21 debian-db1 pgpool: 2008-09-18 16:41:21 LOG:   pid
>> 6876: starting recovery command: "SELECT
>> pgpool_recovery('pgpool_recovery_pitr', 'debian-db4', '/var/lib/
>> postgresql/8.3/main')"
>> Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid
>> 6876: 2 node restarted
>> Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid
>> 6876: send_failback_request: fail back 2 th node request from pid  
>> 6876
>> Sep 18 16:41:38 debian-db1 pgpool: 2008-09-18 16:41:38 LOG:   pid
>> 6876: recovery done
>>
>>
>>
>> Any help much appreciated
>>
>> Marcelo
>> Linux/Solaris System Administrator
>> http://www.zeroaccess.org
>>
>>
>> On Sep 10, 2008, at 8:34 PM, Tatsuo Ishii wrote:
>>
>>>> Hi,
>>>>
>>>> I have recently setup a VM environment to test out an online  
>>>> recovery
>>>> process which works great. Basically I have the following.
>>>>
>>>> pgpool-II 2.1 server:
>>>> 1 - pgpool
>>>>
>>>> 3 postgresql 8.3.3 servers:
>>>> 1 - db1  (backend_node0)
>>>> 2 - db2  (backend_node1)
>>>> 3 - db3  (backend_node2)
>>>>
>>>> The 3rd PG server is kept in detached status so that I can use it  
>>>> for
>>>> online recovery tests.
>>>> I have created the scripts for 1st/2nd stage and also the
>>>> pg_remote_start script.
>>>> I can use pcp_recovery_node to bring a new node online (3rd PG
>>>> server)
>>>> or recover one of the existing ones without any issues.
>>>>
>>>> Now, until today all my previous tests on performing online  
>>>> recovery
>>>> involved calling pcp_recovery_node and during that time no clients
>>>> were using the database servers through pgpool of course. There was
>>>> no
>>>> activity  going on on the database servers at all. So everything
>>>> worked great, pgpool went through 1st and 2nd stage and then called
>>>> the remote start and finished the recovery process. So server got
>>>> online and sync'd.
>>>>
>>>> Then, today I tried to do the same but right before calling
>>>> pcp_recovery_node I started a pgbench process pointing to the  
>>>> pgpool
>>>> server to create some activity on the database. I was under the
>>>> impression that during 2nd stage pgpool would perhaps start to  
>>>> queue
>>>> some of the transactions along with not allowing new clients to
>>>> connect to it. Allowing the 2nd stage to occur and then bring the  
>>>> new
>>>> node online. Once the online recovery was finished, pgpool would go
>>>> through its queue and send those transactions to all nodes.
>>>> Is that not the case ? Cause basically it went through 1st stage  
>>>> and
>>>> then pcp_recovery_node timed out.
>>>
>>> Perhaps you need -C option to pgbench?
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>