[pgpool-general: 2477] Re: native replication PITR problems

Fri Jan 24 03:17:19 JST 2014

Hi all, 

just some clarifications : 

So, what I understand from here is that when I perform online recovery I 
should stop the stand-by pgpool server. 
Also, on the primary server 
there should be no connections open left, and until the recovery is 
performed, no other connections will be opened. The entire cluster will 
be basically down when 2nd stage recovery process is running.
Are these assumptions correct ?

Regards,
Adrian Videanu

________________________________
 From: Videanu Adrian <videanuadrian at yahoo.com>
To: Videanu Adrian <videanuadrian at yahoo.com>; "pgpool-general at pgpool.net" <pgpool-general at pgpool.net> 
Sent: Saturday, January 11, 2014 10:18 AM
Subject: Re: [pgpool-general: 2395] native replication PITR problems

Hi all,

After a further reading of pgpool tutorial i think that i found the problem, but i want to know if i understood this correctly : 

"Data synchronization is finalized during what is called "second stage".
Before entering the second stage, pgpool-II waits until all clients have disconnected.
It blocks any new incoming connection until the second stage is over. 
After all connections have terminated, pgpool-II merges updated data between
the first stage and the second stage. This is the final data
synchronization step. 
Note that there is a restriction about online recovery. If pgpool-II itself
is installed on multiple hosts, online recovery does not work correctly,
because pgpool-II has to stop all clients during the 2nd stage of
online recovery. If there are several pgpool hosts, only one will have received
the online recovery command and will block connections. "  

So, what i understand from here is that when I perform online recovery I should stop the stand-by pgpool server. Also, on the primary server there should be no connections open left, and until the recovery is performed, no other connections will be opened. The entire cluster will be basically down when 2nd stage recovery process is running.
Are these assumptions correct ?
Also sometimes it happens that my standby pgpool node detects one postgresql backend as down and it degenerate it.
My question here is : What is
 the business of pgpool standby node to detach postgres backends as long as it is the STAND-BY and not the ACTIVE node ?

Regards,
Adrian Videanu

________________________________
 From: Videanu Adrian <videanuadrian at yahoo.com>
To: "pgpool-general at pgpool.net" <pgpool-general at pgpool.net> 
Sent: Friday, January 10, 2014 12:21 PM
Subject: [pgpool-general: 2395] native replication PITR problems

Hello, 

I have a pgpool 3.3.2 cluster using native replication with 2 Postgresql 9.2 nodes with online recovery using PITR.

The problem is that from time to time one of the nodes get disconnected (I do not know why, because the load is very low and the machines are in the same subnet), and when I try to recovery it with pgpool-admin recovery button, after the first state the recovery process apparently freezes and nothing happens. During this pgpool cannot be accessed, in fact i guess that the connection are made but it somehow waits for something... .

Active Pgpool machine logs :

// the first postgresql node is declared dead (have no ideea why...  how may i debug this kind of issues ?)

Jan 10 09:36:51 pgpool133 pgpool[26722]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received it
Jan 10 09:36:51 pgpool133 pgpool[26722]: degenerate_backend_set: 0 fail over request from pid 26722
Jan 10 09:36:51 pgpool133 pgpool[26703]: wd_start_interlock: start interlocking
Jan 10 09:36:53 pgpool133 pgpool[26703]: starting degeneration. shutdown host 192.168.91.33(5432)
Jan 10
 09:36:53 pgpool133 pgpool[26703]: Restart all children
Jan 10 09:37:00 pgpool133 pgpool[26703]: wd_end_interlock: end interlocking
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover: set new primary node: -1
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover: set new master node: 1
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover done. shutdown host 192.168.91.33(5432)
Jan 10 09:37:01 pgpool133 pgpool[27029]: worker process received restart request
Jan 10 09:37:02 pgpool133 pgpool[27028]: pcp child process received restart request
Jan 10 09:37:02 pgpool133 pgpool[26703]: PCP child 27028 exits with status 256 in failover()
Jan 10 09:37:02 pgpool133 pgpool[26703]: fork a new PCP child pid 32576 in failover()
Jan 10 09:37:02 pgpool133 pgpool[26703]: worker child 27029 exits with status 256
Jan 10 09:37:02 pgpool133 pgpool[26703]: fork a new worker child pid 32577

Before start the recovery
 process I deleted everything in the archive directory and in data directory to the node that was about to be recovered

// start the recovery process
Jan 10 09:43:07 pgpool133 pgpool[32576]: starting recovering node 0
Jan 10 09:43:08 pgpool133 pgpool[32576]: CHECKPOINT in the 1st stage done
Jan 10 09:43:08 pgpool133 pgpool[32576]: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', '192.168.91.33', '/var/lib/postgresql/9.2/data')"
Jan 10 09:43:22 pgpool133 pgpool[32576]: 1st stage is done
Jan 10 09:43:22 pgpool133 pgpool[32576]: starting 2nd stage
... after that nothing happens

Online postgresql node logs : 
+ DATA=/var/lib/postgresql/9.2/data
+ RECOVERY_TARGET=192.168.91.33
+ RECOVERY_DATA=/var/lib/postgresql/9.2/data
+ ARCHIVE_DIR=/var/lib/postgresql/9.2/archive
+ psql -c 'SELECT pg_start_backup('\''pgpoo-recovery'\'')'
 postgres
 pg_start_backup
-----------------
 1/36000020
(1 row)

+ rsync -C -a -c -e 'ssh -p 2022' --delete --exclude postmaster.log --exclude postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude recovery.conf --
+ cat
+ scp -P 2022 recovery.conf 192.168.91.33:/var/lib/postgresql/9.2/data/
+ rm -f recovery.conf
+ psql -c 'SELECT pg_stop_backup()' postgres
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
 pg_stop_backup
----------------
 1/360000E0
(1 row)

P.S - I had experienced this kind of problems in the past but if i tried multiple times worked. But now, it seems that it doesn`t want to work anymore :)

............
Junst as i was writting  this email the second Postgres node (and the last up) was declared down and pgpool was not acccepting conections due to the fact that no backend was online. After a complet
 turnoff of both postgresql servers and pgpoll servers I could recover the node 1 also....

I have attached my relevant conf files.

Regards,
Adrian Videanu
_______________________________________________
pgpool-general mailing list
pgpool-general at pgpool.net
http://www.pgpool.net/mailman/listinfo/pgpool-general
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.pgpool.net/pipermail/pgpool-general/attachments/20140123/8af97358/attachment.htm>