[pgpool-general: 2395] native replication PITR problems

Videanu Adrian videanuadrian at yahoo.com
Fri Jan 10 19:21:21 JST 2014


Hello, 

I have a pgpool 3.3.2 cluster using native replication with 2 Postgresql 9.2 nodes with online recovery using PITR.

The problem is that from time to time one of the nodes get disconnected (I do not know why, because the load is very low and the machines are in the same subnet), and when I try to recovery it with pgpool-admin recovery button, after the first state the recovery process apparently freezes and nothing happens. During this pgpool cannot be accessed, in fact i guess that the connection are made but it somehow waits for something... .

Active Pgpool machine logs :


// the first postgresql node is declared dead (have no ideea why...  how may i debug this kind of issues ?)

Jan 10 09:36:51 pgpool133 pgpool[26722]: wd_send_response: WD_STAND_FOR_LOCK_HOLDER received it
Jan 10 09:36:51 pgpool133 pgpool[26722]: degenerate_backend_set: 0 fail over request from pid 26722
Jan 10 09:36:51 pgpool133 pgpool[26703]: wd_start_interlock: start interlocking
Jan 10 09:36:53 pgpool133 pgpool[26703]: starting degeneration. shutdown host 192.168.91.33(5432)
Jan 10 09:36:53 pgpool133 pgpool[26703]: Restart all children
Jan 10 09:37:00 pgpool133 pgpool[26703]: wd_end_interlock: end interlocking
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover: set new primary node: -1
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover: set new master node: 1
Jan 10 09:37:01 pgpool133 pgpool[26703]: failover done. shutdown host 192.168.91.33(5432)
Jan 10 09:37:01 pgpool133 pgpool[27029]: worker process received restart request
Jan 10 09:37:02 pgpool133 pgpool[27028]: pcp child process received restart request
Jan 10 09:37:02 pgpool133 pgpool[26703]: PCP child 27028 exits with status 256 in failover()
Jan 10 09:37:02 pgpool133 pgpool[26703]: fork a new PCP child pid 32576 in failover()
Jan 10 09:37:02 pgpool133 pgpool[26703]: worker child 27029 exits with status 256
Jan 10 09:37:02 pgpool133 pgpool[26703]: fork a new worker child pid 32577

 

Before start the recovery process I deleted everything in the archive directory and in data directory to the node that was about to be recovered

// start the recovery process
Jan 10 09:43:07 pgpool133 pgpool[32576]: starting recovering node 0
Jan 10 09:43:08 pgpool133 pgpool[32576]: CHECKPOINT in the 1st stage done
Jan 10 09:43:08 pgpool133 pgpool[32576]: starting recovery command: "SELECT pgpool_recovery('basebackup.sh', '192.168.91.33', '/var/lib/postgresql/9.2/data')"
Jan 10 09:43:22 pgpool133 pgpool[32576]: 1st stage is done
Jan 10 09:43:22 pgpool133 pgpool[32576]: starting 2nd stage
... after that nothing happens


Online postgresql node logs : 
+ DATA=/var/lib/postgresql/9.2/data
+ RECOVERY_TARGET=192.168.91.33
+ RECOVERY_DATA=/var/lib/postgresql/9.2/data
+ ARCHIVE_DIR=/var/lib/postgresql/9.2/archive
+ psql -c 'SELECT pg_start_backup('\''pgpoo-recovery'\'')' postgres
 pg_start_backup
-----------------
 1/36000020
(1 row)

+ rsync -C -a -c -e 'ssh -p 2022' --delete --exclude postmaster.log --exclude postmaster.pid --exclude postmaster.opts --exclude pg_log --exclude recovery.conf --
+ cat
+ scp -P 2022 recovery.conf 192.168.91.33:/var/lib/postgresql/9.2/data/
+ rm -f recovery.conf
+ psql -c 'SELECT pg_stop_backup()' postgres
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived
 pg_stop_backup
----------------
 1/360000E0
(1 row)

P.S - I had experienced this kind of problems in the past but if i tried multiple times worked. But now, it seems that it doesn`t want to work anymore :)

............
Junst as i was writting  this email the second Postgres node (and the last up) was declared down and pgpool was not acccepting conections due to the fact that no backend was online. After a complet turnoff of both postgresql servers and pgpoll servers I could recover the node 1 also....


I have attached my relevant conf files.



Regards,
Adrian Videanu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140110/3fb73168/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: basebackup.sh
Type: text/x-sh
Size: 749 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140110/3fb73168/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool.conf
Type: application/octet-stream
Size: 30838 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140110/3fb73168/attachment-0003.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool_recovery_pitr.sh
Type: text/x-sh
Size: 1035 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140110/3fb73168/attachment-0003.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: pgpool_remote_start
Type: application/octet-stream
Size: 343 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140110/3fb73168/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: postgresql.conf
Type: application/octet-stream
Size: 19717 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20140110/3fb73168/attachment-0005.obj>


More information about the pgpool-general mailing list