[Pgpool-general] Problem with failback
Daniel Codina
dcodina at laigu.net
Wed Apr 21 16:31:26 UTC 2010
Hi there,
I think I found something strange. Again, it could be I am doing something
wrong, but I don't se where.
My problem is with the failback.
Ok, starting from the start.
I have 2 postgres and a pgpool working perfectly. With no failover neither
failback commands.
I have a script wich, start retrying a pcp_recovery_node since it notices it
is already up.
If I stop one of the nodes, pgpool notices it as it has to do. I launch my
script and it works perfectly, after some secconds I have both postgres up
and running again.
Now, I want faiulover_command to execute it, so, I put that script in place
at failover_command = 'script' .
In that case, just after a node falls, pgpool would have to try to recover
it (trouhg my script).
I try it. It starts recovery, and it does it perfectly.
The problem is that pgpool suddenly stops.. completly stops.
Tha trace pgpool i debug mode gives to me is:
------------------------------
2010-04-21 12:30:05 ERROR: pid 1886: connect_inet_domain_socket: connect()
failed: Connection refused
2010-04-21 12:30:05 ERROR: pid 1886: health check failed. 1 th host
192.168.1.34 at port 5432 is down
2010-04-21 12:30:05 LOG: pid 1886: set 1 th backend down status
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler called
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: starting to select
new master node
2010-04-21 12:30:05 LOG: pid 1886: starting degeneration. shutdown host
192.168.1.34(5432)
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1887
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1888
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1889
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1890
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1891
2010-04-21 12:30:05 LOG: pid 1886: execute command:
/etc/pgpool-II/scripts/failover_loop.sh 0 1 192.168.1.34 &>
/tmp/failover_recovery.log
2010-04-21 12:30:05 DEBUG: pid 1887: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1888: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1889: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1890: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1891: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1892: I am PCP 1892 accept fd 5
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'M'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: salt sent to the client
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'R'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: authentication OK
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'I'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: retrieved node information
from shared memory
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'X'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: client disconnecting. close
connection
2010-04-21 12:30:05 DEBUG: pid 1892: I am PCP 1892 accept fd 5
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'M'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: salt sent to the client
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'R'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: authentication OK
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'O'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: start online recovery
2010-04-21 12:30:05 LOG: pid 1892: starting recovering node 1
2010-04-21 12:30:05 DEBUG: pid 1892: exec_checkpoint: start checkpoint
2010-04-21 12:30:05 DEBUG: pid 1892: exec_checkpoint: finish checkpoint
2010-04-21 12:30:05 LOG: pid 1892: CHECKPOINT in the 1st stage done
2010-04-21 12:30:05 LOG: pid 1892: starting recovery command: "SELECT
pgpool_recovery('base-backup', '192.168.1.34', '/var/lib/pgsql/data')"
2010-04-21 12:30:05 DEBUG: pid 1892: exec_recovery: start recovery
2010-04-21 12:30:08 DEBUG: pid 1892: exec_recovery: finish recovery
2010-04-21 12:30:08 LOG: pid 1892: 1st stage is done
2010-04-21 12:30:08 LOG: pid 1892: starting 2nd stage
2010-04-21 12:30:08 LOG: pid 1892: all connections from clients have been
closed
2010-04-21 12:30:08 DEBUG: pid 1892: exec_checkpoint: start checkpoint
2010-04-21 12:30:08 DEBUG: pid 1892: exec_checkpoint: finish checkpoint
2010-04-21 12:30:08 LOG: pid 1892: CHECKPOINT in the 2nd stage done
2010-04-21 12:30:08 LOG: pid 1892: starting recovery command: "SELECT
pgpool_recovery('base-backup', '192.168.1.34', '/var/lib/pgsql/data')"
2010-04-21 12:30:08 DEBUG: pid 1892: exec_recovery: start recovery
2010-04-21 12:30:11 DEBUG: pid 1892: exec_recovery: finish recovery
2010-04-21 12:30:11 DEBUG: pid 1892: exec_remote_start: start
pgpool_remote_start
2010-04-21 12:30:11 DEBUG: pid 1892: exec_remote_start: finish
pgpool_remote_start
2010-04-21 12:30:11 LOG: pid 1892: 1 node restarted
2010-04-21 12:30:11 LOG: pid 1892: send_failback_request: fail back 1 th
node request from pid 1892
-------------------------------------
And it stops there forever.
The trace of the PID 1892 is:
select(0, NULL, NULL, NULL, {0, 33000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}
Also forever repeating that same line.
Ithought it could be the failback command, so I tryied some ... In that case
is working with:
failback_command = 'exit 0'
But I tryed also:
failback_command = 'echo host:%h, new master id:%m, old master id:%M >
/tmp/failback.log'
I have no acces at all to pcp commands.
If I try to stop pgpool (Ctr+C) I get:
2010-04-21 12:30:11 LOG: pid 1892: send_failback_request: fail back 1 th
node request from pid 1892
2010-04-21 12:30:50 DEBUG: pid 1892: PCP child receives shutdown request
signal 2
And again it stays there for ever. In that moment the processes runing are:
1886 pts/0 S+ 0:00 pgpool -d -n
1887 pts/0 Z+ 0:00 [pgpool] <defunct>
1888 pts/0 Z+ 0:00 [pgpool] <defunct>
1889 pts/0 Z+ 0:00 [pgpool] <defunct>
1890 pts/0 Z+ 0:00 [pgpool] <defunct>
1891 pts/0 Z+ 0:00 [pgpool] <defunct>
1892 pts/0 Z+ 0:00 [pgpool] <defunct>
1929 pts/0 S+ 0:00 sh -c /etc/pgpool-II/scripts/failover_loop.sh 0 1
192.168.1.34 &
1930 pts/0 S+ 0:00 /bin/bash /etc/pgpool-II/scripts/failover_loop.sh
0 1 192.168.1.
2037 ? S 0:00 ssh -T 192.168.1.34 /usr/bin/pg_ctl -w -D
/var/lib/pgsql/data st
2064 pts/0 S+ 0:00 /bin/bash /etc/pgpool-II/scripts/failover_loop.sh
0 1 192.168.1.
2065 pts/0 S+ 0:00 pcp_node_info 5 192.168.1.33 9898 postgres
postgres 1
2066 pts/0 S+ 0:00 awk {print $3}
2072 ? S 0:00 sleep 5
2073 pts/1 R+ 0:00 ps ax
Do you know what it could be?
Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pgfoundry.org/pipermail/pgpool-general/attachments/20100421/0480c1c9/attachment.html>
More information about the Pgpool-general
mailing list