[Pgpool-general] Problem with failback

Daniel Codina dcodina at laigu.net
Wed Apr 21 16:31:26 UTC 2010


Hi there,

I think I found something strange. Again, it could be I am doing something
wrong, but I don't se where.

My problem is with the failback.
Ok, starting from the start.

I have 2 postgres and a pgpool working perfectly. With no failover neither
failback commands.
I have a script wich, start retrying a pcp_recovery_node since it notices it
is already up.

If I stop one of the nodes, pgpool notices it as it has to do. I launch my
script and it works perfectly, after some secconds I have both postgres up
and running again.

Now, I want faiulover_command to execute it, so, I put that script in place
at failover_command = 'script' .

In that case, just after a node falls, pgpool would have to try to recover
it (trouhg my script).
I try it. It starts recovery, and it does it perfectly.

The problem is that pgpool suddenly stops.. completly stops.
Tha trace pgpool i debug mode gives to me is:

------------------------------
2010-04-21 12:30:05 ERROR: pid 1886: connect_inet_domain_socket: connect()
failed: Connection refused
2010-04-21 12:30:05 ERROR: pid 1886: health check failed. 1 th host
192.168.1.34 at port 5432 is down
2010-04-21 12:30:05 LOG:   pid 1886: set 1 th backend down status
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler called
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: starting to select
new master node
2010-04-21 12:30:05 LOG:   pid 1886: starting degeneration. shutdown host
192.168.1.34(5432)
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1887
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1888
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1889
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1890
2010-04-21 12:30:05 DEBUG: pid 1886: failover_handler: kill 1891
2010-04-21 12:30:05 LOG:   pid 1886: execute command:
/etc/pgpool-II/scripts/failover_loop.sh 0 1 192.168.1.34 &>
/tmp/failover_recovery.log
2010-04-21 12:30:05 DEBUG: pid 1887: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1888: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1889: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1890: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1891: child received shutdown request signal
3
2010-04-21 12:30:05 DEBUG: pid 1892: I am PCP 1892 accept fd 5
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'M'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: salt sent to the client
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'R'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: authentication OK
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'I'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: retrieved node information
from shared memory
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'X'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: client disconnecting. close
connection
2010-04-21 12:30:05 DEBUG: pid 1892: I am PCP 1892 accept fd 5
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'M'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: salt sent to the client
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'R'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: authentication OK
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: received PCP packet type of
service 'O'
2010-04-21 12:30:05 DEBUG: pid 1892: pcp_child: start online recovery
2010-04-21 12:30:05 LOG:   pid 1892: starting recovering node 1
2010-04-21 12:30:05 DEBUG: pid 1892: exec_checkpoint: start checkpoint
2010-04-21 12:30:05 DEBUG: pid 1892: exec_checkpoint: finish checkpoint
2010-04-21 12:30:05 LOG:   pid 1892: CHECKPOINT in the 1st stage done
2010-04-21 12:30:05 LOG:   pid 1892: starting recovery command: "SELECT
pgpool_recovery('base-backup', '192.168.1.34', '/var/lib/pgsql/data')"
2010-04-21 12:30:05 DEBUG: pid 1892: exec_recovery: start recovery
2010-04-21 12:30:08 DEBUG: pid 1892: exec_recovery: finish recovery
2010-04-21 12:30:08 LOG:   pid 1892: 1st stage is done
2010-04-21 12:30:08 LOG:   pid 1892: starting 2nd stage
2010-04-21 12:30:08 LOG:   pid 1892: all connections from clients have been
closed
2010-04-21 12:30:08 DEBUG: pid 1892: exec_checkpoint: start checkpoint
2010-04-21 12:30:08 DEBUG: pid 1892: exec_checkpoint: finish checkpoint
2010-04-21 12:30:08 LOG:   pid 1892: CHECKPOINT in the 2nd stage done
2010-04-21 12:30:08 LOG:   pid 1892: starting recovery command: "SELECT
pgpool_recovery('base-backup', '192.168.1.34', '/var/lib/pgsql/data')"
2010-04-21 12:30:08 DEBUG: pid 1892: exec_recovery: start recovery
2010-04-21 12:30:11 DEBUG: pid 1892: exec_recovery: finish recovery
2010-04-21 12:30:11 DEBUG: pid 1892: exec_remote_start: start
pgpool_remote_start
2010-04-21 12:30:11 DEBUG: pid 1892: exec_remote_start: finish
pgpool_remote_start
2010-04-21 12:30:11 LOG:   pid 1892: 1 node restarted
2010-04-21 12:30:11 LOG:   pid 1892: send_failback_request: fail back 1 th
node request from pid 1892

-------------------------------------

And it stops there forever.

The trace of the PID 1892 is:
select(0, NULL, NULL, NULL, {0, 33000}) = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0})     = 0 (Timeout)
select(0, NULL, NULL, NULL, {1, 0}

Also forever repeating that same line.

Ithought it could be the failback command, so I tryied some ... In that case
is working with:

failback_command = 'exit 0'

But I tryed also:
failback_command = 'echo host:%h, new master id:%m, old master id:%M >
/tmp/failback.log'


I have no acces at all to pcp commands.

If I try to stop pgpool (Ctr+C) I get:
2010-04-21 12:30:11 LOG:   pid 1892: send_failback_request: fail back 1 th
node request from pid 1892
2010-04-21 12:30:50 DEBUG: pid 1892: PCP child receives shutdown request
signal 2

And again it stays there for ever. In that moment the processes runing are:

 1886 pts/0    S+     0:00 pgpool -d -n
 1887 pts/0    Z+     0:00 [pgpool] <defunct>
 1888 pts/0    Z+     0:00 [pgpool] <defunct>
 1889 pts/0    Z+     0:00 [pgpool] <defunct>
 1890 pts/0    Z+     0:00 [pgpool] <defunct>
 1891 pts/0    Z+     0:00 [pgpool] <defunct>
 1892 pts/0    Z+     0:00 [pgpool] <defunct>
 1929 pts/0    S+     0:00 sh -c /etc/pgpool-II/scripts/failover_loop.sh 0 1
192.168.1.34 &
 1930 pts/0    S+     0:00 /bin/bash /etc/pgpool-II/scripts/failover_loop.sh
0 1 192.168.1.
 2037 ?        S      0:00 ssh -T 192.168.1.34 /usr/bin/pg_ctl -w -D
/var/lib/pgsql/data st
 2064 pts/0    S+     0:00 /bin/bash /etc/pgpool-II/scripts/failover_loop.sh
0 1 192.168.1.
 2065 pts/0    S+     0:00 pcp_node_info 5 192.168.1.33 9898 postgres
postgres 1
 2066 pts/0    S+     0:00 awk {print $3}
 2072 ?        S      0:00 sleep 5
 2073 pts/1    R+     0:00 ps ax


Do you know what it could be?


Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://pgfoundry.org/pipermail/pgpool-general/attachments/20100421/0480c1c9/attachment.html>


More information about the Pgpool-general mailing list