View Issue Details

IDProjectCategoryView StatusLast Update
0000105Pgpool-IIBugpublic2015-01-08 16:12
ReporterqianAssigned Tonagata 
PrioritynormalSeverityblockReproducibilityalways
Status closedResolutionfixed 
Platformx86OSCentOSOS Version6.5
Product Version 
Target VersionFixed in Version 
Summary0000105: Failover dead lock
DescriptionFor HA, three servers to build a database cluster. (59,61,63)
Each node(server) running Tomcat6, Pgpool-II(3.3.3) and Postgresql(9.2)

When a node fails, the other nodes can not failover.
Steps To ReproduceStreaming replication between three nodes

psql -c "show pool_nodes" -d rcd -h 127.0.0.1 -p 9999
 node_id | hostname | port | status | lb_weight | role
---------+---------------+------+--------+-----------+---------
 0 | 172.24.128.59 | 5432 | 2 | 0.333333 | standby
 1 | 172.24.128.61 | 5432 | 2 | 0.333333 | primary
 2 | 172.24.128.63 | 5432 | 2 | 0.333333 | standby
(3 rows)

1. reboot 59 nodes (simulated failure)
2. a little while, 61 and 63 node didn't perform failover, and could not respond to SQL client requests
Additional Information61 nodes pgpool process information:

[root@localhost pgpool2-V3_3_STABLE-ac397a1]# ps -ef | grep pgpool
root 36785 1 0 16:40 ? 00:00:00 /var/lib/pgsql/bin/pgpool
root 36787 36785 0 16:40 ? 00:00:00 pgpool: watchdog
root 36788 36785 0 16:40 ? 00:00:00 pgpool: heartbeat receiver
root 36789 36785 0 16:40 ? 00:00:00 pgpool: heartbeat sender
root 36790 36785 0 16:40 ? 00:00:00 pgpool: heartbeat receiver
root 36791 36785 0 16:40 ? 00:00:00 pgpool: heartbeat sender
root 36792 36785 0 16:40 ? 00:00:00 pgpool: lifecheck

[root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 36785
(gdb) bt
#0 0x00007f93d2961197 in semop () from /lib64/libc.so.6
0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128
0000002 0x0000000000406204 in failover () at main.c:1742
0000003 0x00000000004081c7 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:694
(gdb)

[root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 36787
(gdb) bt
#0 0x00007f93d2961197 in semop () from /lib64/libc.so.6
0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128
0000002 0x0000000000404bc9 in degenerate_backend_set (node_id_set=0x7fffa6f19228, count=1) at main.c:1480
0000003 0x000000000047ca22 in wd_node_request_signal (fork_wait_time=<value optimized out>) at wd_child.c:435
0000004 wd_send_response (fork_wait_time=<value optimized out>) at wd_child.c:420
0000005 wd_child (fork_wait_time=<value optimized out>) at wd_child.c:109
0000006 0x000000000047c0ed in wd_main (fork_wait_time=1) at watchdog.c:147
0000007 0x00000000004086a2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:630
(gdb)

-----------------------
63 nodes pgpool process information:

[root@localhost pgpool2-V3_3_STABLE-ac397a1]# ps -ef | grep pgpool
root 7291 1 0 16:39 ? 00:00:00 /var/lib/pgsql/bin/pgpool
root 7294 7291 0 16:39 ? 00:00:00 pgpool: watchdog
root 7295 7291 0 16:39 ? 00:00:00 pgpool: heartbeat receiver
root 7296 7291 0 16:39 ? 00:00:00 pgpool: heartbeat sender
root 7297 7291 0 16:39 ? 00:00:00 pgpool: heartbeat receiver
root 7298 7291 0 16:39 ? 00:00:00 pgpool: heartbeat sender
root 7299 7291 0 16:39 ? 00:00:00 pgpool: lifecheck

[root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 7291
(gdb) bt
#0 0x00007f758d8b0197 in semop () from /lib64/libc.so.6
0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128
0000002 0x0000000000406204 in failover () at main.c:1742
0000003 0x00000000004081c7 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:694
(gdb)

[root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 7294
(gdb) bt
#0 0x00007f758d8b0197 in semop () from /lib64/libc.so.6
0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128
0000002 0x0000000000404bc9 in degenerate_backend_set (node_id_set=0x7fffda01cfe8, count=1) at main.c:1480
0000003 0x000000000047ca22 in wd_node_request_signal (fork_wait_time=<value optimized out>) at wd_child.c:435
0000004 wd_send_response (fork_wait_time=<value optimized out>) at wd_child.c:420
0000005 wd_child (fork_wait_time=<value optimized out>) at wd_child.c:109
0000006 0x000000000047c0ed in wd_main (fork_wait_time=1) at watchdog.c:147
0000007 0x00000000004086a2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:630
(gdb)


===》they are blocked in the same point.
TagsNo tags attached.

Activities

qian

2014-06-24 18:42

reporter  

failover.rar (10,567 bytes)

qian

2014-06-24 19:05

reporter   ~0000431

I only have to kill pgpool process in 61 node, then 63 node can continue to perform failover.

the issue like 0000054, I don't know it's the same.

Thank you very much!

nagata

2014-06-30 19:09

developer  

main.c.patch (944 bytes)
diff --git a/main.c b/main.c
index 1845c4d..5a62ec8 100644
--- a/main.c
+++ b/main.c
@@ -1739,6 +1739,14 @@ static void failover(void)
 		return;
 	}
 
+	/*
+	 * if not in replication mode/master slave mode, we treat this a restart request.
+	 * otherwise we need to check if we have already failovered.
+	 */
+	pool_debug("failover_handler: starting to select new master node");
+	switching = 1;
+	Req_info->switching = true;
+
 	pool_semaphore_lock(REQUEST_INFO_SEM);
 
 	if (Req_info->kind == CLOSE_IDLE_REQUEST)
@@ -1749,13 +1757,6 @@ static void failover(void)
 		return;
 	}
 
-	/*
-	 * if not in replication mode/master slave mode, we treat this a restart request.
-	 * otherwise we need to check if we have already failovered.
-	 */
-	pool_debug("failover_handler: starting to select new master node");
-	switching = 1;
-	Req_info->switching = true;
 	node_id = Req_info->node_id[0];
 
 	/* start of command inter-lock with watchdog */
main.c.patch (944 bytes)

nagata

2014-06-30 19:09

developer   ~0000433

Thanks for your reporting.

I can't reproduce the hang but found a possible cause.

Pgpool check a flag before getting semaphore lock to see whether
failover is started already in other process. However, depending
on timing, the lock can be gotten even if the flag is fault, and
this causes a dead lock.

I fix this. Could you please try the attached patch?

qian

2014-07-01 11:20

reporter  

failover0701.rar (6,488 bytes)

qian

2014-07-01 11:20

reporter   ~0000434

Thanks for your patch.
I have patched and tried to reproduce the problem,but the problem still exists。(this time node 59 is a primary node, and reboot it)
I uploaded the log, and hope this information will help you locate the problem.
Thanks for your help again!

melerz

2014-08-21 03:47

reporter   ~0000464

Hey.
I believe I have a similar problem.
http://www.sraoss.jp/pipermail/pgpool-general/2014-August/003145.html

How can I produce a back trace with symbols?
Thank you.

qian

2014-12-24 10:40

reporter   ~0000504

Hi all,

I tested the lastest pgpool(commit aace3fd8fe964dee7fe9c23734c7fb8b4141591d), and found the deadlock could't be reproduced.

I think it has been resolved. Thanks!

nagata

2015-01-08 16:11

developer   ~0000507

Thanks for your reporting!

Issue History

Date Modified Username Field Change
2014-06-24 18:42 qian New Issue
2014-06-24 18:42 qian File Added: failover.rar
2014-06-24 19:05 qian Note Added: 0000431
2014-06-30 11:13 nagata Assigned To => nagata
2014-06-30 11:13 nagata Status new => assigned
2014-06-30 19:09 nagata File Added: main.c.patch
2014-06-30 19:09 nagata Note Added: 0000433
2014-06-30 19:10 nagata Status assigned => feedback
2014-07-01 11:20 qian File Added: failover0701.rar
2014-07-01 11:20 qian Note Added: 0000434
2014-07-01 11:20 qian Status feedback => assigned
2014-08-21 03:47 melerz Note Added: 0000464
2014-12-24 10:40 qian Note Added: 0000504
2015-01-08 16:11 nagata Note Added: 0000507
2015-01-08 16:12 nagata Status assigned => closed
2015-01-08 16:12 nagata Resolution open => fixed