View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000105 | Pgpool-II | Bug | public | 2014-06-24 18:42 | 2015-01-08 16:12 |
| Reporter | qian | Assigned To | nagata | ||
| Priority | normal | Severity | block | Reproducibility | always |
| Status | closed | Resolution | fixed | ||
| Platform | x86 | OS | CentOS | OS Version | 6.5 |
| Summary | 0000105: Failover dead lock | ||||
| Description | For HA, three servers to build a database cluster. (59,61,63) Each node(server) running Tomcat6, Pgpool-II(3.3.3) and Postgresql(9.2) When a node fails, the other nodes can not failover. | ||||
| Steps To Reproduce | Streaming replication between three nodes psql -c "show pool_nodes" -d rcd -h 127.0.0.1 -p 9999 node_id | hostname | port | status | lb_weight | role ---------+---------------+------+--------+-----------+--------- 0 | 172.24.128.59 | 5432 | 2 | 0.333333 | standby 1 | 172.24.128.61 | 5432 | 2 | 0.333333 | primary 2 | 172.24.128.63 | 5432 | 2 | 0.333333 | standby (3 rows) 1. reboot 59 nodes (simulated failure) 2. a little while, 61 and 63 node didn't perform failover, and could not respond to SQL client requests | ||||
| Additional Information | 61 nodes pgpool process information: [root@localhost pgpool2-V3_3_STABLE-ac397a1]# ps -ef | grep pgpool root 36785 1 0 16:40 ? 00:00:00 /var/lib/pgsql/bin/pgpool root 36787 36785 0 16:40 ? 00:00:00 pgpool: watchdog root 36788 36785 0 16:40 ? 00:00:00 pgpool: heartbeat receiver root 36789 36785 0 16:40 ? 00:00:00 pgpool: heartbeat sender root 36790 36785 0 16:40 ? 00:00:00 pgpool: heartbeat receiver root 36791 36785 0 16:40 ? 00:00:00 pgpool: heartbeat sender root 36792 36785 0 16:40 ? 00:00:00 pgpool: lifecheck [root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 36785 (gdb) bt #0 0x00007f93d2961197 in semop () from /lib64/libc.so.6 0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128 0000002 0x0000000000406204 in failover () at main.c:1742 0000003 0x00000000004081c7 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:694 (gdb) [root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 36787 (gdb) bt #0 0x00007f93d2961197 in semop () from /lib64/libc.so.6 0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128 0000002 0x0000000000404bc9 in degenerate_backend_set (node_id_set=0x7fffa6f19228, count=1) at main.c:1480 0000003 0x000000000047ca22 in wd_node_request_signal (fork_wait_time=<value optimized out>) at wd_child.c:435 0000004 wd_send_response (fork_wait_time=<value optimized out>) at wd_child.c:420 0000005 wd_child (fork_wait_time=<value optimized out>) at wd_child.c:109 0000006 0x000000000047c0ed in wd_main (fork_wait_time=1) at watchdog.c:147 0000007 0x00000000004086a2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:630 (gdb) ----------------------- 63 nodes pgpool process information: [root@localhost pgpool2-V3_3_STABLE-ac397a1]# ps -ef | grep pgpool root 7291 1 0 16:39 ? 00:00:00 /var/lib/pgsql/bin/pgpool root 7294 7291 0 16:39 ? 00:00:00 pgpool: watchdog root 7295 7291 0 16:39 ? 00:00:00 pgpool: heartbeat receiver root 7296 7291 0 16:39 ? 00:00:00 pgpool: heartbeat sender root 7297 7291 0 16:39 ? 00:00:00 pgpool: heartbeat receiver root 7298 7291 0 16:39 ? 00:00:00 pgpool: heartbeat sender root 7299 7291 0 16:39 ? 00:00:00 pgpool: lifecheck [root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 7291 (gdb) bt #0 0x00007f758d8b0197 in semop () from /lib64/libc.so.6 0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128 0000002 0x0000000000406204 in failover () at main.c:1742 0000003 0x00000000004081c7 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:694 (gdb) [root@localhost pgpool2-V3_3_STABLE-ac397a1]# gdb -p 7294 (gdb) bt #0 0x00007f758d8b0197 in semop () from /lib64/libc.so.6 0000001 0x0000000000423d83 in pool_semaphore_lock (semNum=<value optimized out>) at pool_sema.c:128 0000002 0x0000000000404bc9 in degenerate_backend_set (node_id_set=0x7fffda01cfe8, count=1) at main.c:1480 0000003 0x000000000047ca22 in wd_node_request_signal (fork_wait_time=<value optimized out>) at wd_child.c:435 0000004 wd_send_response (fork_wait_time=<value optimized out>) at wd_child.c:420 0000005 wd_child (fork_wait_time=<value optimized out>) at wd_child.c:109 0000006 0x000000000047c0ed in wd_main (fork_wait_time=1) at watchdog.c:147 0000007 0x00000000004086a2 in main (argc=<value optimized out>, argv=<value optimized out>) at main.c:630 (gdb) ===》they are blocked in the same point. | ||||
| Tags | No tags attached. | ||||
|
|
|
|
|
I only have to kill pgpool process in 61 node, then 63 node can continue to perform failover. the issue like 0000054, I don't know it's the same. Thank you very much! |
|
|
|
|
|
Thanks for your reporting. I can't reproduce the hang but found a possible cause. Pgpool check a flag before getting semaphore lock to see whether failover is started already in other process. However, depending on timing, the lock can be gotten even if the flag is fault, and this causes a dead lock. I fix this. Could you please try the attached patch? |
|
|
|
|
|
Thanks for your patch. I have patched and tried to reproduce the problem,but the problem still exists。(this time node 59 is a primary node, and reboot it) I uploaded the log, and hope this information will help you locate the problem. Thanks for your help again! |
|
|
Hey. I believe I have a similar problem. http://www.sraoss.jp/pipermail/pgpool-general/2014-August/003145.html How can I produce a back trace with symbols? Thank you. |
|
|
Hi all, I tested the lastest pgpool(commit aace3fd8fe964dee7fe9c23734c7fb8b4141591d), and found the deadlock could't be reproduced. I think it has been resolved. Thanks! |
|
|
Thanks for your reporting! |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2014-06-24 18:42 | qian | New Issue | |
| 2014-06-24 18:42 | qian | File Added: failover.rar | |
| 2014-06-24 19:05 | qian | Note Added: 0000431 | |
| 2014-06-30 11:13 | nagata | Assigned To | => nagata |
| 2014-06-30 11:13 | nagata | Status | new => assigned |
| 2014-06-30 19:09 | nagata | File Added: main.c.patch | |
| 2014-06-30 19:09 | nagata | Note Added: 0000433 | |
| 2014-06-30 19:10 | nagata | Status | assigned => feedback |
| 2014-07-01 11:20 | qian | File Added: failover0701.rar | |
| 2014-07-01 11:20 | qian | Note Added: 0000434 | |
| 2014-07-01 11:20 | qian | Status | feedback => assigned |
| 2014-08-21 03:47 | melerz | Note Added: 0000464 | |
| 2014-12-24 10:40 | qian | Note Added: 0000504 | |
| 2015-01-08 16:11 | nagata | Note Added: 0000507 | |
| 2015-01-08 16:12 | nagata | Status | assigned => closed |
| 2015-01-08 16:12 | nagata | Resolution | open => fixed |