View Issue Details
| ID | Project | Category | View Status | Date Submitted | Last Update |
|---|---|---|---|---|---|
| 0000481 | Pgpool-II | Bug | public | 2019-03-26 21:03 | 2019-05-21 16:11 |
| Reporter | nagata | Assigned To | t-ishii | ||
| Priority | normal | Severity | major | Reproducibility | have not tried |
| Status | closed | Resolution | open | ||
| Product Version | 3.6.15 | ||||
| Fixed in Version | 3.6.17 | ||||
| Summary | 0000481: a race condition causing a segfault in replication mode. | ||||
| Description | This segfault occurs when new connection is coming during failover in native-replication mode. This occurs in pool_do_auth(). (See backtrace and log below) I guess pool_do_auth was called before Req_info->master_node_id was updated in failover(), so MASTER_CONNECTION(cp) was referring the downed connection and MASTER_CONNECTION(cp)->sp caused the segfault. Here is the backtrace from core: ================================= Core was generated by `pgpool: accept connection '. Program terminated with signal 11, Segmentation fault. #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18) at auth/pool_auth.c:77 77 protoMajor = MASTER_CONNECTION(cp)->sp->major; Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64 (gdb) bt #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18) at auth/pool_auth.c:77 0000001 0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28) at protocol/child.c:954 0000002 0x0000000000423fdd in get_backend_connection (frontend=0x1678f28) at protocol/child.c:2396 0000003 0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337 0000004 0x000000000040682d in fork_a_child (fds=0x16584f0, id=372) at main/pgpool_main.c:758 0000005 0x0000000000409941 in failover () at main/pgpool_main.c:2102 0000006 0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>, clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476 0000007 0x0000000000405c44 in main (argc=<value optimized out>, argv=<value optimized out>) at main/main.c:317 (gdb) l 72 int authkind; 73 int i; 74 StartupPacket *sp; 75 76 77 protoMajor = MASTER_CONNECTION(cp)->sp->major; 78 79 kind = pool_read_kind(cp); 80 if (kind < 0) 81 ereport(ERROR, =======================================- Here is a snippet of the pgpool log. PID 5067 has a segfault. ================== (snip) 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: starting degeneration. shutdown host xxxxxxxx(xxxx) 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: Restart all children 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG: new connection received 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL: connecting host=xxxxxx port=xxxx (snip) 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exits with status 0 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exited with success and will not be restarted 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING: child process with pid: 5067 was terminated by segmentation fault 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5067 exited with success and will not be restarted 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exits with status 0 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exited with success and will not be restarted (snip) =================== | ||||
| Steps To Reproduce | I can reproduce the segfault by executing pcp_detach_node and pcp_attach_node for node 0 repeatedly during running pgbench with -C option. I useed child_max_connections = 5 to make pgpool create new connections frequently. | ||||
| Additional Information | Discussed in [pgpool-hackers: 3252] Re: Segfault in a race condition. See this thread for details. | ||||
| Tags | No tags attached. | ||||
|
|
Attached is a patch trying to fix the problem by checking failover/failback is ongling in MASTER* macro (actually in pool_virtual_master_db_node_id(), which is called by these macros). If failover/failback is on going, Pgpool-II child emits FATAL error and exits. Probably bug 482 can be fixed by this as well. |
|
|
Attached is a revised patch. Issuing FATAL in pool_virtual_master_db_node_id() was not good since it raises an exception and inside a exception code MASTER macro is called again, thus goes into infinite recursion until it hits the stack depth limit. So in this patch WANING is issued instead of FATAL then exit the process. Also protect the ereport call by signal masking because SIGUSR1 will be issued and the interruption is just a waste of time. |
| Date Modified | Username | Field | Change |
|---|---|---|---|
| 2019-03-26 21:03 | nagata | New Issue | |
| 2019-04-01 16:33 | administrator | Assigned To | => t-ishii |
| 2019-04-01 16:33 | administrator | Status | new => assigned |
| 2019-04-01 17:51 | t-ishii | File Added: failover-check.diff | |
| 2019-04-01 17:51 | t-ishii | Note Added: 0002495 | |
| 2019-04-02 11:15 | t-ishii | File Added: master-macro-segfault.diff | |
| 2019-04-02 11:15 | t-ishii | Note Added: 0002497 | |
| 2019-04-02 11:16 | t-ishii | Status | assigned => feedback |
| 2019-05-17 13:13 | t-ishii | Status | feedback => resolved |
| 2019-05-17 13:13 | t-ishii | Fixed in Version | => 3.6.17 |
| 2019-05-21 16:11 | administrator | Status | resolved => closed |