0000481: a race condition causing a segfault in replication mode. - Pgpool-II Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update

0000481	Pgpool-II	Bug	public	2019-03-26 21:03	2019-05-21 16:11

Reporter	nagata	Assigned To	t-ishii
Priority	normal	Severity	major	Reproducibility	have not tried
Status	closed	Resolution	open
Product Version	3.6.15
Fixed in Version	3.6.17

Summary	0000481: a race condition causing a segfault in replication mode.
Description	This segfault occurs when new connection is coming during failover in native-replication mode. This occurs in pool_do_auth(). (See backtrace and log below) I guess pool_do_auth was called before Req_info->master_node_id was updated in failover(), so MASTER_CONNECTION(cp) was referring the downed connection and MASTER_CONNECTION(cp)->sp caused the segfault. Here is the backtrace from core: ================================= Core was generated by `pgpool: accept connection '. Program terminated with signal 11, Segmentation fault. #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18) at auth/pool_auth.c:77 77 protoMajor = MASTER_CONNECTION(cp)->sp->major; Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64 (gdb) bt #0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18) at auth/pool_auth.c:77 0000001 0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28) at protocol/child.c:954 0000002 0x0000000000423fdd in get_backend_connection (frontend=0x1678f28) at protocol/child.c:2396 0000003 0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337 0000004 0x000000000040682d in fork_a_child (fds=0x16584f0, id=372) at main/pgpool_main.c:758 0000005 0x0000000000409941 in failover () at main/pgpool_main.c:2102 0000006 0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>, clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476 0000007 0x0000000000405c44 in main (argc=<value optimized out>, argv=<value optimized out>) at main/main.c:317 (gdb) l 72 int authkind; 73 int i; 74 StartupPacket *sp; 75 76 77 protoMajor = MASTER_CONNECTION(cp)->sp->major; 78 79 kind = pool_read_kind(cp); 80 if (kind < 0) 81 ereport(ERROR, =======================================- Here is a snippet of the pgpool log. PID 5067 has a segfault. ================== (snip) 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: starting degeneration. shutdown host xxxxxxxx(xxxx) 2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: Restart all children 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG: new connection received 2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL: connecting host=xxxxxx port=xxxx (snip) 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exits with status 0 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exited with success and will not be restarted 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING: child process with pid: 5067 was terminated by segmentation fault 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5067 exited with success and will not be restarted 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exits with status 0 2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exited with success and will not be restarted (snip) ===================
Steps To Reproduce	I can reproduce the segfault by executing pcp_detach_node and pcp_attach_node for node 0 repeatedly during running pgbench with -C option. I useed child_max_connections = 5 to make pgpool create new connections frequently.
Additional Information	Discussed in [pgpool-hackers: 3252] Re: Segfault in a race condition. See this thread for details.
Tags	No tags attached.

t-ishii 2019-04-01 17:51 developer ~0002495	Attached is a patch trying to fix the problem by checking failover/failback is ongling in MASTER* macro (actually in pool_virtual_master_db_node_id(), which is called by these macros). If failover/failback is on going, Pgpool-II child emits FATAL error and exits. Probably bug 482 can be fixed by this as well. failover-check.diff (539 bytes) failover-check.diff (539 bytes)

t-ishii 2019-04-02 11:15 developer ~0002497	Attached is a revised patch. Issuing FATAL in pool_virtual_master_db_node_id() was not good since it raises an exception and inside a exception code MASTER macro is called again, thus goes into infinite recursion until it hits the stack depth limit. So in this patch WANING is issued instead of FATAL then exit the process. Also protect the ereport call by signal masking because SIGUSR1 will be issued and the interruption is just a waste of time. master-macro-segfault.diff (829 bytes) master-macro-segfault.diff (829 bytes)

Date Modified	Username	Field	Change
2019-03-26 21:03	nagata	New Issue
2019-04-01 16:33	administrator	Assigned To	=> t-ishii
2019-04-01 16:33	administrator	Status	new => assigned
2019-04-01 17:51	t-ishii	File Added: failover-check.diff
2019-04-01 17:51	t-ishii	Note Added: 0002495
2019-04-02 11:15	t-ishii	File Added: master-macro-segfault.diff
2019-04-02 11:15	t-ishii	Note Added: 0002497
2019-04-02 11:16	t-ishii	Status	assigned => feedback
2019-05-17 13:13	t-ishii	Status	feedback => resolved
2019-05-17 13:13	t-ishii	Fixed in Version	=> 3.6.17
2019-05-21 16:11	administrator	Status	resolved => closed