View Issue Details

IDProjectCategoryView StatusLast Update
0000481Pgpool-IIBugpublic2019-05-21 16:11
ReporternagataAssigned Tot-ishii 
PrioritynormalSeveritymajorReproducibilityhave not tried
Status closedResolutionopen 
Product Version3.6.15 
Target VersionFixed in Version3.6.17 
Summary0000481: a race condition causing a segfault in replication mode.
DescriptionThis segfault occurs when new connection is coming during failover in native-replication mode.

This occurs in pool_do_auth(). (See backtrace and log below)

I guess pool_do_auth was called before Req_info->master_node_id was updated
in failover(), so MASTER_CONNECTION(cp) was referring the downed connection
and MASTER_CONNECTION(cp)->sp caused the segfault.

Here is the backtrace from core:
=================================
Core was generated by `pgpool: accept connection '.
Program terminated with signal 11, Segmentation fault.
#0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
    at auth/pool_auth.c:77
77 protoMajor = MASTER_CONNECTION(cp)->sp->major;
Missing separate debuginfos, use: debuginfo-install libmemcached-0.31-1.1.el6.x86_64
(gdb) bt
#0 0x000000000041b993 in pool_do_auth (frontend=0x1678f28, cp=0x1668f18)
    at auth/pool_auth.c:77
0000001 0x000000000042377f in connect_backend (sp=0x167ae78, frontend=0x1678f28)
    at protocol/child.c:954
0000002 0x0000000000423fdd in get_backend_connection (frontend=0x1678f28)
    at protocol/child.c:2396
0000003 0x0000000000424b94 in do_child (fds=0x16584f0) at protocol/child.c:337
0000004 0x000000000040682d in fork_a_child (fds=0x16584f0, id=372)
    at main/pgpool_main.c:758
0000005 0x0000000000409941 in failover () at main/pgpool_main.c:2102
0000006 0x000000000040cb40 in PgpoolMain (discard_status=<value optimized out>,
    clear_memcache_oidmaps=<value optimized out>) at main/pgpool_main.c:476
0000007 0x0000000000405c44 in main (argc=<value optimized out>,
    argv=<value optimized out>) at main/main.c:317
(gdb) l
72 int authkind;
73 int i;
74 StartupPacket *sp;
75
76
77 protoMajor = MASTER_CONNECTION(cp)->sp->major;
78
79 kind = pool_read_kind(cp);
80 if (kind < 0)
81 ereport(ERROR,
=======================================-

Here is a snippet of the pgpool log. PID 5067 has a segfault.
==================
(snip)
2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: starting degeneration. shutdown host xxxxxxxx(xxxx)
2019-02-23 18:41:35:MAIN(2743):[No Connection]:[No Connection]: LOG: Restart all children
2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: LOG: new connection received
2019-02-23 18:41:35:CHILD(5067):[No Connection]:[No Connection]: DETAIL: connecting host=xxxxxx port=xxxx
(snip)
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exits with status 0
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5066 exited with success and will not be restarted
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: WARNING: child process with pid: 5067 was terminated by segmentation fault
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5067 exited with success and will not be restarted
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exits with status 0
2019-02-23 18:41:37:MAIN(2743):[No Connection]:[No Connection]: LOG: child process with pid: 5068 exited with success and will not be restarted
(snip)
===================
Steps To ReproduceI can reproduce the segfault by executing pcp_detach_node and pcp_attach_node for node 0 repeatedly during running pgbench with -C option.

I useed child_max_connections = 5 to make pgpool create new connections frequently.
Additional InformationDiscussed in [pgpool-hackers: 3252] Re: Segfault in a race condition.
See this thread for details.
TagsNo tags attached.

Activities

t-ishii

2019-04-01 17:51

developer   ~0002495

Attached is a patch trying to fix the problem by checking failover/failback is ongling in MASTER* macro (actually in pool_virtual_master_db_node_id(), which is called by these macros). If failover/failback is on going, Pgpool-II child emits FATAL error and exits. Probably bug 482 can be fixed by this as well.

failover-check.diff (539 bytes)
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index b82da454..0cf158b6 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -317,6 +317,14 @@ pool_virtual_master_db_node_id(void)
 		return REAL_MASTER_NODE_ID;
 	}
 
+	/*
+	 * Check whether failover is in progress
+	 */
+	if (Req_info->switching)
+	{
+		elog(FATAL, "failover/failback is in progress");
+	}
+
 	if (sc->in_progress && sc->query_context)
 	{
 		int			node_id = sc->query_context->virtual_master_node_id;
failover-check.diff (539 bytes)

t-ishii

2019-04-02 11:15

developer   ~0002497

Attached is a revised patch. Issuing FATAL in pool_virtual_master_db_node_id() was not good since it raises an exception and inside a exception code MASTER macro is called again, thus goes into infinite recursion until it hits the stack depth limit.
So in this patch WANING is issued instead of FATAL then exit the process. Also protect the ereport call by signal masking because SIGUSR1 will be issued and the interruption is just a waste of time.

master-macro-segfault.diff (829 bytes)
diff --git a/src/context/pool_query_context.c b/src/context/pool_query_context.c
index b82da454..0d9bb221 100644
--- a/src/context/pool_query_context.c
+++ b/src/context/pool_query_context.c
@@ -317,6 +317,20 @@ pool_virtual_master_db_node_id(void)
 		return REAL_MASTER_NODE_ID;
 	}
 
+	/*
+	 * Check whether failover is in progress. If so, just abort this session.
+	 */
+	if (Req_info->switching)
+	{
+		POOL_SETMASK(&BlockSig);
+		ereport(WARNING,
+				(errmsg("failover/failback is in progress"),
+						errdetail("executing failover or failback on backend"),
+				 errhint("In a moment you should be able to reconnect to the database")));
+		POOL_SETMASK(&UnBlockSig);
+		child_exit(POOL_EXIT_AND_RESTART);
+	}
+
 	if (sc->in_progress && sc->query_context)
 	{
 		int			node_id = sc->query_context->virtual_master_node_id;

Issue History

Date Modified Username Field Change
2019-03-26 21:03 nagata New Issue
2019-04-01 16:33 administrator Assigned To => t-ishii
2019-04-01 16:33 administrator Status new => assigned
2019-04-01 17:51 t-ishii File Added: failover-check.diff
2019-04-01 17:51 t-ishii Note Added: 0002495
2019-04-02 11:15 t-ishii File Added: master-macro-segfault.diff
2019-04-02 11:15 t-ishii Note Added: 0002497
2019-04-02 11:16 t-ishii Status assigned => feedback
2019-05-17 13:13 t-ishii Status feedback => resolved
2019-05-17 13:13 t-ishii Fixed in Version => 3.6.17
2019-05-21 16:11 administrator Status resolved => closed