View Issue Details

IDProjectCategoryView StatusLast Update
0000407Pgpool-IIBugpublic2018-06-25 11:19
ReporternagataAssigned To 
PrioritynormalSeverityminorReproducibilitysometimes
Status resolvedResolutionopen 
Product Version3.7.4 
Target Version3.7.5Fixed in Version 
Summary0000407: health check process doesn't start in some cases
Descriptioncase 1)
When pgpool reads pgpool_status file indicating that a node is down at starting, the health check process corresponding to the node doesn't start.

case 2)
When watchdog is enabled and the standby pgpool receives backend status information indicating that a node is down from the master pgpool at starting, the health check process corresponding to the node doesn't start.

case 3)
When a health check process whose corresponding backend is down is killed, this is never restarted.

==

Currently, the health check process is forked only if the corresponding backend' status is CON_WAIT or CON_UP. Usually (when there is not pgpool_status or -D option is used), the condition is true before starting health check, so all health check processes are started. However, in the cases mentioned above, the condition is false for a certain backend and the corresponding health check process never start.

To fix this, we just have to start all health check processes regardless to the backend status in similar to usual cases. In addition, we can also ignore "!switching" condition in repear(), otherwise the health check process killed during failover never restart. I think it is safe because !switching condition is needed only for child processes that may be restarted (or marked to be restarted) in failover().

Another possigle design is to restart the health check process when the corresponding backend is attached or recovered. However, this is more complex than above and I think a simple solution is better.

The patch is attached.
Steps To ReproduceYou can always reproduce case 1 and case 3.

You can reproduce case 2 by starting two pgpool nodes with a interval after shutdown one of the backend nodes, but it might be hard to reproduce this because it depends on timing.
TagsNo tags attached.

Activities

nagata

2018-06-20 19:46

developer  

fix_do_health_check_child.patch (914 bytes)
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index 0efe4be..9769df0 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -412,10 +412,7 @@ int PgpoolMain(bool discard_status, bool clear_memcache_oidmaps)
 
 	/* Fork health check process */
 	for (i=0;i<NUM_BACKENDS;i++)
-	{
-		if (VALID_BACKEND(i))
-			health_check_pids[i] = worker_fork_a_child(PT_HEALTH_CHECK, do_health_check_child, &i);
-	}
+		health_check_pids[i] = worker_fork_a_child(PT_HEALTH_CHECK, do_health_check_child, &i);
 
 	if (sigsetjmp(local_sigjmp_buf, 1) != 0)
 	{
@@ -2455,10 +2452,8 @@ static void reaper(void)
 					found = true;
 
 					/* Fork new health check worker */
-					if (!switching && !exiting && VALID_BACKEND(i))
-					{
+					if (!exiting)
 						health_check_pids[i] = worker_fork_a_child(PT_HEALTH_CHECK, do_health_check_child, &i);
-					}
 					else
 						health_check_pids[i] = 0;
 				}

pengbo

2018-06-21 09:22

developer   ~0002053

Thank you for your patch.
I will look into int.

t-ishii

2018-06-21 12:02

developer   ~0002054

Last edited: 2018-06-21 12:05

View 3 revisions

> case 1)
> When pgpool reads pgpool_status file indicating that a node is down at starting, the health check process corresponding to the node doesn't start.
I don't see this as a problem, rather it's an expected behavior. The health check process never checks the status of node which is already in down status, there's no point to start a health check process for the node.

> case 3)
> When a health check process whose corresponding backend is down is killed, this is never restarted.

Ditto as 0000001.

I guess the proper fix would be starting the health check process when the backend node is re-attached.

t-ishii

2018-06-21 13:03

developer  

start_health_check_on_failback.diff (766 bytes)
diff --git a/src/main/pgpool_main.c b/src/main/pgpool_main.c
index 619ae7b..08e60c4 100644
--- a/src/main/pgpool_main.c
+++ b/src/main/pgpool_main.c
@@ -2094,6 +2094,19 @@ static void failover(void)
 					 BACKEND_INFO(node_id).backend_hostname,
 					 BACKEND_INFO(node_id).backend_port)));
 
+			/* Fork health check process if needed */
+			for (i=0;i<NUM_BACKENDS;i++)
+			{
+				if (health_check_pids[i] == 0)
+				{
+					ereport(LOG,
+							(errmsg("start health check process for host %s(%d)",
+									BACKEND_INFO(node_id).backend_hostname,
+									BACKEND_INFO(node_id).backend_port)));
+
+					health_check_pids[i] = worker_fork_a_child(PT_HEALTH_CHECK, do_health_check_child, &i);
+				}
+			}
 		}
 		else if (reqkind == PROMOTE_NODE_REQUEST)
 		{

t-ishii

2018-06-21 13:03

developer   ~0002056

Attached is the patch to start the health check process when the backend node is re-attached.

nagata

2018-06-21 14:18

developer   ~0002060

ok. I think this is a simple fix that will resolve the all problems I mentioned because the health check process is restarted after pcp_attach_node.

By the way, do we need kill the health check process when a backend node is degenerated? As you said, it is no point to run a health check process for a node in down, although it is harmless even if we don't care of it.

t-ishii

2018-06-21 14:23

developer   ~0002061

I think we can safely leave it when a node is degenerated. Of course it's a little bit waste of CPU cycle, but I think an effort to try to eliminate it is not worth the trouble.

nagata

2018-06-21 14:38

developer   ~0002062

OK. I agreed.

Thanks.

t-ishii

2018-06-22 17:32

developer   ~0002070

Fix pushed. BTW, Pgpool-II 3.7.5 does not exist yet.

t-ishii

2018-06-25 11:18

developer   ~0002072

> BTW, Pgpool-II 3.7.5 does not exist yet.
Description fixed.

Issue History

Date Modified Username Field Change
2018-06-20 19:46 nagata New Issue
2018-06-20 19:46 nagata File Added: fix_do_health_check_child.patch
2018-06-21 09:22 pengbo Note Added: 0002053
2018-06-21 12:02 t-ishii Note Added: 0002054
2018-06-21 12:02 t-ishii Note Edited: 0002054 View Revisions
2018-06-21 12:05 t-ishii Note Edited: 0002054 View Revisions
2018-06-21 13:03 t-ishii File Added: start_health_check_on_failback.diff
2018-06-21 13:03 t-ishii Note Added: 0002056
2018-06-21 14:18 nagata Note Added: 0002060
2018-06-21 14:23 t-ishii Note Added: 0002061
2018-06-21 14:38 nagata Note Added: 0002062
2018-06-22 17:32 t-ishii Note Added: 0002070
2018-06-25 11:18 t-ishii Product Version 3.7.5 => 3.7.4
2018-06-25 11:18 t-ishii Target Version => 3.7.5
2018-06-25 11:18 t-ishii Note Added: 0002072
2018-06-25 11:19 t-ishii Status new => resolved