[Pgpool-general] Issues with health checking in 2.5

Tue Feb 8 18:42:18 GMT 2005

Hi,

I'm trying to use the wonderful new features of pgpool 2.5 but I have a few 
issues.

I have the following settings in my configuration file:
health_check_timeout = 5
health_check_period = 60

So I would expect pgpool to check every 60 seconds, and consider that 
things went wrong if did not go trough after 5 seconds.

However, the pgpool log says (only relevant bits kept):

2005-02-08 18:33:19 LOG: pid 93656: starting health checking
2005-02-08 18:34:19 LOG: pid 93656: starting health checking
2005-02-08 18:35:19 LOG: pid 93656: starting health checking
2005-02-08 18:36:19 LOG: pid 93656: starting health checking
2005-02-08 18:37:19 LOG: pid 93656: starting health checking
2005-02-08 18:38:19 LOG: pid 93656: starting health checking
2005-02-08 18:38:19 LOG: pid 93656: starting health checking
2005-02-08 18:38:19 LOG: pid 93656: starting health checking
2005-02-08 18:39:19 LOG: pid 93656: starting health checking
2005-02-08 18:40:19 LOG: pid 93656: starting health checking
2005-02-08 18:41:19 LOG: pid 93656: starting health checking
2005-02-08 18:41:57 LOG: pid 93656: starting health checking
2005-02-08 18:42:57 LOG: pid 93656: starting health checking
2005-02-08 18:43:19 LOG: pid 93656: starting health checking
2005-02-08 18:43:19 ERROR: pid 93656: connect_inet_domain_socket: connect() 
failed: Interrupted system call
2005-02-08 18:43:19 ERROR: pid 93656: health check failed. master 10.0.1.66 
at port 5432 is down
2005-02-08 18:43:19 LOG: pid 93656: starting failover from 10.0.1.66(5432) 
to 10.0.1.65(5432)
2005-02-08 18:43:19 LOG: pid 93656: failover from 10.0.1.66(5432) to 
10.0.1.65(5432) done.

First issue: after doing checks every minute at :19, it does a couple at 
:57, then back at :19. Not to much of an issue for me, but I guess it hides 
some other problem.

Second issue: it thinks the master went down when this is not the case. And 
since the error happens exactly at :19, it's not a matter of connect() 
exceeding a 5 second timeout, obviously. I tried it another time with 
debugging info enabled, and it happened again. Interestingly enough, it 
seems to always happen on the first health check at the "right" time after 
the two consecutive ones at the "wrong" time.

Apparently connect() is interrupted by another alarm, in my case I have 
lots of "child life 300 seconds expired" right at that time and the 
distinction between that and a real timeout is not done. I guess the 
solution is probably to disable those other timeouts in the process doing 
the health checking? Workarounds include disabling child_life_time or 
(probably) setting it to a value which is not a multiple of 
health_check_period.

Here is the log with debugging(this time no health checking at the "wrong" 
times):

2005-02-08 19:23:59 LOG: pid 95144: starting health checking
2005-02-08 19:24:59 LOG: pid 95144: starting health checking
2005-02-08 19:25:59 LOG: pid 95144: starting health checking
2005-02-08 19:26:59 LOG: pid 95144: starting health checking
2005-02-08 19:27:58 DEBUG: pid 95199: child life 300 seconds expired
2005-02-08 19:27:58 DEBUG: pid 95151: child life 300 seconds expired
2005-02-08 19:27:58 DEBUG: pid 95144: reap_handler called
2005-02-08 19:27:58 DEBUG: pid 95144: child 95151 exits with status 512 by 
signal 0
2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95262
2005-02-08 19:27:58 LOG: pid 95144: starting health checking
2005-02-08 19:27:58 DEBUG: pid 95175: child life 300 seconds expired
2005-02-08 19:27:58 DEBUG: pid 95262: I am 95262
2005-02-08 19:27:58 DEBUG: pid 95200: child life 300 seconds expired
2005-02-08 19:27:58 DEBUG: pid 95144: reap_handler called
2005-02-08 19:27:58 DEBUG: pid 95144: child 95200 exits with status 512 by 
signal 0
2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95263
2005-02-08 19:27:58 DEBUG: pid 95144: child 95199 exits with status 512 by 
signal 0
2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95264
2005-02-08 19:27:58 DEBUG: pid 95144: child 95175 exits with status 512 by 
signal 0
2005-02-08 19:27:58 DEBUG: pid 95264: I am 95264
2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95265
2005-02-08 19:27:58 ERROR: pid 95144: connect_inet_domain_socket: connect() 
failed: Interrupted system call
2005-02-08 19:27:58 DEBUG: pid 95263: I am 95263
2005-02-08 19:27:58 ERROR: pid 95144: health check failed. master 10.0.1.66 
at port 5432 is down
2005-02-08 19:27:58 DEBUG: pid 95265: I am 95265
2005-02-08 19:27:58 DEBUG: pid 95144: failover_handler called
2005-02-08 19:27:58 LOG: pid 95144: starting failover from 10.0.1.66(5432) 
to 10.0.1.65(5432)
2005-02-08 19:27:58 DEBUG: pid 95144: kill 95145
2005-02-08 19:27:58 DEBUG: pid 95144: kill 95146
2005-02-08 19:27:58 DEBUG: pid 95144: kill 95147
2005-02-08 19:27:58 DEBUG: pid 95144: kill 95148
[... lots more kills and child life 300 seconds expired and I am xxx ...]
2005-02-08 19:27:59 DEBUG: pid 95307: I am 95307
2005-02-08 19:27:59 DEBUG: pid 95302: I am 95302
2005-02-08 19:27:59 DEBUG: pid 95304: I am 95304
2005-02-08 19:27:59 DEBUG: pid 95299: I am 95299
2005-02-08 19:27:59 LOG: pid 95144: failover from 10.0.1.66(5432) to 
10.0.1.65(5432) done.
2005-02-08 19:27:59 DEBUG: pid 95144: reap_handler called
2005-02-08 19:27:59 DEBUG: pid 95365: I am 95365
2005-02-08 19:27:59 DEBUG: pid 95315: I am 95315
2005-02-08 19:27:59 DEBUG: pid 95349: I am 95349
2005-02-08 19:27:59 DEBUG: pid 95303: I am 95303

It also seems that child_life_time is a bit harsh, and kills even children 
that have never done anything... (the above log is from a totally idle 
pgpool). I would believe it should start only once a child has had at least 
one connection. Otherwise that means that every 300 seconds all children 
die and are respawned for no reason.

Also, please find below a trivial patch to make sure all debugging/logging 
is done "atomically" line per line, it gets hard to read otherwise with all 
the processes "talking" at the same time. It adds a dependency on asprintf, 
though. I haven't yet delved into the pgpool source to fix the other issues :-(

Thanks,

Jacques.

 > diff -u pool_error.c.orig pool_error.c

--- pool_error.c.orig   Tue Feb  8 19:18:11 2005
+++ pool_error.c        Tue Feb  8 19:22:33 2005
@@ -38,49 +38,61 @@
  void pool_error(const char *fmt,...)
  {
         va_list         ap;
+       char            *fmt2;

         if (pool_config.print_timestamp)
-         fprintf(stderr, "%s ERROR: pid %d: ", nowsec(), (int)getpid());
+         asprintf(&fmt2, "%s ERROR: pid %d: %s\n", nowsec(), 
(int)getpid(), fmt);
         else
-         fprintf(stderr, "ERROR: pid %d: ", (int)getpid());
+         asprintf(&fmt2, "ERROR: pid %d: %s\n", (int)getpid(), fmt);

-       va_start(ap, fmt);
-       vfprintf(stderr, fmt, ap);
-       va_end(ap);
-       fprintf(stderr, "\n");
+       if (fmt2)
+       {
+               va_start(ap, fmt);
+               vfprintf(stderr, fmt2, ap);
+               va_end(ap);
+               fflush(stderr);
+       }
  }

  void pool_debug(const char *fmt,...)
  {
         va_list         ap;
+       char            *fmt2;

         if (!debug)
                 return;

         if (pool_config.print_timestamp)
-         fprintf(stderr, "%s DEBUG: pid %d: ", nowsec(), (int)getpid());
+         asprintf(&fmt2, "%s DEBUG: pid %d: %s\n", nowsec(), 
(int)getpid(), fmt);
         else
-         fprintf(stderr, "DEBUG: pid %d: ", (int)getpid());
+         asprintf(&fmt2, "DEBUG: pid %d: %s\n", (int)getpid(), fmt);

-       va_start(ap, fmt);
-       vfprintf(stderr, fmt, ap);
-       va_end(ap);
-       fprintf(stderr, "\n");
+       if (fmt2)
+       {
+               va_start(ap, fmt);
+               vfprintf(stderr, fmt2, ap);
+               va_end(ap);
+               fflush(stderr);
+       }
  }

  void pool_log(const char *fmt,...)
  {
         va_list         ap;
+       char            *fmt2;

         if (pool_config.print_timestamp)
-         fprintf(stderr, "%s LOG: pid %d: ", nowsec(), (int)getpid());
+         asprintf(&fmt2, "%s LOG: pid %d: %s\n", nowsec(), (int)getpid(), 
fmt);
         else
-         fprintf(stderr, "LOG: pid %d: ", (int)getpid());
+         asprintf(&fmt2, "LOG: pid %d: %s\n", (int)getpid(), fmt);

-       va_start(ap, fmt);
-       vfprintf(stderr, fmt, ap);
-       va_end(ap);
-       fprintf(stderr, "\n");
+       if (fmt2)
+       {
+               va_start(ap, fmt);
+               vfprintf(stderr, fmt2, ap);
+               va_end(ap);
+               fflush(stderr);
+       }
  }

  static char *nowsec(void)