[Pgpool-general] Issues with health checking in 2.5

Wed Feb 9 06:58:55 GMT 2005

> Hi,
> 
> I'm trying to use the wonderful new features of pgpool 2.5 but I have a few 
> issues.
> 
> I have the following settings in my configuration file:
> health_check_timeout = 5
> health_check_period = 60
> 
> So I would expect pgpool to check every 60 seconds, and consider that 
> things went wrong if did not go trough after 5 seconds.
> 
> However, the pgpool log says (only relevant bits kept):
> 
> 2005-02-08 18:33:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:34:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:35:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:36:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:37:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:38:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:38:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:38:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:39:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:40:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:41:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:41:57 LOG: pid 93656: starting health checking
> 2005-02-08 18:42:57 LOG: pid 93656: starting health checking
> 2005-02-08 18:43:19 LOG: pid 93656: starting health checking
> 2005-02-08 18:43:19 ERROR: pid 93656: connect_inet_domain_socket: connect() 
> failed: Interrupted system call
> 2005-02-08 18:43:19 ERROR: pid 93656: health check failed. master 10.0.1.66 
> at port 5432 is down
> 2005-02-08 18:43:19 LOG: pid 93656: starting failover from 10.0.1.66(5432) 
> to 10.0.1.65(5432)
> 2005-02-08 18:43:19 LOG: pid 93656: failover from 10.0.1.66(5432) to 
> 10.0.1.65(5432) done.
> 
> First issue: after doing checks every minute at :19, it does a couple at 
> :57, then back at :19. Not to much of an issue for me, but I guess it hides 
> some other problem.
> 
> Second issue: it thinks the master went down when this is not the case. And 
> since the error happens exactly at :19, it's not a matter of connect() 
> exceeding a 5 second timeout, obviously. I tried it another time with 
> debugging info enabled, and it happened again. Interestingly enough, it 
> seems to always happen on the first health check at the "right" time after 
> the two consecutive ones at the "wrong" time.
> 
> Apparently connect() is interrupted by another alarm, in my case I have 
> lots of "child life 300 seconds expired" right at that time and the 
> distinction between that and a real timeout is not done. I guess the 
> solution is probably to disable those other timeouts in the process doing 
> the health checking? Workarounds include disabling child_life_time or 
> (probably) setting it to a value which is not a multiple of 
> health_check_period.

The health checking is done in the pgpool parent process and there's
no any other alarams in the process(child life alarm is done in
children). So my guess is that connect() is interrupted by SIGCHLD,
not SIGALARM. So the fix would be either 1) block SIGCHLD while
connect() is trying, or 2) make sure that connect() is interrupted by
SIGALARM. What do you think?

> Here is the log with debugging(this time no health checking at the "wrong" 
> times):
> 
> 2005-02-08 19:23:59 LOG: pid 95144: starting health checking
> 2005-02-08 19:24:59 LOG: pid 95144: starting health checking
> 2005-02-08 19:25:59 LOG: pid 95144: starting health checking
> 2005-02-08 19:26:59 LOG: pid 95144: starting health checking
> 2005-02-08 19:27:58 DEBUG: pid 95199: child life 300 seconds expired
> 2005-02-08 19:27:58 DEBUG: pid 95151: child life 300 seconds expired
> 2005-02-08 19:27:58 DEBUG: pid 95144: reap_handler called
> 2005-02-08 19:27:58 DEBUG: pid 95144: child 95151 exits with status 512 by 
> signal 0
> 2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95262
> 2005-02-08 19:27:58 LOG: pid 95144: starting health checking
> 2005-02-08 19:27:58 DEBUG: pid 95175: child life 300 seconds expired
> 2005-02-08 19:27:58 DEBUG: pid 95262: I am 95262
> 2005-02-08 19:27:58 DEBUG: pid 95200: child life 300 seconds expired
> 2005-02-08 19:27:58 DEBUG: pid 95144: reap_handler called
> 2005-02-08 19:27:58 DEBUG: pid 95144: child 95200 exits with status 512 by 
> signal 0
> 2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95263
> 2005-02-08 19:27:58 DEBUG: pid 95144: child 95199 exits with status 512 by 
> signal 0
> 2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95264
> 2005-02-08 19:27:58 DEBUG: pid 95144: child 95175 exits with status 512 by 
> signal 0
> 2005-02-08 19:27:58 DEBUG: pid 95264: I am 95264
> 2005-02-08 19:27:58 DEBUG: pid 95144: fork a new child pid 95265
> 2005-02-08 19:27:58 ERROR: pid 95144: connect_inet_domain_socket: connect() 
> failed: Interrupted system call
> 2005-02-08 19:27:58 DEBUG: pid 95263: I am 95263
> 2005-02-08 19:27:58 ERROR: pid 95144: health check failed. master 10.0.1.66 
> at port 5432 is down
> 2005-02-08 19:27:58 DEBUG: pid 95265: I am 95265
> 2005-02-08 19:27:58 DEBUG: pid 95144: failover_handler called
> 2005-02-08 19:27:58 LOG: pid 95144: starting failover from 10.0.1.66(5432) 
> to 10.0.1.65(5432)
> 2005-02-08 19:27:58 DEBUG: pid 95144: kill 95145
> 2005-02-08 19:27:58 DEBUG: pid 95144: kill 95146
> 2005-02-08 19:27:58 DEBUG: pid 95144: kill 95147
> 2005-02-08 19:27:58 DEBUG: pid 95144: kill 95148
> [... lots more kills and child life 300 seconds expired and I am xxx ...]
> 2005-02-08 19:27:59 DEBUG: pid 95307: I am 95307
> 2005-02-08 19:27:59 DEBUG: pid 95302: I am 95302
> 2005-02-08 19:27:59 DEBUG: pid 95304: I am 95304
> 2005-02-08 19:27:59 DEBUG: pid 95299: I am 95299
> 2005-02-08 19:27:59 LOG: pid 95144: failover from 10.0.1.66(5432) to 
> 10.0.1.65(5432) done.
> 2005-02-08 19:27:59 DEBUG: pid 95144: reap_handler called
> 2005-02-08 19:27:59 DEBUG: pid 95365: I am 95365
> 2005-02-08 19:27:59 DEBUG: pid 95315: I am 95315
> 2005-02-08 19:27:59 DEBUG: pid 95349: I am 95349
> 2005-02-08 19:27:59 DEBUG: pid 95303: I am 95303
> 
> It also seems that child_life_time is a bit harsh, and kills even children 
> that have never done anything... (the above log is from a totally idle 
> pgpool). I would believe it should start only once a child has had at least 
> one connection. Otherwise that means that every 300 seconds all children 
> die and are respawned for no reason.

Right, this should be fixed.

> Also, please find below a trivial patch to make sure all debugging/logging 
> is done "atomically" line per line, it gets hard to read otherwise with all 
> the processes "talking" at the same time. It adds a dependency on asprintf, 
> though. I haven't yet delved into the pgpool source to fix the other issues :-(

Thanks!
--
Tatsuo Ishii

> Thanks,
> 
> Jacques.
> 
>  > diff -u pool_error.c.orig pool_error.c
> --- pool_error.c.orig   Tue Feb  8 19:18:11 2005
> +++ pool_error.c        Tue Feb  8 19:22:33 2005
> @@ -38,49 +38,61 @@
>   void pool_error(const char *fmt,...)
>   {
>          va_list         ap;
> +       char            *fmt2;
> 
>          if (pool_config.print_timestamp)
> -         fprintf(stderr, "%s ERROR: pid %d: ", nowsec(), (int)getpid());
> +         asprintf(&fmt2, "%s ERROR: pid %d: %s\n", nowsec(), 
> (int)getpid(), fmt);
>          else
> -         fprintf(stderr, "ERROR: pid %d: ", (int)getpid());
> +         asprintf(&fmt2, "ERROR: pid %d: %s\n", (int)getpid(), fmt);
> 
> -       va_start(ap, fmt);
> -       vfprintf(stderr, fmt, ap);
> -       va_end(ap);
> -       fprintf(stderr, "\n");
> +       if (fmt2)
> +       {
> +               va_start(ap, fmt);
> +               vfprintf(stderr, fmt2, ap);
> +               va_end(ap);
> +               fflush(stderr);
> +       }
>   }
> 
>   void pool_debug(const char *fmt,...)
>   {
>          va_list         ap;
> +       char            *fmt2;
> 
>          if (!debug)
>                  return;
> 
>          if (pool_config.print_timestamp)
> -         fprintf(stderr, "%s DEBUG: pid %d: ", nowsec(), (int)getpid());
> +         asprintf(&fmt2, "%s DEBUG: pid %d: %s\n", nowsec(), 
> (int)getpid(), fmt);
>          else
> -         fprintf(stderr, "DEBUG: pid %d: ", (int)getpid());
> +         asprintf(&fmt2, "DEBUG: pid %d: %s\n", (int)getpid(), fmt);
> 
> -       va_start(ap, fmt);
> -       vfprintf(stderr, fmt, ap);
> -       va_end(ap);
> -       fprintf(stderr, "\n");
> +       if (fmt2)
> +       {
> +               va_start(ap, fmt);
> +               vfprintf(stderr, fmt2, ap);
> +               va_end(ap);
> +               fflush(stderr);
> +       }
>   }
> 
>   void pool_log(const char *fmt,...)
>   {
>          va_list         ap;
> +       char            *fmt2;
> 
>          if (pool_config.print_timestamp)
> -         fprintf(stderr, "%s LOG: pid %d: ", nowsec(), (int)getpid());
> +         asprintf(&fmt2, "%s LOG: pid %d: %s\n", nowsec(), (int)getpid(), 
> fmt);
>          else
> -         fprintf(stderr, "LOG: pid %d: ", (int)getpid());
> +         asprintf(&fmt2, "LOG: pid %d: %s\n", (int)getpid(), fmt);
> 
> -       va_start(ap, fmt);
> -       vfprintf(stderr, fmt, ap);
> -       va_end(ap);
> -       fprintf(stderr, "\n");
> +       if (fmt2)
> +       {
> +               va_start(ap, fmt);
> +               vfprintf(stderr, fmt2, ap);
> +               va_end(ap);
> +               fflush(stderr);
> +       }
>   }
> 
>   static char *nowsec(void)
> 
> 
> _______________________________________________
> Pgpool-general mailing list
> Pgpool-general at pgfoundry.org
> http://pgfoundry.org/mailman/listinfo/pgpool-general
>