[pgpool-general: 6056] Re: "health check timer expired" on local machine
Tatsuo Ishii
ishii at sraoss.co.jp
Fri Apr 27 09:40:09 JST 2018
> 2018-04-26 20:38:10.225 CEST [23537] [unknown]@[unknown] LOG: could not
> accept SSL connection: EOF detected
> 2018-04-26 20:59:34.856 CEST [27744] LOG: trigger file found:
> /var/lib/postgresql/9.6/main/trigger
> 2018-04-26 20:59:34.856 CEST [27746] FATAL: terminating walreceiver
> process due to administrator command
> 2018-04-26 20:59:34.857 CEST [27744] LOG: invalid record length at
> 3/2133FD18: wanted 24, got 0
> 2018-04-26 20:59:34.857 CEST [27744] LOG: redo done at 3/2133FCF0
> 2018-04-26 20:59:34.857 CEST [27744] LOG: last completed transaction was
> at log time 2018-04-26 20:59:29.852716+02
> 2018-04-26 20:59:34.873 CEST [27744] LOG: selected new timeline ID: 94
> 2018-04-26 20:59:34.994 CEST [27744] LOG: archive recovery complete
> 2018-04-26 20:59:35.025 CEST [27744] LOG: MultiXact member wraparound
> protections are now enabled
> 2018-04-26 20:59:35.034 CEST [25506] LOG: autovacuum launcher started
> 2018-04-26 20:59:35.034 CEST [27743] LOG: database system is ready to
> accept connections
>
>> 2018-04-26 20:59:34.856 CEST [27744] LOG: trigger file found:
> /var/lib/postgresql/9.6/main/trigger
> -> On this line I assume this is the standby who is talking, because there
> is no /var/lib/postgresql/9.6/main directory on the master, data are mount
> somewhere else. The failover process start at 20:59:29 on pgpool, and the
> standby get promoted.
Yes, that's my understanding too. So there's no emmitted log on the
master around 2018-04-26 20:59:34.856 CEST, I assume.
>> 2018-04-26 20:38:10.225 CEST [23537] [unknown]@[unknown] LOG: could not
> accept SSL connection: EOF detected
> This could be the weird boy. But it happened 20 minutes before the bug and
> this have not much to do with the healtcheck process.
No idea for this part.
> No more revelant things on Postgres logs
Ok.
>> there's no reason for the heath check process to not accept 127.0.0.1.
>
> Like I said, the health process fetch PostgreSQL trough public ip. So it
> get trough a different interface.
Still I don't understand. Pgpool-II and PostgreSQL master are on the
same machine, that means you could set like "backend_hostname0 =
"127.0.0.1". But actually you did not prefer the way. The heath check
process just uses the same hostname/ip using backend_hostname0.
> At this time PostgreSQL was receiving ~5 inserts / second and that's all.
> No error detected on the apps.
Yeah, no big load.
> So the only reason I could find is a problem on the public interface of
> this server, but this is really really unsual as it come from a dedicated
> server provider.
>From the error message of heath check process:
> 2018-04-26 20:59:29: pid 2153:LOG: failed to connect to PostgreSQL server
> on "x.x.x.x:xxx" using INET socket
> 2018-04-26 20:59:29: pid 2153:DETAIL: health check timer expired
> 2018-04-26 20:59:29: pid 2153:ERROR: failed to make persistent db
Pgpool-II health check process uses non-blocking socket for connecting
to PostgreSQL. After issuing connect system call it waits for its
completion using select system call with timeout: connect_timeout in
pgpool.conf (in your case 10 seconds). On the other hand health_check
timeout is 6 seconds. So after 6 seconds, an alarm interrupted the
select system call and it returned with errno == EINTR, then the log
emitted. Not sure why the connect system call did not respond for 6
seconds.
That's all what I know from the log.
Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
More information about the pgpool-general
mailing list