[pgpool-general: 6056] Re: "health check timer expired" on local machine

Fri Apr 27 09:40:09 JST 2018

> 2018-04-26 20:38:10.225 CEST [23537] [unknown]@[unknown] LOG:  could not
> accept SSL connection: EOF detected
> 2018-04-26 20:59:34.856 CEST [27744] LOG:  trigger file found:
> /var/lib/postgresql/9.6/main/trigger
> 2018-04-26 20:59:34.856 CEST [27746] FATAL:  terminating walreceiver
> process due to administrator command
> 2018-04-26 20:59:34.857 CEST [27744] LOG:  invalid record length at
> 3/2133FD18: wanted 24, got 0
> 2018-04-26 20:59:34.857 CEST [27744] LOG:  redo done at 3/2133FCF0
> 2018-04-26 20:59:34.857 CEST [27744] LOG:  last completed transaction was
> at log time 2018-04-26 20:59:29.852716+02
> 2018-04-26 20:59:34.873 CEST [27744] LOG:  selected new timeline ID: 94
> 2018-04-26 20:59:34.994 CEST [27744] LOG:  archive recovery complete
> 2018-04-26 20:59:35.025 CEST [27744] LOG:  MultiXact member wraparound
> protections are now enabled
> 2018-04-26 20:59:35.034 CEST [25506] LOG:  autovacuum launcher started
> 2018-04-26 20:59:35.034 CEST [27743] LOG:  database system is ready to
> accept connections
> 
>> 2018-04-26 20:59:34.856 CEST [27744] LOG:  trigger file found:
> /var/lib/postgresql/9.6/main/trigger
> -> On this line I assume this is the standby who is talking, because there
> is no /var/lib/postgresql/9.6/main directory on the master, data are mount
> somewhere else. The failover process start at  20:59:29 on pgpool, and the
> standby get promoted.

Yes, that's my understanding too. So there's no emmitted log on the
master around 2018-04-26 20:59:34.856 CEST, I assume.

>> 2018-04-26 20:38:10.225 CEST [23537] [unknown]@[unknown] LOG:  could not
> accept SSL connection: EOF detected
> This could be the weird boy. But it happened 20 minutes before the bug and
> this have not much to do with the healtcheck process.

No idea for this part.

> No more revelant things on Postgres logs

Ok.

>> there's no reason for the heath check process to not accept 127.0.0.1.
> 
> Like I said, the health process fetch PostgreSQL trough public ip. So it
> get trough a different interface.

Still I don't understand. Pgpool-II and PostgreSQL master are on the
same machine, that means you could set like "backend_hostname0 =
"127.0.0.1". But actually you did not prefer the way. The heath check
process just uses the same hostname/ip using backend_hostname0.

> At this time PostgreSQL was receiving ~5 inserts / second and that's all.
> No error detected on the apps.

Yeah, no big load.

> So the only reason I could find is a problem on the public interface of
> this server, but this is really really unsual as it come from a dedicated
> server provider.

>From the error message of heath check process:
> 2018-04-26 20:59:29: pid 2153:LOG:  failed to connect to PostgreSQL server
> on "x.x.x.x:xxx" using INET socket
> 2018-04-26 20:59:29: pid 2153:DETAIL:  health check timer expired
> 2018-04-26 20:59:29: pid 2153:ERROR:  failed to make persistent db

Pgpool-II health check process uses non-blocking socket for connecting
to PostgreSQL. After issuing connect system call it waits for its
completion using select system call with timeout: connect_timeout in
pgpool.conf (in your case 10 seconds). On the other hand health_check
timeout is 6 seconds. So after 6 seconds, an alarm interrupted the
select system call and it returned with errno == EINTR, then the log
emitted. Not sure why the connect system call did not respond for 6
seconds.

That's all what I know from the log.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp