[pgpool-general: 6058] Re: pgpool-general Digest, Vol 78, Issue 19

Fri Apr 27 17:42:06 JST 2018

Thanks for your support :)

> Still I don't understand. Pgpool-II and PostgreSQL master are on thesame
machine, that means you could set like "backend_hostname0 = "127.0.0.1".

Because I need the public address for pgpool_recovery() method to permit
online recovery from remote nodes. And pgPool like health_check
process use backend_hostname0
to do so.

The setting health_check_hostname0 doesn't exist but trough, this is not a
workaround.

So according to the log, is the timeout error triggered by this
"health_check_timeout = 6" or this "connect_timeout = 10000" ?

I downed timeout to 2 seconds each and monitoring net paquets to find some
details... Keep you in touch

2018-04-27 5:00 GMT+02:00 <pgpool-general-request at pgpool.net>:

> Send pgpool-general mailing list submissions to
>         pgpool-general at pgpool.net
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://www.sraoss.jp/mailman/listinfo/pgpool-general
> or, via email, send a message with subject or body 'help' to
>         pgpool-general-request at pgpool.net
>
> You can reach the person managing the list at
>         pgpool-general-owner at pgpool.net
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of pgpool-general digest..."
>
>
> Today's Topics:
>
>    1. [pgpool-general: 6056] Re: "health check timer expired" on
>       local machine (Tatsuo Ishii)
>    2. [pgpool-general: 6057] Re: "health check timer expired" on
>       local machine (Tatsuo Ishii)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Fri, 27 Apr 2018 09:40:09 +0900 (JST)
> From: Tatsuo Ishii <ishii at sraoss.co.jp>
> To: psyckow.prod at gmail.com
> Cc: pgpool-general at pgpool.net
> Subject: [pgpool-general: 6056] Re: "health check timer expired" on
>         local machine
> Message-ID: <20180427.094009.1280111065989297836.t-ishii at sraoss.co.jp>
> Content-Type: Text/Plain; charset=us-ascii
>
> > 2018-04-26 20:38:10.225 CEST [23537] [unknown]@[unknown] LOG:  could not
> > accept SSL connection: EOF detected
> > 2018-04-26 20:59:34.856 CEST [27744] LOG:  trigger file found:
> > /var/lib/postgresql/9.6/main/trigger
> > 2018-04-26 20:59:34.856 CEST [27746] FATAL:  terminating walreceiver
> > process due to administrator command
> > 2018-04-26 20:59:34.857 CEST [27744] LOG:  invalid record length at
> > 3/2133FD18: wanted 24, got 0
> > 2018-04-26 20:59:34.857 CEST [27744] LOG:  redo done at 3/2133FCF0
> > 2018-04-26 20:59:34.857 CEST [27744] LOG:  last completed transaction was
> > at log time 2018-04-26 20:59:29.852716+02
> > 2018-04-26 20:59:34.873 CEST [27744] LOG:  selected new timeline ID: 94
> > 2018-04-26 20:59:34.994 CEST [27744] LOG:  archive recovery complete
> > 2018-04-26 20:59:35.025 CEST [27744] LOG:  MultiXact member wraparound
> > protections are now enabled
> > 2018-04-26 20:59:35.034 CEST [25506] LOG:  autovacuum launcher started
> > 2018-04-26 20:59:35.034 CEST [27743] LOG:  database system is ready to
> > accept connections
> >
> >> 2018-04-26 20:59:34.856 CEST [27744] LOG:  trigger file found:
> > /var/lib/postgresql/9.6/main/trigger
> > -> On this line I assume this is the standby who is talking, because
> there
> > is no /var/lib/postgresql/9.6/main directory on the master, data are
> mount
> > somewhere else. The failover process start at  20:59:29 on pgpool, and
> the
> > standby get promoted.
>
> Yes, that's my understanding too. So there's no emmitted log on the
> master around 2018-04-26 20:59:34.856 CEST, I assume.
>
> >> 2018-04-26 20:38:10.225 CEST [23537] [unknown]@[unknown] LOG:  could not
> > accept SSL connection: EOF detected
> > This could be the weird boy. But it happened 20 minutes before the bug
> and
> > this have not much to do with the healtcheck process.
>
> No idea for this part.
>
> > No more revelant things on Postgres logs
>
> Ok.
>
> >> there's no reason for the heath check process to not accept 127.0.0.1.
> >
> > Like I said, the health process fetch PostgreSQL trough public ip. So it
> > get trough a different interface.
>
> Still I don't understand. Pgpool-II and PostgreSQL master are on the
> same machine, that means you could set like "backend_hostname0 =
> "127.0.0.1". But actually you did not prefer the way. The heath check
> process just uses the same hostname/ip using backend_hostname0.
>
> > At this time PostgreSQL was receiving ~5 inserts / second and that's all.
> > No error detected on the apps.
>
> Yeah, no big load.
>
> > So the only reason I could find is a problem on the public interface of
> > this server, but this is really really unsual as it come from a dedicated
> > server provider.
>
> >From the error message of heath check process:
> > 2018-04-26 20:59:29: pid 2153:LOG:  failed to connect to PostgreSQL
> server
> > on "x.x.x.x:xxx" using INET socket
> > 2018-04-26 20:59:29: pid 2153:DETAIL:  health check timer expired
> > 2018-04-26 20:59:29: pid 2153:ERROR:  failed to make persistent db
>
> Pgpool-II health check process uses non-blocking socket for connecting
> to PostgreSQL. After issuing connect system call it waits for its
> completion using select system call with timeout: connect_timeout in
> pgpool.conf (in your case 10 seconds). On the other hand health_check
> timeout is 6 seconds. So after 6 seconds, an alarm interrupted the
> select system call and it returned with errno == EINTR, then the log
> emitted. Not sure why the connect system call did not respond for 6
> seconds.
>
> That's all what I know from the log.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
>
> ------------------------------
>
> Message: 2
> Date: Fri, 27 Apr 2018 10:15:54 +0900 (JST)
> From: Tatsuo Ishii <ishii at sraoss.co.jp>
> To: psyckow.prod at gmail.com
> Cc: pgpool-general at pgpool.net
> Subject: [pgpool-general: 6057] Re: "health check timer expired" on
>         local machine
> Message-ID: <20180427.101554.1927607666615743161.t-ishii at sraoss.co.jp>
> Content-Type: Text/Plain; charset=us-ascii
>
> > Pgpool-II health check process uses non-blocking socket for connecting
> > to PostgreSQL. After issuing connect system call it waits for its
> > completion using select system call with timeout: connect_timeout in
> > pgpool.conf (in your case 10 seconds). On the other hand health_check
> > timeout is 6 seconds. So after 6 seconds, an alarm interrupted the
> > select system call and it returned with errno == EINTR, then the log
> > emitted. Not sure why the connect system call did not respond for 6
> > seconds.
> >
> > That's all what I know from the log.
>
> If you want to make research on this, packet dump is required.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
>
> ------------------------------
>
> _______________________________________________
> pgpool-general mailing list
> pgpool-general at pgpool.net
> http://www.pgpool.net/mailman/listinfo/pgpool-general
>
>
> End of pgpool-general Digest, Vol 78, Issue 19
> **********************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180427/ca0585c0/attachment.html>