[pgpool-general: 6062] Re: "health check timer expired" on local machine

Bud Curly psyckow.prod at gmail.com
Fri Apr 27 23:48:04 JST 2018


So I have reduce timeout to 2 seconds each like this :
*health_check_timeout* = 2
*connect_timeout* = 2000

The problem seems to appears more frequently (two times since I made the
update, 4 hours ago). Same logs on pgpool and postgresql and same condition
(~5 insert / seconds) during the problem.

On pgpool :

2018-04-27 16:26:27: pid 5983:LOG:  failed to connect to PostgreSQL server
on "x.x.x.x:xxxx" using INET socket
2018-04-27 16:26:27: pid 5983:DETAIL:  health check timer expired
2018-04-27 16:26:27: pid 5983:ERROR:  failed to make persistent db
connection
2018-04-27 16:26:27: pid 5983:DETAIL:  connection to host:" x.x.x.x:xxxx"
failed
2018-04-27 16:26:27: pid 5983:LOG:  health check failed on node 0
(timeout:1)
2018-04-27 16:26:27: pid 5983:LOG:  received degenerate backend request for
node_id: 0 from pid [5983]
2018-04-27 16:26:27: pid 5949:LOG:  Pgpool-II parent process has received
failover request
2018-04-27 16:26:27: pid 5949:LOG:  starting degeneration. shutdown host
x.x.x.x (xxxx)
2018-04-27 16:26:27: pid 5949:LOG:  Restart all children

On PostgreSQL :

2018-04-27 16:26:32.079 CEST [30525] LOG:  trigger file found:
/var/lib/postgresql/9.6/main/trigger
2018-04-27 16:26:32.079 CEST [30527] FATAL:  terminating walreceiver
process due to administrator command
2018-04-27 16:26:32.080 CEST [30525] LOG:  invalid record length at
3/32229D10: wanted 24, got 0
2018-04-27 16:26:32.080 CEST [30525] LOG:  redo done at 3/32229CE8
2018-04-27 16:26:32.080 CEST [30525] LOG:  last completed transaction was
at log time 2018-04-27 16:26:27.093816+02
2018-04-27 16:26:32.090 CEST [30525] LOG:  selected new timeline ID: 98
2018-04-27 16:26:32.215 CEST [30525] LOG:  archive recovery complete
2018-04-27 16:26:32.230 CEST [30525] LOG:  MultiXact member wraparound
protections are now enabled
2018-04-27 16:26:32.237 CEST [30524] LOG:  database system is ready to
accept connections
2018-04-27 16:26:32.238 CEST [31170] LOG:  autovacuum launcher started

On the master PostgreSQL, I set "*log_min_error_statement* = debug5" so if
there were a problem with PostgreSQL, I should have been noticed.

There is nothing weird on tcp paquets while I was monitoring.

I also monitored network connection with a looped ping x.x.x.x (public
address) from the machine, there is no variation in delays during the
problem...

I though a second it could be linked to my number of pool connection
allowed on pgpool and on the backend, because of the connection monopolized
by the health_check process :

- On pgpool :

num_init_children = 30
max_pool = 3

- On the postgreSQL master :

max_connections = 100

I tried to increase these settings, this change nothing...

I will try to simulate the health_check process with one pool and same
timeout and check if I have something

But I run out of idea right now... If someone have something, I take.

Thanks

2018-04-27 11:33 GMT+02:00 Bud Curly <psyckow.prod at gmail.com>:

> *> *So if we had health_check_hostname0, does it help you?
>
> *This could be a workaround for my case but I believe this is a network
> issue from my server provider. I don't really know about their network
> structure, but the public ip address used is set at an higher level trough
> NAT and it's not affected on the network interface of the server itself. *
> With the command tracepath from the machine to its public IP, I found out
> that it goes trough 8 node to resolve.
>
> So in general the use of public IP instead of loopback is not good in
> terms of performance for local services.
>
> A setting that could interest me could be : recovery_hostname0,
> recovery_hostname1, etc. as I need the public IP only for standby to
> perform pgpool_recovery().
>
> Thanks :)
>
> *Tatsuo Ishii* ishii at sraoss.co.jp
> <pgpool-general%40pgpool.net?Subject=Re:%20Re%3A%20%5Bpgpool-general%3A%206060%5D%20Re%3A%20pgpool-general%20Digest%2C%20Vol%2078%2C%20Issue%2019&In-Reply-To=%3C20180427.174736.1303932718741225970.t-ishii%40sraoss.co.jp%3E>
> *Fri Apr 27 17:47:36 JST 2018*
>
> >* Thanks for your support :)
> *
> You are welcome:-)
>
> >>* Still I don't understand. Pgpool-II and PostgreSQL master are on thesame
> *>* machine, that means you could set like "backend_hostname0 = "127.0.0.1".
> *> >* Because I need the public address for pgpool_recovery() method to permit
> *>* online recovery from remote nodes. And pgPool like health_check
> *>* process use backend_hostname0
> *>* to do so.
> *
> Oh that makes sense.
>
> >* The setting health_check_hostname0 doesn't exist but trough, this is not a
> *>* workaround.
> *
> So if we had health_check_hostname0, does it help you?
>
> >* So according to the log, is the timeout error triggered by this
> *>* "health_check_timeout = 6" or this "connect_timeout = 10000" ?
> *
> I believe "health_check_timeout = 6". connect system call waits up to
> 10 seconds but before it expires health_check_timeout comes.
>
> >* I downed timeout to 2 seconds each and monitoring net paquets to find some
> *>* details... Keep you in touch
> *
> Thanks.
>
> Best regards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp
>
>
>
>
> 2018-04-27 10:44 GMT+02:00 Bud Curly <psyckow.prod at gmail.com>:
>
>> Thanks for your support :)
>>
>> > Still I don't understand. Pgpool-II and PostgreSQL master are on thesame
>> machine, that means you could set like "backend_hostname0 = "127.0.0.1".
>>
>> Because I need the public address for pgpool_recovery() method to permit
>> online recovery from remote nodes. And pgPool like health_check process
>> use backend_hostname0 to do so.
>>
>> The setting health_check_hostname0 doesn't exist but trough, this is not
>> a workaround.
>>
>> So according to the log, is the timeout error triggered by this
>> "health_check_timeout = 6" or this "connect_timeout = 10000" ?
>>
>> I downed timeout to 2 seconds each and monitoring net paquets to find
>> some details... Keep you in touch
>>
>>
>> 2018-04-27 3:15 GMT+02:00 Tatsuo Ishii <ishii at sraoss.co.jp>:
>>
>>> > Pgpool-II health check process uses non-blocking socket for connecting
>>> > to PostgreSQL. After issuing connect system call it waits for its
>>> > completion using select system call with timeout: connect_timeout in
>>> > pgpool.conf (in your case 10 seconds). On the other hand health_check
>>> > timeout is 6 seconds. So after 6 seconds, an alarm interrupted the
>>> > select system call and it returned with errno == EINTR, then the log
>>> > emitted. Not sure why the connect system call did not respond for 6
>>> > seconds.
>>> >
>>> > That's all what I know from the log.
>>>
>>> If you want to make research on this, packet dump is required.
>>>
>>> Best regards,
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese:http://www.sraoss.co.jp
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180427/6aa029d7/attachment-0001.html>


More information about the pgpool-general mailing list