[pgpool-general: 6063] Re: "health check timer expired" on local machine

Bud Curly psyckow.prod at gmail.com
Sat Apr 28 01:45:23 JST 2018


Ok so the problem showed up again, here what's new :

I made a small script (remoted, from my pc) that perform "SELECT 1" on the
same db with same credentials, each 2 sec interval and with 2 sec timeout,
using 1 pool, everything just like health_process.

During the bug, my script didn't show anything.... All worked fine on it.
Same for network monitoring and nothing more revelant on pgpool and
postgresql logs...

I changed health_process's credentials and db, nothing changed.

Could someone can tell me how to make pgpool's health_process verbose ?
(even if it's require to change the source)

The start point is still this :

2018-04-27 18:32:16: pid 5983:LOG:  failed to connect to PostgreSQL server
on "x.x.x.x:xxxx" using INET socket
2018-04-27 18:32:16: pid 5983:DETAIL:  health check timer expired

And I would like to have more details about this one

Thanks..

2018-04-27 16:48 GMT+02:00 Bud Curly <psyckow.prod at gmail.com>:

> So I have reduce timeout to 2 seconds each like this :
> *health_check_timeout* = 2
> *connect_timeout* = 2000
>
> The problem seems to appears more frequently (two times since I made the
> update, 4 hours ago). Same logs on pgpool and postgresql and same condition
> (~5 insert / seconds) during the problem.
>
> On pgpool :
>
> 2018-04-27 16:26:27: pid 5983:LOG:  failed to connect to PostgreSQL server
> on "x.x.x.x:xxxx" using INET socket
> 2018-04-27 16:26:27: pid 5983:DETAIL:  health check timer expired
> 2018-04-27 16:26:27: pid 5983:ERROR:  failed to make persistent db
> connection
> 2018-04-27 16:26:27: pid 5983:DETAIL:  connection to host:" x.x.x.x:xxxx"
> failed
> 2018-04-27 16:26:27: pid 5983:LOG:  health check failed on node 0
> (timeout:1)
> 2018-04-27 16:26:27: pid 5983:LOG:  received degenerate backend request
> for node_id: 0 from pid [5983]
> 2018-04-27 16:26:27: pid 5949:LOG:  Pgpool-II parent process has received
> failover request
> 2018-04-27 16:26:27: pid 5949:LOG:  starting degeneration. shutdown host
> x.x.x.x (xxxx)
> 2018-04-27 16:26:27: pid 5949:LOG:  Restart all children
>
> On PostgreSQL :
>
> 2018-04-27 16:26:32.079 CEST [30525] LOG:  trigger file found:
> /var/lib/postgresql/9.6/main/trigger
> 2018-04-27 16:26:32.079 CEST [30527] FATAL:  terminating walreceiver
> process due to administrator command
> 2018-04-27 16:26:32.080 CEST [30525] LOG:  invalid record length at
> 3/32229D10: wanted 24, got 0
> 2018-04-27 16:26:32.080 CEST [30525] LOG:  redo done at 3/32229CE8
> 2018-04-27 16:26:32.080 CEST [30525] LOG:  last completed transaction was
> at log time 2018-04-27 16:26:27.093816+02
> 2018-04-27 16:26:32.090 CEST [30525] LOG:  selected new timeline ID: 98
> 2018-04-27 16:26:32.215 CEST [30525] LOG:  archive recovery complete
> 2018-04-27 16:26:32.230 CEST [30525] LOG:  MultiXact member wraparound
> protections are now enabled
> 2018-04-27 16:26:32.237 CEST [30524] LOG:  database system is ready to
> accept connections
> 2018-04-27 16:26:32.238 CEST [31170] LOG:  autovacuum launcher started
>
> On the master PostgreSQL, I set "*log_min_error_statement* = debug5" so
> if there were a problem with PostgreSQL, I should have been noticed.
>
> There is nothing weird on tcp paquets while I was monitoring.
>
> I also monitored network connection with a looped ping x.x.x.x (public
> address) from the machine, there is no variation in delays during the
> problem...
>
> I though a second it could be linked to my number of pool connection
> allowed on pgpool and on the backend, because of the connection monopolized
> by the health_check process :
>
> - On pgpool :
>
> num_init_children = 30
> max_pool = 3
>
> - On the postgreSQL master :
>
> max_connections = 100
>
> I tried to increase these settings, this change nothing...
>
> I will try to simulate the health_check process with one pool and same
> timeout and check if I have something
>
> But I run out of idea right now... If someone have something, I take.
>
> Thanks
>
> 2018-04-27 11:33 GMT+02:00 Bud Curly <psyckow.prod at gmail.com>:
>
>> *> *So if we had health_check_hostname0, does it help you?
>>
>> *This could be a workaround for my case but I believe this is a network
>> issue from my server provider. I don't really know about their network
>> structure, but the public ip address used is set at an higher level trough
>> NAT and it's not affected on the network interface of the server itself. *
>> With the command tracepath from the machine to its public IP, I found out
>> that it goes trough 8 node to resolve.
>>
>> So in general the use of public IP instead of loopback is not good in
>> terms of performance for local services.
>>
>> A setting that could interest me could be : recovery_hostname0,
>> recovery_hostname1, etc. as I need the public IP only for standby to
>> perform pgpool_recovery().
>>
>> Thanks :)
>>
>> *Tatsuo Ishii* ishii at sraoss.co.jp
>> <pgpool-general%40pgpool.net?Subject=Re:%20Re%3A%20%5Bpgpool-general%3A%206060%5D%20Re%3A%20pgpool-general%20Digest%2C%20Vol%2078%2C%20Issue%2019&In-Reply-To=%3C20180427.174736.1303932718741225970.t-ishii%40sraoss.co.jp%3E>
>> *Fri Apr 27 17:47:36 JST 2018*
>>
>> >* Thanks for your support :)
>> *
>> You are welcome:-)
>>
>> >>* Still I don't understand. Pgpool-II and PostgreSQL master are on thesame
>> *>* machine, that means you could set like "backend_hostname0 = "127.0.0.1".
>> *> >* Because I need the public address for pgpool_recovery() method to permit
>> *>* online recovery from remote nodes. And pgPool like health_check
>> *>* process use backend_hostname0
>> *>* to do so.
>> *
>> Oh that makes sense.
>>
>> >* The setting health_check_hostname0 doesn't exist but trough, this is not a
>> *>* workaround.
>> *
>> So if we had health_check_hostname0, does it help you?
>>
>> >* So according to the log, is the timeout error triggered by this
>> *>* "health_check_timeout = 6" or this "connect_timeout = 10000" ?
>> *
>> I believe "health_check_timeout = 6". connect system call waits up to
>> 10 seconds but before it expires health_check_timeout comes.
>>
>> >* I downed timeout to 2 seconds each and monitoring net paquets to find some
>> *>* details... Keep you in touch
>> *
>> Thanks.
>>
>> Best regards,
>> --
>> Tatsuo Ishii
>> SRA OSS, Inc. Japan
>> English: http://www.sraoss.co.jp/index_en.php
>> Japanese:http://www.sraoss.co.jp
>>
>>
>>
>>
>> 2018-04-27 10:44 GMT+02:00 Bud Curly <psyckow.prod at gmail.com>:
>>
>>> Thanks for your support :)
>>>
>>> > Still I don't understand. Pgpool-II and PostgreSQL master are on thesame
>>> machine, that means you could set like "backend_hostname0 = "127.0.0.1".
>>>
>>> Because I need the public address for pgpool_recovery() method to permit
>>> online recovery from remote nodes. And pgPool like health_check process
>>> use backend_hostname0 to do so.
>>>
>>> The setting health_check_hostname0 doesn't exist but trough, this is not
>>> a workaround.
>>>
>>> So according to the log, is the timeout error triggered by this
>>> "health_check_timeout = 6" or this "connect_timeout = 10000" ?
>>>
>>> I downed timeout to 2 seconds each and monitoring net paquets to find
>>> some details... Keep you in touch
>>>
>>>
>>> 2018-04-27 3:15 GMT+02:00 Tatsuo Ishii <ishii at sraoss.co.jp>:
>>>
>>>> > Pgpool-II health check process uses non-blocking socket for connecting
>>>> > to PostgreSQL. After issuing connect system call it waits for its
>>>> > completion using select system call with timeout: connect_timeout in
>>>> > pgpool.conf (in your case 10 seconds). On the other hand health_check
>>>> > timeout is 6 seconds. So after 6 seconds, an alarm interrupted the
>>>> > select system call and it returned with errno == EINTR, then the log
>>>> > emitted. Not sure why the connect system call did not respond for 6
>>>> > seconds.
>>>> >
>>>> > That's all what I know from the log.
>>>>
>>>> If you want to make research on this, packet dump is required.
>>>>
>>>> Best regards,
>>>> --
>>>> Tatsuo Ishii
>>>> SRA OSS, Inc. Japan
>>>> English: http://www.sraoss.co.jp/index_en.php
>>>> Japanese:http://www.sraoss.co.jp
>>>>
>>>
>>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20180427/72686d6f/attachment-0001.html>


More information about the pgpool-general mailing list