[pgpool-general: 1336] Re: WatchDog and pgool sudden stop working

Wed Jan 23 11:00:23 JST 2013

I suspect that your num_init_children is too small.  In this case it
is possible that watch dog life check cannot connect to pgpool and
timed out, and goes into error status. You really should manage
incoming connections from pgpool clients is not beyond
num_init_children -1 (1 is the room for watch dog life check).
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Hello Tatsuo,
> 
> today my active pgpool again stopped working and finally I have full debug log of this situation, I hope it will help to find the root cause of the troubles with spontanous watchdog fail.
> 
> wwwpri is the server with master pgpool and the log file is from the same wwwpri server:
> 
> 
> 
> 2013-01-22 16:45:31 DEBUG: pid 26847: starting health checking
> 2013-01-22 16:45:31 DEBUG: pid 26847: health_check: 0 th DB node status: 2
> 2013-01-22 16:45:31 DEBUG: pid 26847: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: auth kind: 0
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: backend key data received
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: transaction state: I
> 2013-01-22 16:45:31 DEBUG: pid 26847: health_check: 1 th DB node status: 2
> 2013-01-22 16:45:31 DEBUG: pid 26847: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: auth kind: 0
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: backend key data received
> 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: transaction state: I
> 2013-01-22 16:45:32 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:32 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_current_xlog_location()
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:32 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_last_xlog_replay_location()
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:32 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: checking pgpool 0 (wwwpri:54320)
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:1
> 2013-01-22 16:45:33 LOG: pid 26855: wd_lifecheck: lifecheck failed 5 times. pgpool seems not to be working
> 2013-01-22 16:45:33 LOG: pid 26855: wd_IP_down: ifconfig down succeeded
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: checking pgpool 1 (wwwsec:54320)
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: OK, status: 2
> 2013-01-22 16:45:40 DEBUG: pid 26855: get_result: ping data: PING 172.31.2.1 (172.31.2.1) 56(84) bytes of data.
> 
> --- 172.31.2.1 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 2000ms
> rtt min/avg/max/mdev = 0.342/0.393/0.485/0.068 ms
> 
> 2013-01-22 16:45:42 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:42 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: auth kind: 0
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: backend key data received
> 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: transaction state: I
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_current_xlog_location()
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:42 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_last_xlog_replay_location()
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: T
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: row description received
> 2013-01-22 16:45:42 DEBUG: pid 26890: num_fileds: 1
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: D
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: data row received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: C
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Command complete received
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: Z
> 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Ready for query
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: checking pgpool 0 (wwwpri:54320)
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:45:43 LOG: pid 26855: wd_lifecheck: lifecheck failed 5 times. pgpool seems not to be working
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: checking pgpool 1 (wwwsec:54320)
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: OK, status: 2
> 2013-01-22 16:45:50 DEBUG: pid 26855: get_result: ping data: PING 172.31.2.1 (172.31.2.1) 56(84) bytes of data.
> 
> --- 172.31.2.1 ping statistics ---
> 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
> rtt min/avg/max/mdev = 0.300/0.335/0.382/0.037 ms
> 
> 
> Here is log file grep only for "wd_lifecheck: NG, status: "
> 
> 2013-01-22 16:44:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:5
> 2013-01-22 16:45:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:4
> 2013-01-22 16:45:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:3
> 2013-01-22 16:45:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:2
> 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:1
> 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:45:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:46:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 2013-01-22 16:47:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> 
> 
> 
> Any help is highly appreciated.
> Thank you,
> 
> Tomas
> 
> 
> 
> 
> 
> On 22, Jan, 2013, at 7:42 , Tomas Halgas <tomas at halgas.sk> wrote:
> 
>> Hi Tatsuo,
>> 
>> thank you for your reply in 1st place.
>> The symptom is that once "wd_life_check failed" occurs then pgpool switches virtual ip to second pgpool node. 
>> 
>> Original pgpool node then still accept new connection on TCP level from client (on physical IP), but times out (not providing even prompt for psql client).
>> 
>> I'm not able to test already established sessions as those are established by php web application.. Should I try to keep one open connection from psql client just for this case to test next time it happen?
>> I'm now restarting both pgpool daemons every night as a workaround as this regular restart seems to avoid the hang after 2-3 days period.
>> 
>> Regrads,
>> 
>> Tomas
>> 
>> 
>> On 21, Jan, 2013, at 23:51 , Tatsuo Ishii <ishii at postgresql.org> wrote:
>> 
>>> What do you mean by "pgpools stop working"?
>>> 
>>> I see many clients already connect to pgpool. Can you issue a query
>>> through one of the sessions?
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>> 
>>>> I have repeated issue on my pgpool servers (two servers monitoring each other via pgpool watchdog).
>>>> 
>>>> About each 2nd or 3rd day one (or both) pgpools stop working, most of times the log starts with:
>>>> read_startup_packet: incorrect packet length (0)
>>>> 
>>>> and then follow many of those messages on both servers:
>>>> wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 
>>>> 
>>>> Today it happen only on server which was pgpool master and correctly switched to the other pgpool server.
>>>> This is log from the master:
>>>> 2013-01-18 08:18:20 ERROR: pid 3160: read_startup_packet: incorrect packet length (0)
>>>> 2013-01-18 08:18:23 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 2013-01-18 08:18:23 LOG:   pid 3152: wd_IP_down: ifconfig down succeeded
>>>> 2013-01-18 08:18:32 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 2013-01-18 08:18:39 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 2013-01-18 08:18:46 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>>>> 
>>>> 
>>>> There was no postgresql/network outage, nor I was able to find any corellation to anything on the systems
>>>> which could trigger this.
>>>> 
>>>> Usualy this happen on both servers in same time so I have quickly to solve the situation by pkill -9 pgpool - which then gets
>>>> restarted by DJBs svscan (pkill is not able to actually stop the pgpool when it's in this state).
>>>> But today as the other pgpool was working I was doing a bit more troublehooting on the former pgpool master and have found that
>>>> it was accepting connections from db clients, but the connection never finished (it got hanged).
>>>> Connection to PCP was working all the time OK (attaching pcp_pool_status to this email).
>>>> 
>>>> I have executed strace in pid 3090 to see what's happening but i'm not really able to read the output properly so I would appreciate if someone
>>>> can have a look into attached strace output if she/he can identify what could be the reason for the hang.
>>>> Intereseting enough that after about 40-45 minutes the pgpool on the former pgpool master started to work again properly.
>>>> There was nothing in logfile, just the log entries "wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working" stopped to apear.
>>>> This event should be included inside the strace output too as the strace was running during that time.
>>>> 
>>>> One thing I have also noticed shorty AFTER the pgpool on former master started working again by itself is that
>>>> all processes on former pgpool master were started on Jan 15 , but one pgpool process had other started time shown via the ps axuwf
>>>> and by coincidence it is the time when the pgpool started working again:
>>>> 
>>>> pgsql    23021  0.0  0.0 110308  1876 ?        S    09:02   0:00  |   |               \_ pgpool: wait for connection request
>>>> 
>>>> Maybe it is normal, I just noticed the time of the process is other than other pgpool processes, here is the whole snapshot of ps axuwf related to pgpool:
>>>> 
>>>> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
>>>> root      1975  0.0  0.0   4164   336 ?        S     2012   0:00  |   \_ supervise pgpool
>>>> root      3065  0.0  0.0  17740  1424 ?        S    Jan15   0:00  |   |   \_ /bin/sh ./run
>>>> pgsql     3078  0.0  0.0  20296  1788 ?        S    Jan15   0:00  |   |       \_ -su -c /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n >> /var/log/pgpool.log
>>>> pgsql     3090  0.0  0.0 110184  4640 ?        S    Jan15   0:12  |   |           \_ /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n
>>>> pgsql     3151  0.0  0.0 110184   740 ?        S    Jan15   0:00  |   |               \_ pgpool: watchdog
>>>> pgsql     3152  0.0  0.0 184048  1592 ?        Sl   Jan15   0:43  |   |               \_ pgpool: lifecheck
>>>> pgsql    24999  0.0  0.0   4392   352 ?        S    09:08   0:00  |   |               |   \_ ping -q -c3 172.31.2.1
>>>> pgsql     3153  0.0  0.1 119348 11176 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50064) idle
>>>> pgsql     3154  0.0  0.1 119484 11280 ?        S    Jan15   1:55  |   |               \_ pgpool: az az 172.31.2.159(50068) idle
>>>> pgsql     3155  0.0  0.1 119340 11088 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(49960) idle
>>>> pgsql     3156  0.0  0.1 119008 10744 ?        S    Jan15   1:36  |   |               \_ pgpool: az az 172.31.2.159(50045) idle
>>>> pgsql     3157  0.0  0.1 119148 10856 ?        S    Jan15   1:42  |   |               \_ pgpool: az az 172.31.2.159(50075) idle
>>>> pgsql     3158  0.0  0.1 119208 10964 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50072) idle
>>>> pgsql     3159  0.0  0.1 118836 10608 ?        S    Jan15   1:37  |   |               \_ pgpool: az az 172.31.2.159(50006) idle
>>>> pgsql     3160  0.0  0.1 118360 10112 ?        S    Jan15   1:25  |   |               \_ pgpool: az az 172.31.2.158(57783) idle
>>>> pgsql     3161  0.0  0.1 119352 11128 ?        S    Jan15   1:50  |   |               \_ pgpool: az az 172.31.2.159(50007) idle
>>>> pgsql     3162  0.0  0.1 118772 10508 ?        S    Jan15   1:34  |   |               \_ pgpool: az az 172.31.2.159(50046) idle
>>>> pgsql     3163  0.0  0.1 119620 11388 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50062) idle
>>>> pgsql     3164  0.0  0.1 120384 12092 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50047) idle
>>>> pgsql     3165  0.0  0.1 119384 11220 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50010) idle
>>>> pgsql     3166  0.0  0.1 120084 11824 ?        S    Jan15   2:08  |   |               \_ pgpool: az az 172.31.2.159(50042) idle
>>>> pgsql     3167  0.0  0.1 120444 12116 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50074) idle
>>>> pgsql     3168  0.0  0.1 119732 11492 ?        S    Jan15   1:53  |   |               \_ pgpool: az az 172.31.2.159(50070) idle
>>>> pgsql     3170  0.0  0.1 119868 11680 ?        S    Jan15   1:58  |   |               \_ pgpool: az az 172.31.2.159(50067) idle
>>>> pgsql     3171  0.0  0.1 120224 11980 ?        S    Jan15   2:04  |   |               \_ pgpool: az az 172.31.2.159(49972) idle
>>>> pgsql     3172  0.0  0.1 119880 11676 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50048) idle
>>>> pgsql     3173  0.0  0.1 119440 11136 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50008) idle
>>>> pgsql     3174  0.0  0.1 119296 11108 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50011) idle
>>>> pgsql     3175  0.0  0.1 118952 10700 ?        S    Jan15   1:38  |   |               \_ pgpool: az az 172.31.2.159(50049) idle
>>>> pgsql     3176  0.0  0.1 119200 11008 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(50066) idle
>>>> pgsql     3178  0.0  0.1 119616 11376 ?        S    Jan15   1:55  |   |               \_ pgpool: az az 172.31.2.159(49973) idle
>>>> pgsql     3179  0.0  0.1 118920 10712 ?        S    Jan15   1:40  |   |               \_ pgpool: az az 172.31.2.159(50043) idle
>>>> pgsql     3181  0.0  0.1 120188 11872 ?        S    Jan15   1:48  |   |               \_ pgpool: az az 172.31.2.158(57784) idle
>>>> pgsql     3183  0.0  0.1 119688 11460 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50076) idle
>>>> pgsql     3185  0.0  0.1 119444 11160 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(49959) idle
>>>> pgsql     3186  0.0  0.1 119428 11216 ?        S    Jan15   1:54  |   |               \_ pgpool: az az 172.31.2.159(50069) idle
>>>> pgsql     3187  0.0  0.1 119888 11708 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50044) idle
>>>> pgsql     3188  0.0  0.1 119344 11092 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50009) idle
>>>> pgsql     3189  0.0  0.0 110772  1764 ?        S    Jan15   0:00  |   |               \_ pgpool: PCP: wait for connection request
>>>> pgsql     3190  0.0  0.0 110296  1320 ?        S    Jan15   0:25  |   |               \_ pgpool: worker process
>>>> pgsql    23021  0.0  0.0 110308  1876 ?        S    09:02   0:00  |   |               \_ pgpool: wait for connection request
>>>> 
>>>> 
>>>> My setup:
>>>> 172.31.2.158    - pgpool1 (former master - all above logs were done within this server)
>>>> 172.31.2.159    - pgpool2 (former slave)
>>>> 172.31.2.160    - virtual IP managed by pgpool where db clients connects to (originally assigned to pgpool1 server)
>>>> 
>>>> And all above hosts i'm having included in /etc/hosts to avoid any DNS influence..
>>>> 
>>>> I have later today enabled debug_level = 1 in pgpool.conf and limited logfile to 2GB (which limits the log only for abou 10 minutes)
>>>> 
>>>> Also have included my pgpool.conf files which I believe are setup correctly, but one never knows...
>>>> 
>>>> Many thanks for any help, although I like pgpool very much it's not stable in my environment which is quite frustrating as I opted for pgpool to improve
>>>> service availability but till now it's rather exactly opposite.
>>>> 
>>>> Tomas
>>> 
>> 
>