[pgpool-general: 1351] Re: WatchDog and pgool sudden stop working
Tomas Halgas
tomas at halgas.sk
Wed Jan 30 04:40:46 JST 2013
I think I can now safely confirm this really did help (I have increased
number of num_init_children to double (64) of default size (32) and
pgpool is now stable for 6 days).
Many thanks Tatsuo!
May I suggest that this configuration option would be
either renamed to num_max_children, or maybe pgpool should log a warning
when all childrens are used so the administrator would get a clue in situation like I did have?
Or maybe I just overlooked this in documentation?
Anyways I am very happy with pgpool now (and of course your support) and can
only recommend it to anyone looking for HA solution for streaming postgres setup :)
Best regards,
Tomas
On Wed, Jan 23, 2013 at 11:00:23AM +0900, Tatsuo Ishii wrote:
> I suspect that your num_init_children is too small. In this case it
> is possible that watch dog life check cannot connect to pgpool and
> timed out, and goes into error status. You really should manage
> incoming connections from pgpool clients is not beyond
> num_init_children -1 (1 is the room for watch dog life check).
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > Hello Tatsuo,
> >
> > today my active pgpool again stopped working and finally I have full debug log of this situation, I hope it will help to find the root cause of the troubles with spontanous watchdog fail.
> >
> > wwwpri is the server with master pgpool and the log file is from the same wwwpri server:
> >
> >
> >
> > 2013-01-22 16:45:31 DEBUG: pid 26847: starting health checking
> > 2013-01-22 16:45:31 DEBUG: pid 26847: health_check: 0 th DB node status: 2
> > 2013-01-22 16:45:31 DEBUG: pid 26847: pool_ssl: SSL requested but SSL support is not available
> > 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: auth kind: 0
> > 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: backend key data received
> > 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: transaction state: I
> > 2013-01-22 16:45:31 DEBUG: pid 26847: health_check: 1 th DB node status: 2
> > 2013-01-22 16:45:31 DEBUG: pid 26847: pool_ssl: SSL requested but SSL support is not available
> > 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: auth kind: 0
> > 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: backend key data received
> > 2013-01-22 16:45:31 DEBUG: pid 26847: s_do_auth: transaction state: I
> > 2013-01-22 16:45:32 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> > 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: auth kind: 0
> > 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: backend key data received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: transaction state: I
> > 2013-01-22 16:45:32 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> > 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: auth kind: 0
> > 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: backend key data received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: s_do_auth: transaction state: I
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_current_xlog_location()
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: T
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: row description received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: num_fileds: 1
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: D
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: data row received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: C
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Command complete received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: Z
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Ready for query
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_last_xlog_replay_location()
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: T
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: row description received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: num_fileds: 1
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: D
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: data row received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: C
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Command complete received
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: kind: Z
> > 2013-01-22 16:45:32 DEBUG: pid 26890: do_query: Ready for query
> > 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: checking pgpool 0 (wwwpri:54320)
> > 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:1
> > 2013-01-22 16:45:33 LOG: pid 26855: wd_lifecheck: lifecheck failed 5 times. pgpool seems not to be working
> > 2013-01-22 16:45:33 LOG: pid 26855: wd_IP_down: ifconfig down succeeded
> > 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: checking pgpool 1 (wwwsec:54320)
> > 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: OK, status: 2
> > 2013-01-22 16:45:40 DEBUG: pid 26855: get_result: ping data: PING 172.31.2.1 (172.31.2.1) 56(84) bytes of data.
> >
> > --- 172.31.2.1 ping statistics ---
> > 3 packets transmitted, 3 received, 0% packet loss, time 2000ms
> > rtt min/avg/max/mdev = 0.342/0.393/0.485/0.068 ms
> >
> > 2013-01-22 16:45:42 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> > 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: auth kind: 0
> > 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: backend key data received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: transaction state: I
> > 2013-01-22 16:45:42 DEBUG: pid 26890: pool_ssl: SSL requested but SSL support is not available
> > 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: auth kind: 0
> > 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: backend key data received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: s_do_auth: transaction state: I
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_current_xlog_location()
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: T
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: row description received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: num_fileds: 1
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: D
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: data row received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: C
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Command complete received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: Z
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Ready for query
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: extended:0 query:SELECT pg_last_xlog_replay_location()
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: T
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: row description received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: num_fileds: 1
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: D
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: data row received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: C
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Command complete received
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: kind: Z
> > 2013-01-22 16:45:42 DEBUG: pid 26890: do_query: Ready for query
> > 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: checking pgpool 0 (wwwpri:54320)
> > 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:45:43 LOG: pid 26855: wd_lifecheck: lifecheck failed 5 times. pgpool seems not to be working
> > 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: checking pgpool 1 (wwwsec:54320)
> > 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: OK, status: 2
> > 2013-01-22 16:45:50 DEBUG: pid 26855: get_result: ping data: PING 172.31.2.1 (172.31.2.1) 56(84) bytes of data.
> >
> > --- 172.31.2.1 ping statistics ---
> > 3 packets transmitted, 3 received, 0% packet loss, time 1999ms
> > rtt min/avg/max/mdev = 0.300/0.335/0.382/0.037 ms
> >
> >
> > Here is log file grep only for "wd_lifecheck: NG, status: "
> >
> > 2013-01-22 16:44:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:5
> > 2013-01-22 16:45:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:4
> > 2013-01-22 16:45:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:3
> > 2013-01-22 16:45:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:2
> > 2013-01-22 16:45:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 3 life:1
> > 2013-01-22 16:45:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:45:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:46:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:46:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:46:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:46:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:46:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:46:53 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:47:03 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:47:13 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:47:23 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:47:33 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> > 2013-01-22 16:47:43 DEBUG: pid 26855: wd_lifecheck: NG, status: 4 life:0
> >
> >
> >
> > Any help is highly appreciated.
> > Thank you,
> >
> > Tomas
> >
> >
> >
> >
> >
> > On 22, Jan, 2013, at 7:42 , Tomas Halgas <tomas at halgas.sk> wrote:
> >
> >> Hi Tatsuo,
> >>
> >> thank you for your reply in 1st place.
> >> The symptom is that once "wd_life_check failed" occurs then pgpool switches virtual ip to second pgpool node.
> >>
> >> Original pgpool node then still accept new connection on TCP level from client (on physical IP), but times out (not providing even prompt for psql client).
> >>
> >> I'm not able to test already established sessions as those are established by php web application.. Should I try to keep one open connection from psql client just for this case to test next time it happen?
> >> I'm now restarting both pgpool daemons every night as a workaround as this regular restart seems to avoid the hang after 2-3 days period.
> >>
> >> Regrads,
> >>
> >> Tomas
> >>
> >>
> >> On 21, Jan, 2013, at 23:51 , Tatsuo Ishii <ishii at postgresql.org> wrote:
> >>
> >>> What do you mean by "pgpools stop working"?
> >>>
> >>> I see many clients already connect to pgpool. Can you issue a query
> >>> through one of the sessions?
> >>> --
> >>> Tatsuo Ishii
> >>> SRA OSS, Inc. Japan
> >>> English: http://www.sraoss.co.jp/index_en.php
> >>> Japanese: http://www.sraoss.co.jp
> >>>
> >>>> I have repeated issue on my pgpool servers (two servers monitoring each other via pgpool watchdog).
> >>>>
> >>>> About each 2nd or 3rd day one (or both) pgpools stop working, most of times the log starts with:
> >>>> read_startup_packet: incorrect packet length (0)
> >>>>
> >>>> and then follow many of those messages on both servers:
> >>>> wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> >>>>
> >>>>
> >>>> Today it happen only on server which was pgpool master and correctly switched to the other pgpool server.
> >>>> This is log from the master:
> >>>> 2013-01-18 08:18:20 ERROR: pid 3160: read_startup_packet: incorrect packet length (0)
> >>>> 2013-01-18 08:18:23 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> >>>> 2013-01-18 08:18:23 LOG: pid 3152: wd_IP_down: ifconfig down succeeded
> >>>> 2013-01-18 08:18:32 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> >>>> 2013-01-18 08:18:39 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> >>>> 2013-01-18 08:18:46 LOG: pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> >>>>
> >>>>
> >>>> There was no postgresql/network outage, nor I was able to find any corellation to anything on the systems
> >>>> which could trigger this.
> >>>>
> >>>> Usualy this happen on both servers in same time so I have quickly to solve the situation by pkill -9 pgpool - which then gets
> >>>> restarted by DJBs svscan (pkill is not able to actually stop the pgpool when it's in this state).
> >>>> But today as the other pgpool was working I was doing a bit more troublehooting on the former pgpool master and have found that
> >>>> it was accepting connections from db clients, but the connection never finished (it got hanged).
> >>>> Connection to PCP was working all the time OK (attaching pcp_pool_status to this email).
> >>>>
> >>>> I have executed strace in pid 3090 to see what's happening but i'm not really able to read the output properly so I would appreciate if someone
> >>>> can have a look into attached strace output if she/he can identify what could be the reason for the hang.
> >>>> Intereseting enough that after about 40-45 minutes the pgpool on the former pgpool master started to work again properly.
> >>>> There was nothing in logfile, just the log entries "wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working" stopped to apear.
> >>>> This event should be included inside the strace output too as the strace was running during that time.
> >>>>
> >>>> One thing I have also noticed shorty AFTER the pgpool on former master started working again by itself is that
> >>>> all processes on former pgpool master were started on Jan 15 , but one pgpool process had other started time shown via the ps axuwf
> >>>> and by coincidence it is the time when the pgpool started working again:
> >>>>
> >>>> pgsql 23021 0.0 0.0 110308 1876 ? S 09:02 0:00 | | \_ pgpool: wait for connection request
> >>>>
> >>>> Maybe it is normal, I just noticed the time of the process is other than other pgpool processes, here is the whole snapshot of ps axuwf related to pgpool:
> >>>>
> >>>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> >>>> root 1975 0.0 0.0 4164 336 ? S 2012 0:00 | \_ supervise pgpool
> >>>> root 3065 0.0 0.0 17740 1424 ? S Jan15 0:00 | | \_ /bin/sh ./run
> >>>> pgsql 3078 0.0 0.0 20296 1788 ? S Jan15 0:00 | | \_ -su -c /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n >> /var/log/pgpool.log
> >>>> pgsql 3090 0.0 0.0 110184 4640 ? S Jan15 0:12 | | \_ /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n
> >>>> pgsql 3151 0.0 0.0 110184 740 ? S Jan15 0:00 | | \_ pgpool: watchdog
> >>>> pgsql 3152 0.0 0.0 184048 1592 ? Sl Jan15 0:43 | | \_ pgpool: lifecheck
> >>>> pgsql 24999 0.0 0.0 4392 352 ? S 09:08 0:00 | | | \_ ping -q -c3 172.31.2.1
> >>>> pgsql 3153 0.0 0.1 119348 11176 ? S Jan15 1:44 | | \_ pgpool: az az 172.31.2.159(50064) idle
> >>>> pgsql 3154 0.0 0.1 119484 11280 ? S Jan15 1:55 | | \_ pgpool: az az 172.31.2.159(50068) idle
> >>>> pgsql 3155 0.0 0.1 119340 11088 ? S Jan15 1:47 | | \_ pgpool: az az 172.31.2.159(49960) idle
> >>>> pgsql 3156 0.0 0.1 119008 10744 ? S Jan15 1:36 | | \_ pgpool: az az 172.31.2.159(50045) idle
> >>>> pgsql 3157 0.0 0.1 119148 10856 ? S Jan15 1:42 | | \_ pgpool: az az 172.31.2.159(50075) idle
> >>>> pgsql 3158 0.0 0.1 119208 10964 ? S Jan15 1:44 | | \_ pgpool: az az 172.31.2.159(50072) idle
> >>>> pgsql 3159 0.0 0.1 118836 10608 ? S Jan15 1:37 | | \_ pgpool: az az 172.31.2.159(50006) idle
> >>>> pgsql 3160 0.0 0.1 118360 10112 ? S Jan15 1:25 | | \_ pgpool: az az 172.31.2.158(57783) idle
> >>>> pgsql 3161 0.0 0.1 119352 11128 ? S Jan15 1:50 | | \_ pgpool: az az 172.31.2.159(50007) idle
> >>>> pgsql 3162 0.0 0.1 118772 10508 ? S Jan15 1:34 | | \_ pgpool: az az 172.31.2.159(50046) idle
> >>>> pgsql 3163 0.0 0.1 119620 11388 ? S Jan15 1:51 | | \_ pgpool: az az 172.31.2.159(50062) idle
> >>>> pgsql 3164 0.0 0.1 120384 12092 ? S Jan15 1:52 | | \_ pgpool: az az 172.31.2.159(50047) idle
> >>>> pgsql 3165 0.0 0.1 119384 11220 ? S Jan15 1:51 | | \_ pgpool: az az 172.31.2.159(50010) idle
> >>>> pgsql 3166 0.0 0.1 120084 11824 ? S Jan15 2:08 | | \_ pgpool: az az 172.31.2.159(50042) idle
> >>>> pgsql 3167 0.0 0.1 120444 12116 ? S Jan15 1:52 | | \_ pgpool: az az 172.31.2.159(50074) idle
> >>>> pgsql 3168 0.0 0.1 119732 11492 ? S Jan15 1:53 | | \_ pgpool: az az 172.31.2.159(50070) idle
> >>>> pgsql 3170 0.0 0.1 119868 11680 ? S Jan15 1:58 | | \_ pgpool: az az 172.31.2.159(50067) idle
> >>>> pgsql 3171 0.0 0.1 120224 11980 ? S Jan15 2:04 | | \_ pgpool: az az 172.31.2.159(49972) idle
> >>>> pgsql 3172 0.0 0.1 119880 11676 ? S Jan15 1:57 | | \_ pgpool: az az 172.31.2.159(50048) idle
> >>>> pgsql 3173 0.0 0.1 119440 11136 ? S Jan15 1:51 | | \_ pgpool: az az 172.31.2.159(50008) idle
> >>>> pgsql 3174 0.0 0.1 119296 11108 ? S Jan15 1:52 | | \_ pgpool: az az 172.31.2.159(50011) idle
> >>>> pgsql 3175 0.0 0.1 118952 10700 ? S Jan15 1:38 | | \_ pgpool: az az 172.31.2.159(50049) idle
> >>>> pgsql 3176 0.0 0.1 119200 11008 ? S Jan15 1:47 | | \_ pgpool: az az 172.31.2.159(50066) idle
> >>>> pgsql 3178 0.0 0.1 119616 11376 ? S Jan15 1:55 | | \_ pgpool: az az 172.31.2.159(49973) idle
> >>>> pgsql 3179 0.0 0.1 118920 10712 ? S Jan15 1:40 | | \_ pgpool: az az 172.31.2.159(50043) idle
> >>>> pgsql 3181 0.0 0.1 120188 11872 ? S Jan15 1:48 | | \_ pgpool: az az 172.31.2.158(57784) idle
> >>>> pgsql 3183 0.0 0.1 119688 11460 ? S Jan15 1:57 | | \_ pgpool: az az 172.31.2.159(50076) idle
> >>>> pgsql 3185 0.0 0.1 119444 11160 ? S Jan15 1:47 | | \_ pgpool: az az 172.31.2.159(49959) idle
> >>>> pgsql 3186 0.0 0.1 119428 11216 ? S Jan15 1:54 | | \_ pgpool: az az 172.31.2.159(50069) idle
> >>>> pgsql 3187 0.0 0.1 119888 11708 ? S Jan15 1:57 | | \_ pgpool: az az 172.31.2.159(50044) idle
> >>>> pgsql 3188 0.0 0.1 119344 11092 ? S Jan15 1:44 | | \_ pgpool: az az 172.31.2.159(50009) idle
> >>>> pgsql 3189 0.0 0.0 110772 1764 ? S Jan15 0:00 | | \_ pgpool: PCP: wait for connection request
> >>>> pgsql 3190 0.0 0.0 110296 1320 ? S Jan15 0:25 | | \_ pgpool: worker process
> >>>> pgsql 23021 0.0 0.0 110308 1876 ? S 09:02 0:00 | | \_ pgpool: wait for connection request
> >>>>
> >>>>
> >>>> My setup:
> >>>> 172.31.2.158 - pgpool1 (former master - all above logs were done within this server)
> >>>> 172.31.2.159 - pgpool2 (former slave)
> >>>> 172.31.2.160 - virtual IP managed by pgpool where db clients connects to (originally assigned to pgpool1 server)
> >>>>
> >>>> And all above hosts i'm having included in /etc/hosts to avoid any DNS influence..
> >>>>
> >>>> I have later today enabled debug_level = 1 in pgpool.conf and limited logfile to 2GB (which limits the log only for abou 10 minutes)
> >>>>
> >>>> Also have included my pgpool.conf files which I believe are setup correctly, but one never knows...
> >>>>
> >>>> Many thanks for any help, although I like pgpool very much it's not stable in my environment which is quite frustrating as I opted for pgpool to improve
> >>>> service availability but till now it's rather exactly opposite.
> >>>>
> >>>> Tomas
> >>>
> >>
> >
>
More information about the pgpool-general
mailing list