[pgpool-general: 1333] Re: WatchDog and pgool sudden stop working

Tatsuo Ishii ishii at postgresql.org
Tue Jan 22 07:51:05 JST 2013


What do you mean by "pgpools stop working"?

I see many clients already connect to pgpool. Can you issue a query
through one of the sessions?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> I have repeated issue on my pgpool servers (two servers monitoring each other via pgpool watchdog).
> 
> About each 2nd or 3rd day one (or both) pgpools stop working, most of times the log starts with:
>  read_startup_packet: incorrect packet length (0)
> 
> and then follow many of those messages on both servers:
>  wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> 
> 
> Today it happen only on server which was pgpool master and correctly switched to the other pgpool server.
> This is log from the master:
>  2013-01-18 08:18:20 ERROR: pid 3160: read_startup_packet: incorrect packet length (0)
>  2013-01-18 08:18:23 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>  2013-01-18 08:18:23 LOG:   pid 3152: wd_IP_down: ifconfig down succeeded
>  2013-01-18 08:18:32 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>  2013-01-18 08:18:39 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
>  2013-01-18 08:18:46 LOG:   pid 3152: wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working
> 
> 
> There was no postgresql/network outage, nor I was able to find any corellation to anything on the systems
> which could trigger this.
> 
> Usualy this happen on both servers in same time so I have quickly to solve the situation by pkill -9 pgpool - which then gets
> restarted by DJBs svscan (pkill is not able to actually stop the pgpool when it's in this state).
> But today as the other pgpool was working I was doing a bit more troublehooting on the former pgpool master and have found that
> it was accepting connections from db clients, but the connection never finished (it got hanged).
> Connection to PCP was working all the time OK (attaching pcp_pool_status to this email).
> 
> I have executed strace in pid 3090 to see what's happening but i'm not really able to read the output properly so I would appreciate if someone
> can have a look into attached strace output if she/he can identify what could be the reason for the hang.
> Intereseting enough that after about 40-45 minutes the pgpool on the former pgpool master started to work again properly.
> There was nothing in logfile, just the log entries "wd_lifecheck: lifecheck failed 3 times. pgpool seems not to be working" stopped to apear.
> This event should be included inside the strace output too as the strace was running during that time.
> 
> One thing I have also noticed shorty AFTER the pgpool on former master started working again by itself is that
> all processes on former pgpool master were started on Jan 15 , but one pgpool process had other started time shown via the ps axuwf
> and by coincidence it is the time when the pgpool started working again:
> 
> pgsql    23021  0.0  0.0 110308  1876 ?        S    09:02   0:00  |   |               \_ pgpool: wait for connection request
> 
> Maybe it is normal, I just noticed the time of the process is other than other pgpool processes, here is the whole snapshot of ps axuwf related to pgpool:
> 
> USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root      1975  0.0  0.0   4164   336 ?        S     2012   0:00  |   \_ supervise pgpool
> root      3065  0.0  0.0  17740  1424 ?        S    Jan15   0:00  |   |   \_ /bin/sh ./run
> pgsql     3078  0.0  0.0  20296  1788 ?        S    Jan15   0:00  |   |       \_ -su -c /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n >> /var/log/pgpool.log
> pgsql     3090  0.0  0.0 110184  4640 ?        S    Jan15   0:12  |   |           \_ /usr/local/bin/pgpool -f /usr/local/etc/pgpool.conf -n
> pgsql     3151  0.0  0.0 110184   740 ?        S    Jan15   0:00  |   |               \_ pgpool: watchdog
> pgsql     3152  0.0  0.0 184048  1592 ?        Sl   Jan15   0:43  |   |               \_ pgpool: lifecheck
> pgsql    24999  0.0  0.0   4392   352 ?        S    09:08   0:00  |   |               |   \_ ping -q -c3 172.31.2.1
> pgsql     3153  0.0  0.1 119348 11176 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50064) idle
> pgsql     3154  0.0  0.1 119484 11280 ?        S    Jan15   1:55  |   |               \_ pgpool: az az 172.31.2.159(50068) idle
> pgsql     3155  0.0  0.1 119340 11088 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(49960) idle
> pgsql     3156  0.0  0.1 119008 10744 ?        S    Jan15   1:36  |   |               \_ pgpool: az az 172.31.2.159(50045) idle
> pgsql     3157  0.0  0.1 119148 10856 ?        S    Jan15   1:42  |   |               \_ pgpool: az az 172.31.2.159(50075) idle
> pgsql     3158  0.0  0.1 119208 10964 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50072) idle
> pgsql     3159  0.0  0.1 118836 10608 ?        S    Jan15   1:37  |   |               \_ pgpool: az az 172.31.2.159(50006) idle
> pgsql     3160  0.0  0.1 118360 10112 ?        S    Jan15   1:25  |   |               \_ pgpool: az az 172.31.2.158(57783) idle
> pgsql     3161  0.0  0.1 119352 11128 ?        S    Jan15   1:50  |   |               \_ pgpool: az az 172.31.2.159(50007) idle
> pgsql     3162  0.0  0.1 118772 10508 ?        S    Jan15   1:34  |   |               \_ pgpool: az az 172.31.2.159(50046) idle
> pgsql     3163  0.0  0.1 119620 11388 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50062) idle
> pgsql     3164  0.0  0.1 120384 12092 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50047) idle
> pgsql     3165  0.0  0.1 119384 11220 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50010) idle
> pgsql     3166  0.0  0.1 120084 11824 ?        S    Jan15   2:08  |   |               \_ pgpool: az az 172.31.2.159(50042) idle
> pgsql     3167  0.0  0.1 120444 12116 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50074) idle
> pgsql     3168  0.0  0.1 119732 11492 ?        S    Jan15   1:53  |   |               \_ pgpool: az az 172.31.2.159(50070) idle
> pgsql     3170  0.0  0.1 119868 11680 ?        S    Jan15   1:58  |   |               \_ pgpool: az az 172.31.2.159(50067) idle
> pgsql     3171  0.0  0.1 120224 11980 ?        S    Jan15   2:04  |   |               \_ pgpool: az az 172.31.2.159(49972) idle
> pgsql     3172  0.0  0.1 119880 11676 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50048) idle
> pgsql     3173  0.0  0.1 119440 11136 ?        S    Jan15   1:51  |   |               \_ pgpool: az az 172.31.2.159(50008) idle
> pgsql     3174  0.0  0.1 119296 11108 ?        S    Jan15   1:52  |   |               \_ pgpool: az az 172.31.2.159(50011) idle
> pgsql     3175  0.0  0.1 118952 10700 ?        S    Jan15   1:38  |   |               \_ pgpool: az az 172.31.2.159(50049) idle
> pgsql     3176  0.0  0.1 119200 11008 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(50066) idle
> pgsql     3178  0.0  0.1 119616 11376 ?        S    Jan15   1:55  |   |               \_ pgpool: az az 172.31.2.159(49973) idle
> pgsql     3179  0.0  0.1 118920 10712 ?        S    Jan15   1:40  |   |               \_ pgpool: az az 172.31.2.159(50043) idle
> pgsql     3181  0.0  0.1 120188 11872 ?        S    Jan15   1:48  |   |               \_ pgpool: az az 172.31.2.158(57784) idle
> pgsql     3183  0.0  0.1 119688 11460 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50076) idle
> pgsql     3185  0.0  0.1 119444 11160 ?        S    Jan15   1:47  |   |               \_ pgpool: az az 172.31.2.159(49959) idle
> pgsql     3186  0.0  0.1 119428 11216 ?        S    Jan15   1:54  |   |               \_ pgpool: az az 172.31.2.159(50069) idle
> pgsql     3187  0.0  0.1 119888 11708 ?        S    Jan15   1:57  |   |               \_ pgpool: az az 172.31.2.159(50044) idle
> pgsql     3188  0.0  0.1 119344 11092 ?        S    Jan15   1:44  |   |               \_ pgpool: az az 172.31.2.159(50009) idle
> pgsql     3189  0.0  0.0 110772  1764 ?        S    Jan15   0:00  |   |               \_ pgpool: PCP: wait for connection request
> pgsql     3190  0.0  0.0 110296  1320 ?        S    Jan15   0:25  |   |               \_ pgpool: worker process
> pgsql    23021  0.0  0.0 110308  1876 ?        S    09:02   0:00  |   |               \_ pgpool: wait for connection request
> 
> 
> My setup:
> 172.31.2.158    - pgpool1 (former master - all above logs were done within this server)
> 172.31.2.159    - pgpool2 (former slave)
> 172.31.2.160    - virtual IP managed by pgpool where db clients connects to (originally assigned to pgpool1 server)
> 
> And all above hosts i'm having included in /etc/hosts to avoid any DNS influence..
> 
> I have later today enabled debug_level = 1 in pgpool.conf and limited logfile to 2GB (which limits the log only for abou 10 minutes)
> 
> Also have included my pgpool.conf files which I believe are setup correctly, but one never knows...
> 
> Many thanks for any help, although I like pgpool very much it's not stable in my environment which is quite frustrating as I opted for pgpool to improve
> service availability but till now it's rather exactly opposite.
> 
> Tomas


More information about the pgpool-general mailing list