[pgpool-general: 1460] Re: Fwd: Pgpool 3.2.2 issue on AIX

Daniele Di Vito adivitog at gmail.com
Wed Mar 6 03:16:16 JST 2013


2013/3/4 Tatsuo Ishii <ishii at postgresql.org>

> > Hi Tatsuo, as promised I've tested pgpool 3.2.3 on AIX 5.2
> >
> > it behaves in a really strange way, at the beginning when I was starting
> > pgpool it was seeming to be working pretty fine,
> > after some stop and start it became to behave really strangely.
> >
> > it seems to mess with some cached status does exists some cache_status
> > file? i removed backend files .s.PGSQL.9898 % .s.PGSQL.9999 butit doesn't
> > change anything.
>
> I don't think so. The cached status file is located at
> /tmp/pgpool_status in your case. From the log, it says:
>
> 2013-03-04 17:44:18 ERROR: pid 72000: Could not read backend status file
> as /tmp/pgpool_status. reason: No such file or directory
>
> So pgpool did not read cached status and this is normal.  (BTW, if you
> want to be sure to ignore the status file, you could start pgpool with
> -D option).
>
>
>
Yes the things you say are right, the strange thing is that the
/tmp/pgpool_status file exists but pgpool is not able to read it, it's
strange cause user has right to read and write on it. any way after that I
started using the -D option that file has been discarded automatically by
pgpool process but this didn't help.
My suspect of a strange behaviour cause of some kind of cached status comes
from the fact that in the very beginning, when I was having the same
configuration that I'm having now all was working really fine.

>
> From the log file I noticed to the health check function failed to
> connect to backend 1. Are you sure that backend1(devlam0) allows to
> connect with user = postgres without password from the where pgpool is
> running on?
>

Yes backend1(devlam0) allows to connect normally. It was down in that very
moment when I launched pgpool in debug mode,
I get this trouble only when a failover happen, but pgpool seems not to be
able to manage correctly the failover and the main process go into defunct
status.

I repeat again that earlier than yesterday it was working pretty fine only
a few times I was getting strange behaviours but i was not analizing
properly processes in those moments,
the only thing that I changed yesterday is that I've installed the
pgpool-recovery module compiled from the source tarball but I do not think
that this could make pgpool unstable by logic it should be used only when
calling
the pcp_recovery_node command.

Sorry if I didn't wrote all these details before.

to let you understand better I attach the debug log and I will list the ps
commands related to this scenario:

1) both backends working at the pgpool startup.

2) simulating the node1 failover sending the stop command to postgres on
devlam0 server.

this is the process list sequence

the failover simulation time is at 2013-03-05 18:32:10


process list before failover

[postgres at bldlam:/home/postgres]# ps -fu postgres | grep pgpool
postgres   7580 123522   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres  44862  50512   1 18:30:32  pts/3  0:00 grep pgpool
postgres  48162 123522   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres  49986 123522   0 18:29:46  pts/1  0:00 pgpool: PCP: wait for
connection request
postgres  83912 123522   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres  84228 123522   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres 104292 123522   0 18:29:46  pts/1  0:00 pgpool: worker process
postgres 104468 123522   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres 116214 123522   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres 123522  25022   0 18:29:46  pts/1  0:00 bin/pgpool -d -D -n

process list after failover

[postgres at bldlam:/home/postgres]# ps -fu postgres | grep pgpool
postgres   7580      1   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres  41166  50512   1 18:34:05  pts/3  0:00 grep pgpool
postgres  48162      1   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres  49986      1   0 18:29:46  pts/1  0:00 pgpool: PCP: wait for
connection request
postgres  83912      1   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres  84228      1   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres 104292      1   0 18:29:46  pts/1  0:00 pgpool: worker process
postgres 104468      1   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request
postgres 116214      1   0 18:29:46  pts/1  0:00 pgpool: wait for
connection request


--
Daniele Di Vito



> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
>
> > No changes in pgpool.conf nor in postgresql.conf
> > it returned to loose communication with the main process as described for
> > 3.2.2 version, the main process abnormally exit
> >
> > I've tried to change some settings in pgpool.conf but it doesn't take
> > effets on pool behaviour
> >
> > For example I tried to increase child life time cause i was suspecting
> some
> > issue on child process destruction
> >
> > child_life_time = 100000000000
> >                                    # Pool exits after being idle for this
> > many seconds
> > child_max_connections = 0
> >                                    # Pool exits after receiving that many
> > connections
> >                                    # 0 means no exit
> > connection_life_time = 0
> >                                    # Connection to backend closes after
> > being idle for this many seconds
> >                                    # 0 means no close
> > client_idle_limit = 0
> >
> >
> > I attach debug output truss output and pgpool.conf
> >
> > Hoping that someone can help me to solve this trouble on AIX.
> >
> > Thanks
> >
> > Great!
> >>
> >> Yes, I've just install the 3.2.3 et it seems to be working great!
> >>
> >> Obviouslly I will test it properly and I will let you know how this
> >> version works on AIX.
> >>
> >> Thanks again
> >>
> >>
> >>
> >> 2013/2/22 Tatsuo Ishii <ishii at postgresql.org>
> >>
> >>> I'm going to check the data you posted.
> >>>
> >>> In the meatime, I think it is posiible your problem is caused by the
> >>> bug fixed in pgpool-II 3.2.3, especiall if the problem goes away by
> >>> disabling ealth checking. Can you try 3.2.3?
> >>> --
> >>> Tatsuo Ishii
> >>> SRA OSS, Inc. Japan
> >>> English: http://www.sraoss.co.jp/index_en.php
> >>> Japanese: http://www.sraoss.co.jp
> >>>
> >>> > Thanks for your quick reply Tatsuo,
> >>> >
> >>> > Before to get your reply i tried to understand where the main process
> >>> stops
> >>> > adding some print in the line right above every exit call in the
> main.c
> >>> > class.
> >>> > I found out that it never stops calling the exit command in  an
> explicit
> >>> > way.
> >>> >
> >>> > After your reply I tracked it's behaviour using truss.
> >>> > Truss output is in the attched file.
> >>> >
> >>> >
> >>> > Commenting the dup2 file in the demonize function, I'd been able to
> get
> >>> > which error happens just before the process "death".
> >>> >
> >>> > This is the error:
> >>> >
> >>> > pool_flush_it: write failed to backend (0). reason: Socket is not
> >>> connected
> >>> > offset: 0 wlen: 41
> >>> >
> >>> > it happens after the first health_check call.
> >>> >
> >>> > it seems that the socket is not connected to the local backend
> (backend
> >>> 0
> >>> > is on the same host where pgpool is running) but stragely the
> >>> replication
> >>> > on the backend 0 it work normally so I think that it's connected.
> >>> >
> >>> > The real trouble is that pgppol will never check for failover or
> >>> failback
> >>> > loosing the main process.
> >>> >
> >>> > The pgsql version of every backend is 8.3. I attach the pgpool config
> >>> file
> >>> > too.
> >>> >
> >>> >
> >>> > Thanks again for your help
> >>> >
> >>> >
> >>> > --
> >>> >
> >>> > Daniele Di vito
> >>> >
> >>> >
> >>> > 2013/2/21 Tatsuo Ishii <ishii at postgresql.org>
> >>> >
> >>> >> > HI everybody, I've compiled pgpool 3.2.2 on AIX 5.2.
> >>> >> >
> >>> >> > I configured the pool for using replication mode. The
> configuration
> >>> is
> >>> >> > working really fine on some linux virtual machine, but when I try
> to
> >>> use
> >>> >> > pgpool with the same configuration on AIX I have a big trouble.
> >>> >> >
> >>> >> > Starting with "pgpool -d" the server seems to be starting
> normally.
> >>> it
> >>> >> > create pcp process and it create the pool connections waiting for
> >>> >> > connection requests.
> >>> >> >
> >>> >> > When I lunch a "ps -fu postgres | grep pgpool"  i get this output:
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > postgres  62164      1   0 10:05:48      -  0:00 pgpool: wait for
> >>> >> > connection request
> >>> >> > postgres  75470      1   0 10:05:48      -  0:00 pgpool: PCP: wait
> >>> for
> >>> >> > connection request
> >>> >> > postgres  84072      1   0 10:05:48      -  0:00 pgpool: wait for
> >>> >> > connection request
> >>> >> > postgres  96828      1   0 10:05:48      -  0:00 pgpool: wait for
> >>> >> > connection request
> >>> >> > postgres 100026      1   0 10:05:48      -  0:00 pgpool: wait for
> >>> >> > connection request
> >>> >> > postgres 106670      1   0 10:05:47      -  0:00 pgpool: wait for
> >>> >> > connection request
> >>> >> > postgres 109864      1   0 10:05:48      -  0:00 pgpool: worker
> >>> process
> >>> >> > postgres 116412      1   0 10:05:48      -  0:00 pgpool: wait for
> >>> >> > connection request
> >>> >> >
> >>> >> > but, as you can see looking at the output listed above,no pgpool
> >>> daemon
> >>> >> is
> >>> >> > running and every subprocess created by it now have as ppid 1.
> >>> >> >
> >>> >> > if I look into the pgpool.pid i get a pid that is not running on
> the
> >>> AIX
> >>> >> > machine.
> >>> >> > Obviously if i try to stop pgpool it says that  the process is not
> >>> >> running
> >>> >> > so i have to kill every process and to remove every temporary file
> >>> >> manually.
> >>> >> >
> >>> >> > If i run it without a daemon using  "pgppool -n"
> >>> >> >
> >>> >> > the pgpool -n process is listed for some minutes in the  "ps -fu
> >>> >> postgres |
> >>> >> > grep pgpool" and every subprocess have the right ppid.
> >>> >> > Some minutes later i get the same output I listed for the "pgpool
> -d"
> >>> >> > command start.
> >>> >> >
> >>> >> > Any idea on how to solve this trouble?
> >>> >> >
> >>> >> > I've already tried to find some error while in debug mode, but no
> >>> error
> >>> >> > listed.
> >>> >>
> >>> >> Does AIX have something like "strace" or "truss"? If so, taking a
> >>> >> system call trace by using it, may provide valuable information.
>  You
> >>> >> take system call trace until pgpool-II parent process disappears.
> >>> >> --
> >>> >> Tatsuo Ishii
> >>> >> SRA OSS, Inc. Japan
> >>> >> English: http://www.sraoss.co.jp/index_en.php
> >>> >> Japanese: http://www.sraoss.co.jp
> >>> >>
> >>>
> >>
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130305/c10c1d06/attachment-0001.html>


More information about the pgpool-general mailing list