[pgpool-general: 1897] Re: Pgpool is unable to connect backend PostgreSQL

Tatsuo Ishii ishii at postgresql.org
Tue Jul 16 12:08:05 JST 2013


Oops. Small correction to the patch.

+		if (health_check_timer_expired && getpid() != mypid)		/* has health check timer expired */

should be:

+		if (health_check_timer_expired && getpid() == mypid)		/* has health check timer expired */

Included is the new patch.
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

> Ok. I think I finally understand what's going on here.
> 
> Pgpool main process (14317) started health checking at Jul 12 09:17:04.
> 
> Jul 12 09:17:04 purple1-node1-ps pgpool[14317]: starting health checking
> 
> Pgpool main process set timer at 09:17:14 because you set
> health_check_timeout 10.  This time the health check successfully
> completed. The timer for 09:17:14 is blocked by calling
> signal(SIGALRM, SIG_IGN).
> 
> Unfortunately child life time was expired at 09:17:14 and pgpool main
> process was busy at the time because of this.
> 
> Jul 12 09:17:14 purple1-node1-ps pgpool[16789]: child life 300 seconds expired
> Jul 12 09:17:14 purple1-node1-ps pgpool[14317]: reap_handler called
> 
> Jul 12 09:17:14 purple1-node1-ps pgpool[14317]: starting health checking
> 
> Pgpool main re-enabled the timer and reset the timer variable
> (health_check_timer_expired = 0). But when the timer re-enabled, the
> signal handler for the timer set health_check_timer_expired to 1.  As
> a result pgpool thought that health check timer was expired.
> 
> Jul 12 09:17:14 purple1-node1-ps pgpool[14317]: health_check: health check timer has been already expired before attempting to connect to 0 th backend
> 
> Thus failover happend even if the backend was running fine.
> 
> Besides this,
> 
>>> This seems very strange. The error comes here:
> 
> I can now think of an explanation. When child life time is expired,
> pgpool fork off new process. If the global variable
> health_check_timer_expired to 1 for the reason above, it is possible
> the problem you saw could happen because child process inherits this.
> 
> Includes patch addresses the problems above. Could you try it out?
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese: http://www.sraoss.co.jp
> 
>> Hi, thanks for your reply.
>> 
>>>> What kind of pgpool installation are you using?
>>  I comlile pgpool from source code, pgpool-II-3.2.3.tar.gz
>>>>  What kind of platform are you using?
>> We use Ubuntu 12.04 on HP Cloud server.
>> 
>>>> How is like your pgpool.conf?
>> Attaching pgpool.conf
>> 
>> Also attached syslog file from the time of a problem. You can look for a
>> line
>> "Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: I am 11465 accept fd 7"
>> 
>> By the way postgres was up and running at that time, there is nothing in
>> the logs from it and its procceses uptime shows week ago...
>>
>> Thanks in advance for your help
>> Larisa.
>> 
>> 
>> 
>> On Mon, Jul 15, 2013 at 5:00 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>> 
>>> > Hi
>>> > I am hitting the same issue as described in the mail [pgpool-general:
>>> 1815]
>>> > Pgpool is unable to connect backend PostgreSQL.
>>>
>>> I guess [pgpool-general: 1815] is different from you (my guess is the
>>> case somewhat related to Amazon EC2 environment problem). Moreover,
>>> you case seems very unique and strange.
>>>
>>> > While connected to a single postgres node, after a while pgpool looses
>>> > connection to a running postgres db, restarts all children processes and
>>> > stays in running state unable to connect to db.
>>> >
>>> > Pgpool version 3.2.3
>>> > Postgres version 9.2.4
>>> >
>>> > Part of the log:
>>> > --------------------
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: connection received:
>>> > host=10.4.225.120 port=41090
>>>
>>> Process 11465 is a pgpool child process and is responsible for actual
>>> pgpool functions.
>>>
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: Protocol Major: 3 Minor:
>>> 0
>>> > database: hpadb user: hpauser
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: new_connection:
>>> connecting
>>> > 0 backend
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]:
>>> > connect_inet_domain_socket_by_port: health check timer expired
>>>
>>> This seems very strange. The error comes here:
>>>
>>>                 if (health_check_timer_expired)         /* has health
>>> check timer expired */
>>>                 {
>>>                         pool_log("connect_inet_domain_socket_by_port:
>>> health check timer expired");
>>>                         close(fd);
>>>                         return -1;
>>>                 }
>>>
>>> "health_check_timer_expired" is a global variable used in pgpool main
>>> process, which is responsible for management of pgpool, including:
>>> health check, failover etc. The variable is only meaningful in the
>>> main process and should not be set to non 0 in pgpool child. Moreover,
>>> the only place set the variable to non 0 is the signal handler which
>>> is set by main process.
>>>
>>> One the error occurs, pgpool starts failover as you see.
>>>
>>> I've never seen this kind of report before. What kind of pgpool
>>> installation are you using? (compiled from source code or from
>>> packes?) What kind of platform are you using? How is like your
>>> pgpool.conf?
>>> --
>>> Tatsuo Ishii
>>> SRA OSS, Inc. Japan
>>> English: http://www.sraoss.co.jp/index_en.php
>>> Japanese: http://www.sraoss.co.jp
>>>
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: connection to
>>> > purple1_node1_ps(5432) failed
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: new_connection:
>>> create_cp()
>>> > failed
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11465]: degenerate_backend_set: 0
>>> > fail over request from pid 11465
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler called
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler:
>>> starting
>>> > to select new master node
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: starting degeneration.
>>> > shutdown host purple1_node1_ps(5432)
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: no
>>> valid
>>> > DB node found
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: Restart all children
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 4388
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 9597
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[18648]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[4388]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps rsyslogd-2177: imuxsock lost 85 messages
>>> > from pid 9597 due to rate-limiting
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[9597]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 18648
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[29409]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 29409
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 11454
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14323]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[11454]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 14323
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 22349
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[22349]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 23617
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 29410
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[31511]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[29410]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[14317]: failover_handler: kill
>>> 31511
>>> > Jul 12 09:32:14 purple1-node1-ps pgpool[4385]: child received shutdown
>>> > request signal 3
>>> > Jul 12 09:32:14 purple1-node1-ps rsyslogd-2177: imuxsock lost 757
>>> messages
>>> > from pid 23617 due to rate-limiting
>>> >
>>> > Could you please explain?
>>> > Thanks
>>> > Larisa.
>>>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: health_check.patch
Type: text/x-patch
Size: 2460 bytes
Desc: not available
URL: <http://www.sraoss.jp/pipermail/pgpool-general/attachments/20130716/c5ca6185/attachment.bin>


More information about the pgpool-general mailing list