[pgpool-general: 700] Re: Transaction never finishes

Thu Jul 5 22:57:40 JST 2012

> On Tue, Jul 3, 2012 at 8:39 AM, Luiz Pasqual <luiz at pasquall.com> wrote:
>> On Tue, Jul 3, 2012 at 7:40 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>> Luiz,
>>>
>>>>> On Fri, Jun 29, 2012 at 8:54 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>>>>>> On Fri, Jun 29, 2012 at 12:09 PM, Luiz Pasqual <luiz at pasquall.com> wrote:
>>>>>>>> On Thu, Jun 28, 2012 at 8:06 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>>>>>>>>> Hello,
>>>>>>>>>>
>>>>>>>>>> I'm using pgpool-II 3.1.2 with streaming replication and it's working
>>>>>>>>>> pretty well. But I'm dealing with a weird situation and I don't know
>>>>>>>>>> how to debug:
>>>>>>>>>>
>>>>>>>>>> Sometimes, some transactions never finishes in the master, here is an
>>>>>>>>>> example, the following query:
>>>>>>>>>> select * from pg_stat_activity where xact_start < current_timestamp -
>>>>>>>>>> '10 minutes'::interval
>>>>>>>>>>
>>>>>>>>>> Results:
>>>>>>>>>> 20994;"****";2445;16385;"****";"";"192.168.**.**";"";44083;"2012-06-27
>>>>>>>>>> 05:55:39.525881-03";"2012-06-27 11:17:46.475347-03";"2012-06-27
>>>>>>>>>> 11:18:10.044718-03";f;"<IDLE> in transaction"
>>>>>>>>>>
>>>>>>>>>> This transaction gets AccessShareLock in the relations:
>>>>>>>>>> pg_class_relname_nsp_index
>>>>>>>>>> pg_class
>>>>>>>>>> pg_class_oid_index
>>>>>>>>>> pg_namespace
>>>>>>>>>> pg_namespace_oid_index
>>>>>>>>>> pg_namespace_nspname_index
>>>>>>>>>>
>>>>>>>>>> And one ExclusiveLock that I couldn't identify the relation.
>>>>>>>>>>
>>>>>>>>>> Sometimes, depending on the relations that are locked everything fails
>>>>>>>>>> (timeout) and a pgpool restart must be done. Anyone knows what is
>>>>>>>>>> going on?
>>>>>>>>>
>>>>>>>>> You want to identify the process id of pgpool which is dealing with
>>>>>>>>> PostgreSQL backend process id(in the case above it's 2445).
>>>>>>>>>
>>>>>>>>> Here is the step to find pgpool process id:
>>>>>>>>>
>>>>>>>>> 1) Execute pcp_proc_count to find pgpool process list. This command
>>>>>>>>>   returns all pgpool process ids.
>>>>>>>>>
>>>>>>>>> 2) For each process id in #1, execute pcp_proc_info. This will tell
>>>>>>>>>   what are the PostgreSQL process ids. Note that the command returns
>>>>>>>>>   multiple rows sorting by node id. Usually node id 0 (thus the first
>>>>>>>>>   line) is for primary.
>>>>>>>>>
>>>>>>>>> 3) Look for 2445 in #2 to find the pgpool process id.
>>>>>>>>>
>>>>>>>>> 4) If you could find the pgpool process id (say 12345), grep pgpool
>>>>>>>>>   log by using 12345. This will show what's going on with 12345.
>>>>>>>>
>>>>>>>> Well, they are all "idle in transaction".
>>>>>>>
>>>>>>> With a little more debugging, I discovered that the client who uses
>>>>>>> this specific process already disconnects from pgpool:
>>>>>>>
>>>>>>> # ps aux | grep 25003
>>>>>>> pgpool: *** *** 127.0.0.1(51456) idle in transaction
>>>>>>>
>>>>>>> # lsof -i -P -n | grep 25003
>>>>>>> pgpool    25003       root    5u  IPv4 39609135      0t0  TCP *:5432 (LISTEN)
>>>>>>> pgpool    25003       root    6u  IPv4 39621031      0t0  TCP
>>>>>>> 127.0.0.1:5432->127.0.0.1:51456 (CLOSE_WAIT)
>>>>>>> pgpool    25003       root    7u  IPv4 39621043      0t0  TCP
>>>>>>> 192.168.10.55:33684->192.168.10.101:5432 (ESTABLISHED)
>>>>>>> pgpool    25003       root    8u  IPv4 39621044      0t0  TCP
>>>>>>> 192.168.10.55:51761->192.168.10.11:5432 (CLOSE_WAIT)
>>>>>>>
>>>>>>> There is no connection between the client (127.0.0.1:51456) and ( <- )
>>>>>>> pgpool (127.0.0.1:5432).
>>>>>>>
>>>>>>> For some reason, this transaction was aborted, client disconnected
>>>>>>> from pgpool, pgpool aborted the transaction on the slave server
>>>>>>> (192.168.10.11) but not on the master (192.168.10.55).
>>>>>>>
>>>>>>> I'm running out of ideas.
>>>>>>
>>>>>>> 192.168.10.55:33684->192.168.10.101:5432 (ESTABLISHED)
>>>>>>
>>>>>> What is on 192.168.10.101:5432?
>>>>>
>>>>> It's my database, a PostgresSQL 9.1.3, streaming replication to 192.168.10.11
>>>>>
>>>>>
>>>>>> Also can you attach gdb to 25003 and take back trace?
>>>>>
>>>>> Sure:
>>>>>
>>>>> (gdb) bt
>>>>> #0  0x00007f801e2c5073 in __select_nocancel () from /lib64/libc.so.6
>>>>> #1  0x000000000041661e in pool_check_fd (cp=<value optimized out>) at
>>>>> pool_process_query.c:1050
>>>>> #2  0x0000000000419e64 in pool_read (cp=0x70f1d0, buf=0x7fff24248d3f,
>>>>> len=1) at pool_stream.c:138
>>>>> #3  0x00000000004158db in read_kind_from_backend (frontend=<value
>>>>> optimized out>, backend=0x70b420, decided_kind=<value optimized out>)
>>>>> at pool_process_query.c:3490
>>>>> #4  0x0000000000443cf9 in ProcessBackendResponse (frontend=0x70c450,
>>>>> backend=0x70b420, state=0x7fff24248fb8, num_fields=0x7fff24248fbc) at
>>>>> pool_proto_modules.c:2164
>>>>> #5  0x00000000004188b0 in pool_process_query (frontend=0x70c450,
>>>>> backend=0x70b420, reset_request=0) at pool_process_query.c:416
>>>>> #6  0x000000000040a4d0 in do_child (unix_fd=4, inet_fd=5) at child.c:354
>>>>> #7  0x0000000000404245 in fork_a_child (unix_fd=4, inet_fd=5, id=29)
>>>>> at main.c:1072
>>>>> #8  0x000000000040726f in main (argc=<value optimized out>,
>>>>> argv=<value optimized out>) at main.c:549
>>>>> (gdb)
>>>>>
>>>>> Thanks for your support.
>>>>
>>>> I'm looking for how reproduce your problem. Is there any easy way to
>>>> reproduce?
>>>
>>> I think I found a cause of the problem.
>>> Can you try with attached patch?
>>
>> Tatsuo,
>>
>> I deployed the patch, since I can't reproduce the problem, I'll wait
>> and see. In general, I see it happens 3 or 4 times a day, so I think
>> it's going to be easy to know the results.
>>
>> I really thank you for this great software.
> 
> Tatsuo,
> 
> After 48hrs of heavy load without a single problem I think we can say
> that your patch solve the problem.
> 
> Thank you again.

Thank you for the positive report. Glad to hear that!
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp