[pgpool-general: 698] Re: Transaction never finishes

Thu Jul 5 20:54:42 JST 2012

On Tue, Jul 3, 2012 at 8:39 AM, Luiz Pasqual <luiz at pasquall.com> wrote:
> On Tue, Jul 3, 2012 at 7:40 AM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>> Luiz,
>>
>>>> On Fri, Jun 29, 2012 at 8:54 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>>>>> On Fri, Jun 29, 2012 at 12:09 PM, Luiz Pasqual <luiz at pasquall.com> wrote:
>>>>>>> On Thu, Jun 28, 2012 at 8:06 PM, Tatsuo Ishii <ishii at postgresql.org> wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> I'm using pgpool-II 3.1.2 with streaming replication and it's working
>>>>>>>>> pretty well. But I'm dealing with a weird situation and I don't know
>>>>>>>>> how to debug:
>>>>>>>>>
>>>>>>>>> Sometimes, some transactions never finishes in the master, here is an
>>>>>>>>> example, the following query:
>>>>>>>>> select * from pg_stat_activity where xact_start < current_timestamp -
>>>>>>>>> '10 minutes'::interval
>>>>>>>>>
>>>>>>>>> Results:
>>>>>>>>> 20994;"****";2445;16385;"****";"";"192.168.**.**";"";44083;"2012-06-27
>>>>>>>>> 05:55:39.525881-03";"2012-06-27 11:17:46.475347-03";"2012-06-27
>>>>>>>>> 11:18:10.044718-03";f;"<IDLE> in transaction"
>>>>>>>>>
>>>>>>>>> This transaction gets AccessShareLock in the relations:
>>>>>>>>> pg_class_relname_nsp_index
>>>>>>>>> pg_class
>>>>>>>>> pg_class_oid_index
>>>>>>>>> pg_namespace
>>>>>>>>> pg_namespace_oid_index
>>>>>>>>> pg_namespace_nspname_index
>>>>>>>>>
>>>>>>>>> And one ExclusiveLock that I couldn't identify the relation.
>>>>>>>>>
>>>>>>>>> Sometimes, depending on the relations that are locked everything fails
>>>>>>>>> (timeout) and a pgpool restart must be done. Anyone knows what is
>>>>>>>>> going on?
>>>>>>>>
>>>>>>>> You want to identify the process id of pgpool which is dealing with
>>>>>>>> PostgreSQL backend process id(in the case above it's 2445).
>>>>>>>>
>>>>>>>> Here is the step to find pgpool process id:
>>>>>>>>
>>>>>>>> 1) Execute pcp_proc_count to find pgpool process list. This command
>>>>>>>>   returns all pgpool process ids.
>>>>>>>>
>>>>>>>> 2) For each process id in #1, execute pcp_proc_info. This will tell
>>>>>>>>   what are the PostgreSQL process ids. Note that the command returns
>>>>>>>>   multiple rows sorting by node id. Usually node id 0 (thus the first
>>>>>>>>   line) is for primary.
>>>>>>>>
>>>>>>>> 3) Look for 2445 in #2 to find the pgpool process id.
>>>>>>>>
>>>>>>>> 4) If you could find the pgpool process id (say 12345), grep pgpool
>>>>>>>>   log by using 12345. This will show what's going on with 12345.
>>>>>>>
>>>>>>> Well, they are all "idle in transaction".
>>>>>>
>>>>>> With a little more debugging, I discovered that the client who uses
>>>>>> this specific process already disconnects from pgpool:
>>>>>>
>>>>>> # ps aux | grep 25003
>>>>>> pgpool: *** *** 127.0.0.1(51456) idle in transaction
>>>>>>
>>>>>> # lsof -i -P -n | grep 25003
>>>>>> pgpool    25003       root    5u  IPv4 39609135      0t0  TCP *:5432 (LISTEN)
>>>>>> pgpool    25003       root    6u  IPv4 39621031      0t0  TCP
>>>>>> 127.0.0.1:5432->127.0.0.1:51456 (CLOSE_WAIT)
>>>>>> pgpool    25003       root    7u  IPv4 39621043      0t0  TCP
>>>>>> 192.168.10.55:33684->192.168.10.101:5432 (ESTABLISHED)
>>>>>> pgpool    25003       root    8u  IPv4 39621044      0t0  TCP
>>>>>> 192.168.10.55:51761->192.168.10.11:5432 (CLOSE_WAIT)
>>>>>>
>>>>>> There is no connection between the client (127.0.0.1:51456) and ( <- )
>>>>>> pgpool (127.0.0.1:5432).
>>>>>>
>>>>>> For some reason, this transaction was aborted, client disconnected
>>>>>> from pgpool, pgpool aborted the transaction on the slave server
>>>>>> (192.168.10.11) but not on the master (192.168.10.55).
>>>>>>
>>>>>> I'm running out of ideas.
>>>>>
>>>>>> 192.168.10.55:33684->192.168.10.101:5432 (ESTABLISHED)
>>>>>
>>>>> What is on 192.168.10.101:5432?
>>>>
>>>> It's my database, a PostgresSQL 9.1.3, streaming replication to 192.168.10.11
>>>>
>>>>
>>>>> Also can you attach gdb to 25003 and take back trace?
>>>>
>>>> Sure:
>>>>
>>>> (gdb) bt
>>>> #0  0x00007f801e2c5073 in __select_nocancel () from /lib64/libc.so.6
>>>> #1  0x000000000041661e in pool_check_fd (cp=<value optimized out>) at
>>>> pool_process_query.c:1050
>>>> #2  0x0000000000419e64 in pool_read (cp=0x70f1d0, buf=0x7fff24248d3f,
>>>> len=1) at pool_stream.c:138
>>>> #3  0x00000000004158db in read_kind_from_backend (frontend=<value
>>>> optimized out>, backend=0x70b420, decided_kind=<value optimized out>)
>>>> at pool_process_query.c:3490
>>>> #4  0x0000000000443cf9 in ProcessBackendResponse (frontend=0x70c450,
>>>> backend=0x70b420, state=0x7fff24248fb8, num_fields=0x7fff24248fbc) at
>>>> pool_proto_modules.c:2164
>>>> #5  0x00000000004188b0 in pool_process_query (frontend=0x70c450,
>>>> backend=0x70b420, reset_request=0) at pool_process_query.c:416
>>>> #6  0x000000000040a4d0 in do_child (unix_fd=4, inet_fd=5) at child.c:354
>>>> #7  0x0000000000404245 in fork_a_child (unix_fd=4, inet_fd=5, id=29)
>>>> at main.c:1072
>>>> #8  0x000000000040726f in main (argc=<value optimized out>,
>>>> argv=<value optimized out>) at main.c:549
>>>> (gdb)
>>>>
>>>> Thanks for your support.
>>>
>>> I'm looking for how reproduce your problem. Is there any easy way to
>>> reproduce?
>>
>> I think I found a cause of the problem.
>> Can you try with attached patch?
>
> Tatsuo,
>
> I deployed the patch, since I can't reproduce the problem, I'll wait
> and see. In general, I see it happens 3 or 4 times a day, so I think
> it's going to be easy to know the results.
>
> I really thank you for this great software.

Tatsuo,

After 48hrs of heavy load without a single problem I think we can say
that your patch solve the problem.

Thank you again.

-- 
Luiz Fernando Pasqual S. Souza
mail: luiz at pasquall.com