[pgpool-general: 2784] Re: Child process hangs in active state

Mon Apr 21 16:40:57 JST 2014

Hi, thanks for your reporting and analysis.

> `len` and `bufsz2` being over 800MB looks extremely odd. Could it be a bug
> of pgpool?
> The ps output shows that the process alone actually allocated 800MB of
> address space.
> This has happened several times for us, but we have no idea what caused it
> and we just had to kill the process.

I agree that this looks odd. In some reasons, the length of message 
"parse complete ('1')" from backend might be broken or pgpool might
fail to read this.

I'll investigate this. Could you please send pgpool.conf, log messages
of both pgpool and postgres?

On Mon, 21 Apr 2014 15:28:52 +0900
Junegunn Choi <junegunn.c at gmail.com> wrote:

> Hi, we've run into a very strange situation where a child process hangs in
> active state.
> 
> 
> *** Versions
> 
> - os: RHEL 5.7
> - pgpool: 3.3.2
> - backend: EnterpriseDB 9.3.1.3 on x86_64
> - edb jdbc driver
> 
> 
> *** show pool_pools;
> 
>  pool_pid |     start_time      | pool_id | backend_id | database |
> username   |     create_time     | majorversion | minorversion |
> pool_counter | pool_backendpid | pool_connected
> ----------+---------------------+---------+------------+----------+--------------+---------------------+--------------+--------------+--------------+-----------------+----------------
>  26462    | 2014-04-18 17:10:58 | 0       | 0          | cloud    |
> enterprisedb | 2014-04-18 17:10:58 | 3            | 0            | 11518
>      | 20546           | 1
>  26462    | 2014-04-18 17:10:58 | 0       | 1          | cloud    |
> enterprisedb | 2014-04-18 17:10:58 | 3            | 0            | 11518
>      | 29160           | 1
>  26462    | 2014-04-18 17:10:58 | 1       | 0          |          |
>      |                     | 0            | 0            | 0            | 0
>               | 0
>  26462    | 2014-04-18 17:10:58 | 1       | 1          |          |
>      |                     | 0            | 0            | 0            | 0
>               | 0
>  26462    | 2014-04-18 17:10:58 | 2       | 0          |          |
>      |                     | 0            | 0            | 0            | 0
>               | 0
>  26462    | 2014-04-18 17:10:58 | 2       | 1          |          |
>      |                     | 0            | 0            | 0            | 0
>               | 0
>  26462    | 2014-04-18 17:10:58 | 3       | 0          |          |
>      |                     | 0            | 0            | 0            | 0
>               | 0
>  26462    | 2014-04-18 17:10:58 | 3       | 1          |          |
>      |                     | 0            | 0            | 0            | 0
>               | 0
> 
> 
> *** pg_stat_activity
> 
> edb=# select * from pg_stat_activity where pid = 20546;
>  datid | datname |  pid  | usesysid |   usename    | application_name |
> client_addr  | client_hostname | client_port |          backend_start
>     |           xact_start            |           query_
> start           |           state_change           | waiting | state  |
>       query
> -------+---------+-------+----------+--------------+------------------+--------------+-----------------+-------------+----------------------------------+---------------------------------+-----------------
> ----------------+----------------------------------+---------+--------+----------------------------
>  16391 | cloud   | 20546 |       10 | enterprisedb |                  |
> 10.12.240.94 |                 |       30104 | 18-APR-14 17:10:58.329551
> +09:00 | 18-APR-14 17:14:36.83873 +09:00 | 18-APR-14 17:14:
> 36.83873 +09:00 | 18-APR-14 17:14:36.838732 +09:00 | f       | active | SET
> extra_float_digits = 3
> 
> 
> *** ps output (pgpool server)
> 
> pgpool   32395  0.0  0.0  24160  4848 ?        S    Apr15   0:11  \_
> /db/pgpool/bin/pgpool -n -D -f /db/pgpool/conf/pgm/pgpool.conf -F
> /db/pgpool/conf/pgm/pcp.conf -a /db/pgpool/conf/pgm/pool_hba.conf
> pgpool   18534  0.0  0.0  24160   704 ?        S    Apr18   0:00  |   \_
> pgpool: PCP: wait for connection request
> pgpool   18535  0.0  0.0  24160  1076 ?        S    Apr18   0:03  |   \_
> pgpool: worker process
> pgpool   26462  0.0  0.0 830632  9032 ?        S    Apr18   0:10  |   \_
> pgpool: enterprisedb cloud ---------(redacted)---------(15878) idle
> pgpool   11259  0.0  0.0  24160   876 ?        S    13:31   0:00  |   \_
> pgpool: wait for connection request
> pgpool   11261  0.0  0.0  24160   864 ?        S    13:31   0:00  |   \_
> pgpool: wait for connection request
> pgpool   11262  0.0  0.0  24160   864 ?        S    13:31   0:00  |   \_
> pgpool: wait for connection request
> pgpool   11263  0.0  0.0  24160   864 ?        S    13:31   0:00  |   \_
> pgpool: wait for connection request
> ...
> 
> 
> *** ps output (backend)
> 
> $ ps auxf | grep 20546
> hanadmin 12046  0.0  0.0  61188   744 pts/8    S+   15:07   0:00
>    \_ grep 20546
> 541      20546  0.0  0.0 8761172 7448 ?        Ss   Apr18   0:07  \_
> postgres: enterprisedb cloud 10.12.240.94[30104] PARSE
> 
> 
> 
> *** gdb trace
> 
> (gdb) bt
> #0  0x00000033308cd6d3 in __select_nocancel () from /lib64/libc.so.6
> #1  0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at
> pool_process_query.c:951
> #2  0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at
> pool_stream.c:264
> #3  0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1',
> frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136
> #4  0x00000000004499bb in ProcessBackendResponse (frontend=0xbaf1180,
> backend=0xbad8330, state=0x7fff94626060, num_fields=0x7fff94626066) at
> pool_proto_modules.c:2670
> #5  0x000000000041b69d in pool_process_query (frontend=0xbaf1180,
> backend=0xbad8330, reset_request=0) at pool_process_query.c:289
> #6  0x000000000040ad8a in do_child (unix_fd=4, inet_fd=5) at child.c:355
> #7  0x000000000040456f in fork_a_child (unix_fd=4, inet_fd=5, id=0) at
> main.c:1258
> #8  0x0000000000404897 in reaper () at main.c:2482
> #9  0x0000000000404c25 in pool_sleep (second=<value optimized out>) at
> main.c:2679
> #10 0x0000000000407a0a in main (argc=<value optimized out>, argv=<value
> optimized out>) at main.c:856
> (gdb) up
> #1  0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at
> pool_process_query.c:951
> 951                     fds = select(fd+1, &readmask, NULL, &exceptmask,
> timeoutp);
> (gdb)
> #2  0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at
> pool_stream.c:264
> 264                     if (pool_check_fd(cp))
> (gdb)
> #3  0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1',
> frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136
> 1136                                    p = pool_read2(CONNECTION(backend,
> i), len);
> (gdb) p len
> $1 = 822083580
> (gdb) p *backend->slots[i].con
> $3 = {fd = 10, wbuf = 0xbae01d0 "P", wbufsz = 8192, wbufpo = 0, ssl_active
> = -1, hp = 0xbae21e0 "1", po = 0, bufsz = 1024, len = 0, sbuf = 0x0, sbufsz
> = 0, buf2 = 0x2aaf32c92010 "\004OMMIT",
>   bufsz2 = 822083584, buf3 = 0x0, bufsz3 = 0, isbackend = 1, db_node_id =
> 1, tstate = 73 'I', is_internal_transaction_started = 0 '\000', auth_kind =
> 0, pwd_size = 0,
>   password = '\000' <repeats 1023 times>, salt = "\000\000\000", params =
> {num = 0, names = 0xbae25f0, values = 0xbae2a00}, no_forward = 0, kind = 0
> '\000', protoVersion = 0, raddr = {addr = {
>       ss_family = 0, __ss_align = 0, __ss_padding = '\000' <repeats 111
> times>}, salen = 0}, auth_method = uaReject, auth_arg = 0x0, database =
> 0x0, username = 0x0}
> 
> 
> ***
> 
> `len` and `bufsz2` being over 800MB looks extremely odd. Could it be a bug
> of pgpool?
> The ps output shows that the process alone actually allocated 800MB of
> address space.
> This has happened several times for us, but we have no idea what caused it
> and we just had to kill the process.
> 
> Any idea?
> 
> -- 
> cheers,
> junegunn choi.

-- 
Yugo Nagata <nagata at sraoss.co.jp>