[pgpool-general: 2784] Re: Child process hangs in active state
Yugo Nagata
nagata at sraoss.co.jp
Mon Apr 21 16:40:57 JST 2014
Hi, thanks for your reporting and analysis.
> `len` and `bufsz2` being over 800MB looks extremely odd. Could it be a bug
> of pgpool?
> The ps output shows that the process alone actually allocated 800MB of
> address space.
> This has happened several times for us, but we have no idea what caused it
> and we just had to kill the process.
I agree that this looks odd. In some reasons, the length of message
"parse complete ('1')" from backend might be broken or pgpool might
fail to read this.
I'll investigate this. Could you please send pgpool.conf, log messages
of both pgpool and postgres?
On Mon, 21 Apr 2014 15:28:52 +0900
Junegunn Choi <junegunn.c at gmail.com> wrote:
> Hi, we've run into a very strange situation where a child process hangs in
> active state.
>
>
> *** Versions
>
> - os: RHEL 5.7
> - pgpool: 3.3.2
> - backend: EnterpriseDB 9.3.1.3 on x86_64
> - edb jdbc driver
>
>
> *** show pool_pools;
>
> pool_pid | start_time | pool_id | backend_id | database |
> username | create_time | majorversion | minorversion |
> pool_counter | pool_backendpid | pool_connected
> ----------+---------------------+---------+------------+----------+--------------+---------------------+--------------+--------------+--------------+-----------------+----------------
> 26462 | 2014-04-18 17:10:58 | 0 | 0 | cloud |
> enterprisedb | 2014-04-18 17:10:58 | 3 | 0 | 11518
> | 20546 | 1
> 26462 | 2014-04-18 17:10:58 | 0 | 1 | cloud |
> enterprisedb | 2014-04-18 17:10:58 | 3 | 0 | 11518
> | 29160 | 1
> 26462 | 2014-04-18 17:10:58 | 1 | 0 | |
> | | 0 | 0 | 0 | 0
> | 0
> 26462 | 2014-04-18 17:10:58 | 1 | 1 | |
> | | 0 | 0 | 0 | 0
> | 0
> 26462 | 2014-04-18 17:10:58 | 2 | 0 | |
> | | 0 | 0 | 0 | 0
> | 0
> 26462 | 2014-04-18 17:10:58 | 2 | 1 | |
> | | 0 | 0 | 0 | 0
> | 0
> 26462 | 2014-04-18 17:10:58 | 3 | 0 | |
> | | 0 | 0 | 0 | 0
> | 0
> 26462 | 2014-04-18 17:10:58 | 3 | 1 | |
> | | 0 | 0 | 0 | 0
> | 0
>
>
> *** pg_stat_activity
>
> edb=# select * from pg_stat_activity where pid = 20546;
> datid | datname | pid | usesysid | usename | application_name |
> client_addr | client_hostname | client_port | backend_start
> | xact_start | query_
> start | state_change | waiting | state |
> query
> -------+---------+-------+----------+--------------+------------------+--------------+-----------------+-------------+----------------------------------+---------------------------------+-----------------
> ----------------+----------------------------------+---------+--------+----------------------------
> 16391 | cloud | 20546 | 10 | enterprisedb | |
> 10.12.240.94 | | 30104 | 18-APR-14 17:10:58.329551
> +09:00 | 18-APR-14 17:14:36.83873 +09:00 | 18-APR-14 17:14:
> 36.83873 +09:00 | 18-APR-14 17:14:36.838732 +09:00 | f | active | SET
> extra_float_digits = 3
>
>
> *** ps output (pgpool server)
>
> pgpool 32395 0.0 0.0 24160 4848 ? S Apr15 0:11 \_
> /db/pgpool/bin/pgpool -n -D -f /db/pgpool/conf/pgm/pgpool.conf -F
> /db/pgpool/conf/pgm/pcp.conf -a /db/pgpool/conf/pgm/pool_hba.conf
> pgpool 18534 0.0 0.0 24160 704 ? S Apr18 0:00 | \_
> pgpool: PCP: wait for connection request
> pgpool 18535 0.0 0.0 24160 1076 ? S Apr18 0:03 | \_
> pgpool: worker process
> pgpool 26462 0.0 0.0 830632 9032 ? S Apr18 0:10 | \_
> pgpool: enterprisedb cloud ---------(redacted)---------(15878) idle
> pgpool 11259 0.0 0.0 24160 876 ? S 13:31 0:00 | \_
> pgpool: wait for connection request
> pgpool 11261 0.0 0.0 24160 864 ? S 13:31 0:00 | \_
> pgpool: wait for connection request
> pgpool 11262 0.0 0.0 24160 864 ? S 13:31 0:00 | \_
> pgpool: wait for connection request
> pgpool 11263 0.0 0.0 24160 864 ? S 13:31 0:00 | \_
> pgpool: wait for connection request
> ...
>
>
> *** ps output (backend)
>
> $ ps auxf | grep 20546
> hanadmin 12046 0.0 0.0 61188 744 pts/8 S+ 15:07 0:00
> \_ grep 20546
> 541 20546 0.0 0.0 8761172 7448 ? Ss Apr18 0:07 \_
> postgres: enterprisedb cloud 10.12.240.94[30104] PARSE
>
>
>
> *** gdb trace
>
> (gdb) bt
> #0 0x00000033308cd6d3 in __select_nocancel () from /lib64/libc.so.6
> #1 0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at
> pool_process_query.c:951
> #2 0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at
> pool_stream.c:264
> #3 0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1',
> frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136
> #4 0x00000000004499bb in ProcessBackendResponse (frontend=0xbaf1180,
> backend=0xbad8330, state=0x7fff94626060, num_fields=0x7fff94626066) at
> pool_proto_modules.c:2670
> #5 0x000000000041b69d in pool_process_query (frontend=0xbaf1180,
> backend=0xbad8330, reset_request=0) at pool_process_query.c:289
> #6 0x000000000040ad8a in do_child (unix_fd=4, inet_fd=5) at child.c:355
> #7 0x000000000040456f in fork_a_child (unix_fd=4, inet_fd=5, id=0) at
> main.c:1258
> #8 0x0000000000404897 in reaper () at main.c:2482
> #9 0x0000000000404c25 in pool_sleep (second=<value optimized out>) at
> main.c:2679
> #10 0x0000000000407a0a in main (argc=<value optimized out>, argv=<value
> optimized out>) at main.c:856
> (gdb) up
> #1 0x0000000000418c71 in pool_check_fd (cp=<value optimized out>) at
> pool_process_query.c:951
> 951 fds = select(fd+1, &readmask, NULL, &exceptmask,
> timeoutp);
> (gdb)
> #2 0x000000000041d0e4 in pool_read2 (cp=0xbadfc70, len=822083579) at
> pool_stream.c:264
> 264 if (pool_check_fd(cp))
> (gdb)
> #3 0x0000000000418803 in SimpleForwardToFrontend (kind=49 '1',
> frontend=0xbaf1180, backend=0xbad8330) at pool_process_query.c:1136
> 1136 p = pool_read2(CONNECTION(backend,
> i), len);
> (gdb) p len
> $1 = 822083580
> (gdb) p *backend->slots[i].con
> $3 = {fd = 10, wbuf = 0xbae01d0 "P", wbufsz = 8192, wbufpo = 0, ssl_active
> = -1, hp = 0xbae21e0 "1", po = 0, bufsz = 1024, len = 0, sbuf = 0x0, sbufsz
> = 0, buf2 = 0x2aaf32c92010 "\004OMMIT",
> bufsz2 = 822083584, buf3 = 0x0, bufsz3 = 0, isbackend = 1, db_node_id =
> 1, tstate = 73 'I', is_internal_transaction_started = 0 '\000', auth_kind =
> 0, pwd_size = 0,
> password = '\000' <repeats 1023 times>, salt = "\000\000\000", params =
> {num = 0, names = 0xbae25f0, values = 0xbae2a00}, no_forward = 0, kind = 0
> '\000', protoVersion = 0, raddr = {addr = {
> ss_family = 0, __ss_align = 0, __ss_padding = '\000' <repeats 111
> times>}, salen = 0}, auth_method = uaReject, auth_arg = 0x0, database =
> 0x0, username = 0x0}
>
>
> ***
>
> `len` and `bufsz2` being over 800MB looks extremely odd. Could it be a bug
> of pgpool?
> The ps output shows that the process alone actually allocated 800MB of
> address space.
> This has happened several times for us, but we have no idea what caused it
> and we just had to kill the process.
>
> Any idea?
>
> --
> cheers,
> junegunn choi.
--
Yugo Nagata <nagata at sraoss.co.jp>
More information about the pgpool-general
mailing list