View Issue Details

IDProjectCategoryView StatusLast Update
0000129Pgpool-IIBugpublic2016-01-29 23:43
Reportereldad Assigned ToMuhammad Usama  
PriorityhighSeveritycrashReproducibilityalways
Status closedResolutionunable to reproduce 
PlatformLinuxOSRHELOS Version6.3
Summary0000129: pgpool child out of memory
DescriptionAfter upgrading from 3.3.3 to 3.4.0-3 queries are failing to run with out of memory exception
Same HW and configuration is used.
We have a dedicated host with 16G memory, only 3G is in use by pgpgool when getting these errors (see in additional info)
We have 2 pgpool and 4 postgres servers.
I'm attaching the conf file.
Additional Information2015-02-15 01:03:01: pid 25048: DETAIL: connecting host=postgres-01.lab port=60184
TopMemoryContext: 234560 total in 25 blocks; 8392 free (7 chunks); 226168 used
  pgpool_child_main_loop: 24576 total in 2 blocks; 12080 free (8 chunks); 12496 used
    child_query_process: 0 total in 0 blocks; 0 free (0 chunks); 0 used
      QueryContextMemoryContext: 8192 total in 3 blocks; 3280 free (0 chunks); 4912 used
      pool_process_query: 8192 total in 1 blocks; 7872 free (0 chunks); 320 used
    SessionContext: 1024 total in 1 blocks; 664 free (0 chunks); 360 used
  pgpool_main_loop: 8192 total in 1 blocks; 8160 free (5 chunks); 32 used
  ErrorContext: 8192 total in 1 blocks; 8160 free (5 chunks); 32 used
2015-02-15 01:03:01: pid 30653: ERROR: out of memory
2015-02-15 01:03:01: pid 30653: DETAIL: Failed on request of size 2121728.
2015-02-15 01:03:01: pid 30653: LOG: garbage data from frontend after receiving terminate message ignored
2015-02-15 01:03:02: pid 29341: DETAIL: connecting host=postgres-02.lab port=43964
TopMemoryContext: 13202336 total in 79 blocks; 10872 free (12 chunks); 13191464 used
  pgpool_child_main_loop: 24576 total in 2 blocks; 12080 free (8 chunks); 12496 used
    child_query_process: 0 total in 0 blocks; 0 free (0 chunks); 0 used
      QueryContextMemoryContext: 4096 total in 2 blocks; 1408 free (1 chunks); 2688 used
      pool_process_query: 0 total in 0 blocks; 0 free (0 chunks); 0 used
      QueryContextMemoryContext: 4096 total in 2 blocks; 152 free (1 chunks); 3944 used
      pool_process_query: 8192 total in 1 blocks; 8160 free (1 chunks); 32 used
    SessionContext: 1024 total in 1 blocks; 592 free (0 chunks); 432 used
  pgpool_main_loop: 8192 total in 1 blocks; 8160 free (5 chunks); 32 used
  ErrorContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
2015-02-15 01:03:03: pid 30489: DETAIL: connecting host=postgres-01.lab port=60197
TopMemoryContext: 289600 total in 35 blocks; 6848 free (10 chunks); 282752 used
  pgpool_child_main_loop: 24576 total in 2 blocks; 12080 free (8 chunks); 12496 used
    child_query_process: 0 total in 0 blocks; 0 free (0 chunks); 0 used
      QueryContextMemoryContext: 4096 total in 2 blocks; 152 free (1 chunks); 3944 used
      pool_process_query: 8192 total in 1 blocks; 8000 free (0 chunks); 192 used
    SessionContext: 1024 total in 1 blocks; 728 free (0 chunks); 296 used
  pgpool_main_loop: 8192 total in 1 blocks; 8160 free (5 chunks); 32 used
  ErrorContext: 8192 total in 1 blocks; 8160 free (4 chunks); 32 used
2015-02-15 01:03:03: pid 30489: ERROR: out of memory
2015-02-15 01:03:03: pid 30489: DETAIL: Failed on request of size 2121728.
2015-02-15 01:03:03: pid 30489: LOG: garbage data from frontend after receiving terminate message ignored
2015-02-15 01:03:03: pid 30621: DETAIL: connecting host=postgres-02.lab port=43971
TopMemoryContext: 391488 total in 54 blocks; 5000 free (19 chunks); 386488 used
  pgpool_child_main_loop: 24576 total in 2 blocks; 12816 free (4 chunks); 11760 used
  pgpool_main_loop: 8192 total in 1 blocks; 8160 free (5 chunks); 32 used
  ErrorContext: 8192 total in 1 blocks; 8160 free (4 chunks); 32 used
2015-02-15 01:03:03: pid 30621: ERROR: out of memory
2015-02-15 01:03:03: pid 30621: DETAIL: Failed on request of size 1024.
2015-02-15 01:03:03: pid 26997: WARNING: child process with pid: 30621 was terminated by segmentation fault
2015-02-15 01:03:03: pid 26997: FATAL: failed to fork a child
2015-02-15 01:03:03: pid 26997: DETAIL: system call fork() failed with reason: Cannot allocate memory
TagsNo tags attached.

Activities

eldad

2015-02-15 20:31

reporter  

pgpool.conf (30,619 bytes)

t-ishii

2015-02-15 20:42

developer   ~0000514

pgpool-II 3.4.1 fixed memory leak bugs. Please try 3.4.1.
http://pgpool.net/mediawiki/index.php/Yum_Repository

eldad

2015-02-15 22:47

reporter   ~0000515

Hi,

Just upgraded to 3.4.1-2, same problem.

2015-02-15 03:44:30: pid 12557: ERROR: out of memory
2015-02-15 03:44:30: pid 12557: DETAIL: Failed on request of size 2121728.
2015-02-15 03:44:43: pid 12516: ERROR: out of memory
2015-02-15 03:44:43: pid 12516: DETAIL: Failed on request of size 1472000.
2015-02-15 03:44:44: pid 12244: FATAL: failed to fork a child
2015-02-15 03:44:44: pid 12244: DETAIL: system call fork() failed with reason: Cannot allocate memory

after the last message pgpool shut itself down.
seems like when it gets to 3-3.5G usage it crushed.
the host has 16G
I checked the limits.conf and sysctl.conf, looks ok, see below:
limits.conf:
postgres soft nproc 4500
postgres hard nproc 4500
postgres soft nofile 65536
postgres hard nofile 65536
postgres soft memlock 15728640
postgres hard memlock 15728640

sysctl.conf:
kernel.shmmax=16106127360
kernel.shmall=8244661
kernel.sem=4010 569420 4010 142

No errors found in the postgres logs.

Please advise, its urgent.

Many thanks,
Eldad

Muhammad Usama

2015-02-16 23:51

developer   ~0000516

Hello Eldad!

I am trying to reproduce the issue but with no luck till now. I am getting no memory leak and getting almost identical memory consumption of pgpool-II 3.3 and 3.4.1 with my testing.

By the look of the above details it doesn't seem like a memory leak, because the log shared by you suggests that the process only is holding only ~1MB or memory and yet reporting the error "out of memory"

"2015-02-15 01:03:03: pid 30489: ERROR: out of memory
2015-02-15 01:03:03: pid 30489: DETAIL: Failed on request of size 2121728.
2015-02-15 01:03:03: pid 30489: LOG: garbage data from frontend after receiving terminate message ignored
2015-02-15 01:03:03: pid 30621: DETAIL: connecting host=postgres-02.lab port=43971
TopMemoryContext: 391488 total in 54 blocks; 5000 free (19 chunks); 386488 used"

For further investigation, Can you please share some more information on which OS are you using and how do you get the out of memory issue. Specifically the ps output of all the pgpool-II processes, output of Top, contents of /proc/meminfo and after how much time did pgpool-II goes out of memory. And is there any specific type of queries which cause this issue.
Lastly, as you mentioned pgpool-II 3.4 start reporting "out of memory" when it gets around 3 to 3.5 G memory. Do you have any data like how much total memory did pgpool-II 3.3 uses with similar configuration and load.

Thanks
Best regards!
Muhammad Usama

eldad

2015-02-17 00:18

reporter   ~0000517

Hi Muhammad,

I downgrade th pgpool to 3.3.3-1 that I had before but the problem persist so I'm not sure what is the source problem here.
in this version I get these error messages for all the child processes:
2015-02-16 07:09:00 ERROR: pid 7318: pool_create_relcache: cannot allocate memory 2121728
2015-02-16 07:09:00 ERROR: pid 7318: is_system_catalog: pool_create_relcache error

It happens 2-3 min after the load tests start from the application side.
I'm attaching the information you need in the files.
I use RHEL 6.3.
Before pgpool processes usually took 2-3 GB overall.

Thanks for your help,
Eldad

eldad

2015-02-17 00:18

reporter  

processes.txt (34,889 bytes)   
processes.txt (34,889 bytes)   

eldad

2015-02-17 00:18

reporter  

meminfo.txt (1,212 bytes)   
meminfo.txt (1,212 bytes)   

eldad

2015-02-17 00:19

reporter  

top.txt (3,329 bytes)   
top.txt (3,329 bytes)   

eldad

2015-02-17 00:25

reporter   ~0000518

from the error log I see that the memory exception is always on the pool_create_relcache.

I think this query is related:
 SELECT n.nspname as "Schema",
          c.relname as "Name",
          CASE c.relkind WHEN 'r' THEN 'table' WHEN 'v' THEN 'view' WHEN 'm' THEN 'materialized view' WHEN 'i' THEN 'index' WHEN 'S' THEN 'sequence' WHEN 's' THEN 'special' WHEN 'f' THEN 'foreign table' END as "Type",
          pg_catalog.pg_get_userbyid(c.relowner) as "Owner"
        FROM pg_catalog.pg_class c
             LEFT JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
        WHERE c.relkind IN ('r','v','m','S','f','')
              AND n.nspname <> 'pg_catalog'
              AND n.nspname <> 'information_schema'
              AND n.nspname !~ '^pg_toast'
          AND pg_catalog.pg_table_is_visible(c.oid)
        ORDER BY 1,2;

Muhammad Usama

2015-02-17 01:07

developer   ~0000519

Hi
Thanks for providing the information. Apparently this seems to be some OS limit setting issue.

Can you please share the the output of "ulimit -a"

eldad

2015-02-17 23:53

reporter   ~0000520

Hi,

here is the output:

core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 127456
max locked memory (kbytes, -l) 15728640
max memory size (kbytes, -m) unlimited
open files (-n) 65536
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 10240
cpu time (seconds, -t) unlimited
max user processes (-u) 4500
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

eldad

2015-02-18 22:41

reporter   ~0000521

Hi,

I think I found the root cause for this problem, looks very very strange.
trusted_servers parameter in the pgpool.conf can hold a list of server for connectivity checks by the watchdog process.
Currently each pgpool has the other pgpool node in this parameter.
But if the other pgpool is totally out (reboot for example) then the active pgpool fails the watchdog lifecheck and ping
To trusted server, so it think it lost the network and shut himself down.
To fix it I added more servers to the list (our DB servers) so even if 1-2 servers are down ping will still work. (I had total 5 servers in the list).

Apparently this is what caused the pgpool to break apart and in some point we got out of memory exception for all child processes
Even though this parameter is relevant for the watchdog process only.
my guess this is a memory leak in the process that run over the child process memory.
The bug exists in both old (3.3.3) and new version (3.4.1).

Regards,
Eldad

Muhammad Usama

2016-01-29 23:43

developer   ~0000643

I am not able to reproduce the issue, and as per the reporter he is able to execute the pgpool-II without any problem after configuration changes. Also the system trace shared by user does not suggests memory leaks in pgpool-II. So I am closing this issue.

Issue History

Date Modified Username Field Change
2015-02-15 20:31 eldad New Issue
2015-02-15 20:31 eldad File Added: pgpool.conf
2015-02-15 20:39 t-ishii Assigned To => t-ishii
2015-02-15 20:39 t-ishii Status new => feedback
2015-02-15 20:42 t-ishii Note Added: 0000514
2015-02-15 22:47 eldad Note Added: 0000515
2015-02-15 22:47 eldad Status feedback => assigned
2015-02-16 23:51 Muhammad Usama Note Added: 0000516
2015-02-17 00:18 eldad Note Added: 0000517
2015-02-17 00:18 eldad File Added: processes.txt
2015-02-17 00:18 eldad File Added: meminfo.txt
2015-02-17 00:19 eldad File Added: top.txt
2015-02-17 00:25 eldad Note Added: 0000518
2015-02-17 01:07 Muhammad Usama Note Added: 0000519
2015-02-17 23:53 eldad Note Added: 0000520
2015-02-18 22:41 eldad Note Added: 0000521
2016-01-20 13:38 t-ishii Assigned To t-ishii => Muhammad Usama
2016-01-29 23:43 Muhammad Usama Note Added: 0000643
2016-01-29 23:43 Muhammad Usama Status assigned => closed
2016-01-29 23:43 Muhammad Usama Resolution open => unable to reproduce