Javacpp-presets: Mat in opencv leaks off-heap memory and causes OutOfMemoryError

Created on 18 May 2017  Â·  29Comments  Â·  Source: bytedeco/javacpp-presets

I'm using opencv in javacpp-presets and I found the deallocator of Mat is not working properly.

I test this by creating Mat objects (without keeping their references)

import org.bytedeco.javacpp.Pointer;
import org.bytedeco.javacpp.opencv_core;

public class MemoryTest {

    public static void main(String args[]) {
        for (int i = 0; i < 10000000; ++i) {
            opencv_core.Mat mat = new opencv_core.Mat(200, 100, opencv_core.CV_8UC3);
            if (i % 10000 == 0) {
                long maxPhysicalBytes = Pointer.maxPhysicalBytes();
                long physicalBytes = Pointer.physicalBytes();
                Runtime runtime = Runtime.getRuntime();
                long usedBytes = runtime.totalMemory() - runtime.freeMemory();
                System.out.printf("%d %d %d %d\n", i, maxPhysicalBytes, physicalBytes, usedBytes);
            }
        }
    }

}

The output

0 932184064 56340480 10609104
10000 932184064 100937728 11909384
......
100000 932184064 500686848 22312256
......
180000 932184064 856911872 32505856
2017-05-18T19:29:44.556+0800: 0.574: [GC (Allocation Failure) [PSYoungGen: 31744K->5088K(36864K)] 31744K->22919K(121856K), 0.0309626 secs] [Times: user=0.07 sys=0.02, real=0.03 secs] 
190000 932184064 896753664 26030728
......
250000 932184064 913297408 33923344
2017-05-18T19:29:44.760+0800: 0.778: [GC (System.gc()) [PSYoungGen: 15913K->5104K(68608K)] 33744K->31671K(153600K), 0.0131685 secs] [Times: user=0.04 sys=0.03, real=0.02 secs] 
2017-05-18T19:29:44.773+0800: 0.791: [Full GC (System.gc()) [PSYoungGen: 5104K->0K(68608K)] [ParOldGen: 26567K->22753K(84992K)] 31671K->22753K(153600K), [Metaspace: 5864K->5864K(1056768K)], 0.1087988 secs] [Times: user=0.72 sys=0.00, real=0.11 secs] 
2017-05-18T19:29:44.982+0800: 1.000: [GC (System.gc()) [PSYoungGen: 0K->0K(68608K)] 22753K->22753K(153600K), 0.0023905 secs] [Times: user=0.02 sys=0.00, real=0.00 secs] 
......
2017-05-18T19:29:45.842+0800: 1.860: [Full GC (System.gc()) [PSYoungGen: 0K->0K(68608K)] [ParOldGen: 773K->773K(84992K)] 773K->773K(153600K), [Metaspace: 5864K->5864K(1056768K)], 0.0033012 secs] [Times: user=0.01 sys=0.00, real=0.00 secs] 
Exception in thread "main" java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes = 913M > maxPhysicalBytes = 889M
    at org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:562)
    at org.bytedeco.javacpp.Pointer.init(Pointer.java:121)
    at org.bytedeco.javacpp.opencv_core$Mat.allocate(Native Method)
    at org.bytedeco.javacpp.opencv_core$Mat.<init>(opencv_core.java:15395)
    at MemoryTest.main(MemoryTest.java:8)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Heap
 PSYoungGen      total 68608K, used 1655K [0x00000000f5980000, 0x00000000fa180000, 0x0000000100000000)
  eden space 63488K, 2% used [0x00000000f5980000,0x00000000f5b1dcf0,0x00000000f9780000)
  from space 5120K, 0% used [0x00000000f9780000,0x00000000f9780000,0x00000000f9c80000)
  to   space 5120K, 0% used [0x00000000f9c80000,0x00000000f9c80000,0x00000000fa180000)
 ParOldGen       total 84992K, used 773K [0x00000000e0c00000, 0x00000000e5f00000, 0x00000000f5980000)
  object space 84992K, 0% used [0x00000000e0c00000,0x00000000e0cc15a8,0x00000000e5f00000)
 Metaspace       used 5881K, capacity 6302K, committed 6656K, reserved 1056768K
  class space    used 640K, capacity 654K, committed 768K, reserved 1048576K

One can see that java heap memory is almost not used (30MB) and the mat objects are all collected by gc.
However, the value of Pointer.physicalBytes() keeps increasing until the limitation of Pointer.maxPhysicalBytes() (930MB), and finally throws OutOfMemoryError.

Is this a bug of the deallocator?

package version (I'm using sbt): "org.bytedeco.javacpp-presets" % "opencv" % "3.2.0-1.3" classifier "" classifier "linux-x86_64".
extra java options: -Xmx500m -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps

bug

Most helpful comment

Fix released in JavaCPP 1.3.3. Thanks for reporting and testing this out!

All 29 comments

If you need to allocate more than that amount of Mat objects, increase either the Runtime.maxMemory() value with for example the java -Xmx command line option, or set the "org.bytedeco.javacpp.maxphysicalbytes" system property:
http://bytedeco.org/javacpp/apidocs/org/bytedeco/javacpp/Pointer.html#maxPhysicalBytes

@saudet Thanks for your reply.

My point is: I didn't keep references to the Mat objects I allocated, so their memory should be deallocated, both the java heap memory and off-heap memory. But the testing shows that the off-heap memory is not correctly released.

I see, does it help if you increase http://bytedeco.org/javacpp/apidocs/org/bytedeco/javacpp/Pointer.html#maxRetries (via the "org.bytedeco.javacpp.maxretries" system property) to something like 1000?

I tried but no help.

The gc log shows that the java heap is clean enough after the third full gc.

2017-05-19T10:56:47.596+0800: 1.129: [Full GC (System.gc()) [PSYoungGen: 0K->0K(68608K)] [ParOldGen: 4907K->773K(84992K)] 4907K->773K(153600K), [Metaspace: 5887K->5887K(1056768K)], 0.0051643 secs] [Times: user=0.02 sys=0.00, real=0.01 secs]

I double checked this using jmap -J-d64 -histo <pid>, which shows no more 'org.bytedeco.javacpp.opencv_core.Mat' or 'org.bytedeco.javacpp.Pointer$NativeDeallocator' objects live at that moment.

So even manually calling Mat.deallocate() right after every allocation also ends up running out of memory?

Manually calling Mat.deallocate() right after every allocation solves out of memory problem.

It is also the work-around solution currently using in our application.

Strange, the deallocator thread just basically calls Mat.deallocate() on each object...

I think I found the problem.

The deallocator in jni code use delete to release off-heap memory

static void JavaCPP_org_bytedeco_javacpp_opencv_1core_00024Mat_deallocate(void *p) { delete (::cv::Mat*)p; }

and use resident set size to evaluate the size of used off-heap memory.

/** Returns the amount of physical memory currently used by the whole process, or 0 if unknown.
     * Also known as "resident set size" (Linux, Mac OS X, etc) or "working set size" (Windows). */
    @Name("JavaCPP_physicalBytes") public static synchronized native long physicalBytes();

However this link discussed that delete does not return the memory to operating system immediately, which means resident set size would not decrease right after System.gc().
That's why javacpp says the Physical memory usage is too high, even after deleting many Mat objects.

The solution on linux could be calling malloc_trim(0); after System.gc() in

while (count++ < maxRetries && ((maxBytes > 0 && DeallocatorReference.totalBytes + r.bytes > maxBytes) || (maxPhysicalBytes > 0 && (lastPhysicalBytes = physicalBytes()) > maxPhysicalBytes))) {
  // try to get some more memory back
  System.gc();
  Thread.sleep(100);
}

That's interesting, thanks for investigating this!

The behavior of those calls are pretty complex. Have you been able to confirm that a call to malloc_trim(0) helps?

Yes, I have tested it on my PC ubuntu 14.04, x86_64, glibc 2.19, g++4.8.2, opencv 2.4.13

The testing code

#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <stdlib.h>
#include <string.h>
#include <malloc.h>

#include <opencv2/core/core.hpp>
using namespace cv;

const int MAXN = 10000;

void info(int iter)
{   
    int fd = open("/proc/self/statm", O_RDONLY, 0);
    if (fd >= 0) {
        char line[256];
        char* space;
        lseek(fd, 0, SEEK_SET);     
        if (read(fd, line, sizeof(line)) > 0 && (space = strchr(line, ' ')) != NULL) {
            printf("%d: %s\n", iter, line);
        }
    }   
    close(fd);
}

int main()
{
    Mat* mats[MAXN];
    for (int i = 0; i < MAXN; ++i) {
        mats[i] = new Mat(200, 100, CV_8UC3);
        if (i % 1000 == 0) {
            info(i);    
        }
    }
    for (int i = 0; i < MAXN; ++i) {
        delete mats[i];
        if (i % 1000 == 0) {
            info(i);    
        }
    }
    info(MAXN + 1);
    malloc_trim(0);
    info(MAXN + 2);

    return 0;
}

compile and run

g++ -g -O2 --std=c++11 test.cpp -o test `pkg-config --cflags --libs opencv`
./test

the output

0: 6910 495 435 2 0 116 0

1000: 21587 2156 1019 2 0 14793 0

2000: 36304 3212 1019 2 0 29510 0

3000: 50976 4268 1019 2 0 44182 0

4000: 65649 5324 1019 2 0 58855 0

5000: 80366 6380 1019 2 0 73572 0

6000: 95039 7436 1019 2 0 88245 0

7000: 109712 8492 1019 2 0 102918 0

8000: 124429 9548 1019 2 0 117635 0

9000: 139101 10670 1019 2 0 132307 0

0: 153774 11726 1019 2 0 146980 0

1000: 153774 11726 1019 2 0 146980 0

2000: 153774 11726 1019 2 0 146980 0

3000: 153774 11726 1019 2 0 146980 0

4000: 153774 11726 1019 2 0 146980 0

5000: 153774 11726 1019 2 0 146980 0

6000: 153774 11726 1019 2 0 146980 0

7000: 153774 11726 1019 2 0 146980 0

8000: 153774 11726 1019 2 0 146980 0

9000: 153774 11726 1019 2 0 146980 0

10001: 153774 11726 1019 2 0 146980 0

10002: 6882 1115 1019 2 0 88 0
6980 0

You can see that number of pages keeps 11726 while deleting 10000 Mat objects at one moment.
After calling malloc_trim(0), the number of pages decreases to 1115 immediately.

However, this behavior depends on the implementation of glibc and should be tested on more environments if possible.

I wonder what happens if we forget to delete just the last Mat?

I tested it and it still releases most of the memory, interesting. What would be the equivalent on Mac OS X and Windows? Does Android implement it?

You worry about the hole problem and you want to know "what happens if we forget to delete the Mat with the highest address?"

I tested it too and calling malloc_trim(0) in that case:

  1. does not change the number of virtual pages (the first integer in /proc/self/statm)
  2. does not change the location of program break (could be obtained by sbrk(0))
  3. does decrease the number of resident pages (the second integer in /proc/self/statm), similar with what you tested

As it claims here, malloc_trim can release only the free (virtual) memory at the top of the heap, which matches observation 1 & 2. However, to my best knowledge, no document about observation 3 could be found. So, I had to regard "releasing most of the (physical) memory, even the one not at the top of the heap" as some side effect of malloc_trim(0). Some discussion could be found here.

Whether to use this magic trick is on your decision.

About other platforms, should try to reproduce this bug first?

Ok, I verified that the same thing happens on Mac (though what is the equivalent of malloc_trim(0) there?), but it does not happen on Windows. However, Windows has other issues. When too much memory gets allocated, deallocation becomes very slow, so we shouldn't be allocating more memory then we really need anyway...

What's your Mac OS version?

I tested (with the code I posted in the very beginning) on OS X Yosemite 10.10.5 and found no OutOfMemoryError. When javacpp reaches the bound of maxPhysicalBytes() and calls System.gc();, the value of physicalBytes() does decreases from 500MB to 400MB.

This link shortly describes the memory management of Mac OS and the things are quite different from those on linux. I have no idea whether the equivalent of malloc_trim(0) on Mac exists or not.

Yosemite here as well. I only tried your second "testing code" though. When things get called across threads, maybe it acts differently?

Yes, the second "testing code" shows that the physical memory is not immediately freed after delete on Mac. But then, the behavior in different platforms and different use cases is complicated and beyond my knowledge. It would be nice if some operating system expert gives some suggestions.

@mythly In the case of multi-threaded high concurrent request, whether malloc_trim(0) should be locked by std::mutex?At the end of each request need to be called malloc_trim(0) ?

@youzp malloc_trim is thread safe and calling malloc_trim(0) after System.gc() would be a better choice.

Finally got around to adding the call. Let me know if you still have issues with that! Thanks

Fix released in JavaCPP 1.3.3. Thanks for reporting and testing this out!

Hi guys, I am still having this issue... https://stackoverflow.com/q/47112338/2046728

java.lang.OutOfMemoryError: Physical memory usage is too high: physicalBytes = 1G > maxPhysicalBytes = 1G
org.bytedeco.javacpp.Pointer.deallocator(Pointer.java:576)

Already downloaded the latest version of JavaCPP (1.3.3) but kept the same behavior :-/

What's your platform?

Linux (Ubuntu 16.04 64bits)

You're probably just corrupting your memory. For example, you're calling cvClearMemStorage(storage) before accessing that region of memory...

@saudet, so you are saying that I should't call cvClearMemStorage(storage) there? I understand as it is cleaning the storage before calling the loop of cvGetSeqElem(...) in order to place the rectangles in the image - and even if I place cvClearMemStorage(...) after all or before all it still accumulating... what would be the best practice here then?

Just do it like shown here:
https://github.com/bytedeco/javacv#sample-usage

Hi @saudet ,
I've rewritten my code to something alike the sample - it works basically the same way - monitoring the physical memory it increases over time although jvm heap size is under control - after about 650 image analysis it crashes with the same error.
Currently I am working around this by running the OpenCV part in a parallel process -> ObjectDetection is a separate executable jar and I call it from my application as a command line, so it runs in a separate jvm and has a much shorter lifecycle - the collateral effect is performance, of course, but memory is not a problem in this case. My application used to analyze 8 to 10 images per second (crashing after a minute or so) with the previous design, and now, with this workaround the application is performing 3 to 5 image analysis per second, but still up and running.

Was this page helpful?
0 / 5 - 0 ratings