Incubator-mxnet: Scala Module API resize is leaking memory on the native size.

Created on 9 May 2018 · 18Comments · Source: apache/incubator-mxnet

Description

Create and bind a MXNet Module with batch size N+1 and proceed to loop and pass DataBatches to it that require the Module to resize before performing the forward pass. Monitor the system resources (With htop, nvidia-smi, jvmtop) and you will notice the used system memory in htop will start to grow, but not the jvm heap size (the system memory usages grows beyond the set max JVM heap size) or GPU memory usage. This will continue until your system runs out of memory and there is a crash or the JVM is killed clearing all of the leaked used system memory with it.

Environment info (Required)

----------Python Info----------
Version      : 3.4.3
Compiler     : GCC 4.8.4
Build        : ('default', 'Nov 17 2016 01:08:31')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /usr/local/lib/python3.4/dist-packages/pip
----------MXNet Info-----------
Version      : 1.1.0
Directory    : /home/paperspace/src/mxnet/python/mxnet
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.4.0-31-generic-x86_64-with-Ubuntu-14.04-trusty
system       : Linux
node         : psplnlbg
release      : 4.4.0-31-generic
version      : #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    8
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Stepping:              1
CPU MHz:               2600.072
BogoMIPS:              5200.14
Hypervisor vendor:     Microsoft
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              10240K
NUMA node0 CPU(s):     0-7
----------Network Test----------
Setting timeout: 10
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0884 sec, LOAD: 0.4182 sec.
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0131 sec, LOAD: 0.4660 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.0957 sec, LOAD: 0.6263 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0099 sec, LOAD: 0.3049 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0134 sec, LOAD: 0.6473 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0126 sec, LOAD: 0.0343 sec.

Package used (Python/R/Scala/Julia):
Scala. This seems to be specific to the Scala API. Can not reproduce in Python.

For Scala user, please provide:

Java version: (java -version)
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

Maven version: (mvn -version)
Apache Maven 3.0.5
Maven home: /usr/share/maven
Java version: 1.8.0_131, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-31-generic", arch: "amd64", family: "unix"
Scala runtime if applicable: (scala -version)
2.11.11

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): GCC 4.8.4

MXNet commit hash:
07a83a0325a3d782513a04f47d711710972cb144

Build config:
make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1
make scalainstall -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1

Error Message:

None

Minimum reproducible example

link to simple Scala project/code to reproduce issue
https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala

Steps to reproduce

(Paste the commands you ran that produced the error.)

Pull this project and run it with SBT to test this issue.

What have you tried to solve it?

Avoid passing in DataBatch that requires resize by padding smaller DataBatch with 0's.
Not using the Scala API

Bug Memory Scala

Source

jessebrizzi

Most helpful comment

I've finally come up with a good fix for this. I'll be submitting a PR shortly. I've run your test on my local laptop and am able to get through all passes without issue and memory looks stable. I've also fixed it so that this code can be wrapped in a ResourceScope block without it crashing.

I'm going to start up an instance to run the test over the weekend just to be thorough but everything's looking good.

andrewfayres on 8 Mar 2019

🎉5 🚀3 ❤3

All 18 comments

Also tested on master branch 10ac52993aeaf2fa589a6b3636c8e23a65c8e639 and bug is present

jessebrizzi on 21 May 2018

Hey @jessebrizzi - thanks for reporting the issue!
I noticed that your example code is using the old namespace... can you reproduce on MXNet latest (1.2.0) and with the new package that is available on Maven?

Adding @nswamy who contributes to the Scala package

lupesko on 14 Jun 2018

@lupesko I updated the example code to use the mxnet 1.2.0 package released on Maven and reran the test and the memory leak is still present.

I have also put together a docker container that you can use with nvidia-docker to run my example code with sbt/scala/java/cudnn/cuda all installed here https://hub.docker.com/r/jessebrizzi/dl-dev/ to control for environment differences.

jessebrizzi on 19 Jun 2018

I spent some time digging into this to see if I could find a fix.

Them memory leak seems to be exclusive to when an Executor has to reshape to a size larger than the previously binded size. When you size down there is no noticeable memory leak. I have updated my git repo that as the bug reproduction to show this.

This could be related to the new empty NDArrays being created here https://github.com/apache/incubator-mxnet/blob/master/scala-package/core/src/main/scala/org/apache/mxnet/Executor.scala#L105 and then the previously binded arrays that they are replacing are not properly being freed somewhere. I am assuming the problem is avoided in the resize to a smaller size case since the backend implementation of reshape on the NDArray has no problem reusing the old native memory space for the NDarrays since it can fit the new size in the same memory space as the old size. I tired disposing the old NDArrays for the old size of the Executor in various places but could not find a place that worked (or any proof that disposing the old NDArrays wold fix the memory leak)

I looked into the python API to see if it was doing anything different that could be adapted into the Scala interface to solve the problem. It looks like the Python API has changed from resizing the Executor in Python to using a new backend function to resize the Executor. (Link to the PR for that https://github.com/apache/incubator-mxnet/pull/10882). I figured that this new backend implementation avoids the memory leak issue I made a Jira ticket for the same change (https://issues.apache.org/jira/browse/MXNET-747) and attempted to make the same change in the Scala interface (https://github.com/jessebrizzi/incubator-mxnet/tree/MXNET-747), but I am having trouble writing the new JNI interface for the native resize call and I could use some eyes on it that are more familiar with the JNI interface for the Scala API (getting a SIGSEGV on the call to the MXExecutorReshape method here https://github.com/jessebrizzi/incubator-mxnet/blob/MXNET-747/scala-package/native/src/main/native/org_apache_mxnet_native_c_api.cc#L1764 and I'm not sure why).

Any help or eyes on this problem would be greatly appreciated, I am currently using MXNet in a production environment and have a pretty hacky work around in place currently to avoid this issue.

jessebrizzi on 30 Jul 2018

@jessebrizzi thanks for your efforts in trying to solve this memory issue. @andrewfayres started looking into this part of the code(JNI) recently, he might be able to help you.

@lanking520 as FYI.

nswamy on 1 Aug 2018

@jessebrizzi I actually started looking into this a few days ago. I got a bit sidetracked with something else but I think I should have a solution for you in a day or two.

andrewfayres on 1 Aug 2018

@jessebrizzi I've been looking at the JNI code you wrote and figured out that the sigsegv is happening here. However, I've been unable to figure out exactly what's causing the error thus far. Presumably at least one of the variables on that line is invalid but I'm not seeing why.

I'm going to change gears a bit and attempt to fix the memory leak in the existing scala code to at least resolve the issue for you.

andrewfayres on 7 Aug 2018

@jessebrizzi I've got what I believe to be a working fix in my repo. I need to do some more thorough testing and add some automated testing to this before submitting a PR but my preliminary testing is looking good. Feel free to take a look and let me know if you find any issues.

I'll work on moving all of this over to the native code as soon as I get a few other things off my backlog.

andrewfayres on 15 Aug 2018

Thanks @andrewfayres !

Just catching up on all of this after taking some time off. I'm going to pull down your change and do some testing myself and get back to you.

jessebrizzi on 20 Aug 2018

Sorry for taking so long to get back to you @andrewfayres

I pulled down your fix from your repo and ran it both on my OSX machine in CPU mode and in a nvidia-docker container (https://hub.docker.com/r/jessebrizzi/dl-dev/) in linux for GPU mode and I still think a memory leak issue is still present.

I have update my bug reproduction repository here to show the exact behavior I am testing.

The memory leak that I am still observing seems to be independent of the binded network size. I tested this by running through various max batch sizes that I would randomly sample from for my test input and regardless of what I set after 10000 iterations the native memory growth seems to be around the same.

The fix does seem to address the upsize vs downlise issue observed earlier.

I would still suggest, to maintain parity with the python interface, that the resize logic should be switched to the backend method that the python interface uses, but I know now messing with a JNI interface change is a pain.

Could you post an example of the code you where running to test the change?

jessebrizzi on 30 Aug 2018

@jessebrizzi Sorry for the slow reply, I've been out on vacation.

As soon as I get a little bit of free time I'll pull down the update you've made to your repository and see if I can reproduce. I mainly checked for the resize issue when I was looking earlier. It's possible there's also a different leak. I definitely agree with moving it to native is our best long term solution but at the moment I don't have the bandwidth to dedicate to this.

The example code that I've been running is actually the same as what you posted originally.

As a side note, there is ongoing work being done by @nswamy to provide a much better solution to scala memory management. Feel free to take a look at the design. Hopefully, once this work is finished all the leaking issues will be resolved.

andrewfayres on 10 Sep 2018

@jessebrizzi The NativeResource Management was added as part of https://github.com/apache/incubator-mxnet/pull/12647. You can find some documentation about it https://github.com/apache/incubator-mxnet/blob/master/scala-package/memory-management.md. Does this look like it fixes your problem?

zachgk on 12 Jan 2019

@andrewfayres , could you please take a look and test again whether this memory leak issue is fixed with Phantom references implementation?

ddavydenko on 28 Jan 2019

@zachgk Sorry for the late response, I had to find some time to run through my tests.

I modified my repo (https://github.com/jessebrizzi/MXNet-Bug) with the example code for the memory leak bug to use the snapshots published in the Nightly Repo https://repository.apache.org/#nexus-search;gav~org.apache.mxnet
and include a Dockerfile that sets up an image to test my code with the nightly builds (You have to change the Cuda version in the docker image to switch between 1.4.0-SNAPSHOT and 1.5.0-SNAPSHOT)

I think I am still seeing the bug (native memory climbing when JVM and GPU memory stays stable), but I am running into a crashing issue when running my "bugged" reproduction code https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala

In CPU mode the script will freeze after a few hundred loops with all CPU's pegged and in GPU mode it freezes with the GPU/CPU idling.

In the non-bugged version (https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestNoBug.scala) of my repoduction code this does not happen, it runs fine since it avoids the resize call.

I'm currently trying to get a repo/test using the current 1.3.1 published release, but I have to work through some other blocking issues with that (Mainly the 1.3.1 published scala jar for Linux-GPU is built against opencv 3.4 instead of the opencv 2.4 provided by Ubuntu 16.04 https://packages.ubuntu.com/xenial/libopencv-dev), and I'll comment again when I have that working.

jessebrizzi on 28 Jan 2019

Updated https://github.com/jessebrizzi/MXNet-Bug to support testing with the released MXNet 1.3.1, 1.4.0-SNAPSHOT, and 1.5.0-SNAPSHOT

The bug exists in 1.3.1 (this was already known), running https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestBug.scala for 10000 forward passes of random input resizes leads to over of 6gb of system memory used and climbing when using the GPU.
Htop attributes the memory usage to the java/sbt process but JVMTop does not show this usage in the Heap inside the JVM (The default heapmax is not that large regardless).

Both 1.4.0-SNAPSHOT and 1.5.0-SNAPSHOT cannot finish the 10000 forward passes random resize test as they crash/freeze as described in my previous comment.

for all versions, the https://github.com/jessebrizzi/MXNet-Bug/blob/master/scala/TestNoBug.scala test runs fine with only 1.8ish GB of mem used on my machine (no random resizes on the input) for both GPU and CPU backed nets.

The dockerfile/README in my bug repo https://github.com/jessebrizzi/MXNet-Bug has the needed instructions if someone wants to try and reproduce my observations.

jessebrizzi on 28 Jan 2019

Thanks @jessebrizzi, I'll use your tests on my end to reproduce and try to resolve this.

andrewfayres on 28 Jan 2019

👍1

I'm going to start up an instance to run the test over the weekend just to be thorough but everything's looking good.

andrewfayres on 8 Mar 2019

🎉5 🚀3 ❤3

@andrewfayres Fantastic! Really excited about this 💯