Incubator-mxnet: Feed forward pass memory leaks (using htop)

Created on 9 Oct 2017 · 10Comments · Source: apache/incubator-mxnet

Operating System: Deep Learning Ubuntu AMI, EC2
Package used: Python
MXNet version: 11-Gluon

I am running a DeepRL code close to the existing code in the tutorial
https://github.com/zackchase/mxnet-the-straight-dope/tree/master/chapter17_deep-reinforcement-learning
In the main block of the code("Train the model"), I removed every thing except I kept these two lines:
data = nd.array(state.reshape([1,opt.frame_len,opt.image_size,opt.image_size]),opt.ctx)
action = int(nd.argmax(dqn(data),axis=1).as_in_context(mx.cpu()).asscalar())
where state is the stack of 4 frames, and it is a mx.nd.array with cpu context. The opt.ctx is gpu

Then I run the code while I use htop to track my cpu memory usage. I observe that the memory usage keeps increasing. It seems that the feed forward pass leaks memory.

Memory

Source

kazizzad

All 10 comments

I've got the same problem during training; when there is no 'time gap' between each mod.forward, backward, update; memory usage keep increasing; but when I sleep 0.3s after each 'forward backward update'; memory don't increase at all; I think its the matter underlying C++ structure of mxnet that caused this problem; still working on how to release the memory fast enough during each forward.

gongqiang on 19 Oct 2017

Hey, are you still running it on the jupyter notebook? if yes,
use this
jupyter nbconvert --to script [YOUR_NOTEBOOK].ipynb
and run it as python3 DQN.py

I guess the problem was sourced by running it as jupyter notebook.
(Now I do not have any leak since I am not running it on notebook anymore)

kazizzad on 19 Oct 2017

thanks for the info, but my memory still keep increasing when speed is too fast(like 140 sample/s) ^_^ @kazizzad

gongqiang on 19 Oct 2017

forward/backward/update are all asynchronous operation. It just pushes the operations to the backend engine and returns immediately. https://mxnet.incubator.apache.org/versions/master/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization
You can use mx.nd.waitall to make sure all operations are complete

eric-haibin-lin on 21 Oct 2017

👎2

@eric-haibin-lin thank you ^_^

gongqiang on 25 Oct 2017

@eric-haibin-lin after i call mx.nd.waitall(), the memory increasing problem no longer exist, but speed is only 2.5samples/s; I think my former code just keeps pushing data into the backend engine, but the engine just can't process fast enough; so memory keeps increasing untils it crashes. now I think i just have to face that the fact that its speed can only reach 2.5samples/s, Thanks for the tips.

gongqiang on 26 Oct 2017

After adding mx.nd.waitall() I still have cpu memory usage goes up. It should be around 100Gb, but it sometimes gets to 200Gb.

kazizzad on 5 Nov 2017

I have the same problem in R. Using small batch size = 10. The memory always increases to max of the computer (32GB in my case). I have setup a CNN and even seemingly small examples to train on seem to cause huge increases in memory.

Is there a work around for an R user?

Using MXnet R version 0.10.1

some-guy1 on 16 Jan 2018

I am still experiencing this. If you don't run mx.nd.waitall(), the memory leaks on whichever context you are using. I am using MXNet version '1.3.1'

rohun-tripathi on 13 Oct 2018

👍1

Memory leak happens at module 'backward' step. I believe there were some bugs in it. nd.waitall() can fix it but leads to poor performance. @eric-haibin-lin