Incubator-mxnet: Feed forward pass memory leaks (using htop)

Created on 9 Oct 2017  路  10Comments  路  Source: apache/incubator-mxnet

Operating System: Deep Learning Ubuntu AMI, EC2
Package used: Python
MXNet version: 11-Gluon

I am running a DeepRL code close to the existing code in the tutorial
https://github.com/zackchase/mxnet-the-straight-dope/tree/master/chapter17_deep-reinforcement-learning
In the main block of the code("Train the model"), I removed every thing except I kept these two lines:
data = nd.array(state.reshape([1,opt.frame_len,opt.image_size,opt.image_size]),opt.ctx)
action = int(nd.argmax(dqn(data),axis=1).as_in_context(mx.cpu()).asscalar())
where state is the stack of 4 frames, and it is a mx.nd.array with cpu context. The opt.ctx is gpu

Then I run the code while I use htop to track my cpu memory usage. I observe that the memory usage keeps increasing. It seems that the feed forward pass leaks memory.

Memory

All 10 comments

I've got the same problem during training; when there is no 'time gap' between each mod.forward, backward, update; memory usage keep increasing; but when I sleep 0.3s after each 'forward backward update'; memory don't increase at all; I think its the matter underlying C++ structure of mxnet that caused this problem; still working on how to release the memory fast enough during each forward.

Hey, are you still running it on the jupyter notebook? if yes,
use this
jupyter nbconvert --to script [YOUR_NOTEBOOK].ipynb
and run it as python3 DQN.py

I guess the problem was sourced by running it as jupyter notebook.
(Now I do not have any leak since I am not running it on notebook anymore)

thanks for the info, but my memory still keep increasing when speed is too fast(like 140 sample/s) ^_^ @kazizzad

forward/backward/update are all asynchronous operation. It just pushes the operations to the backend engine and returns immediately. https://mxnet.incubator.apache.org/versions/master/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization
You can use mx.nd.waitall to make sure all operations are complete

@eric-haibin-lin thank you ^_^

@eric-haibin-lin after i call mx.nd.waitall(), the memory increasing problem no longer exist, but speed is only 2.5samples/s; I think my former code just keeps pushing data into the backend engine, but the engine just can't process fast enough; so memory keeps increasing untils it crashes. now I think i just have to face that the fact that its speed can only reach 2.5samples/s, Thanks for the tips.

After adding mx.nd.waitall() I still have cpu memory usage goes up. It should be around 100Gb, but it sometimes gets to 200Gb.

I have the same problem in R. Using small batch size = 10. The memory always increases to max of the computer (32GB in my case). I have setup a CNN and even seemingly small examples to train on seem to cause huge increases in memory.

Is there a work around for an R user?

Using MXnet R version 0.10.1

I am still experiencing this. If you don't run mx.nd.waitall(), the memory leaks on whichever context you are using. I am using MXNet version '1.3.1'

Memory leak happens at module 'backward' step. I believe there were some bugs in it. nd.waitall() can fix it but leads to poor performance. @eric-haibin-lin

Was this page helpful?
0 / 5 - 0 ratings

Related issues

realbns2008 picture realbns2008  路  3Comments

xzqjack picture xzqjack  路  3Comments

qiliux picture qiliux  路  3Comments

yuconglin picture yuconglin  路  3Comments

sbodenstein picture sbodenstein  路  3Comments