Operating System: Deep Learning Ubuntu AMI, EC2
Package used: Python
MXNet version: 11-Gluon
I am running a DeepRL code close to the existing code in the tutorial
https://github.com/zackchase/mxnet-the-straight-dope/tree/master/chapter17_deep-reinforcement-learning
In the main block of the code("Train the model"), I removed every thing except I kept these two lines:
data = nd.array(state.reshape([1,opt.frame_len,opt.image_size,opt.image_size]),opt.ctx)
action = int(nd.argmax(dqn(data),axis=1).as_in_context(mx.cpu()).asscalar())
where state is the stack of 4 frames, and it is a mx.nd.array with cpu context. The opt.ctx is gpu
Then I run the code while I use htop to track my cpu memory usage. I observe that the memory usage keeps increasing. It seems that the feed forward pass leaks memory.
I've got the same problem during training; when there is no 'time gap' between each mod.forward, backward, update; memory usage keep increasing; but when I sleep 0.3s after each 'forward backward update'; memory don't increase at all; I think its the matter underlying C++ structure of mxnet that caused this problem; still working on how to release the memory fast enough during each forward.
Hey, are you still running it on the jupyter notebook? if yes,
use this
jupyter nbconvert --to script [YOUR_NOTEBOOK].ipynb
and run it as python3 DQN.py
I guess the problem was sourced by running it as jupyter notebook.
(Now I do not have any leak since I am not running it on notebook anymore)
thanks for the info, but my memory still keep increasing when speed is too fast(like 140 sample/s) ^_^ @kazizzad
forward/backward/update are all asynchronous operation. It just pushes the operations to the backend engine and returns immediately. https://mxnet.incubator.apache.org/versions/master/tutorials/basic/ndarray.html#lazy-evaluation-and-automatic-parallelization
You can use mx.nd.waitall to make sure all operations are complete
@eric-haibin-lin thank you ^_^
@eric-haibin-lin after i call mx.nd.waitall(), the memory increasing problem no longer exist, but speed is only 2.5samples/s; I think my former code just keeps pushing data into the backend engine, but the engine just can't process fast enough; so memory keeps increasing untils it crashes. now I think i just have to face that the fact that its speed can only reach 2.5samples/s, Thanks for the tips.
After adding mx.nd.waitall() I still have cpu memory usage goes up. It should be around 100Gb, but it sometimes gets to 200Gb.
I have the same problem in R. Using small batch size = 10. The memory always increases to max of the computer (32GB in my case). I have setup a CNN and even seemingly small examples to train on seem to cause huge increases in memory.
Is there a work around for an R user?
Using MXnet R version 0.10.1
I am still experiencing this. If you don't run mx.nd.waitall(), the memory leaks on whichever context you are using. I am using MXNet version '1.3.1'
Memory leak happens at module 'backward' step. I believe there were some bugs in it. nd.waitall() can fix it but leads to poor performance. @eric-haibin-lin