Incubator-mxnet: Fix memory management

Created on 5 Jan 2016 · 13Comments · Source: apache/incubator-mxnet

As discussed in #1130, pooling memory manager has a hard threshold of 4GB. On a video card that doesn't have 4GB this will result in OOM even if there's plenty of memory cached.

The following changes are proposed:

Make the threshold a parameter, default it to available GPU memory at the time mxnet starts. Make it configurable via a env variable or a function call.
Change the check kThreshold <= used_memory_ + size (see discussion in #1130)
Maintain LRU list of memory chunks, and only free enough for the new allocation when there's not enough memory instead of freeing everything.
Add a function that will report amount of cached memory.

If the plan sounds good, I will start working on it after the ICML deadline (which is Feb, 5th).

Source

SkidanovAlex

👍1

All 13 comments

Looks good to me.

piiswrong on 6 Jan 2016

To give an update, I never got a chance to finish the change I described above.

As a quick fix for the actual problem (failing with cuda OOM failure) one good fix would be to change the pooled manager to catch that error once, call ReleaseAll and make a second attempt to allocate. I've had this change for a while now in my branch locally, and it seems to work both on my machine and on AWS (while without this change my models OOM on both), but I don't have a test for it. I can send a PR without a test.

SkidanovAlex on 15 Apr 2016

👍1

This sounds a good idea to me. a PR is welcomed.

tqchen on 26 Apr 2016

@SkidanovAlex Any updates on this

tqchen on 7 May 2016

@tqchen, sorry, I missed your comment from April 26th, and was still waiting for the approval. I will send a PR within couple days.

SkidanovAlex on 13 May 2016

sounds great, thanks!

tqchen on 13 May 2016

I'm on commit 92a2d03740af5cd11d0599c8fc16d8a52132761e from Sep 3rd. and I can confirm I get an issue so #3055 didn't solve the issue.
I'm running the neural-style sample and I'm getting 👍

(venv) ➜ neural-style git:(master) ✗ python run.py
INFO:root:load the content image, size = (1000, 1500)
INFO:root:resize the content image to (400, 600)
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:57] /Users/yosit/git/mxnet/dmlc-core/include/dmlc/logging.h:235: [18:38:57] src/storage/./pooled_storage_manager.h:67: Memory allocation failed.
[18:38:57] /Users/yosit/git/mxnet/dmlc-core/include/dmlc/logging.h:235: [18:38:57] src/engine/./threaded_engine.h:306: [18:38:57] src/storage/./pooled_storage_manager.h:67: Memory allocation failed.
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
libc++abi.dylib: terminating with uncaught exception of type dmlc::Error: [18:38:57] src/engine/./threaded_engine.h:306: [18:38:57] src/storage/./pooled_storage_manager.h:67: Memory allocation failed.
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
[1] 34843 abort python run.py

yosit on 5 Sep 2016

😄1

@yosit

I got same error when I run neural art demo. And this demo runs good under CPU mode. And other demo runs good both under CPU mode and GPU mode.

After reviewing the src code, I think this error is thrown during processing the function MXExecutorBindEX in C_api.cc.

However, I have not find any solution to fix it. Can anyone help?

xiesiyuan on 5 Sep 2016

👍1

@yosit

sloved, this error is actually equal to the OOM(out of memory) failure. change the arg --max-long-edge when use python command to resize the source jpg as a smaller one. Problem solved.

xiesiyuan on 6 Sep 2016

That's what I did. But the result is less than impressive. I managed to go
up as much as 350 pixels, more than that it crashed
On Tue, 6 Sep 2016 at 7:13 xiesiyuan [email protected] wrote:

@yosit https://github.com/yosit

sloved, this error is actually equal to the OOM(out of memory) failure.
change the arg --max-long-edge when use python command to resize the source
jpg as a smaller one. Problem solved.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/1176#issuecomment-244845679, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAHjneeW2U_9y6GwuIxSz4rdyUHd32Euks5qnOhogaJpZM4G_BnS
.

yosit on 6 Sep 2016

@yosit
correct， if more than 350 pixel， it will fail. My GPU is GTX940M which only could support 350 pixel for this demo. ~~

xiesiyuan on 11 Sep 2016

I get the same issue with the rnn sample.
On Mon, 12 Sep 2016 at 0:32 xiesiyuan [email protected] wrote:

@yosit https://github.com/yosit
correct， if more than 350 pixel， it will fail. My GPU is GTX940M which
only could support 350 pixel for this demo. ~~

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/1176#issuecomment-246189375, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAHjnSHtTOngKn49NePZO5vQb6u1zY48ks5qpC0wgaJpZM4G_BnS
.