Incubator-mxnet: Fix memory management

Created on 5 Jan 2016  ·  13Comments  ·  Source: apache/incubator-mxnet

As discussed in #1130, pooling memory manager has a hard threshold of 4GB. On a video card that doesn't have 4GB this will result in OOM even if there's plenty of memory cached.

The following changes are proposed:

  1. Make the threshold a parameter, default it to available GPU memory at the time mxnet starts. Make it configurable via a env variable or a function call.
  2. Change the check kThreshold <= used_memory_ + size (see discussion in #1130)
  3. Maintain LRU list of memory chunks, and only free enough for the new allocation when there's not enough memory instead of freeing everything.
  4. Add a function that will report amount of cached memory.

If the plan sounds good, I will start working on it after the ICML deadline (which is Feb, 5th).

All 13 comments

Looks good to me.

To give an update, I never got a chance to finish the change I described above.

As a quick fix for the actual problem (failing with cuda OOM failure) one good fix would be to change the pooled manager to catch that error once, call ReleaseAll and make a second attempt to allocate. I've had this change for a while now in my branch locally, and it seems to work both on my machine and on AWS (while without this change my models OOM on both), but I don't have a test for it. I can send a PR without a test.

This sounds a good idea to me. a PR is welcomed.

@SkidanovAlex Any updates on this

@tqchen, sorry, I missed your comment from April 26th, and was still waiting for the approval. I will send a PR within couple days.

sounds great, thanks!

I'm on commit 92a2d03740af5cd11d0599c8fc16d8a52132761e from Sep 3rd. and I can confirm I get an issue so #3055 didn't solve the issue.
I'm running the neural-style sample and I'm getting 👍

(venv) ➜ neural-style git:(master) ✗ python run.py
INFO:root:load the content image, size = (1000, 1500)
INFO:root:resize the content image to (400, 600)
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:55] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:56] src/operator/./reshape-inl.h:249: Using target_shape will be deprecated.
[18:38:57] /Users/yosit/git/mxnet/dmlc-core/include/dmlc/logging.h:235: [18:38:57] src/storage/./pooled_storage_manager.h:67: Memory allocation failed.
[18:38:57] /Users/yosit/git/mxnet/dmlc-core/include/dmlc/logging.h:235: [18:38:57] src/engine/./threaded_engine.h:306: [18:38:57] src/storage/./pooled_storage_manager.h:67: Memory allocation failed.
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
libc++abi.dylib: terminating with uncaught exception of type dmlc::Error: [18:38:57] src/engine/./threaded_engine.h:306: [18:38:57] src/storage/./pooled_storage_manager.h:67: Memory allocation failed.
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPE to NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
[1] 34843 abort python run.py

@yosit

I got same error when I run neural art demo. And this demo runs good under CPU mode. And other demo runs good both under CPU mode and GPU mode.

After reviewing the src code, I think this error is thrown during processing the function MXExecutorBindEX in C_api.cc.

However, I have not find any solution to fix it. Can anyone help?

@yosit

sloved, this error is actually equal to the OOM(out of memory) failure. change the arg --max-long-edge when use python command to resize the source jpg as a smaller one. Problem solved.

That's what I did. But the result is less than impressive. I managed to go
up as much as 350 pixels, more than that it crashed
On Tue, 6 Sep 2016 at 7:13 xiesiyuan [email protected] wrote:

@yosit https://github.com/yosit

sloved, this error is actually equal to the OOM(out of memory) failure.
change the arg --max-long-edge when use python command to resize the source
jpg as a smaller one. Problem solved.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/1176#issuecomment-244845679, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAHjneeW2U_9y6GwuIxSz4rdyUHd32Euks5qnOhogaJpZM4G_BnS
.

@yosit
correct, if more than 350 pixel, it will fail. My GPU is GTX940M which only could support 350 pixel for this demo. ~~

I get the same issue with the rnn sample.
On Mon, 12 Sep 2016 at 0:32 xiesiyuan [email protected] wrote:

@yosit https://github.com/yosit
correct, if more than 350 pixel, it will fail. My GPU is GTX940M which
only could support 350 pixel for this demo. ~~


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
https://github.com/dmlc/mxnet/issues/1176#issuecomment-246189375, or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAHjnSHtTOngKn49NePZO5vQb6u1zY48ks5qpC0wgaJpZM4G_BnS
.

This issue is closed due to lack of activity in the last 90 days. Feel free to reopen if this is still an active issue. Thanks!

Was this page helpful?
0 / 5 - 0 ratings