after adding waitall support the resnet example is failing with cudamalloc out of memory error.
Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Example
I wonder the memory of GPU in CI.
The input shape is (50,3,224,224), which may triggers OOM.
In addition, the model in cpp-package seems to be not convergent.
I think its running on a p3.8xlarge which should be sufficient to run this test. @marcoabreu can you confirm.
In addition, the model in cpp-package seems to be not convergent.
yes i observed that too.
Since the input shape of ResNet is (3, 224, 224), so I resized the MNIST image (1, 28, 28) to (3, 224, 224).
We run on a g3.8xlarge
Changing batch size to a smaller value will address the OOM issue.
@marcoabreu There are no changes to the alexnet.cpp, resnet.cpp or cpp-package recently.
Are there any changes to underlying cuda or mxnet implementation.
These tests were part of CI tests and have been passing before. We can change the examples so that pass on lower capacity instances, in my opinion that won't be the right solution.
Did infra that these tests are run on have changed recently? It seems that the test would be running fine on p3.8xl but would fail on g3.8x (legacy hardware)... @marcoabreu
as i said this happened in waitall change. waitall earlier used to hide exceptions, but with the PR: https://github.com/apache/incubator-mxnet/pull/14397 it is thrown. These problems would have been there from before but surfacing now.
I tried these examples with the recent code change in "WaitAll()" on p2.16x instances and c5.18x instances. I did not see the crash.
However, we still need to add missing exception handling in the example so that we can prevent the crashes due to unhandled exceptions.
hi @leleamol . to reproduce you will have to use g3.8xlarge. I was able to reproduce on a g3.8xlarge.
Could someone please look the GPU memory used by the model?
the last i observed it was around 11GB. For now I am going to use smaller batch_size for tests and later @leleamol will revisit and improve the cpp tests.
@anirudh2290
I could reproduce this issue on p2.8 as well when I change the batch size to 100.
The example uses only one GPU. With batch size = 50, the GPU memory reaches 11GB.
This issue can be closed since the PR is merged. @lanking520