Incubator-mxnet: resnet cpp-package test is broken

Created on 13 Mar 2019 · 17Comments · Source: apache/incubator-mxnet

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-14397/5/pipeline

after adding waitall support the resnet example is failing with cudamalloc out of memory error.

Bug C++ Example

Source

anirudh2290

All 17 comments

Hey, this is the MXNet Label Bot.
Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it.
Here are my recommended labels: Example

mxnet-label-bot on 13 Mar 2019

I wonder the memory of GPU in CI.
The input shape is (50,3,224,224), which may triggers OOM.

wkcn on 13 Mar 2019

In addition, the model in cpp-package seems to be not convergent.

wkcn on 13 Mar 2019

I think its running on a p3.8xlarge which should be sufficient to run this test. @marcoabreu can you confirm.

anirudh2290 on 13 Mar 2019

In addition, the model in cpp-package seems to be not convergent.

yes i observed that too.

anirudh2290 on 13 Mar 2019

Since the input shape of ResNet is (3, 224, 224), so I resized the MNIST image (1, 28, 28) to (3, 224, 224).

wkcn on 13 Mar 2019

We run on a g3.8xlarge

marcoabreu on 13 Mar 2019

Changing batch size to a smaller value will address the OOM issue.

wkcn on 13 Mar 2019

@marcoabreu There are no changes to the alexnet.cpp, resnet.cpp or cpp-package recently.
Are there any changes to underlying cuda or mxnet implementation.

These tests were part of CI tests and have been passing before. We can change the examples so that pass on lower capacity instances, in my opinion that won't be the right solution.

leleamol on 13 Mar 2019

Did infra that these tests are run on have changed recently? It seems that the test would be running fine on p3.8xl but would fail on g3.8x (legacy hardware)... @marcoabreu

ddavydenko on 13 Mar 2019

as i said this happened in waitall change. waitall earlier used to hide exceptions, but with the PR: https://github.com/apache/incubator-mxnet/pull/14397 it is thrown. These problems would have been there from before but surfacing now.

anirudh2290 on 13 Mar 2019

👍1

I tried these examples with the recent code change in "WaitAll()" on p2.16x instances and c5.18x instances. I did not see the crash.

However, we still need to add missing exception handling in the example so that we can prevent the crashes due to unhandled exceptions.

leleamol on 13 Mar 2019

hi @leleamol . to reproduce you will have to use g3.8xlarge. I was able to reproduce on a g3.8xlarge.

anirudh2290 on 14 Mar 2019

Could someone please look the GPU memory used by the model?

wkcn on 14 Mar 2019

the last i observed it was around 11GB. For now I am going to use smaller batch_size for tests and later @leleamol will revisit and improve the cpp tests.

anirudh2290 on 14 Mar 2019

👍1

@anirudh2290
I could reproduce this issue on p2.8 as well when I change the batch size to 100.
The example uses only one GPU. With batch size = 50, the GPU memory reaches 11GB.

leleamol on 15 Mar 2019

This issue can be closed since the PR is merged. @lanking520

leleamol on 13 Apr 2019

Was this page helpful?

0 / 5 - 0 ratings