Incubator-mxnet: Can't run YOLOv3 in gpu, not enough workspace size

Created on 9 Jun 2020 · 3Comments · Source: apache/incubator-mxnet

Description

hi, guys, I recently tried to run the yolov3 example code provided in 07. Train YOLOv3 on PASCAL VOC with my own dataset. The CPU version runs pretty smoothly (comment the GPU context), but the GPU version runs with problems, saying that the workspace size is not enough. I tried to reduce the batch_size from 16 to 8,4,2,1 and this error occurs constantly. In fact, using watch -n 1 nvidia-smi, the GPU memory was only less than 1GB out of 12GB during the whole running process. I was wondering whether my mxnet-cu80 installation was alright, so I run the validation example code a = mx.nd.ones((2, 3), mx.gpu()), it took me like 10min(very long) to able to input the next line 'b = a * 2 + 1'. Anyway, the results showed that my code can run in GPU context and I don't know what is wrong with this whole situation.

Error Message

[18:29:52] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File "train_yolo3.py", line 402, in
train(net, train_data, val_data, eval_metric, ctx, args)
File "train_yolo3.py", line 313, in train
obj_metrics.update(0, obj_losses)
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/metric.py", line 1636, in update
loss = ndarray.sum(pred).asscalar()
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2014, in asscalar
return self.asnumpy()[0]
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:29:52] src/operator/nn/./cudnn/cudnn_convolution-inl.h:948: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
Stack trace:
[bt] (0) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4958fb) [0x7fd86c5468fb]
[bt] (1) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319b5b7) [0x7fd86f24c5b7]
[bt] (2) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319f335) [0x7fd86f250335]
[bt] (3) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d514) [0x7fd86f23e514]
[bt] (4) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d9ce) [0x7fd86f23e9ce]
[bt] (5) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318e352) [0x7fd86f23f352]
[bt] (6) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318fa43) [0x7fd86f240a43]
[bt] (7) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195159) [0x7fd86f246159]
[bt] (8) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function > const&, std::vector > const&, std::vector > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector > const&, std::vector > const&, std::vector > const&, std::vector > const&, std::vector > const&, std::vector > const&, std::vector > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x307) [0x7fd86e6fe597]

What have you tried to solve it?

change the batch_size from 16 to 8,4,2,1
resize the images from 608 to 416
check the mxnet environment, try old version pip install mxnet-cu80=1.6,1.5,1.4,1.0

I also look up some similar issues with approximately the same error
install two mxnet
Nvidia driver don't match mxnet version
try old version of Mxnet
None of these solutions above solve my problem and I have been stuck in this problem for two days, so please help me if you happen to encounter the same error, THANK YOU!

Environment

Ubuntu 16.04, TITAN V, Nvidia driver 396.24.10, CUDA 8.0.61, cuDNN 6.0.21, mxnet-cu80

Bug

Source

smileyzyw

All 3 comments

I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?

wkcn on 9 Jun 2020

I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?

Thank you for replying, it is not easy to persuade my boss to give me the access, but it would be my final choice if nothing more can be done. I plan to update cuda8.0 to 9.0 first with the current driver.

smileyzyw on 9 Jun 2020

🚀1

I update my cuda to 9.0 version and also cudnn to 7.6 (follow the instruction here )

Then I run the mxnet example code a = mx.nd.ones((2, 3), mx.gpu()), only a warning goes like
this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
I overlook this warning and try to run my yolo.py, and there is another error
Check failed: compileResult == NVRTC_SUCCESS (7 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0
I type export MXNET_USE_FUSION=0 and run my .py again and it turns out
PROBLEM SOLVED!

smileyzyw on 10 Jun 2020

🎉1

Was this page helpful?

0 / 5 - 0 ratings