hi, guys, I recently tried to run the yolov3 example code provided in 07. Train YOLOv3 on PASCAL VOC with my own dataset. The CPU version runs pretty smoothly (comment the GPU context), but the GPU version runs with problems, saying that the workspace size is not enough. I tried to reduce the batch_size from 16 to 8,4,2,1 and this error occurs constantly. In fact, using watch -n 1 nvidia-smi, the GPU memory was only less than 1GB out of 12GB during the whole running process. I was wondering whether my mxnet-cu80 installation was alright, so I run the validation example code a = mx.nd.ones((2, 3), mx.gpu()), it took me like 10min(very long) to able to input the next line 'b = a * 2 + 1'. Anyway, the results showed that my code can run in GPU context and I don't know what is wrong with this whole situation.
[18:29:52] src/operator/nn/./cudnn/./cudnn_algoreg-inl.h:97: Running performance tests to find the best convolution algorithm, this can take a while... (set the environment variable MXNET_CUDNN_AUTOTUNE_DEFAULT to 0 to disable)
Traceback (most recent call last):
File "train_yolo3.py", line 402, in
train(net, train_data, val_data, eval_metric, ctx, args)
File "train_yolo3.py", line 313, in train
obj_metrics.update(0, obj_losses)
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/metric.py", line 1636, in update
loss = ndarray.sum(pred).asscalar()
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 2014, in asscalar
return self.asnumpy()[0]
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/base.py", line 253, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [18:29:52] src/operator/nn/./cudnn/cudnn_convolution-inl.h:948: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size
Stack trace:
[bt] (0) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4958fb) [0x7fd86c5468fb]
[bt] (1) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319b5b7) [0x7fd86f24c5b7]
[bt] (2) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x319f335) [0x7fd86f250335]
[bt] (3) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d514) [0x7fd86f23e514]
[bt] (4) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318d9ce) [0x7fd86f23e9ce]
[bt] (5) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318e352) [0x7fd86f23f352]
[bt] (6) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x318fa43) [0x7fd86f240a43]
[bt] (7) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x3195159) [0x7fd86f246159]
[bt] (8) /home/disk2/zhangyingwen/anaconda2/envs/newMx/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function
pip install mxnet-cu80=1.6,1.5,1.4,1.0I also look up some similar issues with approximately the same error
install two mxnet
Nvidia driver don't match mxnet version
try old version of Mxnet
None of these solutions above solve my problem and I have been stuck in this problem for two days, so please help me if you happen to encounter the same error, THANK YOU!
Ubuntu 16.04, TITAN V, Nvidia driver 396.24.10, CUDA 8.0.61, cuDNN 6.0.21, mxnet-cu80
I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?
I think it may be related to the driver.
Is it convenient to update the nvidia driver and cuda into the version 10.1 ?
Thank you for replying, it is not easy to persuade my boss to give me the access, but it would be my final choice if nothing more can be done. I plan to update cuda8.0 to 9.0 first with the current driver.
I update my cuda to 9.0 version and also cudnn to 7.6 (follow the instruction here )
Then I run the mxnet example code a = mx.nd.ones((2, 3), mx.gpu()), only a warning goes like
this mxnet has been built against cuda library version 9000, which is older than the oldest version tested by CI (7600). Set MXNET_CUDNN_LIB_CHECKING=0 to quiet this warning.
I overlook this warning and try to run my yolo.py, and there is another error
Check failed: compileResult == NVRTC_SUCCESS (7 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0
I type export MXNET_USE_FUSION=0 and run my .py again and it turns out
PROBLEM SOLVED!