See this test failing: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/915/pipeline/
with
src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.
I have encountered this in the wild very rarely too.
@DickJC123
Hi @nswamy can you addd this one with 'CI' label
This is not CI related
Is it possible that memory is exhausted on CI?
Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.
Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device
Reducing the batch size to 16 resolved the issue.
Currently facing this issue. Reducing batch size does not seem to fix the issue.
Trying to train fast neural style transfer
Issues seems to arise when trying to do mod.save_params() throws the following error:
mxnet.base.MXNetError: [23:59:34] src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.
Update: I've managed to find a rather bizarre workaround to this issue.
I was facing this issue when I was trying to do a model.save_checkpoint(). However, if I caught the exception and saved it in the except block, it seemed to work flawlessly
try:
mod.save_checkpoint(model_save_path, epoch)
except Exception as excep:
print("Exception caught: ", excep)
mod.save_checkpoint(model_save_path, epoch)
Update: Sleeping for 0.5 seconds before saving the checkpoint also seems to help.
time.sleep(0.5)
mod.save_checkpoint(model_save_path, epoch)
@ThomasDelteil Is this still occurring on CI now? If it's not appearing again would you mind closing this?
@aluo-x @codewithsk Usually this is due to lack of GPU memories, reducing the consumption of GPU memories such as reducing batch sizes and using a smaller model would help. If you experience more issues on this, please create a separate issue with a title like "GPU memory overflow on xxx model with yyy batch size and zzz dataset". Meanwhile I'll look for ways to improve this error message to indicate the actual root cause of this error.
Most helpful comment
Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.
Ran the following command
python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store deviceReducing the batch size to 16 resolved the issue.