Incubator-mxnet: Failed to find any forward convolution algorithm.

Created on 6 Jun 2018 · 9Comments · Source: apache/incubator-mxnet

See this test failing: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/915/pipeline/

with

 src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

I have encountered this in the wild very rarely too.

Bug CUDA Operator

Source

ThomasDelteil

Most helpful comment

Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.

Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device

Reducing the batch size to 16 resolved the issue.

aluo-x on 19 Jun 2018

👍2

All 9 comments

@DickJC123

marcoabreu on 6 Jun 2018

Hi @nswamy can you addd this one with 'CI' label

lanking520 on 7 Jun 2018

This is not CI related

marcoabreu on 8 Jun 2018

Is it possible that memory is exhausted on CI?

eric-haibin-lin on 9 Jun 2018

Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.

Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device

Reducing the batch size to 16 resolved the issue.

aluo-x on 19 Jun 2018

👍2

Currently facing this issue. Reducing batch size does not seem to fix the issue.

Trying to train fast neural style transfer

Issues seems to arise when trying to do mod.save_params() throws the following error:

mxnet.base.MXNetError: [23:59:34] src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

codewithsk on 7 Jul 2018

Update: I've managed to find a rather bizarre workaround to this issue.

I was facing this issue when I was trying to do a model.save_checkpoint(). However, if I caught the exception and saved it in the except block, it seemed to work flawlessly

    try:
        mod.save_checkpoint(model_save_path, epoch)
    except Exception as excep:
        print("Exception caught: ", excep)
        mod.save_checkpoint(model_save_path, epoch)

codewithsk on 9 Jul 2018

Update: Sleeping for 0.5 seconds before saving the checkpoint also seems to help.

    time.sleep(0.5)
    mod.save_checkpoint(model_save_path, epoch)

codewithsk on 10 Jul 2018

@ThomasDelteil Is this still occurring on CI now? If it's not appearing again would you mind closing this?
@aluo-x @codewithsk Usually this is due to lack of GPU memories, reducing the consumption of GPU memories such as reducing batch sizes and using a smaller model would help. If you experience more issues on this, please create a separate issue with a title like "GPU memory overflow on xxx model with yyy batch size and zzz dataset". Meanwhile I'll look for ways to improve this error message to indicate the actual root cause of this error.