Incubator-mxnet: Failed to find any forward convolution algorithm.

Created on 6 Jun 2018  路  9Comments  路  Source: apache/incubator-mxnet

See this test failing: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/915/pipeline/

with

 src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

I have encountered this in the wild very rarely too.

Bug CUDA Operator

Most helpful comment

Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.

Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device

Reducing the batch size to 16 resolved the issue.

All 9 comments

@DickJC123

Hi @nswamy can you addd this one with 'CI' label

This is not CI related

Is it possible that memory is exhausted on CI?

Also encountered this error on Windows, CUDA 9.2, cudnn 7.1.4.

Ran the following command python train_imagenet.py --benchmark 1 --gpus 0 --network inception-v3 --batch-size 64 --image-shape 3,299,299 --num-epochs 1 --kv-store device

Reducing the batch size to 16 resolved the issue.

Currently facing this issue. Reducing batch size does not seem to fix the issue.

Trying to train fast neural style transfer

Issues seems to arise when trying to do mod.save_params() throws the following error:

mxnet.base.MXNetError: [23:59:34] src/operator/nn/./cudnn/cudnn_convolution-inl.h:744: Failed to find any forward convolution algorithm.

Update: I've managed to find a rather bizarre workaround to this issue.

I was facing this issue when I was trying to do a model.save_checkpoint(). However, if I caught the exception and saved it in the except block, it seemed to work flawlessly

    try:
        mod.save_checkpoint(model_save_path, epoch)
    except Exception as excep:
        print("Exception caught: ", excep)
        mod.save_checkpoint(model_save_path, epoch)

Update: Sleeping for 0.5 seconds before saving the checkpoint also seems to help.

    time.sleep(0.5)
    mod.save_checkpoint(model_save_path, epoch)

@ThomasDelteil Is this still occurring on CI now? If it's not appearing again would you mind closing this?
@aluo-x @codewithsk Usually this is due to lack of GPU memories, reducing the consumption of GPU memories such as reducing batch sizes and using a smaller model would help. If you experience more issues on this, please create a separate issue with a title like "GPU memory overflow on xxx model with yyy batch size and zzz dataset". Meanwhile I'll look for ways to improve this error message to indicate the actual root cause of this error.

Was this page helpful?
0 / 5 - 0 ratings