Incubator-mxnet: (Nvidia GTX 1080) Runtime error for CUDA 8.0 with cudnn-8.0-v5.0-ga

Created on 3 Jun 2016  路  4Comments  路  Source: apache/incubator-mxnet

When I compile mxnet with CUDA 8.0 without cudnn, it works. But when I use the latest cudnn version libcudnn.so.5.0.5, I got these error info (batch normalization):

[13:38:34] /home/yey1/Work/mxnet/dmlc-core/include/dmlc/logging.h:235: [13:38:34] src/operator/./cudnn_batch_norm-inl.h:138: Check failed: (cudnnBatchNormalizationForwardInference(s->dnn_handle_, CUDNN_BATCHNORM_SPATIAL, &a, &b, io_desc_, x.dptr_, io_desc_, y.dptr_, mean_desc_, gamma.dptr_, beta.dptr_, moving_mean.dptr_, moving_inv_var.dptr_, param_.eps)) == (CUDNN_STATUS_SUCCESS) 
[13:38:34] /home/yey1/Work/mxnet/dmlc-core/include/dmlc/logging.h:235: [13:38:34] src/engine/./threaded_engine.h:306: [13:38:34] src/operator/./cudnn_batch_norm-inl.h:138: Check failed: (cudnnBatchNormalizationForwardInference(s->dnn_handle_, CUDNN_BATCHNORM_SPATIAL, &a, &b, io_desc_, x.dptr_, io_desc_, y.dptr_, mean_desc_, gamma.dptr_, beta.dptr_, moving_mean.dptr_, moving_inv_var.dptr_, param_.eps)) == (CUDNN_STATUS_SUCCESS) 
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPEto NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.
terminate called after throwing an instance of 'dmlc::Error'
  what():  [13:38:34] src/engine/./threaded_engine.h:306: [13:38:34] src/operator/./cudnn_batch_norm-inl.h:138: Check failed: (cudnnBatchNormalizationForwardInference(s->dnn_handle_, CUDNN_BATCHNORM_SPATIAL, &a, &b, io_desc_, x.dptr_, io_desc_, y.dptr_, mean_desc_, gamma.dptr_, beta.dptr_, moving_mean.dptr_, moving_inv_var.dptr_, param_.eps)) == (CUDNN_STATUS_SUCCESS) 
An fatal error occurred in asynchronous engine operation. If you do not know what caused this error, you can try set environment variable MXNET_ENGINE_TYPEto NaiveEngine and run with debugger (i.e. gdb). This will force all operations to be synchronous and backtrace will give you the series of calls that lead to this error. Remember to set MXNET_ENGINE_TYPE back to empty after debugging.

Most helpful comment

@antinucleon
Thank you so much!!
1e-3 and 1e-4 works, 1e-5 and 1e-6 not.

All 4 comments

I am also using CUDA 8 + CUDNN v5 but not 1080, sometimes, CuDNN V5 crash with wrong eps parameter. You may try eps = [1e-3, 1e-4, 1e-5, 1e-6] to see whether it is this issue.

@antinucleon
I just tried Titan X with same settings, it throws same errors.
BTW, what do you mean by eps
Thank you so much.

@antinucleon
Thank you so much!!
1e-3 and 1e-4 works, 1e-5 and 1e-6 not.

Was this page helpful?
0 / 5 - 0 ratings