Incubator-mxnet: cudnn_softmax_activation error in MXNet 1.2.0

Created on 12 May 2018 · 4Comments · Source: apache/incubator-mxnet

Description

Hi.
It seems that SoftmaxActivation(cudnn version) has a problem in MXNet 1.2.0.
It will cause the error when using SoftmaxActivation.
In the issue(#9823), chinakook considers the bug produced after PR#9677

I found the function of mx.nd.Softmax and mx.nd.SoftmaxActivation seems to be the same. The difference is that Softmax uses pure CUDA, and SoftmaxActivation uses CUDNN.
Is it necessary to merge them?

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Dec  4 2017 14:50:18'))
('Arch         :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version      :', '10.0.1')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version      :', '1.2.0')
('Directory    :', '/usr/local/lib/python2.7/dist-packages/mxnet')
('Commit Hash   :', '5088ca9a65641ddf905b60deae00fa6006f5e431')
----------System Info----------
('Platform     :', 'Linux-4.13.9-coreos-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('release      :', '4.13.9-coreos')
('version      :', '#1 SMP Thu Oct 26 03:21:00 UTC 2017')

Package used (Python/R/Scala/Julia):
Python

Build info (Required if built from source)

Installed by pip:
pip install mxnet-cu80 --pre

Error Message:

Traceback (most recent call last):
File "/home/wkcn/proj/incubator-mxnet/test_softmaxat.py", line 12, in
print (a.grad.asnumpy())
File "/usr/local/lib/python2.7/dist-packages/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
ctypes.c_size_t(data.size)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [07:18:14] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Minimum reproducible example

import mxnet as mx
ctx = mx.gpu(0)

a = mx.nd.array([[1,2,3]], ctx = ctx)

a.attach_grad()
with mx.autograd.record():
    y = mx.nd.SoftmaxActivation(data = a)

y.backward()
mx.nd.waitall()
print (a.grad.asnumpy())

Source

wkcn

Most helpful comment

https://github.com/apache/incubator-mxnet/pull/10918

zheng-da on 13 May 2018

👍3

All 4 comments

@zheng-da

szha on 12 May 2018

https://github.com/apache/incubator-mxnet/pull/10918

zheng-da on 13 May 2018

👍3

@szha @zheng-da Thank you!

wkcn on 13 May 2018

Hi I'm also using incubator-mxnet/example/rcnn to train, with commit bea5fd13c5445647a1aeedddd3c5be4406d8fb9c

Single GPU. Windows 10.

I can train the faster rcnn network on my own dataset correctly today (from morning to afternoon), but now (midnight, Beijing Time) fails for several time without any modification on my code. It seems because choosing the best convolution algorithm leads to the unstable result, which, gives the nearly same error:

  File "e:\soft\Anaconda2\lib\site-packages\mxnet\base.py", line 210, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [00:13:25] c:\jenkins\workspace\mxnet\mxnet\src\operator\nn\./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Luckily, I closed my chrome (which was with ~30 tab pages, does it matters the algorithm choosing?), and tried again, it can train.