Hi.
It seems that SoftmaxActivation(cudnn version) has a problem in MXNet 1.2.0.
It will cause the error when using SoftmaxActivation.
In the issue(#9823), chinakook considers the bug produced after PR#9677
I found the function of mx.nd.Softmax and mx.nd.SoftmaxActivation seems to be the same. The difference is that Softmax uses pure CUDA, and SoftmaxActivation uses CUDNN.
Is it necessary to merge them?
----------Python Info----------
('Version :', '2.7.12')
('Compiler :', 'GCC 5.4.0 20160609')
('Build :', ('default', 'Dec 4 2017 14:50:18'))
('Arch :', ('64bit', 'ELF'))
------------Pip Info-----------
('Version :', '10.0.1')
('Directory :', '/usr/local/lib/python2.7/dist-packages/pip')
----------MXNet Info-----------
('Version :', '1.2.0')
('Directory :', '/usr/local/lib/python2.7/dist-packages/mxnet')
('Commit Hash :', '5088ca9a65641ddf905b60deae00fa6006f5e431')
----------System Info----------
('Platform :', 'Linux-4.13.9-coreos-x86_64-with-Ubuntu-16.04-xenial')
('system :', 'Linux')
('release :', '4.13.9-coreos')
('version :', '#1 SMP Thu Oct 26 03:21:00 UTC 2017')
Package used (Python/R/Scala/Julia):
Python
Installed by pip:
pip install mxnet-cu80 --pre
Traceback (most recent call last):
File "/home/wkcn/proj/incubator-mxnet/test_softmaxat.py", line 12, in
print (a.grad.asnumpy())
File "/usr/local/lib/python2.7/dist-packages/mxnet/ndarray/ndarray.py", line 1894, in asnumpy
ctypes.c_size_t(data.size)))
File "/usr/local/lib/python2.7/dist-packages/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [07:18:14] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM
import mxnet as mx
ctx = mx.gpu(0)
a = mx.nd.array([[1,2,3]], ctx = ctx)
a.attach_grad()
with mx.autograd.record():
y = mx.nd.SoftmaxActivation(data = a)
y.backward()
mx.nd.waitall()
print (a.grad.asnumpy())
@zheng-da
@szha @zheng-da Thank you!
Hi I'm also using incubator-mxnet/example/rcnn to train, with commit bea5fd13c5445647a1aeedddd3c5be4406d8fb9c
Single GPU. Windows 10.
I can train the faster rcnn network on my own dataset correctly today (from morning to afternoon), but now (midnight, Beijing Time) fails for several time without any modification on my code. It seems because choosing the best convolution algorithm leads to the unstable result, which, gives the nearly same error:
File "e:\soft\Anaconda2\lib\site-packages\mxnet\base.py", line 210, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [00:13:25] c:\jenkins\workspace\mxnet\mxnet\src\operator\nn\./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM
Luckily, I closed my chrome (which was with ~30 tab pages, does it matters the algorithm choosing?), and tried again, it can train.
Most helpful comment
https://github.com/apache/incubator-mxnet/pull/10918