Incubator-mxnet: RCNN example fails for using latest mxnet

Created on 19 Feb 2018 · 26Comments · Source: apache/incubator-mxnet

I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. However, when I re-run the rcnn code in the example, I got the following error:

Traceback (most recent call last):
File "train_end2end.py", line 199, in
main()
File "train_end2end.py", line 196, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 158, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit
self.update_metric(eval_metric, data_batch.label)
File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric
self._curr_module.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric
self._exec_group.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric
eval_metric.update_dict(labels_, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in update_dict
metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict
self.update(label, pred)
File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update
pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1801, in asnumpy
ctypes.c_size_t(data.size)))
File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:08:44] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x2adc0c3395cd]
[bt] (1) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x2adc0c339a58]
[bt] (2) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669]
[bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradCompute(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector > const&, std::vector > const&, std::vector > const&)+0xd4c) [0x2adc0f5c2eac]
[bt] (4) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x2adc0ec4cc40]
[bt] (5) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) [0x2adc0ec54653]
[bt] (6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock)+0x2c4) [0x2adc0ec2fcd4]
[bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptr const&)+0x103) [0x2adc0ec34253]
[bt] (8) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr)+0x3e) [0x2adc0ec3448e]
[bt] (9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl)> (std::shared_ptr)> >::_M_run()+0x3b) [0x2adc0ec2e36b]

Can anyone help me with it? Thanks very much!

Bug Example Operator

Source

zhechen

Most helpful comment

I'm sure that the problem is caused by this line.
https://github.com/apache/incubator-mxnet/blob/7c28089749287f42ea8f41abd1358e6dbac54187/example/rcnn/rcnn/symbol/symbol_resnet.py#L187
When I changed the line to

rpn_cls_prob = mx.symbol.softmax(data=rpn_cls_score_reshape, axis=1, name="rpn_cls_prob")

The problem is solved. So I'm sure the mx.symbol.SoftmaxActivation operator(which depend on CUDNN, on the other hand, the mx.symbol.softmax operator is native implementation) has some bug after #9677 . @zheng-da

chinakook on 30 Mar 2018

👍4

All 26 comments

I somehow found a solution to this. Since I observed that this issue is caused by cudnn_softmax_activation function, both disabling cudnn and dropping the cudnn implementation of softmax will solve the problem. This mainly happens when using asnumpy() function for softmax results. Maybe someone can help check the real problem out and fix it. Thanks!

zhechen on 20 Feb 2018

👍1

I encountered this problem in RCNN too. I've tested with cuda 8.0 and cudnn6.0.2/cudnn7.1.2, both of them are failed today. However, It can run seccussfully on mxnet version two month ago.
I think there may be a bug within mxnet backend.

chinakook on 27 Mar 2018

@marcoabreu It's not only the bug in RCNN, but also in mx.sym.SoftmaxOutput or mx.sym.SoftmaxActivation when their result are using in metric such as pred.asnumpy().
It may occur in multi-gpu case.
So I suggest that reopen this issue util it's solved.

chinakook on 27 Mar 2018

It's solved when I roll back to mxnet v1.1.0.

chinakook on 27 Mar 2018

Thanks a lot for providing more detail! This indeed sounds like quite a serious issue. Just to clarify, does this only happen on a multi-gpu or on a distributed training environment?

@szha @rahul003 could check this please?

marcoabreu on 27 Mar 2018

pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')

File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:18:51] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f3943b0efab]
[bt] (1) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x1bf5) [0x7f3947f52885]
[bt] (2) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradCompute(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector > const&, std::vector > const&, std::vector > const&)+0x1e1b) [0x7f3947f4dd8b]
[bt] (3) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f39462912d0]
[bt] (4) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(+0x330c7f8) [0x7f39462587f8]
[bt] (5) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock)+0x8e5) [0x7f394689d2c5]
[bt] (6) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>, std::shared_ptr const&)+0xeb) [0x7f39468b2e4b]
[bt] (7) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&)+0x4e) [0x7f39468b30ae]
[bt] (8) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::thread::_Impl)> (std::shared_ptr)> >::_M_run()+0x4a) [0x7f39468acf5a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f3975ab5c80]

chinakook on 28 Mar 2018

@marcoabreu It also happened on single GPU by chance.

chinakook on 28 Mar 2018

rpn_cls_prob = mx.symbol.softmax(data=rpn_cls_score_reshape, axis=1, name="rpn_cls_prob")

chinakook on 30 Mar 2018

👍4

It happened on single GPU by chance in my machine with mxnet 1.2.0.

ysfalo on 12 Apr 2018

It seems the reason is that CUDNN call fails.

When the compile options includes USE_CUDNN=1, the softmax activation operator uses CUDNNSoftmax.

And the convolution operator prints the log:
src/operator/nn/convolution.cu:140: This convolution is not supported by cudnn, MXNET convolution is applied. The old CUDNN doesn't support dilated conv.

Softmax Operator dosen't use CUDNN, so it doesn't cause any error.

Solution:
1.compile MXNet with latest CUDNN
2.Replace mx.sym.SoftmaxActivation(cudnn) with mx.sym.Softmax(pure CUDA) if CUDNN didn't support SoftmaxActivation.

wkcn on 9 May 2018

I had the same problem. I have just installed mxnet gpu support version 1.1.0
Before i had mxnet-cu80 1.2.0.

and It worked.

Ram-Godavarthi on 20 Jun 2018

@Ram124 The latest MXNet has fixed the bug.
https://github.com/apache/incubator-mxnet/pull/10918

wkcn on 20 Jun 2018

@wkcn . Oh cool. i will check that..

I have my custom dataset in pascal format.
what changes needs to be done to get started with training.
I have 2 classes( pedestrian + bicycle). i need to classify them on single image..
I have changed pascal.py by changing class names and numbers.

Is there anything else that i need to change?

Anybody done training on own dataset. Please help me out.

Ram-Godavarthi on 20 Jun 2018

@Ram124
You also need to change num_classes in config.py.

wkcn on 20 Jun 2018

@wkcn
I should make it 3 right? with background.
config.NUM_CLASSES = 21

And where should i specify my datset??
i have my dataset in
./data/my_own_data/
Annotations
Imagesets
Images

In which files should i give this path??
so that it can read my custom data

Ram-Godavarthi on 20 Jun 2018

The num_classes includes background, so 3 is right.
For the dataset path, you could check config.py, pascal_voc.py and pascal_voc_eval.py

wkcn on 20 Jun 2018

👍1

@wkcn
Thank you.
when i run demo.py. I am getting something like this.
What is this??

(mxnet_p27) ubuntu@ip-172-31-10-202:~/mx-rcnn-1$ python demo.py --prefix model/vgg16 --epoch 0 --image myimage.jpg --gpu 0 --vis
Traceback (most recent call last):
File "demo.py", line 143, in
main()
File "demo.py", line 138, in main
predictor = get_net(symbol, args.prefix, args.epoch, ctx)
File "demo.py", line 49, in get_net
assert k in arg_params, k + ' not initialized'
AssertionError: rpn_conv_3x3_weight not initialized

Ram-Godavarthi on 20 Jun 2018

It seems that you read a pretrained model rather than detection model.
There is no rpn_conv parameter.

wkcn on 20 Jun 2018

I solved that problem. Thank You fro that.

I am trying to train on my own dataset.
I have changed Number of classes in config.py

I have modified pascal.py (changed Class names, i have only 2 classes + 1 background)

But now getting this error..What is the problem @wkcn

INFO:root:voc_radar_train append flipped images to roidb
Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 39, in train_net
for image_set in image_sets]
File "/home/ubuntu/mx-rcnn-1/rcnn/utils/load_data.py", line 13, in load_gt_roidb
roidb = imdb.append_flipped_images(roidb)
File "/home/ubuntu/mx-rcnn-1/rcnn/dataset/imdb.py", line 168, in append_flipped_images
assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError

Ram-Godavarthi on 21 Jun 2018

It seems the dataset is wrong.
The boxes xmin, ymin, xmax, ymax should be started with 1.

wkcn on 21 Jun 2018

matlab is starting with 1

chinakook on 21 Jun 2018

Yaa I corrected it.

Now i m getting this error after 1 epoch. How to solve this? Is it related to mxnet version?

Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 137, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 517, in fit
self.set_params(arg_params, aux_params)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 652, in set_params
allow_extra=allow_extra)
TypeError: init_params() got an unexpected keyword argument 'allow_extra'

Ram-Godavarthi on 21 Jun 2018

@wkcn @chinakook @ysfalo @ijkguo I have a question regarding batch size. Can we use batch size of more than 1 in mxnet-rcnn training??
Because i have a large dataset of 15000 images.
if i do training on them , the speed : 2.35 sample/sec.
it takes almost 4 hours per epoch.
is there anyother way i could increase the speed??

Any help is really appreciated.

Ram-Godavarthi on 26 Jun 2018

So the original issue has been fixed in https://github.com/apache/incubator-mxnet/pull/10918.

As to unexpected kwarg 'allow_extra' and multi-batch size training, they are solved in https://github.com/apache/incubator-mxnet/pull/11373.

ijkguo on 26 Jun 2018

@Ram124 If you regard batch_size, please use SNIPER. It can be trained with large batch_size.

chinakook on 27 Jun 2018

Closing this after merging #11373, feel free to ping me to reopen it it's not fixed.

zhreshold on 13 Jul 2018

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Mxnet : test and validation accuracy during training ?

Shiro-LK · 3Comments

what's the usage of ' is_train' in forward?

xzqjack · 3Comments

Correct Speedometer Callback Usage

JonBoyleCoding · 3Comments

Missing constant symbol (equivalent of tf.constant)?

Ajoo · 3Comments

CMake Error: The following variables are used in this project, but they are set to NOTFOUND.

zy-huang · 3Comments