Incubator-mxnet: RCNN example fails for using latest mxnet

Created on 19 Feb 2018  路  26Comments  路  Source: apache/incubator-mxnet

I am using mxnet with CUDA9 + CUDNN7 and distributed training enabled. However, when I re-run the rcnn code in the example, I got the following error:

Traceback (most recent call last):
File "train_end2end.py", line 199, in
main()
File "train_end2end.py", line 196, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 158, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/----/libs/incubator-mxnet/python/mxnet/module/base_module.py", line 496, in fit
self.update_metric(eval_metric, data_batch.label)
File "/----/mx-rcnn/rcnn/core/module.py", line 227, in update_metric
self._curr_module.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/module.py", line 749, in update_metric
self._exec_group.update_metric(eval_metric, labels)
File "/----/libs/incubator-mxnet/python/mxnet/module/executor_group.py", line 616, in update_metric
eval_metric.update_dict(labels_, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 280, in update_dict
metric.update_dict(labels, preds)
File "/----/libs/incubator-mxnet/python/mxnet/metric.py", line 108, in update_dict
self.update(label, pred)
File "/----/mx-rcnn/rcnn/core/metric.py", line 51, in update
pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')
File "/----/libs/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1801, in asnumpy
ctypes.c_size_t(data.size)))
File "/----/libs/incubator-mxnet/python/mxnet/base.py", line 148, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [17:08:44] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::StackTrace()+0x3d) [0x2adc0c3395cd]
[bt] (1) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(dmlc::LogMessageFatal::~LogMessageFatal()+0x18) [0x2adc0c339a58]
[bt] (2) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x10b9) [0x2adc0f5c7669]
[bt] (3) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradCompute(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector > const&, std::vector > const&, std::vector > const&)+0xd4c) [0x2adc0f5c2eac]
[bt] (4) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x2adc0ec4cc40]
[bt] (5) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(+0x3284653) [0x2adc0ec54653]
[bt] (6)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock)+0x2c4) [0x2adc0ec2fcd4]
[bt] (7) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>
, std::shared_ptr const&)+0x103) [0x2adc0ec34253]
[bt] (8) /----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr)+0x3e) [0x2adc0ec3448e]
[bt] (9)/----/libs/incubator-mxnet/python/mxnet/../../lib/libmxnet.so(std::thread::_Impl)> (std::shared_ptr)> >::_M_run()+0x3b) [0x2adc0ec2e36b]

Can anyone help me with it? Thanks very much!

Bug Example Operator

Most helpful comment

I'm sure that the problem is caused by this line.
https://github.com/apache/incubator-mxnet/blob/7c28089749287f42ea8f41abd1358e6dbac54187/example/rcnn/rcnn/symbol/symbol_resnet.py#L187
When I changed the line to

rpn_cls_prob = mx.symbol.softmax(data=rpn_cls_score_reshape, axis=1, name="rpn_cls_prob")

The problem is solved. So I'm sure the mx.symbol.SoftmaxActivation operator(which depend on CUDNN, on the other hand, the mx.symbol.softmax operator is native implementation) has some bug after #9677 . @zheng-da

All 26 comments

I somehow found a solution to this. Since I observed that this issue is caused by cudnn_softmax_activation function, both disabling cudnn and dropping the cudnn implementation of softmax will solve the problem. This mainly happens when using asnumpy() function for softmax results. Maybe someone can help check the real problem out and fix it. Thanks!

I encountered this problem in RCNN too. I've tested with cuda 8.0 and cudnn6.0.2/cudnn7.1.2, both of them are failed today. However, It can run seccussfully on mxnet version two month ago.
I think there may be a bug within mxnet backend.

@marcoabreu It's not only the bug in RCNN, but also in mx.sym.SoftmaxOutput or mx.sym.SoftmaxActivation when their result are using in metric such as pred.asnumpy().
It may occur in multi-gpu case.
So I suggest that reopen this issue util it's solved.

It's solved when I roll back to mxnet v1.1.0.

Thanks a lot for providing more detail! This indeed sounds like quite a serious issue. Just to clarify, does this only happen on a multi-gpu or on a distributed training environment?

@szha @rahul003 could check this please?

pred_label = mx.ndarray.argmax_channel(pred).asnumpy().astype('int32')

File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/ndarray/ndarray.py", line 1826, in asnumpy
ctypes.c_size_t(data.size)))
File "/home/ABCDEFG/dev/incubator-mxnet/python/mxnet/base.py", line 149, in check_call
raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [11:18:51] src/operator/nn/./cudnn/cudnn_softmax_activation-inl.h:154: Check failed: e == CUDNN_STATUS_SUCCESS (3 vs. 0) cuDNN: CUDNN_STATUS_BAD_PARAM

Stack trace returned 10 entries:
[bt] (0) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(dmlc::StackTraceabi:cxx11+0x5b) [0x7f3943b0efab]
[bt] (1) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::op::CuDNNSoftmaxActivationOp::Backward(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType const&, mxnet::TBlob const&)+0x1bf5) [0x7f3947f52885]
[bt] (2) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::op::SoftmaxActivationGradCompute(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector > const&, std::vector > const&, std::vector > const&)+0x1e1b) [0x7f3947f4dd8b]
[bt] (3) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::exec::FComputeExecutor::Run(mxnet::RunContext, bool)+0x50) [0x7f39462912d0]
[bt] (4) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(+0x330c7f8) [0x7f39462587f8]
[bt] (5) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock)+0x8e5) [0x7f394689d2c5]
[bt] (6) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(void mxnet::engine::ThreadedEnginePerDevice::GPUWorker<(dmlc::ConcurrentQueueType)0>(mxnet::Context, bool, mxnet::engine::ThreadedEnginePerDevice::ThreadWorkerBlock<(dmlc::ConcurrentQueueType)0>
, std::shared_ptr const&)+0xeb) [0x7f39468b2e4b]
[bt] (7) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::_Function_handler), mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::{lambda()#3}::operator()() const::{lambda(std::shared_ptr)#1}>::_M_invoke(std::_Any_data const&, std::shared_ptr&&)+0x4e) [0x7f39468b30ae]
[bt] (8) /home/ABCDEFG/dev/incubator-mxnet/lib/libmxnet.so(std::thread::_Impl)> (std::shared_ptr)> >::_M_run()+0x4a) [0x7f39468acf5a]
[bt] (9) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xb8c80) [0x7f3975ab5c80]

@marcoabreu It also happened on single GPU by chance.

I'm sure that the problem is caused by this line.
https://github.com/apache/incubator-mxnet/blob/7c28089749287f42ea8f41abd1358e6dbac54187/example/rcnn/rcnn/symbol/symbol_resnet.py#L187
When I changed the line to

rpn_cls_prob = mx.symbol.softmax(data=rpn_cls_score_reshape, axis=1, name="rpn_cls_prob")

The problem is solved. So I'm sure the mx.symbol.SoftmaxActivation operator(which depend on CUDNN, on the other hand, the mx.symbol.softmax operator is native implementation) has some bug after #9677 . @zheng-da

It happened on single GPU by chance in my machine with mxnet 1.2.0.

It seems the reason is that CUDNN call fails.

When the compile options includes USE_CUDNN=1, the softmax activation operator uses CUDNNSoftmax.

And the convolution operator prints the log:
src/operator/nn/convolution.cu:140: This convolution is not supported by cudnn, MXNET convolution is applied. The old CUDNN doesn't support dilated conv.

Softmax Operator dosen't use CUDNN, so it doesn't cause any error.

Solution:
1.compile MXNet with latest CUDNN
2.Replace mx.sym.SoftmaxActivation(cudnn) with mx.sym.Softmax(pure CUDA) if CUDNN didn't support SoftmaxActivation.

I had the same problem. I have just installed mxnet gpu support version 1.1.0
Before i had mxnet-cu80 1.2.0.

and It worked.

@Ram124 The latest MXNet has fixed the bug.
https://github.com/apache/incubator-mxnet/pull/10918

@wkcn . Oh cool. i will check that..

I have my custom dataset in pascal format.
what changes needs to be done to get started with training.
I have 2 classes( pedestrian + bicycle). i need to classify them on single image..
I have changed pascal.py by changing class names and numbers.

Is there anything else that i need to change?

Anybody done training on own dataset. Please help me out.

@Ram124
You also need to change num_classes in config.py.

@wkcn
I should make it 3 right? with background.
config.NUM_CLASSES = 21

And where should i specify my datset??
i have my dataset in
./data/my_own_data/
Annotations
Imagesets
Images

In which files should i give this path??
so that it can read my custom data

The num_classes includes background, so 3 is right.
For the dataset path, you could check config.py, pascal_voc.py and pascal_voc_eval.py

@wkcn
Thank you.
when i run demo.py. I am getting something like this.
What is this??

(mxnet_p27) ubuntu@ip-172-31-10-202:~/mx-rcnn-1$ python demo.py --prefix model/vgg16 --epoch 0 --image myimage.jpg --gpu 0 --vis
Traceback (most recent call last):
File "demo.py", line 143, in
main()
File "demo.py", line 138, in main
predictor = get_net(symbol, args.prefix, args.epoch, ctx)
File "demo.py", line 49, in get_net
assert k in arg_params, k + ' not initialized'
AssertionError: rpn_conv_3x3_weight not initialized

It seems that you read a pretrained model rather than detection model.
There is no rpn_conv parameter.

I solved that problem. Thank You fro that.

I am trying to train on my own dataset.
I have changed Number of classes in config.py

I have modified pascal.py (changed Class names, i have only 2 classes + 1 background)

But now getting this error..What is the problem @wkcn

INFO:root:voc_radar_train append flipped images to roidb
Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 39, in train_net
for image_set in image_sets]
File "/home/ubuntu/mx-rcnn-1/rcnn/utils/load_data.py", line 13, in load_gt_roidb
roidb = imdb.append_flipped_images(roidb)
File "/home/ubuntu/mx-rcnn-1/rcnn/dataset/imdb.py", line 168, in append_flipped_images
assert (boxes[:, 2] >= boxes[:, 0]).all()
AssertionError

It seems the dataset is wrong.
The boxes xmin, ymin, xmax, ymax should be started with 1.

matlab is starting with 1

Yaa I corrected it.

Now i m getting this error after 1 epoch. How to solve this? Is it related to mxnet version?

Traceback (most recent call last):
File "train_end2end.py", line 178, in
main()
File "train_end2end.py", line 175, in main
lr=args.lr, lr_step=args.lr_step)
File "train_end2end.py", line 137, in train_net
arg_params=arg_params, aux_params=aux_params, begin_epoch=begin_epoch, num_epoch=end_epoch)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 517, in fit
self.set_params(arg_params, aux_params)
File "/home/ubuntu/anaconda3/envs/mxnet_p27/lib/python2.7/site-packages/mxnet/module/base_module.py", line 652, in set_params
allow_extra=allow_extra)
TypeError: init_params() got an unexpected keyword argument 'allow_extra'

@wkcn @chinakook @ysfalo @ijkguo I have a question regarding batch size. Can we use batch size of more than 1 in mxnet-rcnn training??
Because i have a large dataset of 15000 images.
if i do training on them , the speed : 2.35 sample/sec.
it takes almost 4 hours per epoch.
is there anyother way i could increase the speed??

Any help is really appreciated.

So the original issue has been fixed in https://github.com/apache/incubator-mxnet/pull/10918.

As to unexpected kwarg 'allow_extra' and multi-batch size training, they are solved in https://github.com/apache/incubator-mxnet/pull/11373.

@Ram124 If you regard batch_size, please use SNIPER. It can be trained with large batch_size.

Closing this after merging #11373, feel free to ping me to reopen it it's not fixed.

Was this page helpful?
0 / 5 - 0 ratings

Related issues

xzqjack picture xzqjack  路  3Comments

dushoufu picture dushoufu  路  3Comments

JonBoyleCoding picture JonBoyleCoding  路  3Comments

Fzz123 picture Fzz123  路  3Comments

seongkyun picture seongkyun  路  3Comments