Mmdetection: Issue with Printing MAP summary in mean_ap.py

Created on 13 May 2019 · 13Comments · Source: open-mmlab/mmdetection

🐛 Bug

Validation step on custom dataset fails because num_classes is wrongly intialized here:
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/core/evaluation/mean_ap.py#L342

To Reproduce

Follow steps in Custom Dataset preparation
Distributed training on single GPU (Link to model config gist)
./tools/dist_train.sh configs/my_config_faster_rcnn_r101_fpn_1x.py 1 --validate

Error message (during the validate step after first epoch):

Traceback (most recent call last):
  File "./tools/train.py", line 90, in <module>
    main()
  File "./tools/train.py", line 86, in main
    logger=logger)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/apis/train.py", line 58, in train_detector
    _dist_train(model, dataset, cfg, validate=validate)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/apis/train.py", line 99, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 272, in train
    self.call_hook('after_train_epoch')
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 229, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/eval_hooks.py", line 65, in after_train_epoch
    self.evaluate(runner, results)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/eval_hooks.py", line 110, in evaluate
    print_summary=True)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/mean_ap.py", line 327, in eval_map
    print_map_summary(mean_ap, eval_results, dataset)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/mean_ap.py", line 370, in print_map_summary
    label_names[j], num_gts[i, j], results[j]['num_dets'],
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)

Expected Behavior

Printing the AP table:

| class    | gts | dets | recall | precision | ap    |
+----------+-----+------+--------+-----------+-------+
| class_1  | 110 | 781  | 0.091  | 0.013     | 0.002 |
| class_2  | 125 | 891  | 0.208  | 0.029     | 0.009 |
| class_3  | 98  | 1446 | 0.316  | 0.021     | 0.008 |
| class_4  | 0   | 0    | 0.000  | 0.000     | 0.000 |
| class_5  | 118 | 1578 | 0.339  | 0.025     | 0.020 |
| class_6  | 0   | 0    | 0.000  | 0.000     | 0.000 |
+----------+-----+------+--------+-----------+-------+
| mAP      |     |      |        |           | 0.019 |
+----------+-----+------+--------+-----------+-------+

Moving the line 342 after label_names is initialized, resolved the issue:

num_classes = len(label_names)

Source

domarps

👍1

All 13 comments

Thanks for the reporting! Usually L342 should work well, we will look into this bug.
BTW, L342 cannot be moved after label_names being initialized, which may cause issues in L357.

hellock on 15 May 2019

👀2

I am having exactly the same issue.
Traceback (most recent call last):
File "./tools/train.py", line 90, in
main()
File "./tools/train.py", line 86, in main
logger=logger)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 58, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 99, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 272, in train
self.call_hook('after_train_epoch')
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 229, in call_hook
getattr(hook, fn_name)(self)
File "/datadrive/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 65, in after_train_epoch
self.evaluate(runner, results)
File "/datadrive/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 110, in evaluate
print_summary=True)
File "/datadrive/mmdetection/mmdet/core/evaluation/mean_ap.py", line 327, in eval_map
print_map_summary(mean_ap, eval_results, dataset)
File "/datadrive/mmdetection/mmdet/core/evaluation/mean_ap.py", line 370, in print_map_summary
label_names[j], num_gts[i, j], results[j]['num_dets'],
IndexError: list index out of range
Traceback (most recent call last):
File "/data/anaconda/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)

Thanks for lookin into the fix.

fboylu on 16 May 2019

👍1

thanks for sharing the trace, @fboylu. Are you using CustomDataset as well?

I am trying to submit a PR for this and have a question for
@hellock : could you share the use-case for dataset as None use case here. If we are using custom dataset, it is usually of type tuple.

domarps on 16 May 2019

Yes, I am using custom dataset. thanks.

fboylu on 16 May 2019

@domarps where is your PR for this fix? I would like to take a look, thank you.

fboylu on 17 May 2019

same issue using CustomDataset:

Traceback (most recent call last):
File "./tools/train.py", line 94, in
main()
File "./tools/train.py", line 90, in main
logger=logger)
File "/home/lzhang/PycharmProjects/mmdetection/mmdet/apis/train.py", line 59, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/home/lzhang/PycharmProjects/mmdetection/mmdet/apis/train.py", line 171, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/home/lzhang/anaconda3/envs/FCOS/lib/python3.6/site-packages/mmcv-0.2.7-py3.6.egg/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], **kwargs)
File "/home/lzhang/anaconda3/envs/FCOS/lib/python3.6/site-packages/mmcv-0.2.7-py3.6.egg/mmcv/runner/runner.py", line 272, in train
self.call_hook('after_train_epoch')
File "/home/lzhang/anaconda3/envs/FCOS/lib/python3.6/site-packages/mmcv-0.2.7-py3.6.egg/mmcv/runner/runner.py", line 229, in call_hook
getattr(hook, fn_name)(self)
File "/home/lzhang/PycharmProjects/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 65, in after_train_epoch
self.evaluate(runner, results)
File "/home/lzhang/PycharmProjects/mmdetection/mmdet/core/evaluation/eval_hooks.py", line 110, in evaluate
print_summary=True)
File "/home/lzhang/PycharmProjects/mmdetection/mmdet/core/evaluation/mean_ap.py", line 327, in eval_map
print_map_summary(mean_ap, eval_results, dataset)
File "/home/lzhang/PycharmProjects/mmdetection/mmdet/core/evaluation/mean_ap.py", line 378, in print_map_summary
label_names[j], num_gts[i, j], results[j]['num_dets'],
IndexError: tuple index out of range
@hellock @domarps

leochangzliao on 20 May 2019

It seems that the bug is solved.

if your CustomDataset is coco style, you should modify this line to
if issubclass(dataset_type, datasets.yourCustomDataset):
@domarps @hellock @fboylu

leochangzliao on 20 May 2019

Unfortunately, this does not seem to be a trivial fix. The issue with the IndexError: tuple index out of range is the num_classes has been initialized to the default number of classes 80 thing classes from MS-COCO. For t

One quick way to verify this is to do the following:

Make sure the num_classes variable in bbox_head is set to the right number from your CustomDataset
Then check if the num_classes value is the actual number of classes here during the eval_map.

@leochangzliao - I tried it but I ran into some other issue. Did you notice any changes with respect to the above observations after this fix?

domarps on 20 May 2019

No, my code runs smoothly after modifying the code mentioned above.
@domarps

leochangzliao on 20 May 2019

I have only 1 class and it seems I left the bbox_head as default to Pascal VOC (21) but now when I change it to 1, I am getting the below error so maybe tapping into some other issue with just having 1 class. I am not able to train without --validate while setting bbox_head=1. My training results are very bad anyways with bbox_head =21, would that be the reason? @hellock can you comment? @domarps any ideas?

2019-05-20 15:57:51,876 - INFO - Start running, host: fboylu@fboylulinuxgpu, work_dir: /datadrive/mmdetection/work_dirs/cust_faster_rcnn_r50_fpn_1x_voc0712
2019-05-20 15:57:51,876 - INFO - workflow: [('train', 1)], max: 4 epochs
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype , int, int) [with Dtype = float]: block: [0,0,0], thread: [0,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [512,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
Traceback (most recent call last):
File "./tools/train.py", line 90, in
main()
File "./tools/train.py", line 86, in main
logger=logger)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 58, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 99, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], kwargs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
losses = model(data)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(input, *kwargs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(input, *kwargs)
File "/datadrive/mmdetection/mmdet/models/detectors/base.py", line 84, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/datadrive/mmdetection/mmdet/models/detectors/two_stage.py", line 152, in forward_train
*bbox_targets)
File "/datadrive/mmdetection/mmdet/models/bbox_heads/bbox_head.py", line 102, in loss
4)[pos_inds, labels[pos_inds]]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7faf23bd1dc5 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x14792 (0x7faf1e13e792 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7faf23bc1640 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x3067fb (0x7faf1e85d7fb in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x14019b (0x7faf499b219b in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x3bfc84 (0x7faf49c31c84 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3bfcd1 (0x7faf49c31cd1 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x1ad98f (0x5640b8ab798f in /data/anaconda/envs/open-mmlab/bin/python)
frame #8: + 0x10fa0c (0x5640b8a19a0c in /data/anaconda/envs/open-mmlab/bin/python)
frame #9: + 0x10fb57 (0x5640b8a19b57 in /data/anaconda/envs/open-mmlab/bin/python)
frame #10: + 0xfec08 (0x5640b8a08c08 in /data/anaconda/envs/open-mmlab/bin/python)
frame #11: + 0x1100f7 (0x5640b8a1a0f7 in /data/anaconda/envs/open-mmlab/bin/python)
frame #12: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #13: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #14: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #15: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #16: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #17: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #18: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #19: + 0x11010d (0x5640b8a1a10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #20: PyDict_SetItem + 0x4d2 (0x5640b8a81792 in /data/anaconda/envs/open-mmlab/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x5640b8a8223f in /data/anaconda/envs/open-mmlab/bin/python)
frame #22: PyImport_Cleanup + 0x9e (0x5640b8ab927e in /data/anaconda/envs/open-mmlab/bin/python)
frame #23: Py_FinalizeEx + 0x67 (0x5640b8b2c8a7 in /data/anaconda/envs/open-mmlab/bin/python)
frame #24: + 0x23ac63 (0x5640b8b44c63 in /data/anaconda/envs/open-mmlab/bin/python)
frame #25: _Py_UnixMain + 0x3c (0x5640b8b44f7c in /data/anaconda/envs/open-mmlab/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7faf5f9d3830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1e0122 (0x5640b8aea122 in /data/anaconda/envs/open-mmlab/bin/python)

/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype , int, int) [with Dtype = float]: block: [0,0,0], thread: [512,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [513,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [514,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [515,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [516,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [517,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [518,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [519,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [520,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [521,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [522,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [0,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [1,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [2,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [3,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [4,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [5,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [6,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [7,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [8,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [9,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [10,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [11,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [12,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [13,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [14,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [15,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [16,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
/opt/conda/conda-bld/pytorch_1556653215914/work/aten/src/THCUNN/ClassNLLCriterion.cu:56: void ClassNLLCriterion_updateOutput_no_reduce_kernel(int, THCDeviceTensor, THCDeviceTensor, THCDeviceTensor, Dtype *, int, int) [with Dtype = float]: block: [0,0,0], thread: [17,0,0] Assertion cur_target >= 0 && cur_target < n_classes failed.
Traceback (most recent call last):
File "./tools/train.py", line 90, in
main()
File "./tools/train.py", line 86, in main
logger=logger)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 58, in train_detector
_dist_train(model, dataset, cfg, validate=validate)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 99, in _dist_train
runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 356, in run
epoch_runner(data_loaders[i], kwargs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/runner/runner.py", line 262, in train
self.model, data_batch, train_mode=True, *kwargs)
File "/datadrive/mmdetection/mmdet/apis/train.py", line 38, in batch_processor
losses = model(data)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(input, *kwargs)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/mmcv/parallel/distributed.py", line 50, in forward
return self.module(inputs[0], *kwargs[0])
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(input, *kwargs)
File "/datadrive/mmdetection/mmdet/models/detectors/base.py", line 84, in forward
return self.forward_train(img, img_meta, **kwargs)
File "/datadrive/mmdetection/mmdet/models/detectors/two_stage.py", line 152, in forward_train
*bbox_targets)
File "/datadrive/mmdetection/mmdet/models/bbox_heads/bbox_head.py", line 102, in loss
4)[pos_inds, labels[pos_inds]]
RuntimeError: copy_if failed to synchronize: device-side assert triggered
terminate called after throwing an instance of 'c10::Error'
what(): CUDA error: device-side assert triggered (insert_events at /opt/conda/conda-bld/pytorch_1556653215914/work/c10/cuda/CUDACachingAllocator.cpp:564)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x45 (0x7fcf0d12cdc5 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: + 0x14792 (0x7fcf07699792 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x50 (0x7fcf0d11c640 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #3: + 0x3067fb (0x7fcf07db87fb in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch.so.1)
frame #4: + 0x14019b (0x7fcf32f0d19b in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: + 0x3bfc84 (0x7fcf3318cc84 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #6: + 0x3bfcd1 (0x7fcf3318ccd1 in /data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #7: + 0x1ad98f (0x56362419b98f in /data/anaconda/envs/open-mmlab/bin/python)
frame #8: + 0x10fa0c (0x5636240fda0c in /data/anaconda/envs/open-mmlab/bin/python)
frame #9: + 0x10fb57 (0x5636240fdb57 in /data/anaconda/envs/open-mmlab/bin/python)
frame #10: + 0xfec08 (0x5636240ecc08 in /data/anaconda/envs/open-mmlab/bin/python)
frame #11: + 0x1100f7 (0x5636240fe0f7 in /data/anaconda/envs/open-mmlab/bin/python)
frame #12: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #13: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #14: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #15: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #16: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #17: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #18: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #19: + 0x11010d (0x5636240fe10d in /data/anaconda/envs/open-mmlab/bin/python)
frame #20: PyDict_SetItem + 0x4d2 (0x563624165792 in /data/anaconda/envs/open-mmlab/bin/python)
frame #21: PyDict_SetItemString + 0x4f (0x56362416623f in /data/anaconda/envs/open-mmlab/bin/python)
frame #22: PyImport_Cleanup + 0x9e (0x56362419d27e in /data/anaconda/envs/open-mmlab/bin/python)
frame #23: Py_FinalizeEx + 0x67 (0x5636242108a7 in /data/anaconda/envs/open-mmlab/bin/python)
frame #24: + 0x23ac63 (0x563624228c63 in /data/anaconda/envs/open-mmlab/bin/python)
frame #25: _Py_UnixMain + 0x3c (0x563624228f7c in /data/anaconda/envs/open-mmlab/bin/python)
frame #26: __libc_start_main + 0xf0 (0x7fcf48f2e830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #27: + 0x1e0122 (0x5636241ce122 in /data/anaconda/envs/open-mmlab/bin/python)

Traceback (most recent call last):
File "/data/anaconda/envs/open-mmlab/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 235, in
main()
File "/data/anaconda/envs/open-mmlab/lib/python3.7/site-packages/torch/distributed/launch.py", line 231, in main
cmd=process.args)
subprocess.CalledProcessError: Command '['/data/anaconda/envs/open-mmlab/bin/python', '-u', './tools/train.py', '--local_rank=0', 'configs/pascal_voc/my_faster_rcnn_r50_fpn_1x_voc0712.py', '--launcher', 'pytorch']' died with .

fboylu on 20 May 2019

I think this is related to https://github.com/open-mmlab/mmdetection/issues/344 so I am using 1+1(for background) and training seems to go fine now and --validate works as well.. Just not getting good results.

fboylu on 20 May 2019

🐛 Bug

Validation step on custom dataset fails because num_classes is wrongly intialized here:
https://github.com/open-mmlab/mmdetection/blob/master/mmdet/core/evaluation/mean_ap.py#L342

To Reproduce

Follow steps in Custom Dataset preparation
Distributed training on single GPU (Link to model config gist)
./tools/dist_train.sh configs/my_config_faster_rcnn_r101_fpn_1x.py 1 --validate

Error message (during the validate step after first epoch):

Traceback (most recent call last):
  File "./tools/train.py", line 90, in <module>
    main()
  File "./tools/train.py", line 86, in main
    logger=logger)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/apis/train.py", line 58, in train_detector
    _dist_train(model, dataset, cfg, validate=validate)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/apis/train.py", line 99, in _dist_train
    runner.run(data_loaders, cfg.workflow, cfg.total_epochs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 356, in run
    epoch_runner(data_loaders[i], **kwargs)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 272, in train
    self.call_hook('after_train_epoch')
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmcv/runner/runner.py", line 229, in call_hook
    getattr(hook, fn_name)(self)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/eval_hooks.py", line 65, in after_train_epoch
    self.evaluate(runner, results)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/eval_hooks.py", line 110, in evaluate
    print_summary=True)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/mean_ap.py", line 327, in eval_map
    print_map_summary(mean_ap, eval_results, dataset)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/mmdet-0.6.0+11e9c74-py3.6.egg/mmdet/core/evaluation/mean_ap.py", line 370, in print_map_summary
    label_names[j], num_gts[i, j], results[j]['num_dets'],
IndexError: tuple index out of range
Traceback (most recent call last):
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ubuntu/anaconda3/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 235, in <module>
    main()
  File "/home/ubuntu/anaconda3/lib/python3.6/site-packages/torch/distributed/launch.py", line 231, in main
    cmd=process.args)

Expected Behavior

Printing the AP table:

| class    | gts | dets | recall | precision | ap    |
+----------+-----+------+--------+-----------+-------+
| class_1  | 110 | 781  | 0.091  | 0.013     | 0.002 |
| class_2  | 125 | 891  | 0.208  | 0.029     | 0.009 |
| class_3  | 98  | 1446 | 0.316  | 0.021     | 0.008 |
| class_4  | 0   | 0    | 0.000  | 0.000     | 0.000 |
| class_5  | 118 | 1578 | 0.339  | 0.025     | 0.020 |
| class_6  | 0   | 0    | 0.000  | 0.000     | 0.000 |
+----------+-----+------+--------+-----------+-------+
| mAP      |     |      |        |           | 0.019 |
+----------+-----+------+--------+-----------+-------+

Moving the line 342 after label_names is initialized, resolved the issue:

num_classes = len(label_names)

thanks, i had solved my issue using your way

sbbug on 17 Sep 2019

I had the same error while trying to train faster_rcnn. Initially there was no model configuration in the config file. But after adding following information error got fixed

model = dict(
    roi_head=dict(
        bbox_head=dict(
            num_classes=len(classes))))

It's important to override num_classes from where were are calling the base class...

sizhky on 5 Jun 2020

Was this page helpful?

0 / 5 - 0 ratings

Related issues

Any plan for YOLACT？Thanks

liugaolian · 3Comments

A Safety 'issue' at HK

FrankXinqi · 3Comments

only one class, bug

fengxiuyaun · 3Comments

Where's the implementation of CARAFE

Youngkl0726 · 3Comments

finetune CascadeRcnn generate too large model weights

BeBeauty · 3Comments